Joshua Mathias. 2016. PDF Text Processing. https://github.com/JoshuaMathias/text-correction
BYU Special Undergraduate Project in the Computer Science Department under Dr. Bill Barrett's supervision. The purpose of this project was to correct text extracted from PDF files (33285 of which were provided for this project) of The Church of Jesus Christ of Latter-Day Saints to be used as domain-specific machine translation training data.
The published portion of this project mainly consisted of splitting combined words and removing unwanted characters and text. The project was continued in a professional internship for the LDS Church.
View PDF
(439.28 KB)