header image
 

OCR Programs versus Manual Entry

When digitizing historical documents, one of the greatest challenges is taking a large amount of information and enabling users to search through hundreds of years of text. This can create a difficult task of creating usable online archival documents. A problem that I have seen recently involves readings for classes that are in a PDF format but are only scanned images of books and papers. Without searchable text, a 50 page reading is difficult to navigate.

Cohen and Rosenzweig discuss this issue in their book Digital History:

“Manually correcting OCR probably makes sense only on relatively small-scale projects and especially texts that yield particularly clean OCR. You should also keep in mind that if you use a typist, you don’t need to invest in hardware or software or spend time learning new equipment and programs. Despite our occasional euphoria over futuristic technologies like OCR, sometimes tried-and-true methods like typing are more effective and less costly.”

It’s great when there are all sorts of available programs that turn imaged text into searchable text for cheap, some even are free. The newer versions of Microsoft Office even include a document imaging GUI that is pretty intuitive and can proved upwards of 98% accuracy with typed text.

OCR Software By Microsoft

But that isn’t statistically perfect for historical documents, especially when those documents are primary sources. This OCR technology is great, but, like Cohen and Rosenzweig have stated above, nothing can beat having someone type word for word. Another advantage of using a typist is that if a character is hard to read, logical understanding of language and a knowledge of words and their spelling limits the number of possible mistakes, and a typist can consult with others about what a document says, which a computer is unable to do.

Something that is a little bit related that I have seen some information on is the CAPTCHA system. This system is an anti-spam protection software that shows an image of words that the end user must correctly input into the dialogue box to complete a transaction.

CAPTCHA

This system has been recently circumvented by spammers employing the use of OCR software. There are some security downsides to the use of CAPTCHA. But I think something that is really useful with the system in the academic world is that those inaccuracies from OCR processed documents are put into CAPTCHA so that as the system is used and users are typing what they see, documents are crowdsourced and corrected.

Research Question:

How has the historical use of US Foreign Aid prevented development throughout developing countries?

~ by William Hammill on September 10, 2012 . Tagged: , , , , ,



Leave a Reply

Your email address will not be published. Required fields are marked *