OCR with CNN from Interesting Github OCR Resource Page
Our HackOH5 Hackathon has newspaper OCR scans saved as *.xml files with two different structures. One makes it easier to simply grab all the text on the page. The other is slightly more difficult to parse but provides a lot more syntactic and semantic information that would be valuable for downstream natural language processing.
The full *.xml dataset was released a few days ago. You can download it at a *.zip file (514MB) which decompresses into one large *.xml file (1.47GB). The dataset will be available in a variety of batch file formats (pdf, jp2, html, xml) as well as via online interactive REST-like API requests based upon OCLC’s CONTENTdm digital content server. A sample dataset with several college newspapers is available on the HackOH5 website and has a wider variety of formats including one with a richer *.xml tag structure.
The OCR/scan configuration for generating *.xml files is different for both the full *.xml dataset as well as the smaller sample dataset. These different xml tag structures can be visualized using online xml visualizers.
The full *.xml dataset has each page of OCR text embedded with the text area of <pagetext> tags. The *.xml files in the smaller sample dataset have OCR text set the to value of the attribute “CONTENT” in <String> tags.
Full *.xml dataset with simplified *.xml file with all text between <pagetext> tags
……..<pagetext>(all the newspaper text as one long string)</pagetext>
The full *.xml dataset makes it easier to pull out each page of OCR text by concatenating within a <pagetext> tag for each scanned page, but this ignores potentially valuable structural information found in the smaller sample dataset *.xml files. For example, the more complex *.xml markup in the smaller sample dataset has the tag structure.
Sample dataset with complex *.xml tag structure that preserves more syntactic and semantic information
…………<String … HEIGHT=”<float>” HPOS=”” VPOS=”” CONTENT=”word”>
… (any number of String e.g. words alternating with SP e.g. spaces)
… (any number of TextLines e.g. Sentences))
… (any number of TextBlocks e.g. Paragraphs)
Although it takes more work, the more complex *.xml structure of the smaller sample dataset provides us structural/syntactic information that denotes meaning/semantics:
- <TextLine> tags mark sentence units as far as the ABBYY can detect (parsing for a period cannot always successfully parse a character stream into sentence units, especially with noisy text from newspaper OCR)
- <TextBlock> tags mark paragraph units (this valuable syntactic/semantic information is completely lost when all pages text is concatenated together)
- <String… HEIGHT=”<float>”> the HEIGHT attribute of the <String> tag gives valuable semantic clues as to which words on the page are titles and which are simply the body of the text.
- <String… STYLEREFS=””> may give some clue as to special text like titles or italicized text depending on scanner settings, font sets, etc. I didn’t see a lot of information conveyed in the few files I looked at for this characteristics, but it may apply in other scans or future rescans.
- <String… HPOS=”<float>” VPOS=”<float>”> show exact positioning of each word on the page and could prove useful in disambiguating layout. Probably too complex for this exercise, but no mete/information should be unnecessarily deleted as a rule.
Of all the formats potentially available for our HackOH5 hackathon, the richer xml tag structure provides the most information in ready to use format. Much of this information could be retrieved from *.htm files. While the HTML markup is not as precise as XML, the OCR conversion process to *.htm does some of the fuzzy categorizations for us in terms of binning font sizes for titles (in the case of *.xml font size is given as floating point numbers and are not uniform requiring statistical analysis). Although ABBYY FineReader can output *.htm files, no *.htm files were provided in either sample datasets so they may not exist.
The only other format that provides more information are the *.jp2 image files of original microfiche, but these would need to be run through another OCR program(s) to extract information. My initial experiments suggest rescanning older ABBYY *.jp2 files can only be done for on a sampled basis or for retrieving subsets identified and severely limited by NLP algorithms.
In sum, when creating ABBYY OCR documents opt for outputting the following four file types: *.txt (raw text), *.htm (marked-up text), *.xml (more precise marked-up text) and *.jp2 (default res should enable future re-scans with newer OCR software although a higher res lossless TIFF format might be a way to take advantage of future OCR enhancements). For the *.xml configuration, try to preserve as much structural and font information with XML tags rather than simplifying the tag structure.
Parse engines like BeautifulSoup4 and lxml make it simple to extract out text from even the most complex xml structures. But once you simplify your XML tag structure you lose potentially valuable semantic information for subsequent NLP that can never be recovered and result in less accurate textual analysis.