There are a variety of ways to parse structured/marked up text like HTML and XML files. A few of the most common alternatives using Python are:
- Python built-in RegEx module (re)
- Python built in xml.etree.ElementTree API
- BeautifulSoup4 (using the lxml as the plug-in parser)
- lxml direct
- Other parsing libraries (untangle, xmltodict, html5lib, HTMLParser, htmlfill, Genshi)
The first three are popular choices based upon three different scenarios:
- Python RegEx: Non-nested/simpler searches in relatively well-formed marked-up text: RegEx (which may be the fastest when compiled and cached as object)
- Python ElementTree XML: Although part of the standard Python distribution, this solution excels at neither speed nor leniency and is insecure in the face of malicious data, we we will not consider it here.
- BeautifulSoup: This is a popular and fairly high-level/friendly to use parsing framework that has different performance characteristics depending upon what parse engine is plugged used (great summary in the book Data Science Essentials in Python)
- html.parser (default): fast and inflexible for relatively simple HTML
- lxml: very fast and flexible
- xml: for XML only
- html5lib: very slow and very lenient for complicated HTML or where speed is not an issue
- Parse with lmxl directly: Less friendly but very complete parsing of complicated marked-up text: (extremely expressive and fast due to underlying c parsing libraries)
- Other Solutions are either slower and/or less capable of handling mis-formed markup text but well-maintained alternatives find some usage for xml parsing in simpler cases with simpler syntax.
I’ve been told our dataset will be provided in at least two formats: (a) an apparently older ABBYY OCR set of scans stored as one scanned page per file according to the standard ALTO Open XML Schema we’ll call ALTO XML Format, and (b) an apparently newer ABBYY OCR scan stored in a single 1.47GB text file stripped of all layout and formatting information and each scanned page stored within <pagetext></pagetext> tags we’ll call Simplified XML Format.
There are two fundamental and meaningful distinctions between the two formats the dataset will be provided: (1) XML tag hierarchy complexity and (2) file size. This suggests two different approaches to processing these two different formats:
 ALTO XML Format – numerous small *.xml files within a nested file directory structure with more complex xml tag hierarchy
APPROACH: With thousands of files to process this will be an I/O bound task so we’ll probably want to distribute the corpus across 5 computers (one college subset per machine). Secondarily, we’ll place the largest fragments on the fastest SSD HDD machines. Finally, we’ll use the faster lxml parser within BeautifulSoup to extract out the actual text. The text is stored as individual words in deeply nested <String> tags within the attribute CONTENT=”<word>”.
In addition, we want to use the richer ALTO XML tag hierarchy to identify words of special importance such as (Sub)Titles and proper nouns. We can do this by searching within each page scan for words with larger font size and/or special capitalization. Font size information is contained in each <String> tag within the attribute fields of HEIGHT and WIDTH while capitalization is reflected in the actual word of the CONTENT attribute. See the block below for an illustration of 4 levels of importance for words.
Normal Font Words (indicated by HEIGHT and WIDTH attributes)
- Regular text words (e.g. “):
- Proper Noun text words (e.g. “Wednesday”, “Kenyon Lords”):
- ALL CAPS text words (e.g. “):
Larger Font Words (indicated by HEIGHT and WIDTH attributes)
- Regular text words (e.g. “sees”, “unexpected”, “growth”)
- Proper Noun text words (e.g. “President Johnson”)
- ALL CAPS text words (e.g. “LORDS WINS IN OVERTIME”)
Here is an example of (Sub)Titles from the scan of the first page of the Nov 4th, 1964 Kenyon Review (notice the size of the HEIGHT and WIDTH attributes compared to normal text below)
<String ID=”TB.Img0001.6_1_0″ STYLEREFS=”TS_10.0″ HEIGHT=”284.0″ WIDTH=”1884.0″ HPOS=”864.0″ VPOS=”9704.0″ CONTENT=”Johnson” WC=”1.0″/><SP WIDTH=”448.0″ HPOS=”2748.0″ VPOS=”9680.0″/><String ID=”TB.Img0001.6_1_1″ STYLEREFS=”TS_10.0″ HEIGHT=”284.0″ WIDTH=”2244.0″ HPOS=”3196.0″ VPOS=”9680.0″ CONTENT=”Landslide” WC=”1.0″/></TextLine><TextLine ID=”TB.Img0001.6_2″ HEIGHT=”300.0″ WIDTH=”4700.0″ HPOS=”884.0″ VPOS=”10140.0″><String ID=”TB.Img0001.6_2_0″ STYLEREFS=”TS_10.0″ HEIGHT=”276.0″ WIDTH=”2228.0″ HPOS=”884.0″ VPOS=”10164.0″ CONTENT=”Democrats” WC=”1.0″/><SP WIDTH=”244.0″ HPOS=”3112.0″ VPOS=”10140.0″/><String ID=”TB.Img0001.6_2_1″ STYLEREFS=”TS_10.0″ HEIGHT=”276.0″ WIDTH=”848.0″ HPOS=”3356.0″ VPOS=”10144.0″ CONTENT=”Add” WC=”1.0″/><SP WIDTH=”244.0″ HPOS=”4204.0″ VPOS=”10140.0″/><String ID=”TB.Img0001.6_2_2″ STYLEREFS=”TS_10.0″ HEIGHT=”272.0″ WIDTH=”1136.0″ HPOS=”4448.0″ VPOS=”10140.0″ CONTENT=”Seats” WC=”1.0″/>
TEXT: “Johnson Landslide Democrats Add Seats”
Here is an example of normal text scanned from the same page of the Kenyon Review.
<String ID=”TB.Img0001.6_6_2″ STYLEREFS=”TS_10.0″ HEIGHT=”72.0″ WIDTH=”76.0″ HPOS=”1776.0″ VPOS=”11388.0″ CONTENT=”a” WC=”1.0″/><SP WIDTH=”52.0″ HPOS=”1852.0″ VPOS=”11332.0″/><String ID=”TB.Img0001.6_6_3″ STYLEREFS=”TS_10.0″ HEIGHT=”124.0″ WIDTH=”580.0″ HPOS=”1904.0″ VPOS=”11356.0″ CONTENT=”majority” WC=”1.0″/><SP WIDTH=”72.0″ HPOS=”2484.0″ VPOS=”11332.0″/><String ID=”TB.Img0001.6_6_4″ STYLEREFS=”TS_10.0″ HEIGHT=”104.0″ WIDTH=”128.0″ HPOS=”2556.0″ VPOS=”11352.0″ CONTENT=”of” WC=”1.0″/><SP WIDTH=”60.0″ HPOS=”2684.0″ VPOS=”11332.0″/><String ID=”TB.Img0001.6_6_5″ STYLEREFS=”TS_10.0″ HEIGHT=”108.0″ WIDTH=”632.0″ HPOS=”2744.0″ VPOS=”11348.0″ CONTENT=”36802500″ WC=”1.0″/><SP WIDTH=”68.0″ HPOS=”3376.0″ VPOS=”11332.0″/><String ID=”TB.Img0001.6_6_6″ STYLEREFS=”TS_10.0″ HEIGHT=”100.0″ WIDTH=”336.0″ HPOS=”3444.0″ VPOS=”11352.0″ CONTENT=”votes” WC=”1.0″/><SP WIDTH=”68.0″ HPOS=”3780.0″ VPOS=”11332.0″/><String ID=”TB.Img0001.6_6_7″ STYLEREFS=”TS_10.0″ HEIGHT=”100.0″ WIDTH=”128.0″ HPOS=”3848.0″ VPOS=”11348.0″ CONTENT=”to” WC=”1.0″/><SP WIDTH=”64.0″ HPOS=”3976.0″ VPOS=”11332.0″/><String ID=”TB.Img0001.6_6_8″ STYLEREFS=”TS_10.0″ HEIGHT=”108.0″ WIDTH=”800.0″ HPOS=”4040.0″ VPOS=”11336.0″ CONTENT=”Goldwateis” WC=”1.0″/>
TEXT: “a majority of 36802500 votes to Goldwateris” (Goldwater)
 Simplified XML Format: One 1.47GB file has all scanned pages from all newspaper editions across all 5 Ohio Colleges. In contrast to the ALTO XML Format, the Simplified XML Format has collapsed all text scanned on each page into a single <pagetext></pagetext> tag losing all original layout information and associated semantic meaning.
APPROACH: Because of the large size of this file we may have to read the file in a streaming fashion unlike the approach for ALTO XML Format where we simply read the entire file into memory before searching for and extracting out terms. With 8GB of memory, we may not encounter problems using the same BeautifulSoup/lxml parser technique above, but we’ll prepare to use one of two streaming techniques: (a) lxml with SAX or (b) RegEx reading in a line at a time.
Here is an extract from the Simplifed XML Corpus
<unmapped>Oberlin Review (Oberlin, Ohio), 1976-09-17</unmapped>
<pagetext>Page 2 THE OBERLIN REVIEW Friday September 17 1976 v vv VVVVV Ve Tl I II i ne aamimstration aeiays while the union pays Administration backpedalling is needlessly prolongingimportant contract negotiations with College secretaries and administrative assistants The delay has been costly to the union After four months of talks between Oberlin College Office and Professional Employees union OCOPE and theadministration an agreement on principles was reached June 30 On August 17 the Colleges lawyer presented a first draft of the contract supposedly settled a month and a half earlier except for details in language The draft differed in almost every major area from the agreement of June 30 Items not even discussed such as a costofliving decrease were included and OCOPEs four and a half to six percentincrease mysteriously became a straight four percent hike Neither OCOPE nor its members are rich Administrative assistants receive as little as 4400 for the school year and they are not forced to join the union If dues are increased as they almost certainly will be thanks to the expense of extra months of negotiations then membership could very well fall off In this light the administrations deliberate footdragging looks less like an understandable attempt at cutting costs and more like unionbreaking A four to six and a half percent pay increase to some of the lowest paid workers at Oberlin is both overdue and thoroughly acceptable Further administration delay of OCOPEs contract is not Tht REVIEW encourages comrades and adversaries to submit articles and letters Both must be typed on a S space line double spaced and signed and may be mailed to The Oberlin REVIEW Student Union Box 34 Deadlines are Sunday and Wednesday after noons for the Tuesday and Friday issues respectively IK emove Darners Grant disabled students some independence Graduate laments To the editor I looked at my transcriptyesterday On it appeared the pattern of my education for four years Once or twice I selected courses that were minor variations ofinformation I already knew I chose those courses because I knew I could do well with average effort I did it for the old CPA Other courses I winced to think about floundering helplessly I had insufficentbackground to even formulate the questions that might have rescued me from incomprehension I did respectably only by the grace of fanatically researched term papers And about the remaining courses Im not complaining I made good choices The point of this letter is that I made those choices alone The academic advisor was a person I visited as an irritating formality because I couldntregister without his signature At least I was comforted by his efficency I was always out the door in ten minutes he never inquired about my course grades or capabilities I was happy with this freedom to take what I pleased withoutchallenge I encouraged the faculty belief that Oberlin students could decide for themselves that it was our decision and in the end we knew best The purpose of anadvisor was to agree But Ive discovered that I didnt know everything about the courses I selected Nor did I know much The College can do itself andhandicapped students a favor by making campus buildings more accessible By quickly making the top priority changes suggested in a recent report about obstacles on campus for the handicapped the College can put out a welcome mat for handicapped prospective students and make life easier for those now on campus The little expense required to make the changes could easily be offset by theadditional sources of outside financial aid that handicapped students can tap Augmenting the natural accessibility provided by the flat terrain here can make the College a desirable campus for the one out of every six Americans who ishandicapped A drawing card for talented highly motivated handicapped students would add handsomely to Oberlins other attractions After the priority items Mudds steep ramp and heavy doors for example have been taken care of more expensive projects like elevators for Kettering and Severance should be considered The Colleges first order of business however should be removing those first barriers that prevent most disabledstudents from leading a reasonably inde pendent life on campus f V VVWVV v about designing an integrated and comprehensive education for a four year period My foresight was based upon the experience of previous semesters an experience which was less than the four years of college I now possess It is only now that I can begin to suggest the sequence of courses a person might pursue in English or Psychology my major areas Half the time I felt no connectionbetween one course and another I was not building upon knowledge I was only widening my mud hole of unconnected facts Now that college is over I wish an advisor had come alongheedless of the current attitudes toward the Oberlin studentindependent wants to make his own decision who scrutinized my course schedule who was even at times disagreeable Jonathan Brakarsh Class of 76 Injustice to dancers To the editor We are writing to inform this community about what we feel is an injustice to the dancers in our College Basically we do not agree with the priorities that the Oberlin Dance Company has established for this year We feel our dance company should serve the Collegecommunity by being a place where dancers can learn more about all aspects of dance It should not be an organization whose primary concern is to further the creative work of the dance professors Realizing that the professors are artists themselves struggling to do their own work makesapparent the conflict betweensimultaneously being an artist and a teacher of art Metzker and Woideck are being paid to teach dance to Oberlin students This should be their first responsibility By accepting only eight new dancers to the company and devoting a large part of the companys time to Metzkers work the company is serving the professors needs more than the students needs As shown by the large turnout at the auditions there is a significant number of dancers who desire the experience of being in thecompany This year especially when there is no advanced class and all other classes are filled with long waiting lists it is apparent to us that this small company is going to be an organization which fails to serve the needs of too many eager dancers in this College We hope that in the future the Oberlin Dance Company can be more responsive to the needs of the dancers in this community Ann Scheman Marcy Olmsted and other members of the Oberlin Dance Community Ms murder political To the editor The Oberlin Tradition and Oberlini history linger on our campus like latenight fog and I for one am weary of the lie Rhetoric is useless and I will not waste my time or yours explaining the numerous times Third World people have had to hear a motto and know that it overtly ignored them I simply want to expressfrustration and disappointment at the difficulty so many students of all colors seem to have in expressing support of a campus movement to express disapproval of the daily murders of black colored people in South Africa One student refused to voice an opinion on the subject saying Im trying to lay low on political issues I ask all of you is murder political I have seen Oberlin students campaign for saving trees byrecycling paper with moreenthusiasm than they have shown in See LETTERS p 3 obSEVIEW VOLUME 105 NUMBER 3 FRIDAY SEPTEMBER 17 1978 Published by the students of Oberlin College every Tuesday and Friday during the lad and spring semester excepting holidays and examination periods and on Fridays during Winter Term Subscription 1800 per year Second Class postage paid at Oberlin Ohio Entered as second class matter at the Oberlin Ohio post office April 2 1911 Offices 60 South Pleasant Street Oberlin Ohio 44074 Telephone 2161 775 8123 775 5440 TOM ROSENSTIEL EXECUTIVE EDITOR JEFF HORTY BUSINESS MANAGER Carolyn Butter Scon Maier Evelyn Shunaman Managing Editors C S Heinbockel Steve Maas Editorial Board Chairman Kiren Ghei Dave Meardon Hal Straus Naws Editors Josh Levin Pern Sommera Commentary Editors C S Heinbockel Robin Wallace Arts Editors Peggy Dorf Bill Warner Sports Editors Daniel Friedman Photography Editor Welling Had Advertising Managar Francit Alley Assistant Business Managar Editorial comment and policy are collectively determined by the members of the editorial board composed of the editors business manager and senior staff The opinions expressed in editorials are the ultimate responsibility of the elected chairman and are not necessarily those of Oberlin College or of the Association of Students of Oberlin College CAROLYN DULLER ISSUE EDITOR
Page One of the Oberlin Review, Sep 17th, 1976
Potential additional areas of improvement:
- Spell Check for topic words (too expensive for entire corpus)
- Contract/correct spelling based upon different OCR engines FineReader ver 9 and 12
- Identify Topic Sentences and Summary Sentences at the start and end of each identifiable paragraph to help find key terms.
- Exploit journalism “funnel” style to augment key word/topic discovery
- Use known domain space (news: political, sports, arts, etc) to help clean text and guide machine learning algorithms
- Some form of Entity Recognition to identify multi-word Proper Nouns like “Kenyon College”
- Compare/Contrast unsupervised categorization on individual pages as well as entire newspaper issue, synthesize to overcome chopping of articles across page boundaries
- Seed supervised classification with mined topic words
A Note on why the ALTO XML Format is of particular value for Digital Humanities Research as opposed to the Simplified XML Format:The ALTO (
The ALTO (Analyzed Layout and Text Object) XML Schema was designed to preserve as much layout meta information as necessary to enable recreation the original appearance of the document. ALTO XML Schema is currently in version 3.1 as of January 2016.
Although the single 1.47GB file may have slightly better OCR accuracy and be marginally easier to parse it is not nearly as useful without all the layout meta information encoded in the ALTO tag set. The OCR accuracy over the entire corpus ranges from acceptable to horrible depending largely upon the quality of the scanned microfiche and the complexity of the original columnar newsprint layout. The number of articles per page varies tremendously and frequently articles on the same page are scrambled together with numerous articles are split across page. Since a scanned page is the smallest unit of text that can be fed into machine learning algorithms, the OCR errors combined with the scrambled text severely handicaps common algorithms like topic extraction or sentiment analysis.
So much of what machine learning algorithms could tell us in the best possible case is already encoded in the ALTO layout tags. For example, we could directly extract nearly all article titles and subtitles based upon font size tags. Positional tags could also tell us potentially indicate topic sentences and summary sentences that are stylistically more informative in newspapers. The larger fonts could also help assist spell correction of key terms and focus more computationally intensive machine learning algorithms on key subsets of the relatively large corpus.