I modified a python script in a stackexchange post, so we can cut up our large XML file into individual files segregated by college into one of five subdirectories.  This is a common task, and there are many examples out there in most popular languages.

import sys
import os
import re
import xml.etree.ElementTree as ET
import time

SOURCE_XML = ‘oh5_all.xml’  # Big source file with mixed records to be split up
START_LINE = 7500 # In case of incomplete run, restart run after this record no

context = ET.iterparse(SOURCE_XML, events=(‘end’, ))

# Create a subdirectory to split/write each newspaper record to the
# appropriate college

college_dirs = [“kenyon”, “oberlin”, “wooster”, “oh_wesleyan”, “denison”, “unk_college”]

for college_dir in college_dirs:
if not os.path.exists(college_dir):
os.makedirs(college_dir)
# For performance boost, precompile regex search expressions we’ll use
# to match college name string on text within <title> tag

regex_ken = re.compile(‘^Ken’) # “Kenyon Collegian…”
regex_obe = re.compile(‘^Obe’) # “Oberlin College Review…””
regex_den = re.compile(‘^TheDen’) # “The Denisonian…”
regex_woo = re.compile(‘^TheWoo’) # “The Wooster Voice…”
regex_ohw = re.compile(‘^TheOhi’) # “The Ohio Wesleyan…”

# FUTURE: Clean up and shorten filename based upon title string
# regex_newstitle = re.compile(‘^([^\P))
def collegeDir(title):
“””Passed the variable <title> string of a newspaper issue record, match
the start of the <title> string against precompiled regex for each
college newspaper to determine which subdirectory the processed
newspaper issue file should be written to
“””

if regex_ken.match(title):
# print(‘DEBUG: Kenyon Collegian’)
subdir = “kenyon”
elif regex_obe.match(title):
# print(‘DEBUG: Oberlin College Review’)
subdir = “oberlin”
elif regex_den.match(title):
# print(‘DEBUG: The Denisonian’)
subdir = “denison”
elif regex_woo.match(title):
# print(‘DEBUG: The Wooster Voice’)
subdir = “wooster”
elif regex_ohw.match(title):
# print(‘DEBUG: The Ohio Wesleyan’)
subdir = “ohio_wesleyan”
else:
# print(‘DEBUG: Unmatched Record, unk_ subdirectory’)
subdir = “unk_college”

return subdir
# Start timer

start_time = time.time()
print(‘Start Time: %s’ % (start_time))
# Loop through big file and copy/split out each newspaper issue <record>
# to the associated college subdirectory

recno = 0
print(‘Starting processing first record…’)
for event, elem in context:
if elem.tag == ‘record’:

recno += 1

if (recno < START_LINE):
continue

# Give command line visual feedback since long-running process
if recno % 500 == 0:
print(‘…processing record %s’ % (recno))

title = elem.find(‘title’).text
filename = format(title + “.xml”)
# delete edge and embedded whitespaces from title
filename = ”.join(filename.split())
# create full path to file by prefixing with matched subdirectory
filename = “%s/%s” % (collegeDir(filename), filename)
with open(filename, ‘wb’) as f:
f.write(“<?xml version=\”1.0\” encoding=\”UTF-8\”?>\n”)
f.write(ET.tostring(elem))
# Write summary statistics

sys.stdout.write(‘\n’)
end_time = time.time()
print(‘End Time: %s’ % (end_time))
print(‘—————-‘)
print(‘Processed: %s Records’ % (recno))
print(‘Execution TIme: %s ‘ % (end_time – start_time()))

After executing the above python script we find some interesting variations on the names of newspapers and assign them to their modern equivalents:

The Kenyon Review

  • No variation on title

TOTAL Kenyon File Count:  1950

The Oberlin College Review (1875-02-24 thru 2012-04-27

  • Oberlin College Review (1874-04-01 thru 1875-01-20)
  • The Elephant (1936-05-08 thru 1948-05-09)

TOTAL Oberlin File Count:  5956

The Denisonian (1896-02-15 thru 2012-11-13)

  • The Denisonian Collegian (1875-09-25 thru 1893-06-24)
  • The Zenith (1881-10-01 thru 1884-02-15)
  • The Commencement Daily (1885-06-11 thru 1885-06-25)
  • The Denison Weekly News (1885-09-30 thru 1885-12-10)
  • Granvilletimes (1890-09-23 thru 1899-01-20)
  • The Denison (1892-09-17 thru 1894-06-02)
  • The Daily Denison (1893-06-10 thru 1884-06-26)
  • The Weekly Denisonian (1901-10-05 thru 1903-06-06)

TOTAL Denison File Count:  3150

The Wooster Voice (1940-09-20 thru 2011-05-06)

  • Wooster Voice (1890-09-12 thru 1911-06-14)

TOTAL Wooster File Count:  2045

The Transcript (1972-04-06 thru 2006-05-04)

  • The Western Collegian (1867-10-01 thru 1874-06-25)
  • The College Transcript (1874-09-26 thru 1902-06-07)
  • The Practical Student (1888-06-22 thru 1895-06-08)
  • The Ohio Wesleyan Transcript (1902-06-18 thru 1972-03-02)

TOTAL Ohio Wesleyan File Count: 4420

GRAND TOTAL All Five Ohio College File Count:  17,521

 

The next step will be to merge, reorder and filter out unnecessary XML fields.

 

Advertisements