Technology, Cognition and The Digital Humanities

Why PolyCogBlog?

In the technical world there is a ‘thing’ called polyglot programming.  The ideal is that no one programming language is the best tool to solve every problem in all situations.  The logic follows that programmers ideally should become fluent in a variety of programming languages to be able to chose the most suitable language to solve the problem at hand.

For example, one would tend to learn object-oriented Java to develop on many existing systems, procedural Python for Data Analysis, async JavaScript for Web and functional Scala for distributed big data processing.  In addition to knowing the syntax of various languages programmers, polyglot programmers acquire valuable knowledge, perspectives and synthesis of a variety of mental models, idioms and best practices.  The polyglot programmer is analogous to the Hollywood ‘triple threat’.


Hollywood Triple Triple Threats that can Sing, Dance and Act


Indeed, much of the recent progress made across various technologies has been a direct result of polyglot programmers and the intellectual cross-pollination they enable.  Jose Valim grafted a variety of important functionalities from other languages onto the rather special purpose Erlang/OTP platform.  This brought the power for more traditional web/application platforms to his Elixir platform which inherited the reliability and throughput of telco switches.  Chris McCord ported some of the best ideas from the Ruby/Rails world to build the impressive Phoenix web framework atop Elixir/Erlang.  Nearly every major programming language is borrowing heavily from functional programming languages like Haskell to address demands for distributed, high-throughput and more reliable software arising from big data and complex processing;l.


Zooming out from the narrow world of technology to society at large the same argument can be made as that for polyglot programming.  While we undoubtedly live in a world of specialists, the world is an increasingly interconnected place and the most interesting and consequential problems require more interdisciplinary teams.  For example, a generation ago healthcare innovation was rarely driven by anyone without an M.D.  Today, economists, statisticians, computer scientists, behavioral economists, ethicists and countless others have joined medical specialists in improving all aspects of medical care including prevention, management, outcomes, satisfaction and costs.

The term “polycog” stands multiple modes of cognition or way of thinking.





Featured post

ProgHum 101: Laptop Configuration

Identical Hardware and Software Configuration:


MacBook Pro 13′ mid-2012 2.5GHz i5 4GB DDR3 500GB HDD

Operating System:

macOS Sierra 10.12.5

Desktop Image: (download from this link)

User Accounts:

We may want to leave this for the student to create a DH gmail account they can keep after the course is over.  This is needed to create accounts for some of the online resources we’ll be using)

  1. A <KenyonDigHum>@gmail account to signup for many applications

Client-side Software Installs (in preferred order):

  1. Java 8 SE w/SDK (not the default Java8 install on homepage) (from
  2. RapidMiner Studio (not Server) requires Java 8 (from
  3. Install iTerm2 enhanced command line app (from
  4. Anaconda 4.4.0 with Python 3.6 (from
  5. Install Tensorflow (instructions for MacOS with pip3)
  6. R Programming Language (from
  7. R Studio Development Environment (from
  8. From within R Studio, nstall R package “tidyverse”
  9. Visual Studio Code (from Microsoft at
  10. Tableau Public (from

macOS Sierra Applications Dock at bottom of screen (Only these Icons from Left to Right, all other icons removed)

  1. File Manager
  2. Chrome Browser
    1. Config home page:
    2. Config all extensions disabled by default
    3. Install latest Adobe Flash (required for Scratch)
  3. Anaconda-Navigator
  4. R Studio
  5. Visual Studio Code
  6. Microsoft Excel
  7. RapidMiner
  8. Tableau
  9. iTerm2

Chrome Browser Extensions (all disabled by default for privacy and stability):

  1. Sourcegraph for Github
  2. Restlet Client
  3. Postman Application (discontinued support mid-2017 but still viable)
  4. Ghostery
  5. BuiltWith
  6. What Runs It
  7. Simple Material Design Pallet
  8. Web Developer
  9. Code Cola
  10. Web Maker
  11. aXe
  12. Lighthouse
  13. Web Developer Checklist
  14. SEOquake


TOS and Privacy Policy: Slack

Terms of Service (TOS):

Please review the User Terms of Service

Effective: November 17, 2016

These User Terms of Service (the “User Terms”) govern your access and use of our online workplace productivity tools and platform (the “Services”). Please read them carefully. Even though you are signing onto an existing team, these User Terms apply to you —the prospective user reading these words. We are grateful you’re here.

First things First

These User Terms are Legally Binding

These User Terms are a legally binding contract between you and us. As part of these User Terms, you agree to comply with the most recent version of our Acceptable Use Policy, which is incorporated by reference into these User Terms. If you access or use the Services, or continue accessing or using the Services after being notified of a change to the User Terms or the Acceptable Use Policy, you confirm that you have read, understand and agree to be bound by the User Terms and the Acceptable Use Policy. “We”, “our” and “us” currently refers to Slack Technologies, Inc.

Customer’s Choices and Instructions

You are an Authorized User on a Team Controlled by a “Customer”

An organization or other third party that we refer to in these User Terms as “Customer” has invited you to a team (i.e., a unique URL where a group of users may access the Services, as further described in our Help Center pages). If you are joining one of your employer’s teams, for example, Customer is your employer. If you are joining a team created by your friend using her personal email address to work on her new startup idea, she is our Customer and she is authorizing you to join her team.

What This Means for You—and for Us

Customer has separately agreed to our Customer Terms of Service or entered into a written agreement with us (in either case, the “Contract”) that permitted Customer to create and configure a team so that you and others could join (each invitee granted access to the Services, including you, is an “Authorized User”). The Contract contains our commitment to deliver the Services to Customer, who may then invite Authorized Users to join its team(s). When an Authorized User (including, you) submits content or information to the Services, such as messages or files (“Customer Data”), you acknowledge and agree that the Customer Data is owned by Customer and the Contract provides Customer with many choices and control over that Customer Data. For example, Customer may provision or deprovision access to the Services, enable or disable third party integrations, manage permissions, retention and export settings, transfer or assign teams, share channels, or consolidate your team or channels with other teams or channels, and these choices and instructions may result in the access, use, disclosure, modification or deletion of certain or all Customer Data. Please check out our Help Center pages for more detail on our different Service plans and the options available to Customer.

The Relationship Between You, Customer and Us


A Few Ground Rules

You Must be Over the Age of 13

The Services are not intended for and should not be used by anyone under the age of thirteen. You represent that you are over the age of 13 and are the intended recipient of Customer’s invitation to the Services. You may not access or use the Services for any purpose if either of the representations in the preceding sentence is not true.

While You Are Here, You Must Follow the Rules

To help ensure a safe and productive work environment, all Authorized Users must comply with ourAcceptable Use Policy and remain vigilant in reporting inappropriate behavior or content to Customer and us.

You Are Here At the Pleasure of Customer (and Us)

These User Terms remain effective until Customer’s subscription for you expires or terminates, or your access to the Services has been terminated by Customer or us. Please contact Customer if you at any time or for any reason wish to terminate your account, including due to a disagreement with any updates to these User Terms or the Acceptable Use Policy.

Limitation of Liability

If we believe that there is a violation of the Contract, User Terms, the Acceptable Use Policy, or any of our other policies that can simply be remedied by Customer’s removal of certain Customer Data or taking other action, we will, in most cases, ask Customer to take action rather than intervene. We may directly step in and take what we determine to be appropriate action (including disabling your account) if Customer does not take appropriate action or we believe there is a credible risk of harm to us, the Services, Authorized Users, or any third parties. IN NO EVENT WILL YOU OR WE HAVE ANY LIABILITY TO THE OTHER FOR ANY LOST PROFITS OR REVENUES OR FOR ANY INDIRECT, SPECIAL, INCIDENTAL, CONSEQUENTIAL, COVER OR PUNITIVE DAMAGES HOWEVER CAUSED, WHETHER IN CONTRACT, TORT OR UNDER ANY OTHER THEORY OF LIABILITY, AND WHETHER OR NOT THE PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. UNLESS YOU ARE ALSO A CUSTOMER (AND WITHOUT LIMITATION TO OUR RIGHTS AND REMEDIES UNDER THE CONTRACT), YOU WILL HAVE NO FINANCIAL LIABILITY TO US FOR A BREACH OF THESE USER TERMS. OUR MAXIMUM AGGREGATE LIABILITY TO YOU FOR ANY BREACH OF THE USER TERMS IS ONE HUNDRED DOLLARS ($100) IN THE AGGREGATE. THE FOREGOING DISCLAIMERS WILL NOT APPLY TO THE EXTENT PROHIBITED BY APPLICABLE LAW AND DO NOT LIMIT EITHER PARTY’S RIGHT TO SEEK AND OBTAIN EQUITABLE RELIEF.


The sections titled “The Relationship Between You, Customer, and Us”, “Limitation of Liability”, and “Survival”, and all of the provisions under the general heading “General Provisions” will survive any termination or expiration of the User Terms.

General Provisions

Email and Slack Messages

Except as otherwise set forth herein, all notices under the User Terms will be by email, although we may instead choose to provide notice to Authorized Users through the Services (e.g., a slackbot notification). Notices to Slack should be sent to, except for legal notices, which must be sent to A notice will be deemed to have been duly given (a) the day after it is sent, in the case of a notice sent through email; and (b) the same day, in the case of a notice sent through the Services. Notices under the Contract will be delivered solely to Customer in accordance with the terms of that agreement.

Privacy Policy

Please review our Privacy Policy for more information on how we collect and use data relating to the use and performance of our products.


As our business evolves, we may change these User Terms or the Acceptable Use Policy. If we make a material change to the User Terms or the Acceptable Use Policy, we will provide you with reasonable notice prior to the change taking effect either by emailing the email address associated with your account or by messaging you through the Services. You can review the most current version of the User Terms at any time by visiting this page, and by visiting the following for the most current versions of the other pages that are referenced in these User Terms: Acceptable Use Policy and Privacy Policy. Any material revisions to these User Terms will become effective on the date set forth in our notice, and all other changes will become effective on the date we publish the change. If you use the Services after the effective date of any changes, that use will constitute your acceptance of the revised terms and conditions.


No failure or delay by either party in exercising any right under the User Terms, including the Acceptable Use Policy, will constitute a waiver of that right. No waiver under the User Terms will be effective unless made in writing and signed by an authorized representative of the party being deemed to have granted the waiver.


The User Terms, including the Acceptable Use Policy, will be enforced to the fullest extent permitted under applicable law. If any provision of the User Terms is held by a court of competent jurisdiction to be contrary to law, the provision will be modified by the court and interpreted so as best to accomplish the objectives of the original provision to the fullest extent permitted by law, and the remaining provisions of the User Terms will remain in effect.


You may not assign any of your rights or delegate your obligations under these User Terms, including theAcceptable Use Policy, whether by operation of law or otherwise, without the prior written consent of us (not to be unreasonably withheld). We may assign these User Terms in their entirety (including all terms and conditions incorporated herein by reference), without your consent, to a corporate affiliate or in connection with a merger, acquisition, corporate reorganization, or sale of all or substantially all of our assets.

Governing Law

The Contract, and any disputes arising out of or related hereto, will be governed exclusively by the internal laws of the State of California, without regard to its conflicts of laws rules or the United Nations Convention on the International Sale of Goods.

Venue; Waiver of Jury Trial; Fees

The state and federal courts located in San Francisco County, California will have exclusive jurisdiction to adjudicate any dispute arising out of or relating to these User Terms, including the Acceptable Use Policy, or their formation as a contract between us or their enforcement. Each party hereby consents and submits to the exclusive jurisdiction of such courts. Each party also hereby waives any right to jury trial in connection with any action or litigation in any way arising out of or related to the User Terms. In any action or proceeding to enforce rights under the User Terms, the prevailing party will be entitled to recover its reasonable costs and attorney’s fees.

Entire Agreement

The User Terms, including any terms incorporated by reference into the User Terms, constitute the entire agreement between you and us and supersede all prior and contemporaneous agreements, proposals or representations, written or oral, concerning its subject matter. To the extent of any conflict or inconsistency between the provisions in these User Terms and any pages referenced in these User Terms, the terms of these User Terms will first prevail; provided, however, that if there is a conflict or inconsistency between the Contract and the User Terms, the terms of the Contract will first prevail, followed by the provisions in these User Terms, and then followed by the pages referenced in these User Terms (e.g., the Privacy Policy). Customer will be responsible for notifying Authorized Users of those conflicts or inconsistencies and until such time the terms set forth herein will be binding.

Contacting Slack

Please also feel free to contact us if you have any questions about Slack’s User Terms of Service. You may contact us at or at our mailing address below:

Slack Technologies
155 5th Street, 6th Floor
San Francisco, CA

Privacy Policy:

Effective: November 17, 2016

You can see past versions of our Privacy Policy in our Policy Archive.


Our privacy policy will help you understand what information we collect at Slack, how Slack uses it, and what choices you have.

When we talk about “Slack,” “we,” “our,” or “us” in this policy, we are referring to Slack Technologies, Inc., the company which provides the Services. When we talk about the “Services” in this policy, we are referring to our online workplace productivity tools and platform. Our Services are currently available for use via a web browser or applications specific to your desktop or mobile device, as further described in our Help Center.

Information we collect and receive

1. Customer Data

Content and information submitted by users to the Services is referred to in this policy as “Customer Data.” As further explained below, Customer Data is controlled by the organization or other third party that created the team (the “Customer”). Where Slack collects or processes Customer Data, it does so on behalf of the Customer. Here are some examples of Customer Data (but keep in mind they are only examples and there may be others): messages (including those in channels and direct messages), pictures, videos, edits to messages or deleted messages, and other types of files. A user may also choose to enter information into their profile, such as first and last name, job, a photo and a phone number.

If you join a team and create a user account, you are a “user,” as further described in the User Terms of Service. If you are using the Services by invitation of a Customer, whether that Customer is your employer, another organization, or an individual, that Customer determines its own policies regarding storage, access, modification, deletion, sharing, and retention of Customer Data which may apply to your use of the Services. Please check with the Customer about the policies and settings it has in place.

2. Other information

Slack may also collect and receive the following information:

  • Account creation information. Users provide information such as an email address and password to create an account.
  • Team setup information. When a Customer creates a team using the Services, we collect an email address, a team name, domain details (such as, user name for the individual setting up the team, and password. For more information on team set-up, click here.
  • Billing and other information. For Customers that purchase a paid version of the Services, our corporate affiliates and our third party payment processors may collect and store billing address and credit card information on our behalf or we may do this ourselves.
  • Services usage information. This is information about how you are accessing and using the Services, which may include administrative and support communications with us and information about the teams, channels, people, features, content, and links you interact with, and what third party integrations you use (if any).
  • Contact information. With your permission, any contact information you choose to import is collected (such as an address book from a device) when using the Services.
  • Log data. When you use the Services our servers automatically record information, including information that your browser sends whenever you visit a website or your mobile app sends when you are using it. This log data may include your Internet Protocol address, the address of the web page you visited before using the Services, your browser type and settings, the date and time of your use of the Services, information about your browser configuration and plug-ins, language preferences, and cookie data.
  • Device information. We may collect information about the device you are using the Services on, including what type of device it is, what operating system you are using, device settings, application IDs, unique device identifiers, and crash data. Whether we collect some or all of this information often depends on what type of device you are using and its settings.
  • Geo-location information. Precise GPS location from mobile devices is collected only with your permission. WiFi and IP addresses received from your browser or device may be used to determine approximate location.
  • Services integrations. If, when using the Services, you integrate with a third party service, we will connect that service to ours. The third party provider of the integration may share certain information about your account with Slack. However, we do not receive or store your passwords for any of these third party services. For more information on service integrations, click here.
  • Third party data. Slack may also receive information from affiliates in our corporate group, our partners, or others that we use to make our own information better or more useful. This might be aggregate level information, such as which IP addresses go with which zip codes, or it might be more specific information, such as about how well an online marketing or email campaign performed.

Our Cookie Policy

Slack uses cookies and similar technologies like single-pixel gifs and web beacons, to record log data. We use both session-based and persistent cookies.

Cookies are small text files sent by us to your computer and from your computer or mobile device to us each time you visit our website or use our desktop application. They are unique to your account or your browser. Session-based cookies last only while your browser is open and are automatically deleted when you close your browser. Persistent cookies last until you or your browser delete them or until they expire.

Some cookies are associated with your account and personal information in order to remember that you are logged in and which teams you are logged into. Other cookies are not tied to your account but are unique and allow us to carry out site analytics and customization, among other similar things. If you access the Services through your browser, you can manage your cookie settings there but if you disable some or all cookies you may not be able to use the Services.

Slack sets and accesses our own cookies on the domains operated by Slack and its corporate affiliates. In addition, we use third parties like Google Analytics for website analytics. You may opt-out of third party cookies from Google Analytics on its website. We do not currently recognize or respond to browser-initiated Do Not Track signals as there is no consistent industry standard for compliance.

How we use your information

We use your information to provide and improve the Services.

1. Customer Data

Slack may access and use Customer Data as reasonably necessary and in accordance with Customer’s instructions to (a) provide, maintain and improve the Services; (b) to prevent or address service, security, technical issues or at a Customer’s request in connection with customer support matters; (c) as required by law or as permitted by the Data Request Policyand (d) as set forth in our agreement with the Customer or as expressly permitted in writing by the Customer. Additional information about Slack’s confidentiality and security practices with respect to Customer Data is available at our Security Practices page.

2. Other information

We use other kinds of information in providing the Services. Specifically:

  • To understand and improve our Services. We carry out research and analyze trends to better understand how users are using the Services and improve them.
  • To communicate with you by:
    • Responding to your requests. If you contact us with a problem or question, we will use your information to respond.
    • Sending emails and Slack messages. We may send you Service and administrative emails and messages. We may also contact you to inform you about changes in our Services, our Service offerings, and important Service related notices, such as security and fraud notices. These emails and messages are considered part of the Services and you may not opt-out of them. In addition, we sometimes send emails about new product features or other news about Slack. You can opt out of these at any time.
  • Billing and account management. We use account data to administer accounts and keep track of billing and payments.
  • Communicating with you and marketing. We often need to contact you for invoicing, account management and similar reasons. We may also use your contact information for our own marketing or advertising purposes. You can opt out of these at any time.
  • Investigating and preventing bad stuff from happening. We work hard to keep the Services secure and to prevent abuse and fraud.

This policy is not intended to place any limits on what we do with data that is aggregated and/or de-identified so it is no longer associated with an identifiable user or Customer of the Services.

Your choices

1. Customer Data

Customer provides us with instructions on what to do with Customer Data. A Customer has many choices and control over Customer Data. For example, Customer may provision or deprovision access to the Services, enable or disable third party integrations, manage permissions, retention and export settings, transfer or assign teams, share channels, or consolidate teams or channels with other teams or channels. Since these choices and instructions may result in the access, use, disclosure, modification or deletion of certain or all Customer Data, please review the Help Centerpages for more information about these choices and instructions.

2. Other information

If you have any questions about your information, our use of this information, or your rights when it comes to any of the foregoing, contact us at

Other Choices

In addition, the browser you use may provide you with the ability to control cookies or other types of local data storage. Your mobile device may provide you with choices around how and whether location or other data is collected and shared. Slack does not control these choices, or default settings, which are offered by makers of your browser or mobile device operating system.

Sharing and Disclosure

There are times when information described in this privacy policy may be shared by Slack. This section discusses only how Slack may share such information. Customers determine their own policies for the sharing and disclosure of Customer Data. Slack does not control how Customers or their third parties choose to share or disclose Customer Data.

1. Customer Data

Slack may share Customer Data in accordance with our agreement with the Customer and the Customer’s instructions, including:

  • With third party service providers and agents. We may engage third party companies or individuals to process Customer Data.
  • With affiliates. We may engage affiliates in our corporate group to process Customer Data.
  • With third party integrations. Slack may, acting on our Customer’s behalf, share Customer Data with the provider of an integration added by Customer. Slack is not responsible for how the provider of an integration may collect, use, and share Customer Data.

2. Other information

Slack may share other information as follows:

  • About you with the Customer. There may be times when you contact Slack to help resolve an issue specific to a team of which you are a member. In order to help resolve the issue and given our relationship with our Customer, we may share your concern with our Customer.
  • With third party service providers and agents. We may engage third party companies or individuals, such as third party payment processors, to process information on our behalf.
  • With affiliates. We may engage affiliates in our corporate group to process other information.

3. Other types of disclosure

Slack may share or disclose Customer Data and other information as follows:

  • During changes to our business structure. If we engage in a merger, acquisition, bankruptcy, dissolution, reorganization, sale of some or all of Slack’s assets, financing, acquisition of all or a portion of our business, a similar transaction or proceeding, or steps in contemplation of such activities (e.g. due diligence).
  • To comply with laws. To comply with legal or regulatory requirements and to respond to lawful requests, court orders and legal process.
  • To enforce our rights, prevent fraud and for safety. To protect and defend the rights, property, or safety of us or third parties, including enforcing contracts or policies, or in connection with investigating and preventing fraud.

We may disclose or use aggregate or de-identified information for any purpose. For example, we may share aggregated or de-identified information with our partners or others for business or research purposes like telling a prospective Slack Customer the average number of messages sent within a Slack team in a day or partnering with research firm or academics to explore interesting questions about workplace communications.


Slack takes security seriously. We take various steps to protect information you provide to us from loss, misuse, and unauthorized access or disclosure. These steps take into account the sensitivity of the information we collect, process and store, and the current state of technology.

To learn more about current practices and policies regarding security and confidentiality of Customer Data and other information, please see our Security Practices; we keep that document updated as these practices evolve over time.

Children’s information

Our Services are not directed to children under 13. If you learn that a child under 13 has provided us with personal information without consent, please contact us.

Changes to this Privacy Policy

We may change this policy from time to time, and if we do we will post any changes on this page. If you continue to use the Services after those changes are in effect, you agree to the revised policy.

EU-U.S. Privacy Shield and Swiss-U.S. Privacy Shield

Slack has self-certified to the EU-U.S. and Swiss-U.S. Privacy Shield frameworks set forth by the U.S Department of Commerce with respect to collection, use and retention of Customer Data. For more information, see our Privacy Shield Notice. We may process some personal data from individuals or companies via other compliance mechanisms, including data processing agreements based on the EU Standard Contractual Clauses. To learn more about the Privacy Shield program, refer to

Contacting Slack

Please also feel free to contact us if you have any questions about Slack’s Privacy Policy or practices. You may contact us at or at our mailing address below:

Slack Technologies
155 5th Street, 6th Floor
San Francisco, CA

HackOH5 Photos

Snaps from our hackathon this past weekend that the College of Wooster.

This slideshow requires JavaScript.

Not so much sleep, but plenty of sugar, caffeine and an intense out-of-the-box learning experience.  Best part was overcoming adversity, coming together as a team and producing something of value that carries the collective insights and experiences of everyone.

This slideshow requires JavaScript.



Hack5OH Presentation

Tableau Presentation:  The Evolution of Interest in Black Music as Seen thru Ohio5 Student Journalism

Slide Presentation:  What’s Goin’ On:  A Social and Political History of Black Music Consumption Told thru Ohio5 Journalism

Well done!




Keyword Searching the Dataset

Here are some preliminary keyword searches of the HackOH5 dataset.  I ran these on the dataset before our hackathon to give the team ideas and seed our brainstorming session.  Time will be at a premium during the hackathon so we need to come prepared as possible.
As you can see, I picked several general ideas along with specific terms we could search the corpus for related to each idea.
justice – 14,422
protest* – 22,340 (includes protestant and protest
protest – 10,947
riot – 13,910
justice – 14,422
race – 56,232
black – 79,591 (not necessarily race)
african-american – 0
” negro” – 10,293
“latin*” – 29,766
“Hispan*” – 1,076 (incl hispanic, hispano)
chican – 231 (incl chicano, chicana)
asian – 8,142
chinam – 257 (incl chinamen, chinaman)
oriental – 2,832
chinam – 257 (incl chinaman, chinamen)
jap – 504 (excl japanese, japan)
jap* – 20,108 (incl japanese, japan)
” Nip ” – 0 (excl nippon, nip* but could be verb to nip)
” Hun ” – 933
“feminis*” – 4,771 (incl feminist, feminism)
suffrage* – 1,315 (incl suffrage, suffragette)
” gender*” – 4,206 (excl engender, etc)
drug – 29,987
sex –
rock –
protestant – 6
catholic – 5,366
bible – 11,629
holy – 6,764
divin* – 13,471 (incl divine, divinity)
sacred – 4,739
Jesus – 7,746
God – 40,416
jew* – 35,873 (includes jew,jews,jewish)
muslim – 2,089
islam* – 1,860 (includes islam, islamic, etc)


econom* – 47,324
job – 53,563
career – 26,045
interview – 22,440
market -17,377
money – 60,287
dollar – 27,183
vote – 40,250
election – 50,565

war – 607,649

europe – 27,540
asia – 16,441
latin america – 3,294
africa – 30,044
vietnam – 10,811
president – 218,626
professor –
class –
campus – 193,087
restaurant – 11,038
police – 21,098
community – 77,232
local – 8
mayor –

I modified a python script in a stackexchange post, so we can cut up our large XML file into individual files segregated by college into one of five subdirectories.  This is a common task, and there are many examples out there in most popular languages.

import sys
import os
import re
import xml.etree.ElementTree as ET
import time

SOURCE_XML = ‘oh5_all.xml’  # Big source file with mixed records to be split up
START_LINE = 7500 # In case of incomplete run, restart run after this record no

context = ET.iterparse(SOURCE_XML, events=(‘end’, ))

# Create a subdirectory to split/write each newspaper record to the
# appropriate college

college_dirs = [“kenyon”, “oberlin”, “wooster”, “oh_wesleyan”, “denison”, “unk_college”]

for college_dir in college_dirs:
if not os.path.exists(college_dir):
# For performance boost, precompile regex search expressions we’ll use
# to match college name string on text within <title> tag

regex_ken = re.compile(‘^Ken’) # “Kenyon Collegian…”
regex_obe = re.compile(‘^Obe’) # “Oberlin College Review…””
regex_den = re.compile(‘^TheDen’) # “The Denisonian…”
regex_woo = re.compile(‘^TheWoo’) # “The Wooster Voice…”
regex_ohw = re.compile(‘^TheOhi’) # “The Ohio Wesleyan…”

# FUTURE: Clean up and shorten filename based upon title string
# regex_newstitle = re.compile(‘^([^\P))
def collegeDir(title):
“””Passed the variable <title> string of a newspaper issue record, match
the start of the <title> string against precompiled regex for each
college newspaper to determine which subdirectory the processed
newspaper issue file should be written to

if regex_ken.match(title):
# print(‘DEBUG: Kenyon Collegian’)
subdir = “kenyon”
elif regex_obe.match(title):
# print(‘DEBUG: Oberlin College Review’)
subdir = “oberlin”
elif regex_den.match(title):
# print(‘DEBUG: The Denisonian’)
subdir = “denison”
elif regex_woo.match(title):
# print(‘DEBUG: The Wooster Voice’)
subdir = “wooster”
elif regex_ohw.match(title):
# print(‘DEBUG: The Ohio Wesleyan’)
subdir = “ohio_wesleyan”
# print(‘DEBUG: Unmatched Record, unk_ subdirectory’)
subdir = “unk_college”

return subdir
# Start timer

start_time = time.time()
print(‘Start Time: %s’ % (start_time))
# Loop through big file and copy/split out each newspaper issue <record>
# to the associated college subdirectory

recno = 0
print(‘Starting processing first record…’)
for event, elem in context:
if elem.tag == ‘record’:

recno += 1

if (recno < START_LINE):

# Give command line visual feedback since long-running process
if recno % 500 == 0:
print(‘…processing record %s’ % (recno))

title = elem.find(‘title’).text
filename = format(title + “.xml”)
# delete edge and embedded whitespaces from title
filename = ”.join(filename.split())
# create full path to file by prefixing with matched subdirectory
filename = “%s/%s” % (collegeDir(filename), filename)
with open(filename, ‘wb’) as f:
f.write(“<?xml version=\”1.0\” encoding=\”UTF-8\”?>\n”)
# Write summary statistics

end_time = time.time()
print(‘End Time: %s’ % (end_time))
print(‘Processed: %s Records’ % (recno))
print(‘Execution TIme: %s ‘ % (end_time – start_time()))

After executing the above python script we find some interesting variations on the names of newspapers and assign them to their modern equivalents:

The Kenyon Review

  • No variation on title

TOTAL Kenyon File Count:  1950

The Oberlin College Review (1875-02-24 thru 2012-04-27

  • Oberlin College Review (1874-04-01 thru 1875-01-20)
  • The Elephant (1936-05-08 thru 1948-05-09)

TOTAL Oberlin File Count:  5956

The Denisonian (1896-02-15 thru 2012-11-13)

  • The Denisonian Collegian (1875-09-25 thru 1893-06-24)
  • The Zenith (1881-10-01 thru 1884-02-15)
  • The Commencement Daily (1885-06-11 thru 1885-06-25)
  • The Denison Weekly News (1885-09-30 thru 1885-12-10)
  • Granvilletimes (1890-09-23 thru 1899-01-20)
  • The Denison (1892-09-17 thru 1894-06-02)
  • The Daily Denison (1893-06-10 thru 1884-06-26)
  • The Weekly Denisonian (1901-10-05 thru 1903-06-06)

TOTAL Denison File Count:  3150

The Wooster Voice (1940-09-20 thru 2011-05-06)

  • Wooster Voice (1890-09-12 thru 1911-06-14)

TOTAL Wooster File Count:  2045

The Transcript (1972-04-06 thru 2006-05-04)

  • The Western Collegian (1867-10-01 thru 1874-06-25)
  • The College Transcript (1874-09-26 thru 1902-06-07)
  • The Practical Student (1888-06-22 thru 1895-06-08)
  • The Ohio Wesleyan Transcript (1902-06-18 thru 1972-03-02)

TOTAL Ohio Wesleyan File Count: 4420

GRAND TOTAL All Five Ohio College File Count:  17,521


The next step will be to merge, reorder and filter out unnecessary XML fields.


Exploring Text with Python and NLTK

Because our newspaper OCR text is noisy, article text often scrambled and articles are clipped across pages, we’re limited in what information we can extract with more advanced Natural Language Processing (NLP) algorithms.

For example, earlier pages of the corpus we have a very high error rate (~20%) with few complete sentences.  In addition, it appears that all sentence terminating punctuation (periods ‘.’) have been stripped from both XML datasets which trip up even basic utilities like NLTK word and sentence tokenizers.  As a result we cannot do more sophisticated textual analysis that assume complete English sentences and depend upon grammatically correct construction.

As a consequence we’ll first characterize our word corpus with simple statistical metrics and then limit ourselves to fundamental NLP analysis via the Python NLTK toolkit.  For a great introduction to NLP we’ll be looking at excerpts from NLTK’s excellent online manual.

In rough order of increasing complexity, here are some statistics we’d like to gather for each scanned page perhaps averaged across each newspaper issue:

  • Python Package “textstat
    • syllable_count(text)
    • lexicon_count(text, TRUE/FALSE) – TRUE/default removes punctuation first
    • sentence_count(text)
    • flesch_reading_ease(text)
    • flesch_kincaid_grade(text)
    • dale_chall_readability_score(text), uses lookup table of 3000 english words
  • NLTK (see NLTK online book chapter 1)
    • Concordance (context within which a particular word appears)
      • from import *
      • text1.concordance(“monstrous”) # context around word in Moby Dick
    • Similar (what other words appear in similar contexts)
      • text1.similar(“monstrous”)
    • Common Context (find common context shared by 2+ words)
      • text2.common_contexts([“monstrous”, “very”])
    • Dispersion Plots (relative offset into text of word(s))
      • text4.dispersion_plot([“citizens”, “democracy”, “freedom”, “duties”, “America])
    • Generate (generate random text in same style)
      • text3.generate()
    • Counting Vocabulary
      • len(text3)  # total words
      • sorted(set(text3))  # unique words
      • def lexical_diversity(text):
        • return len(set(text3)) / len(text3)  # lexical richness, % distinct words
      • text3.count(“smote”)  # specific word count
      • 100 * text4.count(‘a’) / len(text4)
      • def percentage(count, total):
        • return 100 * count / total  # percent of total that is count
      • percentage(text4.count(‘a’), len(text4))
      • end
    • Tokenizer
    • Frequency Distributions
      • fdist1 = FreqDist(text1)
      • fdist1.most_common(50)  # 50 most common words with count
      • fdist1[‘whale’]  # count for word ‘whale’
      • Functions Defined for NLTK’s FreqDist (Table 3.1)
        • fdist = FreqDist(samples)
        • fdist[sample] += 1
        • fdist[‘monstrous’]  # count of word
        • fdist.freq[‘monstrous’]  # freq of sample
        • fdist.N()  # total number of samples
        • fdist.most_common(n)
        • for sample in fdist:  # iterate over
        • fdist.max()  # greatest count
        • fdist.tabulate()  # tabulate freq dist
        • fdist.plot()
        • fdist.plot(cumulative=True)
        • fdist1 |= fdist2  # update fdist1 with counts from fdist2
        • fdist1 < fdist2  # test if samples in fdist1 occur less freq than in fdist2
    • Fine-grained Selection of Words
      • Min length
        • V = set(text1)
        • long_words = [w for w in V if len(w) > 15]
        • sorted(long_words)
      • Combined length with frequency
        • fdist5 = FreqDist5
        • sorted(w for w in set(text5) if len(w) > 7 and fdisk5[w] > 7)
    • Collocations and Bigrams
      • Bigrams
        • list(bigrams([“more”,”is”,”said”,”than”,”done])
      • Collocations
        • text4.collocations()
    • Stylistics
      • NLTK Brown million word corpus 1961
      • “news” genre from Chicago Tribune: Society Reportage
        • from nltk.corpus import brown
        • news_text = brown.words(categories=’news’)
        • fdist = nltk.FreqDist(w.lower() for w in news_text)
        • modals = [‘can’, ‘could’, ‘may’, ‘might’, ‘must’, ‘will’]
        • for m in modals:
          • print(m + ‘:’, fdist[m], end=’ ‘)
        • cfd = nltk.ConditionalFreqDist(
          • (genre, word)
          • for genre in brown.categories()
          • for word in brown.words(categories=genre))
        • genrres = [‘news’, ‘religion’, …]
        • modals = [‘can’, ‘could’, …]
        • cfd.tabulate(conditions=genres, samples=modals)
      • Reuters Corpus 1.3 million words from 10,788 news documents, 90 topics, divided into two sets: training/test
      • from nltk.corpus import reuters
      • reuters.fileids()
      • reuters.categories()
      • Inaugural Address Corpus to chart word count over time
      • from nltk.corpus import inaugural
      • cfd = nltk.ConditionalFreqDist(
        • (target, fileid[:4])
        • for fileid in inaugural.fileids()
        • for w in inaugural.words(fileid)
        • for target in [‘american’, ‘citizen’]
        • if w.lower().startswith(target))
      • cfd.plot()
    • Loading Your Own Corpus
      • from nltk.corpus import PlaintextCorpusReader
      • corpus_root = ‘/usr/share/dict’
      • wordlists = PlaintextCorpusReader(corpus_root, ‘.*’)
      • wordlists.fileids()
      • wordlists.words(‘connectives’)
    • End
  • Word Count
    • from nltk.tokenize import RegexpTokenizer
    • tokenizer = RegexpTokenizer(r’\w+’)
    • text = with open(“filename”, “r”) as file: <read lines.strip()>
    • tokens = tokenizer.tokenize(text)
  • Sentence Length
  • Vocabulary Diversity
  • Reading Difficulty Level


A good overview of Natural Language Processing at SlideShare


Visualizing and Extracting XML

We have our HackOH5 dataset in two formats (1) ALTO XML and (2) Simplified XML which required different extraction techniques.  First, we should visualize each file to get a better idea of their internal structure and identify exact items we want to extract.  Each piece of data in the XML file can be uniquely identified by it’s XPath (like an internal URL) that we’ll need in the extraction process.

There are a number of websites that you can paste in XML text and have it nicely format into a tree structure as well as give you the XPath to each element within the XML document.  For our 1.47GB Simplified XML document, we’ll have to cut and paste in small well-formed (balanced tags) excerpts.  Here are several XML visualization websites I found useful.

Various IDE and text editors have the ability to parse XML documents as well.  The “XML Tools” extension for the free Microsoft IDE VS Code 2.



Digital Humanities Hackathon

This weekend’s hackathon is qualitatively different than typical hackathons.  For us as digital humanists it is about maintaining focus on the problem domain rather than untangling the technology.  As a technical facilitator my biggest job is in helping to abstract out the low-level complexity with judicious tool selection and workflows that free you to devote as much mental processing the the more interesting and creative high-level problems.  As such, we’ll be focusing on the forests more than the trees.

On the other hand, one of the biggest weaknesses of Digital Humanities is that it is often too disconnected from technology which often results in less than optimum research as a result.  This is not a problem for leaders in Digital Humanities who have begun to apply the most sophisticated technologies to Digital Humanities research such as Machine Learning techniques and Data Visualizations borrowed from Big Data genomic research.  However, the field as a whole still spends far too much time deep in the woods of XML Schema at times to the detriment of achieving more ambitious goals and blunting student interest.

Here is an overview of what value we can extract out of our experience tomorrow.  Given competing demands with other classes and the general background of our team, I’ve slightly modified goals to a more realistic subset rather than dilute the core objectives.  As I’ve mentioned in our run-up meetings, it is my hope that this experience will give you insights beyond the classroom that will prove invaluable in grad school, your career and beyond.  Oh, and let’s have fun with this tomorrow.


Domain Experts/Analysts

  • How to read and think about Data
  • Survey similar Digital Humanities Research
  • Formulate Good Research Questions
  • Effectively Presenting a Narrative with Data
  • Communicating with Technical Teams
  • Go beyond Theory to Create a Project of Genuine Interest

Analysts/Programmers ( * beyond the resources of this event )

  • Understanding Data Acquisition:  Acquisition (OCR), Formats, etc
  • * ( XML Markup Language including XPath Syntax )
  • XML/HTML Parsing Engines (BeautifulSoup4, lxml)
  • Regular Expression Syntax (RegEx)
  • * ( Python, Pandas, and other Python Libraries )
  • Jupyter Notebooks for Data Exploration
  • * ( Simplier Visual ML/Natural Language Processing with RapidMiner )

Programmers/Data Visualizers

  • Visualization Guidelines/Types
  • * ( Visualizations with Jupyter Notebooks )
  • Visualizations with Tableau Public
  • Create Static and Interactive Visualizations
  • Story and Visual Narrative Presentation in Data Science

All Team Members

  • Be able to Think in a more Data-Driven Manner
  • Improve visual storytelling
  • Work and communicate across team specialties
  • Understand the complete Data Science / Analytic Pipeline Process
  • Learn higher-level abstractions that don’t require coding



Blog at

Up ↑