We have our HackOH5 dataset in two formats (1) ALTO XML and (2) Simplified XML which required different extraction techniques. First, we should visualize each file to get a better idea of their internal structure and identify exact items we want to extract. Each piece of data in the XML file can be uniquely identified by it’s XPath (like an internal URL) that we’ll need in the extraction process.
There are a number of websites that you can paste in XML text and have it nicely format into a tree structure as well as give you the XPath to each element within the XML document. For our 1.47GB Simplified XML document, we’ll have to cut and paste in small well-formed (balanced tags) excerpts. Here are several XML visualization websites I found useful.
- CodeBeautify.org: view overall XML tree structure
- xmlgrid.net: find XPath within collapsible XML tree
- xmlbeautifier.com: most comprehensive XML viewer with attributes and namespaces
Various IDE and text editors have the ability to parse XML documents as well. The “XML Tools” extension for the free Microsoft IDE VS Code 2.