If 0.3 nanosecond were 1 second it would take 32 millennia to reboot your computer


In our HackOH5 hackathon we have 24 hours to clean, analyze and visualize over 170,000 files.  The files consist of unstructured and relatively noisy OCR’d newspaper text dating from 1856.  Even in the best of situations (eg with clean, structured numeric data), machine learning and especially neural network algorithms/model building can take relatively long times on even modestly sized datasets.

The quality of our analysis will depend upon the quality of our dataset which we’ll have to clean up in a number of ways.  To ensure we maximize our limited 24hour window of time, we want to optimize both our cleaning of the dataset as well as our ML/NN algorithms.  In this post, we’ll talk about cleaning our dataset.

Our Python data scrubbing could be optimized at a number of points like any other program.  In thinking about optimizing, there is a clear priority that will result in the fastest performance boost.  Although it was published in 2012 and needs to be updated around the edges (7200rpm HDD and SSD numbers are probably different, but not orders of magnitude so), here is a good guide by which to set your priorities:

Latency Comparison Numbers
L1 cache reference                           0.5 ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns                      14x L1 cache
Mutex lock/unlock                           25   ns
Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy             3,000   ns        3 us
Send 1K bytes over 1 Gbps network       10,000   ns       10 us
Read 4K randomly from SSD*             150,000   ns      150 us          ~1GB/sec SSD
Read 1 MB sequentially from memory     250,000   ns      250 us
Round trip within same datacenter      500,000   ns      500 us
Read 1 MB sequentially from SSD*     1,000,000   ns    1,000 us    1 ms  ~1GB/sec SSD, 4X memory
Disk seek                           10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms  80x memory, 20X SSD
Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms

1 ns = 10^-9 seconds
1 us = 10^-6 seconds = 1,000 ns
1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns

By Jeff Dean:               http://research.google.com/people/jeff/
Originally by Peter Norvig: http://norvig.com/21-days.html#answers

Some updates from:       https://gist.github.com/2843375
'Humanized' comparison:  https://gist.github.com/2843375
Visual comparison chart: http://i.imgur.com/k0t1e.png
Animated presentation:   http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/latency.txt

There is also a good Prezi visualization that expands upon these numbers

A great presentation that shows how these relative numbers change by year thru 2017

Basically, what we see is that we need to prioritize our Python data scrubbing in order of efficiently utilizing (1) Network Bandwidth, (2) Disk I/O, (3) Memory I/O with each element having an order of magnitude or two less impact that the previous element.  As our datasets get larger and algorithms more complex, even optimizing lesser factors become noticeable.  Here are common ways we can optimize these three factors:

  1. Network Bandwidth
    1. Filter out and prune noise in the data files to shrink file size like unused metatags or misspelled words that cannot be corrected
    2. Preprocess dataset into a more condensed form without losing valuable information if possible either semantically or syntactically
    3. Split the dataset across multiple computers transferring data in parallel
    4. Compress datasets like plaintext that reduce file sizes significantly before transferring over the network, decompress on the other end in streaming or parallel fashion
  2. Disk I/O
    1. Use fastest disk possible:  SSD > HDD > HDD.  Generally, a HDD with 7200rpm > 5400rpm although look at less advertised throughput which also takes in caching strategies and other factors for a more meaningful global measure.
    2. If available use SSD disk over mechanical spinning HDD, some virtual storage cloud providers offer this at a premium price.  Intel claims SSD are 8x faster than mechanical HDD but SSD also cost about 5x more per GB (2017).
    3.  If possible, configure the disk for optimal performance including:  multiple striped disks with fast bus, performant file system type (eg ZFS with a lot of fast memory vs the default Linux Ext4), file system configuration (eg Journaling off, not fragmented, etc).
  3. Memory
    1. The general rule is more memory at all levels is better since all data must pass through memory of some sort to be processed or transmitted/received.  With proper configuration and programming techniques, the more memory the fewer disk and network accesses are necessary, the smaller the relative overhead of handling these data transfers and the faster the CPU/GPU can process the data whether it be a machine learning algorithm or de/compression.
    2. The general rule is a faster memory like DDR3 SDRAM with upto twice the data transfer bandwidth of its predecessor DDR2 is always good.  DDR2, DDR3 and the newest DDR4 all have different physical specs, are not interoperable and have to have computer systems specifically designed for each type.  DDR5 is in design phase and DDR4 has not reached PC market penetration yet.  As an aside, CPU cache hit ratecache hit rate is interesting topic that relates more to ML/NN algorithms but explains some of the underlying mechanisms at work.
  4. Programming

      Unix command line programs like awk may be faster if processing is simple and single pass

    2. To reduce File I/O at the Python instruction set level read in large blocks or entire files at a time.
    3. Open a file as few times as possible rather than multiple open/closes for processing steps (less reusable code)
    4. Using different versions of Python, Cthyon, etc may make a difference for unusual edge cases like unusual network processing that bypasses TCP for a faster network protocol for certain data types/sizes.  It also helps to understand the interactions between Python, C-code it’s often calling and the underlyting Unix OS.

Finally, all of these are general guesstimations.  The only real way to know what improves performance is by finding well designed, realistic and applicable benchmarks and iteratively refining.  Both Python and Unix provide a wealth of tools for assessing performance and automating testing.

These optimizations are for the data cleaning stage.  For our analytic stage there are additional optimization steps with distributed processing and multiple GPUs being the most significant, but that remains for another post.