If 0.3 nanosecond were 1 second it would take 32 millennia to reboot your computer
In our HackOH5 hackathon we have 24 hours to clean, analyze and visualize over 170,000 files. The files consist of unstructured and relatively noisy OCR’d newspaper text dating from 1856. Even in the best of situations (eg with clean, structured numeric data), machine learning and especially neural network algorithms/model building can take relatively long times on even modestly sized datasets.
The quality of our analysis will depend upon the quality of our dataset which we’ll have to clean up in a number of ways. To ensure we maximize our limited 24hour window of time, we want to optimize both our cleaning of the dataset as well as our ML/NN algorithms. In this post, we’ll talk about cleaning our dataset.
Our Python data scrubbing could be optimized at a number of points like any other program. In thinking about optimizing, there is a clear priority that will result in the fastest performance boost. Although it was published in 2012 and needs to be updated around the edges (7200rpm HDD and SSD numbers are probably different, but not orders of magnitude so), here is a good guide by which to set your priorities:
Latency Comparison Numbers -------------------------- L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Compress 1K bytes with Zippy 3,000 ns 3 us Send 1K bytes over 1 Gbps network 10,000 ns 10 us Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD Read 1 MB sequentially from memory 250,000 ns 250 us Round trip within same datacenter 500,000 ns 500 us Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms Notes ----- 1 ns = 10^-9 seconds 1 us = 10^-6 seconds = 1,000 ns 1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns Credit ------ By Jeff Dean: http://research.google.com/people/jeff/ Originally by Peter Norvig: http://norvig.com/21-days.html#answers Contributions ------------- Some updates from: https://gist.github.com/2843375 'Humanized' comparison: https://gist.github.com/2843375 Visual comparison chart: http://i.imgur.com/k0t1e.png Animated presentation: http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/latency.txt
There is also a good Prezi visualization that expands upon these numbers
A great presentation that shows how these relative numbers change by year thru 2017
Basically, what we see is that we need to prioritize our Python data scrubbing in order of efficiently utilizing (1) Network Bandwidth, (2) Disk I/O, (3) Memory I/O with each element having an order of magnitude or two less impact that the previous element. As our datasets get larger and algorithms more complex, even optimizing lesser factors become noticeable. Here are common ways we can optimize these three factors:
- Network Bandwidth
- Filter out and prune noise in the data files to shrink file size like unused metatags or misspelled words that cannot be corrected
- Preprocess dataset into a more condensed form without losing valuable information if possible either semantically or syntactically
- Split the dataset across multiple computers transferring data in parallel
- Compress datasets like plaintext that reduce file sizes significantly before transferring over the network, decompress on the other end in streaming or parallel fashion
- Disk I/O
- Use fastest disk possible: SSD > HDD > HDD. Generally, a HDD with 7200rpm > 5400rpm although look at less advertised throughput which also takes in caching strategies and other factors for a more meaningful global measure.
- If available use SSD disk over mechanical spinning HDD, some virtual storage cloud providers offer this at a premium price. Intel claims SSD are 8x faster than mechanical HDD but SSD also cost about 5x more per GB (2017).
- If possible, configure the disk for optimal performance including: multiple striped disks with fast bus, performant file system type (eg ZFS with a lot of fast memory vs the default Linux Ext4), file system configuration (eg Journaling off, not fragmented, etc).
- The general rule is more memory at all levels is better since all data must pass through memory of some sort to be processed or transmitted/received. With proper configuration and programming techniques, the more memory the fewer disk and network accesses are necessary, the smaller the relative overhead of handling these data transfers and the faster the CPU/GPU can process the data whether it be a machine learning algorithm or de/compression.
- The general rule is a faster memory like DDR3 SDRAM with upto twice the data transfer bandwidth of its predecessor DDR2 is always good. DDR2, DDR3 and the newest DDR4 all have different physical specs, are not interoperable and have to have computer systems specifically designed for each type. DDR5 is in design phase and DDR4 has not reached PC market penetration yet. As an aside, CPU cache hit ratecache hit rate is interesting topic that relates more to ML/NN algorithms but explains some of the underlying mechanisms at work.
Unix command line programs like awk may be faster if processing is simple and single pass
- To reduce File I/O at the Python instruction set level read in large blocks or entire files at a time.
- Open a file as few times as possible rather than multiple open/closes for processing steps (less reusable code)
- Using different versions of Python, Cthyon, etc may make a difference for unusual edge cases like unusual network processing that bypasses TCP for a faster network protocol for certain data types/sizes. It also helps to understand the interactions between Python, C-code it’s often calling and the underlyting Unix OS.
Finally, all of these are general guesstimations. The only real way to know what improves performance is by finding well designed, realistic and applicable benchmarks and iteratively refining. Both Python and Unix provide a wealth of tools for assessing performance and automating testing.
These optimizations are for the data cleaning stage. For our analytic stage there are additional optimization steps with distributed processing and multiple GPUs being the most significant, but that remains for another post.