Fast Search for Word Count in Large File

One of the ideas we had for the HackOH5 hackathon was tracking the lifecycle of words related to larger social phenomena. To do this we need to create various dictionaries of terms related to each phenomena we like to observe with the frequency of these dictionary terms over time reflecting the strength of the phenomena.

For example, how do the rise and fall of the use of terms like “Chinamen”, “Chinese”, “Japanese”, “Oriental”, “Jap”, “Asian”, etc. over time reflect upon how society perception of Asian-Americans has changed over time? How to commonly accepted and respectable terms like “Negro” (eg United Negro College Fund 1944- and the American Negro League for professional baseball established 1937-1960) fall into disrepute and/or become eclipsed by newer terms? How do the timing and profile of the life and death of such terms compare across the Ohio5 Colleges and relative to society at large? What does this say about the distinctive cultures within and across the Small Liberal Arts College?

To accomplish this we will be doing a lot of word searches over the relatively large corpus of approximately 1.47GB. As such, we’ll need to find the fastest way to batch search a dictionary of terms against this large file. What are our options?

[1] Unix Command Line: fgrep, grep and egrep utilities

The differences between fgrep, grep and egrep are indicated in the names: [fgrep] fast/fixed-sized string (no regex interpretation), [grep] regular and [egrep] extended regular expression variation of the popular Unix grep utility. Here is a simple test run comparing execution speed:

[2] Python:

StackOverflow has a post with an excellent explaination of how you can optimize python code to achieve a 100x speedup over a simple interpreted “grep” command within Python. The key is to use mmap to efficiently stream read the file without consuming a lot of memory.

import io
import mmap

def grep_3(pattern, file_path):
    with io.open(file_path, "r", encoding="utf-8") as f:
        return re.search(pattern, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ))

Here is a comparison of speed improvements over several variations of Python code searching for a word in a file:

10 loops, best of 3: 639 msec per loop       # grep()
10 loops, best of 3: 78.7 msec per loop      # grep_1()
10 loops, best of 3: 19.4 msec per loop      # grep_2()
100 loops, best of 3: 5.32 msec per loop     # grep_3()

Another improvement is to use “\b” boundry in search expressions where possible as the following examples show. Notice an ~80% reduction in execution time when you can define both ends/identify termination character at both ends of the string.

$time fgrep ‘money’ oh5_newspapers.xml | wc -l

32954

real 0m25.526s
user 0m24.346s
sys 0m0.548s

$time fgrep ‘\bmoney’ oh5_newspapers.xml | wc -l

real 0m4.189s
user 0m3.197s
sys 0m0.268s

$time fgrep ‘\bmoney\b’ oh5_newspapers.xml | wc -l

real 0m4.035s
user 0m3.234s
sys 0m0.263s

Another speedup is provided by temporarily setting the local environment in which the grep command executes to 128 character ASCII set rather than the more typical 110,000 character UTF-8 set using LC_ALL=C or LANG=C. An unscientific single run of this change shows an ~% reduction in execution time.

This comes with cautions that it primarily helps for case-insensitive searches and could screw up sort and other commands that assume an UTF-8 encoding.

Yet another potential speed up on the Mac is by installing GNU grep (script to install other duplicate GNU tools) alongside the OS X default FreeBSD grep. I witnessed irregular GNU grep (renamed to ggrep on my Mac) speedups of ~50% reduction over fgrep and ~60% reduction over grep doing a simple ‘word’ search. Using LC_ALL=C/LANG=C I was ~66% reductions over grep. Stacking on “\bstring\b” made little gave approximately the same result.

There are more esoteric tips on how to optimize grep but I suspect the combination of tip above overlap with them to some degree and there is a point of diminishing return here. But to summarize our grep research:

Use GNU grep (ggrep on Mac OS X or fgrep where possible
Bound your search string wherever possible with “\b” at one or both ends (<“-w”)
Execute grep command in temporary environment with LC_ALL or LANG=C
Try not to use case insensitive switches “-i” (make lowercase first if possible)
DO NOT USE the “-c” switch will display word counts lines with word, so lines with multiple word instances are undercounted as only once instance
If multiple search words, make one pass with “-F” or “-f”
(Depricated) “-mmap” to read input instead of default read(2) system call
Switches “-ow” may help, o=show only match/not line, w=whole word, P=Perl RegEx?
Set max number of matches with “-m” in case of typos
Search for longer strings if possible, they should be faster with grep’s algorithm

$LC_ALL=C ggrep -ow ‘word’ <filename> | wc -l

Other tuning that can be done outside ggrep and switches:

Split up file into smaller units and run in parallel on separate machines
If multicore CPU machine (overview), use “parallel –progress -a file1 ‘grep -F {} file2’
Parallel on mac install, distributed computing (parallel is not faster for me on 2 core Intel i5 2014 mac mini), GNU parallel tutorial (check core with “sysctl -n hw.ncpu” / 2 )
grep is generally I/O bound so parallelism on the same machine will be of little value
Get more memory, better CPU (more cores), SSD HDD, GPU for some utilities

< filename.txt parallel –pipepart LANG=C grep -FC 5 ‘word’

or

< filename.txt parallel –pipepart –block 10M grep -FC 5 ‘word’

# Use split to split the file into x number of files, consider your big file
# size and try to stay under 26 split files to keep the filenames 
# easy from split (xa[a-z]), in my example I have 10 million rows in bigfile
split -l 1000000 bigfile.txt
# Produces output files named xa[a-t]

# Now use split files + xargs to iterate and launch parallel greps with output
for id in $(cat my_ids.txt) ; do ls xa* | xargs -n 1 -P 20 grep $id >> matches.txt ; done
# Here you can tune your parallel greps with -P, in my case I am being greedy
# Also be aware that there's no point in allocating more greps than x files

Here are a few other well-known or promising grep-like utilities for quickly processing large files:

Awk offers a mini-programming language and can be used for quick filtering and report writing

Gawk or GNU Awk

Sed is stream editor in the great tradition of UNIX pipes and can modifying a stream of data/text in complex ways

Agrep does fuzzy matching-like grep on Linux system (not Mac OS X)

Goagrep does fuzzy matching like agrep and is sometimes faster and can match longer words by creating either a standalone server or index.db to speed file searches.

TRE is another fuzzy matching utility with time complexity of O(M^2N) where M is the length of the regular expression and N is the length of the ext.

ripgrep claims to be the fastest grep-like search engine. Built on Rust, it accomplishes this in part by sacrificing some less used features of egrep like backreferences and look around. It defaults to recursively searching files in a directory tree, but on a single file search it was about 2x as slow as “$LC_ALL=C ggrep -oc ‘word’ <filename>”.

PolyCogBlog

Technology, Cognition and The Digital Humanities