With the proliferation of Data Science tools from programming languages like Python (general), R (statistical) and Scala (big data) with proliferation of specialized libraries to advanced drag and drop solutions like RapidMiner and Orange and similar advanced solutions appearing in the cloud like AzuerML, it’s easy to forget the lowly Data Science tools available at the command line (mostly Unix-based).
Unix command line tools are particularly suited for the early phases of Data Science and there are both low profile books and courses on the subject that try to bring the bearded 70’s hippie Unix guru into the age of Deep Neural Nets. The evolution of IPython into Jupyter which integrates not only multiple popular Data Science languages like Python, R and Julia with graphical libraries like matplotlib, ggplot and D3.js into a sharable and replayable notebook has replaced much of command line Data Science. Even the command line including shell scripts can be integrated into Jupyter notebooks.
Still, there is nothing as fast and effortless as the command line for those who are familiar with the ubiquitous environment. Early and agile data munging and exploration, automated big data processing and working on GUI-less servers in the cloud all seem idea use cases for command line Data Science over Jupyter notebooks.
Doing Data Science on the command line has several inherent advantages over GUI interfaces like Jupyter:
- Nearly every Data Science platform has an underlying Unix-like command line shell
- There are a plethora of well used and reliable Unix commands suitable for data exploration
- These Unix commands are generally simpler but can be easily combined into more complex with Unix pipeline
- Related to the last step, many of these commands work on streams rather that entire files so they can easily work on very large files without most typical memory constraints.
Jeroen Janssens’ Data Science at the Command Line by O’Reilly gives one of the more interesting illustrations and review of this topic. The author also generously complied a Vagrant configuration/Virtual Box Ubuntu 14.04 instance that contains dozens of Unix commands, utilities, programs and shell scripts useful for doing Data Science at the Command Line. You can see Janssens present his perspective, argument and an interesting demonstration for doing Data Science on the command line on YouTube.
I also recommend the excellent reference Unix Power Tools, 3rd Edition from O’Reilly. Of the tools below sed and awk deserve their own book to explain how to harness their powerful capabilities (Sed & Awk, 2nd Edition by O’Reilly). A more in-depth reference on Unix shell scripting like Mastering Unix Shell Scripting by Randal Michael, Wiley helps to explain more complex automated workflow from a sysadmin perspective.
Here is a summary of the most useful Unix commands for Data Science per Jeroen Janssens’ book:
Retrieving Information from the Internet:
curl, curlicue, wget, httpie, scrape
Displaying Text Files:
cat, more, head, tail, less(big files), body, header,
Manipulating Text Files:
sort, tr, uniq,, cols, cut, fieldsplit, paste, split, wc
echo, seq, sample, shuf
dseq (gen seq date rel today)
CVS Manipulations: (csvkit)
csvcut, cvsgrep, cvsjoin, csvlook, cvssort, cvssql, cvsstack, csvstat, in2cvs, sql2csv
unpack, unrar, unzip
drake (workflow), env, alias, parallel, type, tree, which, find
Rio (CSV to data.frame to PNG/CSV), Rio-scatter
display (imagemagick), feedgnuplot
tapkee (red dim), weka