Dmitry Zinoviev does a decent job in isolating and explaining the core of Python used most frequently for Data Science in his book Data Science Essentials in Python published by Pragmatic Bookshelf, 2016.  In it, he focuses on several areas of Python that Data Scientists need to become especially fluent in:

  • String functions
    • Case:  lower, upper, capitalize
    • Predicates(T/F): isupper, islower, isspace isdigit, isalpha
    • Encoding: b”<bin array>” vs “<string>” decode to string, encode to bin array
    • String Cleaning: lstrip, rstrip, strip
    • String Munging:  split(“x”), ” “.join(ls)
    • String Counting:  find(“.com”), count(“.”)
  • Data structures
    • Lists:  array not for large data O(n)
    • Tuples:  immutable lists O(n)
    • Sets:  unordered/unindexed O(log(N)) fast, for membership
    • Dictionaries:  map keys (hashable obj num,bool,str,tup) ->values O(log(N))
    • Dictionaries from List generators:  dict(enumerate(seq)), zip(kseq,vseq), range
  • List comprehensions
    • Transform a collection into a List
    • Faster and cleaners than loops
    • Nested for performance [line for line in [l.strip() for l in infile] if line]
    • List Generator with (): (x**2 for x in myList) # Eval to <generator obj <genexpr>…>
    • Counter class to find most/least common in resulting list
  • Counters
    • Dictionary-style collection for counting items in another collection
    • from collections import Counter
    • cntr = Counter(phrase.split())
    • cntr.most_common(n),
    • cf:  pandas:  uniqueness, counting, membership
  • File
    • f=open(name, mode=”<r|w|a or rb|wb|ab>”); <read the file>; f.close()
    • with open(name, mode=”<r|w|a>”) as f: <read the file> (auto closed)
    • f.read(<n>), f.readline(s)(<n>), \n not removed, unsafe unless file reasonably small
    • f.write(line), f.writeline(s)([“list”,”of”,”strings”]), \n not added
  • Web
    • urllib.request into cache directory
    • like readonly file handle:  read, readline, and readlines
    • Higher failure:  wrap in try:/except:/finally: exception handling
    • urllib.parse.urlparse(URL) for decomposing URL
    • urllib.parse.urlunparse(parts) for building URL
  • Regular expressions
    • compiledPattern = re.compile(pattern, flags=0) – flags at compile or execution
    • Most common flags:  re.I(gnore case), re.M(ultiline) works with ^start/end$
    • Raw strings do not interpret \ as escape characters (r”\n” == “\\n”)
    • Two forms:  re.function(rawPattern, …) or compliedPattern.function(…)
    • split(pattern, string, maxsplit=(), flags=0) returns List of substrings
    • match(pattern, string, flags=0) returns match obj/None if beg of str matches
    • mo = re.match(r”\d+”, “067 string”) mo.group(), mo.start(), mo.end()
    • search(pattern, string, flags=0)
    • re.search(r”[a-z]+”, “001 Has at least one 010 letter”, re.I)
    • findall(pattern, string, flags=0)
    • re.findall(r”[a-z]+”, “0010 Has at least one 010 letter”, re.I)
    • sub(pattern, repl, string, flags=0) replaces non-overlapping parts of string with repl optionally restrict the number of replacements with optional flag “count=”
  • Globbing
    • Match specific file names and wildcards *(0<=chars), ?(1 char)
    • import glob; glob.glob(“*.txt”)
  • Data pickling
    • Can store more than one object, read out sequentially
    • Can store intermediate results, faster
    • with open(“myData.pickle”, “wb”) as oFile:  pickle.dump(object, oFile)
    • with open(“myData.pickle”,”rb”) as iFile:  object = pickle.load(iFile)

 

This is one of the most concise summaries of Data Science-specific commands in Python.  The book does not go into depth, but I highly recommend it for a quick and simple overview of various aspects of Python core to Data Science (tabular data, database, network data, visualization, etc).

datasci-ess-in-py.jpg

Advertisements