image

If there is a lot of data parsing and cleaning that needs to done before modeling, I tend to follow one of the three paths :

  • Path 1: Use Python to clean the data, export the data structure in to a file/database. Leave Python environment and move in to R to do the modeling.
  • Path 2 :  Use Python to clean the data, Stay in Python environment and invoke R to do the modeling. Rpy is the go to module in this context.
  • Path 3 :  Painfully do the data cleaning in R, despite R hogging memory, and then model stuff in R

Path 3  is something I take very often. However Paths 1 and 2  are also interesting as they give a ton of modules that one can use from Python. A few years back I had used some data types of Python, mainly the dictionary and had worked on something I don’t even remember properly. It was more of an ad-hoc task and had since then never used Python in a big way but for some basic data cleaning tasks. Over the years I have slowly graduated to performing the entire data cleaning exercise in R itself and completely avoid Python. Lately I have realized that I have followed a convenient path instead of a hard but worthwhile paths(1&2). So, I picked up this book to get a decent understanding of data types and modules in Python . In this post, I will list all the points that I found relevant in this book for a newbie like me :

  • The three common elements of natural languages, i.e Ambiguity, Redundancy and “Not literal in meaning” , do not apply to programming languages. They have exactly the opposite attributes. They are Non Ambiguous, Non Redundant and Literal.
  • Syntax rules come in two types, tokens and structures
  • Python is an interpreted language. The error with Python code could be Syntax error, Run time error or Semantic error.
  • >>> symbol is called chevron.
  • invalidSyntax and invalidToken are the usual exceptions that you come across.
  • Python variables are case sensitive.
  • PEMDAS – Useful mnemonic for remembering python order precedence.
  • ^ is not an exponent operator in python.
  • type() is a function that is useful to know the type of the object in Python. It is similar to class function in R.
  • Semantic errors are tough to catch.
  • There are at least 31 keywords in Python.Type import keyword ; print keyword.kwlist to get the list.
  • import math will fetch all the math functions from the python standard library.
  • Function definition has to be executed before the first time it is called
  • Stack diagrams are used for understanding function environments
  • void functions – nice way to call functions that don’t return anything.
  • To see where Python searches for libraries, type the command import sys; sys.path. The output is a list and the first item in the list is a null string symbolizing the current directory.
  • Passwords are never stored as plain text , be it in a file or a database. They are converted to a hash code and stored. When a user enters the password it is checked with the hash code stored internally. The best thing about hash code is it involves a mathematical one way process between password and code. It is very unlikely that you will be able to crack the password given a hash code
  • There can be compiled python code also. These days there are some apps that are distributing compiled python code instead of the usual py files
  • Set environment variable PYTHONPATH to a colon separated list of directories so that python searches for the relevant folders while execution
  • The trick if __name__ = “__main__” exists in Python so that Python files can act as either reusable modules, or as standalone programs.
  • How to draw a fractal using Python ?.
  • Functions that return something are given a  catchy name, Fruitful functions
  • If a function returns no value and you try to print, it will print None
  • Some examples of recursion mentioned are gcd function, palindrome function
  • Interesting use of the word bisection : Debugging by bisection. Well, basically apply the bisection method to finding out the bug.
  • eval function can be used to evaluate python commands in a string
  • Functions that can be used with string are len, for, slicing, upper, lower, find
  • Index range in string means including the first index to second index , excluding the second index
  • The word “in” is a boolean operator that takes two strings and returns True if the first appears as a substring in the second
  • Python does not handle uppercase and lowercase letters the same way that people do. All the uppercase letters come before all the lowercase letters.
  • Strings are immutable
  • split function takes a third argument x[a:b:c] means from a to b in steps of c
  • The following code is to reverse a string :x[::-1]
  • Program testing can be used to show the presence of bugs, but never to show their absence!
  • There are some string related functions built in to Python that can be used for text processing stuff.
  • zfill() function can be used to pad zeroes for a number
  • int() is equivalent to as.numeric() in R
  • Saw similarities between map operator and functions in R. May be this is the reason why people world over love python.
  • The list object stores pointers to objects, not the actual objects themselves. The size of a list in memory depends on the number of objects in the list
  • The time needed to get or set an individual item is constant , no matter what the size of list it
  • The time needed to insert an item depends on the size of the list, or more exactly, how many items that are to the right of the inserted item(O(n)) . In other words, inserting item at the end is fast than at the beginning
  • The time needed to reverse a list is proportional to O(n)
  • The time needed to sort a list varies, worst case is O(n logn)
  • Useful functions that go with lists “in”, “extend”, “append”, “sort”, “sum”,+,*,[a:b]
  • A nice analogy between map, filter and reduce functions applicable to lists. Capitalize is like map where you apply a function to each element, filter is like selecting only some items from the list , reduce is like summing up all the elements or counting all the elements in a list
  • list(s) creates a list out of the elements of the string
  • split is a function that returns list
  • The most dangerous aspect of python unlike R is that , if you assign a list to a variable X, then you say Y=X , if you make changes to Y , the changes are reflected in X. Basically it is pass by reference and not pass by value.In R, the external object is not changed , pass by value happens . In Python code, the external object is changes as pass by reference happens. This means most of the functions in R are pure functions.
  • in operator to check the presence of an element in a list. One can also use index function to check for the presence of the element.
  • To reverse a word, use the following code x[::-1]
  • One can use sorted function for list
  • append modifies the list and returns none
  • always use append instead of a = a +[x]
  • This is a fantastic thing in Python. “in” operator in dictionary takes the same time irrespective of the size of the dictionary.
  • While checking for an word in a list of words, the “in"" function is slower than bisection method , which is slower than the access through hash table in a dictionary.
  • For dictionaries, python uses an algorithm called “hash table” that has this remarkable property that “in” operator takes the same amount of time irrespective of the size of the dictionary
  • Hash tables are apparently used to create a 2 dim array where you store keys as hash values and use these hash values to map to the actual values. Basically hash table is extremely useful when doing stuff with a large number of strings
  • You cannot use lists as keys in dictionary as lists are mutable. Any mutable object cannot be used as keys for a dictionary – Mutable objects give rise to duplicate hashes. Similarly a dictionary cannot be used as a key
  • A previously computed item that is meant for later use is called memo
  • Any variable defined outside the scope of a function is treated as global variable. You can happily use them in a function. However you cannot set them in a function as any setting operation introduces a new variable in the function whose scope is only limited till the function is running.
  • To set a global variable in a function, you have to define the variable as global.
  • You can add, remove or replace elements of a global list but if you want to reassign the variable, you have to declare it
  • Learnt about a way to check duplicates using “set”  in Python.
  • Tuples are a sequence of values. The values can be of any type, and they are indexed by integers. They are immutable. This is the key to understanding tuples. Unlike dict and lists, tuples are immutable.
  • You cannot modify a tuple but you can replace one tuple with another
  • Tuples are good for swapping operations and return values
  • There are certain functions like divmod where you enter the input by scatter
  • zip is another feature of tuples that is very useful
  • items function on a dict returns tuples
  • You can compare tuples
  • Decorate, Sort, Undecorate pattern is useful in many situations like sorting, counting, etc
  • You can use tuples or dictionaries to pass parameters. To pass tuple as parameter , append * . To pass dictionary as parameter, append ** for the parameter value.
  • A few days back I stumbled on to Ziff’s law in an NY Times article.  I managed to use dict,tuples, lists to empirically check ziff’s law on Jane Austen’s novel, Emma
  • Also I have started exploring the Rpy module to invoke R from python. There is some initial pain in learning how to install Rpy. Once that is done, R can be used seamlessly in Python.
  • Computed a Markov Chain for Phrases in Jane Austen’s novel Emma
  • Tuples are very useful in sorting stuff or creating results similar to table in R
  • You can sort a list if the elements are arranged in the form of tuples
  • After reading the chapters on text parsing and string handling, I have this feeling that Python is the king for text parsing. No wonder it is used in Google and other places where they have to deal with a ton of text.
  • To randomly select items from a histogram, one can create a list where each element is repeated x number of times, where x is frequency of the word
  • Random is a useful module which has functions like random(), randint() , choice()
  • If you want to remove punctuation from strings , you can use string.punctuation to check the elements that need to be removed.
  • os module has many useful functions like os.getcwd(), os.path.getabspath(), os.path.exists(), os.path.isdir(), os.path.isfile(), os.path.listdir(), os.path.isfile()
  • There is a mention of pipe in python that is useful in reading very big zip files.
  • repr(s) can be useful for debugging.
  • Learnt to screen scrape using Python
  • Reorganized my iTunes folder with Python.My iTunes folder had duplicate files and I had to remove those duplicates. Obviously manually going over them was a nighmare. Firstly I removed some obvious files like video files and other non music files from the 1500 files from the folder. Then I used MD5 function, used the function to find duplicates in my music folder and removed them programmatically. Now I have about 978 music files in total that I will categorize someday in to various playlists. 
  • The classes are pretty peculiar in Python. These classes have no attributes but you can generally assign instance.x, instance.y to some value. I mean this is weird for me as I have always thought that classes having attributes and functions
  • I am coming from R to Python and I am alarmed at the fact that Python passes objects by reference. In R objects get copied. In Python a reference to the object is sent. This means that function can totally change the object that you are passing
  • You can check whether two objects alias to the same data by using the `is’ operator.
  • copy module has two types of copy functions, one is the shallow copy and one is the deep copy.
  • hasattr is a function that can be used to check the attributes of the object
  • Learnt new terms like invariants, pure functions and modifiers.
  • Functional programming is a style of program design in which the majority of the functions are pure. Pure meaning that whatever input is received by the function, it is not modified.
  • Came to know about datetime module , probably the most important module that I will use.
  • There is a strange thing about invoking functions in Python. When I first came across functions, I was left wondering why there is a need to pass self in to each of the function. At least in the languages that I have coded I have never passed self object. This chapter made me realize that the major reason for passing self is that there are basically two types of invocations in Python. Let’s say I have a class X that has a method test. Either I can invoke it via X.test(obj) or via obj.test() . If the method takes an argument, then you code the function with the first argument as self and give the rest of the arguments in the usual way
  • use __init__method for default constructor
  • use __str__ for giving a string representation of the object
  • use __add__ for operator overloading of +
  • Think of overwriting some operator, look at python docs to find the exact string to use , let’s say it is YYY , then write a method with the name __YYY__ and your class has the operator overloading set.
  • use __dict__ to get the attributes of the class
  • use getattr function to get attributes of the class
  • Learnt about the use of __radd__
  • pass statement has no effect in a class definition. It is only necessary because a compound statement must have something in its body.
  • The concept of deep copy and shallow copy is something that I came across it in the context of Python, after my initial encounter with them in C++.
  • Default values get evaluated ONCE, when the function is defined; they don’t get evaluated again when the function is called.
  • Even though two lists or tuples are distinct objects with different memory addresses, Python tests for deep equality. In some other instances ,shallow equality is tested.
  • There is a difference between instance attributes and class attributes.
  • At the time of class definition itself, the inheritance structure is defined in Python
  • With the help of cards, deck and hand, the chapter on inheritance gives a good introduction to the various concepts related to inheritance.
  • I have ignored chapter 19 as I am planning to use R for visualization. Only if I cannot do something in R, would I probably come back to this book and learn about GUI capabilities of Python.
  • Appendix talks about debugging. Typically there are three types of errors that one comes across in programming. First are the syntax errors. Second are the run time errors such as Name error, Type error, Key error, Attribute error, and Index error. The third type of errors is semantic error and is often difficult to crack as compared to the first two types of errors.
Advertisements