R Cookbook : Summary

Books on R are tricky to read especially when the sheer amount of things that R can do is mind-boggling. So, there are books that range from very specialized to very generic and there is no choice but to refer this gigantic range of collection based on one’s needs. The flip side to this vast amount of stuff is, “it is likely that a first timer would fail to see the forest for the trees”.

Paul Teetor’s book is different from the existing books on R. Firstly, this cannot be your first book on R. So, where does this book fit in the gamut of documents/vignettes/books on R? Well, if you have coded R for sometime / used R at work, then there is every chance that you would have carried out tasks mentioned in the book in some way or the other. May be you have written your own functions/ your own code/ your own packages / or forked a package etc. Whatever be the case, the tasks mentioned in the book would have figured out in your work. Now here is where this book is useful. You can look at a task, form a mental image of the code that you are going to write, compare with Paul Teetor’s solution and reflect on the discussion provided. In most of the cases, your solutions might match. However the discussion provided at the end of each solution is illuminating. There are specific things that you would have missed, certain options in the function prototype that you would never bothered to use, certain ways of thinking about the function that would have never crossed your mind. If you approach the book with the mindset that – “you are going to think about the gaps between your mental image of the code and the code given in the book”, then it is likely that you are going to enjoy this book. I think one should just go over the entire book on a rainy day, instead of waiting for a task to crop in your work and then referring this book.

I approached the book with the mindset -”Can this book teach me a better way to do stuff as compared to what I am doing currently ?” Subsequently, I religiously went over the discussion provided for each of the tasks to note all the gaps that were present in my code. Here is a list of points that this book taught/reiterated/explained in a better way. Chapter 1 and Chapter 2 deal with some very basic stuff that one needs to work using R. May be the first few days you start coding in R, you will probably need stuff from these chapters. So, I will skip the beginning two chapters and list with my takeaways from Chapter 3 onwards.

Chapter 3: Navigating the Software

·Startup sequence for R
- R executes RProfile.site script
- R executes .Rprofile script
- R loads the workspace from .RData
- R executes .First function
- R executes .First.sys function

Chapter 4: Input and Output

Read.csv assumes that header exists. I had always written header=T unnecessarily. Classic case of not reading the function prototype properly
Put rownames=F so that a file written in the output csv has no unnecessary row numbers
Use dput and dump to save the data in ASCII format
Use format along with cat to control the number of digits printed

Chapter 5: Data Structures

Usually one creates a matrix which is purely numeric / purely character data. Here was a surprise to me from this chapter. You can have mixed data types in a matrix. Just use a list which has numeric and character data and then give a dim attribute to the list. Voila, you see a matrix with mixed attributes. A pathological case arises!
append function – Have never felt the need to use this function till date
Use stack to combine a list in to two column data frame
lst[[n]] is an element not a list, lst[n] is a list , not an element
To remove an element from the list, assign NULL. One can create a list with NULL elements but if one assigns an element to NULL, then the element is removed
Initializing a data matrix- Try typing the data itself in a rectangular shape that reveals the matrix structure
drop argument is used to retain the matrix structure when any sub selection is made
Initializing a Data from from Row Data – Store each row in a one-row data frame, Store the one-row data frame in a list. Use rbind and do.call to bind the rows in to one large data frame
It is better not to use matrix notation to access data frame columns. Why ? The result could be a vector or a data frame based on the number of columns you access It’s better to use list based syntax as it leads to clearer code. Or use drop = false so that the return type is always a dataframe
Elegant way to select rows and columns from a datamatrix is by using select and subset

Chapter 6: Data Transformations

s in sapply stands for simplify
mapply is used when the function that needs to be applied is not vectorized( example gcd)
if you want to apply function for specific block of rows, use “by”

Chapter 7 Strings and Dates

Had used date objects but never carefully understood the reasoning behind naming convention
POSIXct – stores the number of seconds since 1970. Why ct ?. It stands for compact and the now it makes sense. The date is stored in the most compact form
POSIXlt – stores the date in the form of a list with 9 elements. Pretty elaborate structure and hence the appearance of lt(list) in POSIXlt
Mention of lubridate package – never explored this package till date. Since it is from Hadley Wickham and Co, it must very useful. Will check this out someday.
cat will not reveal special characters but print does
Not to forget adding 1900 to the year in POSIXlt

Chapter 8: Probability

Remember that the distribution parameters that R asks are sometimes not the natural parameters of the distribution. For example for exponential, you don’t use beta , but use 1/beta as input
MASS library has MLE and fitdistr functions to fit distributions for all the usual flavors that one comes across
If you specify lower.tail = F in pnorm/pt/pgamma, you get the survival probability
Set density parameter in polygon to 10 if you want 45% shading lines in the selected area, else set density parameter as – 1and specify the color

Chapter 9: General Statistics

Nice recap of conventional hypothesis testing. Almost for all the tasks, I would have followed the same code given in the book. So, nothing new here for me.

Chapter 10: Graphics

Nice recap of base graphics tasks, though I must admit that after working with ggplot2, I don’t think I would ever go back to base graphics. Even for some quick and dirty plots, I think ggplot2 is better. There is some learning curve for ggplot2 but I think the effort is worth it as you will start producing publication ready graphics. I missed a lecture recently on lattice and have plans of exploring lattice soon. I have been told that whatever can be done in lattice can be done in ggplot2. So, I am basically waiting so that I have enough motivation to go over lattice.
Using xpd option in barplot2 is a nice way to restrict bar from going outside the graphical region
Quantile plots for other distributions. qqnorm and qqline are a part of basic vocab for any data analyst. However there is a task which generates quantile plots for other distributions. I mean you can do the same using a couple of statements , but the author does it in a nifty way
Plotting a variable in multiple colors – Till date , I had always drawn multiple variables with a separate color for each variable. But never came across a situation where a single variable needs to be shown in multiple colors. So, in that sense the task solution and description is a learning for me
Instead of default R window, use win.graph( width = 7, height = 5.5) where the graph dimensions are based on Golden Ratio
Plot(y~x) is far more intuitive than plot(x,y)
Changing global parameter has a global effect meaning , if you call packages that create their own graphics, the global changes will affect them too. So, need to be careful while coding par(globalparam=value)

Chapter 11: Linear Regression and ANOVA

Poly(x,n) can be used for polynomial regression and adding raw=T computes using simple polynomials instead of orthogonal polynomials
boxcox() function gives you the best transformation( MLE is the intuition)
outlier.test() captures outliers
influence.measures() – A function which is most important as the output needs to interpreted carefully. Cook’s distance (talks about each observation’s effect purely from looking at the independent variables structure, dependent variable is not considered here) and hat values ( values that talk about the influence of each observation on the dependent variable) are needed to do healthy diagnostics.This function does everything for you.
oneway.test(x~f) is better than anova(x~f) as the former does not assume equal variance amongst the factor data
krusal.test(x~f) is a safer test(as it is non parametric ) than oneway.test , anova which assume a gaussian realization
One of the best uses of anova is to compare competing models

Chapter 12: Useful Tricks

You can define your own binary operator %XYZ% by assigning it to a function. Never did this till date in my code.
match can be used to find a specific element in a vector
do.call(cbind, complex.list) is useful when the list structure is complex. unlist might throw junk in such cases and hence the importance of this recipe

Chapter 13: Beyond Basic Numerics and Statistics

Need to be careful while using optimize as it reports only one minimum if there are multiple minimums in the given search range
Orthogonal regression might be a better choice than simple regression. The visual explaining the basic problem in simple regression is excellent
Three lines of code can help you do a cluster analysis 101 ; Use dist, hclust and cuttree
I always use to write my own code for bootstrapping. Came to know that “boot” package has quite some functions that can be used, especially the various confidence intervals for the estimates. Got to explore it someday in a detailed manner.
princomp and factanal is a good start to a full-fledged factor analysis.

Chapter 14: Time Series Analysis

Padding or filling a time series with empty data is something that I found interesting. I think it is an immensely useful hack
na.pad = True is an option which I should be using from now on as it makes coding much more clean. This is a classic case of me not reading the function arguments properly.
There are a few tasks on fitting arima models, checking for unit root and smoothening time series. As the book clearly mentions, the solutions merely scratch the surface. Each of these topics would themselves need a book length of content to do justice. In any case, the solutions presented in this chapter will at least give you a start.

Takeaway:

Whatever you program in R, the tasks mentioned in this book are going to be a part of your code thus forming the building blocks of your code. The book talks about 246 tasks that generally arise in the daily life of an R programmer. Even if only 10% of those tasks teach you something new, book is still worth your while.

Book Reviews