September 2015


reproducible-research

The book starts by explaining an example project that one can download from the author’s github account. The project files serve as an introduction to reproducible research. I guess it might make sense to download this project, try to follow the instructions and create the relevant files. By compiling the example project, one gets a sense of what one can accomplished by reading through the book.

Introducing Reproducible Research
The highlight of an RR document is that data, analysis and results are all in one document. There is no separation between announcing the results and doing number crunching. The author gives a list of benefits that accrue to any researcher generating RR documents. They are

  • better work habits
  • better team work
  • changes are easier
  • high research impact

The author uses knitr / rmarkdown in the book to discuss Reproducibility. The primary difference between the two is that the former demands that document be written using the markup language associated with the desired output. The latter is more straightforward in the sense that one markup can be used to produce a variety of outputs.

Getting Started with Reproducible Research
The thing to keep in mind is that reproducibility is not an after thought – it is something you build into the project from the beginning. Some general aspects of RR are discussed by the author. If you do not believe in the benefits of RR, then you might have to carefully read this chapter to understand the benefits as it gives some RR tips to a newbie. This chapter also gives a road map to the reader as to what he/she can expect from the book. In any research project, there is data gathering stage, data analysis stage and presentation stage. The book contains a set of chapters addressing each stage of the project. More importantly, the book contains ways to tie each of the stages so as to produce a single compendium for your entire project.

Getting started with R, RStudio and knitr/rmarkdown
This chapter gives a basic introduction to R and subsequently dives in to knitr and rmarkdown commands. It shows how one can create a .Rnw or .Rtex document and convert in to a pdf either through RStudio or the command line. rmarkdown documents on the other hand are more convenient for reproducing simple projects where there are not many interdependencies between various tasks. Obviously the content in this chapter gives only a general idea. One has to dig through the documentation to make things work. One learning for me from this chapter is the option of creating .Rtex documents in which the syntax can be less baroque.

Getting started with File Management
This chapter gives the basic directory structure that one can follow for organizing the project files. One can use the structure as a guideline for one’s own projects. The example project uses gnu make file for data munging. It also gives a crash course of bash. 

Storing, Collaborating, Accessing Files, and Versioning
The four activities mentioned in the chapter title can be done in many ways. The chapter focuses on Dropbox and Github. It is fairly easy to learn to use the limited functionality one gets from Dropbox. On the other hand, Github demands some learning from a newbie. One needs to get to know the basic terminology of git. The author does a commendable job of highlighting the main aspects of git version control and its tight integration with RStudio.

Gathering Data with R

This chapter talks about the way in which one can use GNU make utility to create a systematic way of gathering data. The use of make file makes it easy for other to reproduce the data preparation stage of a project. If you have written a make file in C++ or in some other context, it is pretty easy to follow the basic steps mentioned in the chapter. Else it might involve some learning curve. My guess is once you start writing make files for specific tasks, you will realize their tremendous value in any data analysis project. A nice starting point for learning make file is robjhyndman’s site.

Preparing Data for Analysis
This gives a whirlwind tour of data munging operations and data analysis in R.

Statistical Modeling and knitr
The chapter gives a brief description of chunk options that are frequently used in an RR document. Out of all the options, cache.extra and dependson are the options that I have never used in the past and is a learning for me. One of the reasons I like knitr is its ability to cache objects. In the Sweave era, I had to load separate packages, do all sorts of things to run a time intensive RR document. It was very painful to say the least. Thanks to knitr it is extremely easy now. Even though cache option is described at the end, I think it is one of the most useful features of the package. Another good thing is that you can combine various languages in RR document. Currently knitr supports the following language engines :

  • Awk
  • Bash shell
  • CoffeeScript
  • Gawk
  • Haskell
  • Highlight
  • Python
  • R (default)
  • Ruby
  • SAS
  • Bourne shell


Showing results with tables
In whatever analysis you do using R , there are always situations where your output is in the form of a data.frame or matrix or some sort of list structure that is formatted to display on the console as a table. One can use kable to show data.frame and matrix structures. It is simple, effective but limited in scope. xtable package on the other hand is extremely powerful. One can use various statistical model fitted objects and pass it on to xtable function to obtain a table and tabular environment encoded for the results. The chapter also mentions texreg that is far more powerful than the previous mentioned packages. With texreg , you can show the output of more than one statistical model as a table in your RR document.There are times when the output classes are not supported by xtable. In such cases, one has to manually hunt down the relevant table, create a data frame or matrix of the relevant results and then use xtable function.

Showing results with figures
It is often better to know basic LaTeX syntax for embedding graphics before using knitr. One problem I have always faced with knitr embedded graphics is that all the chunk options should be mentioned in one single line. You cannot have two lines for chunk options. Learnt a nice hack from this chapter where some of the environment level code can be used as markup rather than as chunk options .This chapter touches upon the main chunk options relating to graphics and does it well, without overwhelming the reader.

Presentation with knitr/LaTeX
The author says that much of the LaTeX in the book has been written using Sublime Text editor. I think this is the case with most of the people who intend to create an RR. Even though RStudio has a good environment to create a LaTeX file, I usually go back to my old editor to write LaTeX markup. How to cite bibliography in your document? and How to cite R packages in your document? are questions that every researcher has to think about in producing RR documents. The author does a good job of highlighting the main aspects of this thought process. The chapter ends with a brief discussion on Beamer. It gives a 10,000 ft. of beamer. I stumbled on to a nice link in this chapter that gives the reason for using fragile in beamer.

Large knitr/LaTeX Documents: Theses, Books, and Batch Reports
This chapter is extremely useful for creating long RR documents. In fact if your RR document is not large, it makes sense to logically subdivide in to separate child documents. For knitr, there are chunk options to specify parent and child relationships. These options are useful in knitting child documents independently of the other documents embedded in the parent document. You do not have to specify the preamble code again in each of the child documents as it inherits the code from the parent document. The author’s also shows a way to use Pandoc to change rmarkdown document to tex, which can then be included in the RR document.

The penultimate chapter is on rmarkdown. The concluding chapter of the book discusses some general issues of reproducible research.

takeawayTakeaway:

This book gives a nice overview of the main tools that one can use in creating a RR document. Even though the title of the book has the term “RStudio” in it, the tools and the hacks mentioned are IDE agnostic. One can read a book length treatment for each of the tools mentioned in the book and might easily get lost in the details. Books such as these give a nice overview of all the tools and hence motivate the reader to dive into specifics as and when there is requirement.

Advertisements

book_cover

This book can be used as a companion to a more pedagogical text on survival analysis. For someone looking for an appropriate R command to use, for fitting certain kind of survival model, this book is apt. This book neither gives the intuition nor the math behind the various models. It appears like an elaborate help manual for all the packages in R, related to event history analysis.

I guess one of the reasons for the author writing this book is to highlight his package eha on CRAN. The author’s package is basically a layer on survival package that has some advanced techniques which I guess only a serious researcher in this field can appreciate. The book takes the reader through the entire gamut of models using a pretty dry format, i.e. it gives the basic form of a model, the R commands to fit the model,and some commentary on how to interpret the output. The difficulty level is not a linear function from start to end. I found some very intricate level stuff interspersed among some very elementary estimators. An abrupt discussion of Poisson regression breaks the flow in understanding Cox model and its extensions. The chapter on cox regression contains detailed and unnecessary discussion about some elementary aspects of any regression framework. Keeping these cribs aside, the book is useful as a quick reference to functions from survival, coxme, cmprsk and eha packages.

book_cover_sa_self

 

As the title suggests, this book is truly a self-learning text. There is minimal math in the book, even though the subject essentially is about estimating functions(survival, hazard, cumulative hazard). I think the highlight of the book is its unique layout. Each page is divided in to two parts, the left hand side of the page runs like a pitch, whereas the right hand side of the page runs like a commentary to the pitch. Every aspect of estimation and inference is explained in plain simple English. Obviously one cannot expect to learn the math behind the subject. In any case, I guess the target audience for this book comprises those who would like to understand survival analysis, run the model using some software packages and interpret the output. So, in that sense, the book is spot on. The book is 700 pages long and so all said and done, this is not a book that can be read in one or two sittings. Even thought the content is easily understood, I think it takes a while to get used the various terms, assumptions for the whole gamut of models one comes across in survival analysis. Needless to say this is a beginner’s book. If one has to understand the actual math behind the estimation and inference of various functions, then this book will equip a curious reader with a 10,000 ft. view of the subject, which in turn can be very helpful in motivating oneself to slog through the math.

Here is a document that gives a brief summary of the main chapters of the book.