It’s quite an investment to pick up a new programming language (syntax, semantics, types), along with all its periphery: tools, libraries, interfaces, environment, documentation, culture, user groups, history, idioms, and quirks.  I’m not trying to force myself to learn a new language every year, though looking back it’s pretty much turned out that way.  I actually think there’s a skill-diluting effect in going shallow and trying to memorize syntax for a breadth of languages without knowing much about those other necessary peripheral pieces.  You really need a native language as a foundation by which you judge the others.  That may consist of a few, but I think there must be a small core.  Over the years my primary foundation has become Python.  I actually tend to steer away from the majority of new languages (especially since I rarely get into a situation where Python won’t handle the job), though the temptation of the new is sometimes great.  This year I’m having an especially hard time holding back the urge for the new.  It’s R.  I’ve spent the last month telling myself I can’t make the commitment now given how critical my development momentum is to survival.  Well, R keeps taunting me, showing up everywhere I look.  I’ve got to figure out why it won’t leave me alone.  So here are the compelling pieces I’ve discovered thus far about R, that have me tinkering in its REPL, wanting to buy book after book, and even having some strange dreams.  I’m trying to compare to Python to evaluate the investment in learning another new language.

Here are some great things I’ve discovered about R:

  • Functional. Although I’ve dabbled in Erlang, Haskell, and ML, and grown towards using Python in various functional ways (Mertz, AMK), I can’t say I’m fluent in any sanctioned functional language (okay, I just added R to that list, but it should have been there: “R’s functional parts come from Scheme.”)  I want to continue “breaking my brain in useful and creative ways”, and R fits the bill for completing this repertory need (I’ll probably head back to Haskell if/when a strong need for concurrency arises).
  • Statistics and data analysis. The Numerati and other sources say statisticians rule the world.  My eyes have been opened and I believe they’re right, so I’m working back my math chops.  It would appear that R’s statistics packages are more expansive than Python’s, and more is built into the language itself.
  • Mathematical learning tool. It’s been a decade since I finished that math degree, but now the investment is finally paying off… I, uh, hypothesize.  I’ll blame the hiatus on not having found very interesting work to do until now — another reason to be self-employed, and why I’m loving life this year.  Anyway, I’m already getting re-versed in stats just by starting to use the language.  It’s looking like playing with R is a much better way to learn statistics than muddling through textbooks and paper-based homework.
  • Graphical. Don’t know where to start.  There are so many graphical tools for R it is mind-numbing.  I’m starting to look at RGgobi, but there are lots of others to get acquainted with, including what’s built in.  I’ve worked through a couple graphical tutorials, and they seem to just magically pop up amazing graphs without having installed anything.  Try: > demo('graphics') Notice how few lines of code are doing all that. Wow!
  • Geospatial. I’ve got a need to be plotting data on maps in a variety of formats.  I’ve found a lot of ways to do this, but R seems to be the lightest, and very capable.
  • High-level. Everything is a data structure.  Operations applied to variables are done across the whole set with no loops or treatment of individual items.  The syntax appears to be a bit richer (higher level, more sugar) than found in Python.
  • I/O. R seamlessly slurps table-oriented text files for processing.  Output is also automatically formatted in nice text tables.  A mini-book (PDF) describes this and much more.  My next learning task is to start interfacing with PostgreSQL.
  • Kind of Python-like. I’ve started outlining another article that I hope to actually write someday called “R For Python Programmers” (since I can’t find such a guide).  Isn’t Python the gold standard to which great books make comparisons these days?  I won’t duplicate that here, but simply say that I can’t believe how comfortable the syntax feels coming from Python nativity.  And R appears to integrate very well with Python (this probably being the more important point).
  • Best REPL evar! I start evaluating any language simply by firing up its REPL and comparing its facilities to those of IPython.  R’s REPL is on par with it (readline editing with vim-mode support, tab completion for everything, extensive help), and even has some extra niceties; e.g., function parameter tab-completion.  While getting started, note that the help system uses ? and ?? prefixes instead of IPython’s suffix notation.  IOW, use ?topic instead of topic? I’m getting the impression that a common workflow is spending lots of time in the REPL working with files and graphics.  This is probably the killer feature that enables me to quickly get up to speed.
  • Documentation. The pages aren’t pretty, but there is a mass of info on the R site.  And AFAICT R is the foremost language used in recent Statistics textbooks.  There’s also a free 100-page (a good length, compare to Python Tutorial) intro book.  It’s horrible for non-programmers, and less than perfect for non-statisticians, but will get you familiar with some language features.  Appendix A offers a nice REPL walk-thru of language features.  I can’t find any Python books with much treatment of numerics, except for this one which seems to touch on some but shares space with language basics.
  • CRAN. Incredibly diverse libraries (list of packages) for statistics, graphics, and even Epidemiology (probably overkill for my present needs).
  • Mature and well-designed. R has been growing as the de facto FOSS statistical/graphical language for over a decade (inception in 1993, in the golden age).  It has grown up from the learnings of its ancestor S (shouldn’t R then be T??; I guess similar to C->B->A progression), which came onto the scene circa 1976 (a good year :-) .
  • Widely used. A survey of university statistics courses and Public Health curricula show R as a prevalent tool; e.g., UC Berkeley and Iowa State.
  • UNIX-friendly. I was humored to see that R is more apt to borrow names from UNIX commands than from other languages.  The commands for managing a namespace are rm and ls — easy for me to remember. :-)   And I’m glad to see it has good roots.
  • Trivial Ubuntu installation. Try this: apt-get install r-recommended r-cran-<tab><tab>
  • Script-friendly. Rscript enables R to act as a scripting language.
  • PDX-visualization group. This month we’re starting up a group to discuss visualization technology/advances, and R will be the primary language under discussion.  My statistician friend, Ed, won’t stop talking about R, and he’s someone I’ve come to listen intently to.

It’s a bit early to be writing an article on a language I haven’t done much with yet.  But I had to explore what’s pushing me towards R.  Looks pretty compelling now, so I’m very close to diving in (actually using it in a project).  I’ve sprinkled a number of resources throughout this post which I hope will be helpful for newcomers to R (including myself).  I should also mention that the best articles I’ve seen on introducing R are by David Mertz (a very capable Pythonista): Statistical Programming with R, Part I and Part II.

I can’t ignore what’s been growing out of  Scipy, especially the maps.  And matplotlib’s gallery is incredible (and I have some of them working).  The above items are all great features of R, but if I can accomplish them nearly as well in Python I probably shouldn’t invest too much into R.  At this point it does appear that R offers some facilities beyond Python.  Have you worked with both Python and R and found compelling reasons to prefer R?  For which types of tasks?

My pain points with Python today are in analyzing query data (ad hoc in loops and heuristics), and creating a bunch of different formats to send to various third-party visualizations.  My analysis is going to need to get more sophisticated, and I’d like to be able to look at data visually quickly with less overhead.

I’m probably overstating the investment required in getting up and running with R; it’s supposed to be easy to pick up.  I’d really like to get to being comfortable with applying non-trivial statistics, which means working through some R books.  I’ve made it this far (and now so have you), so I’m going for it, hoping it will be a pretty quick learn.  Please feel free to share your experiences with learning and using R!

Tags: , , ,

6 Responses to “Considering R as a Python Supplement”

  1. Parand says:

    I also intended to take a brief look at R, but I started using SciPy and matplotlib and it turned out to have more functionality than I'll need, and was in Python, so I never made it to R. I'm happily using SciPy.

  2. Charlie Roosen says:

    Take a look at rpy2 (http://rpy.sourceforge.net/rpy2.html). You can use it to stick with Python as your main language and start using R for statistics/graphics as needed.

  3. @Charlie: Thanks for the rpy2 link. I was reading about rpy but nice to know there's a second version (which I hadn't noticed). I will certainly have a need to be integrating the languages. This could make bridging the two feasible for a lot of folks and go a long ways in helping R adoption, if Pythonistas are convinced that R's features are worth the combination effort (which doesn't look to bad).

  4. Micah – Great overview. As an avid R user for a number of years, I can tell you the hard truth that R is far from perfect. Its syntax can be frustrating. On occasion, matrices become vectors without notice. But, like you, having munged large data sets in a variety of languages — Perl, Python, Matlab, Mathematica — I find R is the most capable at slicing and dicing data. For me, no other language allows so many ways to index into a data matrix. With R, unique character names, numeric indices, boolean values all work.

    Finally, it's what R can do after the data is sorted out. Few tools offer such high-level functionality for statistical analysis and graphics visualization.

  5. @Parand: Wonder how many of us are in that boat. I still think it will be worth broadening the horizon to see what's out there. I don't expect to hear from a lot of people who have done both. I'll plan to report on how compelling it turns out to be over the various Python facilities (which are obviously many).

  6. @Michael: Thanks for sharing those insights. Yes, I'm finding the "after" features to be really impressive. And having a lot of fun with some of the free literature and tutorials now. I'm looking forward to more of your illuminating posts (on your blog). I'd be really interested in making it to one of your Bay Area R group meetings if I can time a trip right. We were getting our own R study group into the works here in PDX, but recently switched it to the more general PDX-data-viz. We'll be covering R quite a bit there, first meeting Monday.

Leave a Reply

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>