Purpose
I was asked by my Colleague and Board Member, Walt DeGrange, to put together some thoughts about R at the Junior, Middle, and Senior Levels. The recipient is someone who uses R as an analyst in that they use it as a tool to get things done, but not in the sense they would become professional R coders to the exclusion of their jobs.
Having this bit of homework made me stop to reflect on my own journey with R, and what appropriate bits of wisdom I would want to impart at various places along the trail.
But First… What is Data Science?
We define Data Science as the edge that connects “Statistics” with “Computer Programming”. Therefore, an effective Data Scientist is either a computer programmer who has learned Statistics beyond the entry level, or (in my case) a Statistician who has learned to code to support their work.
Why R?
Before covering why one should become proficient at R, we should probably touch on the strengths and weaknesses of the platform, and when it is (or isn’t) the right choice.
R is a computer programming language purpose-built for statistics. This means that if you are going to interact with R, you will be writing code. If this makes you uncomfortable, or you don’t think you would ever be a coder, then you should probably choose another platform. R is free, and - along with packages - can be downloaded from CRAN, which is the “Consolidated R Archive Network”.
There are many contributors to R, and many professional users. In addition to the base language, there are both packages (see below) as well as IDEs (Integrated Development Environments). There are several IDEs to choose from; the remainder of this piece will focus on RStudio.
Disadvantages
Having said this, R has some disadvantages and places where it probably isn’t the right tool. A few that come to mind (not exhaustive) are:
- If the dataset is small, and what needs to be done can be done quickly in native tool, such as Excel.
- If the delivery method is a Microsoft Product and the editors will want to change the graphs (R graphics objects are not rendered in MS-editable forms)
- If you hate to Code
- If R won’t run on your machine
Having discussed these caveats, we’re now ready to think about the lessons that an R user should have at every level. In the sections that follow I am going to restrict myself to a few sentences per issue. This is harder than it seems.
The Skill Recommendations
Basic
The entry-level R coder will need to figure out how to get R installed (as well as the IDE) and perform some basic tasks. Beyond this, there are some habits that the beginning R user will benefit from in their future development if they learn at the beginning.
Interacting with the IDE
Perhaps the first thing that a user needs to learn is how to leverage the IDE to its’ maximum advantage. For a standard R + R Studio setup this includes using some of the management functions, like History, and data import/export.
Understanding Data Types
R has several different atomic data types, several different composite data types. One of the first things a new useR needs to understand is how to choose and interact with these structures. This is foundational and everything builds upon it.
Base Package Plots
Because we’d like to do something with the data other than just stare at the screen, the new user should understand how to use the base plot functions, such as plot(), boxplot() and pairs(). While these methods will be largely deprecated when they move to the intermediate level, these are the analysts first line of defense.
Advanced Level
At the advanced level, the user will be able to rapidly gain insights from data. Some attributes (not all-inclusive) are:
Deploying your work
You’ve built your analysis, and now you want to share it. This requires learning tools such as Shiny, RMarkdown, Notebook and FlxDashBoard. These are integrated in the RStudio environment and feature one-click publishing. This isn’t hard, but you have to do it
This document was rendered in RMarkdown
Models
Building models requires some basic understanding of statistics, which presumably you have since you’re reading a manual on coding in R! Model building and evaluation requires learning about the r::base packages such as lm(), glm(), gam() but also learning about external packages such as Recursive Partitioning.
Writing code that others can use
You will eventually want to write code that other people will adopt and use. I’m not talking about writing your own packages, but rather following good practice with variable names, and also building chunks in Markdown / Shiny in a way that makes them easy for others to edit/collaborate with.
Wrapping it all up
These 9 building blocks will not make one a complete R expert, but instead will put them in a position to continue their own growth without external assistance, and instill in them good habits for a productive career in Data Science.
