Purpose

I was asked by my Colleague and Board Member, Walt DeGrange, to put together some thoughts about R at the Junior, Middle, and Senior Levels. The recipient is someone who uses R as an analyst in that they use it as a tool to get things done, but not in the sense they would become professional R coders to the exclusion of their jobs.

Having this bit of homework made me stop to reflect on my own journey with R, and what appropriate bits of wisdom I would want to impart at various places along the trail.

But First… What is Data Science?

We define Data Science as the edge that connects “Statistics” with “Computer Programming”. Therefore, an effective Data Scientist is either a computer programmer who has learned Statistics beyond the entry level, or (in my case) a Statistician who has learned to code to support their work.

Why R?

Before covering why one should become proficient at R, we should probably touch on the strengths and weaknesses of the platform, and when it is (or isn’t) the right choice.

R is a computer programming language purpose-built for statistics. This means that if you are going to interact with R, you will be writing code. If this makes you uncomfortable, or you don’t think you would ever be a coder, then you should probably choose another platform. R is free, and - along with packages - can be downloaded from CRAN, which is the “Consolidated R Archive Network”.
There are many contributors to R, and many professional users. In addition to the base language, there are both packages (see below) as well as IDEs (Integrated Development Environments). There are several IDEs to choose from; the remainder of this piece will focus on RStudio.

Disadvantages

Having said this, R has some disadvantages and places where it probably isn’t the right tool. A few that come to mind (not exhaustive) are:

  1. If the dataset is small, and what needs to be done can be done quickly in native tool, such as Excel.
  2. If the delivery method is a Microsoft Product and the editors will want to change the graphs (R graphics objects are not rendered in MS-editable forms)
  3. If you hate to Code
  4. If R won’t run on your machine

Having discussed these caveats, we’re now ready to think about the lessons that an R user should have at every level. In the sections that follow I am going to restrict myself to a few sentences per issue. This is harder than it seems.

The Skill Recommendations

Basic

The entry-level R coder will need to figure out how to get R installed (as well as the IDE) and perform some basic tasks. Beyond this, there are some habits that the beginning R user will benefit from in their future development if they learn at the beginning.

Interacting with the IDE

Perhaps the first thing that a user needs to learn is how to leverage the IDE to its’ maximum advantage. For a standard R + R Studio setup this includes using some of the management functions, like History, and data import/export.

Understanding Data Types

R has several different atomic data types, several different composite data types. One of the first things a new useR needs to understand is how to choose and interact with these structures. This is foundational and everything builds upon it.

Base Package Plots

Because we’d like to do something with the data other than just stare at the screen, the new user should understand how to use the base plot functions, such as plot(), boxplot() and pairs(). While these methods will be largely deprecated when they move to the intermediate level, these are the analysts first line of defense.

Intermediate Level

At the intermediate level, the user should be able to begin building their own code. The key enablers here are:

Finding and using other people’s code

Most things that a user would want to do have already been done - and published - by someone else. The user at this stage should be adept at finding new packages in CRAN, installing them, and integrating with their own code.

Finding Help

Your colleagues will rapidly become frustrated with you if you are always knocking on their doors. R has lots of online resources available, but you have to know where to find them. This isn’t just googling ‘stack overflow’, but rather the fine art of making a query that will actually find you help: Examples:

**Bad: I HATE GGPLOT!! IT DOESN“T WORK!!!**

Good: r ggplot2 rotate x axis factors

Applying a Function across a Vector and Program Flow

Real R programmers will tell you that you should never write a loop, but in practice, you end up writing loops every now and again for repetitive tasks. This includes loops, the lapply() family of functions, and also dplyr.

An experienced R user will see that the capstone of this level will be to effectively install and use the ‘Hadleyverse’

Advanced Level

At the advanced level, the user will be able to rapidly gain insights from data. Some attributes (not all-inclusive) are:

Deploying your work

You’ve built your analysis, and now you want to share it. This requires learning tools such as Shiny, RMarkdown, Notebook and FlxDashBoard. These are integrated in the RStudio environment and feature one-click publishing. This isn’t hard, but you have to do it

This document was rendered in RMarkdown

Models

Building models requires some basic understanding of statistics, which presumably you have since you’re reading a manual on coding in R! Model building and evaluation requires learning about the r::base packages such as lm(), glm(), gam() but also learning about external packages such as Recursive Partitioning.

Writing code that others can use

You will eventually want to write code that other people will adopt and use. I’m not talking about writing your own packages, but rather following good practice with variable names, and also building chunks in Markdown / Shiny in a way that makes them easy for others to edit/collaborate with.

Wrapping it all up

These 9 building blocks will not make one a complete R expert, but instead will put them in a position to continue their own growth without external assistance, and instill in them good habits for a productive career in Data Science.

