Food for Thought on R

Author

Collin Paschall

Why tidyverse?

As we approach the submission deadline for Problem Set 2, I want to step back a bit and provide a little 30,000 foot view of our material in the course the last several weeks.

When the course began, our goal was to get you oriented to the major components of any programming language: operators, variables, control flow, data types, data structures, file input/output and functions. From these foundations, you could theoretically write programs that can do almost any kind of data manipulation you want on spreadsheets/rectangular datasets with rows and columns.

What we’ve been doing in the last few weeks is introducing you to a specific set of tools that make your life much easier when you try to do these tasks in R. The tidyverse is the most common and best set of tools available for working with most kinds of data in R.

With the tidyverse, as you’ve seen, you can take a rectangular data set and extract and transform information in a huge (practically unlimited number of ways). You can rename columns, recode data, calculate new values, and summarize data. When you add group operations, you can start to leverage hierarchical structures for your data analysis.

So, in today’s world, being able to analyze data in R really means transforming data and extracting insights from it using the tidyverse.

How am I supposed to remember this?

Here’s the thing to remember though: the tidyverse is big. Unless you are at your desk programming absolutely every day, you are going to forget some of these functions. You’ll forget that certain kinds of transformations are possible, or you’ll forget how the arguments of the functions work. This is normal, and that’s why the most important thing you take from this class is not how to program in R, but rather how to read R documentation. The more important skill is not being able to do this stuff in a closed book text format, but rather how to refresh on R basics and the tidyverse functions.

So, going forward, here’s the basic resources you should have a handle on, which might be helpful as you tackle Problem Set 2.

The tidyverse website
- If ever you are unsure about something, use the tidyverse website. It is a great, great resource, full of code snippets and examples. Learn to read the documentation.
The R Studio cheatsheets
- Most of the tidyverse packages have a cheatsheet maintained by R Studio. These are amazing resources. If you cannot remember something that you know you’ve been introduced to, look here first for bite-sized refreshers.
r4ds
- Wickham’s book R for Data Science is the essential text for data management using R. It’s table of contents is essentially a comprehensive list of everything you can do in the tidyverse, which by extension is almost everything you will practically need to do in R.

Problem Set 2 is designed to make you use a series of piped tidyverse functions to manipulate data, using several different Tidyverse packages. It also requires you to show that you know how to learn something new in R using a little Googling.

Good luck on PS 2.

A quick note about Python

Next week, we are going to dive in Python. Our efforts will be similar to what we did in R. We’ll do a lightning fast review of programming basics in Python. They aren’t very different from R, except for some minor syntactic/formatting differences. Dispensing with quickly, we’ll dive into pandas. Pandas is Python’s version of the tidyverse, essentially. Basically, the same transformations and data wrangling from the tidyverse are available to you in Pandas, just with Python syntax. It’s a piece of cake, as you’ll see soon enough.