This document provides some guidance on starting a new project in R and several frequently-used, and tremendously helpful, R commands to navigate and evaluate your data.

We assume you are using RStudio to interact with R. If not, we highly recommend you install it (it will require administrative priveleges), as it provides a trove of helpful features relevant to using R.

Getting started

Typically, it’s easiest to manage the R code and files (e.g., data) associated with a given project/study in its own R project. You can start an R project using an existing directory of files or allow RStudio to create a new directory for you.

R project creation involves the following steps:

  1. Open RStudio
  2. Under File, select New Project...
  3. Choose between using a new or existing directory
    • if an existing directory, navigate to and select the folder
  4. Press Create Project

We won’t take the project creation discussion any further, but R projects allow you to keep the code and data related to a project organized and readily accessible.

Video tutorials

Some useful (and short) introductory videos to R (in RStudio) are available here.

Common errors (i.e., frustrations) when using R

Like any programming language, R expects things to be formatted in a specific way, and many (most?) of the “errors” that occur after you execute a command in R are due to punctuation errors. So, if (when) something doesn’t work like you expected, the first line of defense is to check if:

Another common mistake is confusing the names of variables in a data frame with the name of an object you’ve created.

Useful R functions for checking your data

There are a handful of functions that are quite useful for checking the structure/format/quality of data that we import into (or modify in) R.

Let’s walk through them using a data set available in the ggplot2 package containing information on the fuel economy of various cars. The data set is called mpg. Here’s how you get access to it. By the way, pound signs (#) at the beginning of lines indicate comments; R doesn’t run these lines.

# First install it if you don't have it.  I already have it, so I skip this step
# install.packages("ggplot2")

# Then, load the library into the current session
library(ggplot2)

The first function, str, shows you the structure of the data — how many variables, what kind of variables, etc.

str(mpg)
## 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ model       : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
##  $ drv         : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ class       : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...

So, mpg is a data.frame with 234 observations (rows) of 11 variables. Some variables are factor variables (i.e., categorical variables), some are integer variables, and one (displ) is a numeric variable. This is a good first step to make sure the data has the variables (columns) and format you expected.

The second function, summary, does just what it says — provides a summary of the object. In the case of a data frame, it will summarize each column depending on the format of that column. I only request the summary for the first few variables so that it prints nicely.

summary(mpg[, 1:6])
##      manufacturer                 model         displ            year           cyl               trans   
##  dodge     :37    caravan 2wd        : 11   Min.   :1.600   Min.   :1999   Min.   :4.000   auto(l4)  :83  
##  toyota    :34    ram 1500 pickup 4wd: 10   1st Qu.:2.400   1st Qu.:1999   1st Qu.:4.000   manual(m5):58  
##  volkswagen:27    civic              :  9   Median :3.300   Median :2004   Median :6.000   auto(l5)  :39  
##  ford      :25    dakota pickup 4wd  :  9   Mean   :3.472   Mean   :2004   Mean   :5.889   manual(m6):19  
##  chevrolet :19    jetta              :  9   3rd Qu.:4.600   3rd Qu.:2008   3rd Qu.:8.000   auto(s6)  :16  
##  audi      :18    mustang            :  9   Max.   :7.000   Max.   :2008   Max.   :8.000   auto(l6)  : 6  
##  (Other)   :74    (Other)            :177                                                  (Other)   :13

For factor variables, summary shows you the count of each level of the factor in the data set (e.g., 37 records of manufacturer = “dodge”). For integer and numeric variables, summary provides a mean and some quantiles. summary is thus useful to look for any gross errors in the range of values or possible misspecification of factor levels (e.g., a typo resulting in multiple levels for the same thing – “toyota” is not the same as “Toyota” to R), etc.

You’ll notice that not all the levels of the factor variables are shown because there are too many to display. You can view the levels of a factor variable using the levels function. For example, to view all the manufacturers in the mpg data frame:

levels(mpg$manufacturer)
##  [1] "audi"       "chevrolet"  "dodge"      "ford"       "honda"      "hyundai"    "jeep"       "land rover" "lincoln"    "mercury"    "nissan"     "pontiac"    "subaru"     "toyota"     "volkswagen"

The $ operator is useful to extract a specific variable (column) from a data frame.

But this only gives us the names of the levels for manufacturer. What if we want to know how many rows are associated with each manufacturer, similar to what was provided by the summary function? with and table functions to the rescue…

# Only show the first 8 to print nicely
with(mpg, table(manufacturer))[1:8]
## manufacturer
##       audi  chevrolet      dodge       ford      honda    hyundai       jeep land rover 
##         18         19         37         25          9         14          8          4
# This is equivalent using the $ operator
# table(mpg$manufacturer)

with tells R we’re going to use a specific data frame so we can refer directly to variable (column) names (manufacturer) without using the $ operator (mpg$manufacturer). table simply tabulates the number of records (rows) for each level of the specified column.

You can pass multiple columns to the table function as well. Subaru likes 4 cylinder engines!

with(mpg, table(manufacturer, cyl))
##             cyl
## manufacturer  4  5  6  8
##   audi        8  0  9  1
##   chevrolet   2  0  3 14
##   dodge       1  0 15 21
##   ford        0  0 10 15
##   honda       9  0  0  0
##   hyundai     8  0  6  0
##   jeep        0  0  3  5
##   land rover  0  0  0  4
##   lincoln     0  0  0  3
##   mercury     0  0  2  2
##   nissan      4  0  8  1
##   pontiac     0  0  4  1
##   subaru     14  0  0  0
##   toyota     18  0 13  3
##   volkswagen 17  4  6  0

Other useful functions relevant to data frames are names, nrow, and ncol. They do what you might expect, returning the column names, # of rows, and # of columns, respectively.

names(mpg)
##  [1] "manufacturer" "model"        "displ"        "year"         "cyl"          "trans"        "drv"          "cty"          "hwy"          "fl"           "class"
nrow(mpg)
## [1] 234
ncol(mpg)
## [1] 11

Another useful function to get a feel for a data frame is head. This will show you, by default, the first 6 rows of the data frame. You can change the number of rows that are printed.

head(mpg)
##   manufacturer model displ year cyl      trans drv cty hwy fl   class
## 1         audi    a4   1.8 1999   4   auto(l5)   f  18  29  p compact
## 2         audi    a4   1.8 1999   4 manual(m5)   f  21  29  p compact
## 3         audi    a4   2.0 2008   4 manual(m6)   f  20  31  p compact
## 4         audi    a4   2.0 2008   4   auto(av)   f  21  30  p compact
## 5         audi    a4   2.8 1999   6   auto(l5)   f  16  26  p compact
## 6         audi    a4   2.8 1999   6 manual(m5)   f  18  26  p compact
# To change the number of rows...
# head(mpg, 10) # would display 10 rows

Lastly, although there are many more useful functions (see this cheat sheet), for example), ls() lists all the objects that are present in the current working environment. We haven’t created any objects yet, so none show up.

ls()
## character(0)

But, if I create an object (in this case, a vector x assigned the value 100), ls() lets us know it’s there.

x <- 100 # assignment of values to objects in R uses this "backwards arrow"

x # Prints x to the console
## [1] 100
ls()
## [1] "x"

Getting help in R

Invariably, you’re going to run into problems (e.g., error messages, warnings). But there is lots of R help at your fingertips, and much of it is useful! For example: