This document provides some guidance on starting a new project in R and several frequently-used, and tremendously helpful, R commands to navigate and evaluate your data.
We assume you are using RStudio to interact with R. If not, we highly recommend you install it (it will require administrative priveleges), as it provides a trove of helpful features relevant to using R.
Typically, it’s easiest to manage the R code and files (e.g., data) associated with a given project/study in its own R project. You can start an R project using an existing directory of files or allow RStudio to create a new directory for you.
R project creation involves the following steps:
File, select New Project...Create ProjectWe won’t take the project creation discussion any further, but R projects allow you to keep the code and data related to a project organized and readily accessible.
Some useful (and short) introductory videos to R (in RStudio) are available here.
Like any programming language, R expects things to be formatted in a specific way, and many (most?) of the “errors” that occur after you execute a command in R are due to punctuation errors. So, if (when) something doesn’t work like you expected, the first line of defense is to check if:
Another common mistake is confusing the names of variables in a data frame with the name of an object you’ve created.
There are a handful of functions that are quite useful for checking the structure/format/quality of data that we import into (or modify in) R.
Let’s walk through them using a data set available in the ggplot2 package containing information on the fuel economy of various cars. The data set is called mpg. Here’s how you get access to it. By the way, pound signs (#) at the beginning of lines indicate comments; R doesn’t run these lines.
# First install it if you don't have it. I already have it, so I skip this step
# install.packages("ggplot2")
# Then, load the library into the current session
library(ggplot2)
The first function, str, shows you the structure of the data — how many variables, what kind of variables, etc.
str(mpg)
## 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
## $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
So, mpg is a data.frame with 234 observations (rows) of 11 variables. Some variables are factor variables (i.e., categorical variables), some are integer variables, and one (displ) is a numeric variable. This is a good first step to make sure the data has the variables (columns) and format you expected.
The second function, summary, does just what it says — provides a summary of the object. In the case of a data frame, it will summarize each column depending on the format of that column. I only request the summary for the first few variables so that it prints nicely.
summary(mpg[, 1:6])
## manufacturer model displ year cyl trans
## dodge :37 caravan 2wd : 11 Min. :1.600 Min. :1999 Min. :4.000 auto(l4) :83
## toyota :34 ram 1500 pickup 4wd: 10 1st Qu.:2.400 1st Qu.:1999 1st Qu.:4.000 manual(m5):58
## volkswagen:27 civic : 9 Median :3.300 Median :2004 Median :6.000 auto(l5) :39
## ford :25 dakota pickup 4wd : 9 Mean :3.472 Mean :2004 Mean :5.889 manual(m6):19
## chevrolet :19 jetta : 9 3rd Qu.:4.600 3rd Qu.:2008 3rd Qu.:8.000 auto(s6) :16
## audi :18 mustang : 9 Max. :7.000 Max. :2008 Max. :8.000 auto(l6) : 6
## (Other) :74 (Other) :177 (Other) :13
For factor variables, summary shows you the count of each level of the factor in the data set (e.g., 37 records of manufacturer = “dodge”). For integer and numeric variables, summary provides a mean and some quantiles. summary is thus useful to look for any gross errors in the range of values or possible misspecification of factor levels (e.g., a typo resulting in multiple levels for the same thing – “toyota” is not the same as “Toyota” to R), etc.
You’ll notice that not all the levels of the factor variables are shown because there are too many to display. You can view the levels of a factor variable using the levels function. For example, to view all the manufacturers in the mpg data frame:
levels(mpg$manufacturer)
## [1] "audi" "chevrolet" "dodge" "ford" "honda" "hyundai" "jeep" "land rover" "lincoln" "mercury" "nissan" "pontiac" "subaru" "toyota" "volkswagen"
The $ operator is useful to extract a specific variable (column) from a data frame.
But this only gives us the names of the levels for manufacturer. What if we want to know how many rows are associated with each manufacturer, similar to what was provided by the summary function? with and table functions to the rescue…
# Only show the first 8 to print nicely
with(mpg, table(manufacturer))[1:8]
## manufacturer
## audi chevrolet dodge ford honda hyundai jeep land rover
## 18 19 37 25 9 14 8 4
# This is equivalent using the $ operator
# table(mpg$manufacturer)
with tells R we’re going to use a specific data frame so we can refer directly to variable (column) names (manufacturer) without using the $ operator (mpg$manufacturer). table simply tabulates the number of records (rows) for each level of the specified column.
You can pass multiple columns to the table function as well. Subaru likes 4 cylinder engines!
with(mpg, table(manufacturer, cyl))
## cyl
## manufacturer 4 5 6 8
## audi 8 0 9 1
## chevrolet 2 0 3 14
## dodge 1 0 15 21
## ford 0 0 10 15
## honda 9 0 0 0
## hyundai 8 0 6 0
## jeep 0 0 3 5
## land rover 0 0 0 4
## lincoln 0 0 0 3
## mercury 0 0 2 2
## nissan 4 0 8 1
## pontiac 0 0 4 1
## subaru 14 0 0 0
## toyota 18 0 13 3
## volkswagen 17 4 6 0
Other useful functions relevant to data frames are names, nrow, and ncol. They do what you might expect, returning the column names, # of rows, and # of columns, respectively.
names(mpg)
## [1] "manufacturer" "model" "displ" "year" "cyl" "trans" "drv" "cty" "hwy" "fl" "class"
nrow(mpg)
## [1] 234
ncol(mpg)
## [1] 11
Another useful function to get a feel for a data frame is head. This will show you, by default, the first 6 rows of the data frame. You can change the number of rows that are printed.
head(mpg)
## manufacturer model displ year cyl trans drv cty hwy fl class
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
# To change the number of rows...
# head(mpg, 10) # would display 10 rows
Lastly, although there are many more useful functions (see this cheat sheet), for example), ls() lists all the objects that are present in the current working environment. We haven’t created any objects yet, so none show up.
ls()
## character(0)
But, if I create an object (in this case, a vector x assigned the value 100), ls() lets us know it’s there.
x <- 100 # assignment of values to objects in R uses this "backwards arrow"
x # Prints x to the console
## [1] 100
ls()
## [1] "x"
Invariably, you’re going to run into problems (e.g., error messages, warnings). But there is lots of R help at your fingertips, and much of it is useful! For example:
?function will give you the help page of a function or data set (e.g., ?head, ?summary)example(function) asks R to run an example(s) of the function??term will search for “term”" in all packages on CRAN (e.g., ??split)RSiteSearch("term") will search the R site for your search term (e.g., RSiteSearch("plyr"))RSiteSearch("{search phrase}") will search for an exact phrase (e.g., RSiteSearch("{mixed model}"))