Developed by Dan Holstein
Some exercises adapted from those developed by J. Irrison
The goal of these exercises is to introduce you to R and RStudio. R is the name of the programming language and architecture, and RStudio is a piece of software that makes working in R convenient and straightforward. And they’re both free!
The first step is to download the software onto your personal computers. To download R visit this website: http://cran.rstudio.com/index.html
Select your platform, and download the BASE (Windows) or the latest package (Mac).
After installation, visit this website to download RStudio: http://www.rstudio.com/ide/download/desktop
Select the download for your operating system, download and install.
Take a minute to investigate the window panels in RStudio.
I’ve rearranged them a bit, and you can too. On the upper right are the Environment and History windows. The Environment stores your variables and datasets. The History window is pretty straightforward, and records the commands you make.
When you generate plots and look up help files, they appear on the lower right.
The console is on the lower left. When you start RStudio it will display text letting you know what version of R you are running. This window is where you can type commands. If you open an RScript it will appear in the upper left.
Staying organized and knowing where your stuff is located is crucial to getting comfortable in R. RStudio works out of a directory. It will default to its installation directory, unless you tell it where you want to work. That’s not ideal, because then all of your stuff - your scripts, your data output, your variables, everything - gets saved into one jumbled directory.
Just FYI, you can always tell where you are working from by looking at the top of the RStudio window. If it just says, “RStudio”, then you are working out of the installation directory.
The first thing to do is to create a new project for the workshop. Everything we do during the workshop should occur inside this project’s directory. You can imagine that you might make new projects for your thesis, or for each chapter of your dissertation. Your organization is up to you.
To start, go to File> New Project…
And then, “Empty Project”
Now you get to choose where this project will live. RStudio will CREATE a new directory for your project, so just tell R where to put it and what to name it. For example, you might name the project “R Workshop” and put the project directory in your Documents folder.
Take a minute and go find your project on your computer, either in Finder (Mac) or Explorer (Windows). There should be a folder, and inside, and Rproj file.
You may see documents like this one online. Text boxes, like the one below, denote R code and R output.
# Like this one. This is the box that contains R code, and you should type what is in these boxes into
# your console window or into an RScript.
# The "#" comes before NOTES, and text behind a "#" will not execute.
When text looks like this it also denotes R code or R output. Sometimes you might see two text boxes in a row, like below. It means that the first box is evaluated and the second box is the R output:
156/7
## [1] 22.28571
R is basically a calculator, and over time the R language has been developed to provide a fairly natural way to access, organize and describe data; even very complex data. Give it a shot, do some computations in the console:
5 + 11
1.2 * 47
3^4
R can also handle logical arguments, which return as TRUE or FALSE. Try something like these:
5 > 2
## [1] TRUE
2 > 5
## [1] FALSE
Note that the double “=” is a logical argument (as opposed to a single “=”, which is a definition)
5 == 2
## [1] FALSE
Unlike simple calculators, you can create and store variables in R. For example:
x = 500
Notice that you now have a variable “x” stored in your Environment (upper right) as a value.
The value of x can be easily overwritten. Keep this in mind!
x = 5000
Now that x is a stored variable, we can use it:
x + 10
## [1] 5010
x * 5
## [1] 25000
x^15
## [1] 3.051758e+55
You can imagine that we can create any number of variables. For example:
y = x * 5
It’s important to be aware that y does not remain dependent on the value of x. If you change the value of x, you must re-run the above command to update the value of y.
Let’s see what happens when we plot our data:
plot(x, y)
R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists. Right now we will be dealing with vectors.
Vectors contain a set of numbers, characters or logical arguments. Below we will create 3 different vectors - a, b and carrot. The function c() means ‘combine’ or ‘concatenate’.
a <- c(1,2,5.3,6,-2,4) # numeric vector
b <- c("one","two","three") # character vector
carrot <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) # logical vector
Notice the ‘<-’ notation. This notation is similar to ‘=’, but it is directional. If we want to set x equal to 10, we could type, equivalently:
x <- 10
10 -> x
Each member of a vector is called an “element”. Vector a has 6 elements, and vector b has 3 elements. You can refer to specific elements using subscripts. For example:
a # Returns all elements of vector a
## [1] 1.0 2.0 5.3 6.0 -2.0 4.0
a[6] # Returns the 6th element of vector a
## [1] 4
a[1:3] # Returns the first through third element of vector a
## [1] 1.0 2.0 5.3
You can apply functions and operations to an entire vector. For example, below we will create a vector x and a vector y - Note, this will overwrite your original stored values for these variables.
x <- c(3, 7, 19, 4, 34, 11)
y <- x * 2
plot(x,y)
Can you find the mean of vector x?
I’d like you to create a folder inside your project directory called “Data”. This is not required by R or RStudio, but I find that if I keep the data I’m using in one place, I’m less likely to lose track of stuff. During the workshop, we will be reading data files from this directory.
Next, I’d like you to create a new RScript, either from the File menu, or by clicking the button in the upper left. RScripts behave like the console, but don’t automatically execute when you press “enter”. It’s a place to write your code and keep notes. When you are ready, you can run an entire RScript at once, or you can run short sections of an RScript. Save your RScript as “Intro2R”. The RScript will save to whatever directory you’re working out of - Hopefully your project directory!
I have provided you with two datasets. One is meteo-temp.csv, and the other is meteo-rain.txt. These two datasets are temperature and rain measurements from several cities in France. Make sure they are inside your Data directory. We want to load these into our workspace, and give the datasets appropriate names.
To import a dataset, we will use the read() command. In this case, our data is in .csv format, so we will use read.csv(). Note that at any time you can use R’s help files for help on a specific function:
?read.csv
# Here we are loading meteo-temp.csv, and saving it to our environment with the name "temp"
temp <- read.csv("Data/meteo-temp.csv")
Notice that, if you are working within your project, you can access your “Data” directory directly.
We’ve created a dataframe “temp” in our workspace, with 36 observations of 14 variables. Let’s take a look at it. You can display the data by typing its name:
temp
## year station Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 2009 Ajaccio 13.4 13.7 15.6 19.0 25.2 26.5 28.8 30.4 27.5 22.9 19.0 14.5
## 2 2009 Auxerre 4.4 7.2 11.9 18.0 21.1 23.2 25.9 28.3 22.8 17.3 13.1 6.9
## 3 2009 Biarritz 11.0 11.9 14.2 15.4 19.2 23.0 25.1 25.0 22.4 20.6 17.3 11.8
## 4 2009 Bordeaux 8.8 11.5 15.3 17.4 22.4 25.7 27.1 28.5 25.1 20.6 15.2 10.7
## 5 2009 Boulogne 5.0 6.3 9.4 14.0 15.9 18.4 19.6 21.5 18.6 14.9 12.5 6.9
## 6 2009 Brest 8.3 9.2 12.7 13.9 16.7 20.7 19.9 20.2 18.5 16.7 13.3 9.5
## 7 2009 La Rochelle 7.5 10.2 13.5 15.1 19.9 23.4 22.8 24.8 23.5 18.7 14.4 9.9
## 8 2009 Langres 1.7 4.1 9.3 16.2 20.0 21.0 23.0 25.1 20.1 13.4 10.3 3.8
## 9 2009 Lille 3.4 6.8 11.3 16.8 18.9 21.6 23.7 25.5 21.3 15.4 12.4 5.5
## 10 2009 Lorient 7.9 9.6 13.3 14.4 17.9 21.8 20.8 21.4 21.5 17.5 13.6 9.7
## 11 2009 Nice 12.8 12.9 14.9 17.8 22.2 24.7 26.8 28.6 25.9 21.3 17.7 12.7
## 12 2009 Perpignan 11.0 13.1 16.9 17.6 22.9 28.2 29.9 31.3 26.9 22.4 17.3 12.5
## 13 2010 Ajaccio 12.3 13.7 15.6 18.7 21.0 25.1 29.9 27.6 25.8 21.7 16.7 13.6
## 14 2010 Auxerre 2.6 6.9 12.1 17.3 17.5 23.4 27.9 24.5 20.7 15.8 9.3 2.5
## 15 2010 Biarritz 10.0 11.1 14.6 18.1 18.3 20.9 24.3 24.8 23.4 18.8 13.9 10.9
## 16 2010 Bordeaux 6.8 NA 14.2 19.8 19.9 24.5 28.2 26.8 24.4 18.6 12.7 8.3
## 17 2010 Boulogne 3.6 6.0 9.0 NA 13.2 18.5 20.6 18.8 NA NA 8.7 3.3
## 18 2010 Brest 7.1 8.7 10.8 14.9 15.9 19.9 21.0 19.5 19.3 16.2 10.9 7.1
## 19 2010 La Rochelle 6.3 NA 11.9 17.1 17.9 NA 24.3 22.6 21.2 17.2 11.7 6.6
## 20 2010 Langres 0.4 4.0 8.9 15.1 15.0 21.6 25.9 21.9 17.4 13.5 6.9 1.2
## 21 2010 Lille 2.5 5.4 11.1 15.8 16.2 22.4 26.0 21.7 19.0 14.7 8.6 1.4
## 22 2010 Lorient 6.8 NA 10.8 16.2 17.1 22.2 23.3 21.6 20.3 16.6 11.5 6.8
## 23 2010 Nice 10.6 12.2 14.0 17.9 19.5 23.6 28.9 26.5 23.9 19.9 15.5 11.3
## 24 2010 Perpignan 10.1 11.4 13.9 19.9 20.9 26.5 30.6 29.5 25.3 20.0 15.2 11.4
## 25 2011 Ajaccio 13.4 14.1 15.9 19.6 23.2 25.5 27.4 28.9 27.2 23.1 20.4 16.1
## 26 2011 Auxerre 6.1 9.0 14.3 21.1 23.7 24.1 23.3 25.1 24.0 18.3 14.4 9.3
## 27 2011 Biarritz 11.0 13.9 14.9 20.4 21.6 22.3 22.0 25.0 25.5 21.1 17.9 13.6
## 28 2011 Bordeaux 9.3 13.1 15.7 22.7 24.9 24.8 24.8 27.4 25.5 21.1 16.4 12.9
## 29 2011 Boulogne 6.5 8.1 10.3 16.2 15.7 17.6 18.2 19.0 19.2 16.1 12.7 9.8
## 30 2011 Brest 8.9 11.3 12.9 18.4 17.4 18.4 19.7 20.1 20.4 17.5 14.9 11.2
## 31 2011 La Rochelle 8.1 11.1 14.4 20.4 20.8 21.4 22.0 23.8 22.9 18.8 16.3 12.3
## 32 2011 Langres 4.0 6.1 12.0 NA 21.0 NA 20.6 23.6 NA 15.3 10.0 NA
## 33 2011 Lille 6.5 8.4 12.4 19.6 20.0 21.8 20.8 22.3 21.7 16.7 11.9 9.5
## 34 2011 Lorient 9.3 11.1 13.4 19.2 18.6 19.9 20.9 21.0 20.5 17.8 15.4 11.7
## 35 2011 Nice 12.2 12.8 14.6 18.3 22.7 24.2 26.0 27.1 26.2 21.2 18.3 15.4
## 36 2011 Perpignan 11.4 13.7 15.2 21.1 24.4 25.4 27.5 29.3 27.8 23.3 18.2 14.8
Describe this dataset. Put your notes in your R script by using hashtags (#) to create a comment.
Some commands to consider using:
names() # Gives the names of the variables in the dataset
dim() # Gives the dimensions of the dataset
head() # Gives the first ~6 rows of the dataset
str() # Gives the structure of the dataset
summary() # Gives a summary of the dataset, with some basic statistics.
?summary # A "?" before a command name will give you the R help file #on that command!
??summary # Two "?" searches the help for that word or phrase
Now bring in the next dataset, meteo-rain.txt. This file is not a CSV, so we will use the read.table() command:
rain <- read.table("Data/meteo-rain.txt", header = TRUE)
We used the command read.table() rather than read.csv() because the data was in a TXT document. We also included a new argument: header = TRUE. The first row of information in the TXT file was not data, but variable names. We had to tell R to be sure to make the first row variable names by telling it the data had a header.
Describe this dataset. How is it different from the temperature dataset? Do they look similar?
How would you go about finding the mean temperature for January (including all cities)? We need a way to isolate just the column in the data that holds data for January.
One way we can do this is with the $ symbol. If you look at the temp dataframe, either by typing temp or by clicking on the dataframe in the Environment console, we can see that each column has a header or name. In this case, there is a column for each month.
You can isolate a column from a dataframe using this syntax: dataframe$columnname
For example, the following code will return all temperature values from January:
temp$Jan
## [1] 13.4 4.4 11.0 8.8 5.0 8.3 7.5 1.7 3.4 7.9 12.8 11.0 12.3 2.6 10.0 6.8 3.6 7.1 6.3 0.4 2.5 6.8 10.6
## [24] 10.1 13.4 6.1 11.0 9.3 6.5 8.9 8.1 4.0 6.5 9.3 12.2 11.4
To find the mean of this new vector of temperatures, do the following:
mean(temp$Jan) # This will give you the mean of all the data in column "Jan" in dataframe "temp"
## [1] 7.805556
You can “index” a dataset using the $ operator, and also through more traditional indexing, where you tell R what rows and columns you’d like to see. Indexing is done using brackets [rows, columns].
For example, we can do the exact same thing as above using indexing brackets:
temp[,3] # Leaving a blank space before the comma means "all", so we're telling R to show us all rows of temp, and column 3.
## [1] 13.4 4.4 11.0 8.8 5.0 8.3 7.5 1.7 3.4 7.9 12.8 11.0 12.3 2.6 10.0 6.8 3.6 7.1 6.3 0.4 2.5 6.8 10.6
## [24] 10.1 13.4 6.1 11.0 9.3 6.5 8.9 8.1 4.0 6.5 9.3 12.2 11.4
temp$Jan == temp[,3] # Check for yourself that the two vectors are equal
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [24] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
mean(temp[,3])
## [1] 7.805556
Play around with indexing before you move on - you can select any part of the dataset this way.
a) Can you do the same for May?
b) Find the median temperatures for January and May, as well.
c) Now find the mean and median temperature for September. If you have trouble, why? Check the help (?mean, ?median).
What if you want to look at January temperatures for one city only? For this we can use the which() command.
# Let's look at data from Ajaccio in January
AjaccioJan <- temp$Jan[which(temp$station == 'Ajaccio')]
Don’t freak out!
Let’s look at this command piece by piece. We’ve created a new vector named “AjaccioJan”.
To assign values to it, we are asking R to look at all January temperatures temp$Jan, and within that, only at stations with the name “Ajaccio” [which(temp$station == "Ajaccio")].
Try which(temp$station == "Ajaccio" on its own to see what it’s doing
which(temp$station == "Ajaccio")
## [1] 1 13 25
Those are index locations! They are the rows where temp$station is equal to “Ajaccio”.
If you’d like, you can save those locations to a variable:
Aj <- which(temp$station == "Ajaccio")
and get the same result:
temp$Jan[Aj]
## [1] 13.4 12.3 13.4
Take a look at AjaccioJan.
AjaccioJan
## [1] 13.4 12.3 13.4
Find the mean August temperature at Lorient. What is the standard deviation?
There are more ways to isolate data from the rest of the dataframe. For example, this is how to see only January data when January temperatures exceed 3C:
temp[which(temp$Jan>3), c(1,2,3)]
## year station Jan
## 1 2009 Ajaccio 13.4
## 2 2009 Auxerre 4.4
## 3 2009 Biarritz 11.0
## 4 2009 Bordeaux 8.8
## 5 2009 Boulogne 5.0
## 6 2009 Brest 8.3
## 7 2009 La Rochelle 7.5
## 9 2009 Lille 3.4
## 10 2009 Lorient 7.9
## 11 2009 Nice 12.8
## 12 2009 Perpignan 11.0
## 13 2010 Ajaccio 12.3
## 15 2010 Biarritz 10.0
## 16 2010 Bordeaux 6.8
## 17 2010 Boulogne 3.6
## 18 2010 Brest 7.1
## 19 2010 La Rochelle 6.3
## 22 2010 Lorient 6.8
## 23 2010 Nice 10.6
## 24 2010 Perpignan 10.1
## 25 2011 Ajaccio 13.4
## 26 2011 Auxerre 6.1
## 27 2011 Biarritz 11.0
## 28 2011 Bordeaux 9.3
## 29 2011 Boulogne 6.5
## 30 2011 Brest 8.9
## 31 2011 La Rochelle 8.1
## 32 2011 Langres 4.0
## 33 2011 Lille 6.5
## 34 2011 Lorient 9.3
## 35 2011 Nice 12.2
## 36 2011 Perpignan 11.4
Here we are asking to isolate observations where January temperatures were greater than 3C, and we are displaying the 1st, 2nd and 3rd columns of “temp” where that condition is met (c(1,2,3)).
Create a vector called “warmJan” that contains all observations of temperature greater than 5C in January, but also contains March temperature data from the same years and cities.
Now you can extract data from datasets. But sometimes you don’t get data in the format you expected.
By now you may have noticed that the “temp” and “rain” datasets look very different, despite containing observations that occurred at the same time and at the same sites. “temp” is wide, whereas “rain” is tall. In “temp” there are multiple observations per row, but in “rain” each row represents a single observation.
We want to combine these two datasets so that all of our data is in one place, and easy to explore. First, you need to install a new library or package. Many of the more advanced operations you will want to do with R will require the installation of external libraries. There are hundreds of libraries that contain valuable functions for statistical analysis, many of which are stored and curated by CRAN. Right now we will be installing reshape2 and plyr.
Go to Tools>Install Packages. As you begin to type “reshape2” into the search box, the correct package should appear in the dropdown. Select it and install. Do the same for “plyr”. After installation, you will need to load the library:
library(reshape2)
library(plyr)
To use a library, it must be loaded once whenever you start R.
Reshape2 calls the tall and wide data structures “melted” and “cast”, respectively. So, to go from wide to tall is “melting”, and from tall to wide is “casting”. Look at the documentation for the melt() function.
# Don't forget to use the help files!
?reshape2
?melt.data.frame
# Reshaping temp data wide to tall (melting)
meltedTemp <- melt(temp,id.var=c("year","station"), variable.name="monthName", value.name="temp")
Let’s break it down again!
We had to state which variable should remain associated with each observation (id.var), and because we want to make sure the year and station remain associated with each observation, we set id.var to "year" and "station" (we had to use c() because there was more than one ID variable).
The remaining variables (months), are moved into a column we’ve titled "monthName", to mimic the corresponding column name in the “rain” dataframe. The temperature values are moved into a column we’ve titled "temp". Compare “rain” and “meltedTemp” to see if they look similar now.
Now we want to combine these two datasets. Look up the help for join().
?join
joinedData<-join(meltedTemp, rain, type="inner")
## Joining by: year, station, monthName
We have successfully joined these two datasets together, using an “inner” join. Now we should have a dataframe that describes both temperature and rainfall at each city, in each month.
Take a look at the new dataframe. We can still extract the same information we did before the join:
#average temp in France in January
mean(joinedData$temp[which(joinedData$monthName=="Jan")])
## [1] 7.805556
Or
#median rainfall in France in January
median(joinedData$rain[which(joinedData$monthName=="Jan")])
## [1] 68.85
BUT, here’s where the real fun begins.
There are many many ways to plot data in R, and much of your time will be spent getting plots to look and behave how you want. A particularly useful - and beautiful - plotting tool is in the package ggplot2. Download, install and load that package now.
ggplot is not as immediately straightforward as the native R plot() function, but it allows us to do some really interesting things. I will give some examples, and you should follow along and experiment as we go.
# Remember to load the library
library(ggplot2)
ggplot(aes(x = month, y = temp), data = joinedData) +
geom_point()
What we’ve done here is created a ggplot (ggplot()), and added a point ‘geom’ (geom_point()). aes stands for “aesthetic mapping”, and this is where we tell the plot how to behave. We define x as month and y as temp, and let the plot know where to look for the data (joinedData). This is a very similar plot to the one we saw before.
But what if we want to be able to see which station is which? Easy!
ggplot(aes(x = month, y = temp), data = joinedData) +
geom_point(aes(colour = station))
The only difference here is that we’ve told the plot that we want to color the points by station.
But the real point was to get rain and temperature data together on the same graph, right? Let’s try incorporating rain data for each data point:
ggplot(aes(x = month, y = temp), data = joinedData) +
geom_point(aes(colour = station, size = rain))
Here are some other examples. Experiment and create at least three figures that express the data uniquely. Consider all of your variables (rain, temp, year, station). Consider your axes, as well. Does it make sense that month is a continuous variable?
ggplot() + geom_point(aes(x=month,y=rain,size=temp,shape=factor(station)), data= joinedData)
ggplot() + geom_point(aes(x=month,y=rain,size=temp), data= joinedData, colour="blue",alpha=0.1)
If you haven’t yet, create a ggplot with temperature and rain as x and y axes. Color data points by month. Do you see any patterns? Describe them here, be specific, and save the plot.
You can save plots using the save button on the plot console window, or by using ggsave
ggsave("myplot.png", width = 6, height = 6)
That’s it for this session! Next up is introducing pipe functions and the Tidyverse, for helping with data manipulations and keeping everything organized.