Everyone should have R and RStudio downloaded. We will be using these datasets: http://dartgo.org/r-getting-started
RStudio is an Integrated Development Environment (IDE) for using the R language. It has a lot of useful tools for writing code, managing projects, viewing plots, and more.
RStudio is highly customizable, and allows you to change appearances and general layout without much fuss. For instance, to rearrange the order of panes, you can do the following:
Tools > Global Options... > Pane LayoutThere are a lot of other features to take advantage of in RStudio, and many of them are bound to shortcuts. You can see those by going to Tools > Keyboard Shortcuts Help or pressing option + shift + K. You can also find a handy cheat sheet here.
help(paste)
People in different disciplines may write functions or create datasets that will be useful to others in their field. They can bundle these functions and datasets into packages that can be downloaded and used by others. There is a simple way into install packages with R:
install.packages("tidyverse")
install.packages("renv")
We then load them into our working environment with a function called library()
library(tidyverse)
renvPart of reproducibility means keeping track of which packages you’re using. The renv package helps with that by taking snapshots of package versions so that you have an accounting of your R environment and package library. This will allow others - and you - to accurately reproduce your work. You can have renv begin tracking and take a snapshot like this:
renv::init()
renv::snapshot()
This will create a folder called renv/ and a file called renv.lock. renv/ will contain copies of installed packages and their dependencies, and renv.lock is a JSON file that contains metadata about each of these packages and their versions. The function renv::restore() will attempt to re-create the appropriate R Environment from an existing renv.lock file. You can read more about this here.
Projects are an RStudio feature to help keep your code and working environments contained and organized, which comes in handy when you start to have multiple projects. To start a new project, you can click on the dropdown in the upper right-hand corner of RStudio and choose to begin a new project.
Even if you’re not using the RStudio projects feature, it’s still a good idea to keep work for any given project in a single directory (folder). You can make a new folder in Finder or File Explorer. Once you have that, you can set your working directory in R like this:
setwd("PATH/TO/PROJECT")
You can also see your current working directory by using this:
getwd()
One of the most important things you can do for reproducibility is to comment your code, using #. This is a symbol that tells the computer to ignore the rest of that line. You can use it inline, but I typically recommend having comments on their own lines.
You can use R to create variables. Variables can contain all kinds of information, from single values to entire data structures. We create variables using <-, called the assignment operator.
new_int <- 4
new_int
## [1] 4
Variables will behave just like the values they represent. When we run cos() on new_int, it returns the same value as cos(4).
cos(new_int)
## [1] -0.6536436
cos(4)
## [1] -0.6536436
Vectors are one of the basic data structures in R. In short, they are groups of values of a single data type. We can make them with the c() function.
coat <- c("calico", "tortoiseshell", "tabby")
weight <- c(2.1, 5.0, 3.2)
likes_string <- c(TRUE, FALSE, TRUE)
When we say they are of a single data type, we are referring to the five “atomic” data types in R. Let’s see those:
typeof(coat)
## [1] "character"
typeof(weight)
## [1] "double"
typeof(likes_string)
## [1] "logical"
typeof(1 + 1i)
## [1] "complex"
typeof(1L)
## [1] "integer"
Vectors must be made of one of these five data types. Let’s see what happens when we try to mix them up.
test <- c(0, 2, 4)
typeof(test)
## [1] "double"
test <- c("0", "2", "4")
typeof(test)
## [1] "character"
test <- c(0, 2, "4")
typeof(test)
## [1] "character"
When we tried to mix numeric and character data types, the entire test vector became a character vector. This is called type coercion. Type coercion follows this pattern: Logical -> Integer -> Double (numeric) -> Complex -> Character
We can force vectors to go in the opposite direction, but this sometimes doesn’t work. Other times, it produces unexpected behaviors.
as.numeric(likes_string)
## [1] 1 0 1
as.numeric(test)
## [1] 0 2 4
as.logical(test)
## [1] NA NA NA
as.logical(as.numeric(test))
## [1] FALSE TRUE TRUE
Notice that test had to be made into a numeric vector before it could be made into a logical vector. Also notice that it was converted to FALSE, TRUE, TRUE. That’s because any number other than 0 defaults to TRUE when it is forced into a logical format.
We can add to existing vectors with c()
test <- c(test, 8)
test
## [1] "0" "2" "4" "8"
We can create series of numbers easily using a :
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
10:1
## [1] 10 9 8 7 6 5 4 3 2 1
We can also create sequences of numbers using functions like rep() and seq().
rep(8, 80)
## [1] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [39] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [77] 8 8 8 8
seq(1, 10, by = 0.1)
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4
## [16] 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
## [31] 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4
## [46] 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9
## [61] 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4
## [76] 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9
## [91] 10.0
Vectors are interesting (and powerful) because we can perform vectorized operations on the entire structure at once.
seq_example <- seq(1, 10, by = 0.1)
seq_example * 2
## [1] 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8
## [16] 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8
## [31] 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8
## [46] 11.0 11.2 11.4 11.6 11.8 12.0 12.2 12.4 12.6 12.8 13.0 13.2 13.4 13.6 13.8
## [61] 14.0 14.2 14.4 14.6 14.8 15.0 15.2 15.4 15.6 15.8 16.0 16.2 16.4 16.6 16.8
## [76] 17.0 17.2 17.4 17.6 17.8 18.0 18.2 18.4 18.6 18.8 19.0 19.2 19.4 19.6 19.8
## [91] 20.0
seq_example - 1
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
## [20] 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7
## [39] 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4 5.5 5.6
## [58] 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5
## [77] 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0
Data frames are probably the most common data structure used by R programmers. They are a rectangular data format, and under the hood, they are typically lists of equal-length vectors. Let’s make one with some of the vectors we made earlier.
cats <- data.frame(coat, weight, likes_string)
cats
Data frames are very easy to write out to local files, like a csv, and very easy to read in from a csv.
write_csv(cats, "./data/processed/cats.csv")
cats <- read_csv("./data/processed/cats.csv")
## Parsed with column specification:
## cols(
## coat = col_character(),
## weight = col_double(),
## likes_string = col_logical()
## )
We can take a look at some of the individual variables using $ as a selector.
cats$weight
## [1] 2.1 5.0 3.2
cats$coat
## [1] "calico" "tortoiseshell" "tabby"
Let’s also take a look at the overall structure of the data
dim(cats)
## [1] 3 3
Right now, coat is a factor, another data structure we won’t be talking about today. Factors are useful for categorical variables, but they can be tricky to use. It’s often easier to convert them to simple character vectors when we read in data.
cats <- read.csv("./data/processed/cats.csv", stringsAsFactors = FALSE)
We can still perform vectorized operations with the vectors within a data frame. R will recognize and warn us when this won’t work, however (such as when we try to add a number to a character string)
cats$weight + 2
## [1] 4.1 7.0 5.2
paste("My cat is", cats$coat)
## [1] "My cat is calico" "My cat is tortoiseshell"
## [3] "My cat is tabby"
cats$weight + cats$coat
## Error in cats$weight + cats$coat: non-numeric argument to binary operator
There are different ways to subset data structures based on the type of structure. We’ll look at vectors, matrixes, and data frames.
Let’s use the seq_example we made before.
seq_example
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4
## [16] 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
## [31] 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4
## [46] 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9
## [61] 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4
## [76] 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9
## [91] 10.0
We can use square brackets to get just the first ten elements, like this:
seq_example[1:10]
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
Or, we can get the elements that match conditions we set up, like this:
seq_example[seq_example < 4]
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8
## [20] 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
seq_example[seq_example <= 4]
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8
## [20] 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0
seq_example[seq_example != 3]
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4
## [16] 2.5 2.6 2.7 2.8 2.9 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0
## [31] 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4 5.5
## [46] 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0
## [61] 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4 8.5
## [76] 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.0
We can add multiple conditions, using & for “and”, and | for “or”
seq_example[seq_example < 4 & seq_example > 2]
## [1] 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
seq_example[seq_example < 4 | seq_example > 8]
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4
## [16] 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
## [31] 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5
## [46] 9.6 9.7 9.8 9.9 10.0
Square brackets work for data frames too
mtcars
mtcars[1]
mtcars[1:2]
mtcars[1,]
mtcars[,1]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
There are ways to use these for more advanced subsetting, but they can be tricky to read and write, especially for new programmers. Another way of subsetting that’s more readable and intuitive is by using functions from the tidyverse package we loaded earlier.
We can select columns we want from a dataset by name:
mt_select <- mtcars %>%
select(mpg, cyl, hp)
mt_select
We can filter for rows that we want based on one or more conditions:
mt_filtered <- mtcars %>%
filter(mpg > 25)
mt_filtered
Notice that we’re using a new character, the pipe %>%. This is used heavily in the tidyverse, an ecosystem of packages you can read more about here and read an entire book that uses it here.
The pipe is great because it allows us to chain commands together. So if I’m interested in just a few columns of data from my efficient cars dataset, I can create that quickly like this:
mt_efficient_select <- mtcars %>%
filter(mpg > 25) %>%
select(mpg, cyl, hp)
mt_efficient_select
You could then write this dataset out to your hard drive like this:
write_csv(mt_efficient_select, "./data/processed/mt_efficient_select.csv")
We can find out some things about the basic structure of our data.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 11
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
We can use specific parts of the data, too, such as the mpg variable. Then we can find out more about that with some built-in functions.
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
length(mtcars$mpg)
## [1] 32
mean(mtcars$mpg)
## [1] 20.09062
median(mtcars$mpg)
## [1] 19.2
prod(mtcars$mpg)
## [1] 1.264241e+41
sum(mtcars$mpg)
## [1] 642.9
sqrt(mtcars$mpg)
## [1] 4.582576 4.582576 4.774935 4.626013 4.324350 4.254409 3.781534 4.939636
## [9] 4.774935 4.381780 4.219005 4.049691 4.159327 3.898718 3.224903 3.224903
## [17] 3.834058 5.692100 5.513620 5.822371 4.636809 3.937004 3.898718 3.646917
## [25] 4.381780 5.224940 5.099020 5.513620 3.974921 4.438468 3.872983 4.626013
var(mtcars$mpg)
## [1] 36.3241
There are several ways you can control flow in R. For conditional statements, the most commonly used approaches are the constructs:
## if
# if (condition is true) {
# perform action
# }
## if ... else
# if (condition is true) {
# perform action
# } else { # that is, if the condition is false,
# perform alternative action
# }
## if ... else if ... else
# if (condition is true) {
# perform action
# } else if { # that is, if the condition is false,
# perform alternative action
# } else {
# perform this action if none of the above conditions are satisified.
# }
x <- 8
if (x >= 10) {
print("x is greater than or equal to 10")
}
# Nothing is printed out - why?
if (x >= 10) {
print("x is greater than or equal to 10")
} else {
print("x is less than 10")
}
## [1] "x is less than 10"
# How about this time?
if (x >= 10) {
print("x is greater than or equal to 10")
} else if (x > 5 & x < 10) {
print("x is greater than 5, but less than 10")
} else {
print("x is less than 5")
}
## [1] "x is greater than 5, but less than 10"
Repeating operations
Using for() loops when you know prior that how many iteractions you'd like to do and the order of iteration is important: i.e. the calculation at each iteration depends on the results of previous iterations.# The basic structure of a for() loop is:
# for(iterator in set of values){
# do a thing
#}
# For example:
for(i in 1:10){
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
# Nested for loop:
for(i in 1:5){
for(j in c('a', 'b', 'c', 'd', 'e')){
print(paste(i,j))
}
}
## [1] "1 a"
## [1] "1 b"
## [1] "1 c"
## [1] "1 d"
## [1] "1 e"
## [1] "2 a"
## [1] "2 b"
## [1] "2 c"
## [1] "2 d"
## [1] "2 e"
## [1] "3 a"
## [1] "3 b"
## [1] "3 c"
## [1] "3 d"
## [1] "3 e"
## [1] "4 a"
## [1] "4 b"
## [1] "4 c"
## [1] "4 d"
## [1] "4 e"
## [1] "5 a"
## [1] "5 b"
## [1] "5 c"
## [1] "5 d"
## [1] "5 e"
# Sometimes you don't know how many iterations you need to do, but find yourself needing to repeat an operation as long as a certain condition is met. You can do this with a while() loop.
z <- 0
while(z < 10){
z <- z + 1
cat(z, "\n")
}
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
Functions are ways of running the same piece of code on something that changes. It can save us a lot of typing - one useful way of thinking says that if you have to copy and paste the same code three times, you should write a function instead. Let’s try writing a simple function to show how this can work.
# Let’s define a function fahr_to_kelvin() that converts temperatures from Fahrenheit to Kelvin:
fahr_to_kelvin <- function(temp) {
kelvin <- ((temp - 32) * (5 / 9)) + 273.15
return(kelvin)
}
# freezing point of water
fahr_to_kelvin(32)
## [1] 273.15
One feature unique to R is that the return statement is not required. R automatically returns whichever variable is on the last line of the body of the function. But for clarity, we will explicitly define the return statement.
The real power of functions comes from mixing, matching and combining them into ever-larger chunks to get the effect we want.
Let’s define two functions that will convert temperature from Fahrenheit to Kelvin, and Kelvin to Celsius:
fahr_to_kelvin <- function(temp) {
kelvin <- ((temp - 32) * (5 / 9)) + 273.15
return(kelvin)
}
kelvin_to_celsius <- function(temp) {
celsius <- temp - 273.15
return(celsius)
}
fahr_to_celsius <- function(temp) {
temp_k <- fahr_to_kelvin(temp)
result <- kelvin_to_celsius(temp_k)
return(result)
}
If you’ve been writing these functions down into a separate R script (a good idea!), you can load in the functions into our R session by using the source() function:
source("./src/R/myfirstRfunction.R")
## Three functions are defined in the script:
## * The first function - fahr_to_kelvin() - is used to convert a temperature from Fahrenheit to Kelvin.
## * The second function - kelvin_to_celsius() - is used to convert a temperature from Kelvin to Celsius.
## * The third function - fahr_to_celsius()- is used to convert a temperature from Fahrenheit to Celsius.