Peter Caya
August 20, 2016
John Chambers on S, the precursor to R:
“[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.”
This philosophy is evident:
(This presentation was put together using R!)
If there is time:
Command line computing in R is fairly straight forward:
2+2
2^2
2*2
Scientific notation can be used more easily using “e” in numbers:
2e10
[1] 2e+10
class(2)
[1] "numeric"
as.integer(2.6)
[1] 2
z = 1 + 2i
z
[1] 1+2i
Logical - TRUE or FALSE
x=1
z =1
test <-x==z
class(test)
[1] "logical"
If statements in R work similarly to the equivalent in other languages.
Logical evaluations are performed with the following operators:
OR and AND statements come in two varieties:
Character:
class("abc")
[1] "character"
The class() function is useful for determing the type of variable. There are data types but these are the ones involved in most of the programming you will be doing.
Variable types can be coerced by using an operation or coercion function: Coercion by using an operator:
TRUE*1
[1] 1
z
[1] 1
class(z)
[1] "numeric"
as.integer(z)
[1] 1
R is meant to be used in a way where operations are performed on entire vectors or matrices. These are composed of the basic data types discussed earlier:
Some basic ways to generate a vector:
vec1 <- c(1,2,3)
vec2 <- seq(from =1, to =3, by=1)
vec3 <- rep(1,10)
vec1
[1] 1 2 3
vec2
[1] 1 2 3
vec3
[1] 1 1 1 1 1 1 1 1 1 1
Vectors can also be added, subtracted, multiplied and divided element by element:
vec1+vec2
[1] 2 4 6
vec1*vec2
[1] 1 4 9
We can also perform more conventional matrix-algebra style operations using vectors. To do this we need to use the following:
t(vec1)
[,1] [,2] [,3]
[1,] 1 2 3
Now, find \( V_1^T V_1 \):
t(vec1)%*% vec1
[,1]
[1,] 14
Matrices creation is somewhat more complicated than vector creation. A matrix of all 1's
mat1 <- matrix(rep(1,3*3),nrow = 3,ncol = 3)
A subset of a matrix can be obtained using brackets and numbers indicating relevant columns and rows:
mat1[2:3,2:3]
[,1] [,2]
[1,] 1 1
[2,] 1 1
The same operations can be used on matrices that were used on vectors:
mat2 <- matrix(seq(from =1 , to =9, by=1 ),nrow = 3, ncol = 3)
# Original matrix:
mat2
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
# Transposition:
t(mat2)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
# Matrix multiplication:
t(mat2)%*%mat2
[,1] [,2] [,3]
[1,] 14 32 50
[2,] 32 77 122
[3,] 50 122 194
Lists are kind of catch-all object in R. They act as a way of storing one or more of any type of object.
mat1 <- matrix(seq(from = 1,to = 9,by=1),nrow = 3,ncol=3)
string1 <- "abc"
reg1 <- lm(mat1[,1]~mat1[,2])
example_list <- list(mat1,string1,reg1)
example_list
[[1]]
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
[[2]]
[1] "abc"
[[3]]
Call:
lm(formula = mat1[, 1] ~ mat1[, 2])
Coefficients:
(Intercept) mat1[, 2]
-3 1
NA values are place holders for numbers that behave like numbers:
NA^0
[1] 1
Most of our needs for working in directories can satisfied with getwd() (which gives the name of the directory) and setwd() which resets the directory.
We can also list the files in the directory using list.files().
Any function can have its documentation brought up using ? or help(). For example:
?t()
# Or:
help(t)
Other useful resources can be readily found at:
One of the best features of R is how easy it is to extend the language with packages written by other people by using the CRAN network. Once you know what library you want to download it just use the install.packages() function. Then, to load the package, use the library() function:
If you wanted to download and load the stringr package you would simply do the following:
install.packages("stringr")
library(stringr)
Here's an example:
test <- c(0,1,0,1,0,1,0,1)
a <- 2
if(a == 1){TRUE
}else if(a==2){print("twice")
}else{print("Nope")}
[1] "twice"
Note: For the else-if part of the evaluation we are essentially passing the criteria to another if function if the evaluation is not true.
R offers the same ability to use loops that are available in all other languages. In the case of R the syntax that is used is show below:
for(i in 1:10)
{ print(i)}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
Unlike C or other languages, we can implicitly initialize the variable which is being used as the criteria to stop the for-loop.
This being said, for-loops should generally be avoided in favor of vectorization unless absolutely necessary.
R provides a set of functions which are preferred when using a function over the elements, columns or rows of a matrix or data frame.
These are the apply() family of functions:
apply() type functions take several arguments:
Let's take the mean of each of the columns of the mtcars dataset:
apply(mtcars,2,mean)
mpg cyl disp hp drat wt
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
qsec vs am gear carb
17.848750 0.437500 0.406250 3.687500 2.812500
R allows users to define their own functions in the same manner as other languages:
new_function <-function(x){
x+1
}
new_function(2)
[1] 3
We can also use return() in order to specifically specify the value that will be returned to the user. If it is not used then the last variable to be evaluated will be returned:
new_function <-function(x){
return(10)
x+1
}
new_function(2)
[1] 10
Lists are useful for returning more than one object from a function:
new_function <-function(x){
answer<- x+1
return(list(x,answer))
}
new_function(2)
[[1]]
[1] 2
[[2]]
[1] 3
A data frame is a structure used in R to store data as a list of named vectors. It is the default away to represent data in R and allows users to review and edit data.
A simple example is the iris data frame which comes preloaded into R:
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Note that each of the columns has a different variable type. Most of the data is numeric but Species is a vector of characters.
Using the $ operator allows the user to select a column of the data frame and to then use it as a vector:
iris$Sepal.Length>5
[1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[12] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[23] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
[34] TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[45] TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
[56] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
[67] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[78] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[89] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
[100] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[111] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[122] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[133] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[144] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
iris$Sepal.Length>5
[1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[12] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[23] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
[34] TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[45] TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
[56] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
[67] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[78] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[89] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
[100] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[111] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[122] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[133] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[144] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Specific rows and columns can be selected in several different ways:
Columns can be selected by name. This gives the values of the column as a vector:
irisdata <- head(iris,10)
names(irisdata)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
[5] "Species"
## Selecting Columns and Rows:
irisdata$Sepal.Length
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
class(irisdata$Sepal.Length)
[1] "numeric"
The columns can also be accessed by using bracket notation. There are two ways to get a column:
# Produces a vector which is equivalent to the earlier example:
irisdata[,1]
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
irisdata[1]
Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
6 5.4
7 4.6
8 5.0
9 4.4
10 4.9
Rows are also accessed through the bracket operator:
irisdata[1,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
The bracket notation can also be used to select rows and columns:
irisdata[3:6,2:5]
Sepal.Width Petal.Length Petal.Width Species
3 3.2 1.3 0.2 setosa
4 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
6 3.9 1.7 0.4 setosa
Finally, a vector can be passed to the brackets in order to specify which rows and columns we want:
rows <- seq(from =2,to = 10, by= 2)
irisdata[rows,2:4]
Sepal.Width Petal.Length Petal.Width
2 3.0 1.4 0.2
4 3.1 1.5 0.2
6 3.9 1.7 0.4
8 3.4 1.5 0.2
10 3.1 1.5 0.1
A vector of logical values can be passed to the brackets of a data frame in order to specify a subset of rows. For instance:
example <- head(mtcars)
logivec <- c(TRUE,FALSE,TRUE,FALSE,FALSE)
example[logivec,]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
We can use the fact that a vector can be evaluated using logical criteria to subset a data frame. For example, say that we want to take the mtcars dataframe and select only rows where disp <150:
mtcars.subset <- mtcars[mtcars$disp <150,]
head(mtcars.subset)
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
We can also produce more complex criteria using the & and | operators:
mtcars.subset <- mtcars[mtcars$disp <150 & mtcars$wt>3,]
mtcars.subset
mpg cyl disp hp drat wt qsec vs am gear carb
Merc 240D 24.4 4 146.7 62 3.69 3.19 20.0 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
Reording a data frame is somewhat more complex but takes a one line command and employs the order() function. This function produces a numeric vector which we can pass to the first argument in the brackets to rearrange the data frame.
order(mtcars$disp)
[1] 20 19 18 26 28 3 21 27 32 9 30 8 1 2 10 11 6 4 12 13 14 31 23
[24] 22 24 29 5 7 25 17 16 15
order_on_disp <- mtcars[order(mtcars$disp),]
head(order_on_disp,10)
mpg cyl disp hp drat wt qsec vs am gear carb
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Importing data to R is fairly simple and can be done from nearly any source imaginable with some work. Some of the more common sources are:
Most of your file importing needs can be fulfilled with the read.table() function.
For example, to load the file example.csv into the workspace requires the one line command:
example_data <-read.table("example.csv",sep = ",")
head(example_data)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
read.table() offers a wide array of options which can be seen by using ?read.table. Some of these are:
Because of the flexibility of this function, it should satisfy most of your data import needs.
The work-horse of basic R plotting is the plot() function which allows the user to generate a wide range of graphics and specify the way they appear.
An example using the iris dataset:
plot(iris)
plot(iris)
We just produced a scatterplot of all the variables against all the other variables!
Let's put together a more descriptive plot of sepal width versus sepal length:
plot(iris$Sepal.Length,iris$Sepal.Width)
Now, add titles to the plot so it is more readable:
plot(iris$Sepal.Length,iris$Width,xlab = "Sepal Length", ylab = "Sepal Width", main = "Plot of Sepal Length Against Width")
Add color and change the shape of the dots used in the scatterplot:
plot(iris$Sepal.Length,iris$Sepal.Width,xlab = "Sepal Length", ylab = "Sepal Width", main = "Plot of Sepal Length Against Width",pch = 5,col = "darkblue")
We can also add vertical and horizontal lines to the plot. In this case, let's plot the means of the width and length using the abline() function:
# Add a vertical line for the sepal length's mean:
plot(iris$Sepal.Length,iris$Sepal.Width,xlab = "Sepal Length", ylab = "Sepal Width", main = "Plot of Sepal Length Against Width",pch = 5,col = "darkblue")
abline(v = mean(iris$Sepal.Length))
# Add a horizontal line for the sepal width's mean:
abline(h = mean(iris$Sepal.Width))
Let's finish this plot by graphing everything from before and add labels to the data points based on species type. This is done with the text() function.
# Add a vertical line for the sepal length's mean:
plot(iris$Sepal.Length,iris$Sepal.Width,xlab = "Sepal Length", ylab = "Sepal Width", main = "Plot of Sepal Length Against Width",pch = 5,col = "darkblue")
text(iris$Sepal.Length,iris$Sepal.Width,iris$Species,pos = 3,cex = .6)
R also offers other functions for plotting:
virginica_vec <- iris$Species=="virginica"
setosa_vec <- iris$Species=="setosa"
versicolor_vec <- iris$Species=="versicolor"
virginica <- iris[virginica_vec,]$Sepal.Width
setosa <- iris[setosa_vec,]$Sepal.Width
versicolor <- iris[versicolor_vec,]$Sepal.Width
new_frame <- data.frame(virginica,setosa,versicolor)
boxplot(new_frame,main = "Box and Whisker Plot of the Sepal Width",ylab = "Width in Inches")
# Add color:
boxplot(new_frame,main = "Box and Whisker Plot of the Sepal Width",ylab = "Width in Inches",col =c("green", "blue","red"))
library(datasets)
toothy_data<-ToothGrowth
hist(ToothGrowth$len,breaks = length(ToothGrowth$len)/4,xlab = "",ylab = "Tooth Length", main = "Plot of Tooth Length",col= "darkgreen" )
R has several packages which extend its graphics capabilities even further: