The popular, open source statistical programming language “R” is widely used in both academia and industry or data analytics. According to a recent survey, R is the 15th most used programming language. We can work directly in R, but RStudio provides a more friendly programming environment. Importantly, Rstudio can be used only in conjunction with R. In this workshop, we will present in RStudio.
Both R and RStudio are free. The links below provide two short videos indicating how to download and install R and RStudio (Please install both):
The introductory workshop includs three sections.
At the end of the workshop, participants will:
Now let’s open Rstudio
Click the buttom at the top left corner and choose R Script.
Now
# Always add comments
# R is CASE SENSITIVE, which means A =/= a.
In R, a vector is a collection of numbers or characters.
The first function we are going to learn is c( ) , which is used to generate a vector.
The round brakets indicate xx() is a function and xx is the function name.
Eg1: a vector contains numbers
c(1, 2, 3, 4, 5, 6, 7, 8)
c(5*6, 72/8)
c(sqrt(3.14), log(1), exp(1), tan(pi/4), sin(pi/3))
Eg2: a vector contains characters
c("Regression", "ANOVA", "SVM")
c("hope=","hold","on","pain","ends")
Question: If a vector contains both numbers and characters, is there anything we need to pay attention to? Run the following code and tell me what you find.
c("R Intro Workshop", 2017)
Eg3: generate a vector with certain pattern
This expression generate a vector from m to n and the increment is 1 or -1.
This function is a generalization of m:n.
1:16
5:1
seq(7,28,3)
seq(67,53, 1)
Question: How to calculate the following expressions? \[ 1^2, 2^2, 3^2,4^2,5^2 \]
\[ 1^2, 2^3, 3^4, 4^5, 5^6 \]
\[ 3^{-1}, 3^{-2}, 3^{-3}, 3^{-4}, 3^{-5},3^{-6},3^{-7},3^{-8} \]
# Let's look at the Environment
x <- 1:5
# vector x: start at 1, end in 5, increment is 1
y <- seq(13,10,by= -0.5)
# vector y: start at 13, end in 10, increment is -0.5
# Names can be fancy -- no space!
Question: Why we cannot see how \(x\) and \(y\) look like? To check them, what can we do?
x*2
x-y
mean(y)
sum(x)
In R, getting a subset of vector, matrix and data frame, one option is ‘[ ]’.
y
y[7]
y[8]
y[8] <- 14
y
Questions:
You must enable Javascript to view this page properly.
The function used to create a matrix in R is ‘matrix( )’.
Suppose we would like to create the following matrix M. \[ M= \left[ \begin{matrix} 1 & 2 & 4 \\ 3 & 8 & 5 \\ 6 & 9 & 0 \end{matrix} \right] \]
M <- matrix(c(1,3,6,2,8,9,4,5,0),ncol=3)
M
Question: Given an output only, how can we distinguish whether it is a vector or a matrix?
x
Usually, we apply same operation, like mean or sum, to all the rows or all the columns to a matrix.
The related function is called ‘apply()’.
Next, we will calculte the row mean and column sum for our matrix M.
apply(M, 1, mean) ## calculate row mean
apply(M, 2, sum) ## calculate column sum
Similar to vector, we also use ‘[]’ to obtain a subset of a matrix. Since matrix is two dimensional, we need to specify each dimension clearly.
Eg1: extract an element from a matrix.
M
M[3,1]
Eg2: extract a row or a column.
M[2,]
M[,3]
Questions:
We use function ‘data.frame( )’ to create a data frame. Eg:
A <- 2:7
B <- A^2
C <- c('These','Are','Words','Not','Numbers','Eh')
mydata <- data.frame(First=A, b=B, Char=C)
mydata
## First b Char
## 1 2 4 These
## 2 3 9 Are
## 3 4 16 Words
## 4 5 25 Not
## 5 6 36 Numbers
## 6 7 49 Eh
Usually, the datasets we work on are date frames.
The function used to read a .csv file is ‘read.csv( )’.
The function used to read a .txt file is ‘read.table( )’.
If you have a .xls file, save it as .csv file before you read it into RStudio.
eg1: read .txt file
Cuckoos <- read.table("cuckoos.txt", header = TRUE)
head(Cuckoos)
## length breadth species id
## 1 21.7 16.1 meadow.pipit 21
## 2 22.6 17.0 meadow.pipit 22
## 3 20.9 16.2 meadow.pipit 23
## 4 21.6 16.2 meadow.pipit 24
## 5 22.2 16.9 meadow.pipit 25
## 6 22.5 16.9 meadow.pipit 26
eg2: read .csv file
Airplane <- read.csv("airplane.csv",sep=",", header = TRUE)
head(Airplane)
## paper distance
## 1 light 3.1
## 2 light 3.3
## 3 light 2.1
## 4 light 1.9
## 5 medium 4.0
## 6 medium 3.5
eg3: read data from a URL
Crab <- read.csv("http://www.hofroe.net/stat557/data/crab.txt", header=TRUE, sep="\t")
head(Crab)
## Color Spine Width Satellite Weight
## 1 3 3 28.3 8 3050
## 2 4 3 22.5 0 1550
## 3 2 1 26.0 9 2300
## 4 4 3 24.8 0 2100
## 5 4 3 26.0 4 2600
## 6 3 3 23.8 0 2100
Let’s use the Airplane data as an example
eg1: extract one column
Airplane$distance # method 1
mydata[,2] # method 2
eg2: extract multiple columns
Airplane[,c("paper","distance")] # method 1
Airplane[,c(1,2)] # method 2
Question: In both eg1 and eg2, which method is better? Why?
eg3: extract the row where paper is medium
Airplane[paper=="medium",]
Airplane[paper=="medium ",]
Airplane[Airplane$paper=="medium ",]
Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed.
class(Airplane$paper)
Airplane$paper
as.character(Airplane$paper)
as.numeric(Airplane$paper)
We can imagine that list is like a magic bag, which can contain any type of objects. It’s very handy when we want to commbine data with different lengths and types.
Question: When?
The function we use to create a list is ‘list( )’
eg:
List1 <- list(X = x, Crab.Data = Crab)
List1
| Data type | Dimension | Features |
|---|---|---|
| Vector | 1 | |
| Matrix | 2 | Only numbers. Rectangle shaped. |
| Data Frame | 2 | Can contain both numbers and characters. Rectangle shaped. |
| List | n | Able to hold any data types. No shape restriction. |
plot(Weight~Width,data=Crab)
## In plot( ) function
# add header -- main
# change the shape of the point -- pch
# change the x-lab -- xlab
# change the y-lab -- ylab
# add line to an existing graph -- abline()
# abline(-3944,242.6)
The location of the tilde between Weight and Width on the key board is shown below:
boxplot(distance~paper,data=Airplane)
hist(Crab$Width)
# change the main
# change the x-lab
barplot(Airplane$distance)
# fill color of each type of paper
# add legend
# add ylab
# add main
Question: What’s the difference between a histogram and a bar plot?
par(mfrow=c(2,2))
plot(Weight~Width,data=Crab)
boxplot(distance~paper,data=Airplane)
hist(Crab$Width)
barplot(Airplane$distance)
par(mfrow=c(2,2))
par(mar=c(1,1,1,1))
plot(Weight~Width,data=Crab)
boxplot(distance~paper,data=Airplane)
hist(Crab$Width)
barplot(Airplane$distance)
The dataset below are from open government portal.
Milk.Canada <- Milk[Milk$GEO=="Canada" & Milk$DAI=="Milk production, total",]
plot(as.numeric(Milk.Canada$Value),type = "l", main="Canada Milk Production Total",ylab="Value",xlab="Time")
Milk.Sold <- Milk[Milk$GEO=="Canada" & Milk$DAI=="Milk sold off farms, total",]
plot(as.numeric(Milk.Sold$Value), type="l", main="Canada Milk Sold Off Farms Total",ylab = "Value",xlab="Year",axes = FALSE,ylim=c(9000,14000))
axis(side=1, at=seq(1,495,by=12), labels= 1976:2017, las=1)
axis(side=2, at=seq(9000,14000,by=1000), lab=seq(9000,14000,by=1000), las=2)
Milk.Fluid <- Milk[Milk$GEO=="Canada" & Milk$DAI=="Fluid purposes", ]
plot(as.numeric(Milk.Fluid$Value),type = "l", main="Canada Milk Fluid Purposes", ylab="Value", xlab="Time",axes = FALSE,ylim=c(1800,8000))
axis(side=1, at=seq(1,855,by=12), labels= 1946:2017, las=1)
axis(side=2, at=seq(1800,8000,by=1000), lab=seq(1800,8000,by=1000), las=2)
Fluid <- Milk[Milk$DAI=="Fluid purposes",]
boxplot(as.numeric(Value) ~ substr(GEO,1,7),data=Fluid,pch=20)
Fluid.Pro <- Fluid[!(Fluid$GEO=="Canada"),]
boxplot(as.numeric(Value) ~ substr(GEO,1,7),data=Fluid.Pro,pch=20)
Mod0 <- lm(as.numeric(Value) ~ GEO, data=Fluid.Pro)
anova(Mod0)
## Analysis of Variance Table
##
## Response: as.numeric(Value)
## Df Sum Sq Mean Sq F value Pr(>F)
## GEO 9 7.5143e+10 8349269857 1135 < 2.2e-16 ***
## Residuals 4940 3.6341e+10 7356483
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(residuals(Mod0),pch=20,cex=0.4)
table(Fluid.Pro$GEO)
##
## Alberta British Columbia
## 495 495
## Canada Manitoba
## 0 495
## New Brunswick Newfoundland and Labrador
## 495 495
## Nova Scotia Ontario
## 495 495
## Prince Edward Island Quebec
## 495 495
## Saskatchewan
## 495
Use R dataset ‘iris’. Load the data using data(iris).
data(iris)
plot and identify two variables that are most linear with each other.
fit a regression model based on the variable you choose and add the fitted line to the scatter plot.
Try: fit regression model for each species.
## plot the data
plot(iris) # the most linear two variables are Petal.Length & Petal.Width
## regression model on Petal.Length & Petal.Width
Mod <- lm(Petal.Length ~ Petal.Width, data=iris)
summary(Mod)
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33542 -0.30347 -0.02955 0.25776 1.39453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.08356 0.07297 14.85 <2e-16 ***
## Petal.Width 2.22994 0.05140 43.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4782 on 148 degrees of freedom
## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266
## F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
plot(Petal.Length ~ Petal.Width, data=iris,
main="Linear model", pch=20,
xlab="Petal width", ylab="Petal length",
xlim=c(0,3),ylim=c(0,8))
abline(Mod, col=2)
## regression model on each species
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
Mod.Setosa <- lm(Petal.Length ~ Petal.Width,
data = subset(iris,Species == 'setosa'))
Mod.Versicolor <- lm(Petal.Length ~ Petal.Width,
data = subset(iris,Species == 'versicolor'))
Mod.Virginica <- lm(Petal.Length ~ Petal.Width,
data = subset(iris,Species == 'virginica'))
plot(Petal.Length ~ Petal.Width, data=iris,
main="Linear model", pch=as.numeric(factor(Species)),
xlab="Petal width", ylab="Petal length",
xlim=c(0,3),ylim=c(0,8),col=factor(Species))
abline(Mod.Setosa,col=1)
abline(Mod.Versicolor,col=2)
abline(Mod.Virginica,col=3)
legend("topleft",legend = c("Setosa","Versicolor","Virginica"),
pch=c(1,2,3),col=1:3)