The popular, open source statistical programming language “R” is widely used in both academia and industry or data analytics. According to a recent survey, R is the 15th most used programming language. We can work directly in R, but RStudio provides a more friendly programming environment. Importantly, Rstudio can be used only in conjunction with R. In this workshop, we will present in RStudio.

Both R and RStudio are free. The links below provide two short videos indicating how to download and install R and RStudio (Please install both):

The introductory workshop includs three sections.

  1. In the first section, we focus on the basic data types used in R. We will introduce vector, matrix, data frame and list.
  2. In the second section, we introduce the basic graphic skills, such as:
    • how to plot different types of graphics, including scatter plots, box plots, and bar charts;
    • how to add points and lines;
    • how to add legends;
    • how to adjust the margins;
    • how to lay out multiple graphics.
  3. In the last section, we will work on a real case/data together.

At the end of the workshop, participants will:



Now let’s open Rstudio


Click the buttom at the top left corner and choose R Script.


Now

# Always add comments
# R is CASE SENSITIVE, which means A =/= a.



1 Basic Data Types

1.1 Vector

In R, a vector is a collection of numbers or characters.

The first function we are going to learn is c( ) , which is used to generate a vector.

The round brakets indicate xx() is a function and xx is the function name.


Eg1: a vector contains numbers

c(1, 2, 3, 4, 5, 6, 7, 8)

c(5*6, 72/8)

c(sqrt(3.14), log(1), exp(1), tan(pi/4), sin(pi/3))


Eg2: a vector contains characters

c("Regression", "ANOVA", "SVM")

c("hope=","hold","on","pain","ends")


Question: If a vector contains both numbers and characters, is there anything we need to pay attention to? Run the following code and tell me what you find.

c("R Intro Workshop", 2017)


Eg3: generate a vector with certain pattern

  • m:n

This expression generate a vector from m to n and the increment is 1 or -1.

  • seq()

This function is a generalization of m:n.

1:16

5:1

seq(7,28,3)

seq(67,53, 1)

Question: How to calculate the following expressions? \[ 1^2, 2^2, 3^2,4^2,5^2 \]

\[ 1^2, 2^3, 3^4, 4^5, 5^6 \]

\[ 3^{-1}, 3^{-2}, 3^{-3}, 3^{-4}, 3^{-5},3^{-6},3^{-7},3^{-8} \]

# Let's look at the Environment
x <- 1:5 
# vector x: start at 1, end in 5, increment is 1 
y <- seq(13,10,by= -0.5)
# vector y: start at 13, end in 10, increment is -0.5 
# Names can be fancy -- no space!


Question: Why we cannot see how \(x\) and \(y\) look like? To check them, what can we do?


1.1.1 Operations on a Vector

x*2

x-y

mean(y)

sum(x)


1.1.2 Subsets of a Vector

In R, getting a subset of vector, matrix and data frame, one option is ‘[ ]’.

y

y[7] 

y[8] 

y[8] <- 14 

y


1.2 Matrix

Questions:

  • What is a matrix?
  • What the difference between matrix and vector?
  • Matrices application?

You must enable Javascript to view this page properly.

The function used to create a matrix in R is ‘matrix( )’.

Suppose we would like to create the following matrix M. \[ M= \left[ \begin{matrix} 1 & 2 & 4 \\ 3 & 8 & 5 \\ 6 & 9 & 0 \end{matrix} \right] \]

M <- matrix(c(1,3,6,2,8,9,4,5,0),ncol=3)
M


Question: Given an output only, how can we distinguish whether it is a vector or a matrix?

x


1.2.1 Operations on a Matrix

Usually, we apply same operation, like mean or sum, to all the rows or all the columns to a matrix.

The related function is called ‘apply()’.

Next, we will calculte the row mean and column sum for our matrix M.

apply(M, 1, mean) ## calculate row mean

apply(M, 2, sum) ## calculate column sum


1.2.2 Subsets of a Matrix

Similar to vector, we also use ‘[]’ to obtain a subset of a matrix. Since matrix is two dimensional, we need to specify each dimension clearly.


Eg1: extract an element from a matrix.

M
M[3,1]


Eg2: extract a row or a column.

M[2,]
M[,3]


1.3 Data Frame

Questions:

  • What is a data frame?
  • What the differences between matrix and data frame?

We use function ‘data.frame( )’ to create a data frame. Eg:

A <- 2:7
B <- A^2
C <- c('These','Are','Words','Not','Numbers','Eh')
mydata <- data.frame(First=A, b=B, Char=C)
mydata
##   First  b    Char
## 1     2  4   These
## 2     3  9     Are
## 3     4 16   Words
## 4     5 25     Not
## 5     6 36 Numbers
## 6     7 49      Eh


Usually, the datasets we work on are date frames.

The function used to read a .csv file is ‘read.csv( )’.

The function used to read a .txt file is ‘read.table( )’.

If you have a .xls file, save it as .csv file before you read it into RStudio.

eg1: read .txt file

Cuckoos <- read.table("cuckoos.txt", header = TRUE)
head(Cuckoos)
##   length breadth      species id
## 1   21.7    16.1 meadow.pipit 21
## 2   22.6    17.0 meadow.pipit 22
## 3   20.9    16.2 meadow.pipit 23
## 4   21.6    16.2 meadow.pipit 24
## 5   22.2    16.9 meadow.pipit 25
## 6   22.5    16.9 meadow.pipit 26

eg2: read .csv file

Airplane <- read.csv("airplane.csv",sep=",", header = TRUE)
head(Airplane)
##      paper distance
## 1   light       3.1
## 2   light       3.3
## 3   light       2.1
## 4   light       1.9
## 5 medium        4.0
## 6 medium        3.5

eg3: read data from a URL

Crab <- read.csv("http://www.hofroe.net/stat557/data/crab.txt", header=TRUE, sep="\t")
head(Crab)
##   Color Spine Width Satellite Weight
## 1     3     3  28.3         8   3050
## 2     4     3  22.5         0   1550
## 3     2     1  26.0         9   2300
## 4     4     3  24.8         0   2100
## 5     4     3  26.0         4   2600
## 6     3     3  23.8         0   2100


1.3.1 Subsets of a Data Frame

Let’s use the Airplane data as an example

eg1: extract one column

Airplane$distance # method 1
mydata[,2] # method 2


eg2: extract multiple columns

Airplane[,c("paper","distance")] # method 1
Airplane[,c(1,2)] # method 2


Question: In both eg1 and eg2, which method is better? Why?


eg3: extract the row where paper is medium

Airplane[paper=="medium",]
Airplane[paper=="medium  ",]
Airplane[Airplane$paper=="medium  ",] 


1.4 Factor

Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed.

class(Airplane$paper) 
Airplane$paper
as.character(Airplane$paper)
as.numeric(Airplane$paper)

1.5 List

We can imagine that list is like a magic bag, which can contain any type of objects. It’s very handy when we want to commbine data with different lengths and types.

Question: When?


The function we use to create a list is ‘list( )’

eg:

List1 <- list(X = x, Crab.Data = Crab)
List1


1.6 summary

Data type Dimension Features
Vector 1
Matrix 2 Only numbers. Rectangle shaped.
Data Frame 2 Can contain both numbers and characters. Rectangle shaped.
List n Able to hold any data types. No shape restriction.



2 Basic Graphic Skills

2.1 Scatter Plot, Box Plot, Histogram, and Bar Chart

plot(Weight~Width,data=Crab)

## In plot( ) function
# add header -- main
# change the shape of the point -- pch
# change the x-lab -- xlab
# change the y-lab -- ylab

# add line to an existing graph -- abline()
# abline(-3944,242.6)

The location of the tilde between Weight and Width on the key board is shown below:

boxplot(distance~paper,data=Airplane)

hist(Crab$Width)

# change the main
# change the x-lab
barplot(Airplane$distance)

# fill color of each type of paper
# add legend
# add ylab
# add main

Question: What’s the difference between a histogram and a bar plot?

2.2 Lay Out Multiple Graphics

par(mfrow=c(2,2))
plot(Weight~Width,data=Crab)
boxplot(distance~paper,data=Airplane)
hist(Crab$Width)
barplot(Airplane$distance)

2.3 Adjust the Margins

par(mfrow=c(2,2))
par(mar=c(1,1,1,1))
plot(Weight~Width,data=Crab)
boxplot(distance~paper,data=Airplane)
hist(Crab$Width)
barplot(Airplane$distance)



3 Case Study

The dataset below are from open government portal.

3.2 Milk production and utilization

Milk.Canada <- Milk[Milk$GEO=="Canada" & Milk$DAI=="Milk production, total",]
plot(as.numeric(Milk.Canada$Value),type = "l", main="Canada Milk Production Total",ylab="Value",xlab="Time")

Milk.Sold <- Milk[Milk$GEO=="Canada" & Milk$DAI=="Milk sold off farms, total",]

plot(as.numeric(Milk.Sold$Value), type="l", main="Canada Milk Sold Off Farms Total",ylab = "Value",xlab="Year",axes = FALSE,ylim=c(9000,14000))

axis(side=1, at=seq(1,495,by=12), labels= 1976:2017, las=1)
axis(side=2, at=seq(9000,14000,by=1000), lab=seq(9000,14000,by=1000), las=2)

Milk.Fluid <- Milk[Milk$GEO=="Canada" & Milk$DAI=="Fluid purposes", ]

plot(as.numeric(Milk.Fluid$Value),type = "l", main="Canada Milk Fluid Purposes", ylab="Value", xlab="Time",axes = FALSE,ylim=c(1800,8000))

axis(side=1, at=seq(1,855,by=12), labels= 1946:2017, las=1)
axis(side=2, at=seq(1800,8000,by=1000), lab=seq(1800,8000,by=1000), las=2)

Fluid <- Milk[Milk$DAI=="Fluid purposes",]
boxplot(as.numeric(Value) ~ substr(GEO,1,7),data=Fluid,pch=20)

Fluid.Pro <- Fluid[!(Fluid$GEO=="Canada"),]
boxplot(as.numeric(Value) ~ substr(GEO,1,7),data=Fluid.Pro,pch=20)

Mod0 <- lm(as.numeric(Value) ~ GEO, data=Fluid.Pro)
anova(Mod0)
## Analysis of Variance Table
## 
## Response: as.numeric(Value)
##             Df     Sum Sq    Mean Sq F value    Pr(>F)    
## GEO          9 7.5143e+10 8349269857    1135 < 2.2e-16 ***
## Residuals 4940 3.6341e+10    7356483                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(residuals(Mod0),pch=20,cex=0.4)

table(Fluid.Pro$GEO)
## 
##                   Alberta          British Columbia 
##                       495                       495 
##                    Canada                  Manitoba 
##                         0                       495 
##             New Brunswick Newfoundland and Labrador 
##                       495                       495 
##               Nova Scotia                   Ontario 
##                       495                       495 
##      Prince Edward Island                    Quebec 
##                       495                       495 
##              Saskatchewan 
##                       495

One-way ANOVA assumptions.


3.4 Previous Practice

Use R dataset ‘iris’. Load the data using data(iris).

data(iris)
  • plot and identify two variables that are most linear with each other.

  • fit a regression model based on the variable you choose and add the fitted line to the scatter plot.

  • Try: fit regression model for each species.

## plot the data 
plot(iris) # the most linear two variables are Petal.Length & Petal.Width

## regression model on Petal.Length & Petal.Width
Mod <- lm(Petal.Length ~ Petal.Width, data=iris)
summary(Mod)
## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33542 -0.30347 -0.02955  0.25776  1.39453 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.08356    0.07297   14.85   <2e-16 ***
## Petal.Width  2.22994    0.05140   43.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4782 on 148 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9266 
## F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16
plot(Petal.Length ~ Petal.Width, data=iris,
     main="Linear model", pch=20,
     xlab="Petal width", ylab="Petal length",
     xlim=c(0,3),ylim=c(0,8))
abline(Mod, col=2)

## regression model on each species 
table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50
Mod.Setosa <- lm(Petal.Length ~ Petal.Width, 
                 data = subset(iris,Species == 'setosa'))
Mod.Versicolor <- lm(Petal.Length ~ Petal.Width, 
                 data = subset(iris,Species == 'versicolor'))
Mod.Virginica <- lm(Petal.Length ~ Petal.Width, 
                 data = subset(iris,Species == 'virginica'))

plot(Petal.Length ~ Petal.Width, data=iris,
     main="Linear model", pch=as.numeric(factor(Species)),
     xlab="Petal width", ylab="Petal length",
     xlim=c(0,3),ylim=c(0,8),col=factor(Species))
abline(Mod.Setosa,col=1)
abline(Mod.Versicolor,col=2)
abline(Mod.Virginica,col=3)
legend("topleft",legend = c("Setosa","Versicolor","Virginica"),
       pch=c(1,2,3),col=1:3)



4 What’s more

5 After this workshop


the book that taught me R

the book that taught me R