The popular, open source statistical programming language “R” is widely used in both academia and industry or data analytics. According to a recent survey, R is the 15th most used programming language. We can work directly in R, but RStudio provides a more friendly programming environment. Importantly, Rstudio can be used only in conjunction with R. In this workshop, we will present in RStudio.

Both R and RStudio are free. The links below provide two short videos indicating how to download and install R and RStudio (Please install both):

Now

# Always add comments
# R is CASE SENSITIVE, which means A =/= a.

1 Basic Data Types

1.1 Vector

In R, a vector is a collection of numbers or characters.

The first function we are going to learn is c( ) , which is used to generate a vector.

The round brakets indicate xx() is a function and xx is the function name.

Eg1: a vector contains numbers

c(1, 2, 3, 4, 5, 6, 7, 8)

c(5*6, 72/8)

c(sqrt(3.14), log(1), exp(1), tan(pi/4), sin(pi/3))

Eg2: a vector contains characters

c("Regression", "ANOVA", "SVM")

c("hope=","hold","on","pain","ends")

Question: If a vector contains both numbers and characters, is there anything we need to pay attention to? Run the following code and tell me what you find.

c("R Intro Workshop", 2017)

Eg3: generate a vector with certain pattern

This expression generate a vector from m to n and the increment is 1 or -1.

seq()

This function is a generalization of m:n.

1:16

5:1

seq(7,28,3)

seq(67,53, 1)

Question: How to calculate the following expressions? \[ 1^2, 2^2, 3^2,4^2,5^2 \]

\[ 1^2, 2^3, 3^4, 4^5, 5^6 \]

\[ 3^{-1}, 3^{-2}, 3^{-3}, 3^{-4}, 3^{-5},3^{-6},3^{-7},3^{-8} \]

# Let's look at the Environment

x <- 1:5 
# vector x: start at 1, end in 5, increment is 1 
y <- seq(13,10,by= -0.5)
# vector y: start at 13, end in 10, increment is -0.5

# Names can be fancy -- no space!

Question: Why we cannot see how \(x\) and \(y\) look like? To check them, what can we do?

1.1.1 Operations on a Vector

x*2

x-y

mean(y)

sum(x)

1.1.2 Subsets of a Vector

In R, getting a subset of vector, matrix and data frame, one option is ‘[ ]’.

y

y[7] 

y[8] 

y[8] <- 14 

y

1.2 Matrix

Retrieved from Math Plane

Questions:

What is a matrix?
What the difference between matrix and vector?
Matrices application?

You must enable Javascript to view this page properly.

The function used to create a matrix in R is ‘matrix( )’.

Suppose we would like to create the following matrix M. \[ M= \left[ \begin{matrix} 1 & 2 & 4 \\ 3 & 8 & 5 \\ 6 & 9 & 0 \end{matrix} \right] \]

M <- matrix(c(1,3,6,2,8,9,4,5,0),ncol=3)
M

Question: Given an output only, how can we distinguish whether it is a vector or a matrix?

1.2.1 Operations on a Matrix

Usually, we apply same operation, like mean or sum, to all the rows or all the columns to a matrix.

The related function is called ‘apply()’.

Next, we will calculte the row mean and column sum for our matrix M.

apply(M, 1, mean) ## calculate row mean

apply(M, 2, sum) ## calculate column sum

1.2.2 Subsets of a Matrix

Similar to vector, we also use ‘[]’ to obtain a subset of a matrix. Since matrix is two dimensional, we need to specify each dimension clearly.

Eg1: extract an element from a matrix.

M
M[3,1]

Eg2: extract a row or a column.

M[2,]
M[,3]

1.3 Data Frame

Questions:

What is a data frame?
What the differences between matrix and data frame?

We use function ‘data.frame( )’ to create a data frame. Eg:

A <- 2:7
B <- A^2
C <- c('These','Are','Words','Not','Numbers','Eh')
mydata <- data.frame(First=A, b=B, Char=C)
mydata

##   First  b    Char
## 1     2  4   These
## 2     3  9     Are
## 3     4 16   Words
## 4     5 25     Not
## 5     6 36 Numbers
## 6     7 49      Eh

Usually, the datasets we work on are date frames.

The function used to read a .csv file is ‘read.csv( )’.

The function used to read a .txt file is ‘read.table( )’.

If you have a .xls file, save it as .csv file before you read it into RStudio.

eg1: read .txt file

Cuckoos <- read.table("cuckoos.txt", header = TRUE)
head(Cuckoos)

##   length breadth      species id
## 1   21.7    16.1 meadow.pipit 21
## 2   22.6    17.0 meadow.pipit 22
## 3   20.9    16.2 meadow.pipit 23
## 4   21.6    16.2 meadow.pipit 24
## 5   22.2    16.9 meadow.pipit 25
## 6   22.5    16.9 meadow.pipit 26

eg2: read .csv file

Airplane <- read.csv("airplane.csv",sep=",", header = TRUE)
head(Airplane)

##      paper distance
## 1   light       3.1
## 2   light       3.3
## 3   light       2.1
## 4   light       1.9
## 5 medium        4.0
## 6 medium        3.5

eg3: read data from a URL

Crab <- read.csv("http://www.hofroe.net/stat557/data/crab.txt", header=TRUE, sep="\t")
head(Crab)

##   Color Spine Width Satellite Weight
## 1     3     3  28.3         8   3050
## 2     4     3  22.5         0   1550
## 3     2     1  26.0         9   2300
## 4     4     3  24.8         0   2100
## 5     4     3  26.0         4   2600
## 6     3     3  23.8         0   2100

1.3.1 Subsets of a Data Frame

Let’s use the Airplane data as an example

eg1: extract one column

Airplane$distance # method 1
mydata[,2] # method 2

eg2: extract multiple columns

Airplane[,c("paper","distance")] # method 1
Airplane[,c(1,2)] # method 2

Question: In both eg1 and eg2, which method is better? Why?

eg3: extract the row where paper is medium

Airplane[paper=="medium",]

Airplane[paper=="medium  ",]

Airplane[Airplane$paper=="medium  ",]

1.4 Factor

Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed.

class(Airplane$paper)

Airplane$paper

as.character(Airplane$paper)

as.numeric(Airplane$paper)

1.5 List

Retrieved from Magic Inkwell

We can imagine that list is like a magic bag, which can contain any type of objects. It’s very handy when we want to commbine data with different lengths and types.

Question: When?

The function we use to create a list is ‘list( )’

eg:

List1 <- list(X = x, Crab.Data = Crab)
List1

1.6 summary

Data type	Dimension	Features
Vector	1
Matrix	2	Only numbers. Rectangle shaped.
Data Frame	2	Can contain both numbers and characters. Rectangle shaped.
List	n	Able to hold any data types. No shape restriction.

2 Basic Graphic Skills

2.1 Scatter Plot, Box Plot, Histogram, and Bar Chart

plot(Weight~Width,data=Crab)

## In plot( ) function
# add header -- main
# change the shape of the point -- pch
# change the x-lab -- xlab
# change the y-lab -- ylab

# add line to an existing graph -- abline()
# abline(-3944,242.6)

The location of the tilde between Weight and Width on the key board is shown below:

boxplot(distance~paper,data=Airplane)

hist(Crab$Width)

# change the main
# change the x-lab

barplot(Airplane$distance)

# fill color of each type of paper
# add legend
# add ylab
# add main

Question: What’s the difference between a histogram and a bar plot?

2.2 Lay Out Multiple Graphics

par(mfrow=c(2,2))
plot(Weight~Width,data=Crab)
boxplot(distance~paper,data=Airplane)
hist(Crab$Width)
barplot(Airplane$distance)

2.3 Adjust the Margins

par(mfrow=c(2,2))
par(mar=c(1,1,1,1))
plot(Weight~Width,data=Crab)
boxplot(distance~paper,data=Airplane)
hist(Crab$Width)
barplot(Airplane$distance)

3 Case Study

The dataset below are from open government portal.

3.1 Distribution of the household population by Children’s body mass index (BMI) - Center for Disease Control (CDC) classification system, by sex and age group

3.2 Milk production and utilization

Milk.Canada <- Milk[Milk$GEO=="Canada" & Milk$DAI=="Milk production, total",]
plot(as.numeric(Milk.Canada$Value),type = "l", main="Canada Milk Production Total",ylab="Value",xlab="Time")

Milk.Sold <- Milk[Milk$GEO=="Canada" & Milk$DAI=="Milk sold off farms, total",]

plot(as.numeric(Milk.Sold$Value), type="l", main="Canada Milk Sold Off Farms Total",ylab = "Value",xlab="Year",axes = FALSE,ylim=c(9000,14000))

axis(side=1, at=seq(1,495,by=12), labels= 1976:2017, las=1)
axis(side=2, at=seq(9000,14000,by=1000), lab=seq(9000,14000,by=1000), las=2)

Milk.Fluid <- Milk[Milk$GEO=="Canada" & Milk$DAI=="Fluid purposes", ]

plot(as.numeric(Milk.Fluid$Value),type = "l", main="Canada Milk Fluid Purposes", ylab="Value", xlab="Time",axes = FALSE,ylim=c(1800,8000))

axis(side=1, at=seq(1,855,by=12), labels= 1946:2017, las=1)
axis(side=2, at=seq(1800,8000,by=1000), lab=seq(1800,8000,by=1000), las=2)

Fluid <- Milk[Milk$DAI=="Fluid purposes",]
boxplot(as.numeric(Value) ~ substr(GEO,1,7),data=Fluid,pch=20)

Fluid.Pro <- Fluid[!(Fluid$GEO=="Canada"),]
boxplot(as.numeric(Value) ~ substr(GEO,1,7),data=Fluid.Pro,pch=20)

Mod0 <- lm(as.numeric(Value) ~ GEO, data=Fluid.Pro)
anova(Mod0)

## Analysis of Variance Table
## 
## Response: as.numeric(Value)
##             Df     Sum Sq    Mean Sq F value    Pr(>F)    
## GEO          9 7.5143e+10 8349269857    1135 < 2.2e-16 ***
## Residuals 4940 3.6341e+10    7356483                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(residuals(Mod0),pch=20,cex=0.4)

table(Fluid.Pro$GEO)

## 
##                   Alberta          British Columbia 
##                       495                       495 
##                    Canada                  Manitoba 
##                         0                       495 
##             New Brunswick Newfoundland and Labrador 
##                       495                       495 
##               Nova Scotia                   Ontario 
##                       495                       495 
##      Prince Edward Island                    Quebec 
##                       495                       495 
##              Saskatchewan 
##                       495

One-way ANOVA assumptions.

3.3 Estimates of population, by age group and sex, Canada, provinces and territories

3.4 Previous Practice

Use R dataset ‘iris’. Load the data using data(iris).

data(iris)

plot and identify two variables that are most linear with each other.
fit a regression model based on the variable you choose and add the fitted line to the scatter plot.
Try: fit regression model for each species.

## plot the data 
plot(iris) # the most linear two variables are Petal.Length & Petal.Width

## regression model on Petal.Length & Petal.Width
Mod <- lm(Petal.Length ~ Petal.Width, data=iris)
summary(Mod)

## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33542 -0.30347 -0.02955  0.25776  1.39453 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.08356    0.07297   14.85   <2e-16 ***
## Petal.Width  2.22994    0.05140   43.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4782 on 148 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9266 
## F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

plot(Petal.Length ~ Petal.Width, data=iris,
     main="Linear model", pch=20,
     xlab="Petal width", ylab="Petal length",
     xlim=c(0,3),ylim=c(0,8))
abline(Mod, col=2)

## regression model on each species 
table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

Mod.Setosa <- lm(Petal.Length ~ Petal.Width, 
                 data = subset(iris,Species == 'setosa'))
Mod.Versicolor <- lm(Petal.Length ~ Petal.Width, 
                 data = subset(iris,Species == 'versicolor'))
Mod.Virginica <- lm(Petal.Length ~ Petal.Width, 
                 data = subset(iris,Species == 'virginica'))

plot(Petal.Length ~ Petal.Width, data=iris,
     main="Linear model", pch=as.numeric(factor(Species)),
     xlab="Petal width", ylab="Petal length",
     xlim=c(0,3),ylim=c(0,8),col=factor(Species))
abline(Mod.Setosa,col=1)
abline(Mod.Versicolor,col=2)
abline(Mod.Virginica,col=3)
legend("topleft",legend = c("Setosa","Versicolor","Virginica"),
       pch=c(1,2,3),col=1:3)

4 What’s more

4.1 Shiny

Shiny example 1

Shiny example 2

Shiny example 3

5 After this workshop

the book that taught me R

Thanks.

An Introduction to R

Fang He

June 1 2017