Let’s open RStudio.
Before we get started, I want to clarify that this session is intended for these of you who have little to no experience with programming languages, and so those of you with experience might find the pace a little slow. Please take time to help your neighbour, and use this opportunity to pay very close attention to details of your own programming.
Click the circled icon on the tool bar as indicated above and choose ‘R Script’.
Why we need add comments to our code?
In R we use “#” to start a comment. In a script file, the color of comment by default is dark green.
# We use '#' to start a comment.
# R is case-sensitive.
## What does case-sensitive mean?
## Any Example?
## What do we need to pay attention to?
There are two types of operators. One is for numeric calculation, let’s name them ‘common operators’; the other is for logical comparison, we call them ‘logical operators’.
| Operator | Meaning | Examples | Math formula |
|---|---|---|---|
| + | plus | 3 + cos(pi/10) | \(3 + cos(\frac{\pi}{10})\) |
| - | minus | log(4) - 5 | \(\ln(4) - 5\) |
| * | times | 3*x | \(3\times x\) |
| / | divide | 5/exp(2)*3 | \(\frac{5}{e^2}\times 3\) |
| ^ | power | 5^2 | \(5^2\) |
The result of a calculation with a common operator is usually a number.
| Operator | Example | Meaning | Result |
|---|---|---|---|
| \(>\) | 3 > 5 | Is 3 greater than 5? | FALSE |
| \(<\) | 1 < 100 | Is 1 less than 100? | TRUE |
| \(>=\) | 7 >= 7 | Is 7 greater or equal to 7? | TRUE |
| \(<=\) | 41 <= 37 | Is 41 less or equal to 37? | FALSE |
| \(==\) | 19 == 17 | Is 19 equal to 17? | FALSE |
| \(!=\) | 19 != 17 | Is 19 not equal to 17? | TRUE |
The result of a logical operation is either TRUE or FALSE.
Usually, we use logical operator in a function’s ‘if( )’ condition.
Identify which are the valid names:
| (1) Mod.1 | (2) xyz | (3) hight/width | (4) humidity% |
| ———————– | ——————— | ———————- | ———————— |
| (5) pressure&reading | (6) my model | (7) .file. | (8) ID# |
| ———————– | ——————— | ———————- | ———————— |
| (9) pressure_reading | (10) my.model | (11) 3rd.file | (12) address@ |
Some names are ‘registered’ for special uses, such as flow control, logical indication, etc. Here are some examples:
function, for, repeat, next, return, if, else, while, break.
Other common special names are summarized in the following table:
| Variable Name | Meaning |
|---|---|
| c (lower case) | create a vector |
| NA | not available |
| NaN | not a number |
| TRUE | true |
| FALSE | false |
| Inf | infinity |
There are several ways to run your code.
Put the cursor at any location of the row of code you want to execute.
click the ‘Run’ button.
Use shortcut
Select all the code you want to run.
click the ‘Run’ button.
Use shortcut
In R we can use either ‘<-’ or ‘=’ to assign a value to a variable. I personally prefer ‘<-’.
## c() is the function to create a vector.
a <- c(1,2,3) # assign vector (1,2,3) to variable a
A <- c(1,2) # assign vector (1,2) to variable A
## To check the value of a variable, call its name as follows:
a
# running the above line of code means we want let Rstudio tell us the value of a.
A
# running the above line of code means we want let Rstudio tell us the value of A.
## let's try
B
## Let's continue
b <- c(2,3,4)
a <- b
A <- c(3,4)
# What is the value of a and A? Why?
# Can we name different objects under the same name?
Match the following formula and the code:
| Math Formula | Computer code |
|---|---|
| \(1.\;7 + 3x-2\) | \(a. 7+3x -2\) |
| \(b. (7+3)x/4 -2\) | |
| \(2.\;(7+3)x-2\) | \(c. (7+3)*x-2\) |
| \(d. (7+3*x)-2\) | |
| \(3.\;\frac{7+3x}{4}-2\) | \(e. (7+3)x -2\) |
| \(f. (7+3*x)/4-2\) | |
| \(4.\;7+\frac{3x}{4}-2\) | \(g. 7 + (3*x/4 -2)\) |
Everytime we calculate the fractions, do pay extra attention.
Eg: We want to calculate
\[ \frac{7+5x}{3-2x} - 2. \] Without brackets, the code \(7+5*x/3-2*x-2\) actually calculates \[ 7 + \frac{5x}{3} -2x -2. \] The correct code should be (\(7+5*x\))/( \(3-2*x\) )\(-2\).
Always remember to use ‘( )’ to group your numerator and denominator.
A function usually has the following format:
\[ \texttt{FunctionName}(\texttt{argument 1}, \texttt{ argument 2}, ...) \] A function usually has three components:
We already used one function ‘c( )’, which means combine values into a vector or list.
Usually it’s hard to memorize the arguments for all functions. But once we know the name of a function, we can always use ‘?FunctionName’ to check the arguments and useage of the function.
For example:
?c
In R the index of a vector starts from 1.
If we define \(\texttt{x <- c(5,6,1,2,3)}\), the index of each element of x is
How to check the argument of a function when we know its name?
Let’s use stem cells as a metaphor to explain the importance of mastering the data types.
We will introduce three data types - vector, matrix and data frame.
Each data type will be introduced through the following steps:
In R, a vector is a collection of numbers or characters.
c(sqrt(3.14), log(1), sin(pi/3)) # a vector with numbers
c("Matrix","Data Frame","Linear Regression") # a vector with characters
\(m\):\(n\) This expression generates a vector from \(m\) to \(n\) and the increment is 1 (when \(n>m\)) or -1 (when \(n<m\)).
seq(from, to, by) This function is a generalization of m:n.
1:6
7:2
x <- seq(1,12,by=3)
In R, getting a subset of a vector, matrix and data frame, one option is ‘[ ]’.
x
x[3] # show me the third element of x
x[c(2,4)] # show me the second and fourth elements of x
The function used to create a matrix in R is matrix( ).
Suppose we would like to create the following matrix M.
\[ M= \left[ \begin{matrix} 1 & 2 & 4 \\ 3 & 8 & 5 \\ 6 & 9 & 0 \end{matrix} \right] \]
M <- matrix(c(1,3,6,2,8,9,4,5,0),ncol=3)
M
## [,1] [,2] [,3]
## [1,] 1 2 4
## [2,] 3 8 5
## [3,] 6 9 0
Question: Given an output only, how can we distinguish whether it is a vector or a matrix?
Similar to vector, we also use ‘[ ]’ to obtain a subset of a matrix. Since a matrix is two dimensional, we need to specify each dimension clearly.
M[1,] # the first index of [,] indicates the row, while the second index of [,] indicates the column
## [1] 1 2 4
M[3,1]
## [1] 6
A data frame can contain both numbers and characters at the same time. It’s the most common data type we work with.
For a data frame, each column represent a variable, and each row stands for an observation.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
We use function data.frame( ) to create a data frame. Eg:
A <- 2:6
B <- A^2
C <- c('These','Are','Words','Not','Numbers')
My.DataFrame <- data.frame(First=A, b=B, Char=C)
My.DataFrame
## First b Char
## 1 2 4 These
## 2 3 9 Are
## 3 4 16 Words
## 4 5 25 Not
## 5 6 36 Numbers
In addition to ‘[ ]’, we can also use ‘$’ to retrieve a column of a data frame.
My.DataFrame[,c(1,3)]
## First Char
## 1 2 These
## 2 3 Are
## 3 4 Words
## 4 5 Not
## 5 6 Numbers
My.DataFrame$C
## [1] These Are Words Not Numbers
## Levels: Are Not Numbers These Words
| Data type | Dimension | Features |
|---|---|---|
| Vector | 1 | |
| Matrix | 2 | Only numbers. Rectangle shaped. |
| Data Frame | 2 | Can contain both numbers and characters. Rectangle shaped. |
Instead of creating a data frame from scratch in R, we usually read a file into R.
The function used to read a .csv file is read.csv( ).
The function used to read a .txt file is read.table( ).
If you have a .xls file, save it as .csv file before you read it into RStudio.
Click here to download the data. (right click – open in a new tab)
To read a dataset we need its path. The function to retrieve the path is
eg1: read .txt file
Cuckoos <- read.table("cuckoos.txt", header = TRUE)
head(Cuckoos)
## length breadth species id
## 1 21.7 16.1 meadow.pipit 21
## 2 22.6 17.0 meadow.pipit 22
## 3 20.9 16.2 meadow.pipit 23
## 4 21.6 16.2 meadow.pipit 24
## 5 22.2 16.9 meadow.pipit 25
## 6 22.5 16.9 meadow.pipit 26
eg2: read .csv file
Sites <- read.csv("SitesLatLong_CrlseOut.csv", header = TRUE)
head(Sites)
## Site Lat Long Elevation Start End Vegetation
## 1 BR-Sa1 -2.8567 -54.9589 88 2002 2011 EBF
## 2 BR-Sa3 -3.0180 -54.9714 100 2000 2004 EBF
## 3 CA-Gro 48.2167 -82.1556 340 2003 2014 MF
## 4 CA-NS1 55.8792 -98.4839 260 2001 2005 ENF
## 5 CA-NS2 55.9058 -98.5247 260 2001 2005 ENF
## 6 CA-NS3 55.9117 -98.3822 260 2001 2005 ENF
eg3: read data from a URL
Crab <- read.csv("http://www.hofroe.net/stat557/data/crab.txt", header=TRUE, sep="\t")
head(Crab)
## Color Spine Width Satellite Weight
## 1 3 3 28.3 8 3050
## 2 4 3 22.5 0 1550
## 3 2 1 26.0 9 2300
## 4 4 3 24.8 0 2100
## 5 4 3 26.0 4 2600
## 6 3 3 23.8 0 2100
Made a plot in three steps:
1.________________________________________
2.________________________________________
3.________________________________________
Based on the feedback from our previous workshops, compared to the conventional plotting method, ggplot is easier for beginners. (ggplot looks nicer). Today we introduce the steps for producing graphics using the package ‘ggplot2’ (Hadley Wickham, 2016). Here is a comparison:
data(beavers) # we use data() function to use R built-in dataset.
plot(beaver1$temp~beaver1$time)
library(ggplot2)
ggplot(beaver1,aes(x=time,y=temp)) + geom_point()
The ‘gg’ in ggplot means ‘grammar of graphics’. The idea behind the ggplot graph is to add a layer on top of previous layers. We can group all layers into three categories: data, geom, and additional.
Now let’s use an example to illustrate each step. The dataset we are going to work on is called ‘iris’.
data(iris) # we use data() function to use R built-in dataset.
head(iris) # we use head() function to check the first a few rows of the data set.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# From the above information, we see five variables in iris data.
# The first four variables are numeric. They measure the Length and Width of the Sepal and Petal.
# The last variable, 'Species', may be categorical.
# To check the summary information of a categorical variable
# we can use function table().
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
# install.packages("ggplot2") # For any package, we need to install them once on each computer.
library(ggplot2) # Once we install the package, before we use it we need to library it each time.
To display the information of groups of numeric data there are many ways. We can display their mean and standard deviation. Also we can use a boxplot.
Let’s display the petal length of each species.
ggplot(iris,aes(x=Species,y=Petal.Length)) + geom_boxplot()
# to add labels we use function labs(x="", y="")
# to add a title we use function ggtitle("")
# to adjust the title to the center: theme_update(plot.title = element_text(hjust = 0.5))
# to add color for the box plot, histogram, bar chart we use option 'fill'
Let’s try another example:
theme_update(plot.title = element_text(hjust = 0.5))
ggplot(iris,aes(x=Species, y=Sepal.Length, fill=Species)) + geom_boxplot() + labs(y="Sepal Length (cm)") + ggtitle("Boxplot of Sepal Length of each Iris Species")
Exercise: Follow the steps and produce boxplots:
We use scatte plot to display the the relationship of two numeric variables.
theme_update(plot.title = element_text(hjust = 0.5))
ggplot(iris, aes(x=Sepal.Length,y=Sepal.Width, color=Species)) + geom_point() + labs(x="Sepal Length (cm)",y="Sepal Width (cm)") + ggtitle("Scatter Plot of Sepal Width against Sepal Length")
Exercise: Produce a scatter plot of petal length and petal width.
theme_update(plot.title = element_text(hjust = 0.5))
ggplot(iris, aes(x=Sepal.Length,y=Petal.Width, color=Species)) + geom_point() + labs(x="Sepal Length (cm)",y="Petal Width (cm)") + ggtitle("Scatter Plot of Petal Width against Sepal Length")
# geom_smooth(method=) # 'lm' 'loess'
Three basic steps of plotting: 1._______________________________ 2._______________________________ 3._______________________________
ggplot syntax: \(ggplot() + geom\_point()/geom\_boxplot()/geom\_smooth() + labs() + ggtitle()\)
Linear regression may not sound familiar at first, but it actually works `undercover’ of a wide range of analyses, such as ANOVA, ANCOVA, MANOVA, MANCOVA. Moreover, generalized linear regression has even wider applications.
Today we take simple linear regression for an example. The model of simple linear regression is: \[ y = \beta_0 + \beta_1x + \epsilon, \] where \(\epsilon\) follows Normal distribution with mean 0 and constant variance \(\sigma^2\).
Our fundamental mission is to find the estimation of \(\beta_0\) and \(\beta_1\) when a bunch of \(x\) and \(y\) values are given. The meaning of the two coeficients \((\beta_0,\beta_1)\) is explained below:
The red line is called the ‘fitted line’, whose intercept is \(\beta_0\) and slope is \(\beta_1\).
The syntax in R to fit a linear regression model is:
# lm(y ~ x, data)
## the function we use is lm().
## 'lm' stands for linear model. It is not one_m(1m), Im,...
## In the lm() function, we need to specify the data, our independent variable x and the depended variable y.
## The sign between y and x can be found in following picture of a keyboard. It is a tilde (~) not the minus sign (-).
# For any package, we need to install them once on each computer.
# Once we have installed the package, before we use it we need to library it each time.
library(datasauRus)
data("datasaurus_dozen") # we use data() to use the built-in dataset
head(datasaurus_dozen)
## # A tibble: 6 x 3
## dataset x y
## <chr> <dbl> <dbl>
## 1 dino 55.4 97.2
## 2 dino 51.5 96.0
## 3 dino 46.2 94.5
## 4 dino 42.8 91.4
## 5 dino 40.8 88.3
## 6 dino 38.7 84.9
# we use head() to check the first few rows of the dataset
table(datasaurus_dozen$dataset) # we use table() to check the summary information of a categorical variable
##
## away bullseye circle dino dots h_lines
## 142 142 142 142 142 142
## high_lines slant_down slant_up star v_lines wide_lines
## 142 142 142 142 142 142
## x_shape
## 142
Let’s take a blind journey first.
Mod.away <- lm(y~x,data=subset(datasaurus_dozen,dataset=='away'))
Mod.away
Mod.bullseye <- lm(y~x, data=subset(datasaurus_dozen,dataset=='bullseye'))
Mod.bullseye
Mod.star <- lm(y~x, data=subset(datasaurus_dozen,dataset=='star'))
Mod.star
Mod.dino <- lm(y~x, data=subset(datasaurus_dozen,dataset=='dino'))
Mod.dino
Let’s check the four sets of extimated \((\beta_0, \beta_1)\).
The estimations are quite __________.
ggplot(subset(datasaurus_dozen,dataset=='away'),aes(x=x,y=y)) + geom_point() + geom_smooth(method='lm') + ggtitle("away")
ggplot(subset(datasaurus_dozen,dataset=='bullseye'),aes(x=x,y=y)) + geom_point() + geom_smooth(method='lm')+ ggtitle("bullseye")
ggplot(subset(datasaurus_dozen,dataset=='star'),aes(x=x,y=y)) + geom_point() + geom_smooth(method='lm')+ ggtitle("star")
ggplot(subset(datasaurus_dozen,dataset=='dino'),aes(x=x,y=y)) + geom_point() + geom_smooth(method='lm')+ ggtitle("dino")
Fit a linear regression model of the Cuckoos with vivid graphics.
head(Cuckoos)
## length breadth species id
## 1 21.7 16.1 meadow.pipit 21
## 2 22.6 17.0 meadow.pipit 22
## 3 20.9 16.2 meadow.pipit 23
## 4 21.6 16.2 meadow.pipit 24
## 5 22.2 16.9 meadow.pipit 25
## 6 22.5 16.9 meadow.pipit 26
After this workshop, I recommend some books to systematically learn R: