Let’s open RStudio.

Before we get started, I want to clarify that this session is intended for these of you who have little to no experience with programming languages, and so those of you with experience might find the pace a little slow. Please take time to help your neighbour, and use this opportunity to pay very close attention to details of your own programming.

Basics

Use Script File

Click the circled icon on the tool bar as indicated above and choose ‘R Script’.

Interface

Add Comments

Why we need add comments to our code?
In R we use “#” to start a comment. In a script file, the color of comment by default is dark green.

# We use '#' to start a comment.

# R is case-sensitive.

## What does case-sensitive mean?

## Any Example?

## What do we need to pay attention to?

Operators

There are two types of operators. One is for numeric calculation, let’s name them ‘common operators’; the other is for logical comparison, we call them ‘logical operators’.

Common Operators

Operator	Meaning	Examples	Math formula
+	plus	3 + cos(pi/10)	$3 + cos(\frac{\pi}{10})$
-	minus	log(4) - 5	$\ln(4) - 5$
*	times	3*x	$3\times x$
/	divide	5/exp(2)*3	$\frac{5}{e^2}\times 3$
^	power	5^2	$5^2$

The result of a calculation with a common operator is usually a number.

Logical Operators

Operator	Example	Meaning	Result
$>$	3 > 5	Is 3 greater than 5?	FALSE
$<$	1 < 100	Is 1 less than 100?	TRUE
$>=$	7 >= 7	Is 7 greater or equal to 7?	TRUE
$<=$	41 <= 37	Is 41 less or equal to 37?	FALSE
$==$	19 == 17	Is 19 equal to 17?	FALSE
$!=$	19 != 17	Is 19 not equal to 17?	TRUE

The result of a logical operation is either TRUE or FALSE.

Usually, we use logical operator in a function’s ‘if( )’ condition.

Naming Variable

The Basic Rules of Naming Variable

A valide R variable name can start with a letter or a dot.
Any name can contain letters, numbers, underscores (_) and dots (.).
Names are limited to 10,000 bytes (and limited to 256 bytes in versions of R before 2.13.0).

Exercise of Naming Variable

Identify which are the valid names:

(1) Mod.1	(2) xyz	(3) hight/width	(4) humidity%
———————–	———————	———————-	————————
(5) pressure&reading	(6) my model	(7) .file.	(8) ID#
———————–	———————	———————-	————————
(9) pressure_reading	(10) my.model	(11) 3rd.file	(12) address@

Some Special Names and Don’t Name Your Variable After Them

Some names are ‘registered’ for special uses, such as flow control, logical indication, etc. Here are some examples:

function, for, repeat, next, return, if, else, while, break.

Other common special names are summarized in the following table:

Variable Name	Meaning
c (lower case)	create a vector
NA	not available
NaN	not a number
TRUE	true
FALSE	false
Inf	infinity

Run the Code

There are several ways to run your code.

Line by Line

Put the cursor at any location of the row of code you want to execute.

click the ‘Run’ button.
Use shortcut

Windows: control $+$ enter
Mac: command $+$ enter

Run a Chunk of Code At Once

Select all the code you want to run.

click the ‘Run’ button.
Use shortcut

Windows: control + enter
Mac: command + enter

Assign Value to Varible and Check the Value of Variable

In R we can use either ‘<-’ or ‘=’ to assign a value to a variable. I personally prefer ‘<-’.

## c() is the function to create a vector.

a <- c(1,2,3) # assign vector (1,2,3) to variable a
A <- c(1,2)   # assign vector (1,2) to variable A 

## To check the value of a variable, call its name as follows: 
a 
# running the above line of code means we want let Rstudio tell us the value of a.
A 
# running the above line of code means we want let Rstudio tell us the value of A.

## let's try
B

## Let's continue
b <- c(2,3,4)
a <- b
A <- c(3,4)

# What is the value of a and A? Why?
# Can we name different objects under the same name?

The Power of Parentheses

Match the following formula and the code:

Math Formula	Computer code
$1.\;7 + 3x-2$	$a. 7+3x -2$
	$b. (7+3)x/4 -2$
$2.\;(7+3)x-2$	$c. (7+3)*x-2$
	$d. (7+3*x)-2$
$3.\;\frac{7+3x}{4}-2$	$e. (7+3)x -2$
	$f. (7+3*x)/4-2$
$4.\;7+\frac{3x}{4}-2$	$g. 7 + (3*x/4 -2)$

Everytime we calculate the fractions, do pay extra attention.

Eg: We want to calculate

\[ \frac{7+5x}{3-2x} - 2. \] Without brackets, the code $7+5*x/3-2*x-2$ actually calculates \[ 7 + \frac{5x}{3} -2x -2. \] The correct code should be ($7+5*x$)/( $3-2*x$ )$-2$.

Always remember to use ‘( )’ to group your numerator and denominator.

Function

A function usually has the following format:

\[ \texttt{FunctionName}(\texttt{argument 1}, \texttt{ argument 2}, ...) \] A function usually has three components:

function name
arguments (seperated by ‘,’)
round brackets ( )

We already used one function ‘c( )’, which means combine values into a vector or list.

Usually it’s hard to memorize the arguments for all functions. But once we know the name of a function, we can always use ‘?FunctionName’ to check the arguments and useage of the function.

For example:

?c

Index

In R the index of a vector starts from 1.

If we define $\texttt{x <- c(5,6,1,2,3)}$, the index of each element of x is

Section Summary

Always write code in (console, a script file).
We add comments in R starting with (#, %).
What does case-sensitive mean?
The categories of operators are __________ and ___________.
The rules of naming a variable:

a variable can start with ____________ and _____________.
a variable can contain _____________, _____________, ______________, and _____________.
the length of a variable ___________________________________.

Can we name different object under the same name?
Parentheses:_______________________________________.
The components of a function:

______________.
______________.
______________.

How to check the argument of a function when we know its name?

In R the index of a vector starts from ____________.

Data Types

Let’s use stem cells as a metaphor to explain the importance of mastering the data types.

We will introduce three data types - vector, matrix and data frame.

Each data type will be introduced through the following steps:

the function used to create a certain type of data.
how to retrieve a subset of that type of data.

Vector

In R, a vector is a collection of numbers or characters.

c(sqrt(3.14), log(1), sin(pi/3))             # a vector with numbers

c("Matrix","Data Frame","Linear Regression") # a vector with characters

Generate a Numeric Vector with a Pattern

$m$:$n$ This expression generates a vector from $m$ to $n$ and the increment is 1 (when $n>m$) or -1 (when $n<m$).
seq(from, to, by) This function is a generalization of m:n.

1:6

7:2

x <- seq(1,12,by=3)

Get a Sub-Vector

In R, getting a subset of a vector, matrix and data frame, one option is ‘[ ]’.

x

x[3]      # show me the third element of x

x[c(2,4)] # show me the second and fourth elements of x

Matrix

The function used to create a matrix in R is matrix( ).

Suppose we would like to create the following matrix M.

\[ M= \left[ \begin{matrix} 1 & 2 & 4 \\ 3 & 8 & 5 \\ 6 & 9 & 0 \end{matrix} \right] \]

M <- matrix(c(1,3,6,2,8,9,4,5,0),ncol=3)
M

##      [,1] [,2] [,3]
## [1,]    1    2    4
## [2,]    3    8    5
## [3,]    6    9    0

Question: Given an output only, how can we distinguish whether it is a vector or a matrix?

Subsets of a Matrix

Similar to vector, we also use ‘[ ]’ to obtain a subset of a matrix. Since a matrix is two dimensional, we need to specify each dimension clearly.

M[1,] # the first index of [,] indicates the row, while the second index of [,] indicates the column

## [1] 1 2 4

M[3,1]

## [1] 6

Data Frame

A data frame can contain both numbers and characters at the same time. It’s the most common data type we work with.

For a data frame, each column represent a variable, and each row stands for an observation.

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

We use function data.frame( ) to create a data frame. Eg:

A <- 2:6
B <- A^2
C <- c('These','Are','Words','Not','Numbers')
My.DataFrame <- data.frame(First=A, b=B, Char=C)
My.DataFrame

##   First  b    Char
## 1     2  4   These
## 2     3  9     Are
## 3     4 16   Words
## 4     5 25     Not
## 5     6 36 Numbers

Get a Sub-Data-Frame

In addition to ‘[ ]’, we can also use ‘$’ to retrieve a column of a data frame.

My.DataFrame[,c(1,3)]

##   First    Char
## 1     2   These
## 2     3     Are
## 3     4   Words
## 4     5     Not
## 5     6 Numbers

My.DataFrame$C

## [1] These   Are     Words   Not     Numbers
## Levels: Are Not Numbers These Words

Section Summary

Data type	Dimension	Features
Vector	1
Matrix	2	Only numbers. Rectangle shaped.
Data Frame	2	Can contain both numbers and characters. Rectangle shaped.

Read Data

Instead of creating a data frame from scratch in R, we usually read a file into R.

The function used to read a .csv file is read.csv( ).

The function used to read a .txt file is read.table( ).

If you have a .xls file, save it as .csv file before you read it into RStudio.

Click here to download the data. (right click – open in a new tab)

To read a dataset we need its path. The function to retrieve the path is

Mac: file.choose()
Windows: choose.files()

eg1: read .txt file

Cuckoos <- read.table("cuckoos.txt", header = TRUE)
head(Cuckoos)

##   length breadth      species id
## 1   21.7    16.1 meadow.pipit 21
## 2   22.6    17.0 meadow.pipit 22
## 3   20.9    16.2 meadow.pipit 23
## 4   21.6    16.2 meadow.pipit 24
## 5   22.2    16.9 meadow.pipit 25
## 6   22.5    16.9 meadow.pipit 26

eg2: read .csv file

Sites <- read.csv("SitesLatLong_CrlseOut.csv", header = TRUE)
head(Sites)

##     Site     Lat     Long Elevation Start  End Vegetation
## 1 BR-Sa1 -2.8567 -54.9589        88  2002 2011        EBF
## 2 BR-Sa3 -3.0180 -54.9714       100  2000 2004        EBF
## 3 CA-Gro 48.2167 -82.1556       340  2003 2014         MF
## 4 CA-NS1 55.8792 -98.4839       260  2001 2005        ENF
## 5 CA-NS2 55.9058 -98.5247       260  2001 2005        ENF
## 6 CA-NS3 55.9117 -98.3822       260  2001 2005        ENF

eg3: read data from a URL

Crab <- read.csv("http://www.hofroe.net/stat557/data/crab.txt", header=TRUE, sep="\t")
head(Crab)

##   Color Spine Width Satellite Weight
## 1     3     3  28.3         8   3050
## 2     4     3  22.5         0   1550
## 3     2     1  26.0         9   2300
## 4     4     3  24.8         0   2100
## 5     4     3  26.0         4   2600
## 6     3     3  23.8         0   2100

Plot

Go Behind the Scenes

Made a plot in three steps:

1.________________________________________

2.________________________________________

3.________________________________________

Based on the feedback from our previous workshops, compared to the conventional plotting method, ggplot is easier for beginners. (ggplot looks nicer). Today we introduce the steps for producing graphics using the package ‘ggplot2’ (Hadley Wickham, 2016). Here is a comparison:

data(beavers) # we use data() function to use R built-in dataset.
plot(beaver1$temp~beaver1$time)

library(ggplot2)
ggplot(beaver1,aes(x=time,y=temp)) + geom_point()

ggplot

The ‘gg’ in ggplot means ‘grammar of graphics’. The idea behind the ggplot graph is to add a layer on top of previous layers. We can group all layers into three categories: data, geom, and additional.

Now let’s use an example to illustrate each step. The dataset we are going to work on is called ‘iris’.

data(iris) #  we use data() function to use R built-in dataset.
head(iris) #  we use head() function to check the first a few rows of the data set.

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# From the above information, we see five variables in iris data. 
# The first four variables are numeric. They measure the Length and Width of the Sepal and Petal.
# The last variable, 'Species', may be categorical.

# To check the summary information of a categorical variable
# we can use function table().
table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

# install.packages("ggplot2") # For any package, we need to install them once on each computer.
library(ggplot2) # Once we install the package, before we use it we need to library it each time.

Boxplot

To display the information of groups of numeric data there are many ways. We can display their mean and standard deviation. Also we can use a boxplot.

Let’s display the petal length of each species.

ggplot(iris,aes(x=Species,y=Petal.Length)) + geom_boxplot()

# to add labels we use function labs(x="", y="")
# to add a title we use function ggtitle("")
# to adjust the title to the center: theme_update(plot.title = element_text(hjust = 0.5))
# to add color for the box plot, histogram, bar chart we use option 'fill'

Let’s try another example:

theme_update(plot.title = element_text(hjust = 0.5))
ggplot(iris,aes(x=Species, y=Sepal.Length, fill=Species)) + geom_boxplot() + labs(y="Sepal Length (cm)") + ggtitle("Boxplot of Sepal Length of each Iris Species")

Exercise: Follow the steps and produce boxplots:

Petal.Wodth
Sepal.Width

Scatter Plot

We use scatte plot to display the the relationship of two numeric variables.

theme_update(plot.title = element_text(hjust = 0.5))
ggplot(iris, aes(x=Sepal.Length,y=Sepal.Width, color=Species)) + geom_point() + labs(x="Sepal Length (cm)",y="Sepal Width (cm)") + ggtitle("Scatter Plot of Sepal Width against Sepal Length")

Exercise: Produce a scatter plot of petal length and petal width.

Add Smooth

theme_update(plot.title = element_text(hjust = 0.5))
ggplot(iris, aes(x=Sepal.Length,y=Petal.Width, color=Species)) + geom_point() + labs(x="Sepal Length (cm)",y="Petal Width (cm)") + ggtitle("Scatter Plot of Petal Width against Sepal Length")

# geom_smooth(method=) # 'lm' 'loess'

Section Summary

Three basic steps of plotting: 1._______________________________ 2._______________________________ 3._______________________________
ggplot syntax: $ggplot() + geom\_point()/geom\_boxplot()/geom\_smooth() + labs() + ggtitle()$

Linear Regression

Linear regression may not sound familiar at first, but it actually works `undercover’ of a wide range of analyses, such as ANOVA, ANCOVA, MANOVA, MANCOVA. Moreover, generalized linear regression has even wider applications.

Today we take simple linear regression for an example. The model of simple linear regression is: \[ y = \beta_0 + \beta_1x + \epsilon, \] where $\epsilon$ follows Normal distribution with mean 0 and constant variance $\sigma^2$.

Our fundamental mission is to find the estimation of $\beta_0$ and $\beta_1$ when a bunch of $x$ and $y$ values are given. The meaning of the two coeficients $(\beta_0,\beta_1)$ is explained below:

The red line is called the ‘fitted line’, whose intercept is $\beta_0$ and slope is $\beta_1$.

The syntax in R to fit a linear regression model is:

# lm(y ~ x, data)

## the function we use is lm().

## 'lm' stands for linear model. It is not one_m(1m), Im,...

## In the lm() function, we need to specify the data, our independent variable x and the depended variable y. 

## The sign between y and x can be found in following picture of a keyboard. It is a tilde (~) not the minus sign (-).

Linear Regression – Dinosaur

# For any package, we need to install them once on each computer.
# Once we have installed the package, before we use it we need to library it each time.
library(datasauRus)
data("datasaurus_dozen") # we use data() to use the built-in dataset
head(datasaurus_dozen)

## # A tibble: 6 x 3
##   dataset     x     y
##   <chr>   <dbl> <dbl>
## 1 dino     55.4  97.2
## 2 dino     51.5  96.0
## 3 dino     46.2  94.5
## 4 dino     42.8  91.4
## 5 dino     40.8  88.3
## 6 dino     38.7  84.9

# we use head() to check the first few rows of the dataset
table(datasaurus_dozen$dataset) # we use table() to check the summary information of a categorical variable

## 
##       away   bullseye     circle       dino       dots    h_lines 
##        142        142        142        142        142        142 
## high_lines slant_down   slant_up       star    v_lines wide_lines 
##        142        142        142        142        142        142 
##    x_shape 
##        142

Let’s take a blind journey first.

Mod.away <- lm(y~x,data=subset(datasaurus_dozen,dataset=='away'))
Mod.away

Mod.bullseye <- lm(y~x, data=subset(datasaurus_dozen,dataset=='bullseye'))
Mod.bullseye

Mod.star <- lm(y~x, data=subset(datasaurus_dozen,dataset=='star'))
Mod.star

Mod.dino <- lm(y~x, data=subset(datasaurus_dozen,dataset=='dino'))
Mod.dino

Let’s check the four sets of extimated $(\beta_0, \beta_1)$.

The estimations are quite __________.

ggplot(subset(datasaurus_dozen,dataset=='away'),aes(x=x,y=y)) + geom_point() + geom_smooth(method='lm') + ggtitle("away")

ggplot(subset(datasaurus_dozen,dataset=='bullseye'),aes(x=x,y=y)) + geom_point() + geom_smooth(method='lm')+ ggtitle("bullseye")

ggplot(subset(datasaurus_dozen,dataset=='star'),aes(x=x,y=y)) + geom_point() + geom_smooth(method='lm')+ ggtitle("star")

ggplot(subset(datasaurus_dozen,dataset=='dino'),aes(x=x,y=y)) + geom_point() + geom_smooth(method='lm')+ ggtitle("dino")

Practice

Fit a linear regression model of the Cuckoos with vivid graphics.

head(Cuckoos)

##   length breadth      species id
## 1   21.7    16.1 meadow.pipit 21
## 2   22.6    17.0 meadow.pipit 22
## 3   20.9    16.2 meadow.pipit 23
## 4   21.6    16.2 meadow.pipit 24
## 5   22.2    16.9 meadow.pipit 25
## 6   22.5    16.9 meadow.pipit 26

What’s More

After this workshop, I recommend some books to systematically learn R:

Math Formula	Computer code
\(1.\;7 + 3x-2\)	\(a. 7+3x -2\)
	\(b. (7+3)x/4 -2\)
\(2.\;(7+3)x-2\)	\(c. (7+3)*x-2\)
	\(d. (7+3*x)-2\)
\(3.\;\frac{7+3x}{4}-2\)	\(e. (7+3)x -2\)
	\(f. (7+3*x)/4-2\)
\(4.\;7+\frac{3x}{4}-2\)	\(g. 7 + (3*x/4 -2)\)

R Introductory Workshop

Fang He

2018-05-11

Basics

Use Script File

Interface

Add Comments

Operators

Common Operators

Logical Operators

Naming Variable

The Basic Rules of Naming Variable

Exercise of Naming Variable

Some Special Names and Don’t Name Your Variable After Them

Run the Code

Assign Value to Varible and Check the Value of Variable

The Power of Parentheses

Function

Index

Section Summary

Data Types

Vector

Generate a Numeric Vector with a Pattern

Get a Sub-Vector

Matrix

Subsets of a Matrix

Data Frame

Get a Sub-Data-Frame

Section Summary

Read Data

Plot

ggplot

Boxplot

Scatter Plot

Add Smooth

Section Summary

Linear Regression

Linear Regression – Dinosaur

Practice

What’s More

Operator	Meaning	Examples	Math formula
+	plus	3 + cos(pi/10)	\(3 + cos(\frac{\pi}{10})\)
-	minus	log(4) - 5	\(\ln(4) - 5\)
*	times	3*x	\(3\times x\)
/	divide	5/exp(2)*3	\(\frac{5}{e^2}\times 3\)
^	power	5^2	\(5^2\)

Operator	Example	Meaning	Result
\(>\)	3 > 5	Is 3 greater than 5?	FALSE
\(<\)	1 < 100	Is 1 less than 100?	TRUE
\(>=\)	7 >= 7	Is 7 greater or equal to 7?	TRUE
\(<=\)	41 <= 37	Is 41 less or equal to 37?	FALSE
\(==\)	19 == 17	Is 19 equal to 17?	FALSE
\(!=\)	19 != 17	Is 19 not equal to 17?	TRUE