1 Background

1.1 Why we are here

We are here to have fun, by the way learn some R.
We are here to type something. What we typed sometimes is not equal to what we thought we typed.
We are here to make some mistakes and correct them.

1.2 Objectives

know commonly used data types and their differences
be able to read data into Rstudio
be able to produce simple graphics
be able to conduct linear regression
have idea of how to write simple function

1.3 Before Ask for Help

Did you try what you think about on the computer?

‘If I change xxx part of the code, what will it be?’ – If you have the time waiting for an answer, why not try it and let computer answer it.

Did you ask google for help?

Now let’s open Rstudio

1.4 Tips/Attention

R is case sensitive. (eg: TRUE\(\ne\)true)
R runs the code in the script file line by line.
Always add comments to your code.

Let’s try something.

2*2

## [1] 4

sqrt(100)

## [1] 10

55/sin(2*pi/9)

## [1] 85.56481

exp(1)

## [1] 2.718282

xx() associate with functions and xx is the function name.

Practice. Calculate the following expression:

\[ \frac{25log(tan(\frac{\pi}{4}))+ 2\sqrt{169} }{exp(sin(\pi))} \]

Pay attention to the ‘()’.

2 Data Types

Q: Why knowing data types is important?

2.1 Vector

In R, a vector is a collection of numbers or characters.

2.1.1 create random vector

One way to create a vector is using function ‘c()’.

Eg1: a vector contains numbers

c(1, 2, 3, 4, 5, 6, 7, 8)

## [1] 1 2 3 4 5 6 7 8

c(5*6, 72/8)

## [1] 30  9

c(sqrt(3.14), log(1), exp(1), tan(pi/4), sin(pi/3))

## [1] 1.7720045 0.0000000 2.7182818 1.0000000 0.8660254

Eg2: a vector contains characters

c("sleep", "reading week", "trips")

## [1] "sleep"        "reading week" "trips"

c("hope=","hold","on","pain","ends")

## [1] "hope=" "hold"  "on"    "pain"  "ends"

Q: A vector contains both numbers and characters, is there anything we need to pay attention to? Run the follow code and tell me what you notice.

c("R Intro Workshop", 2017)

Q: We then want to use some of the results in our next calculation, for example, we want to calculate

\[ 3^1, 3^2, 3^3, 3^4, 3^5,3^6,3^7,3^8 \] What will you do?

2.1.2 numeric vector with pattern

x <- 1:5 
# vector x: start at 1, end in 5, increment is 1 
y <- seq(13,10,by= -0.5)
# vector y: start at 13, end in 10, increment is -0.5

Q: Why we cannot see how \(x\) and \(y\) looks like.

In order to check what \(x\) and \(y\) are, what can we do?

2.1.3 operations using vectors

x*2

x-y

mean(y)

sum(x)

2.1.4 subsets of a vector

In R, getting a subset of vector, matrix and data frame, we can use ‘[]’.

y

y[7] 

y[8] 

y[8] <- 14 

y

Practice. Suppose \(z=1,2,3,\dots,100\). Get all the get the odd numbers from \(z\).

2.2 Matrix

Retrieved from Math Plane

Q: What is a matrix?

Matrices application?

You must enable Javascript to view this page properly.

The function used to create a matrix is ‘matrix()’.

Suppose we would like to create the following matrix M. \[ M= \left[ \begin{matrix} 1 & 2 & 4 \\ 3 & 8 & 5 \\ 6 & 9 & 0 \end{matrix} \right] \]

M <- matrix(c(1,3,6,2,8,9,4,5,0),ncol=3)
M

##      [,1] [,2] [,3]
## [1,]    1    2    4
## [2,]    3    8    5
## [3,]    6    9    0

Q: Given output only, how can we distinguish is it a vector or a matrix?

##  [1]  1  2  3  4  5  6  7  8  9 10

2.2.1 operations associate with matrix

Usually, we apply same operation, like mean or sum, to all the rows or all the columns to a matrix.

The related function is called ‘apply()’.

Next, we will calculte the row mean and column sum for our matrix M.

apply(M, 1, mean) ## calculate row mean

## [1] 2.333333 5.333333 5.000000

apply(M, 2, sum) ## calculate column sum

## [1] 10 19  9

2.2.2 subsets of a matrix

Similar to vector, we also use ‘[]’ to obtain a subset of a matrix. Since matrix is two dimensional, we need to specify each dimension clearly.

Eg1: extract an element from a matrix.

##      [,1] [,2] [,3]
## [1,]    1    2    4
## [2,]    3    8    5
## [3,]    6    9    0

M[3,1]

## [1] 6

Eg2: extract a row or a column.

M[2,]

## [1] 3 8 5

M[,3]

## [1] 4 5 0

Q: How to extract the first column of matrix M in matrix form?

Ans: Use ‘drop=FALSE’ in ‘[]’ when extract one colum/row to keep it two dimensional.

M[,1]

## [1] 1 3 6

M[,1, drop=FALSE]

##      [,1]
## [1,]    1
## [2,]    3
## [3,]    6

M[2,]

## [1] 3 8 5

M[2, ,drop=FALSE]

##      [,1] [,2] [,3]
## [1,]    3    8    5

2.3 Data Frame

Q: What is a data frame?

What are the differences from a matrix?

We use function ‘data.frame()’ to create a data frame.

A <- 2:7
B <- A^2
C <- c('These','Are','Words','Not','Numbers','Eh')
mydata <- data.frame(First=A, b=B, Char=C)
mydata

##   First  b    Char
## 1     2  4   These
## 2     3  9     Are
## 3     4 16   Words
## 4     5 25     Not
## 5     6 36 Numbers
## 6     7 49      Eh

2.3.1 subsets of a data frame

eg1: extract one column

mydata$First # method 1

## [1] 2 3 4 5 6 7

mydata[,1] # method 2

## [1] 2 3 4 5 6 7

eg2: extract multiple columns

mydata[,c("First","Char")] # method 1

##   First    Char
## 1     2   These
## 2     3     Are
## 3     4   Words
## 4     5     Not
## 5     6 Numbers
## 6     7      Eh

mydata[,c(1,3)] # method 2

##   First    Char
## 1     2   These
## 2     3     Are
## 3     4   Words
## 4     5     Not
## 5     6 Numbers
## 6     7      Eh

Q: In both eg1 and eg2, which method is better? Why?

eg3: extract the row which Char is “Not”.

mydata[mydata$Char=="Not",] # don't forget the comma ~

##   First  b Char
## 4     5 25  Not

Q: How about “mydata[Char==“Not”,]”? If it doesn’t work, what are the possible reasons?

2.4 List

Retrieved from Magic Inkwell

We can imagine that list is a magic bag. It’s very handy when we want to commbine various types of data with different lengths.

Q: When?

The function we use to create a list is ‘list()’

eg:

List1 <- list(Y = y, MyData = mydata)
List1

## $Y
##  [1] 0.031022328 1.926457236 0.002292707 0.439684028 0.178286256
##  [6] 2.294487004 0.320546231 1.817832689 0.032440342 0.500020061
## 
## $MyData
##   First  b    Char
## 1     2  4   These
## 2     3  9     Are
## 3     4 16   Words
## 4     5 25     Not
## 5     6 36 Numbers
## 6     7 49      Eh

List2 <- list(Mat = M, MeanFun = mean)
List2

## $Mat
##      [,1] [,2] [,3]
## [1,]    1    2    4
## [2,]    3    8    5
## [3,]    6    9    0
## 
## $MeanFun
## function (x, ...) 
## UseMethod("mean")
## <bytecode: 0x7fbac4a82060>
## <environment: namespace:base>

2.5 summary

Data type	Dimension	Features
Vector	1
Matrix	2	Only numbers. Rectangle shaped.
Data Frame	2	Can contain both numbers and characters. Rectangle shaped.
List	n	Able to hold any data types. No shape restriction.

3 Read data

eg1: read .txt file

Cuckoos <- read.table("cuckoos.txt", header = TRUE)
head(Cuckoos)

##   length breadth      species id
## 1   21.7    16.1 meadow.pipit 21
## 2   22.6    17.0 meadow.pipit 22
## 3   20.9    16.2 meadow.pipit 23
## 4   21.6    16.2 meadow.pipit 24
## 5   22.2    16.9 meadow.pipit 25
## 6   22.5    16.9 meadow.pipit 26

eg2: read .csv file

Airplane <- read.csv("airplane.csv", header = TRUE)
head(Airplane)

##   paper..distance
## 1      light  3.1
## 2      light  3.3
## 3      light  2.1
## 4      light  1.9
## 5        medium 4
## 6      medium 3.5

eg3: read data from a URL

Crab <- read.csv("http://www.hofroe.net/stat557/data/crab.txt", header=TRUE, sep="\t")
head(Crab)

##   Color Spine Width Satellite Weight
## 1     3     3  28.3         8   3050
## 2     4     3  22.5         0   1550
## 3     2     1  26.0         9   2300
## 4     4     3  24.8         0   2100
## 5     4     3  26.0         4   2600
## 6     3     3  23.8         0   2100

4 Simple Plot

eg1: scatter plot

x <- seq(0,6*pi,len=100)
y <- sin(x)*2 + x*0.25 + rnorm(100,sd=0.3)
plot(x,y)
y.true <- sin(x)*2 + x*0.25
lines(x, y.true, col=2)

eg2: histogram

hist(Crab$Width)

eg3: boxplot

boxplot(length~species, data=Cuckoos)

The location of the tilde between length and specices on the key board is shown below:

5 Linear Regression Example

Retrieved from Dataaspirant

The dataset we are goint to use is called ‘mtcars’, which is one of the built-in datasets in R.

The following table shows the first 5 rows of the dataset.

First five rows of dataset mtcars
	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2

Use ?mtcars to get more information.

data(mtcars)
attach(mtcars)

qqnorm(mpg)
qqline(mpg, col=2, lty="dashed")

plot(mpg~wt, pch=16)

5.1 Fit linear model for all the data.

‘lm’ stands for linear model NOT one m.

Mod0 <- lm(mpg ~ wt)
summary(Mod0)

## 
## Call:
## lm(formula = mpg ~ wt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

plot(mpg~wt, pch=16)
abline(Mod0, col="red")

5.2 Fit linear model for each transmission type.

mean(mpg[am==0]) # am=0: automatic

## [1] 17.14737

mean(mpg[am==1]) # am=1: manual

## [1] 24.39231

Q: Are these two mean statistically significantly different?

t.test(mpg[am==0],mpg[am==1])

## 
##  Welch Two Sample t-test
## 
## data:  mpg[am == 0] and mpg[am == 1]
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

Mod.auto <- lm(mpg~wt, data=subset(mtcars,am==0))
Mod.manu <- lm(mpg~wt, data=subset(mtcars,am==1))

plot(mpg~wt, col=am+1, pch=am+12)
abline(Mod.auto)
abline(Mod.manu, col=2)

5.3 Summary of model fitting steps

Know the purpose of the study.
Check the dataset and plot the data.

?name_of_dataset
head()
plot()

Choose proper model/tool to conduct the analysis.

lm(), t.test() …
summary()

Model checking. (This part won’t be covered in this workshop.)

5.4 Practice

Use R dataset ‘iris’

plot and identify two variables that are most linear with each other.
fit a regression model based on the variable you choose and add the fitted line to the scatter plot.
Try: fit regression model for each species.

6 Write a simple function

We want to have a function with following properties:

take two data vector as input
conduct linear regression
produce the model summary information
produce the plot with the fitted line

LM <- function(x,y){
  Mod <- lm(y~x)
  summary(Mod)
  plot(x,y, pch=16)
  abline(Mod, col=2)
}

LM(wt, mpg)

LM(Cuckoos$length, Cuckoos$breadth)

7 What’s more

7.1 Shiny

Shiny example 1

Shiny example 2

7.2 more about plot

## glX 
##   2