1 Background


1.1 Why we are here

  • We are here to have fun, by the way learn some R.

  • We are here to type something. What we typed sometimes is not equal to what we thought we typed.

  • We are here to make some mistakes and correct them.


1.2 Objectives

  • know commonly used data types and their differences
  • be able to read data into Rstudio
  • be able to produce simple graphics
  • be able to conduct linear regression
  • have idea of how to write simple function


1.3 Before Ask for Help

  • Did you try what you think about on the computer?

‘If I change xxx part of the code, what will it be?’ – If you have the time waiting for an answer, why not try it and let computer answer it.

  • Did you ask google for help?


Now let’s open Rstudio




1.4 Tips/Attention

  • R is case sensitive. (eg: TRUE\(\ne\)true)
  • R runs the code in the script file line by line.
  • Always add comments to your code.

Let’s try something.

2*2
## [1] 4
sqrt(100)
## [1] 10
55/sin(2*pi/9)
## [1] 85.56481
exp(1)
## [1] 2.718282

xx() associate with functions and xx is the function name.


Practice. Calculate the following expression:

\[ \frac{25log(tan(\frac{\pi}{4}))+ 2\sqrt{169} }{exp(sin(\pi))} \]

Pay attention to the ‘()’.


2 Data Types

Q: Why knowing data types is important?



2.1 Vector

In R, a vector is a collection of numbers or characters.


2.1.1 create random vector

One way to create a vector is using function ‘c()’.


Eg1: a vector contains numbers

c(1, 2, 3, 4, 5, 6, 7, 8)
## [1] 1 2 3 4 5 6 7 8
c(5*6, 72/8)
## [1] 30  9
c(sqrt(3.14), log(1), exp(1), tan(pi/4), sin(pi/3))
## [1] 1.7720045 0.0000000 2.7182818 1.0000000 0.8660254


Eg2: a vector contains characters

c("sleep", "reading week", "trips")
## [1] "sleep"        "reading week" "trips"
c("hope=","hold","on","pain","ends")
## [1] "hope=" "hold"  "on"    "pain"  "ends"


Q: A vector contains both numbers and characters, is there anything we need to pay attention to? Run the follow code and tell me what you notice.

c("R Intro Workshop", 2017)


Q: We then want to use some of the results in our next calculation, for example, we want to calculate

\[ 3^1, 3^2, 3^3, 3^4, 3^5,3^6,3^7,3^8 \] What will you do?


2.1.2 numeric vector with pattern

x <- 1:5 
# vector x: start at 1, end in 5, increment is 1 
y <- seq(13,10,by= -0.5)
# vector y: start at 13, end in 10, increment is -0.5 


Q: Why we cannot see how \(x\) and \(y\) looks like.

In order to check what \(x\) and \(y\) are, what can we do?


2.1.3 operations using vectors

x*2

x-y

mean(y)

sum(x)


2.1.4 subsets of a vector

In R, getting a subset of vector, matrix and data frame, we can use ‘[]’.

y

y[7] 

y[8] 

y[8] <- 14 

y


Practice. Suppose \(z=1,2,3,\dots,100\). Get all the get the odd numbers from \(z\).


2.2 Matrix

Q: What is a matrix?

Matrices application?

You must enable Javascript to view this page properly.

The function used to create a matrix is ‘matrix()’.

Suppose we would like to create the following matrix M. \[ M= \left[ \begin{matrix} 1 & 2 & 4 \\ 3 & 8 & 5 \\ 6 & 9 & 0 \end{matrix} \right] \]

M <- matrix(c(1,3,6,2,8,9,4,5,0),ncol=3)
M
##      [,1] [,2] [,3]
## [1,]    1    2    4
## [2,]    3    8    5
## [3,]    6    9    0


Q: Given output only, how can we distinguish is it a vector or a matrix?

x
##  [1]  1  2  3  4  5  6  7  8  9 10


2.2.1 operations associate with matrix

Usually, we apply same operation, like mean or sum, to all the rows or all the columns to a matrix.

The related function is called ‘apply()’.

Next, we will calculte the row mean and column sum for our matrix M.

apply(M, 1, mean) ## calculate row mean
## [1] 2.333333 5.333333 5.000000
apply(M, 2, sum) ## calculate column sum
## [1] 10 19  9


2.2.2 subsets of a matrix

Similar to vector, we also use ‘[]’ to obtain a subset of a matrix. Since matrix is two dimensional, we need to specify each dimension clearly.


Eg1: extract an element from a matrix.

M
##      [,1] [,2] [,3]
## [1,]    1    2    4
## [2,]    3    8    5
## [3,]    6    9    0
M[3,1]
## [1] 6


Eg2: extract a row or a column.

M[2,]
## [1] 3 8 5
M[,3]
## [1] 4 5 0


Q: How to extract the first column of matrix M in matrix form?

Ans: Use ‘drop=FALSE’ in ‘[]’ when extract one colum/row to keep it two dimensional.

M[,1]
## [1] 1 3 6
M[,1, drop=FALSE]
##      [,1]
## [1,]    1
## [2,]    3
## [3,]    6
M[2,] 
## [1] 3 8 5
M[2, ,drop=FALSE]
##      [,1] [,2] [,3]
## [1,]    3    8    5


2.3 Data Frame

Q: What is a data frame?

What are the differences from a matrix?

We use function ‘data.frame()’ to create a data frame.

A <- 2:7
B <- A^2
C <- c('These','Are','Words','Not','Numbers','Eh')
mydata <- data.frame(First=A, b=B, Char=C)
mydata
##   First  b    Char
## 1     2  4   These
## 2     3  9     Are
## 3     4 16   Words
## 4     5 25     Not
## 5     6 36 Numbers
## 6     7 49      Eh


2.3.1 subsets of a data frame

eg1: extract one column

mydata$First # method 1
## [1] 2 3 4 5 6 7
mydata[,1] # method 2
## [1] 2 3 4 5 6 7


eg2: extract multiple columns

mydata[,c("First","Char")] # method 1
##   First    Char
## 1     2   These
## 2     3     Are
## 3     4   Words
## 4     5     Not
## 5     6 Numbers
## 6     7      Eh
mydata[,c(1,3)] # method 2
##   First    Char
## 1     2   These
## 2     3     Are
## 3     4   Words
## 4     5     Not
## 5     6 Numbers
## 6     7      Eh


Q: In both eg1 and eg2, which method is better? Why?


eg3: extract the row which Char is “Not”.

mydata[mydata$Char=="Not",] # don't forget the comma ~
##   First  b Char
## 4     5 25  Not

Q: How about “mydata[Char==“Not”,]”? If it doesn’t work, what are the possible reasons?


2.4 List

We can imagine that list is a magic bag. It’s very handy when we want to commbine various types of data with different lengths.

Q: When?


The function we use to create a list is ‘list()’

eg:

List1 <- list(Y = y, MyData = mydata)
List1
## $Y
##  [1] 0.031022328 1.926457236 0.002292707 0.439684028 0.178286256
##  [6] 2.294487004 0.320546231 1.817832689 0.032440342 0.500020061
## 
## $MyData
##   First  b    Char
## 1     2  4   These
## 2     3  9     Are
## 3     4 16   Words
## 4     5 25     Not
## 5     6 36 Numbers
## 6     7 49      Eh
List2 <- list(Mat = M, MeanFun = mean)
List2
## $Mat
##      [,1] [,2] [,3]
## [1,]    1    2    4
## [2,]    3    8    5
## [3,]    6    9    0
## 
## $MeanFun
## function (x, ...) 
## UseMethod("mean")
## <bytecode: 0x7fbac4a82060>
## <environment: namespace:base>


2.5 summary

Data type Dimension Features
Vector 1
Matrix 2 Only numbers. Rectangle shaped.
Data Frame 2 Can contain both numbers and characters. Rectangle shaped.
List n Able to hold any data types. No shape restriction.


3 Read data

eg1: read .txt file

Cuckoos <- read.table("cuckoos.txt", header = TRUE)
head(Cuckoos)
##   length breadth      species id
## 1   21.7    16.1 meadow.pipit 21
## 2   22.6    17.0 meadow.pipit 22
## 3   20.9    16.2 meadow.pipit 23
## 4   21.6    16.2 meadow.pipit 24
## 5   22.2    16.9 meadow.pipit 25
## 6   22.5    16.9 meadow.pipit 26

eg2: read .csv file

Airplane <- read.csv("airplane.csv", header = TRUE)
head(Airplane)
##   paper..distance
## 1      light  3.1
## 2      light  3.3
## 3      light  2.1
## 4      light  1.9
## 5        medium 4
## 6      medium 3.5

eg3: read data from a URL

Crab <- read.csv("http://www.hofroe.net/stat557/data/crab.txt", header=TRUE, sep="\t")
head(Crab)
##   Color Spine Width Satellite Weight
## 1     3     3  28.3         8   3050
## 2     4     3  22.5         0   1550
## 3     2     1  26.0         9   2300
## 4     4     3  24.8         0   2100
## 5     4     3  26.0         4   2600
## 6     3     3  23.8         0   2100


4 Simple Plot

eg1: scatter plot

x <- seq(0,6*pi,len=100)
y <- sin(x)*2 + x*0.25 + rnorm(100,sd=0.3)
plot(x,y)
y.true <- sin(x)*2 + x*0.25
lines(x, y.true, col=2)

eg2: histogram

hist(Crab$Width)

eg3: boxplot

boxplot(length~species, data=Cuckoos)

The location of the tilde between length and specices on the key board is shown below:


5 Linear Regression Example

The dataset we are goint to use is called ‘mtcars’, which is one of the built-in datasets in R.

The following table shows the first 5 rows of the dataset.

First five rows of dataset mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

Use ?mtcars to get more information.


data(mtcars)
attach(mtcars)

qqnorm(mpg)
qqline(mpg, col=2, lty="dashed")

plot(mpg~wt, pch=16)


5.1 Fit linear model for all the data.

‘lm’ stands for linear model NOT one m.

Mod0 <- lm(mpg ~ wt)
summary(Mod0)
## 
## Call:
## lm(formula = mpg ~ wt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10
plot(mpg~wt, pch=16)
abline(Mod0, col="red")


5.2 Fit linear model for each transmission type.

mean(mpg[am==0]) # am=0: automatic
## [1] 17.14737
mean(mpg[am==1]) # am=1: manual
## [1] 24.39231

Q: Are these two mean statistically significantly different?

t.test(mpg[am==0],mpg[am==1])
## 
##  Welch Two Sample t-test
## 
## data:  mpg[am == 0] and mpg[am == 1]
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231


Mod.auto <- lm(mpg~wt, data=subset(mtcars,am==0))
Mod.manu <- lm(mpg~wt, data=subset(mtcars,am==1))
plot(mpg~wt, col=am+1, pch=am+12)
abline(Mod.auto)
abline(Mod.manu, col=2)


5.3 Summary of model fitting steps

  1. Know the purpose of the study.

  2. Check the dataset and plot the data.
  • ?name_of_dataset
  • head()
  • plot()
  1. Choose proper model/tool to conduct the analysis.
  • lm(), t.test() …
  • summary()
  1. Model checking. (This part won’t be covered in this workshop.)


5.4 Practice

Use R dataset ‘iris’

  • plot and identify two variables that are most linear with each other.

  • fit a regression model based on the variable you choose and add the fitted line to the scatter plot.

  • Try: fit regression model for each species.


6 Write a simple function

We want to have a function with following properties:

LM <- function(x,y){
  Mod <- lm(y~x)
  summary(Mod)
  plot(x,y, pch=16)
  abline(Mod, col=2)
}

LM(wt, mpg)

LM(Cuckoos$length, Cuckoos$breadth)

7 What’s more

7.2 more about plot

## glX 
##   2

You must enable Javascript to view this page properly.

8 After this workshop


the book that taught me R

the book that taught me R