We are here to have fun, by the way learn some R.
We are here to type something. What we typed sometimes is not equal to what we thought we typed.
We are here to make some mistakes and correct them.
‘If I change xxx part of the code, what will it be?’ – If you have the time waiting for an answer, why not try it and let computer answer it.
Now let’s open Rstudio
Let’s try something.
2*2
## [1] 4
sqrt(100)
## [1] 10
55/sin(2*pi/9)
## [1] 85.56481
exp(1)
## [1] 2.718282
xx() associate with functions and xx is the function name.
Practice. Calculate the following expression:
\[ \frac{25log(tan(\frac{\pi}{4}))+ 2\sqrt{169} }{exp(sin(\pi))} \]
Pay attention to the ‘()’.
In R, a vector is a collection of numbers or characters.
One way to create a vector is using function ‘c()’.
Eg1: a vector contains numbers
c(1, 2, 3, 4, 5, 6, 7, 8)
## [1] 1 2 3 4 5 6 7 8
c(5*6, 72/8)
## [1] 30 9
c(sqrt(3.14), log(1), exp(1), tan(pi/4), sin(pi/3))
## [1] 1.7720045 0.0000000 2.7182818 1.0000000 0.8660254
Eg2: a vector contains characters
c("sleep", "reading week", "trips")
## [1] "sleep" "reading week" "trips"
c("hope=","hold","on","pain","ends")
## [1] "hope=" "hold" "on" "pain" "ends"
Q: A vector contains both numbers and characters, is there anything we need to pay attention to? Run the follow code and tell me what you notice.
c("R Intro Workshop", 2017)
Q: We then want to use some of the results in our next calculation, for example, we want to calculate
\[ 3^1, 3^2, 3^3, 3^4, 3^5,3^6,3^7,3^8 \] What will you do?
x <- 1:5
# vector x: start at 1, end in 5, increment is 1
y <- seq(13,10,by= -0.5)
# vector y: start at 13, end in 10, increment is -0.5
Q: Why we cannot see how \(x\) and \(y\) looks like.
In order to check what \(x\) and \(y\) are, what can we do?
x*2
x-y
mean(y)
sum(x)
In R, getting a subset of vector, matrix and data frame, we can use ‘[]’.
y
y[7]
y[8]
y[8] <- 14
y
Practice. Suppose \(z=1,2,3,\dots,100\). Get all the get the odd numbers from \(z\).
Q: What is a matrix?
Matrices application?
You must enable Javascript to view this page properly.
The function used to create a matrix is ‘matrix()’.
Suppose we would like to create the following matrix M. \[ M= \left[ \begin{matrix} 1 & 2 & 4 \\ 3 & 8 & 5 \\ 6 & 9 & 0 \end{matrix} \right] \]
M <- matrix(c(1,3,6,2,8,9,4,5,0),ncol=3)
M
## [,1] [,2] [,3]
## [1,] 1 2 4
## [2,] 3 8 5
## [3,] 6 9 0
Q: Given output only, how can we distinguish is it a vector or a matrix?
x
## [1] 1 2 3 4 5 6 7 8 9 10
Usually, we apply same operation, like mean or sum, to all the rows or all the columns to a matrix.
The related function is called ‘apply()’.
Next, we will calculte the row mean and column sum for our matrix M.
apply(M, 1, mean) ## calculate row mean
## [1] 2.333333 5.333333 5.000000
apply(M, 2, sum) ## calculate column sum
## [1] 10 19 9
Similar to vector, we also use ‘[]’ to obtain a subset of a matrix. Since matrix is two dimensional, we need to specify each dimension clearly.
Eg1: extract an element from a matrix.
M
## [,1] [,2] [,3]
## [1,] 1 2 4
## [2,] 3 8 5
## [3,] 6 9 0
M[3,1]
## [1] 6
Eg2: extract a row or a column.
M[2,]
## [1] 3 8 5
M[,3]
## [1] 4 5 0
Q: How to extract the first column of matrix M in matrix form?
Ans: Use ‘drop=FALSE’ in ‘[]’ when extract one colum/row to keep it two dimensional.
M[,1]
## [1] 1 3 6
M[,1, drop=FALSE]
## [,1]
## [1,] 1
## [2,] 3
## [3,] 6
M[2,]
## [1] 3 8 5
M[2, ,drop=FALSE]
## [,1] [,2] [,3]
## [1,] 3 8 5
Q: What is a data frame?
What are the differences from a matrix?
We use function ‘data.frame()’ to create a data frame.
A <- 2:7
B <- A^2
C <- c('These','Are','Words','Not','Numbers','Eh')
mydata <- data.frame(First=A, b=B, Char=C)
mydata
## First b Char
## 1 2 4 These
## 2 3 9 Are
## 3 4 16 Words
## 4 5 25 Not
## 5 6 36 Numbers
## 6 7 49 Eh
eg1: extract one column
mydata$First # method 1
## [1] 2 3 4 5 6 7
mydata[,1] # method 2
## [1] 2 3 4 5 6 7
eg2: extract multiple columns
mydata[,c("First","Char")] # method 1
## First Char
## 1 2 These
## 2 3 Are
## 3 4 Words
## 4 5 Not
## 5 6 Numbers
## 6 7 Eh
mydata[,c(1,3)] # method 2
## First Char
## 1 2 These
## 2 3 Are
## 3 4 Words
## 4 5 Not
## 5 6 Numbers
## 6 7 Eh
Q: In both eg1 and eg2, which method is better? Why?
eg3: extract the row which Char is “Not”.
mydata[mydata$Char=="Not",] # don't forget the comma ~
## First b Char
## 4 5 25 Not
Q: How about “mydata[Char==“Not”,]”? If it doesn’t work, what are the possible reasons?
We can imagine that list is a magic bag. It’s very handy when we want to commbine various types of data with different lengths.
Q: When?
The function we use to create a list is ‘list()’
eg:
List1 <- list(Y = y, MyData = mydata)
List1
## $Y
## [1] 0.031022328 1.926457236 0.002292707 0.439684028 0.178286256
## [6] 2.294487004 0.320546231 1.817832689 0.032440342 0.500020061
##
## $MyData
## First b Char
## 1 2 4 These
## 2 3 9 Are
## 3 4 16 Words
## 4 5 25 Not
## 5 6 36 Numbers
## 6 7 49 Eh
List2 <- list(Mat = M, MeanFun = mean)
List2
## $Mat
## [,1] [,2] [,3]
## [1,] 1 2 4
## [2,] 3 8 5
## [3,] 6 9 0
##
## $MeanFun
## function (x, ...)
## UseMethod("mean")
## <bytecode: 0x7fbac4a82060>
## <environment: namespace:base>
| Data type | Dimension | Features |
|---|---|---|
| Vector | 1 | |
| Matrix | 2 | Only numbers. Rectangle shaped. |
| Data Frame | 2 | Can contain both numbers and characters. Rectangle shaped. |
| List | n | Able to hold any data types. No shape restriction. |
eg1: read .txt file
Cuckoos <- read.table("cuckoos.txt", header = TRUE)
head(Cuckoos)
## length breadth species id
## 1 21.7 16.1 meadow.pipit 21
## 2 22.6 17.0 meadow.pipit 22
## 3 20.9 16.2 meadow.pipit 23
## 4 21.6 16.2 meadow.pipit 24
## 5 22.2 16.9 meadow.pipit 25
## 6 22.5 16.9 meadow.pipit 26
eg2: read .csv file
Airplane <- read.csv("airplane.csv", header = TRUE)
head(Airplane)
## paper..distance
## 1 light 3.1
## 2 light 3.3
## 3 light 2.1
## 4 light 1.9
## 5 medium 4
## 6 medium 3.5
eg3: read data from a URL
Crab <- read.csv("http://www.hofroe.net/stat557/data/crab.txt", header=TRUE, sep="\t")
head(Crab)
## Color Spine Width Satellite Weight
## 1 3 3 28.3 8 3050
## 2 4 3 22.5 0 1550
## 3 2 1 26.0 9 2300
## 4 4 3 24.8 0 2100
## 5 4 3 26.0 4 2600
## 6 3 3 23.8 0 2100
eg1: scatter plot
x <- seq(0,6*pi,len=100)
y <- sin(x)*2 + x*0.25 + rnorm(100,sd=0.3)
plot(x,y)
y.true <- sin(x)*2 + x*0.25
lines(x, y.true, col=2)
eg2: histogram
hist(Crab$Width)
eg3: boxplot
boxplot(length~species, data=Cuckoos)
The location of the tilde between length and specices on the key board is shown below:
The dataset we are goint to use is called ‘mtcars’, which is one of the built-in datasets in R.
The following table shows the first 5 rows of the dataset.
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Use ?mtcars to get more information.
data(mtcars)
attach(mtcars)
qqnorm(mpg)
qqline(mpg, col=2, lty="dashed")
plot(mpg~wt, pch=16)
‘lm’ stands for linear model NOT one m.
Mod0 <- lm(mpg ~ wt)
summary(Mod0)
##
## Call:
## lm(formula = mpg ~ wt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
plot(mpg~wt, pch=16)
abline(Mod0, col="red")
mean(mpg[am==0]) # am=0: automatic
## [1] 17.14737
mean(mpg[am==1]) # am=1: manual
## [1] 24.39231
Q: Are these two mean statistically significantly different?
t.test(mpg[am==0],mpg[am==1])
##
## Welch Two Sample t-test
##
## data: mpg[am == 0] and mpg[am == 1]
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
Mod.auto <- lm(mpg~wt, data=subset(mtcars,am==0))
Mod.manu <- lm(mpg~wt, data=subset(mtcars,am==1))
plot(mpg~wt, col=am+1, pch=am+12)
abline(Mod.auto)
abline(Mod.manu, col=2)
Know the purpose of the study.
Use R dataset ‘iris’
plot and identify two variables that are most linear with each other.
fit a regression model based on the variable you choose and add the fitted line to the scatter plot.
Try: fit regression model for each species.
We want to have a function with following properties:
LM <- function(x,y){
Mod <- lm(y~x)
summary(Mod)
plot(x,y, pch=16)
abline(Mod, col=2)
}
LM(wt, mpg)
LM(Cuckoos$length, Cuckoos$breadth)
## glX
## 2
You must enable Javascript to view this page properly.