R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. R is used among data miners, bioinformaticians and statisticians for data analysis and developing statistical software. Users have created packages to augment the functions of the R language.
To work with R, you need to download the program and install it. I recommend that you also install R-Studio. R-Studio is a separate program that makes R easier to use .
Download the latest R installer (.exe) for Windows. Install the downloaded file as any other windows app.
Now that R is installed, you need to download and install RStudio. First download the installer for Windows. Run the installer (.exe file) and follow the instructions.
First download the latest release (“R-version.pkg”) of R Save the .pkg file, double-click it to open, and follow the installation instructions. Now that R is installed, you need to download and install RStudio.
First download the the version for Mac. After downloading, double-click the file to open it, and then drag and drop it to your applications folder.
# create create variable "a" that is a vector of one number.
a <- 7
a = 7
a
## [1] 7
# create vectors that consists of multiple numbers.
b <- c(1.25, 2.9, 3.0)
b
## [1] 1.25 2.90 3.00
# create a regular sequence of whole numbers
c <- 1:10
c
## [1] 1 2 3 4 5 6 7 8 9 10
# seq(from, to, by) function ro creat sequence of numbers
d <- seq(1,10,0.5)
d
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
## [16] 8.5 9.0 9.5 10.0
# rep() function to repeat a single number, or a sequence of numbers.
rep(99, times=5)
## [1] 99 99 99 99 99
rep(c(1,2,3), times=5)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(c(1,2,3),each = 5)
## [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
x <- "Monday"
class(x)
y <- c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")
class(y)
A factor is a nominal (categorical) variable with a set of known
possible values called levels. They can be created using the
as.factor function.
yy <- as.factor(y)
yy
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed
In its most basic form, R can be used as a simple calculator. Consider the following arithmetic operators:
+-*/^ Ex: \(2^3\) = 2^3 = 2*2*2 = 8%% Ex: 20 modulo 3 = 20%%3 = 2 (3 6 9 12 15
18+2=20)# An addition
5 + 5
# A subtraction
5 - 5
# A multiplication
3 * 5
# A division
(5 + 5) / 2
# create a vector
x <- c(3, 5, 7, 1, 9, 20)
# multiply each element by 5
##TODO: your code here
# create a second vector
y <- c(17, 15, 13, 19, 11, 0)
# add both vectors
##TODO
# multiply both vectors
##TODO
To extract elements based on their position we simply write the
position inside the [ ]. For example, to extract the 3rd
value of z
z <- c(2, 3, 1, 6, 4, 3, 3, 7)
z[3] # extract the 3rd value
# if you want to store this value in another object
z_3 <- z[3]
z_3
We can also extract more than one value by using the c()
function inside the square brackets. Here we extract the 1st, 5th, 6th
and 8th element from the z vector
z[c(1, 5, 6, 8)]
Or we can extract a range of values using the :
notation. To extract the values from the 3rd to the 8th elements
z[3:8]
Another really useful way to extract data from a vector is to use a logical expression as an index.
z[z >= 4] # values greater or equal to 4
z[z < 4] # values less than 4
z[z <= 4] # values less than or equal to 4
z[z == 4] # values equal to 4
z[z != 4] # values not equal to 4
val26 <- z[z < 6 & z > 2] # extract values which are less than 6 AND greater than 2
val26
val63 <- z[z > 6 | z < 3]#extract values that are greater than 6 OR less than 3
mat0 <- matrix(NA, ncol=3, nrow=2)
mat0
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
mat1 <- matrix(1:6, ncol=3, nrow=2)
mat1
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
mat2 <- matrix(1:6, ncol=3, nrow=2,byrow = TRUE)
mat2
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
A list is a very flexible container to store data. Each element of a list can contain any type of R object, e.g. a vector, matrix, data.frame, another list, or more complex data types.
lst <- list(d,yy,mat2)
lst
The data.frame is commonly used for statistical data
analysis in R. It is a special type of list that requires that all
elements (variables) have the same length.
R works with numerous data types. Some of the most basic types to get started are:
4.5 are called
numerics.4 are called
integers. Integers are also numerics.TRUE or FALSE) are called
logical.# Loading Built in Datasets in R
data(ToothGrowth)
# Print the first 6 rows
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
# Print the first 15 rows
ToothGrowth[1:15,]
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
## 7 11.2 VC 0.5
## 8 11.2 VC 0.5
## 9 5.2 VC 0.5
## 10 7.0 VC 0.5
## 11 16.5 VC 1.0
## 12 16.5 VC 1.0
## 13 15.2 VC 1.0
## 14 17.3 VC 1.0
## 15 22.5 VC 1.0
# Check data type of variables in the dataframe
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
Your tasks are to:
aabb containing each digit of your student
ID (For example, studentID is 650510999, create vector
c(6,5,0,5,1,0,9,9,9))aaand bb, store the result
in vector cccc, extract values less than 2cc, extract values not equal to 3cc, extract values which are less than 5
AND greater than 2cc, extract values which are less than 5
OR greater than 2dd that contains 12 months character
namesdd as a factordd 3 timesdd by 4 timesaa,bb,cc,
and ddThe data.frame is commonly used for statistical data
analysis in R. It is a special type of list that requires that all
elements (variables) have the same length.
R works with numerous data types. Some of the most basic types to get started are:
4.5 are called
numerics.4 are called
integers. Integers are also numerics.TRUE or FALSE) are called
logical.We will use dataset in data frame Insurance. The data
frame consist of the numbers of policyholders of an insurance company
who were exposed to risk, and the numbers of car insurance claims made
by those policyholders in the third quarter of 1973. This data frame
contains the following columns:
District as factor: district of residence of
policyholder (1 to 4): 4 is major cities.Group as an ordered factor: group of car with levels
<1 litre, 1–1.5 litre, 1.5–2 litre, >2 litre.Age as an ordered factor: the age of the insured in 4
groups labelled <25, 25–29, 30–35, >35.Holders numbers of policyholders.Claims numbers of claimslibrary(MASS)
data("Insurance")
head(Insurance)
str(Insurance)
summary() functiontable() functionprop.table()
function# job, marital, education, housing, loan
summary(Insurance)
#frequency table for one categorical variable
table(Insurance$District)
table(Insurance$Age)
table(Insurance$Group)
prop.table(table(Insurance$Age))
prop.table(table(Insurance$Age))*100
#Two-Way Frequency Tables
tb <- table(Insurance$District,Insurance$Age)
# Add margin totals to your table.
addmargins(tb)
#Calculate percentage
prop.table(tb)*100
summary() functionmax() and min functionmean() functionmedian() functiongetmode() function defined below.#define function to calculate mode
getmode <- function(x) {
u <- unique(x)
tab <- tabulate(match(x, u))
u[tab == max(tab)]
}
range() functionquantile(, probs = ) function where probs
can be 0.25, 0.5, 0.75 for first, second, and third quantileIQR() functionsd() and var() function#age, balance, duration, campaign, pdays
summary(Insurance$Holders)
min(Insurance$Holders)
range(Insurance$Holders)
mean(Insurance$Holders)
getmode(Insurance$Holders)
IQR(Insurance$Holders)
sd(Insurance$Holders)
quantile(Insurance$Holders,probs = 0.25)
quantile(Insurance$Holders,probs = 0.50)
quantile(Insurance$Holders,probs = 0.75)
quantile(Insurance$Holders,probs = c(0.25,0.50,0.75))
cv_Holders <- sd(Insurance$Holders)/mean(Insurance$Holders)
cv_Holders
#cv of Claims by District (1,2)
dt1 <- Insurance$Claims[Insurance$District=="1"]
mean1 <- mean(dt1)
sd1 <- sd(dt1)
cv1 <- sd1/mean1
dt2 <- Insurance$Claims[Insurance$District=="2"]
mean2 <- mean(dt2)
sd2 <- sd(dt2)
cv2 <- sd2/mean2
aggregate() function is used to get the summary
statistics of the data by group.# aggregate: compute summary statistics of data subsets
aggregate(Claims~Group, data = Insurance, FUN=summary)
aggregate(Claims~District+Age, data = Insurance, FUN=summary)
aggregate(Claims~District+Age, data = Insurance, FUN=mean)
Perform the following numerical data presentation for the
Insurance dataset:
Age, with proportionAge by
Group category, with percentageClaimsClaims, also the IQRClaims between Age < 25
and 25-29claims by
AgeClaims by Age and
GroupWe will use dataset survey. This data frame contains the
responses of 237 Statistics I students at the University of Adelaide to
a number of questions. The components of the data frame are:
Sex : The sex of the student. (Factor with levels
“Male” and “Female”.)Wr.Hnd : span (distance from tip of thumb to tip of
little finger of spread hand) of writing hand, in centimetres.NW.Hnd : span of non-writing hand.W.Hnd : writing hand of student. (Factor, with levels
“Left” and “Right”.)Fold : “Fold your arms! Which is on top” (Factor, with
levels “R on L”, “L on R”, “Neither”.)Pulse : pulse rate of student (beats per minute).Clap : ‘Clap your hands! Which hand is on top?’
(Factor, with levels “Right”, “Left”, “Neither”.)Exer : how often the student exercises. (Factor, with
levels “Freq” (frequently), “Some”, “None”.)Smoke : how much the student smokes. (Factor, levels
“Heavy”, “Regul” (regularly), “Occas” (occasionally), “Never”.)Height : height of the student in centimetres.M.I : whether the student expressed height in imperial
(feet/inches) or metric (centimetres/metres) units. (Factor, levels
“Metric”, “Imperial”.)Age : age of the student in years.library(MASS)
data("survey")
head(survey)
## Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height M.I
## 1 Female 18.5 18.0 Right R on L 92 Left Some Never 173.00 Metric
## 2 Male 19.5 20.5 Left R on L 104 Left None Regul 177.80 Imperial
## 3 Male 18.0 13.3 Right L on R 87 Neither None Occas NA <NA>
## 4 Male 18.8 18.9 Right R on L NA Neither None Never 160.00 Metric
## 5 Male 20.0 20.0 Right Neither 35 Right Some Never 165.00 Metric
## 6 Female 18.0 17.7 Right L on R 64 Right Some Never 172.72 Imperial
## Age
## 1 18.250
## 2 17.583
## 3 16.917
## 4 20.333
## 5 23.667
## 6 21.000
#check data structure
str(survey)
## 'data.frame': 237 obs. of 12 variables:
## $ Sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 2 1 2 1 2 2 ...
## $ Wr.Hnd: num 18.5 19.5 18 18.8 20 18 17.7 17 20 18.5 ...
## $ NW.Hnd: num 18 20.5 13.3 18.9 20 17.7 17.7 17.3 19.5 18.5 ...
## $ W.Hnd : Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2 ...
## $ Fold : Factor w/ 3 levels "L on R","Neither",..: 3 3 1 3 2 1 1 3 3 3 ...
## $ Pulse : int 92 104 87 NA 35 64 83 74 72 90 ...
## $ Clap : Factor w/ 3 levels "Left","Neither",..: 1 1 2 2 3 3 3 3 3 3 ...
## $ Exer : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3 ...
## $ Smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2 2 ...
## $ Height: num 173 178 NA 160 165 ...
## $ M.I : Factor w/ 2 levels "Imperial","Metric": 2 1 NA 2 2 1 1 2 2 2 ...
## $ Age : num 18.2 17.6 16.9 20.3 23.7 ...
In the case that you have a factor variable, you can display a frequency table or a bar graph.
You can use the good ol’ summary() on a factor variable
to display a frequency table.
freq <- table(survey$Smoke)
freq
##
## Heavy Never Occas Regul
## 11 189 19 17
barplot(freq)
Stacked Bar Plot with Colors and Legend
counts <- table(survey$Sex, survey$Smoke)
barplot(counts,
legend = rownames(counts))
Grouped Bar Plot
counts <- table(survey$Sex, survey$Smoke)
barplot(counts,col=c("blue","red"),
legend = rownames(counts), beside=TRUE)
counts <- table(survey$Exer)
pie(counts, main = "Exercise pie chart")
hist()hist(survey$Height)
Sometimes, you want to change the title of the plot, and also the
label of the horizontal axis and vertical axis. To do that, set the
parameters main, xlab and
ylab
hist(survey$Height, main = "Histogram of Height", xlab = "height", ylab = "counts")
To change the number of bins, set the breaks parameter.
However, R will decide what is the best number of bins close to the
number that your specify. For example, the following code displays a
histogram with the number of bins close to 9:
hist(survey$Height, breaks = 20,
main = "Histogram of Height",
xlab = "height",
ylab = "counts")
stem().stem(survey$Height)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 15 | 0224
## 15 | 555566777777899
## 16 | 00000000333333334444
## 16 | 555555555555555555677777777888888888888899999
## 17 | 000000000000000000111112222222233333333334
## 17 | 555555555566777778888999999
## 18 | 0000000000000000023333333344
## 18 | 55555555777888899
## 19 | 00011123
## 19 | 56
## 20 | 0
#compare with its histogram
boxplot().boxplot(survey$Height,main="Boxplot of height")
horizontal = TRUEboxplot(survey$Height, horizontal = TRUE)
You can display boxplots of by group variable.
boxplot(survey$Height~survey$Sex,xlab="Sex",ylab="Height")
boxplot(survey$Height~survey$Exer,xlab="Exercise",ylab="Height")
Note that you need two vectors to plot a scatter plot: one for the x-axis, and one for the y-axis.
Suppose that you have two variables: var1 and
var2. To display a scatter plot between var1
and var2, use plot(x = var1, y = var2).
plot(x = survey$Age, y = survey$Height,xlab="Age",ylab="Height")
Perform the following graphical data presentation for the
survey dataset:
W.Hnd data)?W.Hnd) and how often the student
exercises (Exer), which graph shall you use to present such
information?Pulse
data)?Pulse) between students with different how often the
student exercises (Exer), which graph shall you use to
present such information?Pulse) and their age (Age), which
graph shall you use to present such information?Use rbinom to generate random numbers from a binomial
distribution.
Recall that binomial distribution requires two parameters: \(n\) (number of trials) and \(p\) (probability of success).
The following code generate random numbers of \(X\sim \text{Binomial}(20,1/6)\)
rbinom( n = 40, size = 20, prob = 1/6 )
Use dbinom to compute probabilities of Binomial
variable
The following code compute \(P(X=4)\), where \(X\sim \text{Binomial}(20,1/6)\)
dbinom( x = 4, size = 20, prob = 1/6 )
## [1] 0.2022036
You can also compute cumulative probability.
The following code compute \(P(X\leq 4)\), where \(X\sim \text{Binomial}(20,1/6)\)
Note that \(P(X\leq 4)=P(X= 0)+P(X= 1)+P(X= 2)+P(X= 3)+P(X= 4)\)
pbinom( q= 4, size = 20, prob = 1/6)
## [1] 0.7687492
From this,we can compute \(P(X>4)\) using \(P(X\leq 4)+P(X>4) = 1\):
##P(X>4)
1 - pbinom( q= 4, size = 20, prob = 1/6)
## [1] 0.2312508
The Bernoulli distribution is a special case of the Binomial
distribution where \(n=1\): \(\text{Binomial}(1,p) =
\text{Bernoulli}(p)\) Thus we can generate random Bernoulli
numbers using rbinom( n=..., size = 1, p=...)
For example: suppose we want to flip a fair coin 40 times:
rbinom( n=40, size = 1, p=0.5)
##TODO: your code here
##P(X=5)
##TODO: your code here
##P(X<=5) = P(X<5) + P(X=5)
##TODO: your code here
##P(X<5)
## TODO: your code here
## P(X>10) = 1 - P(X <= 10)
Recall that normal distribution can be used to generate numerics, with numbers concentrated around the mean.
The normal distribution requires two parameters: the mean and the standard deviation (sd).
To generate number from normal distribution, use
rnorm
##Generate 40 random numbers from N(mu = 0, sigma = 1)
rnorm( n=40, mean=10, sd=100 )
If you want to generate from the standard normal
distribution, you can ignore the mean and sd
parameters:
rnorm(n=40)
rnorm(n=40,mean=0,sd=1)
Let’s look at the histogram of generated data.
normal.a <- rnorm( n=100, mean=0, sd=1 )
hist( normal.a )
Suppose IQ’s are normally distributed with a mean of 100 and a standard deviation of 15.
Thus IQ of a random person can be represented by a random variable \(X\sim N(100,15)\)
## P(X <= 125)
pnorm(125, mean = 100, sd = 15)
## [1] 0.9522096
Set lower.tail = FALSE to compute the upper tail
probability.
## P(X > 110)
pnorm(110, mean = 100, sd = 15, lower.tail=FALSE)
## [1] 0.2524925
You can check for the total law of probability by adding the lower tail and the upper tail.
pnorm(110, mean = 100, sd = 15) + pnorm(110, mean = 100, sd = 15, lower.tail=FALSE)
## [1] 1
You need to think of these problems as calculating the area under the curve!
$P(110<X<125) = $ area under the curve of \(N(100,15)\) between 110 and 125 which is
lower tail area of 125 -
lower tail area of 110
## P(110<X<125)
pnorm(125, mean = 100, sd = 15) - pnorm(110, mean = 100, sd = 15)
## [1] 0.2047022
##TODO: your code here
##Note: P(X<170) = P(X <= 170) (only true for continuous distributions!)
##TODO: your code here
##TODO: your code here
Function qnorm() takes the probability value and gives a
number whose cumulative value matches the probability value.
qnorm(0.05, mean = 0, sd = 1)
## [1] -1.644854
qnorm(0.1, mean = 0, sd = 1)
## [1] -1.281552
qnorm(0.5, mean = 0, sd = 1)
## [1] 0
qnorm(0.9, mean = 0, sd = 1)
## [1] 1.281552
qnorm(0.95, mean = 0, sd = 1)
## [1] 1.644854
qnorm(0.975, mean = 0, sd = 1)
## [1] 1.959964
The rt() function generates random deviates of the
t-distribution and is written as rt(n, df). We may easily
generate n number of random samples. Recall that he number
of degrees of freedom for a t-distribution is equal to the sample size
minus one, that is,\(df=n−1\).
n <- 30
df <- n - 1
samp <- rt(n, df)
plot them as a histogram.
hist(samp)
Further we may generate a very large number of random samples and plot them as a histogram.
n <- 1000
df <- n - 1
samples <- rt(n, df)
hist(samples)
By using the dt() function we may calculate the
probability density function, and thus, the vertical distance between
the horizontal axis and the t-curve at any point. For the purpose of
demonstration we construct a t-distribution with \(df=5\) and calculate the probability
density function at \(t=−4,−2,0,2,4.\)
x <- seq(-4, 4, by = 2)
dt(x, df = 5)
## [1] 0.005123727 0.065090310 0.379606690 0.065090310 0.005123727
Another very useful function is the pt() function, which
returns the area under the t-curve for any given interval.
pt(-2, df = df, lower.tail = TRUE)
## [1] 0.02288531
pt(-1, df = df, lower.tail = TRUE)
## [1] 0.1587763
pt(0, df = df, lower.tail = TRUE)
## [1] 0.5
pt(1, df = df, lower.tail = TRUE)
## [1] 0.8412237
pt(2, df = df, lower.tail = TRUE)
## [1] 0.9771147
The qt() function returns the quantile function, and
thus it is the reversing function of pt().
qt(0.5, df = 5, lower.tail = TRUE)
## [1] 0
qt(0.95, df = 5, lower.tail = TRUE)
## [1] 2.015048
qt(0.975, df = 5, lower.tail = TRUE)
## [1] 2.570582
The distribution of everyone’s IQ score is \(N(100,15)\).
Here’s we use rnorm() to simulate the IQ’s of 1,000
people in a city.
The function round() is used to round the numbers to
integers (e.g. round(93.7)=93)
pop <- rnorm(1000, mean = 100, sd = 15)
pop <- round(pop)
Let’s look at the histogram of this data. Also, calculate the mean and standard deviation of the population’s data.
hist(pop)
mean(pop)
sd(pop)
Now we do some sampling from pop. Imagine sampling as
randomly drawing balls from a box full of balls labeled with numbers in
pop.
In sampling with replacement, each ball that we draw is put back in the box for the next sampling.
To sample with replacement, use
sample(iq, size, replace = TRUE)
Here, we sample IQ’s of 10 people (size = 10).
# Sample IQ's of 10 people with replacement
iq20_w_replace = sample(pop, size = 10, replace = TRUE)
iq20_w_replace
In sampling with replacement, each ball that we draw is put back is thrown away. This means that the numbers that we draw are different.
Here, we sample IQ’s of 10 people without replacement
(replace = FALSE).
# Sample IQ's of 10 people with replacement
iq20_wo_replace = sample(pop, size = 10, replace = FALSE)
iq20_wo_replace
mean() function to
compute the average IQ of the population , average IQ of the sample with
replacement (iq10_w_replace), and the sample without
replacement (iq10_wo_replace).##Exercise 1
##TODO:
The law of large numbers tells you that, the larger the sample size, the closer the sample mean to the population mean.
pop with replacement and compute their
average.##Exercise2
#TODO:
pop with replacement and compute their
average.##Exercise3
#TODO:
From the law of large number, we can use the sample mean to estimate the population mean. But are we confident that the sample mean is actually close to the population mean? We will look at the distribution of the sample mean to answer this question.
First, we need to sample from the population a couple of times; let’s say 20 times, each with 100 observations (\(n=10\)). Then we compute the means of these 20 samples.
#calculate the mean of first set of the sample
m1 = mean(sample(pop, size = 10, replace = TRUE))
m2 = mean(sample(pop, size = 10, replace = TRUE))
m3 = mean(sample(pop, size = 10, replace = TRUE))
m4 = mean(sample(pop, size = 10, replace = TRUE))
m5 = mean(sample(pop, size = 10, replace = TRUE))
m6 = mean(sample(pop, size = 10, replace = TRUE))
m7 = mean(sample(pop, size = 10, replace = TRUE))
m8 = mean(sample(pop, size = 10, replace = TRUE))
m9 = mean(sample(pop, size = 10, replace = TRUE))
m10 = mean(sample(pop, size = 10, replace = TRUE))
m11 = mean(sample(pop, size = 10, replace = TRUE))
m12 = mean(sample(pop, size = 10, replace = TRUE))
m13 = mean(sample(pop, size = 10, replace = TRUE))
m14 = mean(sample(pop, size = 10, replace = TRUE))
m15 = mean(sample(pop, size = 10, replace = TRUE))
m16 = mean(sample(pop, size = 10, replace = TRUE))
m17 = mean(sample(pop, size = 10, replace = TRUE))
m18 = mean(sample(pop, size = 10, replace = TRUE))
m19 = mean(sample(pop, size = 10, replace = TRUE))
m20 = mean(sample(pop, size = 10, replace = TRUE))
Let’s combine these 20 sample means into a single vector.
sample_means = c(m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11,m12,m13,m14,m15,m16,m17,m18,m19,m20)
sample_means
## [1] 97.2 95.6 109.3 105.8 95.1 93.0 102.5 101.0 100.5 94.9 100.4 95.9
## [13] 96.3 101.3 97.8 102.8 96.0 103.0 100.9 93.2
Then we look at the histogram of these sample means.
hist(sample_means)
If you are lucky, the histogram will look like that of a normal distribution.
#Exercise4
#TODO: population mean
#TODO: population sd
#TODO: mean of sample means
#TODO: s.d. of sample means
Next, we will perform the same experiment, but with \(n=100\) for 20 sample sets.
n1 = mean(sample(pop, size = 100, replace = TRUE))
n2 = mean(sample(pop, size = 100, replace = TRUE))
n3 = mean(sample(pop, size = 100, replace = TRUE))
n4 = mean(sample(pop, size = 100, replace = TRUE))
n5 = mean(sample(pop, size = 100, replace = TRUE))
n6 = mean(sample(pop, size = 100, replace = TRUE))
n7 = mean(sample(pop, size = 100, replace = TRUE))
n8 = mean(sample(pop, size = 100, replace = TRUE))
n9 = mean(sample(pop, size = 100, replace = TRUE))
n10 = mean(sample(pop, size = 100, replace = TRUE))
n11 = mean(sample(pop, size = 100, replace = TRUE))
n12 = mean(sample(pop, size = 100, replace = TRUE))
n13 = mean(sample(pop, size = 100, replace = TRUE))
n14 = mean(sample(pop, size = 100, replace = TRUE))
n15 = mean(sample(pop, size = 100, replace = TRUE))
n16 = mean(sample(pop, size = 100, replace = TRUE))
n17 = mean(sample(pop, size = 100, replace = TRUE))
n18 = mean(sample(pop, size = 100, replace = TRUE))
n19 = mean(sample(pop, size = 100, replace = TRUE))
n20 = mean(sample(pop, size = 100, replace = TRUE))
sample_means2 = c(n1,n2,n3,n4,n5,n6,n7,n8,n9,n10,n11,n12,n13,n14,n15,n16,n17,n18,n19,n20)
sample_means2
## [1] 98.83 98.58 98.73 98.82 100.38 100.49 96.54 99.72 99.23 99.91
## [11] 101.11 100.48 98.79 99.27 101.20 100.63 97.45 100.82 100.47 104.69
#Exercise5
#TODO: Histogram of 20 sample means with n=100
#TODO: mean of sample means
#TODO: s.d. of sample means
Example 4.1: In a study of maximal aerobic capacity, 12 women were used as subjects, and one measurement that was made was blood plasma volume. The following data give their blood plasma volumes in liters: 3.15, 2.99, 2.77, 3.12, 2.45, 3.85, 2.99, 3.87, 4.06,2.94,3.53, and 3.20.
Assume that blood plasma volumes are normally distributed.
#create data vector
bpm <- c(3.15, 2.99, 2.77, 3.12, 2.45, 3.85, 2.99, 3.87, 4.06,2.94,3.53,3.20)
# count number of observations in vector
n <- length(bpm)
calculate the point estimator of \(\mu\) which is sample mean (\(\bar{x}\))
xbar <- mean(bpm)
calculate the point estimator of \(\sigma\) which is sample sd (s)
s <- sd(bpm)
Find a 90% confidence interval for \(\mu\) (case: unknown \(\sigma\) and \(n < 30\))
# alpha = 0.10 , we need 1-(alpha/2) = 1-(0.1/2)=0.95
tval <- qt(0.95,n-1)
tval
## [1] 1.795885
lower <- xbar-tval*s/sqrt(n)
lower
## [1] 2.99083
upper <- xbar+tval*s/sqrt(n)
upper
## [1] 3.495837
Example 4.2: Suppose average pizza delivery times are normally distributed with an unknown population mean and an unknown population standard deviation A random sample of 50 pizza delivery restaurants is taken and has a sample mean delivery time of 36 minutes with standard deviation of 6 minutes.
Find a 95% confidence interval for the mean delivery time.
n <- 50
xbar <- 36
s <- 6
# alpha = 0.05 , we need 1-(alpha/2) = 1-(0.05/2)=0.975
zval <- qnorm(0.975)
zval
lower <- xbar-zval*s/sqrt(n)
lower
upper <- xbar+zval*s/sqrt(n)
upper
A teacher wants to estimate the mean height of all 400 students at her school. She takes a simple random sample of 45 students and finds the sample mean height is 63 inches with a standard deviation of 1.5 inches.
Construct a 90% confidence interval for the population mean height.
##Exercise6
##TODO:
Let X equal the weight in grams of a “52-grams” snack pack of candies. Assume that the distribution of X is \(N(\mu,\sigma=4)\). A random sample of \(n=10\) observation s of X yielded the following data: 55.95, 56.54, 57.58, 55.13,57.48,56.06,59.93,58.30,52.57 and 58.46.
##Exercise7
##TODO: calculate a point estimator of mu
##TODO: find a 95% confidence interval for mu
We will use birthwt data frame that has 189 rows and 10
columns. The data were collected at Baystate Medical Center,
Springfield, Mass during 1986. This data frame contains the following
columns:
low indicator of birth weight less than 2.5 kg.age mother’s age in years.lwt mother’s weight in pounds at last menstrual
period.race racenmother’s race (1 = white, 2 = black, 3 =
other).smoke smoking status during pregnancy.ptl number of previous premature labours.ht history of hypertension.ui presence of uterine irritability.ftv number of physician visits during the first
trimester.bwt birth weight in grams.library(MASS)
data("birthwt")
head(birthwt)
## low age lwt race smoke ptl ht ui ftv bwt
## 85 0 19 182 2 0 0 0 1 0 2523
## 86 0 33 155 3 0 0 0 0 3 2551
## 87 0 20 105 1 1 0 0 0 1 2557
## 88 0 21 108 1 1 0 0 1 2 2594
## 89 0 18 107 1 1 0 0 1 0 2600
## 91 0 21 124 3 0 0 0 0 0 2622
str(birthwt)
## 'data.frame': 189 obs. of 10 variables:
## $ low : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : int 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: int 0 0 1 1 1 0 0 0 1 1 ...
## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ht : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ui : int 1 0 0 1 1 0 0 0 0 0 ...
## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
#change qualitative variables to be factor
birthwt$low <- as.factor(birthwt$low)
birthwt$race <- as.factor(birthwt$race)
birthwt$smoke <- as.factor(birthwt$smoke)
birthwt$ht <- as.factor(birthwt$ht)
birthwt$ui <- as.factor(birthwt$ui)
str(birthwt)
## 'data.frame': 189 obs. of 10 variables:
## $ low : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ht : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ ui : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
summary(birthwt)
bwt birth weight in grams by
smoke smoking status during pregnancy using
boxplot() and aggregate() function.boxplot(bwt~smoke,data=birthwt)
aggregate(bwt~smoke,data=birthwt,FUN=summary)
t.test function, viewed in isolation by adding
$conf.int to the command.t.test(bwt~smoke, data = birthwt, var.equal = TRUE, conf.level=.95)$conf.int
## [1] 72.75612 494.79735
## attr(,"conf.level")
## [1] 0.95
d1 <- birthwt$bwt[birthwt$smoke=="0"]
d2 <- birthwt$bwt[birthwt$smoke=="1"]
n1 <- length(d1)
n2 <- length(d2)
xbar1 <- mean(d1)
xbar2 <- mean(d2)
s1 <- sd(d1)
s2 <- sd(d2)
# at alpha=0.05, we need 1-(alpha/2)=1-(0.05/2)=0.975
tval <- qt(0.975,n1+n2-2)
tval
## [1] 1.972731
sp <- sqrt(((n1-1)*s1^2+(n2-1)*s2^2)/(n1+n2-2))
sp
## [1] 717.7792
lower <- (xbar1-xbar2)-tval*sp*sqrt(1/n1+1/n2)
lower
## [1] 72.75612
upper <- (xbar1-xbar2)+tval*sp*sqrt(1/n1+1/n2)
upper
## [1] 494.7973
bwtby
ht history of hypertension (assumed equal variances for two
groups)##Exercise 1
##TODO: calculate n1,n2,xbar1,xbar2,s1,s2,sp
##TODO: find 90% confidence interval for mu1-mu2
##TODO: find 90% confidence interval for mu1-mu2 using t.test function
A researcher wants to examine whether music has an effect on the perceived psychological effort required to perform an exercise session. To test this, the researcher recruited 12 runners who each ran three times on a treadmill for 30 minutes. For consistency, the treadmill speed was the same for all three runs. In a random order, each subject ran: (a) listening to no music at all; (b) listening to classical music; and (c) listening to dance music. At the end of each run, subjects were asked to record how hard the running session felt on a scale of 1 to 10, with 1 being easy and 10 extremely hard.
#create data
music_data <- data.frame(Runner = rep(seq(1,12,by=1),times=3),
Type =rep(c("None","Classical","Dance"),each=12),
Scale = c(8,7,6,8,5,9,7,8,8,7,7,9,8,6,8,9,8,7,7,
7,6,6,8,9,7,6,6,7,5,7,7,7,8,6,6,6)
)
#set 'Type' to be a factor
music_data$Type <- as.factor(music_data$Type)
#create boxplot of 'Scale' by 'Type'
boxplot(Scale~Type,data=music_data)
Confidence intervals for \(\mu_{d}\)
can be obtained using the t.test function with
paired=TRUE, viewed in isolation by adding
$conf.int to the command.
x <- music_data$Scale[music_data$Type=="None"]
y <- music_data$Scale[music_data$Type=="Dance"]
t.test(x,y, paired = TRUE,conf.level=0.90)$conf.int
## [1] 0.4499076 1.3834258
## attr(,"conf.level")
## [1] 0.9
n <- length(x)
d <- x-y
dbar <- mean(d)
sd <- sd(d)
# at alpha=0.10, 1-(alpha/2)=1-(0.10/2)=0.95
tval <- qt(0.95,n-1)
lower <- dbar - tval*sd/sqrt(n)
lower
## [1] 0.4499076
upper <- dbar + tval*sd/sqrt(n)
upper
## [1] 1.383426
##Exercise 2
##TODO: calculate n, dbar, sd
##TODO: find 90% confidence interval for mu_d
##TODO: find 90% confidence interval for mu_d using t.test function
In this example, we will use dataset ToothGrowth which
is about the effect of Vitamin C on tooth growth in Guinea Pigs. A data
frame with 60 observations on 3 variables.
len numeric Tooth lengthsupp factor Supplement type (VC or OJ).dose numeric Dose in milligrams/daydata("ToothGrowth")
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
#data summary
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
## aggregate: compute summary statistics of data subsets by categorical data
aggregate(len~supp, data = ToothGrowth, FUN=summary)
## supp len.Min. len.1st Qu. len.Median len.Mean len.3rd Qu. len.Max.
## 1 OJ 8.20000 15.52500 22.70000 20.66333 25.72500 30.90000
## 2 VC 4.20000 11.20000 16.50000 16.96333 23.10000 33.90000
aggregate(len~supp, data = ToothGrowth, FUN=length)
## supp len
## 1 OJ 30
## 2 VC 30
boxplot(len~supp, data = ToothGrowth)
We will take the data of Tooth length for Vitamin C (VC).
Use the subset() function to select data by their
values.
#Take the Tooth length for VC
vc_dat = subset(ToothGrowth, supp == "VC" )
head(vc_dat)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
vc_dat_len = vc_dat$len
n_vc = length(vc_dat_len)
Let \(\mu\) be the average tooth
length of all observations for VC supplement
We will test the following hypotheses:
\(H_0: \mu \leq 24.5\)
\(H_1: \mu > 24.5\)
Thus this is an upper test. In R, z-test can be
performed via z.test function for when \(n > 30\) .
Package BSDA is required before using
z.test function
t-test can be performed via t.test function.
#t-test
t.test(vc_dat_len, mu = 24.5, alternative = "greater")
##
## One Sample t-test
##
## data: vc_dat_len
## t = -4.9939, df = 29, p-value = 1
## alternative hypothesis: true mean is greater than 24.5
## 95 percent confidence interval:
## 14.39907 Inf
## sample estimates:
## mean of x
## 16.96333
Answer:
Answer:
Now we will test the following hypotheses:
\(H_0: \mu \geq 25.5\)
\(H_1: \mu < 25.5\)
Thus this is a lower test.
#t-test
t.test(vc_dat_len, mu = 25.5, alternative = "less")
##
## One Sample t-test
##
## data: vc_dat_len
## t = -5.6566, df = 29, p-value = 2.051e-06
## alternative hypothesis: true mean is less than 25.5
## 95 percent confidence interval:
## -Inf 19.52759
## sample estimates:
## mean of x
## 16.96333
In this case, if we were to test at 0.05 significance level, since
p-value> 0.05, we would accept \(H_0\).
if we were to test at 0.10 significance level, since p-value< 0.10, we would reject \(H_0\) (accept \(H_1\)).
Now we will test the following hypotheses:
\(H_0: \mu = 24.5\)
\(H_1: \mu \not= 24.5\)
Thus this is a two-sided test.
#t-test
t.test(vc_dat_len, mu = 24.5, alternative = "two.sided")
##
## One Sample t-test
##
## data: vc_dat_len
## t = -4.9939, df = 29, p-value = 2.58e-05
## alternative hypothesis: true mean is not equal to 24.5
## 95 percent confidence interval:
## 13.87675 20.04992
## sample estimates:
## mean of x
## 16.96333
In this case, if we were to test at 0.05 significance level, sincep-value>0.05, we would accept \(H_0\).
#TODO: use subset() to select part of the data of interest
#TODO: take only the len column
Test between the following hypotheses at 0.05 significance level:
\(H_0: \mu \leq 20.5\)
\(H_1: \mu > 20.5\)
##TODO: your code here
Which hypothesis would you accept?
Answer:
Test between the following hypotheses at 0.1 significance level:
\(H_0: \mu \geq 21\)
\(H_1: \mu < 21\)
##TODO: your code here
Which hypothesis would you accept?
Answer:
Test between the following hypotheses at 0.1 significance level:
\(H_0: \mu = 21\)
\(H_1: \mu \not= 21\)
##TODO: your code here
Which hypothesis would you accept?
Answer:
Hypothesis testing between two independent populations means \(\mu_{1}-\mu_{2}\)
Example 1: Toy data
Example 2: Video game sales data
In this lab, we will focus on testing between two independent population means from two samples \(\mu_{1}-\mu_{2}\)
There are two ways that you typically obtain two samples.
data1 with Two samples in two
separate columnsdata1 = data.frame( Sample1 = c(2.5,6.3,7.3,2.8,4.3,2.8,6.1,9.5,3.8,1.5),
Sample2 = c(6.3,8.1,9.5,9.6,1.6,7.5,2.5,6.7,2.1,4.6))
First, you need to check if two samples have approximately the same variance.
#Use `sd` to compute the sample standard deviation
sd(data1$Sample1)
## [1] 2.529581
sd(data1$Sample2)
## [1] 3.000093
#Compute the percent difference in variances for data1
(3.000093-2.529581)/2.529581*100
## [1] 18.60039
We will assume the variances are equal if the sd difference is not more than 10%.
The standard deviations (and so the variances) are different, so we will perform a two sample t-test with unpooled variances.
To perform two sample t-test, use the t-test function
with first sample as the first argument, and the second sample as the
second argument.
# H1: mu1 > mu2 -> alternative = "greater"
t.test(data1$Sample1, data1$Sample2, alternative = "greater")
# H1: mu1 < mu2 -> alternative = "less"
t.test(data1$Sample1, data1$Sample2, alternative = "less")
# H1: mu1 != mu2 -> alternative = "two.sided"
t.test(data1$Sample1, data1$Sample2, alternative = "two.sided")
Let’s try another example.
data2 = data.frame( Sample1 = c(2.5,6.3,7.3,2.8,4.3,2.8,6.1,9.5,3.8,1.5),
Sample2 = c(6.3,6.1,6.5,9.6,1.6,7.5,2.5,5.7,2.1,3.6))
Again, we compute the sample standard deviations of both samples.
#Use `sd` to compute the sample standard deviation
sd(data2$Sample1)
## [1] 2.529581
sd(data2$Sample2)
## [1] 2.603523
In practice, use F-test between two population variances.
For our purpose, assume that pop. variances are equal if the sd difference is not more than 10%
#Compute the percent difference in variances for data2
(2.60-2.53)/2.53*100
## [1] 2.766798
Now they are really close, so the population variances of the two populations might be the same. We will perform a two sample t-test with pooled variance.
To test with pooled variance, specify var.equal = TRUE
in t.test.
# H1: mu1 > mu2 -> alternative = "greater"
t.test(data2$Sample1, data2$Sample2, alternative = "greater", var.equal = TRUE)
# H1: mu1 < mu2 -> alternative = "less"
t.test(data2$Sample1, data2$Sample2, alternative = "less", var.equal = TRUE)
# H1: mu1 != mu2 -> alternative = "two.sided"
t.test(data2$Sample1, data2$Sample2, alternative = "two.sided", var.equal = TRUE)
data3 Two samples in one column,
separated by two levels (Group) of another column.data3 = data.frame( Sample = c(2.5,6.3,7.3,2.8,4.3,2.8,6.1,9.5,3.8,1.5),
Group = c("A","B","A","A","B","B","B","A","B","A"))
Now we want to test the population means between Group A and Group B.
First, we need to check the sample standard deviations of Group A and Group B.
#Take a subset where Group == "A"
data3_A = subset(data3, Group == "A")
#Compute the standard deviation of the `Sample` variable in data3
sd(data3_A$Sample)
## [1] 3.481666
#Take a subset where Group == "B"
data3_B = subset(data3, Group == "B")
#Compute the standard deviation of the `Sample` variable in data3
sd(data3_B$Sample)
## [1] 1.507647
#Compute the percent difference in variances for data 3
The standard deviations are not equal, so we will use t-test with unpooled variances.
In this case, we will use the formula in t-test. The
format is
formula = sample~group,
where sample = sample data
and group = variable that you use to split
sample
You also need to specify the data that you take these variables as an
argument of data in t.test. In our case, the
data is data3 that I defined above.
#Two sample test on `Sample` data between Group == "A" and Group == "B",
# H1: muA > muB -> alternative = "greater"
t.test( formula = Sample~Group, data = data3, alternative = "greater" )
# H1: muA < muB -> alternative = "less"
t.test( formula = Sample~Group, data = data3, alternative = "less" )
# H1: muA != muB -> alternative = "two.sided"
t.test( formula = Sample~Group, data = data3, alternative = "two.sided" )
The UScereal data frame has 65 rows and 11 columns. The
data come from the 1993 ASA Statistical Graphics Exposition, and are
taken from the mandatory F&DA food label. The data have been
normalized here to a portion of one American cup. This data frame
contains the following columns:
mfr Manufacturer, represented by its first initial:
G=General Mills, K=Kelloggs, N=Nabisco, P=Post, Q=Quaker Oats, R=Ralston
Purina.calories number of calories in one portion.protein grams of protein in one portion.fat grams of fat in one portion.sodium milligrams of sodium in one portion.fibre grams of dietary fibre in one portion.carbo grams of complex carbohydrates in one
portion.sugars grams of sugars in one portion.shelf display shelf (1, 2, or 3, counting from the
floor).potassium grams of potassium.vitamins vitamins and minerals (none, enriched, or
100%).data("UScereal")
head(UScereal)
## mfr calories protein fat sodium fibre
## 100% Bran N 212.1212 12.121212 3.030303 393.9394 30.303030
## All-Bran K 212.1212 12.121212 3.030303 787.8788 27.272727
## All-Bran with Extra Fiber K 100.0000 8.000000 0.000000 280.0000 28.000000
## Apple Cinnamon Cheerios G 146.6667 2.666667 2.666667 240.0000 2.000000
## Apple Jacks K 110.0000 2.000000 0.000000 125.0000 1.000000
## Basic 4 G 173.3333 4.000000 2.666667 280.0000 2.666667
## carbo sugars shelf potassium vitamins
## 100% Bran 15.15152 18.18182 3 848.48485 enriched
## All-Bran 21.21212 15.15151 3 969.69697 enriched
## All-Bran with Extra Fiber 16.00000 0.00000 3 660.00000 enriched
## Apple Cinnamon Cheerios 14.00000 13.33333 1 93.33333 enriched
## Apple Jacks 11.00000 14.00000 2 30.00000 enriched
## Basic 4 24.00000 10.66667 3 133.33333 enriched
str(UScereal)
## 'data.frame': 65 obs. of 11 variables:
## $ mfr : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ...
## $ calories : num 212 212 100 147 110 ...
## $ protein : num 12.12 12.12 8 2.67 2 ...
## $ fat : num 3.03 3.03 0 2.67 0 ...
## $ sodium : num 394 788 280 240 125 ...
## $ fibre : num 30.3 27.3 28 2 1 ...
## $ carbo : num 15.2 21.2 16 14 11 ...
## $ sugars : num 18.2 15.2 0 13.3 14 ...
## $ shelf : int 3 3 3 1 2 3 1 3 2 1 ...
## $ potassium: num 848.5 969.7 660 93.3 30 ...
## $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ...
#create boxplot
boxplot(protein~mfr,data=UScereal)
We will perform three comparison tasks between two population means.
Task 1: you are asked to test whether protein of cereals from manufacturer K is greater than those from manufacturer G.
Let \(\mu_1\) be the average protein
of cereals from manufacturer K
Let \(\mu_2\) be the average protein of
cereals from manufacturer G
Answer: \(H_0:\) \(\mu_1\) ? \(\mu_2\), \(H_1:\) \(\mu_1\) ? \(\mu_2\)
sd function.##TODO: compute the standard deviation of protein of cereals from manufacturer K
##TODO: compute the standard deviation of protein of cereals from manufacturer G
Answer:
t.test.##TODO: perform t-test
Answer:
Example: PlantGrowth data
Exercise: Feedtype data
First, we need to install car package in order to make
normal probability plot
We will use the built-in R PlantGrowth data, which
contains results from an experiment to compare yields (measured by dried
weight of plants) obtained under a control (ctrl) and two
different treatment conditions (trt1 and
trt2).
Assume that the samples are independent
data(PlantGrowth)
str(PlantGrowth)
## 'data.frame': 30 obs. of 2 variables:
## $ weight: num 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
## $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...
Let’s see the boxplots of each group.
boxplot(weight ~ group, data = PlantGrowth)
ctrl, trt and
trt2 follow the normal distribution.To check the normality, we plot the Normal Probability
Plot qqPlot.
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
qqPlot(~weight, data= PlantGrowth, subset = group == "ctrl")
## [1] 4 1
qqPlot(~weight, data= PlantGrowth, subset = group == "trt1")
## 17 15
## 7 5
qqPlot(~weight, data= PlantGrowth, subset = group == "trt2")
## 21 28
## 1 8
All the samples are within the confidence interval, so they follow the normal distribution.
ctrl,
trt and trt2 are equal.We need to check that the largest s.d. among the three is smaller than 2xthe smallest s.d.
ctrl_group = subset(PlantGrowth, group == "ctrl")
sd(ctrl_group$weight)
trt1_group = subset(PlantGrowth, group == "trt1")
sd(trt1_group$weight)
trt2_group = subset(PlantGrowth, group == "trt2")
sd(trt2_group$weight)
#Check if largest sd < 2xsmallest sd
0.7936757 < 2*0.4425733
## [1] TRUE
So we pass the standard deviation test.
This depends on how we obtained the data. Here, we will believe the assumption that the samples are independent.
Our goal is to test if the population means of ctrl,
trt1 and trt2 are equal.
Let \(\mu_1\) be the average weight
of all plants that are subjected to the same condition as those in
ctrl.
Let \(\mu_2\) be the average weight of
all plants that are subjected to the same condition as those in
trt1.
Let \(\mu_3\) be the average weight of
all plants that are subjected to the same condition as those in
trt2.
PlantGrowth
## weight group
## 1 4.17 ctrl
## 2 5.58 ctrl
## 3 5.18 ctrl
## 4 6.11 ctrl
## 5 4.50 ctrl
## 6 4.61 ctrl
## 7 5.17 ctrl
## 8 4.53 ctrl
## 9 5.33 ctrl
## 10 5.14 ctrl
## 11 4.81 trt1
## 12 4.17 trt1
## 13 4.41 trt1
## 14 3.59 trt1
## 15 5.87 trt1
## 16 3.83 trt1
## 17 6.03 trt1
## 18 4.89 trt1
## 19 4.32 trt1
## 20 4.69 trt1
## 21 6.31 trt2
## 22 5.12 trt2
## 23 5.54 trt2
## 24 5.50 trt2
## 25 5.37 trt2
## 26 5.29 trt2
## 27 4.92 trt2
## 28 6.15 trt2
## 29 5.80 trt2
## 30 5.26 trt2
Hypotheses:
\(H_0: \mu_1=\mu_2=\mu_3\)
\(H_1: \mu_1\not= \mu_2\) or \(\mu_2\not= \mu_3\) or \(\mu_3 \not=\mu_1\)
plant_anova <- aov(weight ~ group, data = PlantGrowth)
#look at the summary of the model
summary(plant_anova)
In the summary above, the Mean Square of Treatment (MST) is 1.8832 and the Mean Square of Error (MSE) is 0.3886.
The F-value is \(\frac{MST}{MSE} = \frac{1.8832}{0.3886} = 4.846\).
The p-value is \(0.0159\), so we would reject \(H_0\) if we were to test at 0.05 significance level, but we would not reject if we were to test at 0.01 significance level.
20 young pigs are assigned at random among 4 experimental groups. Each group is fed a different diet. (This design is a completely randomized design.) The data are the pig’s weight, in kilograms, after being raised on these diets for 10 months
##TODO: your code here
data = data.frame(weight = c(60.8,57.1,65,58.7,61.8,68.3,67.7,74,66.3,69.9,102.6,102.2,100.5,97.5,98.9,87.9,84.7,83.2,85.8,90.3),
FeedType=as.factor(c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4))
)
str(data)
## 'data.frame': 20 obs. of 2 variables:
## $ weight : num 60.8 57.1 65 58.7 61.8 68.3 67.7 74 66.3 69.9 ...
## $ FeedType: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 2 2 2 2 2 ...
data
## weight FeedType
## 1 60.8 1
## 2 57.1 1
## 3 65.0 1
## 4 58.7 1
## 5 61.8 1
## 6 68.3 2
## 7 67.7 2
## 8 74.0 2
## 9 66.3 2
## 10 69.9 2
## 11 102.6 3
## 12 102.2 3
## 13 100.5 3
## 14 97.5 3
## 15 98.9 3
## 16 87.9 4
## 17 84.7 4
## 18 83.2 4
## 19 85.8 4
## 20 90.3 4
##TODO: your code here
##TODO: check Assumption #1
##TODO: check Assumption #2
Answer:
#TODO: your code here
Answer:
Answer:
Example: cars dataset
Exercise: TV vs sales data
We will be working with the cars dataset (which comes with R).
The dataset cars give the speed of cars and the
distances taken to stop. Note that the data were recorded in the
1920s.
The dataset has two variables:
1. speed: numeric Speed (mph)
2. dist: numeric Stopping distance (ft)
data(cars)
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
We now want to visualize the data to see if we can get an understanding of how it is structured.
plot(dist~speed, data=cars)
This plot makes sense, as the more running speed of a car, the longer the distance it takes to stop the car.
cor(cars$dist,cars$speed)
We will let speed be the independent
variable \(X\)
and dist be the dependent variable \(Y\).
And we will perform a linear regression analysis between \(X\) and \(Y\).
That is, we will find an appropriate values of the coefficients: \(\beta_0\) and \(\beta_1\) such that
\[ Y \approx \beta_0+\beta_1X \].
To run a linear regression, call the function
lm(Y~X, data=name_of_data).
In the following code, we run a linear regression and save the result
in a variable called model.
model <- lm(dist~speed, data=cars)
To look at the result of the analysis, call
summary(model)
summary(model)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The result shows that:
beta0 = -17.58
beta1 = 3.93
SEbeta0 = 6.76
SEbeta1 = 0.42
#Compute the confidence intervals of beta0 and beta1
beta0 - 2*SEbeta0
## [1] -31.1
beta0 + 2*SEbeta0
## [1] -4.06
beta1 - 2*SEbeta1
## [1] 3.09
beta1 + 2*SEbeta1
## [1] 4.77
The 95% confidence interval of \(\beta_0\) is \([-31.1,-4.06]\)
The 95% confidence interval of \(\beta_1\) is \([3.09,4.77]\)
“1 mph increase speed leads to somewhere between 3.09-4.77 increase in
stopping distance with 95% probability”
To plot the regression line, use abline(model) after the
scatter plot.
plot(dist~speed, data=cars)
abline(model, col="red")
Your task is to study how much TV advertising of a product affects the sales volume of that product. The data is given in the following.
dat2 <- data.frame(TV = c(230.1,44.5,17.2,151.5,180.8,8.7,57.5,120.2,8.6,199.8,
66.1,214.7,23.8,97.5,204.1,195.4,67.8,281.4,69.2,147.3),
sales = c(22.1,10.4,9.3,18.5,12.9,7.2,11.8,13.2,4.8,
10.6,8.6,17.4,9.2,9.7,19,22.4,12.5,24.4,11.3,14.6)
)
str(dat2)
## 'data.frame': 20 obs. of 2 variables:
## $ TV : num 230.1 44.5 17.2 151.5 180.8 ...
## $ sales: num 22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...
Task 1: Get the correlation between the two variables
#TODO: your code here
Task 2: Perform a linear regression with
TV as the independent variable and sales as
the dependent variable. Then view the summary of the regression.
#TODO: your code here
Task 3: From the result, answer the following questions:
What are beta0 and beta1?
Answer:
From this model, how much, on average, would
sales volume increase from $1 additional spending on TV
advertising?
Answer:
What are standard errors (SE) of beta0 and beta1?
Answer:
What are t-statistics of beta0 and beta1?
Answer:
Suppose that you want to test whether the TV advertising
affects the sales volume at \(0.05\) significance level. Write down
appropriate null and alternative hypotheses.
Answer:
From the hypothesis testing in Question 5., would you accept H0
or H1? Explain your answer.
Answer:
Compute the 95% confidence interval of beta0 and beta1
##TODO: compute the confidence interval
Answer:
Task 4: Plot the scatter plot of the data and the
regression line. TV is on the horizontal axis, and
sales is on the vertical axis.
#TODO: your code here