Lecture Notes Chapter 1-7

Lab 1: Introduction to R

  1. Installing the R and RStudio
  2. Basic data types
  3. Arithmetic with R
  4. Working with vectors
  5. Data structure
  6. Lab1 Assignment

1. Installing the R and RStudio

R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. R is used among data miners, bioinformaticians and statisticians for data analysis and developing statistical software. Users have created packages to augment the functions of the R language.

To work with R, you need to download the program and install it. I recommend that you also install R-Studio. R-Studio is a separate program that makes R easier to use .

Windows

Install R

Download the latest R installer (.exe) for Windows. Install the downloaded file as any other windows app.

Install RStudio

Now that R is installed, you need to download and install RStudio. First download the installer for Windows. Run the installer (.exe file) and follow the instructions.

Mac

Install R

First download the latest release (“R-version.pkg”) of R Save the .pkg file, double-click it to open, and follow the installation instructions. Now that R is installed, you need to download and install RStudio.

Install RStudio

First download the the version for Mac. After downloading, double-click the file to open it, and then drag and drop it to your applications folder.

RStudio: integrated development environment for the programming language R
RStudio: integrated development environment for the programming language R

2. Basic data types

Numeric and integer values

# create create variable "a" that is a vector of one number.
a <- 7
a = 7
a
## [1] 7
# create vectors that consists of multiple numbers.
b <- c(1.25, 2.9, 3.0)
b
## [1] 1.25 2.90 3.00
# create a regular sequence of whole numbers
c <- 1:10
c
##  [1]  1  2  3  4  5  6  7  8  9 10
# seq(from, to, by) function ro creat sequence of numbers
d <- seq(1,10,0.5)
d
##  [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
## [16]  8.5  9.0  9.5 10.0
# rep() function to repeat a single number, or a sequence of numbers.

rep(99, times=5)
## [1] 99 99 99 99 99
rep(c(1,2,3), times=5)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(c(1,2,3),each = 5)
##  [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3

Character values

x <- "Monday"
class(x)
y <- c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")
class(y)

Factors

A factor is a nominal (categorical) variable with a set of known possible values called levels. They can be created using the as.factor function.

yy <- as.factor(y)
yy
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed

3. Arithmetic with R

In its most basic form, R can be used as a simple calculator. Consider the following arithmetic operators:

  • Addition: +
  • Subtraction: -
  • Multiplication: *
  • Division: /
  • Exponentiation: ^ Ex: \(2^3\) = 2^3 = 2*2*2 = 8
  • Modulo: %% Ex: 20 modulo 3 = 20%%3 = 2 (3 6 9 12 15 18+2=20)
# An addition
5 + 5 
# A subtraction
5 - 5 
# A multiplication
3 * 5
 # A division
(5 + 5) / 2 

 # create a vector
x <- c(3, 5, 7, 1, 9, 20)

# multiply each element by 5
##TODO: your code here

# create a second vector
y <- c(17, 15, 13, 19, 11, 0)

# add both vectors
##TODO

# multiply both vectors
##TODO

4. Working with vectors

Positional index

To extract elements based on their position we simply write the position inside the [ ]. For example, to extract the 3rd value of z

z <- c(2, 3, 1, 6, 4, 3, 3, 7)
z[3]     # extract the 3rd value


# if you want to store this value in another object
z_3 <- z[3]
z_3

We can also extract more than one value by using the c() function inside the square brackets. Here we extract the 1st, 5th, 6th and 8th element from the z vector

z[c(1, 5, 6, 8)]

Or we can extract a range of values using the : notation. To extract the values from the 3rd to the 8th elements

z[3:8]

Logical index

Another really useful way to extract data from a vector is to use a logical expression as an index.

z[z >= 4]        # values greater or equal to 4

z[z < 4]         # values less than 4

z[z <= 4]        # values less than or equal to 4

z[z == 4]        # values equal to 4

z[z != 4]        # values not equal to 4

val26 <- z[z < 6 & z > 2] # extract values which are less than 6 AND greater than 2
val26

val63 <- z[z > 6 | z < 3]#extract values that are greater than 6 OR less than 3

5. Data structure

Matrix

mat0 <- matrix(NA, ncol=3, nrow=2)
mat0
##      [,1] [,2] [,3]
## [1,]   NA   NA   NA
## [2,]   NA   NA   NA
mat1 <- matrix(1:6, ncol=3, nrow=2)
mat1
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
mat2 <- matrix(1:6, ncol=3, nrow=2,byrow = TRUE)
mat2
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

List

A list is a very flexible container to store data. Each element of a list can contain any type of R object, e.g. a vector, matrix, data.frame, another list, or more complex data types.

lst <- list(d,yy,mat2)
lst

Data frame

The data.frame is commonly used for statistical data analysis in R. It is a special type of list that requires that all elements (variables) have the same length.

Basic data types in R

R works with numerous data types. Some of the most basic types to get started are:

  • Decimal values like 4.5 are called numerics.
  • Whole numbers like 4 are called integers. Integers are also numerics.
  • Boolean values (TRUE or FALSE) are called logical.
  • Text (or string) (e.g. “world”) values are called characters.
# Loading Built in Datasets in R
data(ToothGrowth)
# Print the first 6 rows
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
# Print the first 15 rows
ToothGrowth[1:15,]
##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5
## 11 16.5   VC  1.0
## 12 16.5   VC  1.0
## 13 15.2   VC  1.0
## 14 17.3   VC  1.0
## 15 22.5   VC  1.0
# Check data type of variables in the dataframe
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

6. Lab1 Assignment

Your tasks are to:

    1. Create a sequence of numbers 10 to 18 and store in vector aa
    1. Create vector bb containing each digit of your student ID (For example, studentID is 650510999, create vector c(6,5,0,5,1,0,9,9,9))
    1. Substract vectoraaand bb, store the result in vector cc
    1. In the vector cc, extract values less than 2
    1. In the vector cc, extract values not equal to 3
    1. In the vector cc, extract values which are less than 5 AND greater than 2
    1. In the vector cc, extract values which are less than 5 OR greater than 2
    1. Create vector dd that contains 12 months character names
    1. Set vector dd as a factor
    1. Replicate vector dd 3 times
    1. Replicate each element in vector dd by 4 times
    1. Create a list of aa,bb,cc, and dd

Lab 2: Numerical Presentation

  1. Data structure
  2. Import data from excel (.csv file) into R
  3. Numerical presentation for qualitative data or categorical data
  4. Numerical presentation for quantitative data
  5. Lab2 Assignment

1. Data structure

Data frame

The data.frame is commonly used for statistical data analysis in R. It is a special type of list that requires that all elements (variables) have the same length.

Basic data types in R

R works with numerous data types. Some of the most basic types to get started are:

  • Decimal values like 4.5 are called numerics.
  • Whole numbers like 4 are called integers. Integers are also numerics.
  • Boolean values (TRUE or FALSE) are called logical.
  • Text (or string) (e.g. “world”) values are called characters.

2. Numerical presentation for qualitative data or categorical data

We will use dataset in data frame Insurance. The data frame consist of the numbers of policyholders of an insurance company who were exposed to risk, and the numbers of car insurance claims made by those policyholders in the third quarter of 1973. This data frame contains the following columns:

  • District as factor: district of residence of policyholder (1 to 4): 4 is major cities.
  • Group as an ordered factor: group of car with levels <1 litre, 1–1.5 litre, 1.5–2 litre, >2 litre.
  • Age as an ordered factor: the age of the insured in 4 groups labelled <25, 25–29, 30–35, >35.
  • Holders numbers of policyholders.
  • Claims numbers of claims
library(MASS)
data("Insurance")
head(Insurance)
str(Insurance)
  • summary() function
  • frequency table with table() function
  • contingency table (Two-Way Frequency Tables)
  • percentage or proportion with prop.table() function
# job, marital, education, housing, loan
summary(Insurance)
#frequency table for one categorical variable
table(Insurance$District)
table(Insurance$Age)
table(Insurance$Group)
prop.table(table(Insurance$Age))
prop.table(table(Insurance$Age))*100
#Two-Way Frequency Tables
tb <- table(Insurance$District,Insurance$Age)

# Add margin totals to your table. 
addmargins(tb)
#Calculate percentage
prop.table(tb)*100

3. Numerical presentation for quantitative data

  • summary() function
  • max() and min function
  • mean() function
  • median() function
  • There is no function to compute mode in R. To compute mode, use the getmode() function defined below.
#define function to calculate mode
getmode <- function(x) {
  u <- unique(x)
  tab <- tabulate(match(x, u))
  u[tab == max(tab)]
}
  • range() function
  • quantile(, probs = ) function where probs can be 0.25, 0.5, 0.75 for first, second, and third quantile
  • IQR() function
  • sd() and var() function
#age, balance, duration, campaign, pdays
summary(Insurance$Holders)
min(Insurance$Holders)
range(Insurance$Holders)
mean(Insurance$Holders)
getmode(Insurance$Holders)
IQR(Insurance$Holders)
sd(Insurance$Holders)
quantile(Insurance$Holders,probs = 0.25)
quantile(Insurance$Holders,probs = 0.50)
quantile(Insurance$Holders,probs = 0.75)
quantile(Insurance$Holders,probs = c(0.25,0.50,0.75))
cv_Holders <- sd(Insurance$Holders)/mean(Insurance$Holders)
cv_Holders
#cv of Claims by District (1,2)
dt1 <- Insurance$Claims[Insurance$District=="1"]
mean1 <- mean(dt1)
sd1 <- sd(dt1)
cv1 <- sd1/mean1

dt2 <- Insurance$Claims[Insurance$District=="2"]
mean2 <- mean(dt2)
sd2 <- sd(dt2)
cv2 <- sd2/mean2
  • aggregate() function is used to get the summary statistics of the data by group.
# aggregate: compute summary statistics of data subsets
aggregate(Claims~Group, data = Insurance, FUN=summary)
aggregate(Claims~District+Age, data = Insurance, FUN=summary)
aggregate(Claims~District+Age, data = Insurance, FUN=mean)

4. Lab2 Assignment

Perform the following numerical data presentation for the Insurance dataset:

    1. Create frequency table of Age, with proportion
    1. Create two-ways frequency table of Age by Group category, with percentage
    1. Compute summary statistics of Claims
    1. Compute Q1,Q2, and Q3 of Claims, also the IQR
    1. Compute CV of Claims between Age < 25 and 25-29
    1. Compute summary statistics of claims by Age
    1. Compute sd of Claims by Age and Group

Lab 3: Graphical Presentation

  1. Graphical presentation for qualitative data or categorical data
  2. Graphical presentation for quantitative data
  3. Lab3 Assignment

We will use dataset survey. This data frame contains the responses of 237 Statistics I students at the University of Adelaide to a number of questions. The components of the data frame are:

  • Sex : The sex of the student. (Factor with levels “Male” and “Female”.)
  • Wr.Hnd : span (distance from tip of thumb to tip of little finger of spread hand) of writing hand, in centimetres.
  • NW.Hnd : span of non-writing hand.
  • W.Hnd : writing hand of student. (Factor, with levels “Left” and “Right”.)
  • Fold : “Fold your arms! Which is on top” (Factor, with levels “R on L”, “L on R”, “Neither”.)
  • Pulse : pulse rate of student (beats per minute).
  • Clap : ‘Clap your hands! Which hand is on top?’ (Factor, with levels “Right”, “Left”, “Neither”.)
  • Exer : how often the student exercises. (Factor, with levels “Freq” (frequently), “Some”, “None”.)
  • Smoke : how much the student smokes. (Factor, levels “Heavy”, “Regul” (regularly), “Occas” (occasionally), “Never”.)
  • Height : height of the student in centimetres.
  • M.I : whether the student expressed height in imperial (feet/inches) or metric (centimetres/metres) units. (Factor, levels “Metric”, “Imperial”.)
  • Age : age of the student in years.
library(MASS)
data("survey")
head(survey)
##      Sex Wr.Hnd NW.Hnd W.Hnd    Fold Pulse    Clap Exer Smoke Height      M.I
## 1 Female   18.5   18.0 Right  R on L    92    Left Some Never 173.00   Metric
## 2   Male   19.5   20.5  Left  R on L   104    Left None Regul 177.80 Imperial
## 3   Male   18.0   13.3 Right  L on R    87 Neither None Occas     NA     <NA>
## 4   Male   18.8   18.9 Right  R on L    NA Neither None Never 160.00   Metric
## 5   Male   20.0   20.0 Right Neither    35   Right Some Never 165.00   Metric
## 6 Female   18.0   17.7 Right  L on R    64   Right Some Never 172.72 Imperial
##      Age
## 1 18.250
## 2 17.583
## 3 16.917
## 4 20.333
## 5 23.667
## 6 21.000
#check data structure 
str(survey)
## 'data.frame':    237 obs. of  12 variables:
##  $ Sex   : Factor w/ 2 levels "Female","Male": 1 2 2 2 2 1 2 1 2 2 ...
##  $ Wr.Hnd: num  18.5 19.5 18 18.8 20 18 17.7 17 20 18.5 ...
##  $ NW.Hnd: num  18 20.5 13.3 18.9 20 17.7 17.7 17.3 19.5 18.5 ...
##  $ W.Hnd : Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2 ...
##  $ Fold  : Factor w/ 3 levels "L on R","Neither",..: 3 3 1 3 2 1 1 3 3 3 ...
##  $ Pulse : int  92 104 87 NA 35 64 83 74 72 90 ...
##  $ Clap  : Factor w/ 3 levels "Left","Neither",..: 1 1 2 2 3 3 3 3 3 3 ...
##  $ Exer  : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3 ...
##  $ Smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2 2 ...
##  $ Height: num  173 178 NA 160 165 ...
##  $ M.I   : Factor w/ 2 levels "Imperial","Metric": 2 1 NA 2 2 1 1 2 2 2 ...
##  $ Age   : num  18.2 17.6 16.9 20.3 23.7 ...

1. Graphical presentation for qualitative data or categorical data

In the case that you have a factor variable, you can display a frequency table or a bar graph.

You can use the good ol’ summary() on a factor variable to display a frequency table.

Bar graph

freq <- table(survey$Smoke)
freq
## 
## Heavy Never Occas Regul 
##    11   189    19    17
barplot(freq)

Stacked Bar Plot with Colors and Legend

counts <- table(survey$Sex, survey$Smoke)
barplot(counts,
  legend = rownames(counts))

Grouped Bar Plot

counts <- table(survey$Sex, survey$Smoke)
barplot(counts,col=c("blue","red"),
  legend = rownames(counts), beside=TRUE)

Pie chart

counts <- table(survey$Exer)
pie(counts, main = "Exercise pie chart")

2. Graphical presentation for quantitative data

Histogram

  • To display a histogram of a vector, use hist()
hist(survey$Height)

Sometimes, you want to change the title of the plot, and also the label of the horizontal axis and vertical axis. To do that, set the parameters main, xlab and ylab

hist(survey$Height, main = "Histogram of Height", xlab = "height", ylab = "counts")

To change the number of bins, set the breaks parameter. However, R will decide what is the best number of bins close to the number that your specify. For example, the following code displays a histogram with the number of bins close to 9:

hist(survey$Height, breaks = 20, 
     main = "Histogram of Height", 
     xlab = "height", 
     ylab = "counts")

Stem and leaf plot

  • To display a stem and leaf plot of a vector, use stem().
stem(survey$Height)
## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   15 | 0224
##   15 | 555566777777899
##   16 | 00000000333333334444
##   16 | 555555555555555555677777777888888888888899999
##   17 | 000000000000000000111112222222233333333334
##   17 | 555555555566777778888999999
##   18 | 0000000000000000023333333344
##   18 | 55555555777888899
##   19 | 00011123
##   19 | 56
##   20 | 0
#compare with its histogram 

Boxplot

  • To display a boxplot of a vector, use boxplot().
boxplot(survey$Height,main="Boxplot of height")

  • To display the boxplot horizontally, set the parameter horizontal = TRUE
boxplot(survey$Height, horizontal = TRUE)

You can display boxplots of by group variable.

boxplot(survey$Height~survey$Sex,xlab="Sex",ylab="Height")

boxplot(survey$Height~survey$Exer,xlab="Exercise",ylab="Height")

Scatterplot

Note that you need two vectors to plot a scatter plot: one for the x-axis, and one for the y-axis.

Suppose that you have two variables: var1 and var2. To display a scatter plot between var1 and var2, use plot(x = var1, y = var2).

plot(x = survey$Age, y = survey$Height,xlab="Age",ylab="Height")

3. Lab3 Assignment

Perform the following graphical data presentation for the survey dataset:

    1. How would you present the writing hand of student (W.Hnd data)?
    1. If you would like to examine the number of students with different writing hand of student (W.Hnd) and how often the student exercises (Exer), which graph shall you use to present such information?
    1. How would you present the pulse rate of student (Pulse data)?
    1. If you would like to compare the pulse rate of student (Pulse) between students with different how often the student exercises (Exer), which graph shall you use to present such information?
    1. If you would like to examine the relationship between the pulse rate of student (Pulse) and their age (Age), which graph shall you use to present such information?

Lab 4: Probability distribution

  1. Binomial distribution
  2. Normal distribution
  3. t distribution

1.Binomial distribution

Use rbinom to generate random numbers from a binomial distribution.

Recall that binomial distribution requires two parameters: \(n\) (number of trials) and \(p\) (probability of success).

The following code generate random numbers of \(X\sim \text{Binomial}(20,1/6)\)

rbinom( n = 40, size = 20, prob = 1/6 )

Use dbinom to compute probabilities of Binomial variable

The following code compute \(P(X=4)\), where \(X\sim \text{Binomial}(20,1/6)\)

dbinom( x = 4, size = 20, prob = 1/6 )
## [1] 0.2022036

You can also compute cumulative probability.

The following code compute \(P(X\leq 4)\), where \(X\sim \text{Binomial}(20,1/6)\)

Note that \(P(X\leq 4)=P(X= 0)+P(X= 1)+P(X= 2)+P(X= 3)+P(X= 4)\)

pbinom( q= 4, size = 20, prob = 1/6)
## [1] 0.7687492

From this,we can compute \(P(X>4)\) using \(P(X\leq 4)+P(X>4) = 1\):

##P(X>4)

1 - pbinom( q= 4, size = 20, prob = 1/6)
## [1] 0.2312508

The Bernoulli distribution is a special case of the Binomial distribution where \(n=1\): \(\text{Binomial}(1,p) = \text{Bernoulli}(p)\) Thus we can generate random Bernoulli numbers using rbinom( n=..., size = 1, p=...)

For example: suppose we want to flip a fair coin 40 times:

rbinom( n=40, size = 1, p=0.5)

Exercise 1: There are 40 washing machines in a laundromat. Each machine has 20% chance (0.2 probability) to malfunction at the end of the year.
Exercise 1.1: Compute the probability that there will be 5 malfunctioning washing machine at the end of the year.
##TODO: your code here
##P(X=5)
Exercise 1.2: Compute the probability that there will be less than or equal to 5 malfunctioning washing machine at the end of the year.
##TODO: your code here
##P(X<=5) = P(X<5) + P(X=5)
Exercise 1.3: Compute the probability that there will be less than 5 malfunctioning washing machine at the end of the year (Hint: use answers from 1.1 and 1.2).
##TODO: your code here
##P(X<5)
Exercise 1.4: Compute the probability that there will be more than 10 malfunctioning washing machine at the end of the year.
## TODO: your code here
## P(X>10) = 1 - P(X <= 10)

2.Normal distribution

Recall that normal distribution can be used to generate numerics, with numbers concentrated around the mean.

The normal distribution requires two parameters: the mean and the standard deviation (sd).

To generate number from normal distribution, use rnorm

##Generate 40 random numbers from N(mu = 0, sigma = 1)
rnorm( n=40, mean=10, sd=100 )

If you want to generate from the standard normal distribution, you can ignore the mean and sd parameters:

rnorm(n=40)
rnorm(n=40,mean=0,sd=1)

Let’s look at the histogram of generated data.

normal.a <- rnorm( n=100, mean=0, sd=1 )
hist( normal.a )

Examples:

Suppose IQ’s are normally distributed with a mean of 100 and a standard deviation of 15.

Thus IQ of a random person can be represented by a random variable \(X\sim N(100,15)\)

What percentage of people have an IQ less than or equal to 125?
## P(X <= 125)

pnorm(125, mean = 100, sd = 15)
## [1] 0.9522096

What percentage of people have an IQ greater than 110?

Set lower.tail = FALSE to compute the upper tail probability.

## P(X > 110)

pnorm(110, mean = 100, sd = 15, lower.tail=FALSE)
## [1] 0.2524925

You can check for the total law of probability by adding the lower tail and the upper tail.

pnorm(110, mean = 100, sd = 15) + pnorm(110, mean = 100, sd = 15, lower.tail=FALSE)
## [1] 1

What percentage of people have an IQ between 110 and 125?

You need to think of these problems as calculating the area under the curve!

$P(110<X<125) = $ area under the curve of \(N(100,15)\) between 110 and 125 which is lower tail area of 125 - lower tail area of 110

## P(110<X<125)

pnorm(125, mean = 100, sd = 15) - pnorm(110, mean = 100, sd = 15)
## [1] 0.2047022

Exercise 2: The distribution of the height of males in the US is N(178,8) (in centimetres)
Exercise 2.1: Compute a probability that a random male is shorter than 170cm.
##TODO: your code here 
##Note: P(X<170) = P(X <= 170) (only true for continuous distributions!)
Exercise 2.2: Compute a probability that a random male is shorter than 190cm.
##TODO: your code here
Exercise 2.3: Compute a probability that a random male is between 170cm and 180cm.
##TODO: your code here

Function qnorm() takes the probability value and gives a number whose cumulative value matches the probability value.

qnorm(0.05, mean = 0, sd = 1)
## [1] -1.644854
qnorm(0.1, mean = 0, sd = 1)
## [1] -1.281552
qnorm(0.5, mean = 0, sd = 1)
## [1] 0
qnorm(0.9, mean = 0, sd = 1)
## [1] 1.281552
qnorm(0.95, mean = 0, sd = 1)
## [1] 1.644854
qnorm(0.975, mean = 0, sd = 1)
## [1] 1.959964

3. t distribution

The rt() function generates random deviates of the t-distribution and is written as rt(n, df). We may easily generate n number of random samples. Recall that he number of degrees of freedom for a t-distribution is equal to the sample size minus one, that is,\(df=n−1\).

n <- 30
df <- n - 1
samp <- rt(n, df)

plot them as a histogram.

hist(samp)

Further we may generate a very large number of random samples and plot them as a histogram.

n <- 1000
df <- n - 1
samples <- rt(n, df)
hist(samples)

By using the dt() function we may calculate the probability density function, and thus, the vertical distance between the horizontal axis and the t-curve at any point. For the purpose of demonstration we construct a t-distribution with \(df=5\) and calculate the probability density function at \(t=−4,−2,0,2,4.\)

x <- seq(-4, 4, by = 2)
dt(x, df = 5)
## [1] 0.005123727 0.065090310 0.379606690 0.065090310 0.005123727

Another very useful function is the pt() function, which returns the area under the t-curve for any given interval.

pt(-2, df = df, lower.tail = TRUE)
## [1] 0.02288531
pt(-1, df = df, lower.tail = TRUE)
## [1] 0.1587763
pt(0, df = df, lower.tail = TRUE)
## [1] 0.5
pt(1, df = df, lower.tail = TRUE)
## [1] 0.8412237
pt(2, df = df, lower.tail = TRUE)
## [1] 0.9771147

The qt() function returns the quantile function, and thus it is the reversing function of pt().

qt(0.5, df = 5, lower.tail = TRUE)
## [1] 0
qt(0.95, df = 5, lower.tail = TRUE)
## [1] 2.015048
qt(0.975, df = 5, lower.tail = TRUE)
## [1] 2.570582

Lab 5: Sampling distribution and Confidence interval for a population mean

  1. Sampling from the population
  2. Law of large numbers
  3. Sampling distribution of sample mean \(\bar{X}\)
  4. Confidence interval for a population mean, \(\mu\)

1. Sampling from the population

The distribution of everyone’s IQ score is \(N(100,15)\).

Here’s we use rnorm() to simulate the IQ’s of 1,000 people in a city.

The function round() is used to round the numbers to integers (e.g. round(93.7)=93)

pop <- rnorm(1000, mean = 100, sd = 15)
pop <- round(pop)

Let’s look at the histogram of this data. Also, calculate the mean and standard deviation of the population’s data.

hist(pop)
mean(pop)
sd(pop)

2. Law of large numbers

Sample with replacement

Now we do some sampling from pop. Imagine sampling as randomly drawing balls from a box full of balls labeled with numbers in pop.

In sampling with replacement, each ball that we draw is put back in the box for the next sampling.

To sample with replacement, use sample(iq, size, replace = TRUE)

Here, we sample IQ’s of 10 people (size = 10).

# Sample IQ's of 10 people with replacement
iq20_w_replace = sample(pop, size = 10, replace = TRUE)

iq20_w_replace

Sample without replacement

In sampling with replacement, each ball that we draw is put back is thrown away. This means that the numbers that we draw are different.

Here, we sample IQ’s of 10 people without replacement (replace = FALSE).

# Sample IQ's of 10 people with replacement
iq20_wo_replace = sample(pop, size = 10, replace = FALSE)

iq20_wo_replace
Exercise 1: Use the mean() function to compute the average IQ of the population , average IQ of the sample with replacement (iq10_w_replace), and the sample without replacement (iq10_wo_replace).
##Exercise 1
##TODO:

The law of large numbers tells you that, the larger the sample size, the closer the sample mean to the population mean.

Exercise 2: sample 100 random numbers from pop with replacement and compute their average.
##Exercise2
#TODO:
Exercise 3: sample 500 random numbers from pop with replacement and compute their average.
##Exercise3
#TODO: 

3. Sampling distribution of sample mean \(\bar{X}\)

From the law of large number, we can use the sample mean to estimate the population mean. But are we confident that the sample mean is actually close to the population mean? We will look at the distribution of the sample mean to answer this question.

First, we need to sample from the population a couple of times; let’s say 20 times, each with 100 observations (\(n=10\)). Then we compute the means of these 20 samples.

#calculate the mean of first set of the sample
m1 = mean(sample(pop, size = 10, replace = TRUE))
m2 = mean(sample(pop, size = 10, replace = TRUE))
m3 = mean(sample(pop, size = 10, replace = TRUE))
m4 = mean(sample(pop, size = 10, replace = TRUE))
m5 = mean(sample(pop, size = 10, replace = TRUE))
m6 = mean(sample(pop, size = 10, replace = TRUE))
m7 = mean(sample(pop, size = 10, replace = TRUE))
m8 = mean(sample(pop, size = 10, replace = TRUE))
m9 = mean(sample(pop, size = 10, replace = TRUE))
m10 = mean(sample(pop, size = 10, replace = TRUE))
m11 = mean(sample(pop, size = 10, replace = TRUE))
m12 = mean(sample(pop, size = 10, replace = TRUE))
m13 = mean(sample(pop, size = 10, replace = TRUE))
m14 = mean(sample(pop, size = 10, replace = TRUE))
m15 = mean(sample(pop, size = 10, replace = TRUE))
m16 = mean(sample(pop, size = 10, replace = TRUE))
m17 = mean(sample(pop, size = 10, replace = TRUE))
m18 = mean(sample(pop, size = 10, replace = TRUE))
m19 = mean(sample(pop, size = 10, replace = TRUE))
m20 = mean(sample(pop, size = 10, replace = TRUE))

Let’s combine these 20 sample means into a single vector.

sample_means = c(m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11,m12,m13,m14,m15,m16,m17,m18,m19,m20)

sample_means
##  [1]  97.2  95.6 109.3 105.8  95.1  93.0 102.5 101.0 100.5  94.9 100.4  95.9
## [13]  96.3 101.3  97.8 102.8  96.0 103.0 100.9  93.2

Then we look at the histogram of these sample means.

hist(sample_means)

If you are lucky, the histogram will look like that of a normal distribution.

Exercise 4: compute the mean and the standard deviation of these 20 sample means, as well as the population mean and s.d.
#Exercise4
#TODO: population mean

#TODO: population sd

#TODO: mean of sample means

#TODO: s.d. of sample means

Next, we will perform the same experiment, but with \(n=100\) for 20 sample sets.

n1 = mean(sample(pop, size = 100, replace = TRUE))
n2 = mean(sample(pop, size = 100, replace = TRUE))
n3 = mean(sample(pop, size = 100, replace = TRUE))
n4 = mean(sample(pop, size = 100, replace = TRUE))
n5 = mean(sample(pop, size = 100, replace = TRUE))
n6 = mean(sample(pop, size = 100, replace = TRUE))
n7 = mean(sample(pop, size = 100, replace = TRUE))
n8 = mean(sample(pop, size = 100, replace = TRUE))
n9 = mean(sample(pop, size = 100, replace = TRUE))
n10 = mean(sample(pop, size = 100, replace = TRUE))
n11 = mean(sample(pop, size = 100, replace = TRUE))
n12 = mean(sample(pop, size = 100, replace = TRUE))
n13 = mean(sample(pop, size = 100, replace = TRUE))
n14 = mean(sample(pop, size = 100, replace = TRUE))
n15 = mean(sample(pop, size = 100, replace = TRUE))
n16 = mean(sample(pop, size = 100, replace = TRUE))
n17 = mean(sample(pop, size = 100, replace = TRUE))
n18 = mean(sample(pop, size = 100, replace = TRUE))
n19 = mean(sample(pop, size = 100, replace = TRUE))
n20 = mean(sample(pop, size = 100, replace = TRUE))


sample_means2 = c(n1,n2,n3,n4,n5,n6,n7,n8,n9,n10,n11,n12,n13,n14,n15,n16,n17,n18,n19,n20)

sample_means2
##  [1]  98.83  98.58  98.73  98.82 100.38 100.49  96.54  99.72  99.23  99.91
## [11] 101.11 100.48  98.79  99.27 101.20 100.63  97.45 100.82 100.47 104.69
Exercise 5: plot the histogram of the sample means. Compared with the histogram from the previous experiment, which histogram is more concentrated? Also, compute the mean and the standard deviation of these 20 sample means with \(n=100\)
#Exercise5
#TODO: Histogram of 20 sample means with n=100

#TODO: mean of sample means

#TODO: s.d. of sample means

4. Confidence interval for a population mean (\(\mu\))

Example 4.1: In a study of maximal aerobic capacity, 12 women were used as subjects, and one measurement that was made was blood plasma volume. The following data give their blood plasma volumes in liters: 3.15, 2.99, 2.77, 3.12, 2.45, 3.85, 2.99, 3.87, 4.06,2.94,3.53, and 3.20.

Assume that blood plasma volumes are normally distributed.

#create data vector
bpm <- c(3.15, 2.99, 2.77, 3.12, 2.45, 3.85, 2.99, 3.87, 4.06,2.94,3.53,3.20)
# count number of observations in vector
n <- length(bpm)

calculate the point estimator of \(\mu\) which is sample mean (\(\bar{x}\))

xbar <- mean(bpm)

calculate the point estimator of \(\sigma\) which is sample sd (s)

s   <-  sd(bpm)

Find a 90% confidence interval for \(\mu\) (case: unknown \(\sigma\) and \(n < 30\))

# alpha = 0.10 , we need 1-(alpha/2) = 1-(0.1/2)=0.95 
tval <- qt(0.95,n-1)
tval
## [1] 1.795885
lower <- xbar-tval*s/sqrt(n)
lower
## [1] 2.99083
upper <- xbar+tval*s/sqrt(n)
upper
## [1] 3.495837

Example 4.2: Suppose average pizza delivery times are normally distributed with an unknown population mean and an unknown population standard deviation A random sample of 50 pizza delivery restaurants is taken and has a sample mean delivery time of 36 minutes with standard deviation of 6 minutes.

Find a 95% confidence interval for the mean delivery time.

n <- 50
xbar <- 36
s  <- 6
# alpha = 0.05 , we need 1-(alpha/2) = 1-(0.05/2)=0.975 
zval <- qnorm(0.975)
zval
lower <- xbar-zval*s/sqrt(n)
lower
upper <- xbar+zval*s/sqrt(n)
upper
Exercise 6:

A teacher wants to estimate the mean height of all 400 students at her school. She takes a simple random sample of 45 students and finds the sample mean height is 63 inches with a standard deviation of 1.5 inches.

Construct a 90% confidence interval for the population mean height.

##Exercise6
##TODO:
Exercise 7:

Let X equal the weight in grams of a “52-grams” snack pack of candies. Assume that the distribution of X is \(N(\mu,\sigma=4)\). A random sample of \(n=10\) observation s of X yielded the following data: 55.95, 56.54, 57.58, 55.13,57.48,56.06,59.93,58.30,52.57 and 58.46.

##Exercise7
##TODO: calculate a point estimator of mu

##TODO: find a 95% confidence interval for mu

Lab 6: Confidence interval for difference of two means

  1. Confidence interval for difference of means from two independent groups, \(\mu_{1}- \mu_{2}\)
  2. Confidence interval for difference of means from paired groups, \(\mu_{d}\)

1. Confidence interval for difference of means from two independent groups, \(\mu_{1}-\mu_{2}\)

We will use birthwt data frame that has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Mass during 1986. This data frame contains the following columns:

  • low indicator of birth weight less than 2.5 kg.
  • age mother’s age in years.
  • lwt mother’s weight in pounds at last menstrual period.
  • race racenmother’s race (1 = white, 2 = black, 3 = other).
  • smoke smoking status during pregnancy.
  • ptl number of previous premature labours.
  • ht history of hypertension.
  • ui presence of uterine irritability.
  • ftv number of physician visits during the first trimester.
  • bwt birth weight in grams.
library(MASS)
data("birthwt")
head(birthwt)
##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622
str(birthwt)
## 'data.frame':    189 obs. of  10 variables:
##  $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : int  2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
#change qualitative variables to be factor
birthwt$low <- as.factor(birthwt$low)
birthwt$race <- as.factor(birthwt$race)
birthwt$smoke <- as.factor(birthwt$smoke)
birthwt$ht <- as.factor(birthwt$ht)
birthwt$ui <- as.factor(birthwt$ui)

str(birthwt)
## 'data.frame':    189 obs. of  10 variables:
##  $ low  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ui   : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
  1. Get the data summary
summary(birthwt)
  1. Suppose we would like to find 95% confidence interval for the mean difference of sales by gender. We first look at the numerical and graphical presentation of bwt birth weight in grams by smoke smoking status during pregnancy using boxplot() and aggregate() function.
boxplot(bwt~smoke,data=birthwt)

aggregate(bwt~smoke,data=birthwt,FUN=summary)
  1. Confidence intervals are part of the output of the t.test function, viewed in isolation by adding $conf.int to the command.
t.test(bwt~smoke, data = birthwt, var.equal = TRUE, conf.level=.95)$conf.int
## [1]  72.75612 494.79735
## attr(,"conf.level")
## [1] 0.95
  1. Compute the 95% confidence interval using the formula \((\bar{x}_{1} - \bar{x}_{2}) \pm t_{1-\frac{\alpha}{2},n_{1}+n_{2}-2} s_{p} \sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}\) for unknown \(\sigma, n_{1},n_{2} < 30\) but assumed equal variances for two groups.
d1 <- birthwt$bwt[birthwt$smoke=="0"]
d2 <- birthwt$bwt[birthwt$smoke=="1"]
n1 <- length(d1)
n2 <- length(d2)
xbar1 <- mean(d1)
xbar2 <- mean(d2)
s1 <- sd(d1)
s2 <- sd(d2)
# at alpha=0.05, we need 1-(alpha/2)=1-(0.05/2)=0.975
tval <- qt(0.975,n1+n2-2)
tval
## [1] 1.972731
sp <- sqrt(((n1-1)*s1^2+(n2-1)*s2^2)/(n1+n2-2))
sp
## [1] 717.7792
lower <- (xbar1-xbar2)-tval*sp*sqrt(1/n1+1/n2)
lower
## [1] 72.75612
upper <- (xbar1-xbar2)+tval*sp*sqrt(1/n1+1/n2)
upper
## [1] 494.7973
Exercise 1: Suppose we would like to find 90% confidence interval for the mean difference of bwtby ht history of hypertension (assumed equal variances for two groups)
##Exercise 1

##TODO: calculate n1,n2,xbar1,xbar2,s1,s2,sp


##TODO: find 90% confidence interval for mu1-mu2


##TODO: find 90% confidence interval for mu1-mu2 using t.test function

2. Confidence interval for difference of means from paired groups, \(\mu_{d}\)

A researcher wants to examine whether music has an effect on the perceived psychological effort required to perform an exercise session. To test this, the researcher recruited 12 runners who each ran three times on a treadmill for 30 minutes. For consistency, the treadmill speed was the same for all three runs. In a random order, each subject ran: (a) listening to no music at all; (b) listening to classical music; and (c) listening to dance music. At the end of each run, subjects were asked to record how hard the running session felt on a scale of 1 to 10, with 1 being easy and 10 extremely hard.

#create data
music_data <- data.frame(Runner = rep(seq(1,12,by=1),times=3),
                      Type =rep(c("None","Classical","Dance"),each=12),
                      Scale = c(8,7,6,8,5,9,7,8,8,7,7,9,8,6,8,9,8,7,7,
                                7,6,6,8,9,7,6,6,7,5,7,7,7,8,6,6,6)
)
                          
#set 'Type' to be a factor
music_data$Type <- as.factor(music_data$Type)
#create boxplot of 'Scale' by 'Type'
boxplot(Scale~Type,data=music_data)

  1. Suppose we would like to find 90% confidence interval of mean scale difference between running while listening to no music and dance music. In this case, we need a confidence interval for \(\mu_{d}\) as they are paired sample (each subject gets to do all type of music).

Confidence intervals for \(\mu_{d}\) can be obtained using the t.test function with paired=TRUE, viewed in isolation by adding $conf.int to the command.

x <- music_data$Scale[music_data$Type=="None"]
y <- music_data$Scale[music_data$Type=="Dance"]
t.test(x,y, paired = TRUE,conf.level=0.90)$conf.int
## [1] 0.4499076 1.3834258
## attr(,"conf.level")
## [1] 0.9
  1. We can compute the 90% confidence interval for \(\mu_{d}\) with the following formula \(\bar{d} \pm t_{1-\frac{\alpha}{2},n-1} \frac{s_{d}}{\sqrt{n}}\)
n <- length(x)
d <- x-y
dbar <- mean(d)
sd   <- sd(d)
# at alpha=0.10, 1-(alpha/2)=1-(0.10/2)=0.95
tval <- qt(0.95,n-1)
lower <- dbar - tval*sd/sqrt(n)
lower
## [1] 0.4499076
upper <- dbar + tval*sd/sqrt(n)
upper
## [1] 1.383426
Exercise 2: Find 95% confidence interval of mean scale difference between running while listening to Classical music and Dance music.
##Exercise 2

##TODO: calculate n, dbar, sd

##TODO: find 90% confidence interval for mu_d

##TODO: find 90% confidence interval for mu_d using t.test function

Lab 7: Hypothesis testing for one-sample

  1. Test for population mean (\(\mu\)) using \(p\)-value

1. Test for population means (\(\mu\)) using \(p\)-value

In this example, we will use dataset ToothGrowth which is about the effect of Vitamin C on tooth growth in Guinea Pigs. A data frame with 60 observations on 3 variables.

  • len numeric Tooth length
  • supp factor Supplement type (VC or OJ).
  • dose numeric Dose in milligrams/day
data("ToothGrowth")
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
#data summary
summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
## aggregate: compute summary statistics of data subsets by categorical data
aggregate(len~supp, data = ToothGrowth, FUN=summary)
##   supp len.Min. len.1st Qu. len.Median len.Mean len.3rd Qu. len.Max.
## 1   OJ  8.20000    15.52500   22.70000 20.66333    25.72500 30.90000
## 2   VC  4.20000    11.20000   16.50000 16.96333    23.10000 33.90000
aggregate(len~supp, data = ToothGrowth, FUN=length)
##   supp len
## 1   OJ  30
## 2   VC  30
boxplot(len~supp, data = ToothGrowth)

Example 1: VC Supplement type

We will take the data of Tooth length for Vitamin C (VC).

Use the subset() function to select data by their values.

#Take the Tooth length for VC
vc_dat = subset(ToothGrowth, supp == "VC" )
head(vc_dat)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
vc_dat_len = vc_dat$len
n_vc = length(vc_dat_len)

Let \(\mu\) be the average tooth length of all observations for VC supplement
We will test the following hypotheses:
\(H_0: \mu \leq 24.5\)
\(H_1: \mu > 24.5\)

Thus this is an upper test. In R, z-test can be performed via z.test function for when \(n > 30\) .

Package BSDA is required before using z.test function

t-test can be performed via t.test function.

#t-test
t.test(vc_dat_len, mu = 24.5, alternative = "greater")
## 
##  One Sample t-test
## 
## data:  vc_dat_len
## t = -4.9939, df = 29, p-value = 1
## alternative hypothesis: true mean is greater than 24.5
## 95 percent confidence interval:
##  14.39907      Inf
## sample estimates:
## mean of x 
##  16.96333

Question 1: Do you accept \(H_1\) at 0.05 significance level? Explain your answer.

Answer:

Question 2: Do you accept \(H_1\) at 0.01 significance level? Explain your answer.

Answer:

Now we will test the following hypotheses:
\(H_0: \mu \geq 25.5\)
\(H_1: \mu < 25.5\)
Thus this is a lower test.

#t-test
t.test(vc_dat_len, mu = 25.5, alternative = "less")
## 
##  One Sample t-test
## 
## data:  vc_dat_len
## t = -5.6566, df = 29, p-value = 2.051e-06
## alternative hypothesis: true mean is less than 25.5
## 95 percent confidence interval:
##      -Inf 19.52759
## sample estimates:
## mean of x 
##  16.96333

In this case, if we were to test at 0.05 significance level, since p-value> 0.05, we would accept \(H_0\).

if we were to test at 0.10 significance level, since p-value< 0.10, we would reject \(H_0\) (accept \(H_1\)).

Now we will test the following hypotheses:
\(H_0: \mu = 24.5\)
\(H_1: \mu \not= 24.5\)
Thus this is a two-sided test.

#t-test
t.test(vc_dat_len, mu = 24.5, alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  vc_dat_len
## t = -4.9939, df = 29, p-value = 2.58e-05
## alternative hypothesis: true mean is not equal to 24.5
## 95 percent confidence interval:
##  13.87675 20.04992
## sample estimates:
## mean of x 
##  16.96333

In this case, if we were to test at 0.05 significance level, sincep-value>0.05, we would accept \(H_0\).

Example 2: OJ Supplement type

Question 3: Select the data of OJ Supplement type. Test the following hypotheses, all at 0.05 significance level

#TODO: use subset() to select part of the data of interest

#TODO: take only the len column

Question 4: Let \(\mu\) be tooth length of all observations for OJ supplement . Test the following hypotheses:

  1. Test between the following hypotheses at 0.05 significance level:

    \(H_0: \mu \leq 20.5\)
    \(H_1: \mu > 20.5\)

##TODO: your code here

Which hypothesis would you accept?
Answer:

  1. Test between the following hypotheses at 0.1 significance level:

    \(H_0: \mu \geq 21\)
    \(H_1: \mu < 21\)

##TODO: your code here

Which hypothesis would you accept?
Answer:

  1. Test between the following hypotheses at 0.1 significance level:

    \(H_0: \mu = 21\)
    \(H_1: \mu \not= 21\)

##TODO: your code here

Which hypothesis would you accept?
Answer:

Lab 8: Hypothesis testing for two samples

Hypothesis testing between two independent populations means \(\mu_{1}-\mu_{2}\)

  • Example 1: Toy data

  • Example 2: Video game sales data

Hypothesis testing between two independent populations means \(\mu_{1}-\mu_{2}\)

In this lab, we will focus on testing between two independent population means from two samples \(\mu_{1}-\mu_{2}\)

Example 1: Toy data

There are two ways that you typically obtain two samples.

  1. Data is in a dataframe data1 with Two samples in two separate columns
data1 = data.frame( Sample1 = c(2.5,6.3,7.3,2.8,4.3,2.8,6.1,9.5,3.8,1.5),
                    Sample2 = c(6.3,8.1,9.5,9.6,1.6,7.5,2.5,6.7,2.1,4.6))

First, you need to check if two samples have approximately the same variance.

#Use `sd` to compute the sample standard deviation 
sd(data1$Sample1)
## [1] 2.529581
sd(data1$Sample2)
## [1] 3.000093
#Compute the percent difference in variances for data1
(3.000093-2.529581)/2.529581*100
## [1] 18.60039

We will assume the variances are equal if the sd difference is not more than 10%.

The standard deviations (and so the variances) are different, so we will perform a two sample t-test with unpooled variances.

To perform two sample t-test, use the t-test function with first sample as the first argument, and the second sample as the second argument.

# H1: mu1 > mu2 -> alternative = "greater"
t.test(data1$Sample1, data1$Sample2, alternative = "greater")
# H1: mu1 < mu2 -> alternative = "less"
t.test(data1$Sample1, data1$Sample2, alternative = "less")
# H1: mu1 != mu2 -> alternative = "two.sided"
t.test(data1$Sample1, data1$Sample2, alternative = "two.sided")

Let’s try another example.

data2 = data.frame( Sample1 = c(2.5,6.3,7.3,2.8,4.3,2.8,6.1,9.5,3.8,1.5),
                    Sample2 = c(6.3,6.1,6.5,9.6,1.6,7.5,2.5,5.7,2.1,3.6))

Again, we compute the sample standard deviations of both samples.

#Use `sd` to compute the sample standard deviation 
sd(data2$Sample1)
## [1] 2.529581
sd(data2$Sample2)
## [1] 2.603523

In practice, use F-test between two population variances.

For our purpose, assume that pop. variances are equal if the sd difference is not more than 10%

#Compute the percent difference in variances for data2
(2.60-2.53)/2.53*100
## [1] 2.766798

Now they are really close, so the population variances of the two populations might be the same. We will perform a two sample t-test with pooled variance.

To test with pooled variance, specify var.equal = TRUE in t.test.

# H1: mu1 > mu2 -> alternative = "greater"
t.test(data2$Sample1, data2$Sample2, alternative = "greater", var.equal = TRUE)
# H1: mu1 < mu2 -> alternative = "less"
t.test(data2$Sample1, data2$Sample2, alternative = "less", var.equal = TRUE)
# H1: mu1 != mu2 -> alternative = "two.sided"
t.test(data2$Sample1, data2$Sample2, alternative = "two.sided", var.equal = TRUE)
  1. Data is in dataframe data3 Two samples in one column, separated by two levels (Group) of another column.
data3 = data.frame( Sample = c(2.5,6.3,7.3,2.8,4.3,2.8,6.1,9.5,3.8,1.5),
                    Group = c("A","B","A","A","B","B","B","A","B","A"))

Now we want to test the population means between Group A and Group B.

First, we need to check the sample standard deviations of Group A and Group B.

#Take a subset where Group == "A"
data3_A = subset(data3, Group == "A")

#Compute the standard deviation of the `Sample` variable in data3
sd(data3_A$Sample)
## [1] 3.481666
#Take a subset where Group == "B"
data3_B = subset(data3, Group == "B")

#Compute the standard deviation of the `Sample` variable in data3
sd(data3_B$Sample)
## [1] 1.507647
#Compute the percent difference in variances for data 3

The standard deviations are not equal, so we will use t-test with unpooled variances.

In this case, we will use the formula in t-test. The format is
formula = sample~group,
where sample = sample data
and group = variable that you use to split sample

You also need to specify the data that you take these variables as an argument of data in t.test. In our case, the data is data3 that I defined above.

#Two sample test on `Sample` data between Group == "A" and Group == "B", 
# H1: muA > muB -> alternative = "greater"
t.test( formula = Sample~Group, data = data3, alternative = "greater" )
# H1: muA < muB -> alternative = "less"
t.test( formula = Sample~Group, data = data3, alternative = "less" )
# H1: muA != muB -> alternative = "two.sided"
t.test( formula = Sample~Group, data = data3, alternative = "two.sided" )

Example 2: Nutritional and Marketing Information on US Cereals

The UScereal data frame has 65 rows and 11 columns. The data come from the 1993 ASA Statistical Graphics Exposition, and are taken from the mandatory F&DA food label. The data have been normalized here to a portion of one American cup. This data frame contains the following columns:

  • mfr Manufacturer, represented by its first initial: G=General Mills, K=Kelloggs, N=Nabisco, P=Post, Q=Quaker Oats, R=Ralston Purina.
  • calories number of calories in one portion.
  • protein grams of protein in one portion.
  • fat grams of fat in one portion.
  • sodium milligrams of sodium in one portion.
  • fibre grams of dietary fibre in one portion.
  • carbo grams of complex carbohydrates in one portion.
  • sugars grams of sugars in one portion.
  • shelf display shelf (1, 2, or 3, counting from the floor).
  • potassium grams of potassium.
  • vitamins vitamins and minerals (none, enriched, or 100%).
data("UScereal")
head(UScereal)
##                           mfr calories   protein      fat   sodium     fibre
## 100% Bran                   N 212.1212 12.121212 3.030303 393.9394 30.303030
## All-Bran                    K 212.1212 12.121212 3.030303 787.8788 27.272727
## All-Bran with Extra Fiber   K 100.0000  8.000000 0.000000 280.0000 28.000000
## Apple Cinnamon Cheerios     G 146.6667  2.666667 2.666667 240.0000  2.000000
## Apple Jacks                 K 110.0000  2.000000 0.000000 125.0000  1.000000
## Basic 4                     G 173.3333  4.000000 2.666667 280.0000  2.666667
##                              carbo   sugars shelf potassium vitamins
## 100% Bran                 15.15152 18.18182     3 848.48485 enriched
## All-Bran                  21.21212 15.15151     3 969.69697 enriched
## All-Bran with Extra Fiber 16.00000  0.00000     3 660.00000 enriched
## Apple Cinnamon Cheerios   14.00000 13.33333     1  93.33333 enriched
## Apple Jacks               11.00000 14.00000     2  30.00000 enriched
## Basic 4                   24.00000 10.66667     3 133.33333 enriched
str(UScereal)
## 'data.frame':    65 obs. of  11 variables:
##  $ mfr      : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ...
##  $ calories : num  212 212 100 147 110 ...
##  $ protein  : num  12.12 12.12 8 2.67 2 ...
##  $ fat      : num  3.03 3.03 0 2.67 0 ...
##  $ sodium   : num  394 788 280 240 125 ...
##  $ fibre    : num  30.3 27.3 28 2 1 ...
##  $ carbo    : num  15.2 21.2 16 14 11 ...
##  $ sugars   : num  18.2 15.2 0 13.3 14 ...
##  $ shelf    : int  3 3 3 1 2 3 1 3 2 1 ...
##  $ potassium: num  848.5 969.7 660 93.3 30 ...
##  $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ...
#create boxplot
boxplot(protein~mfr,data=UScereal)

We will perform three comparison tasks between two population means.

Task 1: you are asked to test whether protein of cereals from manufacturer K is greater than those from manufacturer G.

Let \(\mu_1\) be the average protein of cereals from manufacturer K
Let \(\mu_2\) be the average protein of cereals from manufacturer G

Exercise 1: Write down the null and alternative hypotheses.

Answer: \(H_0:\) \(\mu_1\) ? \(\mu_2\), \(H_1:\) \(\mu_1\) ? \(\mu_2\)

Exercise 2: check the standard deviations using sd function.

##TODO: compute the standard deviation of protein of cereals from manufacturer K


 
##TODO: compute the standard deviation of protein of cereals from manufacturer G

Exercise 3: from the outputs above, will you use the t-test with pooled or unpooled variance? Explain your answer.

Answer:

Exercise 4: perform the t-test between the means protein of cereals from manufacturer K and manufacturer G at 0.01 significance level using t.test.

##TODO: perform t-test

Exercise 5: What is your conclusion about protein of cereals from manufacturer K VS manufacturer G from the result of this test?

Answer:


Lab 9:Analysis of Variance (ANOVA)

  1. Example: PlantGrowth data

  2. Exercise: Feedtype data

1.Example: PlantGrowth data

First, we need to install car package in order to make normal probability plot

We will use the built-in R PlantGrowth data, which contains results from an experiment to compare yields (measured by dried weight of plants) obtained under a control (ctrl) and two different treatment conditions (trt1 and trt2).

Assume that the samples are independent

data(PlantGrowth)
str(PlantGrowth)
## 'data.frame':    30 obs. of  2 variables:
##  $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
##  $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...

Let’s see the boxplots of each group.

boxplot(weight ~ group, data = PlantGrowth) 

Checking assumptions

  1. Check if the data of group ctrl, trt and trt2 follow the normal distribution.

To check the normality, we plot the Normal Probability Plot qqPlot.

library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
qqPlot(~weight, data= PlantGrowth, subset = group == "ctrl")

## [1] 4 1
qqPlot(~weight, data= PlantGrowth, subset = group == "trt1")

## 17 15 
##  7  5
qqPlot(~weight, data= PlantGrowth, subset = group == "trt2")

## 21 28 
##  1  8

All the samples are within the confidence interval, so they follow the normal distribution.

  1. Check if the population variances of ctrl, trt and trt2 are equal.

We need to check that the largest s.d. among the three is smaller than 2xthe smallest s.d.

ctrl_group = subset(PlantGrowth, group == "ctrl")

sd(ctrl_group$weight)

trt1_group = subset(PlantGrowth, group == "trt1")

sd(trt1_group$weight)

trt2_group = subset(PlantGrowth, group == "trt2")

sd(trt2_group$weight)
#Check if largest sd < 2xsmallest sd
0.7936757 < 2*0.4425733
## [1] TRUE

So we pass the standard deviation test.

  1. Check if the samples are independent.

This depends on how we obtained the data. Here, we will believe the assumption that the samples are independent.

Perform Analysis of Variance (ANOVA)

Our goal is to test if the population means of ctrl, trt1 and trt2 are equal.

Let \(\mu_1\) be the average weight of all plants that are subjected to the same condition as those in ctrl.
Let \(\mu_2\) be the average weight of all plants that are subjected to the same condition as those in trt1.
Let \(\mu_3\) be the average weight of all plants that are subjected to the same condition as those in trt2.

PlantGrowth
##    weight group
## 1    4.17  ctrl
## 2    5.58  ctrl
## 3    5.18  ctrl
## 4    6.11  ctrl
## 5    4.50  ctrl
## 6    4.61  ctrl
## 7    5.17  ctrl
## 8    4.53  ctrl
## 9    5.33  ctrl
## 10   5.14  ctrl
## 11   4.81  trt1
## 12   4.17  trt1
## 13   4.41  trt1
## 14   3.59  trt1
## 15   5.87  trt1
## 16   3.83  trt1
## 17   6.03  trt1
## 18   4.89  trt1
## 19   4.32  trt1
## 20   4.69  trt1
## 21   6.31  trt2
## 22   5.12  trt2
## 23   5.54  trt2
## 24   5.50  trt2
## 25   5.37  trt2
## 26   5.29  trt2
## 27   4.92  trt2
## 28   6.15  trt2
## 29   5.80  trt2
## 30   5.26  trt2

Hypotheses:
\(H_0: \mu_1=\mu_2=\mu_3\)
\(H_1: \mu_1\not= \mu_2\) or \(\mu_2\not= \mu_3\) or \(\mu_3 \not=\mu_1\)

plant_anova <- aov(weight ~ group, data = PlantGrowth)

#look at the summary of the model
summary(plant_anova)

In the summary above, the Mean Square of Treatment (MST) is 1.8832 and the Mean Square of Error (MSE) is 0.3886.

The F-value is \(\frac{MST}{MSE} = \frac{1.8832}{0.3886} = 4.846\).

The p-value is \(0.0159\), so we would reject \(H_0\) if we were to test at 0.05 significance level, but we would not reject if we were to test at 0.01 significance level.


2. Exercise : FeedType data

20 young pigs are assigned at random among 4 experimental groups. Each group is fed a different diet. (This design is a completely randomized design.) The data are the pig’s weight, in kilograms, after being raised on these diets for 10 months

Here’s the data

##TODO: your code here


data = data.frame(weight = c(60.8,57.1,65,58.7,61.8,68.3,67.7,74,66.3,69.9,102.6,102.2,100.5,97.5,98.9,87.9,84.7,83.2,85.8,90.3),
                  FeedType=as.factor(c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4))
 )

str(data)
## 'data.frame':    20 obs. of  2 variables:
##  $ weight  : num  60.8 57.1 65 58.7 61.8 68.3 67.7 74 66.3 69.9 ...
##  $ FeedType: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 2 2 2 2 2 ...
data
##    weight FeedType
## 1    60.8        1
## 2    57.1        1
## 3    65.0        1
## 4    58.7        1
## 5    61.8        1
## 6    68.3        2
## 7    67.7        2
## 8    74.0        2
## 9    66.3        2
## 10   69.9        2
## 11  102.6        3
## 12  102.2        3
## 13  100.5        3
## 14   97.5        3
## 15   98.9        3
## 16   87.9        4
## 17   84.7        4
## 18   83.2        4
## 19   85.8        4
## 20   90.3        4

Task 1: Plot the boxplot of this data.

##TODO: your code here

Task 2: Check if this data passes Assumptions #1 and #2 in order to perform an ANOVA test.

##TODO: check Assumption #1
##TODO: check Assumption #2

Task 3: What conclusion about the data do you obtain after checking these assumptions?

Answer:

Task 4: Perform an ANOVA test to see if on average, pigs with different diets have different weights at 0.05 significance level.

#TODO: your code here

Task 5: What are (1) MST, (2) MSE, (3) F-value of this test?

Answer:

Task 6: What can you conclude about the average weights of the pig with different diets from the result of the ANOVA test?

Answer:

Lab 10: Linear regression and correlation

  1. Example: cars dataset

  2. Exercise: TV vs sales data

1. Example: cars dataset

We will be working with the cars dataset (which comes with R).

The dataset cars give the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.

The dataset has two variables:
1. speed: numeric Speed (mph)
2. dist: numeric Stopping distance (ft)

data(cars)
str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

1. Visualize the data

We now want to visualize the data to see if we can get an understanding of how it is structured.

plot(dist~speed, data=cars)

This plot makes sense, as the more running speed of a car, the longer the distance it takes to stop the car.

2.Get the correlation between the two variables

cor(cars$dist,cars$speed)

3.Perform linear regression analysis

We will let speed be the independent variable \(X\)
and dist be the dependent variable \(Y\).

And we will perform a linear regression analysis between \(X\) and \(Y\).
That is, we will find an appropriate values of the coefficients: \(\beta_0\) and \(\beta_1\) such that
\[ Y \approx \beta_0+\beta_1X \].

To run a linear regression, call the function lm(Y~X, data=name_of_data).

In the following code, we run a linear regression and save the result in a variable called model.

model <- lm(dist~speed, data=cars)

To look at the result of the analysis, call summary(model)

summary(model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

4.Interpreting the result

The result shows that:

  1. \(\hat\beta_0=-17.58, \hat\beta_1 = 3.93\)
  2. so the linear regression equation is \(Y = -17.58 + 3.93X\)
  3. 1 mph increase in speed leads to 3.93 increase in stopping distance on average
  4. \(SE(\beta_0)=6.76, SE(\beta_1)=0.42\)
  5. The \(t\)-statistic of \(\hat\beta_0\) is \(\frac{\hat\beta_0}{SE(\beta_0)}=\frac{-17.58}{6.76}=-2.60\)
  6. The \(t\)-statistic of \(\hat\beta_1\) is \(\frac{\hat\beta_1}{SE(\beta_1)}=\frac{3.93}{0.42}=9.46\)
  7. The \(p\)-value of \(\hat\beta_0\) is \(0.0123\), so for the hypothesis testing:
    \[ H_0: \beta_0 = 0, \quad H_1: \beta_0 \not= 0\]
    we accept \(H_1\) at \(0.05,0.1\) significance levels, but accept \(H_0\) at \(0.01\) significance level.
  8. The \(p\)-value of \(\hat\beta_1\) is \(1.49\times 10^{-12}\), so for the hypothesis testing:
    \[ H_0: \beta_1 = 0, \quad H_1: \beta_1 \not= 0\]
    we accept \(H_1\) at \(0.01,0.05,0.1\) significance levels.
  9. The \(R^2\) statistic of this model is \(0.6511\).
  10. The formula of the 95% confidence interval of \(\beta_0,\beta_1\) is: \[ 95\% \text{ confidence interval of }\beta_0: [\hat\beta_0-2SE(\hat\beta_0),\hat\beta_0+2SE(\hat\beta_0)] \]
    \[ 95\% \text{ confidence interval of }\beta_1: [\hat\beta_1-2SE(\hat\beta_1),\hat\beta_1+2SE(\hat\beta_1)] \]
beta0 = -17.58
beta1 = 3.93
SEbeta0 = 6.76
SEbeta1 = 0.42

#Compute the confidence intervals of beta0 and beta1
beta0 - 2*SEbeta0
## [1] -31.1
beta0 + 2*SEbeta0
## [1] -4.06
beta1 - 2*SEbeta1
## [1] 3.09
beta1 + 2*SEbeta1
## [1] 4.77

5. Confidence interval

The 95% confidence interval of \(\beta_0\) is \([-31.1,-4.06]\)
The 95% confidence interval of \(\beta_1\) is \([3.09,4.77]\)
“1 mph increase speed leads to somewhere between 3.09-4.77 increase in stopping distance with 95% probability”

6. Plotting the regression line

To plot the regression line, use abline(model) after the scatter plot.

plot(dist~speed, data=cars)
abline(model, col="red")


2. Exercise: TV vs sales data

Your task is to study how much TV advertising of a product affects the sales volume of that product. The data is given in the following.

dat2 <- data.frame(TV = c(230.1,44.5,17.2,151.5,180.8,8.7,57.5,120.2,8.6,199.8,
        66.1,214.7,23.8,97.5,204.1,195.4,67.8,281.4,69.2,147.3),
sales = c(22.1,10.4,9.3,18.5,12.9,7.2,11.8,13.2,4.8,
           10.6,8.6,17.4,9.2,9.7,19,22.4,12.5,24.4,11.3,14.6)
)
str(dat2)
## 'data.frame':    20 obs. of  2 variables:
##  $ TV   : num  230.1 44.5 17.2 151.5 180.8 ...
##  $ sales: num  22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...

Task 1: Get the correlation between the two variables

#TODO: your code here

Task 2: Perform a linear regression with TV as the independent variable and sales as the dependent variable. Then view the summary of the regression.

#TODO: your code here

Task 3: From the result, answer the following questions:

  1. What are beta0 and beta1?
    Answer:

  2. From this model, how much, on average, would sales volume increase from $1 additional spending on TV advertising?
    Answer:

  3. What are standard errors (SE) of beta0 and beta1?
    Answer:

  4. What are t-statistics of beta0 and beta1?
    Answer:

  5. Suppose that you want to test whether the TV advertising affects the sales volume at \(0.05\) significance level. Write down appropriate null and alternative hypotheses.
    Answer:

  6. From the hypothesis testing in Question 5., would you accept H0 or H1? Explain your answer.
    Answer:

  7. Compute the 95% confidence interval of beta0 and beta1

##TODO: compute the confidence interval  

Answer:

  1. What is your interpretation of the 95% confidence interval of beta1?
    Answer:

Task 4: Plot the scatter plot of the data and the regression line. TV is on the horizontal axis, and sales is on the vertical axis.

#TODO: your code here