This document contains the solutions to the exercises from the practical session of Day 3 of the Introduction to R workshop. The session focuses on data analysis, including one-sample t-test, two-sample t-test, linear regression and anova, as well as some non-parametric variants.
Before you start:
- Set working directory
- Read in titanic dataset (‘titanic.csv’)
- Check your datasets
titanic <- read.csv("titanic.csv", header=T)
str(titanic)
#> 'data.frame': 1313 obs. of 7 variables:
#> $ passenger_class : chr "1st" "1st" "1st" "1st" ...
#> $ name : chr "Allen,MissElisabethWalton" "Allison,MissHelenLoraine" "Allison,MrHudsonJoshuaCreighton" "Allison,MrsHudsonJ.C.(BessieWaldoDaniels)" ...
#> $ age : num 29 2 30 25 0.917 ...
#> $ embarked : chr "Southampton" "Southampton" "Southampton" "Southampton" ...
#> $ home_destination: chr "StLouis,MO" "Montreal,PQ/Chesterville,ON" "Montreal,PQ/Chesterville,ON" "Montreal,PQ/Chesterville,ON" ...
#> $ sex : chr "female" "female" "male" "female" ...
#> $ survive : chr "yes" "no" "no" "no" ...
View(titanic)
Test whether more than a third of the passengers onboard survived.
table(titanic$survive) #number of people that survived: 449
#>
#> no yes
#> 864 449
length(titanic$survive) #total number of people: 1313
#> [1] 1313
prop.test(x=449,n=1313,p=1/3, alternative='greater')
#>
#> 1-sample proportions test with continuity correction
#>
#> data: 449 out of 1313, null probability 1/3
#> X-squared = 0.40223, df = 1, p-value = 0.263
#> alternative hypothesis: true p is greater than 0.3333333
#> 95 percent confidence interval:
#> 0.3204021 1.0000000
#> sample estimates:
#> p
#> 0.341965
# P-value is larger than 0.05, so there is no evidence that the proportion of people that survived is greater than 1/3
Create a 99% confidence interval for the percentage of people that did not survive.
prop.test(x=864,n=1313, alternative='two.sided',conf.level = 0.99)
#>
#> 1-sample proportions test with continuity correction
#>
#> data: 864 out of 1313, null probability 0.5
#> X-squared = 130.54, df = 1, p-value < 2.2e-16
#> alternative hypothesis: true p is not equal to 0.5
#> 99 percent confidence interval:
#> 0.6232072 0.6912559
#> sample estimates:
#> p
#> 0.658035
#We are 99% confident that the proportion of people that did not survive the disaster lies between 62% and 69%
Test whether the proportion of people in first, second and third class respectively is equal to 0.25, 0.25, 0.5.
chisq.test(table(titanic$passenger_class),p=c(0.25,0.25,0.5))
#>
#> Chi-squared test for given probabilities
#>
#> data: table(titanic$passenger_class)
#> X-squared = 11.736, df = 2, p-value = 0.002829
# The p-value is lower than 0.05, so we can conclude that the proportion of people is first, second and third class is not equal to 0.25, 0.25, 0.5 respectively.
Are sex and passenger class independent from each other?
table(titanic$passenger_class,titanic$sex)
#>
#> female male
#> 1st 143 179
#> 2nd 107 173
#> 3rd 213 498
chisq.test(table(titanic$passenger_class,titanic$sex))
#>
#> Pearson's Chi-squared test
#>
#> data: table(titanic$passenger_class, titanic$sex)
#> X-squared = 21.636, df = 2, p-value = 2.004e-05
# The p-value is lower than 0.05, so we can conclude that passenger class and sex are not independent.
Test whether the mean age of the passengers is equal to 30.
t.test(titanic$age,mu=30)
#>
#> One Sample t-test
#>
#> data: titanic$age
#> t = 2.0373, df = 632, p-value = 0.04204
#> alternative hypothesis: true mean is not equal to 30
#> 95 percent confidence interval:
#> 30.04312 32.34524
#> sample estimates:
#> mean of x
#> 31.19418
# The p-value is below 0.05, so we can conclude that the mean age of the passengers is not equal to 30.
Construct a QQplot for age.
qqnorm(titanic$age)
qqline(titanic$age,col='red')
Test if age is normally distributed.
shapiro.test(titanic$age)
#>
#> Shapiro-Wilk normality test
#>
#> data: titanic$age
#> W = 0.98264, p-value = 7.771e-07
# The variable age is not normally distributed (p-value below 0.05). Therefore, a nonparametric test needs to be performed.
wilcox.test(titanic$age,,mu=30)
#>
#> Wilcoxon signed rank test with continuity correction
#>
#> data: titanic$age
#> V = 96658, p-value = 0.2449
#> alternative hypothesis: true location is not equal to 30
# The p-value is larger than 0.05, indicating that there is no evidence that the median age is different from 30
Construct a 90% confidence interval for the mean age.
t.test(titanic$age, conf.level = 0.9)
#>
#> One Sample t-test
#>
#> data: titanic$age
#> t = 53.218, df = 632, p-value < 2.2e-16
#> alternative hypothesis: true mean is not equal to 0
#> 90 percent confidence interval:
#> 30.22862 32.15975
#> sample estimates:
#> mean of x
#> 31.19418
#We are 90% confident that the mean age of the passengers lies between 30 and 32.
Test whether the mean age of female passengers is lower than the mean age of male passengers.
# First test if variances are equal
var.test(titanic$age~titanic$sex)
#>
#> F test to compare two variances
#>
#> data: titanic$age by titanic$sex
#> F = 1.0262, num df = 242, denom df = 389, p-value = 0.8163
#> alternative hypothesis: true ratio of variances is not equal to 1
#> 95 percent confidence interval:
#> 0.8199801 1.2921285
#> sample estimates:
#> ratio of variances
#> 1.026194
# The p-value is greater than 0.05, so we can conclude that the variances for males and females are equal.
t.test(titanic$age~titanic$sex, alternative='less', var.equal=T)
#>
#> Two Sample t-test
#>
#> data: titanic$age by titanic$sex
#> t = -0.83671, df = 631, p-value = 0.2015
#> alternative hypothesis: true difference in means between group female and group male is less than 0
#> 95 percent confidence interval:
#> -Inf 0.9771769
#> sample estimates:
#> mean in group female mean in group male
#> 30.57270 31.58141
#The p-value is greater than 0.05, indicating that there is no evidence that the mean value of age for females is smaller than the mean age of male passengers
Are the assumptions to perform the parametric t-test test fulfilled?
tapply(titanic$age,titanic$sex,shapiro.test)
#> $female
#>
#> Shapiro-Wilk normality test
#>
#> data: X[[i]]
#> W = 0.9815, p-value = 0.00294
#>
#>
#> $male
#>
#> Shapiro-Wilk normality test
#>
#> data: X[[i]]
#> W = 0.97848, p-value = 1.495e-05
# Both p-values are lower than 0.05, so the assumption of normality is not met. We need to perform a nonparametric test.
wilcox.test(titanic$age~titanic$sex, alternative="less")
#>
#> Wilcoxon rank sum test with continuity correction
#>
#> data: titanic$age by titanic$sex
#> W = 45105, p-value = 0.154
#> alternative hypothesis: true location shift is less than 0
Use the dataset mtcars available in the R environment. Check which variables are contained in the dataset
Data <- mtcars
str(Data)
#> 'data.frame': 32 obs. of 11 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
#> $ disp: num 160 160 108 258 360 ...
#> $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
#> $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ qsec: num 16.5 17 18.6 19.4 17 ...
#> $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
#> $ am : num 1 1 1 0 0 0 0 0 0 0 ...
#> $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
#> $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Produce a scatterplot to visually check the relation between weight (wt) and miles per gallon (mpg). Have a look at what kind of relation you expect to find with the regression.
# different ways to use plot function
plot(Data$wt, Data$mpg, xlab = "Weight", ylab ="Miles per gallon")
plot(Data$mpg ~ Data$wt, xlab = "Weight", ylab ="Miles per gallon")
plot(mpg ~ wt, data=Data, xlab = "Weight", ylab ="Miles per gallon")
Test whether there is a significant effect of weight on the miles per gallon of a car.
model1 <- lm(mpg ~ wt, data=Data)
summary(model1)
#>
#> Call:
#> lm(formula = mpg ~ wt, data = Data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.5432 -2.3647 -0.1252 1.4096 6.8727
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
#> wt -5.3445 0.5591 -9.559 1.29e-10 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.046 on 30 degrees of freedom
#> Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
#> F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
# The p-value for wt is smaller than 0.05, so wt does have a significant effect on mpg
Plot the relevant plots to check for assumptions
# Check assumptions: normality of residuals
qqnorm(residuals(model1))
qqline(residuals(model1), col="red")
shapiro.test(residuals(model1))
#>
#> Shapiro-Wilk normality test
#>
#> data: residuals(model1)
#> W = 0.94508, p-value = 0.1044
# Check assumptions: homoscedasticity & linearity
plot(fitted(model1),residuals(model1), main="Residual Plot",pch=19)
abline(h=0, lty='dotted', col='blue')
# All assumptions can be considered to be fulfilled
In the scatterplot, try to add a line that shows the regression function you predicted.
# different ways to use plot function
plot(Data$wt, Data$mpg, xlab = "Weight", ylab ="Miles per gallon")
abline(model1, col='red')
Test whether there is a difference in the mean age for passengers in different classes.
model2 <- lm(age~factor(passenger_class), data=titanic)
anova(model2)
#> Analysis of Variance Table
#>
#> Response: age
#> Df Sum Sq Mean Sq F value Pr(>F)
#> factor(passenger_class) 2 26690 13344.8 75.903 < 2.2e-16 ***
#> Residuals 630 110764 175.8
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# The p-value is smaller than 0.05, indicating that there is a significant difference in the mean age for passengers from different classes.
Use an appropriate correction for multiple testing and check which classes of passengers are significantly different when it comes to age.
#Note that you can use different kinds of corrections
pairwise.t.test(titanic$age, titanic$passenger_class, p.adj="bonferroni")
#>
#> Pairwise comparisons using t tests with pooled SD
#>
#> data: titanic$age and titanic$passenger_class
#>
#> 1st 2nd
#> 2nd <2e-16 -
#> 3rd <2e-16 0.013
#>
#> P value adjustment method: bonferroni
# All the classes are significantly different when it come to the mean value of age.
Check whether the assumptions for a parametric ANOVA are fulfilled.
# Check assumptions - Normality
tapply(titanic$age, titanic$passenger_class,shapiro.test)
#> $`1st`
#>
#> Shapiro-Wilk normality test
#>
#> data: X[[i]]
#> W = 0.98537, p-value = 0.02003
#>
#>
#> $`2nd`
#>
#> Shapiro-Wilk normality test
#>
#> data: X[[i]]
#> W = 0.97749, p-value = 0.001793
#>
#>
#> $`3rd`
#>
#> Shapiro-Wilk normality test
#>
#> data: X[[i]]
#> W = 0.98207, p-value = 0.01353
# The assumption of normality is not met
# Check assumption - Equal variances
library(car)
#> Loading required package: carData
leveneTest(age~factor(passenger_class), data=titanic)
#> Levene's Test for Homogeneity of Variance (center = median)
#> Df F value Pr(>F)
#> group 2 12.835 3.44e-06 ***
#> 630
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# The assumption of equal variances is not met
# Perform a nonparametric anova
kruskal.test(age~factor(passenger_class), data=titanic)
#>
#> Kruskal-Wallis rank sum test
#>
#> data: age by factor(passenger_class)
#> Kruskal-Wallis chi-squared = 116.08, df = 2, p-value < 2.2e-16