This document contains the solutions to the exercises from the practical session of Day 3 of the Introduction to R workshop. The session focuses on data analysis, including one-sample t-test, two-sample t-test, linear regression and anova, as well as some non-parametric variants.

Before you start:
- Set working directory
- Read in titanic dataset (‘titanic.csv’)
- Check your datasets

titanic <- read.csv("titanic.csv", header=T)
str(titanic)
#> 'data.frame':    1313 obs. of  7 variables:
#>  $ passenger_class : chr  "1st" "1st" "1st" "1st" ...
#>  $ name            : chr  "Allen,MissElisabethWalton" "Allison,MissHelenLoraine" "Allison,MrHudsonJoshuaCreighton" "Allison,MrsHudsonJ.C.(BessieWaldoDaniels)" ...
#>  $ age             : num  29 2 30 25 0.917 ...
#>  $ embarked        : chr  "Southampton" "Southampton" "Southampton" "Southampton" ...
#>  $ home_destination: chr  "StLouis,MO" "Montreal,PQ/Chesterville,ON" "Montreal,PQ/Chesterville,ON" "Montreal,PQ/Chesterville,ON" ...
#>  $ sex             : chr  "female" "female" "male" "female" ...
#>  $ survive         : chr  "yes" "no" "no" "no" ...
View(titanic)

1 Simple hypothesis tests

1.1 Binary variable

Test whether more than a third of the passengers onboard survived.

table(titanic$survive) #number of people that survived: 449
#> 
#>  no yes 
#> 864 449

length(titanic$survive) #total number of people: 1313
#> [1] 1313

prop.test(x=449,n=1313,p=1/3, alternative='greater')
#> 
#>  1-sample proportions test with continuity correction
#> 
#> data:  449 out of 1313, null probability 1/3
#> X-squared = 0.40223, df = 1, p-value = 0.263
#> alternative hypothesis: true p is greater than 0.3333333
#> 95 percent confidence interval:
#>  0.3204021 1.0000000
#> sample estimates:
#>        p 
#> 0.341965

# P-value is larger than 0.05, so there is no evidence that the proportion of people that survived is greater than 1/3

Create a 99% confidence interval for the percentage of people that did not survive.

prop.test(x=864,n=1313, alternative='two.sided',conf.level = 0.99)
#> 
#>  1-sample proportions test with continuity correction
#> 
#> data:  864 out of 1313, null probability 0.5
#> X-squared = 130.54, df = 1, p-value < 2.2e-16
#> alternative hypothesis: true p is not equal to 0.5
#> 99 percent confidence interval:
#>  0.6232072 0.6912559
#> sample estimates:
#>        p 
#> 0.658035

#We are 99% confident that the proportion of people that did not survive the disaster lies between 62% and 69%

1.2 Categorical variable (>2 levels)

Test whether the proportion of people in first, second and third class respectively is equal to 0.25, 0.25, 0.5.

chisq.test(table(titanic$passenger_class),p=c(0.25,0.25,0.5))
#> 
#>  Chi-squared test for given probabilities
#> 
#> data:  table(titanic$passenger_class)
#> X-squared = 11.736, df = 2, p-value = 0.002829

# The p-value is lower than 0.05, so we can conclude that the proportion of people is first, second and third class is not equal to 0.25, 0.25, 0.5 respectively.

1.3 Categorical & Categorical

Are sex and passenger class independent from each other?

table(titanic$passenger_class,titanic$sex)
#>      
#>       female male
#>   1st    143  179
#>   2nd    107  173
#>   3rd    213  498
chisq.test(table(titanic$passenger_class,titanic$sex))
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  table(titanic$passenger_class, titanic$sex)
#> X-squared = 21.636, df = 2, p-value = 2.004e-05

# The p-value is lower than 0.05, so we can conclude that passenger class and sex are not independent.

1.4 One-sample t-test

Test whether the mean age of the passengers is equal to 30.

t.test(titanic$age,mu=30)
#> 
#>  One Sample t-test
#> 
#> data:  titanic$age
#> t = 2.0373, df = 632, p-value = 0.04204
#> alternative hypothesis: true mean is not equal to 30
#> 95 percent confidence interval:
#>  30.04312 32.34524
#> sample estimates:
#> mean of x 
#>  31.19418
# The p-value is below 0.05, so we can conclude that the mean age of the passengers is not equal to 30.

Construct a QQplot for age.

qqnorm(titanic$age)
qqline(titanic$age,col='red')

Test if age is normally distributed.

shapiro.test(titanic$age)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  titanic$age
#> W = 0.98264, p-value = 7.771e-07

# The variable age is not normally distributed (p-value below 0.05). Therefore, a nonparametric test needs to be performed.

wilcox.test(titanic$age,,mu=30)
#> 
#>  Wilcoxon signed rank test with continuity correction
#> 
#> data:  titanic$age
#> V = 96658, p-value = 0.2449
#> alternative hypothesis: true location is not equal to 30

# The p-value is larger than 0.05, indicating that there is no evidence that the median age is different from 30

Construct a 90% confidence interval for the mean age.

t.test(titanic$age, conf.level = 0.9)
#> 
#>  One Sample t-test
#> 
#> data:  titanic$age
#> t = 53.218, df = 632, p-value < 2.2e-16
#> alternative hypothesis: true mean is not equal to 0
#> 90 percent confidence interval:
#>  30.22862 32.15975
#> sample estimates:
#> mean of x 
#>  31.19418

#We are 90% confident that the mean age of the passengers lies between 30 and 32.

1.5 Two-sample t-test

Test whether the mean age of female passengers is lower than the mean age of male passengers.

# First test if variances are equal

var.test(titanic$age~titanic$sex)
#> 
#>  F test to compare two variances
#> 
#> data:  titanic$age by titanic$sex
#> F = 1.0262, num df = 242, denom df = 389, p-value = 0.8163
#> alternative hypothesis: true ratio of variances is not equal to 1
#> 95 percent confidence interval:
#>  0.8199801 1.2921285
#> sample estimates:
#> ratio of variances 
#>           1.026194

# The p-value is greater than 0.05, so we can conclude that the variances for males and females are equal.

t.test(titanic$age~titanic$sex, alternative='less', var.equal=T)
#> 
#>  Two Sample t-test
#> 
#> data:  titanic$age by titanic$sex
#> t = -0.83671, df = 631, p-value = 0.2015
#> alternative hypothesis: true difference in means between group female and group male is less than 0
#> 95 percent confidence interval:
#>       -Inf 0.9771769
#> sample estimates:
#> mean in group female   mean in group male 
#>             30.57270             31.58141

#The p-value is greater than 0.05, indicating that there is no evidence that the mean value of age for females is smaller than the mean age of male passengers

Are the assumptions to perform the parametric t-test test fulfilled?

tapply(titanic$age,titanic$sex,shapiro.test)
#> $female
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  X[[i]]
#> W = 0.9815, p-value = 0.00294
#> 
#> 
#> $male
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  X[[i]]
#> W = 0.97848, p-value = 1.495e-05

# Both p-values are lower than 0.05, so the assumption of normality is not met. We need to perform a nonparametric test.

wilcox.test(titanic$age~titanic$sex, alternative="less")
#> 
#>  Wilcoxon rank sum test with continuity correction
#> 
#> data:  titanic$age by titanic$sex
#> W = 45105, p-value = 0.154
#> alternative hypothesis: true location shift is less than 0

2 Linear Models

2.1 Simple linear regression (continuous & continuous)

Use the dataset mtcars available in the R environment. Check which variables are contained in the dataset

Data <- mtcars
str(Data)
#> 'data.frame':    32 obs. of  11 variables:
#>  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#>  $ disp: num  160 160 108 258 360 ...
#>  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
#>  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#>  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#>  $ qsec: num  16.5 17 18.6 19.4 17 ...
#>  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#>  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
#>  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#>  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Produce a scatterplot to visually check the relation between weight (wt) and miles per gallon (mpg). Have a look at what kind of relation you expect to find with the regression.

# different ways to use plot function
plot(Data$wt, Data$mpg, xlab = "Weight", ylab ="Miles per gallon")


plot(Data$mpg ~ Data$wt, xlab = "Weight", ylab ="Miles per gallon")


plot(mpg ~ wt, data=Data, xlab = "Weight", ylab ="Miles per gallon")

Test whether there is a significant effect of weight on the miles per gallon of a car.

model1 <- lm(mpg ~ wt, data=Data)
summary(model1)
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = Data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.5432 -2.3647 -0.1252  1.4096  6.8727 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
#> wt           -5.3445     0.5591  -9.559 1.29e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.046 on 30 degrees of freedom
#> Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
#> F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

# The p-value for wt is smaller than 0.05, so wt does have a significant effect on mpg

Plot the relevant plots to check for assumptions

# Check assumptions: normality of residuals
qqnorm(residuals(model1))
qqline(residuals(model1), col="red")


shapiro.test(residuals(model1))
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  residuals(model1)
#> W = 0.94508, p-value = 0.1044

# Check assumptions: homoscedasticity & linearity
plot(fitted(model1),residuals(model1), main="Residual Plot",pch=19)
abline(h=0, lty='dotted', col='blue')


# All assumptions can be considered to be fulfilled

In the scatterplot, try to add a line that shows the regression function you predicted.

# different ways to use plot function
plot(Data$wt, Data$mpg, xlab = "Weight", ylab ="Miles per gallon")

abline(model1, col='red')

2.2 Simple linear regression: ANOVA (continuous & categorical)

Test whether there is a difference in the mean age for passengers in different classes.

model2 <- lm(age~factor(passenger_class), data=titanic)
anova(model2)
#> Analysis of Variance Table
#> 
#> Response: age
#>                          Df Sum Sq Mean Sq F value    Pr(>F)    
#> factor(passenger_class)   2  26690 13344.8  75.903 < 2.2e-16 ***
#> Residuals               630 110764   175.8                      
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# The p-value is smaller than 0.05, indicating that there is a significant difference in the mean age for passengers from different classes.

Use an appropriate correction for multiple testing and check which classes of passengers are significantly different when it comes to age.

#Note that you can use different kinds of corrections

pairwise.t.test(titanic$age, titanic$passenger_class, p.adj="bonferroni")
#> 
#>  Pairwise comparisons using t tests with pooled SD 
#> 
#> data:  titanic$age and titanic$passenger_class 
#> 
#>     1st    2nd  
#> 2nd <2e-16 -    
#> 3rd <2e-16 0.013
#> 
#> P value adjustment method: bonferroni

# All the classes are significantly different when it come to the mean value of age.

Check whether the assumptions for a parametric ANOVA are fulfilled.

# Check assumptions - Normality
tapply(titanic$age, titanic$passenger_class,shapiro.test)
#> $`1st`
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  X[[i]]
#> W = 0.98537, p-value = 0.02003
#> 
#> 
#> $`2nd`
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  X[[i]]
#> W = 0.97749, p-value = 0.001793
#> 
#> 
#> $`3rd`
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  X[[i]]
#> W = 0.98207, p-value = 0.01353

# The assumption of normality is not met

# Check assumption - Equal variances
library(car)
#> Loading required package: carData
leveneTest(age~factor(passenger_class), data=titanic)
#> Levene's Test for Homogeneity of Variance (center = median)
#>        Df F value   Pr(>F)    
#> group   2  12.835 3.44e-06 ***
#>       630                     
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# The assumption of equal variances is not met

# Perform a nonparametric anova

kruskal.test(age~factor(passenger_class), data=titanic)
#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  age by factor(passenger_class)
#> Kruskal-Wallis chi-squared = 116.08, df = 2, p-value < 2.2e-16

R Workshop: Practical DAY 3

dr. Annelies Agten

2026-06-11