Unit of observation: 1 individual student.
Variables:
Sample size: 395.
In this exercise I have compared variable Walc (weekend alcohol consumption) based on the gender of the student => Hypothesis about the difference between two population arithmetic means (independent samples).
H0: The average alcohol consumption by men on weekends is the same as the average alcohol consumption by women (𝜇Walc male = 𝜇Walc female).
H1: The average alcohol consumption by men on weekends is not the same as the average alcohol consumption by women (𝜇Walc male ≠ 𝜇Walc female).
Assumptions:
Variable is numeric - yes, measured by 1-5 Likert scale.
The distribution of the variable is normal in both populations - to be checked.
The data must come from two independent populations - yes, there are male and female populations.
Variable has the same variance in both populations - if not, Welch correction to be applied.
library(readxl)
mydata <- read_xlsx("./student-mat1.xlsx")
mydata <- as.data.frame(mydata)
head(mydata)
## sex age higher Dalc Walc
## 1 1 18 1 1 1
## 2 1 17 1 1 1
## 3 1 15 1 2 3
## 4 1 15 1 1 1
## 5 1 16 1 1 2
## 6 0 16 1 1 2
mydata$sexF <- factor(mydata$sex,
levels = c(0, 1),
labels = c("Male", "Female"))
library(psych)
describeBy(mydata$Walc, g = mydata$sexF)
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 187 2.66 1.42 3 2.58 1.48 1 5 4 0.23 -1.31 0.1
## ------------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 208 1.96 1.06 2 1.82 1.48 1 5 4 0.82 -0.21 0.07
50% of the men had more than the medium weekend alcohol consumption (more than 3), and 50% of the women had more than the low weekend alcohol consumption (more than 2).
On average, male weekend alcohol consumption is equal to 2.66, while female weekend alcohol consumption is equal to 1.96 on average.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
Male <- ggplot(mydata[mydata$sexF == "Male", ], aes(x = Walc)) +
theme_linedraw() +
geom_bar(fill = "dodgerblue4") +
ylab("Frequency") +
ggtitle("Male")
Female <- ggplot(mydata[mydata$sexF == "Female", ], aes(x = Walc)) +
theme_linedraw() +
geom_bar(fill = "hotpink2") +
ylab("Frequency") +
ggtitle("Female")
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.2
ggarrange(Male, Female,
ncol = 2, nrow = 1)
Normality could be rejected now, because both graphs are strongly skewed to the right, but I will check the normality also with the Shapiro-Wilk test.
H0-1: Weekend alcohol consumption is normally distributed for male.
H1-1: Weekend alcohol consumption is not normally distributed for male.
H0-2: Weekend alcohol consumption is normally distributed for female.
H1-2: Weekend alcohol consumption is not normally distributed for female.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(sexF) %>%
shapiro_test(Walc)
## # A tibble: 2 × 4
## sexF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Male Walc 0.870 1.34e-11
## 2 Female Walc 0.814 4.83e-15
We reject the null hypothesis H0-1 with p<0.001.
We reject the null hypothesis H0-2 with p<0.001.
H0: The average alcohol consumption by men on weekends is the same as the average alcohol consumption by women (𝜇Walc male = 𝜇Walc female).
H1: The average alcohol consumption by men on weekends is not the same as the average alcohol consumption by women (𝜇Walc male ≠ 𝜇Walc female).
t.test(mydata$Walc ~ mydata$sexF,
var.equal = FALSE,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: mydata$Walc by mydata$sexF
## t = 5.5666, df = 341.41, p-value = 5.263e-08
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
## 0.4567768 0.9559649
## sample estimates:
## mean in group Male mean in group Female
## 2.663102 1.956731
We reject H0 at p<0.001 => there is a difference between male and female weekend alcohol consumption.
library(effectsize)
effectsize::cohens_d(mydata$Walc ~ mydata$sexF,
pooled_sd = FALSE)
## Cohen's d | 95% CI
## ------------------------
## 0.57 | [0.36, 0.77]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(0.57, rules = "sawilowsky2009")
## [1] "medium"
## (Rules: sawilowsky2009)
The effect size is medium => there is a medium difference between male and female weekend alcohol consumption. Comparing mean = males are consuming more alcohol on weekends on average than females.
H0: Location distribution of alcohol consumption by men on weekends is the same as location distribution of alcohol consumption by women.
H1: Location distribution of alcohol consumption by men on weekends is not the same as location distribution of alcohol consumption by women.
wilcox.test(mydata$Walc ~ mydata$sexF,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$Walc by mydata$sexF
## W = 24838, p-value = 7.4e-07
## alternative hypothesis: true location shift is not equal to 0
We reject H0 at p<0.001 => there is a difference between location distributions of weekend alcohol consumption between male and female.
library(effectsize)
effectsize(wilcox.test(mydata$Walc ~ mydata$sexF,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## --------------------------------
## 0.28 | [0.17, 0.38]
interpret_rank_biserial(0.28)
## [1] "medium"
## (Rules: funder2019)
The effect size is medium => there is a medium difference between male and female location distributions of weekend alcohol consumption. Comparing median = males are consuming more alcohol on weekends than females.
The results of Wilcoxon Rank Sum Test should be used, since the assumption of normality was violated.
library(readxl)
mydata2 <- read_xlsx("./student-mat1.xlsx")
mydata2 <- as.data.frame(mydata2)
head(mydata2)
## sex age higher Dalc Walc
## 1 1 18 1 1 1
## 2 1 17 1 1 1
## 3 1 15 1 2 3
## 4 1 15 1 1 1
## 5 1 16 1 1 2
## 6 0 16 1 1 2
H0: ρ_age, Walc = 0
H1: ρ_age, Walc ≠ 0
Is there a correlation between student workday alcohol consumption and weekend alcohol consumption?
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(mydata2[, c("Dalc", "Walc")], smooth = FALSE)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(mydata2, columns = c("Dalc", "Walc"))
Reject H0 at p<0.001 => linear relationships between workday alcohol consumption and weekend alcohol consumption are semi-strong and positive.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata2[, c("Dalc", "Walc")]),
type = "pearson")
## Dalc Walc
## Dalc 1.00 0.65
## Walc 0.65 1.00
##
## n= 395
##
##
## P
## Dalc Walc
## Dalc 0
## Walc 0
cor(mydata2$Dalc, mydata2$Walc,
method = "pearson",
use = "complete.obs")
## [1] 0.6475442
Linear relationships between workday alcohol consumption and weekend alcohol consumption are semi-strong and positive.
cor.test(mydata2$Dalc, mydata$Walc,
method = "pearson",
use = "complete.obs")
##
## Pearson's product-moment correlation
##
## data: mydata2$Dalc and mydata$Walc
## t = 16.846, df = 393, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5863365 0.7014001
## sample estimates:
## cor
## 0.6475442
Reject H0 at p<0.001 => the workday alcohol consumption and weekend alcohol consumption are positively correlated.
library(readxl)
mydata3 <- read_xlsx("./student-mat1.xlsx")
mydata3 <- as.data.frame(mydata3)
head(mydata3)
## sex age higher Dalc Walc
## 1 1 18 1 1 1
## 2 1 17 1 1 1
## 3 1 15 1 2 3
## 4 1 15 1 1 1
## 5 1 16 1 1 2
## 6 0 16 1 1 2
What is the relationships between sex of the students and their wish to take higher education?
H0: There is no association between sex of the students and their wish to take higher education.
H1: There is an association.
mydata3$sexF <- factor(mydata3$sex,
levels = c(0, 1),
labels = c("Male", "Female"))
mydata3$higherF <- factor(mydata3$higher,
levels = c(0, 1),
labels = c("NO", "YES"))
results <- chisq.test(mydata3$sexF, mydata3$higherF,
correct = TRUE)
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata3$sexF and mydata3$higherF
## X-squared = 7.6859, df = 1, p-value = 0.005565
addmargins(results$observed)
## mydata3$higherF
## mydata3$sexF NO YES Sum
## Male 16 171 187
## Female 4 204 208
## Sum 20 375 395
round(results$expected, 2)
## mydata3$higherF
## mydata3$sexF NO YES
## Male 9.47 177.53
## Female 10.53 197.47
round(results$res, 2)
## mydata3$higherF
## mydata3$sexF NO YES
## Male 2.12 -0.49
## Female -2.01 0.46
We reject H0 at p=0.006 => There is an association between sex of the students and their wish to take higher education.
From all the students, that do not have a wish to take a higher education, there is more than expected number of males (𝛼 = 0.05).
From all the students, that do not have a wish to take a higher education, there is more than expected number of females (𝛼 = 0.05).
oddsratio(mydata3$sexF, mydata3$higherF)
## Odds ratio | 95% CI
## --------------------------
## 4.77 | [1.57, 14.54]
One gender is 4.77 times more likely to wish to pursue higher education compared to the other gender.
interpret_oddsratio(4.77)
## [1] "medium"
## (Rules: chen2010)
The association between the sex of students and their wish to take a higher education is medium.