Student Alcohol Consumption.

RQ1: Hypothesis.

Unit of observation: 1 individual student.

Variables:

  1. sex - student’s sex (binary: ‘0’ - male or ‘1’ - female)
  2. age - student’s age (numeric: from 15 to 22)
  3. higher - wants to take higher education (binary: ‘0’ - no or ‘1’ - yes)
  4. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
  5. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

Sample size: 395.

In this exercise I have compared variable Walc (weekend alcohol consumption) based on the gender of the student => Hypothesis about the difference between two population arithmetic means (independent samples).

H0: The average alcohol consumption by men on weekends is the same as the average alcohol consumption by women (𝜇Walc male = 𝜇Walc female).

H1: The average alcohol consumption by men on weekends is not the same as the average alcohol consumption by women (𝜇Walc male ≠ 𝜇Walc female).

Assumptions:

  1. Variable is numeric - yes, measured by 1-5 Likert scale.

  2. The distribution of the variable is normal in both populations - to be checked.

  3. The data must come from two independent populations - yes, there are male and female populations.

  4. Variable has the same variance in both populations - if not, Welch correction to be applied.

library(readxl)
mydata <- read_xlsx("./student-mat1.xlsx")

mydata <- as.data.frame(mydata)

head(mydata)
##   sex age higher Dalc Walc
## 1   1  18      1    1    1
## 2   1  17      1    1    1
## 3   1  15      1    2    3
## 4   1  15      1    1    1
## 5   1  16      1    1    2
## 6   0  16      1    1    2
mydata$sexF <- factor(mydata$sex, 
                         levels = c(0, 1), 
                         labels = c("Male", "Female"))
library(psych)
describeBy(mydata$Walc, g = mydata$sexF)
## 
##  Descriptive statistics by group 
## group: Male
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis  se
## X1    1 187 2.66 1.42      3    2.58 1.48   1   5     4 0.23    -1.31 0.1
## ------------------------------------------------------------ 
## group: Female
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 208 1.96 1.06      2    1.82 1.48   1   5     4 0.82    -0.21 0.07

50% of the men had more than the medium weekend alcohol consumption (more than 3), and 50% of the women had more than the low weekend alcohol consumption (more than 2).

On average, male weekend alcohol consumption is equal to 2.66, while female weekend alcohol consumption is equal to 1.96 on average.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
Male <- ggplot(mydata[mydata$sexF == "Male",  ], aes(x = Walc)) +
               theme_linedraw() + 
               geom_bar(fill = "dodgerblue4") +
               ylab("Frequency") +
               ggtitle("Male")

Female <- ggplot(mydata[mydata$sexF == "Female",  ], aes(x = Walc)) +
                   theme_linedraw() + 
                   geom_bar(fill = "hotpink2") +
                   ylab("Frequency") +
                   ggtitle("Female")

library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.2
ggarrange(Male, Female,
          ncol = 2, nrow = 1)

Normality could be rejected now, because both graphs are strongly skewed to the right, but I will check the normality also with the Shapiro-Wilk test.

H0-1: Weekend alcohol consumption is normally distributed for male.

H1-1: Weekend alcohol consumption is not normally distributed for male.

H0-2: Weekend alcohol consumption is normally distributed for female.

H1-2: Weekend alcohol consumption is not normally distributed for female.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
mydata %>%
  group_by(sexF) %>%
  shapiro_test(Walc)
## # A tibble: 2 × 4
##   sexF   variable statistic        p
##   <fct>  <chr>        <dbl>    <dbl>
## 1 Male   Walc         0.870 1.34e-11
## 2 Female Walc         0.814 4.83e-15

We reject the null hypothesis H0-1 with p<0.001.

We reject the null hypothesis H0-2 with p<0.001.

Parametric test - Independent samples t-test.

H0: The average alcohol consumption by men on weekends is the same as the average alcohol consumption by women (𝜇Walc male = 𝜇Walc female).

H1: The average alcohol consumption by men on weekends is not the same as the average alcohol consumption by women (𝜇Walc male ≠ 𝜇Walc female).

t.test(mydata$Walc ~ mydata$sexF, 
       var.equal = FALSE,
       alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  mydata$Walc by mydata$sexF
## t = 5.5666, df = 341.41, p-value = 5.263e-08
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  0.4567768 0.9559649
## sample estimates:
##   mean in group Male mean in group Female 
##             2.663102             1.956731

We reject H0 at p<0.001 => there is a difference between male and female weekend alcohol consumption.

library(effectsize)
effectsize::cohens_d(mydata$Walc ~ mydata$sexF, 
                     pooled_sd = FALSE)
## Cohen's d |       95% CI
## ------------------------
## 0.57      | [0.36, 0.77]
## 
## - Estimated using un-pooled SD.
interpret_cohens_d(0.57, rules = "sawilowsky2009")
## [1] "medium"
## (Rules: sawilowsky2009)

The effect size is medium => there is a medium difference between male and female weekend alcohol consumption. Comparing mean = males are consuming more alcohol on weekends on average than females.

Non-parametric test - Wilcoxon Rank Sum Test.

H0: Location distribution of alcohol consumption by men on weekends is the same as location distribution of alcohol consumption by women.

H1: Location distribution of alcohol consumption by men on weekends is not the same as location distribution of alcohol consumption by women.

wilcox.test(mydata$Walc ~ mydata$sexF,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata$Walc by mydata$sexF
## W = 24838, p-value = 7.4e-07
## alternative hypothesis: true location shift is not equal to 0

We reject H0 at p<0.001 => there is a difference between location distributions of weekend alcohol consumption between male and female.

library(effectsize)
effectsize(wilcox.test(mydata$Walc ~ mydata$sexF,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))
## r (rank biserial) |       95% CI
## --------------------------------
## 0.28              | [0.17, 0.38]
interpret_rank_biserial(0.28)
## [1] "medium"
## (Rules: funder2019)

The effect size is medium => there is a medium difference between male and female location distributions of weekend alcohol consumption. Comparing median = males are consuming more alcohol on weekends than females.

The results of Wilcoxon Rank Sum Test should be used, since the assumption of normality was violated.

RQ2: Pearson correlation analysis.

library(readxl)
mydata2 <- read_xlsx("./student-mat1.xlsx")

mydata2 <- as.data.frame(mydata2)

head(mydata2)
##   sex age higher Dalc Walc
## 1   1  18      1    1    1
## 2   1  17      1    1    1
## 3   1  15      1    2    3
## 4   1  15      1    1    1
## 5   1  16      1    1    2
## 6   0  16      1    1    2

H0: ρ_age, Walc = 0

H1: ρ_age, Walc ≠ 0

Is there a correlation between student workday alcohol consumption and weekend alcohol consumption?

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(mydata2[, c("Dalc", "Walc")], smooth = FALSE)

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(mydata2, columns = c("Dalc", "Walc"))

Reject H0 at p<0.001 => linear relationships between workday alcohol consumption and weekend alcohol consumption are semi-strong and positive.

Or the other way to check the correlation between 2 variables:

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(mydata2[, c("Dalc", "Walc")]), 
      type = "pearson")
##      Dalc Walc
## Dalc 1.00 0.65
## Walc 0.65 1.00
## 
## n= 395 
## 
## 
## P
##      Dalc Walc
## Dalc       0  
## Walc  0
cor(mydata2$Dalc, mydata2$Walc,
    method = "pearson",
    use = "complete.obs")
## [1] 0.6475442

Linear relationships between workday alcohol consumption and weekend alcohol consumption are semi-strong and positive.

cor.test(mydata2$Dalc, mydata$Walc,
    method = "pearson",
    use = "complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  mydata2$Dalc and mydata$Walc
## t = 16.846, df = 393, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5863365 0.7014001
## sample estimates:
##       cor 
## 0.6475442

Reject H0 at p<0.001 => the workday alcohol consumption and weekend alcohol consumption are positively correlated.

RQ3: Pearson Chi2 test.

library(readxl)
mydata3 <- read_xlsx("./student-mat1.xlsx")

mydata3 <- as.data.frame(mydata3)

head(mydata3)
##   sex age higher Dalc Walc
## 1   1  18      1    1    1
## 2   1  17      1    1    1
## 3   1  15      1    2    3
## 4   1  15      1    1    1
## 5   1  16      1    1    2
## 6   0  16      1    1    2
  • sex - student’s sex (binary: ‘0’ - male or ‘1’ - female)
  • higher - wants to take higher education (binary: yes or no)

What is the relationships between sex of the students and their wish to take higher education?

H0: There is no association between sex of the students and their wish to take higher education.

H1: There is an association.

mydata3$sexF <- factor(mydata3$sex, 
                         levels = c(0, 1), 
                         labels = c("Male", "Female"))

mydata3$higherF <- factor(mydata3$higher, 
                                levels = c(0, 1), 
                                labels = c("NO", "YES"))
results <- chisq.test(mydata3$sexF, mydata3$higherF, 
                      correct = TRUE)

results
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata3$sexF and mydata3$higherF
## X-squared = 7.6859, df = 1, p-value = 0.005565
addmargins(results$observed)
##             mydata3$higherF
## mydata3$sexF  NO YES Sum
##       Male    16 171 187
##       Female   4 204 208
##       Sum     20 375 395
round(results$expected, 2)
##             mydata3$higherF
## mydata3$sexF    NO    YES
##       Male    9.47 177.53
##       Female 10.53 197.47
round(results$res, 2)
##             mydata3$higherF
## mydata3$sexF    NO   YES
##       Male    2.12 -0.49
##       Female -2.01  0.46

We reject H0 at p=0.006 => There is an association between sex of the students and their wish to take higher education.

From all the students, that do not have a wish to take a higher education, there is more than expected number of males (𝛼 = 0.05).

From all the students, that do not have a wish to take a higher education, there is more than expected number of females (𝛼 = 0.05).

oddsratio(mydata3$sexF, mydata3$higherF)
## Odds ratio |        95% CI
## --------------------------
## 4.77       | [1.57, 14.54]

One gender is 4.77 times more likely to wish to pursue higher education compared to the other gender.

interpret_oddsratio(4.77)
## [1] "medium"
## (Rules: chen2010)

The association between the sex of students and their wish to take a higher education is medium.