data <- read.table("C:/Julia/SEB LU/IMB/SEMESTER 2/Multivariate Analysis/HOMEWORK/student-mat.csv", header=TRUE, sep=",", dec=".")
head(data)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health services home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 services other reputation
## guardian traveltime studytime failures schoolsup famsup paid activities
## 1 mother 2 2 0 yes no no no
## 2 father 1 2 0 no yes no no
## 3 mother 1 2 3 yes no yes no
## 4 mother 1 3 0 no yes yes yes
## 5 father 1 2 0 no yes yes no
## 6 mother 1 2 0 no yes yes yes
## nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1 yes yes no no 4 3 4 1 1 3
## 2 no yes yes no 5 3 3 1 1 3
## 3 yes yes yes no 4 3 2 2 3 3
## 4 yes yes yes yes 3 2 2 1 1 5
## 5 yes yes no no 4 3 2 1 2 5
## 6 yes yes yes no 5 4 2 1 2 5
## absences G1 G2 G3
## 1 6 5 6 6
## 2 4 5 5 6
## 3 10 7 8 10
## 4 2 15 14 15
## 5 4 6 10 10
## 6 10 15 15 15
The dataset was found on kaggle.com, and investigates the alcohol consumption of students attending Math and Portugese courses.
The unit of observation is a student attending Portugese or Math courses. It incorporates 33 variables in total, with 395 observations.
Data <- data [c(-1, -5, -6, -7, -8, -9, -10, -11, -13, -14, -15, -16, -17, -18, -19, -20, -21, -22, -23, -24, -25, -26, -29, -30)]
head(Data)
## sex age address guardian Dalc Walc G1 G2 G3
## 1 F 18 U mother 1 1 5 6 6
## 2 F 17 U father 1 1 5 5 6
## 3 F 15 U mother 2 3 7 8 10
## 4 F 15 U mother 1 1 15 14 15
## 5 F 16 U father 1 2 6 10 10
## 6 M 16 U mother 1 2 15 15 15
Description of variables:
sex: student’s sex (binary: ‘F’ - female or ‘M’ - male)
age: student’s age (numeric: from 15 to 22)
address: student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
guardian: student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
Dalc: workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc: weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
G1: first period Math or Portuguese grade (numeric: from 0 to 20)
G2: second period Math or Portuguese grade (numeric: from 0 to 20)
G3: final Math or Portuguese grade (numeric: from 0 to 20, output target)
RQ: Does alcohol consumption of students differ on workdays in comparison to weekends? What are the characteristics of alcohol consumption among students?
For this example we will compare Dalc, of the entire sample (urban and rural) to Walc, also of the entire sample.
Data$Difference <- Data$Dalc - Data$Walc
library(ggplot2)
ggplot(Data, aes(x = Difference)) +
geom_histogram(binwidth = 1, fill = "hotpink2") +
xlab("Differences")
We perform the Shapiro-Wilk normality test.
H0: variable is normally distributed
H1: variable is not normally distributed
shapiro.test(Data$Difference)
##
## Shapiro-Wilk normality test
##
## data: Data$Difference
## W = 0.82267, p-value < 2.2e-16
We reject the null hypothesis at p<2.2e-16. We cannot do a paired samples t-test, so we do the non-parametric alternative.
(We can also doublecheck normality this way:)
library(ggpubr)
ggqqplot(Data$Difference)
## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
The Wilcoxon test compares the medians of the differences between the paired observations, rather than the means used in the t-test.
H0: Distribution locations of alcohol consumption are the same
H1: Distribution locations of alcohol consumption are different
wilcox.test(Data$Dalc, Data$Walc,
paired = TRUE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon signed rank test
##
## data: Data$Dalc and Data$Walc
## V = 317.5, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
We reject the null hypothesis and accept H1, at p<2.2e-16. As the p-value suggests that there is a statistically significant difference between the medians of Dalc and Walc. (= The difference between the medians of Dalc and Walc is very likely to be real and not due to random chance.)
library(effectsize)
effectsize(wilcox.test(Data$Dalc, Data$Walc,
paired = TRUE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.97 | [-0.98, -0.96]
interpret_rank_biserial(-0.97)
## [1] "very large"
## (Rules: funder2019)
Conclusions: Using the sample data, we find that the consumption of alcohol during weekdays differs from the alcohol consumption during weekends (p<2.2e-16). Moreover, the rank biserial, in this case is a negative correlation coefficient of -0.97 and indicates a strong negative association between the two variables - the difference is very large.
For this example we will compare Walc of students living in an urban environment to those students living in a rural one.
Data$addressR <- factor(Data$address,
levels = c("U", "R"),
labels = c("Urban", "Rural"))
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:effectsize':
##
## phi
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describeBy(Data$Walc, g = Data$addressR)
##
## Descriptive statistics by group
## group: Urban
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 307 2.22 1.27 2 2.08 1.48 1 5 4 0.67 -0.75 0.07
## ------------------------------------------------------------
## group: Rural
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 88 2.53 1.31 2 2.43 1.48 1 5 4 0.41 -0.97 0.14
library(ggplot2)
ggplot(Data, aes(x = Walc)) +
geom_histogram(binwidth = 1, fill="lightblue") +
facet_wrap(~addressR, ncol = 1) +
ylab("Frequency")
We must perform the Wilcoxon Rank Sum test:
H0: distribution locations are the same
H1: distribution locations are different
library(rstatix)
##
## Attaching package: 'rstatix'
## The following objects are masked from 'package:effectsize':
##
## cohens_d, eta_squared
## The following object is masked from 'package:stats':
##
## filter
Data %>%
group_by(addressR) %>%
shapiro_test(Walc)
## # A tibble: 2 × 4
## addressR variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Urban Walc 0.834 1.87e-17
## 2 Rural Walc 0.882 8.46e- 7
H0: distribution is normal
H1: distribution is not normal
=> Reject null hypothesis for Urban based on p-value (alpha = 5%).
=> Reject null hypothesis for Rural based on p-value.
wilcox.test(Data$Walc ~ Data$addressR,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: Data$Walc by Data$addressR
## W = 11614, p-value = 0.03684
## alternative hypothesis: true location shift is not equal to 0
Students from rural env. consume more alcohol.
library(effectsize)
effectsize(wilcox.test(Data$Walc ~ Data$addressR,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.14 | [-0.27, 0.00]
interpret_rank_biserial(-0.14)
## [1] "small"
## (Rules: funder2019)
Conclusions: Based on the sample data, we found that students from urban environments and students from rural environments differ in their alcohol consumption during weekends (p=0.03684) - rural students consume more alcohol during weekends, the difference in distribution is small (r=-0.14).
For the final hypothesis testing, we will check if there is a difference between final grades of the students (G3), based on who is their guardian (mother, father, other).
shapiro.test(Data$G3)
##
## Shapiro-Wilk normality test
##
## data: Data$G3
## W = 0.92873, p-value = 8.836e-13
H0: variable is normally distributed
H1: variable is not normally distributed
We reject the null hypothesis based on p=8.836e-13. We must perform a non-parametric test.
Data1 <- Data [c(-1, -2, -3, -5, -6, -7, -8)]
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Data2 <- Data1 %>%
mutate(id = seq_len(nrow(Data1)))
library(dplyr)
Data2 <- Data2 %>%
select(id, everything())
head(Data2)
## id guardian G3 Difference addressR
## 1 1 mother 6 0 Urban
## 2 2 father 6 0 Urban
## 3 3 mother 10 -1 Urban
## 4 4 mother 15 0 Urban
## 5 5 father 10 -1 Urban
## 6 6 mother 15 -1 Urban
kruskal.test(G3 ~ guardian,
data = Data2)
##
## Kruskal-Wallis rank sum test
##
## data: G3 by guardian
## Kruskal-Wallis chi-squared = 3.8131, df = 2, p-value = 0.1486
H0: loc. distrib. of happiness are the same in all three groups
H1: at least 1 loc. distrib. is different
=> We fail to reject the null hypothesis based on p=0.1486 (alpha=0.05)
kruskal_effsize(G3 ~ guardian,
data = Data2)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 G3 395 0.00463 eta2[H] small
library(rstatix)
groups_nonpar <- wilcox_test(G3 ~ guardian,
paired = FALSE,
p.adjust.method = "bonferroni",
data = Data2)
groups_nonpar
## # A tibble: 3 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 G3 father mother 90 273 12679 0.647 1 ns
## 2 G3 father other 90 32 1778. 0.048 0.144 ns
## 3 G3 mother other 273 32 5198. 0.078 0.233 ns
pwc <- Data2 %>%
wilcox_test(G3 ~ guardian,
paired = FALSE,
p.adjust.method = "bonferroni")
Kruskal_results <- kruskal_test(G3 ~ guardian,
data = Data2)
library(rstatix)
pwc <- pwc %>%
add_y_position(fun = "median", step.increase = 0.35)
library(ggpubr)
ggboxplot(Data2, x = "guardian", y = "G3", add = "point", ylim=c(0, 20)) +
stat_pvalue_manual(pwc, hide.ns = FALSE) +
stat_summary(fun = median, geom = "point", shape = 16, size = 4,
aes(group = guardian), color = "purple",
position = position_dodge(width = 0.8)) +
stat_summary(fun = median, colour = "purple",
position = position_dodge(width = 0.8),
geom = "text", vjust = -0.5, hjust = -1,
aes(label = round(after_stat(y), digits = 2), group = guardian)) +
labs(subtitle = get_test_label(Kruskal_results, detailed = TRUE),
caption = get_pwc_label(pwc))
Conclusions: We found that at a p-value of 0.15 (which means that there is a 15% chance that the difference in median values between the groups is due to random chance), we fail to reject the null hypothesis. This means that there is not enough evidence to suggest that there is a statistically significant difference in the median values of the the final grade based on grouping by types of guardians (x^2=3.81, p=0.15), moreover, the effect size was small (=0.00463). The post-hoc analysis with wilcox_test was done with a multiple comparison correction (Bonferroni correction), and it indicates that there is no significant difference between the groups as well. In conclusion, based on the analysis, there is no evidence to suggest that there is a statistically significant difference in the median values of the final grades of the students based on who their guardian is.