We are interested in whether the amount of sleep students have is associated with their academic achievement. We suppose, that students who sleep the least and the most will have lower grades while students who sleep normally (around 7-8 hours a day) will have the highest grades.
First, let us prepare our data.
The chosen dataset can be found following the link. It is also properly described here.
https://www.kaggle.com/datasets/hopesb/student-depression-dataset/data
It is a big dataset with more than 27k observations. There are many gropus of people in terms of their education status (school students, students of different majors and even PhD students), in our analysis we will focus only on school students (12 grade in India is the last school grade). At least their students are homogenous in terms of their degree which makes analysis more meaningful. There are about 6k observations.
The distribution of amount of sleep for school students can be found below:
data = read.csv("SDD.csv")
library(dplyr)
data = data %>% mutate_if(is.character, as.factor)
data = data %>% filter(Degree == "Class 12", CGPA > 0)
data$Sleep.Duration = ordered(data$Sleep.Duration, levels = c("Less than 5 hours", "5-6 hours", "7-8 hours", "More than 8 hours"))
table(data$Sleep.Duration)
##
## Less than 5 hours 5-6 hours 7-8 hours More than 8 hours
## 1800 1308 1559 1406
There are more than 1300 observations in each group. We deleted people who’s GPA = 0 (there were only 4 or 5 of them, these observations seem to be a mistake during data collection process).
We have a long night, GPA as a continuous variable and Hours of sleep as an ordinal one, so why not to suffer making ANOVA.
First, lets us check ANOVA assumptions. The observations are independent (one observation - one student).
Let us check the normality in each group (we have a long night, let us check it 3 different ways, however, we suppose that Shapiro-Wilk test will be too sensitive for such big groups to use it).
library(ggplot2)
ggplot(data, aes(x = CGPA, fill = Sleep.Duration)) + geom_density(alpha = 0.2)
From density plots data seems to be distributed not normally in each group.
library(psych)
library(kableExtra)
describeBy(data$CGPA, data$Sleep.Duration, mat = TRUE) %>%
select(Sleep_duration = group1, N = n, Mean = mean, SD = sd, Median = median, Min = min, Max = max,
Skew = skew, Kurtosis = kurtosis, st.error = se) %>%
kable(align = c("lrrrrrrrrr"), digits = 2, row.names = FALSE, caption = "Sleeping hours and GPA")
| Sleep_duration | N | Mean | SD | Median | Min | Max | Skew | Kurtosis | st.error |
|---|---|---|---|---|---|---|---|---|---|
| Less than 5 hours | 1800 | 7.61 | 1.41 | 7.75 | 5.03 | 10.00 | -0.04 | -1.13 | 0.03 |
| 5-6 hours | 1308 | 7.65 | 1.40 | 7.77 | 5.03 | 9.98 | -0.10 | -1.19 | 0.04 |
| 7-8 hours | 1559 | 7.66 | 1.44 | 7.83 | 5.03 | 10.00 | -0.08 | -1.22 | 0.04 |
| More than 8 hours | 1406 | 7.49 | 1.44 | 7.48 | 5.03 | 10.00 | 0.07 | -1.22 | 0.04 |
Skew and kurtosis, which are very small in each case can make us think data is distributed normally.
library(ggpubr)
data1 = data %>% filter(Sleep.Duration == "Less than 5 hours")
data2 = data %>% filter(Sleep.Duration == "5-6 hours")
data3 = data %>% filter(Sleep.Duration == "7-8 hours")
data4 = data %>% filter(Sleep.Duration == "More than 8 hours")
# "Less than 5 hours", "5-6 hours", "7-8 hours", "More than 8 hours"
ggqqplot(data1$CGPA, title = "Less than 5 hours")
ggqqplot(data2$CGPA, title = "5-6 hours")
ggqqplot(data3$CGPA, title = "7-8 hours")
ggqqplot(data4$CGPA, title = "More than 8 hours")
But QQ-plots in hand with density-plots show that data is not really normal.
Levene’s test shows us that variances are equal among all groups (Pr(>F) > 0.05). From the table above we can also see that standard deviations are more or less the same in all groups (around 1.4).
library(car)
leveneTest(data$CGPA ~ data$Sleep.Duration)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 1.7103 0.1626
## 6069
Let us proceed to test.
From simple visualisation it looks like there are no real differences between groups, in all cases the distributions seem to be quite similar. And only for those who sleep more than 8 hours on average the median and distribution in general can depict quite smaller GPA.
ggplot(data, aes(x = Sleep.Duration, y = CGPA)) + geom_boxplot()
Let us first conduct parametric test (even though distributions are not normal).
anova = aov(data$CGPA ~ data$Sleep.Duration)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## data$Sleep.Duration 3 24 8.109 4 0.00742 **
## Residuals 6069 12304 2.027
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pr(>F) is smaller than 0.05, so there is stat. significant difference at least between 2 groups.
layout(matrix(1:4, 2, 2))
plot(anova, 1)
plot(anova, 2)
plot(anova, 3)
residuals = residuals(object = anova)
The residuals are distributed more or less OK. Let us proceed to post-hoc.
pairwise.t.test(data$CGPA, data$Sleep.Duration, adjust = "bonferroni", pool.sd = T)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$CGPA and data$Sleep.Duration
##
## Less than 5 hours 5-6 hours 7-8 hours
## 5-6 hours 0.937 - -
## 7-8 hours 0.937 0.955 -
## More than 8 hours 0.108 0.019 0.012
##
## P value adjustment method: holm
We can see, that only those sleeping more than 8 hours on average differ in terms of GPA from other 2 groups (5-6 hours a day and 7-8 hours a day). So, those who sleep “too much” will achieve less. At the same time (even though it is not stat. significant) we can also assume that those who sleep less than 5 hours also very slightly perform worse that those sleeping 5-8 hours.
library(sjstats)
anova_stats(anova)$omegasq
## [1] 0.001 NA
For parametric test, omega-square depicts small effect.
Since distributions are not normal, let us also do a non-parametric test.
kruskal.test(data$CGPA, data$Sleep.Duration, data)
##
## Kruskal-Wallis rank sum test
##
## data: data$CGPA and data$Sleep.Duration
## Kruskal-Wallis chi-squared = 11.983, df = 3, p-value = 0.007443
library(DescTools)
DunnTest(data$CGPA ~ data$Sleep.Duration, data)
##
## Dunn's test of multiple comparisons using rank sums : holm
##
## mean.rank.diff pval
## 5-6 hours-Less than 5 hours 61.864027 0.8443
## 7-8 hours-Less than 5 hours 65.331399 0.8443
## More than 8 hours-Less than 5 hours -133.812965 0.1280
## 7-8 hours-5-6 hours 3.467373 0.9579
## More than 8 hours-5-6 hours -195.676992 0.0183 *
## More than 8 hours-7-8 hours -199.144364 0.0121 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Generally, all of the results go in hand with a parametric test.
library(rcompanion)
epsilonSquared(data$CGPA, data$Sleep.Duration)
## epsilon.squared
## 0.00197
In that case effect size is also very small.
And now let us make a big visualization of our results.
library(ggstatsplot)
ggbetweenstats(data, x = Sleep.Duration, y = CGPA, type = "nonparametric", var.equal = T)
To sum up, we can say that those who sleep too much perform slightly worse who sleep less. However, the difference and effect size are very small. We should keep in mind that academic achievement is a complex parameter affected by many factors and as we can see on our data sleep is one of them but, probably, not the most important one.