Hypothesis

We are interested in whether the amount of sleep students have is associated with their academic achievement. We suppose, that students who sleep the least and the most will have lower grades while students who sleep normally (around 7-8 hours a day) will have the highest grades.

Data preparation

First, let us prepare our data.

The chosen dataset can be found following the link. It is also properly described here.

https://www.kaggle.com/datasets/hopesb/student-depression-dataset/data

It is a big dataset with more than 27k observations. There are many gropus of people in terms of their education status (school students, students of different majors and even PhD students), in our analysis we will focus only on school students (12 grade in India is the last school grade). At least their students are homogenous in terms of their degree which makes analysis more meaningful. There are about 6k observations.

The distribution of amount of sleep for school students can be found below:

data = read.csv("SDD.csv")

library(dplyr)

data = data %>% mutate_if(is.character, as.factor)
data = data %>% filter(Degree == "Class 12", CGPA > 0)

data$Sleep.Duration = ordered(data$Sleep.Duration, levels = c("Less than 5 hours", "5-6 hours", "7-8 hours", "More than 8 hours"))

table(data$Sleep.Duration)
## 
## Less than 5 hours         5-6 hours         7-8 hours More than 8 hours 
##              1800              1308              1559              1406

There are more than 1300 observations in each group. We deleted people who’s GPA = 0 (there were only 4 or 5 of them, these observations seem to be a mistake during data collection process).

We have a long night, GPA as a continuous variable and Hours of sleep as an ordinal one, so why not to suffer making ANOVA.

Assumption check

First, lets us check ANOVA assumptions. The observations are independent (one observation - one student).

Let us check the normality in each group (we have a long night, let us check it 3 different ways, however, we suppose that Shapiro-Wilk test will be too sensitive for such big groups to use it).

library(ggplot2)
ggplot(data, aes(x = CGPA, fill = Sleep.Duration)) + geom_density(alpha = 0.2)

From density plots data seems to be distributed not normally in each group.

library(psych)
library(kableExtra)

describeBy(data$CGPA, data$Sleep.Duration, mat = TRUE) %>% 
  select(Sleep_duration = group1, N = n, Mean = mean, SD = sd, Median = median, Min = min, Max = max, 
         Skew = skew, Kurtosis = kurtosis, st.error = se) %>% 
  kable(align = c("lrrrrrrrrr"), digits = 2, row.names = FALSE, caption = "Sleeping hours and GPA")
Sleeping hours and GPA
Sleep_duration N Mean SD Median Min Max Skew Kurtosis st.error
Less than 5 hours 1800 7.61 1.41 7.75 5.03 10.00 -0.04 -1.13 0.03
5-6 hours 1308 7.65 1.40 7.77 5.03 9.98 -0.10 -1.19 0.04
7-8 hours 1559 7.66 1.44 7.83 5.03 10.00 -0.08 -1.22 0.04
More than 8 hours 1406 7.49 1.44 7.48 5.03 10.00 0.07 -1.22 0.04

Skew and kurtosis, which are very small in each case can make us think data is distributed normally.

library(ggpubr)

data1 = data %>% filter(Sleep.Duration == "Less than 5 hours")
data2 = data %>% filter(Sleep.Duration == "5-6 hours")
data3 = data %>% filter(Sleep.Duration == "7-8 hours")
data4 = data %>% filter(Sleep.Duration == "More than 8 hours")

# "Less than 5 hours", "5-6 hours", "7-8 hours", "More than 8 hours"
ggqqplot(data1$CGPA, title = "Less than 5 hours")

ggqqplot(data2$CGPA, title = "5-6 hours")

ggqqplot(data3$CGPA, title = "7-8 hours")

ggqqplot(data4$CGPA, title = "More than 8 hours")

But QQ-plots in hand with density-plots show that data is not really normal.

Equality of variances

Levene’s test shows us that variances are equal among all groups (Pr(>F) > 0.05). From the table above we can also see that standard deviations are more or less the same in all groups (around 1.4).

library(car)
leveneTest(data$CGPA ~ data$Sleep.Duration)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    3  1.7103 0.1626
##       6069

Test

Simple visualisation

Let us proceed to test.

From simple visualisation it looks like there are no real differences between groups, in all cases the distributions seem to be quite similar. And only for those who sleep more than 8 hours on average the median and distribution in general can depict quite smaller GPA.

ggplot(data, aes(x = Sleep.Duration, y = CGPA)) + geom_boxplot()

Parametric test

Let us first conduct parametric test (even though distributions are not normal).

anova = aov(data$CGPA ~ data$Sleep.Duration)
summary(anova)
##                       Df Sum Sq Mean Sq F value  Pr(>F)   
## data$Sleep.Duration    3     24   8.109       4 0.00742 **
## Residuals           6069  12304   2.027                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pr(>F) is smaller than 0.05, so there is stat. significant difference at least between 2 groups.

layout(matrix(1:4, 2, 2))
plot(anova, 1)
plot(anova, 2)
plot(anova, 3)

residuals = residuals(object = anova)

The residuals are distributed more or less OK. Let us proceed to post-hoc.

pairwise.t.test(data$CGPA, data$Sleep.Duration, adjust = "bonferroni", pool.sd = T)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$CGPA and data$Sleep.Duration 
## 
##                   Less than 5 hours 5-6 hours 7-8 hours
## 5-6 hours         0.937             -         -        
## 7-8 hours         0.937             0.955     -        
## More than 8 hours 0.108             0.019     0.012    
## 
## P value adjustment method: holm

We can see, that only those sleeping more than 8 hours on average differ in terms of GPA from other 2 groups (5-6 hours a day and 7-8 hours a day). So, those who sleep “too much” will achieve less. At the same time (even though it is not stat. significant) we can also assume that those who sleep less than 5 hours also very slightly perform worse that those sleeping 5-8 hours.

library(sjstats)
anova_stats(anova)$omegasq
## [1] 0.001    NA

For parametric test, omega-square depicts small effect.

Non-parametric test

Since distributions are not normal, let us also do a non-parametric test.

kruskal.test(data$CGPA, data$Sleep.Duration, data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  data$CGPA and data$Sleep.Duration
## Kruskal-Wallis chi-squared = 11.983, df = 3, p-value = 0.007443
library(DescTools)
DunnTest(data$CGPA ~ data$Sleep.Duration, data)
## 
##  Dunn's test of multiple comparisons using rank sums : holm  
## 
##                                     mean.rank.diff   pval    
## 5-6 hours-Less than 5 hours              61.864027 0.8443    
## 7-8 hours-Less than 5 hours              65.331399 0.8443    
## More than 8 hours-Less than 5 hours    -133.812965 0.1280    
## 7-8 hours-5-6 hours                       3.467373 0.9579    
## More than 8 hours-5-6 hours            -195.676992 0.0183 *  
## More than 8 hours-7-8 hours            -199.144364 0.0121 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Generally, all of the results go in hand with a parametric test.

library(rcompanion)
epsilonSquared(data$CGPA, data$Sleep.Duration)
## epsilon.squared 
##         0.00197

In that case effect size is also very small.

Visualisation

And now let us make a big visualization of our results.

library(ggstatsplot)
ggbetweenstats(data, x = Sleep.Duration, y = CGPA, type = "nonparametric", var.equal = T)

Results

To sum up, we can say that those who sleep too much perform slightly worse who sleep less. However, the difference and effect size are very small. We should keep in mind that academic achievement is a complex parameter affected by many factors and as we can see on our data sleep is one of them but, probably, not the most important one.