Data Analysis of Student Stress Factors

Author

James Leleji

Introduction

This project provides an analysis of the stress factors of students in higher institution. The data used is a secondary data from kaggle, click here for it. The goal of this project is to determine how these stress factors impact students academic performance. A combination of exploratory data analysis and linear regression are utilized to uncover some useful insights for us to draw informed conclusions.

Variable Description

There are 6 variables which are all numeric variables in the data. They are rated on a scale of 1 to 5; they are explained as follows:

Sleep quality: The quality of sleep on a rate of 1 to 5.
Headache: How many times a week the student suffers headache.
Academic performance: How would they rate their academic performance.
Study load: The number of credits enrolled in a given semester.
Extracurricular activities: How many times a week the student practiced extracurricular activity.
Stress levels: How the student would rate their stress level.

Data Cleaning

Loading necessary packages

library(psych)
library(ggcorrplot)
library(tidyverse)
library(readxl)
library(janitor)

# import data
student <- read_excel("student_stress_factors.xlsx")

After running this line of code, the data is now imported into R the environment and ready for analysis.

# make all variables name lower case and cosistent
student <- student |> 
  clean_names()

The variables name have now been cleaned and any spaces between the names have been replaced with an underscore.

# A snapshot of the data by using the head function
glimpse(student)

Rows: 53
Columns: 7
$ timestamp                  <chr> "27/10/2023 21:54:15", "28/10/2023 12:24:40…
$ sleep_quality              <dbl> 3, 4, 2, 3, 2, 3, 3, 4, 2, 1, 2, 3, 2, 4, 4…
$ headache                   <dbl> 1, 1, 1, 2, 3, 1, 5, 3, 1, 2, 3, 1, 3, 1, 1…
$ academic_performance       <dbl> 3, 2, 2, 3, 1, 3, 1, 1, 4, 3, 5, 5, 3, 4, 3…
$ study_load                 <dbl> 4, 3, 1, 2, 5, 2, 4, 4, 4, 2, 5, 1, 4, 4, 2…
$ extracurricular_activities <dbl> 2, 3, 4, 3, 5, 1, 3, 1, 5, 5, 2, 4, 4, 1, 5…
$ stress_levels              <dbl> 3, 2, 4, 3, 3, 1, 5, 1, 1, 2, 4, 1, 3, 1, 2…

We can see that there are 53 observations and 7 columns in the data.

# convert data type from double to numeric
student <- student %>%
  mutate_if(sapply(student, is.double), as.numeric)
# check if the conversion has been done
sapply(student, class)

                 timestamp              sleep_quality 
               "character"                  "numeric" 
                  headache       academic_performance 
                 "numeric"                  "numeric" 
                study_load extracurricular_activities 
                 "numeric"                  "numeric" 
             stress_levels 
                 "numeric"

The variables have been converted from double to numeric since the values all have levels ranging from 1 to 5.

Exploratory Data Analysis

describe(student)

                           vars  n  mean    sd median trimmed   mad min max
timestamp*                    1 53 27.00 15.44     27   27.00 19.27   1  53
sleep_quality                 2 53  3.15  1.20      3    3.16  1.48   1   5
headache                      3 53  1.98  1.26      1    1.79  0.00   1   5
academic_performance          4 53  3.23  1.15      3    3.28  1.48   1   5
study_load                    5 53  2.81  1.43      2    2.77  1.48   1   5
extracurricular_activities    6 53  2.89  1.45      3    2.86  1.48   1   5
stress_levels                 7 53  2.79  1.38      3    2.74  1.48   1   5
                           range  skew kurtosis   se
timestamp*                    52  0.00    -1.27 2.12
sleep_quality                  4  0.04    -1.00 0.16
headache                       4  0.93    -0.37 0.17
academic_performance           4 -0.36    -0.52 0.16
study_load                     4  0.17    -1.44 0.20
extracurricular_activities     4  0.16    -1.34 0.20
stress_levels                  4  0.19    -1.23 0.19

This is a concise statistical summary of each of the variables.

#count total missing values in each column
sapply(student, function(x) sum(is.na(x)))

                 timestamp              sleep_quality 
                         0                          0 
                  headache       academic_performance 
                         0                          0 
                study_load extracurricular_activities 
                         0                          0 
             stress_levels 
                         0

There are no missing values in the data.

# select numeric variables
student_numeric <- dplyr::select_if(student, is.numeric)

Select only numeric variables to view the correlation coefficients among the variables.

# create a correlation plot to see relationships among the variables
student_cor <- cor(student_numeric, use="complete.obs")
round(student_cor,2)

                           sleep_quality headache academic_performance
sleep_quality                       1.00     0.03                 0.27
headache                            0.03     1.00                -0.12
academic_performance                0.27    -0.12                 1.00
study_load                          0.10     0.10                 0.07
extracurricular_activities          0.00    -0.18                 0.05
stress_levels                       0.29    -0.04                 0.01
                           study_load extracurricular_activities stress_levels
sleep_quality                    0.10                       0.00          0.29
headache                         0.10                      -0.18         -0.04
academic_performance             0.07                       0.05          0.01
study_load                       1.00                       0.05          0.34
extracurricular_activities       0.05                       1.00          0.18
stress_levels                    0.34                       0.18          1.00

ggcorrplot(student_cor, 
           lab = TRUE)

Figure 1: Correlation matrix

From the matrix, an increase in study load is associated with increased stress levels.

# Create a kernel density plot of stress levels
ggplot(data=student, aes(x=stress_levels)) +
  geom_density(fill = "deepskyblue")+
  scale_x_continuous(breaks = seq(1, 5, 1), lim = c(1,5))+
  ggtitle("Kernel Density Plot of Stress Levels Values")

Figure 2: Kernel Density Plot

The graph shows the distribution of stress levels values. For example, the proportion of observations between 1 and 3 stress levels values would be represented by the area under the curve between 1 and 3 on the x-axis.

Linear Regression Analysis

Linear regression allows us to explore the relationship between a quantitative response variable and an explanatory variable while other variables are held constant.

library(kableExtra)
library(jtools)
student_lm <- lm(stress_levels ~ study_load + sleep_quality + extracurricular_activities,
                data = student)
summ(student_lm)

Observations	53
Dependent variable	stress_levels
Type	OLS linear regression

F(3,49)	4.30
R²	0.21
Adj. R²	0.16

	Est.	S.E.	t val.	p
(Intercept)	0.58	0.67	0.86	0.39
study_load	0.30	0.12	2.41	0.02
sleep_quality	0.30	0.15	2.02	0.05
extracurricular_activities	0.16	0.12	1.29	0.20
Standard errors: OLS

From the results, we can estimate that a one unit increase in study load is is associated with a stress level increase of 0.30 unit, holding the other variables constant.

library(ggplot2)
library(visreg)
visreg(student_lm, "study_load", gg = TRUE)

Figure 3: Conditional plot of study load and stress levels

The graph suggests that, after controlling for sleep quality, and extracurricular_activities, stress levels increases with study load.