library(psych)
library(ggcorrplot)
library(tidyverse)
library(readxl)
library(janitor)Data Analysis of Students Stress Factors
Introduction
This project provides an analysis of the stress factors of students in higher institution. The data used is a secondary data from kaggle, click here for it. The goal of this project is to determine how these stress factors impact students academic performance. A combination of exploratory data analysis and linear regression are utilized to uncover some useful insights for us to draw informed conclusions.
Variable Description
There are 6 variables which are all numeric varibales in the data. They are rated on a scale of 1 to 5; they are explained as follows:
Sleep quality: The quality of sleep on a rate of 1 to 5.
Headache: How many times a week the student suffers headache.
Academic performance: How would they rate their academic performance.
Study load: The number of credits enrolled in a given semester.
Extracurricular activities: How many times a week the student practiced extracurricular activity.
Stress levels: How the student would rate their stress level.
Data Cleaning
Loading necessary packages
# import data
student <- read_excel("student_stress_factors.xlsx")After running this line of code, the data has been imported into R environment and ready for analysis.
# make all variables name lower case and cosistent
student <- student |>
clean_names()The variabe names have now been cleaned and any spaces between the names have been replaced with an underscore.
# A snapshot of the data by using the head function
glimpse(student)Rows: 53
Columns: 7
$ timestamp <chr> "27/10/2023 21:54:15", "28/10/2023 12:24:40…
$ sleep_quality <dbl> 3, 4, 2, 3, 2, 3, 3, 4, 2, 1, 2, 3, 2, 4, 4…
$ headache <dbl> 1, 1, 1, 2, 3, 1, 5, 3, 1, 2, 3, 1, 3, 1, 1…
$ academic_performance <dbl> 3, 2, 2, 3, 1, 3, 1, 1, 4, 3, 5, 5, 3, 4, 3…
$ study_load <dbl> 4, 3, 1, 2, 5, 2, 4, 4, 4, 2, 5, 1, 4, 4, 2…
$ extracurricular_activities <dbl> 2, 3, 4, 3, 5, 1, 3, 1, 5, 5, 2, 4, 4, 1, 5…
$ stress_levels <dbl> 3, 2, 4, 3, 3, 1, 5, 1, 1, 2, 4, 1, 3, 1, 2…
We can see that there are 53 observations and 7 columns in the data.
# convert data type from double to factor
student <- student %>%
mutate_if(sapply(student, is.double), as.numeric)
# check if the conversion has been done
sapply(student, class) timestamp sleep_quality
"character" "numeric"
headache academic_performance
"numeric" "numeric"
study_load extracurricular_activities
"numeric" "numeric"
stress_levels
"numeric"
The variables have been converted from double to numeric since the values all have levels ranging from 1 to 5.
Exploratory Data Analysis
describe(student) vars n mean sd median trimmed mad min max
timestamp* 1 53 27.00 15.44 27 27.00 19.27 1 53
sleep_quality 2 53 3.15 1.20 3 3.16 1.48 1 5
headache 3 53 1.98 1.26 1 1.79 0.00 1 5
academic_performance 4 53 3.23 1.15 3 3.28 1.48 1 5
study_load 5 53 2.81 1.43 2 2.77 1.48 1 5
extracurricular_activities 6 53 2.89 1.45 3 2.86 1.48 1 5
stress_levels 7 53 2.79 1.38 3 2.74 1.48 1 5
range skew kurtosis se
timestamp* 52 0.00 -1.27 2.12
sleep_quality 4 0.04 -1.00 0.16
headache 4 0.93 -0.37 0.17
academic_performance 4 -0.36 -0.52 0.16
study_load 4 0.17 -1.44 0.20
extracurricular_activities 4 0.16 -1.34 0.20
stress_levels 4 0.19 -1.23 0.19
This is a concise statistical summary of each of the variables.
#count total missing values in each column
sapply(student, function(x) sum(is.na(x))) timestamp sleep_quality
0 0
headache academic_performance
0 0
study_load extracurricular_activities
0 0
stress_levels
0
There are no missing values in the data.
# select numeric variables
student_numeric <- dplyr::select_if(student, is.numeric)Select only numeric variables to view the correlation coefficients among the variables.
# create a correlation plot to see relationships among the variables
student_cor <- cor(student_numeric, use="complete.obs")
round(student_cor,2) sleep_quality headache academic_performance
sleep_quality 1.00 0.03 0.27
headache 0.03 1.00 -0.12
academic_performance 0.27 -0.12 1.00
study_load 0.10 0.10 0.07
extracurricular_activities 0.00 -0.18 0.05
stress_levels 0.29 -0.04 0.01
study_load extracurricular_activities stress_levels
sleep_quality 0.10 0.00 0.29
headache 0.10 -0.18 -0.04
academic_performance 0.07 0.05 0.01
study_load 1.00 0.05 0.34
extracurricular_activities 0.05 1.00 0.18
stress_levels 0.34 0.18 1.00
ggcorrplot(student_cor,
hc.order = TRUE,
type = "lower",
lab = TRUE)Figure 1: Correlation matrix
From the matrix, an increase in study load is associated with increased stress levels.
# Create a kernel density plot of stress levels
ggplot(data=student, aes(x=stress_levels)) +
geom_density(fill = "deepskyblue")+
scale_x_continuous(breaks = seq(1, 5, 1), lim = c(1,5))+
ggtitle("Kernel Density Plot of Stress Levels Values")Figure 2: Kernel Density Plot
The graph shows the distribution of stress levels values. For example, the proportion of observations between 1 and 3 stress levels values would be represented by the area under the curve between 1 and 3 on the x-axis.
Linear Regression Analysis
Linear regression allows us to explore the relationship between a quantitative response variable and an explanatory variable while other variables are held constant.
library(kableExtra)
Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':
group_rows
library(jtools)
student_lm <- lm(stress_levels ~ study_load + sleep_quality + extracurricular_activities,
data = student)
summ(student_lm)| Observations | 53 |
| Dependent variable | stress_levels |
| Type | OLS linear regression |
| F(3,49) | 4.30 |
| R² | 0.21 |
| Adj. R² | 0.16 |
| Est. | S.E. | t val. | p | |
|---|---|---|---|---|
| (Intercept) | 0.58 | 0.67 | 0.86 | 0.39 |
| study_load | 0.30 | 0.12 | 2.41 | 0.02 |
| sleep_quality | 0.30 | 0.15 | 2.02 | 0.05 |
| extracurricular_activities | 0.16 | 0.12 | 1.29 | 0.20 |
| Standard errors: OLS |
From the results, we can estimate that a one unit increase in study load is is associated with a stress level increase of 0.30 unit, holding the other variables constant.
library(ggplot2)
library(visreg)
visreg(student_lm, "study_load", gg = TRUE) Figure 3: Conditional plot of study load and stress levels
The graph suggests that, after controlling for sleep quality, and extracurricular_activities, stress levels increases with study load in a linear fashion.