Introduction

Student lifestyles and performance are measures of interest to many pursuing higher education and academic institutions, with the influence of how students use the limited hours of their day an element of interest to all involved in these educational systems, including school faculty, students, and the students’ parents. The stress levels and mental health of students has also been of great concern in recent times.

Description of the Data

Using a dataset (1) of 2000 observations of students based on reported hours they spend on different identified aspects of their lives, including studying, extracurricular activities, and sleep, as well their GPA and reported stress levels across 8 variables.

Student_ID (numeric): identification of each individual observation, 1-2000

Study_Hours_Per_Day (numeric): Approximate hours spent studying per day, to the nearest tenth of an hour

Extracurricular_Hours_Per_Day (numeric): Approximate hours spent on extracurricular activities per day, to the nearest tenth of an hour

Social_Hours_Per_Day (numeric): Approximate hours spent on social activities per day, to the nearest tenth of an hour

Physical_Activity_Hours_Per_Day (numeric): Approximate hours spent on physical activity per day, to the nearest tenth of an hour

GPA (numeric): GPA on a 4 point Grade Point Average scale (2)

Stress_Level (character): Self-reported stress level of the student (Low, Moderate, High)

Research Questions

We will use various unsupervised learning methods to identify possible patterns between students that exist in the dataset and reduce the dimensionality of the data. We will also use anomaly detection to attempt to identify students with GPAs above the 95th percentile.

Methodology

We will begin with exploratory data analysis to look at possible correlation between numeric variables, which will allow us to reduce dimensionality later on. We will also check for any issues with missing variables and do any feature engineering as necessary. As we are not fitting any models, there are no assumptions that we need to check for violations to.

We will use k-means and hierarchical clustering methods to examine patterns that exist in the data. K-means clustering defines clusters in the dataset to minimize intra-cluster variation. We will scale the data beforehand to minimize issues that may exist due to differences in scale of the different variables, and then use an elbow plot to identify the optimal number of clusters for this dataset. We will also use agglomerative hierarchical clustering with the hclust() function, which begins by identifying each individual observation as its own cluster and then merges clusters until some stopping criteron is reached. Again, an elbow plot will be used to identify the ideal number of clusters.

Primary Component Analysis will be used to attempt to reduce the dimensionality of the data. This can be used to simplify the dataset, especially when there are a number of highly correlated variables in the dataset, by creating new variables that are linear combinations of the original variables that are uncorrelated with each other and maximize the amount of information conveyed in the first few principal components. We will also use the clustering methods in combination with PCA to create plots of the clusters created through both k-means and hierarchical clustering methods.

For anomaly detection, we will artificially create a sparse category for students with about the top 5% GPA’s in the dataset, then use local outlier factor to attempt to identify data points that deviate from the norm. LOF uses the density of neighboring data points to identify points that deviate slightly from the norm.

Explanatory Data Analysis and Feature Engineering

We will begin with a look at the pairwise correlation plots of the numeric variables, coloring them by stress level.

We observe certain patterns that exist betweeen the numeric variables and the stress levels of students in the data. Those with “High” levels of stress report higher numbers of hours spent studying in a day and higher GPA’s than students with “Low” levels of stress, while the opposite is true for sleep hours and hours spent on physical activity. Looking purely at the pairwise correlation plots, distinct clusters aren’t apparent in any of the pairwise plots between the numeric variables, and not much correlation is obvious in most of the plots, which will limit the efficacy of PCA.

For the sake of using it for analysis, we will create a numeric coding for stress level with dummy variables. We will continue by checking for missing values, which can interfere with the efficacy of different unsupervised learning methods.

 Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day
 Min.   : 5.000      Min.   :0.00                  Min.   : 5.000     
 1st Qu.: 6.300      1st Qu.:1.00                  1st Qu.: 6.200     
 Median : 7.400      Median :2.00                  Median : 7.500     
 Mean   : 7.476      Mean   :1.99                  Mean   : 7.501     
 3rd Qu.: 8.700      3rd Qu.:3.00                  3rd Qu.: 8.800     
 Max.   :10.000      Max.   :4.00                  Max.   :10.000     
 Social_Hours_Per_Day Physical_Activity_Hours_Per_Day      GPA       
 Min.   :0.000        Min.   : 0.000                  Min.   :2.240  
 1st Qu.:1.200        1st Qu.: 2.400                  1st Qu.:2.900  
 Median :2.600        Median : 4.100                  Median :3.110  
 Mean   :2.705        Mean   : 4.328                  Mean   :3.116  
 3rd Qu.:4.100        3rd Qu.: 6.100                  3rd Qu.:3.330  
 Max.   :6.000        Max.   :13.000                  Max.   :4.000  
   HighStress       MedStress    
 Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:0.000  
 Median :1.0000   Median :0.000  
 Mean   :0.5145   Mean   :0.337  
 3rd Qu.:1.0000   3rd Qu.:1.000  
 Max.   :1.0000   Max.   :1.000

There do not appear to be any missing values in the data, so we will continue with the Unsupervised Learning Methods.

Unsupervised Learning

Clustering

We will begin with two different clustering methods: k-means clustering and hierarchical clustering.

K-means clustering

Taking the dataframe of all the numeric variables and dummy-coded categorical variable, we build a scaled version of the data and use it to create a heat map to identify clustering in the data.

We will continue by using k-means clustering with different numbers of clusters to identify the best number of clusters possible for this dataset.

The elbow plot is relatively smooth without a very obvious elbow point, but a point between k = 4 to k = 6 seems the most appropriate. We will use a k of 5 to create the final clusters and graph them on a pairwise plot.

Hierarchical Clustering

Once again, we scale the numeric variables. A dendrogram can be used to illustrate the agglomerative clustering process with five clusters, in this example.

Once again, we will use an elbow plot to identify the optimal number of clusters for the data.

The elbow plot is once again rather smooth without an obvious elbow, but seems to even out at around 3-5 clusters. We choose to create 4 clusters and once again examine the pairwise scatterplots.

The hierarchical clustering process seems to indicate slightly different clusters than the k-means clustering process. For example, clear vertical bands of color are observable in the graph of Study Hours against Extracurricular hours for the k-means clustering method, but the same is not true for the hierarchical clustering method; similarly, clear horizontal bands of color are visible in the graph of extracurricular hours against physical activity hours for hierarchical clustering, which is less true of the k-means method. For the k-means clustering methods with the stress level variables, the green and purple clusters almost all had values of 1 for highstress and values of 0 for medstress, while blue and red clusters almost all had values of 0 for highstress and values of 1 for medstress. Similar groupings by stress level are not obvious in the hierarchical clustering method, where almost all clusters seem to have values of 1 and 0 for both high- and moderate-self-reported stress levels.

PCA

We will continue the analysis by using PCA to attempt to reduce the dimensionality of the data. Once again, we scale the data to minimize the influence of units, then use PCA to create the different principal components.

Factor loadings of the PCA
	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8
Study_Hours_Per_Day	0.53	-0.13	-0.15	-0.18	0.22	-0.50	-0.46	0.37
Extracurricular_Hours_Per_Day	0.05	-0.28	-0.07	0.88	0.21	0.11	0.04	0.30
Sleep_Hours_Per_Day	-0.06	-0.56	-0.32	-0.21	-0.58	0.09	0.22	0.38
Social_Hours_Per_Day	-0.01	-0.13	0.87	-0.15	0.02	0.09	0.06	0.44
Physical_Activity_Hours_Per_Day	-0.28	0.61	-0.28	-0.08	0.10	0.12	0.07	0.66
GPA	0.48	-0.10	-0.15	-0.25	0.37	0.73	0.11	0.00
HighStress	0.53	0.27	0.05	0.10	-0.15	-0.29	0.73	0.00
MedStress	-0.37	-0.34	-0.11	-0.23	0.63	-0.30	0.44	0.00

The importance of each principal component
	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8
Standard deviation	1.656126	1.33241	1.118281	1.030038	0.8712267	0.5539886	0.3232012	0
Proportion of Variance	0.342840	0.22191	0.156320	0.132620	0.0948800	0.0383600	0.0130600	0
Cumulative Proportion	0.342840	0.56476	0.721080	0.853700	0.9485800	0.9869400	1.0000000	1

From a look at the principal components, we can see that the earlier observed lack of obvious correlation between most of the numeric variables means that it still takes a few principal components to explain the variability in the data, with it taking the first 4 PCs cumulatively to explain 85% of the information given in the data. We can identify the optimal amount of PCs to retain through a scree plot.

There does not appear to be a very defined elbow in this scree plot, but I would likely choose an ideal PC number to be retained of around 3 or 4. We can extract the PC scores as follows.

The first 15 PC scores transformed from the original variable.
PC1	PC2	PC3	PC4	PC5	PC6	PC7
-1.1672177	-2.1822916	-0.1249299	1.0153085	0.5438707	-0.0839218	0.1804290
-1.4984740	-0.7014110	1.1502302	1.6268448	-1.0537595	0.5894087	-0.3538513
-1.8865623	-0.5869440	-0.7997225	2.0547629	-1.5612748	0.4823707	-0.1822521
-2.0085699	0.1075797	-0.6842495	0.0371472	0.8090259	-0.3073326	0.0809085
1.3692914	1.6294454	-0.3391425	-1.0839506	0.2234911	0.5495583	0.1837988
-2.3853914	0.2373238	-1.6363262	0.0931161	0.4031003	-0.1834714	0.3331274
0.9273259	1.3798924	2.1971438	-0.6818970	0.1198075	-0.4360473	-0.0594960
1.2195250	1.3507013	0.4766510	0.1596198	0.4131674	-0.2682482	-0.1271483
-1.5604943	0.3875483	1.1839003	1.8616289	-0.2157019	0.7821936	-0.4991009
-1.3591464	-2.1041963	0.7704956	-1.5272223	-0.6117881	-1.0930387	-0.0621739
2.5999825	-1.3801528	-0.1128093	1.0152809	0.0661149	-0.1104255	-0.1986859
-1.6472945	-0.8870448	-0.7386952	-1.1057721	-0.0150320	-0.2717845	0.1973882
-0.0985165	1.3174771	1.7871043	0.8737519	-0.3274455	-0.3568988	0.4600152
-1.4265589	-0.9185140	1.1492661	1.3258458	-1.1918218	0.9994507	-0.1461467
1.4023148	1.8351653	-1.2681364	-1.2530709	0.0481127	-0.0633514	-0.1095925

Clustering and PCA together

Using the first two PCs and the different clustering methods, we can create plots identifyin the clusters in the data graphing the first two PCs against each other.

km.clust.ID
  1   2   3   4   5 
281 297 430 674 318

hierarch.clust.ID
  1   2   3   4 
749 638 475 138

We can see that the two clustering methods drew very different clusters in the data. The k-means clustering method created clusters largely in a “vertical” fashion, looking at this plot, while the hierarchical clustering method created “horizontal” clusters that span the width of the first PC. Looking at the plot of the first two principal components, there appears to be two pretty distinct groups to the left and right of the graph. Based on this, the k-means method seems to capture the clustering in the data in a slightly more intuitive fashion for this dataset.

Anomaly Detection

We begin by creating a variable that identifies those with a GPA higher than 3.6, as the 95th percentile in the data is identified as 3.61. This creates a sparse category with 107 students of the 2000 in the dataset with a “high” GPA.

 95% 
3.61

# A tibble: 2 × 2
  HighGPA     n
    <dbl> <int>
1       0  1893
2       1   107

We will continue by creating LOF scores using the numeric data, excluding GPA as it was the variable that we used to create the categories. We will look at the LOF catching rates and examine its performance using an ROC curve.

The LOF method with k=100 doesn’t appear to perform much better than random chance, but does have an AUC of slightly higher than 0.5. We will try the same method with different k values and construct ROC curves.

Once again, the performance of the different LOF methods do not seem much better than random chance. However, k-values of 50 and 400 seem to perform the best out of the ones we tried.

Summary and Conclusion

Unsupervised learning methods can be powerful ways of exploring patterns in a dataset which can then be exploited in supervised learning methods. PCA, for example, can be used to create simpler datasets that can then be used for a variety of supervised learning methods. For further analysis using this data, a differently coded sparse category could be used to check for a better performance using the LOF method. A dataset with more correlation between numeric variables would have led to a more meaningful usage of PCA.

An Overview of Different Unsupervised Learning Algorithms with Student Lifestyle Dataset

Alice Xiang