Student lifestyles and performance are measures of interest to many pursuing higher education and academic institutions, with the influence of how students use the limited hours of their day an element of interest to all involved in these educational systems, including school faculty, students, and the students’ parents. The stress levels and mental health of students has also been of great concern in recent times.
Using a dataset (1) of 2000 observations of students based on reported hours they spend on different identified aspects of their lives, including studying, extracurricular activities, and sleep, as well their GPA and reported stress levels across 8 variables.
Student_ID (numeric): identification of each individual observation, 1-2000
Study_Hours_Per_Day (numeric): Approximate hours spent studying per day, to the nearest tenth of an hour
Extracurricular_Hours_Per_Day (numeric): Approximate hours spent on extracurricular activities per day, to the nearest tenth of an hour
Social_Hours_Per_Day (numeric): Approximate hours spent on social activities per day, to the nearest tenth of an hour
Physical_Activity_Hours_Per_Day (numeric): Approximate hours spent on physical activity per day, to the nearest tenth of an hour
GPA (numeric): GPA on a 4 point Grade Point Average scale (2)
Stress_Level (character): Self-reported stress level of the student (Low, Moderate, High)
We will use various unsupervised learning methods to identify possible patterns between students that exist in the dataset and reduce the dimensionality of the data. We will also use anomaly detection to attempt to identify students with GPAs above the 95th percentile.
We will begin with exploratory data analysis to look at possible correlation between numeric variables, which will allow us to reduce dimensionality later on. We will also check for any issues with missing variables and do any feature engineering as necessary. As we are not fitting any models, there are no assumptions that we need to check for violations to.
We will use k-means and hierarchical clustering methods to examine patterns that exist in the data. K-means clustering defines clusters in the dataset to minimize intra-cluster variation. We will scale the data beforehand to minimize issues that may exist due to differences in scale of the different variables, and then use an elbow plot to identify the optimal number of clusters for this dataset. We will also use agglomerative hierarchical clustering with the hclust() function, which begins by identifying each individual observation as its own cluster and then merges clusters until some stopping criteron is reached. Again, an elbow plot will be used to identify the ideal number of clusters.
Primary Component Analysis will be used to attempt to reduce the dimensionality of the data. This can be used to simplify the dataset, especially when there are a number of highly correlated variables in the dataset, by creating new variables that are linear combinations of the original variables that are uncorrelated with each other and maximize the amount of information conveyed in the first few principal components. We will also use the clustering methods in combination with PCA to create plots of the clusters created through both k-means and hierarchical clustering methods.
For anomaly detection, we will artificially create a sparse category for students with about the top 5% GPA’s in the dataset, then use local outlier factor to attempt to identify data points that deviate from the norm. LOF uses the density of neighboring data points to identify points that deviate slightly from the norm.
We will begin with a look at the pairwise correlation plots of the numeric variables, coloring them by stress level.
We observe certain patterns that exist betweeen the numeric variables and the stress levels of students in the data. Those with “High” levels of stress report higher numbers of hours spent studying in a day and higher GPA’s than students with “Low” levels of stress, while the opposite is true for sleep hours and hours spent on physical activity. Looking purely at the pairwise correlation plots, distinct clusters aren’t apparent in any of the pairwise plots between the numeric variables, and not much correlation is obvious in most of the plots, which will limit the efficacy of PCA.
For the sake of using it for analysis, we will create a numeric coding for stress level with dummy variables. We will continue by checking for missing values, which can interfere with the efficacy of different unsupervised learning methods.
Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day
Min. : 5.000 Min. :0.00 Min. : 5.000
1st Qu.: 6.300 1st Qu.:1.00 1st Qu.: 6.200
Median : 7.400 Median :2.00 Median : 7.500
Mean : 7.476 Mean :1.99 Mean : 7.501
3rd Qu.: 8.700 3rd Qu.:3.00 3rd Qu.: 8.800
Max. :10.000 Max. :4.00 Max. :10.000
Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA
Min. :0.000 Min. : 0.000 Min. :2.240
1st Qu.:1.200 1st Qu.: 2.400 1st Qu.:2.900
Median :2.600 Median : 4.100 Median :3.110
Mean :2.705 Mean : 4.328 Mean :3.116
3rd Qu.:4.100 3rd Qu.: 6.100 3rd Qu.:3.330
Max. :6.000 Max. :13.000 Max. :4.000
HighStress MedStress
Min. :0.0000 Min. :0.000
1st Qu.:0.0000 1st Qu.:0.000
Median :1.0000 Median :0.000
Mean :0.5145 Mean :0.337
3rd Qu.:1.0000 3rd Qu.:1.000
Max. :1.0000 Max. :1.000
There do not appear to be any missing values in the data, so we will continue with the Unsupervised Learning Methods.
We will begin with two different clustering methods: k-means clustering and hierarchical clustering.
Taking the dataframe of all the numeric variables and dummy-coded categorical variable, we build a scaled version of the data and use it to create a heat map to identify clustering in the data.
We will continue by using k-means clustering with different numbers of clusters to identify the best number of clusters possible for this dataset.
The elbow plot is relatively smooth without a very obvious elbow point, but a point between k = 4 to k = 6 seems the most appropriate. We will use a k of 5 to create the final clusters and graph them on a pairwise plot.
Once again, we scale the numeric variables. A dendrogram can be used to illustrate the agglomerative clustering process with five clusters, in this example.
Once again, we will use an elbow plot to identify the optimal number of clusters for the data.
The elbow plot is once again rather smooth without an obvious elbow, but seems to even out at around 3-5 clusters. We choose to create 4 clusters and once again examine the pairwise scatterplots.
The hierarchical clustering process seems to indicate slightly different clusters than the k-means clustering process. For example, clear vertical bands of color are observable in the graph of Study Hours against Extracurricular hours for the k-means clustering method, but the same is not true for the hierarchical clustering method; similarly, clear horizontal bands of color are visible in the graph of extracurricular hours against physical activity hours for hierarchical clustering, which is less true of the k-means method. For the k-means clustering methods with the stress level variables, the green and purple clusters almost all had values of 1 for highstress and values of 0 for medstress, while blue and red clusters almost all had values of 0 for highstress and values of 1 for medstress. Similar groupings by stress level are not obvious in the hierarchical clustering method, where almost all clusters seem to have values of 1 and 0 for both high- and moderate-self-reported stress levels.
We will continue the analysis by using PCA to attempt to reduce the dimensionality of the data. Once again, we scale the data to minimize the influence of units, then use PCA to create the different principal components.
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | |
---|---|---|---|---|---|---|---|---|
Study_Hours_Per_Day | 0.53 | -0.13 | -0.15 | -0.18 | 0.22 | -0.50 | -0.46 | 0.37 |
Extracurricular_Hours_Per_Day | 0.05 | -0.28 | -0.07 | 0.88 | 0.21 | 0.11 | 0.04 | 0.30 |
Sleep_Hours_Per_Day | -0.06 | -0.56 | -0.32 | -0.21 | -0.58 | 0.09 | 0.22 | 0.38 |
Social_Hours_Per_Day | -0.01 | -0.13 | 0.87 | -0.15 | 0.02 | 0.09 | 0.06 | 0.44 |
Physical_Activity_Hours_Per_Day | -0.28 | 0.61 | -0.28 | -0.08 | 0.10 | 0.12 | 0.07 | 0.66 |
GPA | 0.48 | -0.10 | -0.15 | -0.25 | 0.37 | 0.73 | 0.11 | 0.00 |
HighStress | 0.53 | 0.27 | 0.05 | 0.10 | -0.15 | -0.29 | 0.73 | 0.00 |
MedStress | -0.37 | -0.34 | -0.11 | -0.23 | 0.63 | -0.30 | 0.44 | 0.00 |
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | |
---|---|---|---|---|---|---|---|---|
Standard deviation | 1.656126 | 1.33241 | 1.118281 | 1.030038 | 0.8712267 | 0.5539886 | 0.3232012 | 0 |
Proportion of Variance | 0.342840 | 0.22191 | 0.156320 | 0.132620 | 0.0948800 | 0.0383600 | 0.0130600 | 0 |
Cumulative Proportion | 0.342840 | 0.56476 | 0.721080 | 0.853700 | 0.9485800 | 0.9869400 | 1.0000000 | 1 |
From a look at the principal components, we can see that the earlier observed lack of obvious correlation between most of the numeric variables means that it still takes a few principal components to explain the variability in the data, with it taking the first 4 PCs cumulatively to explain 85% of the information given in the data. We can identify the optimal amount of PCs to retain through a scree plot.
There does not appear to be a very defined elbow in this scree plot, but I would likely choose an ideal PC number to be retained of around 3 or 4. We can extract the PC scores as follows.
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 |
---|---|---|---|---|---|---|---|
-1.1672177 | -2.1822916 | -0.1249299 | 1.0153085 | 0.5438707 | -0.0839218 | 0.1804290 | 0 |
-1.4984740 | -0.7014110 | 1.1502302 | 1.6268448 | -1.0537595 | 0.5894087 | -0.3538513 | 0 |
-1.8865623 | -0.5869440 | -0.7997225 | 2.0547629 | -1.5612748 | 0.4823707 | -0.1822521 | 0 |
-2.0085699 | 0.1075797 | -0.6842495 | 0.0371472 | 0.8090259 | -0.3073326 | 0.0809085 | 0 |
1.3692914 | 1.6294454 | -0.3391425 | -1.0839506 | 0.2234911 | 0.5495583 | 0.1837988 | 0 |
-2.3853914 | 0.2373238 | -1.6363262 | 0.0931161 | 0.4031003 | -0.1834714 | 0.3331274 | 0 |
0.9273259 | 1.3798924 | 2.1971438 | -0.6818970 | 0.1198075 | -0.4360473 | -0.0594960 | 0 |
1.2195250 | 1.3507013 | 0.4766510 | 0.1596198 | 0.4131674 | -0.2682482 | -0.1271483 | 0 |
-1.5604943 | 0.3875483 | 1.1839003 | 1.8616289 | -0.2157019 | 0.7821936 | -0.4991009 | 0 |
-1.3591464 | -2.1041963 | 0.7704956 | -1.5272223 | -0.6117881 | -1.0930387 | -0.0621739 | 0 |
2.5999825 | -1.3801528 | -0.1128093 | 1.0152809 | 0.0661149 | -0.1104255 | -0.1986859 | 0 |
-1.6472945 | -0.8870448 | -0.7386952 | -1.1057721 | -0.0150320 | -0.2717845 | 0.1973882 | 0 |
-0.0985165 | 1.3174771 | 1.7871043 | 0.8737519 | -0.3274455 | -0.3568988 | 0.4600152 | 0 |
-1.4265589 | -0.9185140 | 1.1492661 | 1.3258458 | -1.1918218 | 0.9994507 | -0.1461467 | 0 |
1.4023148 | 1.8351653 | -1.2681364 | -1.2530709 | 0.0481127 | -0.0633514 | -0.1095925 | 0 |
Using the first two PCs and the different clustering methods, we can create plots identifyin the clusters in the data graphing the first two PCs against each other.
km.clust.ID
1 2 3 4 5
281 297 430 674 318
hierarch.clust.ID
1 2 3 4
749 638 475 138
We can see that the two clustering methods drew very different clusters in the data. The k-means clustering method created clusters largely in a “vertical” fashion, looking at this plot, while the hierarchical clustering method created “horizontal” clusters that span the width of the first PC. Looking at the plot of the first two principal components, there appears to be two pretty distinct groups to the left and right of the graph. Based on this, the k-means method seems to capture the clustering in the data in a slightly more intuitive fashion for this dataset.
We begin by creating a variable that identifies those with a GPA higher than 3.6, as the 95th percentile in the data is identified as 3.61. This creates a sparse category with 107 students of the 2000 in the dataset with a “high” GPA.
95%
3.61
# A tibble: 2 Ă— 2
HighGPA n
<dbl> <int>
1 0 1893
2 1 107
We will continue by creating LOF scores using the numeric data, excluding GPA as it was the variable that we used to create the categories. We will look at the LOF catching rates and examine its performance using an ROC curve.
The LOF method with k=100 doesn’t appear to perform much better than random chance, but does have an AUC of slightly higher than 0.5. We will try the same method with different k values and construct ROC curves.
Once again, the performance of the different LOF methods do not seem much better than random chance. However, k-values of 50 and 400 seem to perform the best out of the ones we tried.
Unsupervised learning methods can be powerful ways of exploring patterns in a dataset which can then be exploited in supervised learning methods. PCA, for example, can be used to create simpler datasets that can then be used for a variety of supervised learning methods. For further analysis using this data, a differently coded sparse category could be used to check for a better performance using the LOF method. A dataset with more correlation between numeric variables would have led to a more meaningful usage of PCA.