Unsupervised Learning - Clustering and Dimension Reduction Project

Adrianna Łazuga

418397

Project Introduction

This project focuses on analyzing student lifestyles and helping them improve in the future. The main goal is to divide the students into smaller groups to provide them with better and more personalized care based on their needs. The dataset used in this project contains information about different areas of students’ lifestyles. The first step will be to reduce the number of features using the Principal Components Analysis (PCA) technique. In the second step, based on the obtained results, I will perform clustering using the K-Means clustering algorithm.

Libraries

library(factoextra)
## Warning: pakiet 'factoextra' został zbudowany w wersji R 4.4.2
## Ładowanie wymaganego pakietu: ggplot2
## Warning: pakiet 'ggplot2' został zbudowany w wersji R 4.4.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(stats)
library(cluster)
## Warning: pakiet 'cluster' został zbudowany w wersji R 4.4.2
library(pheatmap)
## Warning: pakiet 'pheatmap' został zbudowany w wersji R 4.4.2
library(GGally)
## Warning: pakiet 'GGally' został zbudowany w wersji R 4.4.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Dataset

df <- read.csv("student_lifestyle_dataset.csv")
summary(df)
##    Student_ID     Study_Hours_Per_Day Extracurricular_Hours_Per_Day
##  Min.   :   1.0   Min.   : 5.000      Min.   :0.00                 
##  1st Qu.: 500.8   1st Qu.: 6.300      1st Qu.:1.00                 
##  Median :1000.5   Median : 7.400      Median :2.00                 
##  Mean   :1000.5   Mean   : 7.476      Mean   :1.99                 
##  3rd Qu.:1500.2   3rd Qu.: 8.700      3rd Qu.:3.00                 
##  Max.   :2000.0   Max.   :10.000      Max.   :4.00                 
##  Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day
##  Min.   : 5.000      Min.   :0.000        Min.   : 0.000                 
##  1st Qu.: 6.200      1st Qu.:1.200        1st Qu.: 2.400                 
##  Median : 7.500      Median :2.600        Median : 4.100                 
##  Mean   : 7.501      Mean   :2.705        Mean   : 4.328                 
##  3rd Qu.: 8.800      3rd Qu.:4.100        3rd Qu.: 6.100                 
##  Max.   :10.000      Max.   :6.000        Max.   :13.000                 
##       GPA        Stress_Level      
##  Min.   :2.240   Length:2000       
##  1st Qu.:2.900   Class :character  
##  Median :3.110   Mode  :character  
##  Mean   :3.116                     
##  3rd Qu.:3.330                     
##  Max.   :4.000

Data preparation

Columns in the dataset that will be taken into further analysis:

Study_Hours_Per_Day - number of hours spent daily on studying

Extracurricular_Hours_Per_Day - number of hours spent daily on extracurricular activities

Sleep_Hours_Per_Day - number of hours spent daily on sleeping

Social_Hours_Per_Day - number of hours spent daily on social activities

Physical_Activity_Hours_Per_Day - number of hours spent daily on physical activities

GPA - average grade

features <- df[c("Study_Hours_Per_Day","Extracurricular_Hours_Per_Day","Sleep_Hours_Per_Day"          ,"Social_Hours_Per_Day","Physical_Activity_Hours_Per_Day","GPA")]
sum(is.na(features))
## [1] 0
head(features)
##   Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day
## 1                 6.9                           3.8                 8.7
## 2                 5.3                           3.5                 8.0
## 3                 5.1                           3.9                 9.2
## 4                 6.5                           2.1                 7.2
## 5                 8.1                           0.6                 6.5
## 6                 6.0                           2.1                 8.0
##   Social_Hours_Per_Day Physical_Activity_Hours_Per_Day  GPA
## 1                  2.8                             1.8 2.99
## 2                  4.2                             3.0 2.75
## 3                  1.2                             4.6 2.67
## 4                  1.7                             6.5 2.88
## 5                  2.2                             6.6 3.51
## 6                  0.3                             7.6 2.85
str(features)
## 'data.frame':    2000 obs. of  6 variables:
##  $ Study_Hours_Per_Day            : num  6.9 5.3 5.1 6.5 8.1 6 8 8.4 5.2 7.7 ...
##  $ Extracurricular_Hours_Per_Day  : num  3.8 3.5 3.9 2.1 0.6 2.1 0.7 1.8 3.6 0.7 ...
##  $ Sleep_Hours_Per_Day            : num  8.7 8 9.2 7.2 6.5 8 5.3 5.6 6.3 9.8 ...
##  $ Social_Hours_Per_Day           : num  2.8 4.2 1.2 1.7 2.2 0.3 5.7 3 4 4.5 ...
##  $ Physical_Activity_Hours_Per_Day: num  1.8 3 4.6 6.5 6.6 7.6 4.3 5.2 4.9 1.3 ...
##  $ GPA                            : num  2.99 2.75 2.67 2.88 3.51 2.85 3.08 3.2 2.82 2.76 ...

Correlation

The graph below shows the correlation between features. As expected, the features corresponding to Study Hours and GPA have a strong positive correlation. It also shows that Physical Activity feature is negatively correlated with all other features.

pheatmap(cor(features))

Dimension reduction

This step will be focused on dimension reduction using Principal Components Analysis (PCA).

pca <- prcomp(features, center=TRUE, scale=TRUE)
summary(pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5       PC6
## Standard deviation     1.4663 1.1677 1.1035 0.9948 0.5282 1.282e-15
## Proportion of Variance 0.3584 0.2273 0.2029 0.1650 0.0465 0.000e+00
## Cumulative Proportion  0.3584 0.5856 0.7886 0.9535 1.0000 1.000e+00
fviz_eig(pca, barfill = "lightblue2", addlabels = TRUE)

fviz_pca_var(pca)

get_eigenvalue(pca)
##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.150141e+00     3.583568e+01                    35.83568
## Dim.2 1.363524e+00     2.272541e+01                    58.56109
## Dim.3 1.217668e+00     2.029446e+01                    78.85555
## Dim.4 9.896821e-01     1.649470e+01                    95.35025
## Dim.5 2.789850e-01     4.649750e+00                   100.00000
## Dim.6 1.642402e-30     2.737336e-29                   100.00000

With first 4 PCA variables we can explain over 95% of variance of the original features. This allows to reduce the dimension of the features dataset to 4 variables.

Clustering

The next step will be clustering using K-Means clustering algorithm applied to the PCA variables chosen in the previous part.

To find the optimal number of clusters I will use the Elbow method and Silhouette score.

pca_variables <- pca$x[,1:4]
fviz_nbclust(pca_variables, kmeans, method = "wss")

fviz_nbclust(pca_variables,kmeans, method = "silhouette")

Based on those results the optimal number of clusters would be 2.

clusters <- kmeans(pca_variables, 2)
clusters$size
## [1]  971 1029
clusters$centers
##         PC1        PC2         PC3          PC4
## 1  1.238473  0.1979269 -0.02470234 -0.005916837
## 2 -1.168666 -0.1867707  0.02330998  0.005583332
fviz_cluster(list(data=pca_variables, cluster=clusters$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())

sil<-silhouette(clusters$cluster, dist(pca_variables))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1  971          0.25
## 2       2 1029          0.22

The graphs below show how the clusters look between the features from the original dataset.

df$cluster <- as.factor(clusters$cluster)
ggpairs(df, columns = 2:7, aes(color = cluster, alpha = 0.5))

Conclusion

By using Principal Component Analysis (PCA), the dimension of variables was reduced from 6 to 4. Then, using the K-Means clustering method, the students were divided into two groups based on their previous scores. This method could be helpful to provide them with more fitted workshops that address the areas in which they could need some improvement. It could bring better results in the future by focusing on topics needed in a specific group.

Data source: https://www.kaggle.com/datasets/steve1215rogg/student-lifestyle-dataset