Adrianna Łazuga
418397
This project focuses on analyzing student lifestyles and helping them improve in the future. The main goal is to divide the students into smaller groups to provide them with better and more personalized care based on their needs. The dataset used in this project contains information about different areas of students’ lifestyles. The first step will be to reduce the number of features using the Principal Components Analysis (PCA) technique. In the second step, based on the obtained results, I will perform clustering using the K-Means clustering algorithm.
library(factoextra)
## Warning: pakiet 'factoextra' został zbudowany w wersji R 4.4.2
## Ładowanie wymaganego pakietu: ggplot2
## Warning: pakiet 'ggplot2' został zbudowany w wersji R 4.4.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(stats)
library(cluster)
## Warning: pakiet 'cluster' został zbudowany w wersji R 4.4.2
library(pheatmap)
## Warning: pakiet 'pheatmap' został zbudowany w wersji R 4.4.2
library(GGally)
## Warning: pakiet 'GGally' został zbudowany w wersji R 4.4.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
df <- read.csv("student_lifestyle_dataset.csv")
summary(df)
## Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day
## Min. : 1.0 Min. : 5.000 Min. :0.00
## 1st Qu.: 500.8 1st Qu.: 6.300 1st Qu.:1.00
## Median :1000.5 Median : 7.400 Median :2.00
## Mean :1000.5 Mean : 7.476 Mean :1.99
## 3rd Qu.:1500.2 3rd Qu.: 8.700 3rd Qu.:3.00
## Max. :2000.0 Max. :10.000 Max. :4.00
## Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day
## Min. : 5.000 Min. :0.000 Min. : 0.000
## 1st Qu.: 6.200 1st Qu.:1.200 1st Qu.: 2.400
## Median : 7.500 Median :2.600 Median : 4.100
## Mean : 7.501 Mean :2.705 Mean : 4.328
## 3rd Qu.: 8.800 3rd Qu.:4.100 3rd Qu.: 6.100
## Max. :10.000 Max. :6.000 Max. :13.000
## GPA Stress_Level
## Min. :2.240 Length:2000
## 1st Qu.:2.900 Class :character
## Median :3.110 Mode :character
## Mean :3.116
## 3rd Qu.:3.330
## Max. :4.000
Columns in the dataset that will be taken into further analysis:
Study_Hours_Per_Day - number of hours spent daily on studying
Extracurricular_Hours_Per_Day - number of hours spent daily on extracurricular activities
Sleep_Hours_Per_Day - number of hours spent daily on sleeping
Social_Hours_Per_Day - number of hours spent daily on social activities
Physical_Activity_Hours_Per_Day - number of hours spent daily on physical activities
GPA - average grade
features <- df[c("Study_Hours_Per_Day","Extracurricular_Hours_Per_Day","Sleep_Hours_Per_Day" ,"Social_Hours_Per_Day","Physical_Activity_Hours_Per_Day","GPA")]
sum(is.na(features))
## [1] 0
head(features)
## Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day
## 1 6.9 3.8 8.7
## 2 5.3 3.5 8.0
## 3 5.1 3.9 9.2
## 4 6.5 2.1 7.2
## 5 8.1 0.6 6.5
## 6 6.0 2.1 8.0
## Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA
## 1 2.8 1.8 2.99
## 2 4.2 3.0 2.75
## 3 1.2 4.6 2.67
## 4 1.7 6.5 2.88
## 5 2.2 6.6 3.51
## 6 0.3 7.6 2.85
str(features)
## 'data.frame': 2000 obs. of 6 variables:
## $ Study_Hours_Per_Day : num 6.9 5.3 5.1 6.5 8.1 6 8 8.4 5.2 7.7 ...
## $ Extracurricular_Hours_Per_Day : num 3.8 3.5 3.9 2.1 0.6 2.1 0.7 1.8 3.6 0.7 ...
## $ Sleep_Hours_Per_Day : num 8.7 8 9.2 7.2 6.5 8 5.3 5.6 6.3 9.8 ...
## $ Social_Hours_Per_Day : num 2.8 4.2 1.2 1.7 2.2 0.3 5.7 3 4 4.5 ...
## $ Physical_Activity_Hours_Per_Day: num 1.8 3 4.6 6.5 6.6 7.6 4.3 5.2 4.9 1.3 ...
## $ GPA : num 2.99 2.75 2.67 2.88 3.51 2.85 3.08 3.2 2.82 2.76 ...
The graph below shows the correlation between features. As expected, the features corresponding to Study Hours and GPA have a strong positive correlation. It also shows that Physical Activity feature is negatively correlated with all other features.
pheatmap(cor(features))
This step will be focused on dimension reduction using Principal Components Analysis (PCA).
pca <- prcomp(features, center=TRUE, scale=TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.4663 1.1677 1.1035 0.9948 0.5282 1.282e-15
## Proportion of Variance 0.3584 0.2273 0.2029 0.1650 0.0465 0.000e+00
## Cumulative Proportion 0.3584 0.5856 0.7886 0.9535 1.0000 1.000e+00
fviz_eig(pca, barfill = "lightblue2", addlabels = TRUE)
fviz_pca_var(pca)
get_eigenvalue(pca)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.150141e+00 3.583568e+01 35.83568
## Dim.2 1.363524e+00 2.272541e+01 58.56109
## Dim.3 1.217668e+00 2.029446e+01 78.85555
## Dim.4 9.896821e-01 1.649470e+01 95.35025
## Dim.5 2.789850e-01 4.649750e+00 100.00000
## Dim.6 1.642402e-30 2.737336e-29 100.00000
With first 4 PCA variables we can explain over 95% of variance of the original features. This allows to reduce the dimension of the features dataset to 4 variables.
The next step will be clustering using K-Means clustering algorithm applied to the PCA variables chosen in the previous part.
To find the optimal number of clusters I will use the Elbow method and Silhouette score.
pca_variables <- pca$x[,1:4]
fviz_nbclust(pca_variables, kmeans, method = "wss")
fviz_nbclust(pca_variables,kmeans, method = "silhouette")
Based on those results the optimal number of clusters would be 2.
clusters <- kmeans(pca_variables, 2)
clusters$size
## [1] 971 1029
clusters$centers
## PC1 PC2 PC3 PC4
## 1 1.238473 0.1979269 -0.02470234 -0.005916837
## 2 -1.168666 -0.1867707 0.02330998 0.005583332
fviz_cluster(list(data=pca_variables, cluster=clusters$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
sil<-silhouette(clusters$cluster, dist(pca_variables))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 971 0.25
## 2 2 1029 0.22
The graphs below show how the clusters look between the features from the original dataset.
df$cluster <- as.factor(clusters$cluster)
ggpairs(df, columns = 2:7, aes(color = cluster, alpha = 0.5))
By using Principal Component Analysis (PCA), the dimension of variables was reduced from 6 to 4. Then, using the K-Means clustering method, the students were divided into two groups based on their previous scores. This method could be helpful to provide them with more fitted workshops that address the areas in which they could need some improvement. It could bring better results in the future by focusing on topics needed in a specific group.
Data source: https://www.kaggle.com/datasets/steve1215rogg/student-lifestyle-dataset