2025-11-16

Background and Problem Definition

Using a data set from Kaggle.com containing 8 variables regarding student performance, we wanted to discover possible relationships between the variables. After some initial data wrangling, we will undertake an exploratory data analysis and visualization.

The fundamental question is:

Which variables can best predict student GPA?

Let’s have a look at our loaded data set:

datatotal = read.csv("student_lifestyle_dataset.csv", sep=",", header=TRUE)
str(datatotal)
## 'data.frame':    2000 obs. of  8 variables:
##  $ Student_ID                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Study_Hours_Per_Day            : num  6.9 5.3 5.1 6.5 8.1 6 8 8.4 5.2 7.7 ...
##  $ Extracurricular_Hours_Per_Day  : num  3.8 3.5 3.9 2.1 0.6 2.1 0.7 1.8 3.6 0.7 ...
##  $ Sleep_Hours_Per_Day            : num  8.7 8 9.2 7.2 6.5 8 5.3 5.6 6.3 9.8 ...
##  $ Social_Hours_Per_Day           : num  2.8 4.2 1.2 1.7 2.2 0.3 5.7 3 4 4.5 ...
##  $ Physical_Activity_Hours_Per_Day: num  1.8 3 4.6 6.5 6.6 7.6 4.3 5.2 4.9 1.3 ...
##  $ GPA                            : num  2.99 2.75 2.67 2.88 3.51 2.85 3.08 3.2 2.82 2.76 ...
##  $ Stress_Level                   : chr  "Moderate" "Low" "Low" "Moderate" ...
table(is.na(datatotal))
## 
## FALSE 
## 16000

Data Wrangling, Munging, Cleaning

With no NA’s, the data looks fairly clean. However, we need to investigate the character inputs for Stress_Level. We will re-name the column names to shortened versions for easier manipulation. We will also remove the column Student_ID, which is irrelevant for our analysis:

unique(datatotal$Stress_Level)
## [1] "Moderate" "Low"      "High"

We will recategorize Stress Level as a factor, since visualization will be easier if it is seen as a category.

Data Visualization

To visualize our data, we start by plotting each variable against GPA:

Study Hours and Stress Level have a clear positive correlation with GPA, while Physical Activity has a negative correlation. Extracurricular, Sleep, and Social hours do not seem to have any effect on GPA at first glance. We would like to visualize multiple variables on one graph:

Exploratory Data Analysis

Now, we can try to look for relationships between GPA and other variables using linear regressions. We will re-code the stress levels Low, Medium, and High, as integers 0, 1, and 2 so that they may be included in the mathematical calculations for the regression.

## 
## Call:
## lm(formula = GPA ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5944 -0.1352 -0.0030  0.1344  0.7874 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.0073566  0.0379219  52.934   <2e-16 ***
## Study        0.1543056  0.0051288  30.086   <2e-16 ***
## EC          -0.0074958  0.0039608  -1.892   0.0586 .  
## Sleep       -0.0045156  0.0035812  -1.261   0.2075    
## Social       0.0013135  0.0027898   0.471   0.6378    
## Phys                NA         NA      NA       NA    
## Stress       0.0002071  0.0104942   0.020   0.9843    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2026 on 1994 degrees of freedom
## Multiple R-squared:  0.541,  Adjusted R-squared:  0.5398 
## F-statistic:   470 on 5 and 1994 DF,  p-value: < 2.2e-16

A warning message regarding “Coefficients” was returned, that one set of coefficients (in this case, “Phys”) was not defined due to singularities. Closer examination of the data reveals that this might be a case of collinearity, where one value can be perfectly predicted by one or more variables already in the data set.

To check whether this is the case, we create a new column, the sum of all hours that were categorized in this survey.

## [1] "24"

This confirms that one of the columns is redundant for our model, since the total hours entered will always add up to 24.

This also prompts the question, whether the distribution of hours spent by students with a similar GPA follows a pattern. We classified the students by their GPA and found the distribution of hours spent.

We can confirm visually what was shown in the individual graphs: As GPA rises, the hours spent studying rise, and the hours spent in physical activity fall.

After reviewing the variables Social, Sleep, and Extracurricular, we decided to remove the column Social because it had the highest p-value, meaning it was the least significant.

We now run our model again:

## 
## Call:
## lm(formula = GPA ~ ., data = datanoSocial)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5944 -0.1352 -0.0030  0.1344  0.7874 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.0388815  0.0609810  33.435   <2e-16 ***
## Study        0.1529921  0.0056712  26.977   <2e-16 ***
## EC          -0.0088093  0.0045097  -1.953   0.0509 .  
## Sleep       -0.0058291  0.0041219  -1.414   0.1575    
## Phys        -0.0013135  0.0027898  -0.471   0.6378    
## Stress       0.0002071  0.0104942   0.020   0.9843    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2026 on 1994 degrees of freedom
## Multiple R-squared:  0.541,  Adjusted R-squared:  0.5398 
## F-statistic:   470 on 5 and 1994 DF,  p-value: < 2.2e-16

Based on its p-value, the best predictor of GPA is clearly Study hours. We graph the linear regression of GPA vs. Study hours, including the 95% confidence interval.

The very narrow confidence interval band indicates a very strong predictive effect of Study Hours on GPA.

As a contrast, we plot the linear regression of GPA vs. Hours Spent Socializing:

The much broader grey band indicates a much larger confidence interval, which means this variable is not a very good predictor.

In summary, high GPA’s were strongly correlated with higher hours spent studying, and also higher stress levels.

Using common sense, we can recommend students who want to increase their GPA’s to invest more time studying, while being prepared for a rise in their stress level.