How Study Habits and Lifestyle Affect Student Grades

Author

Yohannes Gebretsadik

Introductory Essay

This project analyzes student academic performance using a real-world dataset containing demographic, behavioral, and academic variables. The dataset includes information such as study time, number of past failures, absences, parental education levels, and final grades.

The goal of this analysis is to understand which factors have the strongest impact on student performance, particularly final grades. This topic is important because identifying key predictors of academic success can help improve educational strategies and student outcomes.

The dataset was obtained from Kaggle and is based on a survey of secondary school students. Before analysis, the data was cleaned by selecting relevant variables and checking for missing values to ensure accuracy and consistency.

Research Question

What factors significantly affect students final grades (G3), and how do variables such as study time, past failures, absences, alcohol consumption, and parental education influence academic performance?

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
df <- read_csv("student_data.csv") 
Rows: 395 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
dbl (16): age, Medu, Fedu, traveltime, studytime, failures, famrel, freetime...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Dataset Description

The dataset includes both categorical variables (such as gender and parental education) and quantitative variables (such as age, absence, sex, and grades G1, G2 and G3).

For this analysis, G3 (final grade) is used as the response variable. The predictor variables include studytime, absences, failures, Walc (weekend alcohol consumption), Medu (mother’s education) and sex (Gender).

The selected variables were chosen based on their potential influence on academic performance. Study time represents student effort and engagement in learning, while absences capture attendance and participation in class. Past failures were included as an indicator of prior academic difficulty, which may strongly impact future performance. Weekend alcohol consumption (Walc) was selected to represent lifestyle habits that could negatively affect academic outcomes. Mother’s education level (Medu) serves as a proxy for socioeconomic background, and gender (sex) was included as a categorical variable to explore potential differences in performance between groups.

## Data Cleaning
#I performed Data Cleaning by selecting relevant variables for analysis, including final grade (G3), study time, absences, past failures, weekend alcohol consumption (Walc), parental education (Medu) and sex. The dataset was checked for missing values, and no missing data was found in the selected variables. This ensured that the analysis was conducted on complete and consistent data.

colSums(is.na(df))
    school        sex        age    address    famsize    Pstatus       Medu 
         0          0          0          0          0          0          0 
      Fedu       Mjob       Fjob     reason   guardian traveltime  studytime 
         0          0          0          0          0          0          0 
  failures  schoolsup     famsup       paid activities    nursery     higher 
         0          0          0          0          0          0          0 
  internet   romantic     famrel   freetime      goout       Dalc       Walc 
         0          0          0          0          0          0          0 
    health   absences         G1         G2         G3 
         0          0          0          0          0 
df_clean <- df %>%
  select(G3, studytime, absences, failures, Walc, Medu, sex)
summary(df_clean)
       G3          studytime        absences         failures     
 Min.   : 0.00   Min.   :1.000   Min.   : 0.000   Min.   :0.0000  
 1st Qu.: 8.00   1st Qu.:1.000   1st Qu.: 0.000   1st Qu.:0.0000  
 Median :11.00   Median :2.000   Median : 4.000   Median :0.0000  
 Mean   :10.42   Mean   :2.035   Mean   : 5.709   Mean   :0.3342  
 3rd Qu.:14.00   3rd Qu.:2.000   3rd Qu.: 8.000   3rd Qu.:0.0000  
 Max.   :20.00   Max.   :4.000   Max.   :75.000   Max.   :3.0000  
      Walc            Medu           sex           
 Min.   :1.000   Min.   :0.000   Length:395        
 1st Qu.:1.000   1st Qu.:2.000   Class :character  
 Median :2.000   Median :3.000   Mode  :character  
 Mean   :2.291   Mean   :2.749                     
 3rd Qu.:3.000   3rd Qu.:4.000                     
 Max.   :5.000   Max.   :4.000                     

Visualization

The following visualization shows the relationship between study time and final grades, with points colored by weekend alcohol consumption.

ggplot(df_clean, aes(x = studytime, y = G3, color = as.factor(Walc))) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    title = "Study Time vs Final Grade",
    x = "Study Time",
    y = "Final Grade (G3)",
    color = "Alcohol Consumption",
    caption = "Source: UCI Student Performance Dataset"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Interactive Visualization

ggplotly(
  ggplot(df_clean, aes(x = studytime, y = G3, color = as.factor(Walc))) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    labs(
      title = "Interactive: Study Time vs Final Grade",
      x = "Study Time",
      y = "Final Grade (G3)"
    )
)
`geom_smooth()` using formula = 'y ~ x'
  • This interactive visualization allows users to explore the relationship between study time and final grades by hovering over individual data points.

Interpretation

A multiple linear regression model was used to analyze how study habits and lifestyle factors influence students’ final grades.

The results indicate that past failures have a strong negative impact on academic performance, with students who have more past failures tending to achieve lower final grades. Absences also show a negative relationship, although the effect is smaller.

In contrast, study time and weekend alcohol consumption do not appear to have a statistically significant effect on final grades in this model. This suggests that prior academic performance is a more important predictor than short-term behaviors.

ggplot(df_clean, aes(x = studytime, y = G3, color = as.factor(Walc))) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Effect of Study Time on Final Grades",
    x = "Study Time",
    y = "Final Grade (G3)",
    color = "Weekend Alcohol Consumption",
    caption = "Source: UCI Student Performance Dataset"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

The plot shows the relationship between study time and students’ final grades (G3), with color representing weekend alcohol consumption. There is a general positive trend, indicating that students who study more tend to achieve higher final grades. However, the variation in color suggests that higher alcohol consumption may negatively impact performance.

Boxplot

ggplot(df_clean, aes(x = sex, y = G3, fill=sex)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Final Grades by Gender",
    x = "Gender",
    y = "Final Grade (G3)"
  ) +
  theme_minimal()

The boxplot shows that difference in final grades acrose gender, proving insight into potenntial variation in performance.

Regression Model

# A multiple linear regression model was built to examine the relationship between study habits and students final grade.

model <- lm(G3 ~ studytime + absences + failures + Walc + Medu, data = df_clean)

summary(model)

Call:
lm(formula = G3 ~ studytime + absences + failures + Walc + Medu, 
    data = df_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.0145  -1.9988   0.3402   3.0205   8.1631 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.97283    0.96864   9.263  < 2e-16 ***
studytime    0.19821    0.26654   0.744  0.45755    
absences     0.02483    0.02723   0.912  0.36252    
failures    -2.00738    0.30238  -6.639 1.07e-10 ***
Walc         0.01392    0.17401   0.080  0.93630    
Medu         0.55872    0.20279   2.755  0.00614 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.248 on 389 degrees of freedom
Multiple R-squared:  0.1513,    Adjusted R-squared:  0.1404 
F-statistic: 13.87 on 5 and 389 DF,  p-value: 1.773e-12
# The regression model can be written as:

# G3 = β₀ + β₁(studytime) + β₂(absences) + β₃(failures) + β₄(Walc) + β₅(Medu) + ε

Interpretation of Regression model

A multiple linear regression model was used to examine how study habits and lifestyle factors influence students’ final grades (G3). The model identifies past academic performance and parental education as the most meaningful predictors of student outcomes.

Among all variables, past failures has the strongest impact on final grades. The coefficient for failures is approximately -2.07, meaning that each additional past failure is associated with an average decrease of about 2 points in the final grade, holding other variables constant. This effect is also highly statistically significant (p < 0.001), indicating a strong and reliable relationship.

Parental education (Medu) also shows a statistically significant positive effect (p < 0.01). The coefficient of approximately +0.56 suggests that higher levels of a mother’s education are associated with higher student performance. This highlights the influence of socioeconomic and educational background on academic outcomes.

In contrast, study time, absences, and weekend alcohol consumption (Walc) do not appear to be statistically significant predictors in this model, as their p-values are relatively high. Although the scatter plot suggests a slight positive relationship between study time and final grades, this effect does not remain significant when controlling for other variables, indicating that study time alone may not strongly predict performance.

The model has an adjusted R² of approximately 0.14, meaning that about 14% of the variation in final grades is explained by the variables included in the model. While this indicates that the model captures some important factors, a large portion of variation remains unexplained, suggesting that other variables not included in this analysis may also play a significant role in student performance.

Diagnostic Plot

plot(model, which = 1)

  • The Residuals vs Fitted plot shows that residuals are mostly centered around zero, indicating that the linearity assumption is reasonably satisfied. However, a slight curve suggests minor non-linearity in the model.
plot(model, which = 2)

  • The Normal Q-Q plot shows that most points lie close to the diagonal line, suggesting that the residuals are approximately normally distributed, with slight deviations at the extremes.
plot(model, which = 3)

  • The Scale-Location plot indicates that the spread of residuals is relatively constant, suggesting homoscedasticity, although minor variations are present.
plot(model)

  • The Residuals vs Leverage plot shows that most observations have low leverage, with a few moderately influential points, but no extreme outliers that significantly affect the model.

Works Cited

UCI Student Performance Dataset (Kaggle)

Course materials and notes

ChatGPT for explanation