Factors Affecting Student Academic Performance

Factors Affecting Student Academic Performance

This project analyzes the Student Performance dataset from the UCI Machine Learning Repository. The dataset contains academic, social, and lifestyle variables related to secondary school students. The purpose of this project is to examine how study habits, alcohol consumption, absences, and family background influence students’ final academic performance. Multiple visualizations and a multiple linear regression model were used to explore patterns within the data.

Research Questions

  • Research Question: Which lifestyle and social factors are the strongest predictors of student academic performance?
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(GGally)
library(dplyr)
df <- read.csv("student_data.csv")
head(df)
  school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
2     GP   F  17       U     GT3       T    1    1  at_home    other     course
3     GP   F  15       U     LE3       T    1    1  at_home    other      other
4     GP   F  15       U     GT3       T    4    2   health services       home
5     GP   F  16       U     GT3       T    3    3    other    other       home
6     GP   M  16       U     LE3       T    4    3 services    other reputation
  guardian traveltime studytime failures schoolsup famsup paid activities
1   mother          2         2        0       yes     no   no         no
2   father          1         2        0        no    yes   no         no
3   mother          1         2        3       yes     no  yes         no
4   mother          1         3        0        no    yes  yes        yes
5   father          1         2        0        no    yes  yes         no
6   mother          1         2        0        no    yes  yes        yes
  nursery higher internet romantic famrel freetime goout Dalc Walc health
1     yes    yes       no       no      4        3     4    1    1      3
2      no    yes      yes       no      5        3     3    1    1      3
3     yes    yes      yes       no      4        3     2    2    3      3
4     yes    yes      yes      yes      3        2     2    1    1      5
5     yes    yes       no       no      4        3     2    1    2      5
6     yes    yes      yes       no      5        4     2    1    2      5
  absences G1 G2 G3
1        6  5  6  6
2        4  5  5  6
3       10  7  8 10
4        2 15 14 15
5        4  6 10 10
6       10 15 15 15
str(df)
'data.frame':   395 obs. of  33 variables:
 $ school    : chr  "GP" "GP" "GP" "GP" ...
 $ sex       : chr  "F" "F" "F" "F" ...
 $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
 $ address   : chr  "U" "U" "U" "U" ...
 $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
 $ Pstatus   : chr  "A" "T" "T" "T" ...
 $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
 $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
 $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
 $ Fjob      : chr  "teacher" "other" "other" "services" ...
 $ reason    : chr  "course" "course" "other" "home" ...
 $ guardian  : chr  "mother" "father" "mother" "mother" ...
 $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
 $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
 $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
 $ schoolsup : chr  "yes" "no" "yes" "no" ...
 $ famsup    : chr  "no" "yes" "no" "yes" ...
 $ paid      : chr  "no" "no" "yes" "yes" ...
 $ activities: chr  "no" "no" "no" "yes" ...
 $ nursery   : chr  "yes" "no" "yes" "yes" ...
 $ higher    : chr  "yes" "yes" "yes" "yes" ...
 $ internet  : chr  "no" "yes" "yes" "yes" ...
 $ romantic  : chr  "no" "no" "no" "yes" ...
 $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
 $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
 $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
 $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
 $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
 $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
 $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
 $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
 $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
 $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
summary(df)
    school              sex                 age         address         
 Length:395         Length:395         Min.   :15.0   Length:395        
 Class :character   Class :character   1st Qu.:16.0   Class :character  
 Mode  :character   Mode  :character   Median :17.0   Mode  :character  
                                       Mean   :16.7                     
                                       3rd Qu.:18.0                     
                                       Max.   :22.0                     
   famsize            Pstatus               Medu            Fedu      
 Length:395         Length:395         Min.   :0.000   Min.   :0.000  
 Class :character   Class :character   1st Qu.:2.000   1st Qu.:2.000  
 Mode  :character   Mode  :character   Median :3.000   Median :2.000  
                                       Mean   :2.749   Mean   :2.522  
                                       3rd Qu.:4.000   3rd Qu.:3.000  
                                       Max.   :4.000   Max.   :4.000  
     Mjob               Fjob              reason            guardian        
 Length:395         Length:395         Length:395         Length:395        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
   traveltime      studytime        failures       schoolsup        
 Min.   :1.000   Min.   :1.000   Min.   :0.0000   Length:395        
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   Class :character  
 Median :1.000   Median :2.000   Median :0.0000   Mode  :character  
 Mean   :1.448   Mean   :2.035   Mean   :0.3342                     
 3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                     
 Max.   :4.000   Max.   :4.000   Max.   :3.0000                     
    famsup              paid            activities          nursery         
 Length:395         Length:395         Length:395         Length:395        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
    higher            internet           romantic             famrel     
 Length:395         Length:395         Length:395         Min.   :1.000  
 Class :character   Class :character   Class :character   1st Qu.:4.000  
 Mode  :character   Mode  :character   Mode  :character   Median :4.000  
                                                          Mean   :3.944  
                                                          3rd Qu.:5.000  
                                                          Max.   :5.000  
    freetime         goout            Dalc            Walc      
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
 Median :3.000   Median :3.000   Median :1.000   Median :2.000  
 Mean   :3.235   Mean   :3.109   Mean   :1.481   Mean   :2.291  
 3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
     health         absences            G1              G2       
 Min.   :1.000   Min.   : 0.000   Min.   : 3.00   Min.   : 0.00  
 1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00   1st Qu.: 9.00  
 Median :4.000   Median : 4.000   Median :11.00   Median :11.00  
 Mean   :3.554   Mean   : 5.709   Mean   :10.91   Mean   :10.71  
 3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00   3rd Qu.:13.00  
 Max.   :5.000   Max.   :75.000   Max.   :19.00   Max.   :19.00  
       G3       
 Min.   : 0.00  
 1st Qu.: 8.00  
 Median :11.00  
 Mean   :10.42  
 3rd Qu.:14.00  
 Max.   :20.00  

Background Research

Previous research has shown that study habits, attendance, parental education, and alcohol consumption can influence student academic performance. Students with higher study time and stronger family educational support often perform better academically, while higher alcohol consumption and school absences are associated with lower grades and increased academic difficulties. These factors are important because they help explain how both lifestyle behaviors and social environments contribute to student success in school. This research supports the purpose of this project, which is to analyze how academic, family, and behavioral variables affect final student grades using statistical analysis and data visualization.

APA Citation

UCI Machine Learning Repository

Cortez, P., & Silva, A. (2008). Using data mining to predict secondary school student performance. UCI Machine Learning Repository. Retrieved from https://archive.ics.uci.edu/ml/datasets/student+performance

VARIABLES USED

Response Variable

  • G3 (Final Grade)

Quantitative Variables

  • studytime

  • absences

  • failures

  • Walc

  • Dalc

  • Medu

  • Fedu

Categorical Variables

  • sex

  • school

df_clean <- df %>%
  drop_na()
sum(is.na(df_clean))
[1] 0

Data Cleaning

  • The dataset was cleaned before analysis by removing missing values and selecting variables relevant to the research question. Categorical variables were converted into factors to improve visualization and regression analysis. The dataset was also checked for consistency and variable types before modeling.
# Selecting relevant variables
numeric_data <- df_clean %>%
  select(G3, studytime, absences, failures, Walc, Dalc, Medu, Fedu)
# Average final grade by sex
df_clean %>%
  group_by(sex) %>%
  summarise(avg_grade = mean(G3))
# A tibble: 2 × 2
  sex   avg_grade
  <chr>     <dbl>
1 F          9.97
2 M         10.9 
# Filtering students with high absences
high_absence <- df_clean %>%
  filter(absences > 10)

head(high_absence)
  school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob reason
1     GP   M  17       U     GT3       T    3    2 services services course
2     GP   F  16       U     GT3       T    2    2 services services   home
3     GP   M  16       U     GT3       T    4    4  teacher  teacher   home
4     GP   F  16       U     LE3       T    2    2    other    other   home
5     GP   F  16       U     LE3       T    2    2    other  at_home course
6     GP   F  16       U     LE3       A    3    3    other services   home
  guardian traveltime studytime failures schoolsup famsup paid activities
1   mother          1         1        3        no    yes   no        yes
2   mother          1         1        2        no    yes  yes         no
3   mother          1         2        0        no    yes  yes        yes
4   mother          2         2        1        no    yes   no        yes
5   father          2         2        1       yes     no   no        yes
6   mother          1         2        0        no    yes   no         no
  nursery higher internet romantic famrel freetime goout Dalc Walc health
1     yes    yes      yes       no      5        5     5    2    4      5
2      no    yes      yes       no      1        2     2    1    3      5
3     yes    yes      yes      yes      4        4     5    5    5      5
4      no    yes      yes      yes      3        3     3    1    2      3
5     yes    yes      yes       no      4        3     3    2    2      5
6     yes    yes      yes       no      2        3     5    1    4      3
  absences G1 G2 G3
1       16  6  5  5
2       14  6  9  8
3       16 10 12 11
4       25  7 10 11
5       14 10 10  9
6       12 11 12 11
df_clean$sex <- as.factor(df_clean$sex)
df_clean$school <- as.factor(df_clean$school)
df_clean$romantic <- as.factor(df_clean$romantic)
df_clean$higher <- as.factor(df_clean$higher)

Heatmap

  • The correlation heatmap shows relationships between academic, lifestyle, and family-related variables in the dataset. Final grades (G3) show a negative relationship with absences and past failures, while study time and parental education show slight positive relationships with academic performance. The visualization helps identify patterns and relationships between variables before regression analysis.
numeric_data <- df_clean %>%
  select(G3, studytime, absences, failures, Walc, Dalc, Medu, Fedu)

cor_matrix <- cor(numeric_data)

cor_data <- as.data.frame(as.table(cor_matrix))

ggplot(cor_data, aes(Var1, Var2, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = round(Freq, 2)), color = "white", size = 4) +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = 0, limit = c(-1,1)) +
  labs(
    title = "Correlation Heatmap of Student Performance Variables",
    x = "",
    y = "",
    fill = "Correlation",
    caption = "Source: UCI Student Performance Dataset"
  ) +
  theme_minimal()

Violinplot

  • The violin plot shows the distribution of final grades across different levels of weekend alcohol consumption. Students with lower alcohol consumption generally show higher and more concentrated grade distributions, while higher alcohol consumption levels display greater variation and slightly lower academic performance.
ggplot(df_clean, aes(x = as.factor(Walc), y = G3, fill = as.factor(Walc))) +
  geom_violin(trim = FALSE) +
  labs(
    title = "Distribution of Final Grades by Weekend Alcohol Consumption",
    x = "Weekend Alcohol Consumption",
    y = "Final Grade (G3)",
    fill = "Walc",
    caption = "Source: UCI Student Performance Dataset"
  ) +
  theme_minimal()

Interactive Visualization

An interactive visualization was created in RStudio using the plotly package to explore the relationship between study time and final grades. The interactive plot allows users to hover over data points and visually examine patterns between academic performance and study habits.

Color was used to represent different levels of weekend alcohol consumption, helping identify how lifestyle factors may relate to student grades. The visualization provides a more engaging way to analyze the dataset compared to static graphs and supports the overall findings of the project.

p <- ggplot(df_clean, aes(x = studytime, y = G3, color = as.factor(Walc))) +
  geom_point(size = 2) +
  labs(
    title = "Interactive Study Time vs Final Grade",
    x = "Study Time",
    y = "Final Grade"
  ) +
  theme_minimal()

ggplotly(p)
model <- lm(G3 ~ studytime + absences + failures + Walc + Dalc + Medu + Fedu, data = df_clean)

summary(model)

Call:
lm(formula = G3 ~ studytime + absences + failures + Walc + Dalc + 
    Medu + Fedu, data = df_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.0168  -2.0499   0.3525   3.0284   8.1268 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.12142    1.01771   8.963  < 2e-16 ***
studytime    0.18631    0.26823   0.695   0.4877    
absences     0.02456    0.02732   0.899   0.3694    
failures    -2.01597    0.30698  -6.567 1.65e-10 ***
Walc         0.06903    0.22333   0.309   0.7574    
Dalc        -0.12196    0.31798  -0.384   0.7015    
Medu         0.62212    0.25554   2.435   0.0154 *  
Fedu        -0.09514    0.25582  -0.372   0.7102    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.257 on 387 degrees of freedom
Multiple R-squared:  0.1519,    Adjusted R-squared:  0.1365 
F-statistic:   9.9 on 7 and 387 DF,  p-value: 2.226e-11
summary(model)

Call:
lm(formula = G3 ~ studytime + absences + failures + Walc + Dalc + 
    Medu + Fedu, data = df_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.0168  -2.0499   0.3525   3.0284   8.1268 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.12142    1.01771   8.963  < 2e-16 ***
studytime    0.18631    0.26823   0.695   0.4877    
absences     0.02456    0.02732   0.899   0.3694    
failures    -2.01597    0.30698  -6.567 1.65e-10 ***
Walc         0.06903    0.22333   0.309   0.7574    
Dalc        -0.12196    0.31798  -0.384   0.7015    
Medu         0.62212    0.25554   2.435   0.0154 *  
Fedu        -0.09514    0.25582  -0.372   0.7102    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.257 on 387 degrees of freedom
Multiple R-squared:  0.1519,    Adjusted R-squared:  0.1365 
F-statistic:   9.9 on 7 and 387 DF,  p-value: 2.226e-11
#G3 = β0 + β1(studytime) + β2(absences) + β3(failures) + β4(Walc) + β5(Dalc) + β6(Medu) + β7(Fedu)

Regression Diagnostic Plots

Regression diagnostic plots were used to evaluate the assumptions of the multiple linear regression model. These plots help identify issues such as non-linearity, unequal variance, outliers, and whether the residuals are approximately normally distributed.

The Residuals vs Fitted plot was used to check for patterns in the residuals, while the Q-Q plot was used to examine the normality of residuals. Overall, the diagnostic plots suggest that the regression model is reasonably appropriate for analyzing the relationship between the selected variables and final student grades.

par(mfrow = c(2,2))
plot(model)

Conclusion

This project analyzed factors affecting student academic performance using the UCI Student Performance dataset. Multiple visualizations and a multiple linear regression model were used to examine relationships between study time, absences, alcohol consumption, parental education, and final grades.

The regression analysis showed that previous class failures had the strongest negative relationship with final grades, while parental education showed a small positive relationship. Visualizations also suggested that increased alcohol consumption and absences may negatively affect academic performance.

The interactive Tableau dashboard provided additional exploration of relationships between academic and lifestyle variables. Overall, the project demonstrates how statistical analysis and visualization techniques can be used to better understand student performance patterns.