This project analyzes the Student Performance dataset from the UCI Machine Learning Repository. The dataset contains academic, social, and lifestyle variables related to secondary school students. The purpose of this project is to examine how study habits, alcohol consumption, absences, and family background influence students’ final academic performance. Multiple visualizations and a multiple linear regression model were used to explore patterns within the data.
Research Questions
Research Question: Which lifestyle and social factors are the strongest predictors of student academic performance?
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(GGally)library(dplyr)
df <-read.csv("student_data.csv")
head(df)
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
1 GP F 18 U GT3 A 4 4 at_home teacher course
2 GP F 17 U GT3 T 1 1 at_home other course
3 GP F 15 U LE3 T 1 1 at_home other other
4 GP F 15 U GT3 T 4 2 health services home
5 GP F 16 U GT3 T 3 3 other other home
6 GP M 16 U LE3 T 4 3 services other reputation
guardian traveltime studytime failures schoolsup famsup paid activities
1 mother 2 2 0 yes no no no
2 father 1 2 0 no yes no no
3 mother 1 2 3 yes no yes no
4 mother 1 3 0 no yes yes yes
5 father 1 2 0 no yes yes no
6 mother 1 2 0 no yes yes yes
nursery higher internet romantic famrel freetime goout Dalc Walc health
1 yes yes no no 4 3 4 1 1 3
2 no yes yes no 5 3 3 1 1 3
3 yes yes yes no 4 3 2 2 3 3
4 yes yes yes yes 3 2 2 1 1 5
5 yes yes no no 4 3 2 1 2 5
6 yes yes yes no 5 4 2 1 2 5
absences G1 G2 G3
1 6 5 6 6
2 4 5 5 6
3 10 7 8 10
4 2 15 14 15
5 4 6 10 10
6 10 15 15 15
school sex age address
Length:395 Length:395 Min. :15.0 Length:395
Class :character Class :character 1st Qu.:16.0 Class :character
Mode :character Mode :character Median :17.0 Mode :character
Mean :16.7
3rd Qu.:18.0
Max. :22.0
famsize Pstatus Medu Fedu
Length:395 Length:395 Min. :0.000 Min. :0.000
Class :character Class :character 1st Qu.:2.000 1st Qu.:2.000
Mode :character Mode :character Median :3.000 Median :2.000
Mean :2.749 Mean :2.522
3rd Qu.:4.000 3rd Qu.:3.000
Max. :4.000 Max. :4.000
Mjob Fjob reason guardian
Length:395 Length:395 Length:395 Length:395
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
traveltime studytime failures schoolsup
Min. :1.000 Min. :1.000 Min. :0.0000 Length:395
1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 Class :character
Median :1.000 Median :2.000 Median :0.0000 Mode :character
Mean :1.448 Mean :2.035 Mean :0.3342
3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
Max. :4.000 Max. :4.000 Max. :3.0000
famsup paid activities nursery
Length:395 Length:395 Length:395 Length:395
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
higher internet romantic famrel
Length:395 Length:395 Length:395 Min. :1.000
Class :character Class :character Class :character 1st Qu.:4.000
Mode :character Mode :character Mode :character Median :4.000
Mean :3.944
3rd Qu.:5.000
Max. :5.000
freetime goout Dalc Walc
Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
Median :3.000 Median :3.000 Median :1.000 Median :2.000
Mean :3.235 Mean :3.109 Mean :1.481 Mean :2.291
3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000
Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
health absences G1 G2
Min. :1.000 Min. : 0.000 Min. : 3.00 Min. : 0.00
1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00 1st Qu.: 9.00
Median :4.000 Median : 4.000 Median :11.00 Median :11.00
Mean :3.554 Mean : 5.709 Mean :10.91 Mean :10.71
3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:13.00 3rd Qu.:13.00
Max. :5.000 Max. :75.000 Max. :19.00 Max. :19.00
G3
Min. : 0.00
1st Qu.: 8.00
Median :11.00
Mean :10.42
3rd Qu.:14.00
Max. :20.00
Background Research
Previous research has shown that study habits, attendance, parental education, and alcohol consumption can influence student academic performance. Students with higher study time and stronger family educational support often perform better academically, while higher alcohol consumption and school absences are associated with lower grades and increased academic difficulties. These factors are important because they help explain how both lifestyle behaviors and social environments contribute to student success in school. This research supports the purpose of this project, which is to analyze how academic, family, and behavioral variables affect final student grades using statistical analysis and data visualization.
APA Citation
UCI Machine Learning Repository
Cortez, P., & Silva, A. (2008). Using data mining to predict secondary school student performance. UCI Machine Learning Repository. Retrieved from https://archive.ics.uci.edu/ml/datasets/student+performance
VARIABLES USED
Response Variable
G3 (Final Grade)
Quantitative Variables
studytime
absences
failures
Walc
Dalc
Medu
Fedu
Categorical Variables
sex
school
df_clean <- df %>%drop_na()sum(is.na(df_clean))
[1] 0
Data Cleaning
The dataset was cleaned before analysis by removing missing values and selecting variables relevant to the research question. Categorical variables were converted into factors to improve visualization and regression analysis. The dataset was also checked for consistency and variable types before modeling.
# Average final grade by sexdf_clean %>%group_by(sex) %>%summarise(avg_grade =mean(G3))
# A tibble: 2 × 2
sex avg_grade
<chr> <dbl>
1 F 9.97
2 M 10.9
# Filtering students with high absenceshigh_absence <- df_clean %>%filter(absences >10)head(high_absence)
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
1 GP M 17 U GT3 T 3 2 services services course
2 GP F 16 U GT3 T 2 2 services services home
3 GP M 16 U GT3 T 4 4 teacher teacher home
4 GP F 16 U LE3 T 2 2 other other home
5 GP F 16 U LE3 T 2 2 other at_home course
6 GP F 16 U LE3 A 3 3 other services home
guardian traveltime studytime failures schoolsup famsup paid activities
1 mother 1 1 3 no yes no yes
2 mother 1 1 2 no yes yes no
3 mother 1 2 0 no yes yes yes
4 mother 2 2 1 no yes no yes
5 father 2 2 1 yes no no yes
6 mother 1 2 0 no yes no no
nursery higher internet romantic famrel freetime goout Dalc Walc health
1 yes yes yes no 5 5 5 2 4 5
2 no yes yes no 1 2 2 1 3 5
3 yes yes yes yes 4 4 5 5 5 5
4 no yes yes yes 3 3 3 1 2 3
5 yes yes yes no 4 3 3 2 2 5
6 yes yes yes no 2 3 5 1 4 3
absences G1 G2 G3
1 16 6 5 5
2 14 6 9 8
3 16 10 12 11
4 25 7 10 11
5 14 10 10 9
6 12 11 12 11
The correlation heatmap shows relationships between academic, lifestyle, and family-related variables in the dataset. Final grades (G3) show a negative relationship with absences and past failures, while study time and parental education show slight positive relationships with academic performance. The visualization helps identify patterns and relationships between variables before regression analysis.
The violin plot shows the distribution of final grades across different levels of weekend alcohol consumption. Students with lower alcohol consumption generally show higher and more concentrated grade distributions, while higher alcohol consumption levels display greater variation and slightly lower academic performance.
ggplot(df_clean, aes(x =as.factor(Walc), y = G3, fill =as.factor(Walc))) +geom_violin(trim =FALSE) +labs(title ="Distribution of Final Grades by Weekend Alcohol Consumption",x ="Weekend Alcohol Consumption",y ="Final Grade (G3)",fill ="Walc",caption ="Source: UCI Student Performance Dataset" ) +theme_minimal()
The interactive Tableau dashboard allows users to explore relationships between study habits, absences, alcohol consumption, and final grades in a more dynamic way. Users can interact with filters and visual elements to examine how different variables influence student academic performance.
The dashboard includes visualizations comparing study time and final grades, as well as absences and academic performance. Interactive filters such as sex and alcohol consumption levels allow users to customize the analysis and identify trends within specific student groups.
This interactive component helps provide a clearer understanding of patterns in the dataset and supports the findings from the regression analysis and static visualizations.
<script type='text/javascript'> var divElement = document.getElementById('viz1778434076499'); var vizElement = divElement.getElementsByTagName('object')[0]; if ( divElement.offsetWidth > 800 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='650px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='887px';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='650px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='887px';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else { vizElement.style.width='100%';vizElement.style.height='877px';} var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>
Interactive Visualization
An interactive visualization was created in RStudio using the plotly package to explore the relationship between study time and final grades. The interactive plot allows users to hover over data points and visually examine patterns between academic performance and study habits.
Color was used to represent different levels of weekend alcohol consumption, helping identify how lifestyle factors may relate to student grades. The visualization provides a more engaging way to analyze the dataset compared to static graphs and supports the overall findings of the project.
p <-ggplot(df_clean, aes(x = studytime, y = G3, color =as.factor(Walc))) +geom_point(size =2) +labs(title ="Interactive Study Time vs Final Grade",x ="Study Time",y ="Final Grade" ) +theme_minimal()ggplotly(p)
model <-lm(G3 ~ studytime + absences + failures + Walc + Dalc + Medu + Fedu, data = df_clean)summary(model)
Regression diagnostic plots were used to evaluate the assumptions of the multiple linear regression model. These plots help identify issues such as non-linearity, unequal variance, outliers, and whether the residuals are approximately normally distributed.
The Residuals vs Fitted plot was used to check for patterns in the residuals, while the Q-Q plot was used to examine the normality of residuals. Overall, the diagnostic plots suggest that the regression model is reasonably appropriate for analyzing the relationship between the selected variables and final student grades.
par(mfrow =c(2,2))plot(model)
Conclusion
This project analyzed factors affecting student academic performance using the UCI Student Performance dataset. Multiple visualizations and a multiple linear regression model were used to examine relationships between study time, absences, alcohol consumption, parental education, and final grades.
The regression analysis showed that previous class failures had the strongest negative relationship with final grades, while parental education showed a small positive relationship. Visualizations also suggested that increased alcohol consumption and absences may negatively affect academic performance.
The interactive Tableau dashboard provided additional exploration of relationships between academic and lifestyle variables. Overall, the project demonstrates how statistical analysis and visualization techniques can be used to better understand student performance patterns.