Parents and schools do not always know what factors help their children succeed in high school. Post secondary education relies heavily on high school performance as part of their acceptance criteria. In addition, many students just completely fail high school, which ultimately sets them up for increased financial hardship later in life.
This project will predict student performance (measured as a final grade) in Mathematics and Portuguese Language in two Portuguese schools. This will be useful because parents and schools can see what demographic, social, and school related features affect student performance and can begin to make improvements.
30 predictor variables
Dataset Source:
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
theme_set(theme_classic())
setwd("C:/Users/Cat/OneDrive/UW/Data Mining & Predictive Analytics Essentials/Final Project/")
df_mat <- read.csv2("student-mat.csv")
df_por <- read.csv2("student-por.csv")
df <- bind_rows(df_mat, df_por)
df_var <- df %>% select(-(G1:G3))
str(df_var)
## 'data.frame': 1044 obs. of 30 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
df_lab <- df %>% select(G1:G3)
summary(df_lab)
## G1 G2 G3
## Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 9.00 1st Qu.:10.00
## Median :11.00 Median :11.00 Median :11.00
## Mean :11.21 Mean :11.25 Mean :11.34
## 3rd Qu.:13.00 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :19.00 Max. :19.00 Max. :20.00
Student performance measured as grades:
Missing values: none
Amelia::missmap(df, main = "Missing Values vs. Observed")
Label distribution: fairly normal with some outliers on the lower end.
#histogram with Freedman-Diaconis rule for binwidth
h <- hist(df$G3, breaks = "FD", plot = FALSE)
df %>% ggplot(aes(G3))+
geom_histogram(breaks = h$breaks, col = "white")
Correlations: Some variables have high correlations
df_cor <- df %>% mutate_if(is.factor, as.numeric)
#run a correlation and since there are so many different relationships, drop the insignificant ones
corr <- cor(df_cor)
corr[lower.tri(corr,diag=TRUE)] <- NA #Prepare to drop duplicates and correlations of 1
corr[corr == 1] <- NA #drop perfect
corr <- as.data.frame(as.table(corr)) #Turn into a 3-column table
corr <- na.omit(corr) #remove the NA values from above
corr <- subset(corr, abs(Freq) > 0.3) #select significant values
corr <- corr[order(-abs(corr$Freq)),] #Sort by highest correlation
#turn corr back into matrix in order to plot with corrplot
mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
#plot correlations visually
corrplot::corrplot(mtx_corr, is.corr=FALSE, tl.col="black", na.label=" ")
Some data types will need to be changed for more appropriate analysis.
Feature Engineering will be required to handle correlations, but this is generally high quality data that does not require much cleaning or preparation.
This will be done using a function to walk through multiple visualizations.
Visualizations will include:
gg_plot <- function(x_col, y_col=df$G3){
if(is.numeric(df[[x_col]])){
p1 <- df %>% ggplot(mapping=aes_string(x_col, y_col))+
geom_jitter(alpha=0.5)+
geom_smooth(method="lm", se=FALSE)+
labs(title=str_c("Student Performance: ", x_col, " by Final Grade"), y="Final Grade")
p1 %>% print()
h <- hist(df[[x_col]], breaks = "FD", plot = FALSE)
p2 <- ggplot(df, aes_string(x_col))+
geom_histogram(aes(y = ..density..), breaks = h$breaks, alpha = 0.3, col = "white")+
geom_density(size = 1) +
labs(title=str_c("Histogram and density for ", x_col))
p2 %>% print()
}
else{
p3 <- ggplot(df, aes_string(x_col, y_col))+
geom_boxplot()+
geom_hline(yintercept=mean(y_col), color="red")+
geom_hline(yintercept=median(y_col), color="blue", linetype="dashed")+
labs(title=str_c("Student Performance: Final Grade by ", x_col), subtitle="Showing mean(red), median(blue)", y="Final Grade")
p3 %>% print()
}
}
cols <- df %>% select(-G3) %>% names()
cols %>% walk(gg_plot)
Linear Regression and Decision Trees will be used.
The label is a continuous value which suggests Linear Regression vs. Logistic Regression. Decision Trees will also be used as another model type to see which model is better.
Data will be split up into 80% training and 20% test sets.
The following performance metrics will be checked with the test set:
Performance will also be visually inspected with plots for:
There will be a section describing the overall model performance as well as factors that may have helped to create a better model. I suspect these would be things like Feature Engineering for different variables or using different model types.
My goal is to predict student performance as accurately as possible so that parents and schools can start making necessary changes in their children’s lives.