Problem

Parents and schools do not always know what factors help their children succeed in high school. Post secondary education relies heavily on high school performance as part of their acceptance criteria. In addition, many students just completely fail high school, which ultimately sets them up for increased financial hardship later in life.

This project will predict student performance (measured as a final grade) in Mathematics and Portuguese Language in two Portuguese schools. This will be useful because parents and schools can see what demographic, social, and school related features affect student performance and can begin to make improvements.

Dataset Predictors

30 predictor variables

Dataset Source:

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
theme_set(theme_classic())
setwd("C:/Users/Cat/OneDrive/UW/Data Mining & Predictive Analytics Essentials/Final Project/")

df_mat <- read.csv2("student-mat.csv")
df_por <- read.csv2("student-por.csv")

df <- bind_rows(df_mat, df_por)

df_var <- df %>% select(-(G1:G3))
str(df_var)
## 'data.frame':    1044 obs. of  30 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...

Dataset Output

df_lab <- df %>% select(G1:G3)
summary(df_lab)
##        G1              G2              G3       
##  Min.   : 0.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 9.00   1st Qu.: 9.00   1st Qu.:10.00  
##  Median :11.00   Median :11.00   Median :11.00  
##  Mean   :11.21   Mean   :11.25   Mean   :11.34  
##  3rd Qu.:13.00   3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :19.00   Max.   :19.00   Max.   :20.00

Ground Truths

Student performance measured as grades:

  • G1 - 1st period grade
  • G2 - 2nd period grade
  • G3 - final grade

Label

  • G3 - final grade (issued at 3rd period)

Data Quality & Preparations

Missing values: none

Amelia::missmap(df, main = "Missing Values vs. Observed")

Label distribution: fairly normal with some outliers on the lower end.

#histogram with Freedman-Diaconis rule for binwidth
h <- hist(df$G3, breaks = "FD", plot = FALSE)

df %>% ggplot(aes(G3))+ 
  geom_histogram(breaks = h$breaks, col = "white")

Correlations: Some variables have high correlations

df_cor <- df %>% mutate_if(is.factor, as.numeric)

#run a correlation and since there are so many different relationships, drop the insignificant ones
corr <- cor(df_cor)
corr[lower.tri(corr,diag=TRUE)] <- NA  #Prepare to drop duplicates and correlations of 1
corr[corr == 1] <- NA #drop perfect
corr <- as.data.frame(as.table(corr)) #Turn into a 3-column table
corr <- na.omit(corr) #remove the NA values from above
corr <- subset(corr, abs(Freq) > 0.3) #select significant values
corr <- corr[order(-abs(corr$Freq)),] #Sort by highest correlation

#turn corr back into matrix in order to plot with corrplot
mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")

#plot correlations visually
corrplot::corrplot(mtx_corr, is.corr=FALSE, tl.col="black", na.label=" ")

Some data types will need to be changed for more appropriate analysis.

Feature Engineering will be required to handle correlations, but this is generally high quality data that does not require much cleaning or preparation.

Exploratory Data Analysis

This will be done using a function to walk through multiple visualizations.

Visualizations will include:

gg_plot <- function(x_col, y_col=df$G3){
  if(is.numeric(df[[x_col]])){
    p1 <- df %>% ggplot(mapping=aes_string(x_col, y_col))+
      geom_jitter(alpha=0.5)+
      geom_smooth(method="lm", se=FALSE)+
      labs(title=str_c("Student Performance: ", x_col, " by Final Grade"), y="Final Grade")
    p1 %>% print()
    h <- hist(df[[x_col]], breaks = "FD", plot = FALSE)
    p2 <- ggplot(df, aes_string(x_col))+
      geom_histogram(aes(y = ..density..), breaks = h$breaks, alpha = 0.3, col = "white")+
      geom_density(size = 1) +
      labs(title=str_c("Histogram and density for ", x_col))
    p2 %>% print()
    }
  else{
    p3 <- ggplot(df, aes_string(x_col, y_col))+
      geom_boxplot()+
      geom_hline(yintercept=mean(y_col), color="red")+
      geom_hline(yintercept=median(y_col), color="blue", linetype="dashed")+
      labs(title=str_c("Student Performance: Final Grade by ", x_col), subtitle="Showing mean(red), median(blue)", y="Final Grade")
    p3 %>% print()
    }
}

cols <- df %>% select(-G3) %>% names()
cols %>% walk(gg_plot)

Models

Linear Regression and Decision Trees will be used.

The label is a continuous value which suggests Linear Regression vs. Logistic Regression. Decision Trees will also be used as another model type to see which model is better.

Data will be split up into 80% training and 20% test sets.

Performance

The following performance metrics will be checked with the test set:

  • Statistical significance (p-value)
  • AIC statistic
  • R squared value
  • Root Mean Square Error (RMSE)

Performance will also be visually inspected with plots for:

  • Quantile-quantile
  • Residuals vs. fitted values

Conclusion

There will be a section describing the overall model performance as well as factors that may have helped to create a better model. I suspect these would be things like Feature Engineering for different variables or using different model types.

My goal is to predict student performance as accurately as possible so that parents and schools can start making necessary changes in their children’s lives.