Data

This dataset is from the UCI Machine Learning Repository and is comprised of student performance inforation. The data contains the following features:

The task associated with this dataset is regression to determine the the grades related with the course subjects:

df_mat = read.table("https://raw.githubusercontent.com/mkivenson/Business-Analytics-Data-Mining/master/Final%20Project/student-mat.csv",sep=";",header=TRUE)
df_por = read.table("https://raw.githubusercontent.com/mkivenson/Business-Analytics-Data-Mining/master/Final%20Project/student-por.csv",sep=";",header=TRUE)

Abstract

Key Words

Introduction

The goal of this project is to use features from a student’s personal life and activities to predict grades in Math and Portuguese classes. Feature importance and explainability will be particularly useful, as it will indicate which factors are most valuable in grades. Two datasets will be used containing the same features - one for Math grades (df_mat) and one for Portuguese grades (df_por). There is some overlap in students between the Math and Portuguese classes, so separate models will be created for each dataset. This will also be a useful indicator of whether feature importance varies between math and language skills.

Methodology

The first step to predicting test scores for math and portuguese classes is data exploration and preprocessing. There are no missing values in the dataset, so no imputation is needed. However, there are many categorical and ordinal variables in the dataset. Categorical variables will be encoded prior to use in linear models, but ordinal variables will be kept as-is. To encode categorical variables, one hot encoding will be used to create dummy variables.

We then divide our variables into two separate datasets for Math and Portugues and divide each of those datasets into test and train data. We used stepwise regression model to build a linear model using our train data and test their performance on test data.

Experimentation and Results

Data Exploration

Correlation

The correlation plots below show correlation on numeric columns only, and indicate very limited collinearity in the dataset. The last three columns/rows are the test scores - scores are highly correlated with each other.

df_mat<-as.data.frame(df_mat)
corrplot(cor(df_mat[,sapply(df_mat, is.numeric)], use = "complete.obs"), method="color", type="lower", tl.col = "black", tl.srt = 5)

df_por<-as.data.frame(df_por)
corrplot(cor(df_por[,sapply(df_por, is.numeric)], use = "complete.obs"), method="color", type="lower", tl.col = "black", tl.srt = 5)

Distributions

The following tables show feature and target variable distributions for math and portuguese data. For both datasets, all grades (G1, G2, and G3) appear to be normally distributed.

grid.arrange(ggplot(df_mat, aes(school)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(sex)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(age)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(address)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(famsize)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(Pstatus)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(Medu)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(Fedu)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(Mjob)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(Fjob)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(reason)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(guardian)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(traveltime)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(studytime)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(failures)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(schoolsup)) + geom_histogram(stat = "count"),   
             ggplot(df_mat, aes(famsup)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(paid)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(activities)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(nursery)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(higher)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(internet)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(romantic)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(famrel)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(freetime)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(goout)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(Dalc)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(Walc)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(health)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(absences)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(G1)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(G2)) + geom_histogram(stat = "count"),
             ggplot(df_mat, aes(G3)) + geom_histogram(stat = "count"),
             ncol=6)

grid.arrange(ggplot(df_por, aes(school)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(sex)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(age)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(address)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(famsize)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(Pstatus)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(Medu)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(Fedu)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(Mjob)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(Fjob)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(reason)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(guardian)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(traveltime)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(studytime)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(failures)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(schoolsup)) + geom_histogram(stat = "count"),   
             ggplot(df_por, aes(famsup)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(paid)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(activities)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(nursery)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(higher)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(internet)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(romantic)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(famrel)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(freetime)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(goout)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(Dalc)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(Walc)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(health)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(absences)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(G1)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(G2)) + geom_histogram(stat = "count"),
             ggplot(df_por, aes(G3)) + geom_histogram(stat = "count"),
             ncol=6)

Data Preparation

Encoding

The summary of df_mat and df_por reveals that the dataset has multiple categorical variables that will be encoded into dummy variables. The amount of dummy variables used for each field will be one less than the amount of categories in features. This encoding will should only be used for linear regression models - not tree based models.

  • school
  • sex
  • address
  • famsize
  • Pstatus
  • Mjob
  • Fjob
  • reason
  • guardian
  • schoolsup
  • famsup
  • paid
  • activities
  • nursery
  • higher
  • internet
  • romantic
library(onehot)

#encode math data
encoder <- onehot(df_mat, stringsAsFactors = TRUE)
df_mat <- predict(encoder, df_mat)
drop <- c("school=GP", "sex=F", "address=U", "famsize=LE3", "Pstatus=T",
                        "Mjob=at_home", "Fjob=at_home","reason=course", "guardian=other",
                        "schoolsup=no", "famsup=no", "paid=no", "activities=no", "nursery=no",
                        "higher=no", "internet=no", "romantic=no")
df_mat <- df_mat[,!(colnames(df_mat) %in% drop)]

#encode portuguese data
encoder <- onehot(df_por, stringsAsFactors = TRUE)
df_por <- predict(encoder, df_por)
df_por <- df_por[,!(colnames(df_por) %in% drop)]

Model Building

Let’s split our datasets into test and train sets using 80/20 ratio.Portuguese train data set has 511 observations and test data set has 138. The Math train data set has 511 observations and test data set has 86.

set.seed(95)

#Portugese Data
df_por2<-as.data.frame(df_por)
sample <- sample.split(df_por2, SplitRatio = 0.8)

train_por <- subset(df_por2,sample==TRUE)
test_por <- subset(df_por2,sample==FALSE)

#Math Data
df_mat2<-as.data.frame(df_mat)
sample <- sample.split(df_mat2, SplitRatio = 0.8)

train_mat <- subset(df_mat2,sample==TRUE)
test_mat <- subset(df_mat2,sample==FALSE)

We will use stepwise regression to select the variables for our model.

#Portugese Data
model_por <- lm(G3~., train_por)
step_model_por <- stepAIC(model_por, direction = "both", trace = FALSE)
summary(step_model_por)
## 
## Call:
## lm(formula = G3 ~ `sex=M` + `address=R` + `Mjob=other` + `reason=other` + 
##     traveltime + failures + health + absences + G1 + G2, data = train_por)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9311 -0.5011 -0.0138  0.6186  5.0442 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.29551    0.37503   0.788 0.431096    
## `sex=M`        -0.24348    0.11710  -2.079 0.038107 *  
## `address=R`    -0.27737    0.13551  -2.047 0.041196 *  
## `Mjob=other`   -0.20197    0.11479  -1.759 0.079109 .  
## `reason=other` -0.41281    0.18469  -2.235 0.025850 *  
## traveltime      0.18289    0.08432   2.169 0.030555 *  
## failures       -0.18177    0.11216  -1.621 0.105736    
## health         -0.06079    0.04066  -1.495 0.135453    
## absences        0.02012    0.01216   1.654 0.098784 .  
## G1              0.14774    0.04079   3.622 0.000323 ***
## G2              0.87552    0.03891  22.502  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.268 on 500 degrees of freedom
## Multiple R-squared:  0.8509, Adjusted R-squared:  0.8479 
## F-statistic: 285.3 on 10 and 500 DF,  p-value: < 2.2e-16
#Math Data
model_mat <- lm(G3~., train_mat)
step_model_mat <- stepAIC(model_mat, direction = "both", trace = FALSE)
summary(step_model_mat)
## 
## Call:
## lm(formula = G3 ~ age + `reason=other` + studytime + `activities=yes` + 
##     `nursery=yes` + `higher=yes` + `internet=yes` + famrel + 
##     goout + absences + G1 + G2, data = train_mat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3331 -0.5044  0.2201  0.9194  3.7953 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.61142    1.72810  -0.354 0.723734    
## age              -0.19488    0.08796  -2.216 0.027479 *  
## `reason=other`    0.67213    0.36155   1.859 0.064020 .  
## studytime        -0.22862    0.12462  -1.834 0.067583 .  
## `activities=yes` -0.31706    0.20297  -1.562 0.119346    
## `nursery=yes`    -0.62767    0.25998  -2.414 0.016370 *  
## `higher=yes`      1.09842    0.48365   2.271 0.023858 *  
## `internet=yes`   -0.53247    0.28516  -1.867 0.062856 .  
## famrel            0.42623    0.11176   3.814 0.000167 ***
## goout             0.19007    0.09498   2.001 0.046299 *  
## absences          0.04437    0.01276   3.476 0.000585 ***
## G1                0.16679    0.06460   2.582 0.010305 *  
## G2                0.97407    0.05854  16.640  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.761 on 296 degrees of freedom
## Multiple R-squared:  0.8537, Adjusted R-squared:  0.8478 
## F-statistic: 143.9 on 12 and 296 DF,  p-value: < 2.2e-16

Let’s plot residuals.

#Portuguese Data
res <- residuals(step_model_por)
res <- as.data.frame(res)
qplot(res, data=res, geom='histogram')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

par(mfrow=c(2,2))
plot(step_model_por)

#Math Data
res <- residuals(step_model_mat)
res <- as.data.frame(res)
qplot(res, data=res, geom='histogram')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

par(mfrow=c(2,2))
plot(step_model_mat)

Residuals look pretty good even though the residuals appear to be slightly skewed to the left. Looks like the model is a good fit but there are some exceptions we might want to look into.

Model Selection

Model Evaluation

We will now make predictions and calculate R2 for our test dataset to evaluate the performance of our linear model.

mat_pred <- predict(step_model_mat, test_mat, type = "response")
summary(mat_pred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1956  8.0839 10.5988 10.4892 13.5411 18.9162
por_pred <- predict(step_model_por, test_por, type = "response")
summary(por_pred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9598  9.5369 11.4096 11.4573 13.3340 18.5929
R2(mat_pred, test_mat$G3)
## [1] 0.7674924
R2(por_pred, test_por$G3)
## [1] 0.8663727

The models is performing well and has a very good R^2 values - our models explains 77% of varialbility in math data and 87% in portuguese data which is a very good result. The Portuguese data set is showing better result than the Math data set - the reason for that might be the fast that we have more observations in that dataset which made that model more accurate.

Experimentation and Results

Our linear model indicates that the following factors can be used in explaining final grades: gender, address, mother’s job, reason for choosing the school, travel time, number of past class failures, health, absences and earlier grades.

Discussion and Conclusions

References

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

Appendices