Predicting Student Performance

The purpose of this exercise is to build a linear regression model that will predict final test scores for students.

1. Get our Data

We will use the Student Performance Data Set from UC Irvine’s Machine Learning Repository!

A. Lets read the data and look at the head of our dataset

df <- read.csv('student-mat.csv', sep=';')
head(df)

##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime failures schoolsup famsup paid
## 1     course   mother          2         2        0       yes     no   no
## 2     course   father          1         2        0        no    yes   no
## 3      other   mother          1         2        3       yes     no  yes
## 4       home   mother          1         3        0        no    yes  yes
## 5       home   father          1         2        0        no    yes  yes
## 6 reputation   mother          1         2        0        no    yes  yes
##   activities nursery higher internet romantic famrel freetime goout Dalc
## 1         no     yes    yes       no       no      4        3     4    1
## 2         no      no    yes      yes       no      5        3     3    1
## 3         no     yes    yes      yes       no      4        3     2    2
## 4        yes     yes    yes      yes      yes      3        2     2    1
## 5         no     yes    yes       no       no      4        3     2    1
## 6        yes     yes    yes      yes       no      5        4     2    1
##   Walc health absences G1 G2 G3
## 1    1      3        6  5  6  6
## 2    1      3        4  5  5  6
## 3    3      3       10  7  8 10
## 4    1      5        2 15 14 15
## 5    2      5        4  6 10 10
## 6    2      5       10 15 15 15

B. Lets look at the summary of our dataset

summary(df)

##  school   sex          age       address famsize   Pstatus      Medu      
##  GP:349   F:208   Min.   :15.0   R: 88   GT3:281   A: 41   Min.   :0.000  
##  MS: 46   M:187   1st Qu.:16.0   U:307   LE3:114   T:354   1st Qu.:2.000  
##                   Median :17.0                             Median :3.000  
##                   Mean   :16.7                             Mean   :2.749  
##                   3rd Qu.:18.0                             3rd Qu.:4.000  
##                   Max.   :22.0                             Max.   :4.000  
##       Fedu             Mjob           Fjob            reason   
##  Min.   :0.000   at_home : 59   at_home : 20   course    :145  
##  1st Qu.:2.000   health  : 34   health  : 18   home      :109  
##  Median :2.000   other   :141   other   :217   other     : 36  
##  Mean   :2.522   services:103   services:111   reputation:105  
##  3rd Qu.:3.000   teacher : 58   teacher : 29                   
##  Max.   :4.000                                                 
##    guardian     traveltime      studytime        failures      schoolsup
##  father: 90   Min.   :1.000   Min.   :1.000   Min.   :0.0000   no :344  
##  mother:273   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   yes: 51  
##  other : 32   Median :1.000   Median :2.000   Median :0.0000            
##               Mean   :1.448   Mean   :2.035   Mean   :0.3342            
##               3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000            
##               Max.   :4.000   Max.   :4.000   Max.   :3.0000            
##  famsup     paid     activities nursery   higher    internet  romantic 
##  no :153   no :214   no :194    no : 81   no : 20   no : 66   no :263  
##  yes:242   yes:181   yes:201    yes:314   yes:375   yes:329   yes:132  
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##      famrel         freetime         goout            Dalc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :4.000   Median :3.000   Median :3.000   Median :1.000  
##  Mean   :3.944   Mean   :3.235   Mean   :3.109   Mean   :1.481  
##  3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##       Walc           health         absences            G1       
##  Min.   :1.000   Min.   :1.000   Min.   : 0.000   Min.   : 3.00  
##  1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00  
##  Median :2.000   Median :4.000   Median : 4.000   Median :11.00  
##  Mean   :2.291   Mean   :3.554   Mean   : 5.709   Mean   :10.91  
##  3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00  
##  Max.   :5.000   Max.   :5.000   Max.   :75.000   Max.   :19.00  
##        G2              G3       
##  Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 9.00   1st Qu.: 8.00  
##  Median :11.00   Median :11.00  
##  Mean   :10.71   Mean   :10.42  
##  3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :19.00   Max.   :20.00

2. Clean the Data

A. Check for NA values - Let’s see if we have any NA values

any(is.na(df))

## [1] FALSE

Great, looks like there are no NA values in this dataset.

B. Categorical Variables

Let’s check to see which categorical variables have been set as factors and if we need to change the datatype of any of these variables to factors.

str(df)

## 'data.frame':    395 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

We may convert some of the predictors such as ‘Medu’, ‘Fedu’ as factors but we will leave the data as is for now.

3. Exploratory Data Analysis

We will use ggplot2 to explore our dataset

library(ggplot2)
library(corrplot)
library(corrgram)
library(dplyr)

A. Correlation and Corrplots

Correlation plots are a great way of exploring data and seeing if there are any interaction terms. Let’s start off by just grabbing the numeric data (we can’t see correlation for categorical data):

# Grab only numeric columns
num.cols <- sapply(df, is.numeric)

# Filter to numeric columns for correlation
cor.data <- cor(df[,num.cols])

Lets look at our correlation data

corrplot::corrplot(cor.data, method='color')

Cleary we have very high correlation between G1, G2, and G3 which makes sense since those are grades:

G1 - first period grade (numeric: from 0 to 20)
G2 - second period grade (numeric: from 0 to 20)
G3 - final grade (numeric: from 0 to 20, output target)

Meaning good students do well each period, and poor students do poorly each period, etc. Also a high G1,G2, or G3 value has a negative correlation with failure (number of past class failures). Also Mother and Father education levels are correlated, which also makes sense.

Corrgram - We can also use the corrgram which allows to just automatically do these type of figures by just passing in the dataframe directly.

library(corrgram)
corrgram(df, order=TRUE, lower.panel=panel.shade, upper.panel=panel.pie,text.panel=panel.txt)

B. Histograms

Lets look at a histogram of our response variable G3 that we are going to predict.

library(ggplot2)
qplot(G3, data=df, geom='histogram', bins=20, fill=..count.., xlab='G3 (final grade)')

Looks like quite a few students get a zero. This is a good place to ask questions, like are students missing the test? Also why is the mean occurence so high? Is this test curved?

Let’s continue by building a model

4. Build the model

A. Train and Test Data

We’ll need to split our data into a training set and a testing set in order to test our accuracy.

library(caTools)

# Set the seed
set.seed(101)

# Split the data into test and train sets
sample <- sample.split(df$age, SplitRatio = 0.7)

# Training data
train <- subset(df,sample==TRUE)

# Test data
test <- subset(df,sample==FALSE)

B. Train our Model

Let’s train out model on our training data, then ask for a summary of that model:

model <- lm(G3~., train)
summary(model)

## 
## Call:
## lm(formula = G3 ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7681 -0.6423  0.2294  1.0691  4.5942 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.329568   2.474569  -0.537 0.591574    
## schoolMS          0.838581   0.470545   1.782 0.076016 .  
## sexM              0.034883   0.275586   0.127 0.899382    
## age              -0.214994   0.119579  -1.798 0.073472 .  
## addressU          0.067190   0.326035   0.206 0.836905    
## famsizeLE3       -0.111068   0.283228  -0.392 0.695302    
## PstatusT         -0.153653   0.401679  -0.383 0.702417    
## Medu              0.279949   0.171111   1.636 0.103164    
## Fedu             -0.221275   0.151103  -1.464 0.144422    
## Mjobhealth        0.002065   0.610532   0.003 0.997304    
## Mjobother         0.509947   0.403195   1.265 0.207209    
## Mjobservices      0.475476   0.435332   1.092 0.275857    
## Mjobteacher       0.285345   0.550640   0.518 0.604802    
## Fjobhealth        0.433172   0.774191   0.560 0.576343    
## Fjobother        -0.296792   0.577217  -0.514 0.607611    
## Fjobservices     -0.311595   0.593628  -0.525 0.600148    
## Fjobteacher      -0.321205   0.712695  -0.451 0.652628    
## reasonhome       -0.431435   0.319907  -1.349 0.178755    
## reasonother       0.159612   0.454480   0.351 0.725755    
## reasonreputation -0.051845   0.317894  -0.163 0.870589    
## guardianmother    0.267462   0.311371   0.859 0.391226    
## guardianother    -0.157335   0.554872  -0.284 0.777003    
## traveltime        0.274301   0.197865   1.386 0.166968    
## studytime        -0.140650   0.155149  -0.907 0.365577    
## failures         -0.185333   0.211040  -0.878 0.380739    
## schoolsupyes      0.562716   0.379303   1.484 0.139268    
## famsupyes         0.369402   0.268848   1.374 0.170745    
## paidyes           0.060643   0.270971   0.224 0.823107    
## activitiesyes    -0.286006   0.247519  -1.155 0.249063    
## nurseryyes       -0.426858   0.315064  -1.355 0.176774    
## higheryes         0.503353   0.677346   0.743 0.458148    
## internetyes      -0.097405   0.331040  -0.294 0.768834    
## romanticyes      -0.243837   0.268414  -0.908 0.364577    
## famrel            0.494479   0.136663   3.618 0.000363 ***
## freetime         -0.139869   0.138297  -1.011 0.312879    
## goout             0.078871   0.128794   0.612 0.540879    
## Dalc             -0.248633   0.178366  -1.394 0.164651    
## Walc              0.221434   0.139178   1.591 0.112950    
## health            0.027495   0.095100   0.289 0.772748    
## absences          0.060830   0.017915   3.396 0.000804 ***
## G1                0.162966   0.076956   2.118 0.035255 *  
## G2                0.994677   0.066450  14.969  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.868 on 235 degrees of freedom
## Multiple R-squared:  0.8616, Adjusted R-squared:  0.8374 
## F-statistic: 35.68 on 41 and 235 DF,  p-value: < 2.2e-16

C. Let’s plot the residuals

# Grab residuals
res <- residuals(model)

# Convert to data frame for plot
res <- as.data.frame(res)

# Plot residuals
qplot(res, data=res, geom='histogram')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Looks like there are some suspicious residual values that have a value less than -5. Model seems to be predicting that poor performing students will get negative test results. The lowest possible score however is 0. We can further explore this by just calling plot on our model.

par(mfrow=c(2,2))
plot(model)

5. Make Prediction

A. Let’s test our model by predicting on our testing set:

G3.predictions <- predict(model, test)
summary(G3.predictions)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.832   7.776  10.280  10.240  13.140  20.260

Looking at the range, it looks like our model is predicting negative test scores.

B. Let’s create a dataset of actual and predicted results to check model performance:

results <- cbind(G3.predictions, test$G3)
colnames(results) <- c('Predicted', 'Real')
results <- as.data.frame(results)

C. Let’s check the performance of our model:

sse <- sum((results$Predicted - results$Real)^2)
sst <- sum((mean(df$G3) - results$Real)^2)
R2 <- 1-sse/sst
R2

## [1] 0.7747253

Current performance of our model is R2 = 0.774725

D. Improve performance

Now let’s take care of negative predictions by creating a simple function to replace them with 0.

# Create function to replace negative values with 0
to_zero <- function(x){
  if(x<0){
    return(0)
  } else {
    return(x)
  }
}

# Apply the function to predicted results
results$Predicted <- sapply(results$Predicted, to_zero)

# Check the range of predicted values
range(results$Predicted)

## [1]  0.00000 20.25837

D. Let’s check the performance of our model again:

sse <- sum((results$Predicted - results$Real)^2)
sst <- sum((mean(df$G3) - results$Real)^2)
R2 <- 1-sse/sst
RMSE <- sqrt(mean((results$Real - results$Predicted)^2))
R2

## [1] 0.7779023

RMSE

## [1] 2.100335

Conclusion - Performance of our improved model is slightly better with R2 = 0.7779023. It means our model can explain about 78% variance in our test data.