The purpose of this exercise is to build a linear regression model that will predict final test scores for students.
We will use the Student Performance Data Set from UC Irvine’s Machine Learning Repository!
df <- read.csv('student-mat.csv', sep=';')
head(df)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason guardian traveltime studytime failures schoolsup famsup paid
## 1 course mother 2 2 0 yes no no
## 2 course father 1 2 0 no yes no
## 3 other mother 1 2 3 yes no yes
## 4 home mother 1 3 0 no yes yes
## 5 home father 1 2 0 no yes yes
## 6 reputation mother 1 2 0 no yes yes
## activities nursery higher internet romantic famrel freetime goout Dalc
## 1 no yes yes no no 4 3 4 1
## 2 no no yes yes no 5 3 3 1
## 3 no yes yes yes no 4 3 2 2
## 4 yes yes yes yes yes 3 2 2 1
## 5 no yes yes no no 4 3 2 1
## 6 yes yes yes yes no 5 4 2 1
## Walc health absences G1 G2 G3
## 1 1 3 6 5 6 6
## 2 1 3 4 5 5 6
## 3 3 3 10 7 8 10
## 4 1 5 2 15 14 15
## 5 2 5 4 6 10 10
## 6 2 5 10 15 15 15
summary(df)
## school sex age address famsize Pstatus Medu
## GP:349 F:208 Min. :15.0 R: 88 GT3:281 A: 41 Min. :0.000
## MS: 46 M:187 1st Qu.:16.0 U:307 LE3:114 T:354 1st Qu.:2.000
## Median :17.0 Median :3.000
## Mean :16.7 Mean :2.749
## 3rd Qu.:18.0 3rd Qu.:4.000
## Max. :22.0 Max. :4.000
## Fedu Mjob Fjob reason
## Min. :0.000 at_home : 59 at_home : 20 course :145
## 1st Qu.:2.000 health : 34 health : 18 home :109
## Median :2.000 other :141 other :217 other : 36
## Mean :2.522 services:103 services:111 reputation:105
## 3rd Qu.:3.000 teacher : 58 teacher : 29
## Max. :4.000
## guardian traveltime studytime failures schoolsup
## father: 90 Min. :1.000 Min. :1.000 Min. :0.0000 no :344
## mother:273 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 yes: 51
## other : 32 Median :1.000 Median :2.000 Median :0.0000
## Mean :1.448 Mean :2.035 Mean :0.3342
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## famsup paid activities nursery higher internet romantic
## no :153 no :214 no :194 no : 81 no : 20 no : 66 no :263
## yes:242 yes:181 yes:201 yes:314 yes:375 yes:329 yes:132
##
##
##
##
## famrel freetime goout Dalc
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000
## Median :4.000 Median :3.000 Median :3.000 Median :1.000
## Mean :3.944 Mean :3.235 Mean :3.109 Mean :1.481
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Walc health absences G1
## Min. :1.000 Min. :1.000 Min. : 0.000 Min. : 3.00
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00
## Median :2.000 Median :4.000 Median : 4.000 Median :11.00
## Mean :2.291 Mean :3.554 Mean : 5.709 Mean :10.91
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:13.00
## Max. :5.000 Max. :5.000 Max. :75.000 Max. :19.00
## G2 G3
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 8.00
## Median :11.00 Median :11.00
## Mean :10.71 Mean :10.42
## 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :19.00 Max. :20.00
any(is.na(df))
## [1] FALSE
Great, looks like there are no NA values in this dataset.
Let’s check to see which categorical variables have been set as factors and if we need to change the datatype of any of these variables to factors.
str(df)
## 'data.frame': 395 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
We may convert some of the predictors such as ‘Medu’, ‘Fedu’ as factors but we will leave the data as is for now.
We will use ggplot2 to explore our dataset
library(ggplot2)
library(corrplot)
library(corrgram)
library(dplyr)
Correlation plots are a great way of exploring data and seeing if there are any interaction terms. Let’s start off by just grabbing the numeric data (we can’t see correlation for categorical data):
# Grab only numeric columns
num.cols <- sapply(df, is.numeric)
# Filter to numeric columns for correlation
cor.data <- cor(df[,num.cols])
Lets look at our correlation data
corrplot::corrplot(cor.data, method='color')
Cleary we have very high correlation between G1, G2, and G3 which makes sense since those are grades:
Meaning good students do well each period, and poor students do poorly each period, etc. Also a high G1,G2, or G3 value has a negative correlation with failure (number of past class failures). Also Mother and Father education levels are correlated, which also makes sense.
Corrgram - We can also use the corrgram which allows to just automatically do these type of figures by just passing in the dataframe directly.
library(corrgram)
corrgram(df, order=TRUE, lower.panel=panel.shade, upper.panel=panel.pie,text.panel=panel.txt)
Lets look at a histogram of our response variable G3 that we are going to predict.
library(ggplot2)
qplot(G3, data=df, geom='histogram', bins=20, fill=..count.., xlab='G3 (final grade)')
Looks like quite a few students get a zero. This is a good place to ask questions, like are students missing the test? Also why is the mean occurence so high? Is this test curved?
Let’s continue by building a model
We’ll need to split our data into a training set and a testing set in order to test our accuracy.
library(caTools)
# Set the seed
set.seed(101)
# Split the data into test and train sets
sample <- sample.split(df$age, SplitRatio = 0.7)
# Training data
train <- subset(df,sample==TRUE)
# Test data
test <- subset(df,sample==FALSE)
Let’s train out model on our training data, then ask for a summary of that model:
model <- lm(G3~., train)
summary(model)
##
## Call:
## lm(formula = G3 ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7681 -0.6423 0.2294 1.0691 4.5942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.329568 2.474569 -0.537 0.591574
## schoolMS 0.838581 0.470545 1.782 0.076016 .
## sexM 0.034883 0.275586 0.127 0.899382
## age -0.214994 0.119579 -1.798 0.073472 .
## addressU 0.067190 0.326035 0.206 0.836905
## famsizeLE3 -0.111068 0.283228 -0.392 0.695302
## PstatusT -0.153653 0.401679 -0.383 0.702417
## Medu 0.279949 0.171111 1.636 0.103164
## Fedu -0.221275 0.151103 -1.464 0.144422
## Mjobhealth 0.002065 0.610532 0.003 0.997304
## Mjobother 0.509947 0.403195 1.265 0.207209
## Mjobservices 0.475476 0.435332 1.092 0.275857
## Mjobteacher 0.285345 0.550640 0.518 0.604802
## Fjobhealth 0.433172 0.774191 0.560 0.576343
## Fjobother -0.296792 0.577217 -0.514 0.607611
## Fjobservices -0.311595 0.593628 -0.525 0.600148
## Fjobteacher -0.321205 0.712695 -0.451 0.652628
## reasonhome -0.431435 0.319907 -1.349 0.178755
## reasonother 0.159612 0.454480 0.351 0.725755
## reasonreputation -0.051845 0.317894 -0.163 0.870589
## guardianmother 0.267462 0.311371 0.859 0.391226
## guardianother -0.157335 0.554872 -0.284 0.777003
## traveltime 0.274301 0.197865 1.386 0.166968
## studytime -0.140650 0.155149 -0.907 0.365577
## failures -0.185333 0.211040 -0.878 0.380739
## schoolsupyes 0.562716 0.379303 1.484 0.139268
## famsupyes 0.369402 0.268848 1.374 0.170745
## paidyes 0.060643 0.270971 0.224 0.823107
## activitiesyes -0.286006 0.247519 -1.155 0.249063
## nurseryyes -0.426858 0.315064 -1.355 0.176774
## higheryes 0.503353 0.677346 0.743 0.458148
## internetyes -0.097405 0.331040 -0.294 0.768834
## romanticyes -0.243837 0.268414 -0.908 0.364577
## famrel 0.494479 0.136663 3.618 0.000363 ***
## freetime -0.139869 0.138297 -1.011 0.312879
## goout 0.078871 0.128794 0.612 0.540879
## Dalc -0.248633 0.178366 -1.394 0.164651
## Walc 0.221434 0.139178 1.591 0.112950
## health 0.027495 0.095100 0.289 0.772748
## absences 0.060830 0.017915 3.396 0.000804 ***
## G1 0.162966 0.076956 2.118 0.035255 *
## G2 0.994677 0.066450 14.969 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.868 on 235 degrees of freedom
## Multiple R-squared: 0.8616, Adjusted R-squared: 0.8374
## F-statistic: 35.68 on 41 and 235 DF, p-value: < 2.2e-16
# Grab residuals
res <- residuals(model)
# Convert to data frame for plot
res <- as.data.frame(res)
# Plot residuals
qplot(res, data=res, geom='histogram')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Looks like there are some suspicious residual values that have a value less than -5. Model seems to be predicting that poor performing students will get negative test results. The lowest possible score however is 0. We can further explore this by just calling plot on our model.
par(mfrow=c(2,2))
plot(model)
G3.predictions <- predict(model, test)
summary(G3.predictions)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.832 7.776 10.280 10.240 13.140 20.260
Looking at the range, it looks like our model is predicting negative test scores.
results <- cbind(G3.predictions, test$G3)
colnames(results) <- c('Predicted', 'Real')
results <- as.data.frame(results)
sse <- sum((results$Predicted - results$Real)^2)
sst <- sum((mean(df$G3) - results$Real)^2)
R2 <- 1-sse/sst
R2
## [1] 0.7747253
Current performance of our model is R2 = 0.774725
Now let’s take care of negative predictions by creating a simple function to replace them with 0.
# Create function to replace negative values with 0
to_zero <- function(x){
if(x<0){
return(0)
} else {
return(x)
}
}
# Apply the function to predicted results
results$Predicted <- sapply(results$Predicted, to_zero)
# Check the range of predicted values
range(results$Predicted)
## [1] 0.00000 20.25837
sse <- sum((results$Predicted - results$Real)^2)
sst <- sum((mean(df$G3) - results$Real)^2)
R2 <- 1-sse/sst
RMSE <- sqrt(mean((results$Real - results$Predicted)^2))
R2
## [1] 0.7779023
RMSE
## [1] 2.100335