This dataset is from the UCI Machine Learning Repository and is comprised of student performance inforation. The data contains the following features:
school
- student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)sex
- student’s sex (binary: ‘F’ - female or ‘M’ - male)age
- student’s age (numeric: from 15 to 22)address
- student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)famsize
- family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)Pstatus
- parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)Medu
- mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)Fedu
- father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)Mjob
- mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)Fjob
- father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)reason
- reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)guardian
- student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)traveltime
- home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)studytime
- weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)failures
- number of past class failures (numeric: n if 1<=n<3, else 4)schoolsup
- extra educational support (binary: yes or no)famsup
- family educational support (binary: yes or no)paid
- extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)activities
- extra-curricular activities (binary: yes or no)nursery
- attended nursery school (binary: yes or no)higher
- wants to take higher education (binary: yes or no)internet
- Internet access at home (binary: yes or no)romantic
- with a romantic relationship (binary: yes or no)famrel
- quality of family relationships (numeric: from 1 - very bad to 5 - excellent)freetime
- free time after school (numeric: from 1 - very low to 5 - very high)goout
- going out with friends (numeric: from 1 - very low to 5 - very high)Dalc
- workday alcohol consumption (numeric: from 1 - very low to 5 - very high)Walc
- weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)health
- current health status (numeric: from 1 - very bad to 5 - very good)absences
- number of school absences (numeric: from 0 to 93)The task associated with this dataset is regression to determine the the grades related with the course subjects:
G1
- first period grade (numeric: from 0 to 20)G2
- second period grade (numeric: from 0 to 20)G3
- final grade (numeric: from 0 to 20, output target)df_mat = read.table("https://raw.githubusercontent.com/mkivenson/Business-Analytics-Data-Mining/master/Final%20Project/student-mat.csv",sep=";",header=TRUE)
df_por = read.table("https://raw.githubusercontent.com/mkivenson/Business-Analytics-Data-Mining/master/Final%20Project/student-por.csv",sep=";",header=TRUE)
The goal of this project is to use features from a student’s personal life and activities to predict grades in Math and Portuguese classes. Feature importance and explainability will be particularly useful, as it will indicate which factors are most valuable in grades. Two datasets will be used containing the same features - one for Math grades (df_mat
) and one for Portuguese grades (df_por
). There is some overlap in students between the Math and Portuguese classes, so separate models will be created for each dataset. This will also be a useful indicator of whether feature importance varies between math and language skills.
The first step to predicting test scores for math and portuguese classes is data exploration and preprocessing. There are no missing values in the dataset, so no imputation is needed. However, there are many categorical and ordinal variables in the dataset. Categorical variables will be encoded prior to use in linear models, but ordinal variables will be kept as-is. To encode categorical variables, one hot encoding will be used to create dummy variables.
The correlation plots below show correlation on numeric columns only, and indicate very limited collinearity in the dataset. The last three columns/rows are the test scores - scores are highly correlated with each other.
corrplot(cor(select_if(df_mat, is.numeric), use = "complete.obs"), method = "circle", tl.pos='n')
corrplot(cor(select_if(df_por, is.numeric), use = "complete.obs"), method = "circle", tl.pos='n')
The following tables show feature and target variable distributions for math and portuguese data. For both datasets, all grades (G1, G2, and G3) appear to be normally distributed.
grid.arrange(ggplot(df_mat, aes(school)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(sex)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(age)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(address)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(famsize)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(Pstatus)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(Medu)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(Fedu)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(Mjob)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(Fjob)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(reason)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(guardian)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(traveltime)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(studytime)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(failures)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(schoolsup)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(famsup)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(paid)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(activities)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(nursery)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(higher)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(internet)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(romantic)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(famrel)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(freetime)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(goout)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(Dalc)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(Walc)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(health)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(absences)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(G1)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(G2)) + geom_histogram(stat = "count"),
ggplot(df_mat, aes(G3)) + geom_histogram(stat = "count"),
ncol=6)
grid.arrange(ggplot(df_por, aes(school)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(sex)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(age)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(address)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(famsize)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(Pstatus)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(Medu)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(Fedu)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(Mjob)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(Fjob)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(reason)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(guardian)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(traveltime)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(studytime)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(failures)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(schoolsup)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(famsup)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(paid)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(activities)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(nursery)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(higher)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(internet)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(romantic)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(famrel)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(freetime)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(goout)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(Dalc)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(Walc)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(health)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(absences)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(G1)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(G2)) + geom_histogram(stat = "count"),
ggplot(df_por, aes(G3)) + geom_histogram(stat = "count"),
ncol=6)
The summary of df_mat
and df_por
reveals that the dataset has multiple categorical variables that will be encoded into dummy variables. The amount of dummy variables used for each field will be one less than the amount of categories in features. This encoding will should only be used for linear regression models - not tree based models.
library(onehot)
#encode math data
encoder <- onehot(df_mat, stringsAsFactors = TRUE)
df_mat <- predict(encoder, df_mat)
drop <- c("school=GP", "sex=F", "address=U", "famsize=LE3", "Pstatus=T",
"Mjob=at_home", "Fjob=at_home","reason=course", "guardian=other",
"schoolsup=no", "famsup=no", "paid=no", "activities=no", "nursery=no",
"higher=no", "internet=no", "romantic=no")
df_mat <- df_mat[,!(colnames(df_mat) %in% drop)]
#encode portuguese data
encoder <- onehot(df_por, stringsAsFactors = TRUE)
df_por <- predict(encoder, df_por)
df_por <- df_por[,!(colnames(df_por) %in% drop)]
Let’s split our datasets into test and train sets
set.seed(95)
#Portugese Data
df_por2<-as.data.frame(df_por)
sample <- sample.split(df_por2, SplitRatio = 0.8)
train_por <- subset(df_por2,sample==TRUE)
test_por <- subset(df_por2,sample==FALSE)
#Math Data
df_mat2<-as.data.frame(df_mat)
sample <- sample.split(df_mat2, SplitRatio = 0.8)
train_mat <- subset(df_mat2,sample==TRUE)
test_mat <- subset(df_mat2,sample==FALSE)
We we use stepwise regression to select the variables for our model
#Portugese Data
model_por <- lm(G3~., train_por)
step_model_por <- stepAIC(model_por, direction = "both", trace = FALSE)
summary(step_model_por)
##
## Call:
## lm(formula = G3 ~ `sex=M` + `address=R` + `Mjob=other` + `reason=other` +
## traveltime + failures + health + absences + G1 + G2, data = train_por)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9311 -0.5011 -0.0138 0.6186 5.0442
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.29551 0.37503 0.788 0.431096
## `sex=M` -0.24348 0.11710 -2.079 0.038107 *
## `address=R` -0.27737 0.13551 -2.047 0.041196 *
## `Mjob=other` -0.20197 0.11479 -1.759 0.079109 .
## `reason=other` -0.41281 0.18469 -2.235 0.025850 *
## traveltime 0.18289 0.08432 2.169 0.030555 *
## failures -0.18177 0.11216 -1.621 0.105736
## health -0.06079 0.04066 -1.495 0.135453
## absences 0.02012 0.01216 1.654 0.098784 .
## G1 0.14774 0.04079 3.622 0.000323 ***
## G2 0.87552 0.03891 22.502 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.268 on 500 degrees of freedom
## Multiple R-squared: 0.8509, Adjusted R-squared: 0.8479
## F-statistic: 285.3 on 10 and 500 DF, p-value: < 2.2e-16
#Math Data
model_mat <- lm(G3~., train_mat)
step_model_mat <- stepAIC(model_mat, direction = "both", trace = FALSE)
summary(step_model_mat)
##
## Call:
## lm(formula = G3 ~ age + `reason=other` + studytime + `activities=yes` +
## `nursery=yes` + `higher=yes` + `internet=yes` + famrel +
## goout + absences + G1 + G2, data = train_mat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3331 -0.5044 0.2201 0.9194 3.7953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.61142 1.72810 -0.354 0.723734
## age -0.19488 0.08796 -2.216 0.027479 *
## `reason=other` 0.67213 0.36155 1.859 0.064020 .
## studytime -0.22862 0.12462 -1.834 0.067583 .
## `activities=yes` -0.31706 0.20297 -1.562 0.119346
## `nursery=yes` -0.62767 0.25998 -2.414 0.016370 *
## `higher=yes` 1.09842 0.48365 2.271 0.023858 *
## `internet=yes` -0.53247 0.28516 -1.867 0.062856 .
## famrel 0.42623 0.11176 3.814 0.000167 ***
## goout 0.19007 0.09498 2.001 0.046299 *
## absences 0.04437 0.01276 3.476 0.000585 ***
## G1 0.16679 0.06460 2.582 0.010305 *
## G2 0.97407 0.05854 16.640 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.761 on 296 degrees of freedom
## Multiple R-squared: 0.8537, Adjusted R-squared: 0.8478
## F-statistic: 143.9 on 12 and 296 DF, p-value: < 2.2e-16
Let’s plot residuals.
#Portuguese Data
res <- residuals(step_model_por)
res <- as.data.frame(res)
qplot(res, data=res, geom='histogram')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
par(mfrow=c(2,2))
plot(step_model_por)
#Math Data
res <- residuals(step_model_mat)
res <- as.data.frame(res)
qplot(res, data=res, geom='histogram')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
par(mfrow=c(2,2))
plot(step_model_mat)
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.