Math scores are a predictor of overall academic success, as well a key indicator in precollegiate momentum. More so, “…kids who do well in math in high school end up doing well in the labor market,” according to Greg Duncan, a professor of education at the University of California Irvine (Mongeau). Therefore, effectively predicting students’ high school math scores can help predict students’ future success.
For our project we will be analyzing data that was published by the UCI machine learning repository, and subsequently was made available through Kaggle. The data was obtained in a survey of math students in secondary schools. This dataset contains a lot of interesting social, demographic and academic information about students.
The dataset contains 30 variables, 17 of which are categorical predictors. The objective is find the statistical model that best predicts student’s final grades. Student’s grades are on a scale from 0-20, with 20 being the highest grade possible. The grade’s are approximately normally distributed, so no transformation was necessary for the response variable. We ran an initial exploratory analysis to understand the linear relationship between the predictor variables, and also the relationship between the predictors and the response variable (G3 = math scores). We noticed that none of the predictors had a high correlation with math scores. Additionally, none of the predictors appeared to be highly correlated with each other.
k = 5 #k-fold CV
n <- c(seq(1:nrow(math)))
index <- sample(n, size = nrow(math), replace = FALSE)
math$index <- index
# CV: 357 rows/ 5 folds = 71 for 4 and 73 for last
folds <- ifelse(math$index <= 71, 1, ifelse(math$index <= 71 * 2, 2, ifelse(math$index <=
71 * 3, 3, ifelse(math$index <= 71 * 4, 4, 5))))
table(folds)
## folds
## 1 2 3 4 5
## 71 71 71 71 73
math$index <- NULL
cv.errors = matrix(NA, k, 6, dimnames = list(NULL, paste(1:6)))
n.pcr <- c(rep(0, k)) # to check the number of PCs used for the PC regression
We explored five competing methods to find the best model: ordinary least squares, principal component regression, LASSO, random forest and boosting. For each model, k-fold cross validation was used to find an average root mean squared error (RMSE). The models were then compared using RMSE. The model with the lowest RMSE is considered to have the best predictive power. However, the best model for predicting is not always the best for interpretation, therefore we also judged based on this criterion.
Ordinary least squares (OLS), also known as linear least squares regression is a technique that is commonly utilized to estimate the relationship between variables. Regression in general is used typically for two distinct purposes: prediction or inference. We were interested in using OLS regression for prediction, specifically for predicting student’s math grades based on the information provided by our predictor variables. There are key assumptions that must be upheld include linearity, independence, constant variance, and normality. These assumptions can be checked graphically through plotting residuals. Based on the following residual plots, we have no reason to believe there are any residual assumptions violated.
# OLS
for (j in 1:k) {
math.train <- math[folds != j, ]
math.test <- math[folds == j, ]
ols.fit <- glm(G3 ~ ., data = math.train)
# ols.fit <- step(ols.fit, trace=0) names(ols.fit)
pred <- predict(ols.fit, math.test)
cv.errors[j, 1] <- mean((math.test$G3 - pred)^2)
}
We utilized two tree based methods in our analysis. Random forest is a tree based procedure that expands upon the more basic procedure of bagging. In bagging, a single dataset is expanding into “B” separate training sets by bootstrapping the original dataset. Then, a regression tree is built for each B training set. An overall prediction for each observation is obtained by averaging the predictions of that observation from the B regression trees. In random forest, bagging still occurs but when the tree model is built, only a subset of parameters (m) are considered. By excluding some predictors from each regression tree, we avoid the possibility that a few strong predictors will overwhelm the tree building process and lead to high correlation between trees. Random forest therefore reduces variance more so than bagging, which helps us in our efforts to build the best model for our data.
We chose to utilize 5 predictors (m = 5) for each random forest tree. It is recommended that the number of predictors per tree equal the square root of the total number of predictors, which in our case was 26. We built 5000 trees to estimate our outcome variable, math score. There is no concern of overfitting by the number of trees used. The final prediction for math score was averaged over all the estimates from the individual trees built.
We also built a boosting model, which is conceptually much like random forest, however, in practice the two procedures have significant differences. Boosting does not utilize bootstrapping to build tree models. Instead, trees are grown sequentially by modelling our current tree after the previous tree’s residuals. This process “learns slowly”, improving our fit as each tree is grown. As a result, each additional tree will explain something not explained by the previous trees. The speed of growth is determined by the penalty term\(\lambda\), where a large \(\lambda\) slows the process down. We chose \(\lambda\) = .0005 through cross validation. The interaction depth (d) dictates the number of splits our trees will have. Because our trees are a process of dependent growth, the number of d splits does not need to be large. We chose d = 3, and as with random forest, we built 5000 trees (B). Again, our result was the predicted math scores averaged over all built trees.
# random forest
for (j in 1:k) {
math.train <- math[folds != j, ]
math.test <- math[folds == j, ]
tree.model <- randomForest(G3 ~ ., data = math.train, mtry = 5, ntree = 5000)
pred2 <- predict(tree.model, newdata = math.test)
cv.errors[j, 2] <- mean((math.test$G3 - pred2)^2)
# boost
boost.model <- gbm(G3 ~ ., data = math.train, distribution = "gaussian",
n.trees = 5000, interaction.depth = 3)
pred3 <- predict(boost.model, newdata = math.test, n.trees = 5000)
cv.errors[j, 3] <- mean((math.test$G3 - pred3)^2)
}
cv.errors
## 1 2 3 4 5 6
## [1,] 9.557610 8.710734 8.875719 NA NA NA
## [2,] 9.590165 8.557641 8.925515 NA NA NA
## [3,] 7.500245 7.351461 7.257087 NA NA NA
## [4,] 7.322210 7.936264 7.480028 NA NA NA
## [5,] 9.632696 8.824497 8.826394 NA NA NA
# mean((math.test$G3 - pred3)^2)
Principal component analysis (PCA) is a statistical method used to reduce the dimensions of a dataset that contains many variables. PCA reduces the data by finding linear combinations of the original predictors that explains the greatest amount of variance within the predictor variables. These linear combinations that were created represent a subset of the original predictor variables have been transformed into smaller set of features that are linearly uncorrelated. Each of the new features is referred to as a principal component. PCA is considered an unsupervised learning technique as no prior information related to the individual principal components is provided by the user. The first principal component accounts for as much variation in the predictors as possible. The subsequent principal components accounts for as much of the leftover variation from the predictors after the previous principal component is established. Principal component regression (PCR) is a regression method that uses the principal components found from PCA as the regressors against a response variable of interest. Upon running a PCA on our data, we decided to include 15 principal components in our PCR model. This number of principal components was selected through cross-validation, using MSEP (mean squared error of prediction) as our criteria.
Since none of predictors are highly correlated with our response, the number of principal components suggested through cross-validation varied each time we ran PCA. We ran PCA 50 times and 15 was the average number of suggested principal components over all runs. After conducting PCA, we built a PCR model using our data to try to improve the prediction power over our baseline OLS model.
# For the GAM, I used principle components first in order to use these as
# predictors in gam.
for (j in 1:k) {
math.train <- math[folds != j, ]
math.test <- math[folds == j, ]
# PCR Regression To figure out the number of principle components for the
# minimum MSE
check <- c(rep(NA, ncol(math.train) - 1))
# for(m in 1:(ncol(math.train)-1)){ #decide number of components by CV
# pcr.fit=pcr(G3~.,data=math.train,scale=TRUE, validation ='CV')
# #validationplot(plsr.fit, val.type='MSEP') summary(pcr.fit)
# pcr.pred=predict(pcr.fit, math.test, ncomp =m) check[m] <-
# mean((pcr.pred-math.test$G3)^2) } cv.errors[j,4] <- min(check) n.pcr[j] <-
# which.min(check) #number of components achieving the minimum MSE
pcr.fit = pcr(G3 ~ ., data = math.train, scale = TRUE, validation = "CV")
# #validationplot(plsr.fit, val.type='MSEP') summary(pcr.fit)
pcr.pred = predict(pcr.fit, math.test, ncomp = 3)
cv.errors[j, 4] <- mean((pcr.pred - math.test$G3)^2)
# General Additive Model with pls
loading <- as.matrix(pcr.fit$loadings) # PC loadings
x.train <- model.matrix(G3 ~ ., math.train)[, -1] #X matrix
comp.train <- x.train %*% loading #This is PCs
pls.train <- data.frame(G3 = math.train$G3, comp.train) #train data set for gam
# create gam I used smoothing spline with default values fot the smoothing
# parameter Also, I used 10 PCAs. In fact, because our elbow plot for PCs is
# kinda ungly, choosing how many factors we gonna use is tough. PCA
# regression works better than GAM. Thus, we may need to use PCA replacing
# GAM. PCA model is way simpler than GAM in order to explain.
add.model <- gam(G3 ~ s(Comp.1) + s(Comp.2) + s(Comp.3) + s(Comp.4) + s(Comp.5) +
s(Comp.6) + s(Comp.7), data = pls.train) #create gam
x.test <- model.matrix(G3 ~ ., math.test)[, -1]
comp.test <- x.test %*% loading
pls.test <- data.frame(Y = math.test$G3, comp.test)
gam.pred <- predict(add.model, newdata = pls.test)
# gam.pred <- exp(gam.pred)
cv.errors[j, 5] <- mean((math.test$G3 - gam.pred)^2)
}
LASSO (least absolute shrinkage and selection operator) is a variable selection technique for ordinary least squares models. LASSO utilizes an L1 penalty which limits the sum of the absolute parameters values allowed in the model. This shrinks the coefficient estimates of predictors in the model towards zero. The tuning parameter,\(\lambda\), scales the sum of the absolute parameter values, thereby dictating the number of parameters which will be allowed in the model. When \(\lambda\) is large, a greater penalty is imposed and the total allowable parameter value allowed is reduced. This causes some coefficients to decrease to zero, causing parameters to be removed from the model. Whenv \(\lambda\) = 0, the model does not impose a penalty and regular OLS is performed. Using cross validation, we selected \(\lambda\) = 0.0989, because this was the value of \(\lambda\) that best reduced the RMSE.
################ LASSO##########################
for (j in 1:k) {
math2.train <- math2[folds != j, ]
math2.test <- math2[folds == j, ]
g3.train <- math2$G3[folds != j]
g3.test <- math2$G3[folds == j]
x.train <- x[folds != j, ]
x.test <- x[folds == j, ]
set.seed(1)
cv.out <- cv.glmnet(x.train, g3.train, alpha = 1)
# plot(cv.out)
bestlam <- cv.out$lambda.min
# bestlam=cv.out$lambda.m1se
lasso.model <- glmnet(x.train, g3.train, alpha = 1, lambda = bestlam)
pred5 <- predict(lasso.model, s = bestlam, newx = x.test)
cv.errors[j, 6] <- mean((math2.test$G3 - pred5)^2)
}
####################Scree plot
X<-math[,-ncol(math)]
X<-as.matrix(X)
ev <- eigen(t(X)%*%X)
lam <- ev$values
VECT <- ev$vectors
#pct <- var.ex/sum(var.ex)
pct <- lam/sum(lam)
cpct <- cumsum(lam)/sum(lam)
par(mfrow=c(1,1))
plot(pct,type="b",xlab="j",ylim=0:1,ylab="Percent variation explained",main="Scree plot")
plot(cpct,type="b",xlab="j",ylim=0:1,ylab="Cumulative percent variation explained")
abline(h=0.50,col="green")
abline(h=0.80,col="yellow")
abline(h=0.90,col="orange")
abline(h=0.95,col="red")
We compared five model building methods to predict students’ math scores: (1) ordinary least squares, (2) principal component regression, (3) LASSO, (4) random forest, and (5) boosting. We used root mean squared error (RMSE) to compare the efficacy of the different methods. RMSE is calculated as the standard deviation of the residuals, and in general describes how spread out the model residuals are. When model residuals are more concentrated, it is suggestive of a better fitting model. More concentrated residuals result in a lower RMSE, therefore our goal for model selection is to minimize RMSE.
For each method, k-fold cross validation was used to compute RMSE. K-fold cross validation. splits the data into k mutually exclusively parts of more or less equal size. The model is built with k-1 parts and is then used to predict the response on the remaining part which was not used in the model building process. This process is repeated for each subset of k-1 parts, and each time the resulting RMSE is recorded. In the end, the RMSE is averaged across the iterations to get a robust estimate for model prediction according to the specified method. We chose k = 5, as this provides a good bias-variance tradeoff. A small k would give our estimates very low bias, but higher variance. A small k is also computationally expensive. Therefore k=5 gives us an accurate estimate without overfitting the data.
To select the best model, we compared the RMSE for each of the model building methods we considered. The results are shown in the table below.
cv_rmse <- matrix(sqrt(colMeans(cv.errors)), nrow = 1)
colnames(cv_rmse) <- c("OLS", "RandomForest", "Boosting", "PCA", "GAM", "LASSO")
rownames(cv_rmse) <- c("RMSECV")
cv_rmse
## OLS RandomForest Boosting PCA GAM LASSO
## RMSECV 2.953064 2.876825 2.876273 3.126554 2.994361 2.920632
n.pcr #number of PCs used for the regression. We use 5-fold CV;thus, we have 5 numbers for it.
## [1] 0 0 0 0 0
Random Forest and Boosting have the best performance in terms of predictive power. LASSO is third however, and is a more simple model with better interpretability. LASSO likely performs better than OLS because it is less flexible, and therefore has lower variance. OLS must therefore be overfitting the model somewhat. Similarly, PCA and GAM are likely overfitting by utilizing too many components, and thereby inflating the variance. Random Forest and Boosting are less susceptible to concerns of overfitting, and are powerful predictors, so it is not suprising that they perform well. I would select LASSO as the final model because of its relative simplicity and good performance.
For future research, we would be interested in comparing other techniques’ prediction power utilizing our dataset. Specifically, we would be interested in determining if nonparametric methods such as smoothing splines or generalized additive models would reduce our prediction error. Also, we believe that the addition of socioeconomic variables such as students’ family income could be a potentially powerful predictor that could allow us to more accurately predict students final math grades.
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.
Mongeau, Lillian. “Early Math Matters.” EdSource, 1 Dec. 2013, edsource.org/2013/early-math-matters-top-researcher-discusses-his-work/50061.