For this project, I wanted to take on a task at the school where I work that has been of great interest. The reason I’m in this program (besides finding the material fun and interesting) is so that I can use some of these skills to help my school improve. To that end, I wanted to build a simple predictor that would be able to predict student scores in classes.
At my school, student take courses from a local community college during their high school years. We have had great difficulty in deciding when to recommend students for particular college courses, so my goal with this project was to add in another data point to consider during the recommendation process: what grade do we think they’ll receive.
The first step was to load the required libraries and the datasets in their base form. There were quite a few libraries that were needed for this project!
library(ggplot2)
library(tidyverse)
## -- Attaching packages ----------------------------------------------- tidyverse 1.3.0 --
## v tibble 2.1.3 v dplyr 0.8.5
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## v purrr 0.3.3
## Warning: package 'dplyr' was built under R version 3.6.3
## -- Conflicts -------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ranger)
library(missRanger)
## Warning: package 'missRanger' was built under R version 3.6.3
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(mongolite)
## Warning: package 'mongolite' was built under R version 3.6.3
library(keyring)
The first dataset was loaded in at the same time as some initial cleaning was performed. The initial cleaning here just separated course code data into useable columns and removed some unneeded columns. I also created a column to indicate if a course was a college course or not (many of the students at my school take college courses).
set.seed(52638)
course <- "MGS22"
url <- "https://github.com/dmoste/DATA607/blob/master/Final/all_transcripts.csv?raw=true"
df <- read.csv(url, header = TRUE) %>%
separate(Table, c("Code","GPA.Modify"), sep = "\\*") %>%
select(-GPA.Modify,
-Term.Average.Display,
-Credits.Total,
-Credit.Earned.Total,
-Credits.Average,
-Credits.Total.1,
-Cumulative.Term.Average1,
-Mark.1,
-Credits.Average.1,
-Credit.Earned.Total.1,
-CreditEarned.1,
-Mark,
-CreditEarned) %>%
mutate(College = ifelse(grepl("U$", Code) | Code == "HBS11UA", "Y", "N")) %>%
tbl_df()
## Warning: Expected 2 pieces. Additional pieces discarded in 290 rows [2, 3, 5,
## 83, 84, 87, 114, 294, 301, 359, 430, 663, 787, 867, 994, 1083, 1341, 1545, 1796,
## 2007, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 22556 rows [1, 4,
## 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, ...].
The second dataset was attendance data loaded in from MongoDB (my second data source). No cleaning was required for this set.
mongo_url = str_c('mongodb+srv://dmoste:',
key_get("MongoDB","dmoste"),
'@cuny-msds-wnup9.gcp.mongodb.net/test')
mongo_data <- mongo(collection = "Attendance", db = "DATA607_Final",
url = mongo_url,
verbose = TRUE)
attendance <- mongo_data$find('{}')
##
Found 516 records...
Imported 516 records. Simplifying into dataframe...
Each row of the data is data on a single course taken by a student, where the rows are grouped by student. Unfortunately, StudentID is only listed in the last row of data for each student. I used nafill and next observation carried backwards to pull the StudentIDs up into each column for the students. I also created names that could be used as columns for the Code values so that each code could be used as feature by my predictor.
df$StudentID <- nafill(df$StudentID, type = "nocb")
df$Code <- make.names(df$Code, unique = FALSE, allow_ = FALSE)
Here, I used pivot_wider to continue my cleaning efforts and create columns for each course so that courses could be used as features for prediction. I also joined my two datasets together at this step.
df <- df %>%
pivot_wider(names_from = Code,
values_from = NumericEquivalent) %>%
group_by(StudentID) %>%
left_join(attendance, by = "StudentID") %>%
summarise_if(is.numeric,
max,
na.rm = TRUE)
Next, I filled all the -Inf created by my summarise_if statement with NAs. This is so I can impute this data in my next step. I also removed columns where more than 95% of students in the school had not taken the course. This removes some useless information from courses that won’t be relevant predictors and will save time in the prediction phase.
df[df == -Inf] <- NA
df <- df[, which(colMeans(!is.na(df)) > 0.05)]
df <- df %>%
mutate(Imputed = ifelse(is.na(df[course]), 1, 0)) %>%
select(-TermCD, -SchoolYear)
At this point, I have a bunch of NAs, which will cause major issues for my predictor. I used the missRanger package here to impute (predict) these values.
df <- missRanger(df, num.trees = 100, maxiter = 5)
##
## Missing value imputation by random forests
##
## Variables to impute: EES21, FSS63X, FSS64X, HES11, MQS11U, PHS11, PPS87, RQS43T, EES88X, FSS62QA, HUS22, MRS22, PPS86, RQS42T, SPS22, TQS22, EES87X, FSS61QA, HUS21, MRS21, PPS85, RQS41T, SPS21, TQS11U, CHS22, EES86, HBS11U, HGS43, HGS44, MGS22, PPS84, RVS22T, SCS22, CHS21, EES85, HGS41, HGS42, MGS21, PPS83, RVS21T, SCS21, ANS21, EES83, EES84, MES22, PPS82, SLS22, SLS22QL, SQS22T, EES81, EES82, MES21, PPS81, RZS21T, SLS21, SLS21QL, SQS21T, HXRUE., MXRNE., SXRPE., HXRTE., MXRKE., SXRXE., EXRCR., MXRCE., SXRKE., HVS11, MPS22, PPS88, RQS44T, EWS11U, EES87, TCS11U, RZS22T, SQS11T, MKS11, PHS22, PHS21, HXRGE., MES41, MES42, HUS21X, MQS21, ANS22, FSS62, FSS61, SES22, MXRCR., EXRCE., AUS22, AUS21, MES43, MES44, MXRKG., HXRCE., MXRCG., MPS21, SES21, SES21QL, SXRUE., AJS21, EWS11UA, SQS21QA, SKS21XP, SWS21X, MEN11A, SLN11A, FSN31A, FX2SE., MES21S, AJS11, EES88, SDS22, SDS21
## Variables used to impute: StudentID, Credits, EES21, FSS63X, FSS64X, HES11, MQS11U, PHS11, PPS87, RQS43T, EES88X, FSS62QA, HUS22, MRS22, PPS86, RQS42T, SPS22, TQS22, EES87X, FSS61QA, HUS21, MRS21, PPS85, RQS41T, SPS21, TQS11U, CHS22, EES86, HBS11U, HGS43, HGS44, MGS22, PPS84, RVS22T, SCS22, CHS21, EES85, HGS41, HGS42, MGS21, PPS83, RVS21T, SCS21, ANS21, EES83, EES84, MES22, PPS82, SLS22, SLS22QL, SQS22T, EES81, EES82, MES21, PPS81, RZS21T, SLS21, SLS21QL, SQS21T, HXRUE., MXRNE., SXRPE., HXRTE., MXRKE., SXRXE., EXRCR., MXRCE., SXRKE., HVS11, MPS22, PPS88, RQS44T, EWS11U, EES87, TCS11U, RZS22T, SQS11T, MKS11, PHS22, PHS21, HXRGE., MES41, MES42, HUS21X, MQS21, ANS22, FSS62, FSS61, SES22, MXRCR., EXRCE., AUS22, AUS21, MES43, MES44, MXRKG., HXRCE., MXRCG., MPS21, SES21, SES21QL, SXRUE., AJS21, EWS11UA, SQS21QA, SKS21XP, SWS21X, MEN11A, SLN11A, FSN31A, FX2SE., MES21S, AJS11, EES88, SDS22, SDS21, Attendance, Imputed
## iter 1: ..................................................................................................................
## iter 2: ..................................................................................................................
## iter 3: ..................................................................................................................
## iter 4: ..................................................................................................................
## iter 5: ..................................................................................................................
I’ve almost made it to my predictor! I just have to create a training dataset and a holdout dataset.
sample_size = floor(0.75*nrow(df))
picked = sample(seq_len(nrow(df)),
size = sample_size)
training = df[picked,]
holdout = df[-picked,]
Finally, I get to create my random forest predictor! The values I used for this came from a grid I setup on the side to determine which values would give me the lowest root mean squared error.
OOB_RMSE <- vector(mode = "numeric", length = 100)
# Use the best parameter values to create 100 different models
for(i in seq_along(OOB_RMSE)) {
optimal_ranger <- ranger(
formula = MGS22 ~ .,
data = training,
num.trees = 500,
mtry = 36,
min.node.size = 1,
sample.fraction = .8,
importance = 'impurity')
OOB_RMSE[i] <- sqrt(optimal_ranger$prediction.error)
}
Next, I wanted to view the histogram showing RMSE just to make sure it was approximately normal.
hist(OOB_RMSE, breaks = 20)
After confirmation, I loaded the importance variable into a dataframe and plotted so I could easily see waht the most important predictors were.
importance <- data.frame(optimal_ranger$variable.importance)
setDT(importance, keep.rownames = TRUE)[]
## rn optimal_ranger.variable.importance
## 1: StudentID 44.141286
## 2: Credits 2.020782
## 3: EES21 183.531950
## 4: FSS63X 30.573187
## 5: FSS64X 26.721342
## ---
## 113: EES88 1144.127942
## 114: SDS22 1347.114951
## 115: SDS21 66.323161
## 116: Attendance 19.250699
## 117: Imputed 5.096869
names(importance) <- c("Variable", "Value")
importance %>%
arrange(desc(Value)) %>%
top_n(20) %>%
ggplot(aes(reorder(Variable, Value), Value)) +
geom_col() +
coord_flip() +
ggtitle("Top 20 important variables")
## Selecting by Value
Now I wanted to run my predictor on my holdout dataset and plot the results.
pred_ranger <- predict(optimal_ranger, holdout)
plot(pred_ranger[["predictions"]], holdout[[course]])
Then I created a linear model to see how good my fit was.
m_all <- lm(holdout[[course]] ~ pred_ranger[["predictions"]])
summary(m_all)
##
## Call:
## lm(formula = holdout[[course]] ~ pred_ranger[["predictions"]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5893 -1.5402 -0.2764 1.8775 12.5151
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.13247 2.71666 -2.625 0.00972 **
## pred_ranger[["predictions"]] 1.09063 0.03546 30.757 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.537 on 127 degrees of freedom
## Multiple R-squared: 0.8816, Adjusted R-squared: 0.8807
## F-statistic: 946 on 1 and 127 DF, p-value: < 2.2e-16
plot(pred_ranger[["predictions"]], holdout[[course]])
abline(m_all)
Looking quickly at the statistical output of my model, I stated the null and alternative hypotheses.
\(H_0:\) the true linear model has slope zero.
\(H_A:\) the true linear model has slope different than zero. The Ranger predictions are predictive of course scores.
The equation created by the linear model:
\(\widehat{score}=-7.13+1.09\times prediction\)
A 95% confidence interval for this model:
\(CI=b\pm{t}^{*}_{df}\times SE_{b}\)
\(CI=1.09\pm 1.979\times 0.03546\)
\(CI=\left(1.02,1.16\right)\)
This confidence interval indicates that we are 95% confident that with each point increase in the prediction, the actual score is predicted to increase on average by 1.02 to 1.16 points.
It’s pretty clear from this analysis, that we can reject the null hypothesis.
Next, I circled back, knowing that I had imputed some values which may be assisting the quality of this model. I wanted to look at that breakdown and then remove the imputed values and create a new model.
pf <-data.frame(holdout, pred_ranger[["predictions"]])
pf$Imputed <- as.factor(pf$Imputed)
ggplot(pf,
aes_string(x = "pred_ranger...predictions...",
y = course,
color = "Imputed")) + geom_point()
pf <- filter(pf, Imputed == 0)
m_real <- lm(pf[[course]] ~ pf$pred_ranger...predictions...)
plot(pf$pred_ranger...predictions..., pf[[course]])
abline(m_real)
summary(m_real)
##
## Call:
## lm(formula = pf[[course]] ~ pf$pred_ranger...predictions...)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.0429 -4.1705 -0.4948 4.3954 12.6548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.90104 4.62861 -2.787 0.00675 **
## pf$pred_ranger...predictions... 1.16584 0.06075 19.192 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.781 on 74 degrees of freedom
## Multiple R-squared: 0.8327, Adjusted R-squared: 0.8304
## F-statistic: 368.3 on 1 and 74 DF, p-value: < 2.2e-16
Looking quickly at the statistical output of my model, I stated the null and alternative hypotheses.
\(H_0:\) the true linear model has slope zero.
\(H_A:\) the true linear model has slope different than zero. The Ranger predictions are predictive of course scores.
The equation created by the linear model:
\(\widehat{score}=-12.9+1.166\times prediction\)
A 95% confidence interval for this model:
\(CI=b\pm{t}^{*}_{df}\times SE_{b}\)
\(CI=1.166\pm 1.99\times 0.06075\)
\(CI=\left(1.05,1.29\right)\)
This confidence interval indicates that we are 95% confident that with each point increase in the prediction, the actual score is predicted to increase on average by 1.05 to 1.29 points.
It’s pretty clear from this analysis, that we can reject the null hypothesis.
This was a super fun project to work on, but I really wish I had MORE DATA POINTS! While this predictor does a decent job for courses with lots of data points, such as geometry (the course used in this example), the reliability obviously decreases significantly when you try it out on college courses, for which there is much less data. I’d love to keep the data I have and add to it each year so that I can build up a more robust predictor. I’d also like to add in professor data if possible, because I imagine that would make a difference as well.