Predicting Droput - OULAD dataset

During this tutorial we will explore a complex dataset and use it to predict student droput. The main goals of this tutorial are to:

Manipulating complex interlinked datasets
Format the data in a format that is accepted by statistical and machine learning models
Use statistical models for prediction
Use machine learning models for prediction
Evaluate prediction certainty
Visualize prediciton

This dataset is available in https://analyse.kmi.open.ac.uk/open_dataset and correspond to student activity in several MOOCs of the Open University UK.

Step 1: Loading the Data

First we will import the csv files with the data

library(readr)
library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v dplyr   1.0.2
## v tibble  3.0.3     v stringr 1.4.0
## v tidyr   1.1.2     v forcats 0.5.0
## v purrr   0.3.4

## -- Conflicts ------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

assessments <- read_csv("data/assessments.csv")

## Parsed with column specification:
## cols(
##   code_module = col_character(),
##   code_presentation = col_character(),
##   id_assessment = col_double(),
##   assessment_type = col_character(),
##   date = col_double(),
##   weight = col_double()
## )

courses <- read_csv("data/courses.csv")

## Parsed with column specification:
## cols(
##   code_module = col_character(),
##   code_presentation = col_character(),
##   module_presentation_length = col_double()
## )

studentAssessment <- read_csv("data/studentAssessment.csv")

## Parsed with column specification:
## cols(
##   id_assessment = col_double(),
##   id_student = col_double(),
##   date_submitted = col_double(),
##   is_banked = col_double(),
##   score = col_double()
## )

studentInfo <- read_csv("data/studentInfo.csv")

## Parsed with column specification:
## cols(
##   code_module = col_character(),
##   code_presentation = col_character(),
##   id_student = col_double(),
##   gender = col_character(),
##   region = col_character(),
##   highest_education = col_character(),
##   imd_band = col_character(),
##   age_band = col_character(),
##   num_of_prev_attempts = col_double(),
##   studied_credits = col_double(),
##   disability = col_character(),
##   final_result = col_character()
## )

studentRegistration <- read_csv("data/studentRegistration.csv")

## Parsed with column specification:
## cols(
##   code_module = col_character(),
##   code_presentation = col_character(),
##   id_student = col_double(),
##   date_registration = col_double(),
##   date_unregistration = col_double()
## )

studentVle <- read_csv("data/studentVle.csv")

## Parsed with column specification:
## cols(
##   code_module = col_character(),
##   code_presentation = col_character(),
##   id_student = col_double(),
##   id_site = col_double(),
##   date = col_double(),
##   sum_click = col_double()
## )

vle <- read_csv("data/vle.csv")

## Parsed with column specification:
## cols(
##   id_site = col_double(),
##   code_module = col_character(),
##   code_presentation = col_character(),
##   activity_type = col_character(),
##   week_from = col_double(),
##   week_to = col_double()
## )

Step 2: Extracting Outcomes and Predictors

Then we create a set of functions that will enable to extract outcome from the different tables. This information is the final status (Pass, Fail), if the student has withdrawn (dropout) and the final score in the course.

First, get the final state (Pass, Fail) of the students in a given course. If the student has a “Withdrawn” state, we remove them from the list. If the student has “Distinction” as a final state, we count it as a “Pass”.

getFinalState<-function(course)
{
  studentInfo%>%
    filter(code_module==course) %>%
    filter(final_result!="Withdrawn") %>%
    mutate(final_result=ifelse(final_result=="Distinction","Pass",final_result))%>%
    mutate(final_result=as.factor(final_result))%>%
    select("id_student", "final_result")
}

Explanation:

This function take as parameter the name of course from which we want to extract the information.

We take the studentInfo dataset and first we filter only those rows in which the “course_module” column is the course that we want to select.

Then we filter only those rows in which the “final_result” is NOT “Withdrawn” (we use != to say different or not equal to).

Then we use mutate to change the final_result value. We use the ifelse function to check if the content is “Distinction”. If it is true, we change it for “Pass”, if not, we leave whatever value was there originally.

Finally, we only retain the “id_student” and the “final_result” columns.

Next, we will create a function to obtain the if a student has droped-out of the course. We do this by checking if the final state is “Withdrawn”

getDropout<-function(course)
{
  studentInfo%>%
    filter(code_module==course) %>%
    mutate(dropout=ifelse(final_result=="Withdrawn",1,0))%>%
    mutate(dropout=as.factor(dropout))%>%
    select("id_student", "dropout")
}

Explanation:

This function take as parameter the name of course from which we want to extract the information.

We take the studentInfo dataset and first we filter only those rows in which the “course_module” column is the course that we want to select.

Then we create a new column (“dropout”) that will contain a 1 if the “final_result” column contains “Withdrawn” or a 0, if it contains anything else.

Finally, we only retain the “id_student” and the “dropout” columns.

Continuing, we will create a function to obtain the final grade of the student, given the course. This is a little more complicated than the previous functions, because we need to use two datasets. First, we need to find what is the ID code of the “Exam” assessment from the “assessments” dataset for the course. Then we need to use those ID codes to select only those assessments from the “studentAssessment” dataset.

getFinalGrade<-function(course)
{
  finalExam<-assessments%>%
              filter(code_module==course)%>%
              filter(assessment_type=="Exam")
  inner_join(studentAssessment,finalExam, by = c("id_assessment" = "id_assessment"))%>%
  select("id_student","score")
}

Explanation:

This function take as parameter the name of course from which we want to extract the information.

First we create a new dataset (“finalExam”) that will contain only the information of the “Exam” assignments for a given course. To create thi dataset, first we use the “assessment” dataset, then we filter only the “course” that we want and then we filter only those rows that contains the word “Exam” in the “assessment_type” column.

Then we combine the “studentAssessment” and our create “finalExam” datasets. To do this, we use the inner_join function that match rows from both dataset that share a value. In this case we use the value in the “id_assessment” column in both datasets to create the link. This inner join presever only row in the first dataset (“studentAssessment”) that correspond to the ID code of the exams that we extracted in “finalExam”.

Finally, we only retain the “id_student” and the “score” columns.

For the next functions, that will extract the predictors, we should specify not only the course from which we want the information, but the period of time since the start of the course at which want to make the prediction.

We will start with the information about the assessments deliverd by the student. We will extract two predictors, the average grade of the assessments present until that date, and the total number of assessments presented.

getAssessmentPredictors<-function(course,days)
{
  courseAssessments<-assessments%>%
                     filter(code_module==course)%>%
                     filter(date<days)
  studentAssessment%>%
    filter(id_assessment %in% courseAssessments$id_assessment)%>%
    group_by(id_student)%>%
    summarise(avgScore=mean(score),delivered=n())
}

Explanation:

This function take two parameters: 1) the name of course from which we want to extract the information and 2) the day at which we cut the information (we only consider information previous to this day in the course)

First we create a new dataset (“courseAssessments”) that will contain only the assessments in the specified course with a deadline that is before the cut-off day.

Then, from the studentAssessment dataset, we extract only the information in which the ID of the assessment is contained in the list of “courseAssessment” obtained before. Then we group the resulting information by each student and we calculate the average score (“mean(score)”)and store it in the “avgScore” variable and the number of delivered assignments (“n()”) and store it inthe delivered variable.

The resulting dataset has only three columns: “student_id” because it was used to group the calculation and the “avgScore” and “delivered” columns that we calculated.

Then we obtain the information about lateness delivery of assessments. For this we need information about the deadline of the assessment and we substract the deliver day to calculate if it was delivered late.

getLateAssessments<-function(course,days)
{
  courseAssessments<-assessments%>%
    filter(code_module==course)%>%
    filter(date<days)
  fullAssessments<-inner_join(studentAssessment,courseAssessments,by = c("id_assessment" = "id_assessment"))
  fullAssessments%>%
    mutate(delay=ifelse(date<date_submitted,1,0))%>%
    group_by(id_student)%>%
    summarise(sumDelays=sum(delay))
}

Explanation:

First we create a new dataset (“courseAssessments”) that will contain only the assessments in the specified course with a deadline that is before the cut-off day.

Then, we join (inner_join) that information with that contained in the “studentAssessment” dataset, where both assessment IDs (id_assessment) are equal.

We use this merged dataset (“fullAssessments”) and we calculate a “delay” column that will be 1 if the date of submission (“date_submitted”) is later than the date of the deadline. Then we group by the ID of the student and calculate the number of delasy (summing the delay column).

The resulting dataset has only two columns: “student_id” because it was used to group the calculation and the “sumDelays” that we calculated.

Finally we will get information from the number of clicks in the VLE information. We will get tree predictors, the total number of clicks, the average number of clicks per day and the number of active days.

getClickInfo<-function(course,days)
{
  studentVle%>%
    filter(code_module==course)%>%
    filter(date<days+1)%>%
    group_by(id_student,date)%>%
    summarise(daily_clicks=sum(sum_click),daily_elements=n())%>%
    group_by(id_student)%>%
    summarise(total_clicks=sum(daily_clicks),total_elements=sum(daily_elements),active_days=n())%>%
    mutate(average_daily_clicks=total_clicks/active_days,average_elements=total_elements/active_days)
}

Explanation:

From the “studentVle” dataset, we filter only the information related to the selected course.

Then we only select information that happened before the cut-off day.

Then we gropu the data by student and day, so we can get daily activity per student. We obtain the number of daily clicks adding all the clicks done during that day in different elements (“sum(sum_click)”) and the number of elements clicked on that day (“n()”).

Then we only group by student to obtain totals for each student.

Then we obtain total clicks by adding all the daily_clicks, the total number of elements by adding the daily_elements, and the total number of days that the student was active in the VLE (“n()”).

Finally, we calculate daily average values of clicks and elements by dividing them by the total number of active days.

Step 4: Assembling the Predictors

First we will put together the extracted numerical predictors for a given course (“DDD”) and cut-off days (50).

course="DDD"
days=50

clicksInfo<-getClickInfo(course,days)

## `summarise()` regrouping output by 'id_student' (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

assessInfo<-getAssessmentPredictors(course,days)

## `summarise()` ungrouping output (override with `.groups` argument)

latenessInfo<-getLateAssessments(course,days)

## `summarise()` ungrouping output (override with `.groups` argument)

temp1<-merge(clicksInfo,assessInfo,by="id_student")
predictors<-merge(temp1,latenessInfo,by="id_student")

Explanation:

First we will define “DDD” as the course that we are interested.

Then we set 50 as the cut-off date where the prediction will be performed.

In the dataset clicksInfo, we store the features from the interactions with the VLE

In the dataset assessInfo, we store the features from the assessment

In the dataset latenessInfo, we store the features fro which assessements were submitted late

Then we merge these 3 datasets into predictors. First we merge clickInfo and assessInfo by the ID of the student, then we merge that with latenessInfo, again by the ID of the student. The result “predictors” has all the variables that we extracted from the VLE and assessment information.

Now we will add additional predictors from the studentInfo (minus the final_result and the course information)

predictors<-left_join(predictors,studentInfo,by="id_student")%>%
            select(!c("final_result", "code_module", "code_presentation"))

Explanation:

We now join the dataset “predictors” with the studentInfo by the id of the student. We use left_join to maintain all the rows in the predictor dataset. The join is performed by the ID of the student.

We eliminate the “final_result”, “code_module” and “code_presentation” columns because they will not be used as predictors.

Step 4: Assembling the Outcomes

Now, we create the outcome values of Pass/Fail (finalState), the final grade (finalGrade) and the dropout (dropout)

finalState<-getFinalState(course)
finalGrade<-getFinalGrade(course)
dropout<-getDropout(course)

Explanation:

We use the previously created functions to store the final state (Pass/Fail), the final grade (0 to 100) and the droput (1 if droped-out or 0 if they finished) in their respective datasets.

Now we will create three datasets, each one with the predictors and one of the outcomes. We also eliminate the id_student column as it is not needed for the prediction.

datasetState<-merge(predictors,finalState,by="id_student")%>%
          select(!"id_student")

Explanation:

We merge the outocomes with the predictors to obtain three different datasets: one for the final state, one for the grades and one for the droput.

Step 5: Classification models

We will learn how to build and evaluate classification models in R. For this we will use the “caret” library that interacts with a bunch of existing libraries implementing machine learning models.

First we will split our dataset into two parts. One for training, and another to validate how effective the model is.

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

set.seed(123)

index <- createDataPartition(datasetState$final_result, p = .7, list = FALSE, times = 1)

trainState <- datasetState[ index,]
testState  <- datasetState[-index,]

fitControl <- trainControl(method = "cv", number = 10)
preProcess = c("center", "scale")

Explanation:

First, we import the caret library (install it if you do not have it).

Then we set a random seed to have the same results no matter when we run the code. If we do not do this, each time there will be a different random selection (preffered in real world).

Then, we use the createDataPartition function to divide the datasetState into two parts. The first one will have 70% of the data (0.7) and the other the 30% of the data. A list of 1s and 0s is created to identify those rows that will be train (1) or test (0).

We create these two datasets by selecting from datasetState those rows that are in the train set (index), and those that are not (-index).

Finally, we set up the way in which the training will select the best model. We will use Cross-Validation (“cv”) one time.

We will start with building a simple classification models that will determine if a student will pass or fail the the “DDD” course based on only the first 50 days of data.

We will try different models. First, a Decision Tree. For that we will use the library “rpart” that contains that algorithm.

modelDT <- train(final_result~., data=trainState, method="rpart", trControl=fitControl, preProcess=preProcess, na.action=na.omit)
print(modelDT)

## CART 
## 
## 3275 samples
##   16 predictor
##    2 classes: 'Fail', 'Pass' 
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2781, 2782, 2782, 2782, 2782, 2782, ... 
## Resampling results across tuning parameters:
## 
##   cp           Accuracy   Kappa     
##   0.008266129  0.7576793  0.39198486
##   0.015793011  0.7415012  0.34613048
##   0.205645161  0.6887775  0.09325761
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.008266129.

Explanation:

We use the train function from the caret library. This function need the formula that specify which column will be the outcome, “final_result” in our case, and which columns will be the predictors, all other columns in our case. This is represented by the formula: “final_result~.”. Then it needs the dataset that will be used to train the model. In this case we use the trainState dataset. Then, it needs the algorithm that it will use to create the model. We use “rpart” that is one to create Decision Trees. We then specify the type of training to perform (10 Fold Cross-validation one-time). Finally, we say that if one row is missing data, it will be omited from the model (“na.action=na.omit”).

Once the training is finished, we print the information of the model. It says that the best training model had an accuracy of 0.7576.

That is it. We have trained a Machine Learning model with the data. This model is able to classify new student data into “Pass” or “Fail” final state.

One nice thing about Decision Trees is that we can visualize ther internal rules. We use the library rpart.plot to do it.

library (rpart.plot)

## Loading required package: rpart

rpart.plot(modelDT$finalModel)

Explanation:

We import the rpart.plot library (install it if you do not have it.)

Then we plot the final selectd model.

In the result, we can see that each node has tree numbers. The first indicate the class that is majority in that node (Pass or Fail). Then, which percentage of elements in that node are from the selected class. Finally, how many of the elements in the training dataset are in that node. For example, in the first node, the majority class is Pass. 68% of the elements in that node are of the Pass type and it contains 100% of the dataset elements.

Below the nodes there are rules. In the first division, it says that if the average Score is lower than 67 two different outcomes could happen. If yes, you go to a node that is mainly Fail, while if no, you go to a node that is mainly Pass. These rules are nested inside each other and divide the dataset in incresingly more precise divisions.

Interesting rules. It tell us that if your average score in the assessments on day 50 is higher than 71, you are likely to pass (more than 81% of those in that group pass the course).

But how well the prediction works? We need to evaluate the model. For that, we use the test partition that we create earlier.

predict_unseen <-predict(modelDT, testState, type = 'raw', na.action = na.omit)
confusionMatrix(predict_unseen, na.omit(testState)$final_result)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Pass
##       Fail  166   56
##       Pass  258  845
##                                           
##                Accuracy : 0.763           
##                  95% CI : (0.7392, 0.7857)
##     No Information Rate : 0.68            
##     P-Value [Acc > NIR] : 1.789e-11       
##                                           
##                   Kappa : 0.3769          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.3915          
##             Specificity : 0.9378          
##          Pos Pred Value : 0.7477          
##          Neg Pred Value : 0.7661          
##              Prevalence : 0.3200          
##          Detection Rate : 0.1253          
##    Detection Prevalence : 0.1675          
##       Balanced Accuracy : 0.6647          
##                                           
##        'Positive' Class : Fail            
##

Explanation:

To evaluate the model we use the “predict” function, that takes the model, the dataset that will be used for the prediction. In this case we use the “raw” type to get the final prediction and na.omit to avoid incomplete data.

Then we create a confusion matrix, that compare the prediction (predict_unseen) with the reality testState$final_result. We use the na.omit to also eliminate those rows that have incomplete data.

The accuracy of our model is 76%, that is it predict the real outcome 76% of the time. But are there more false positives or more false negatives (when Positive is dectecting a Fail). According to the confusion Matrix, most of the errors are false negatives (failing students that are wrongly classified as passing students). That leads to a low sensistivity (39%), our model is not able to clearly isolate only the failing students.

Let’s try with more advanced models, such as Random Forest. Training will take approximately 5 minutes.

modelRF <- train(final_result~., data=trainState, method="rf", trControl=fitControl, preProcess=preProcess, na.action=na.omit)
print(modelRF)

## Random Forest 
## 
## 3275 samples
##   16 predictor
##    2 classes: 'Fail', 'Pass' 
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2782, 2783, 2782, 2782, 2782, 2782, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7674053  0.3796327
##   20    0.7948894  0.5068242
##   39    0.7994160  0.5183236
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 39.

Explanation:

We use the same train function, with the only difference that the algorithm is RandomForests (“rf”).

The selected model has an estimated accuracy of 0.79.

Now we evaluate the Random Forest model in the test dataset.

predict_unseen <-predict(modelRF, testState, na.action = na.omit)
confusionMatrix(predict_unseen, na.omit(testState)$final_result)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Pass
##       Fail  275   95
##       Pass  149  806
##                                           
##                Accuracy : 0.8158          
##                  95% CI : (0.7939, 0.8364)
##     No Information Rate : 0.68            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5621          
##                                           
##  Mcnemar's Test P-Value : 0.0006914       
##                                           
##             Sensitivity : 0.6486          
##             Specificity : 0.8946          
##          Pos Pred Value : 0.7432          
##          Neg Pred Value : 0.8440          
##              Prevalence : 0.3200          
##          Detection Rate : 0.2075          
##    Detection Prevalence : 0.2792          
##       Balanced Accuracy : 0.7716          
##                                           
##        'Positive' Class : Fail            
##

This more advanced model increase the accuracy to 81% and the sensitivity to 62%.

Let’s see other models. Support Vector Machines:

modelSVM <- train(final_result ~ ., data=trainState, method="svmRadial", trControl=fitControl, preProcess=preProcess, na.action=na.omit)

predict_unseen <-predict(modelSVM, testState, na.action = na.omit)
confusionMatrix(predict_unseen, na.omit(testState)$final_result)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Pass
##       Fail  232   86
##       Pass  192  815
##                                           
##                Accuracy : 0.7902          
##                  95% CI : (0.7673, 0.8118)
##     No Information Rate : 0.68            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4837          
##                                           
##  Mcnemar's Test P-Value : 3.025e-10       
##                                           
##             Sensitivity : 0.5472          
##             Specificity : 0.9046          
##          Pos Pred Value : 0.7296          
##          Neg Pred Value : 0.8093          
##              Prevalence : 0.3200          
##          Detection Rate : 0.1751          
##    Detection Prevalence : 0.2400          
##       Balanced Accuracy : 0.7259          
##                                           
##        'Positive' Class : Fail            
##

Support Vector Machines seems to be better than the Decision Tree, but worst than Random Forest.

Let’s try K-Neighbors

modelKNN <- train(final_result ~ ., data=datasetState, method="knn", trControl=fitControl, preProcess=preProcess, na.action=na.omit)
predict_unseen <-predict(modelKNN, testState, na.action = na.omit)
confusionMatrix(predict_unseen, na.omit(testState)$final_result)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Pass
##       Fail  202   49
##       Pass  222  852
##                                           
##                Accuracy : 0.7955          
##                  95% CI : (0.7727, 0.8169)
##     No Information Rate : 0.68            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4731          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4764          
##             Specificity : 0.9456          
##          Pos Pred Value : 0.8048          
##          Neg Pred Value : 0.7933          
##              Prevalence : 0.3200          
##          Detection Rate : 0.1525          
##    Detection Prevalence : 0.1894          
##       Balanced Accuracy : 0.7110          
##                                           
##        'Positive' Class : Fail            
##

Similar results, lets try Linear Discriminant Analysis

modelLDA <- train(final_result ~ ., data=datasetState, method="lda", trControl=fitControl, preProcess=preProcess, na.action=na.omit)
predict_unseen <-predict(modelLDA, testState, na.action = na.omit)
confusionMatrix(predict_unseen, na.omit(testState)$final_result)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Pass
##       Fail  224   88
##       Pass  200  813
##                                           
##                Accuracy : 0.7826          
##                  95% CI : (0.7594, 0.8046)
##     No Information Rate : 0.68            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.463           
##                                           
##  Mcnemar's Test P-Value : 6.122e-11       
##                                           
##             Sensitivity : 0.5283          
##             Specificity : 0.9023          
##          Pos Pred Value : 0.7179          
##          Neg Pred Value : 0.8026          
##              Prevalence : 0.3200          
##          Detection Rate : 0.1691          
##    Detection Prevalence : 0.2355          
##       Balanced Accuracy : 0.7153          
##                                           
##        'Positive' Class : Fail            
##

Not different.

From our analysis, it seems that Random Forest is the best model for our data. Let’s explore what are the predictors that are important according to the model.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

varImpPlot(modelRF$finalModel)

As we can see, the average score at day 50 is the most important indicator for the student passing or failing the course, followed by the total and average number of elements that the student has clicked in the VLE and how active the student is in the VLE in general.

We will retrain the model to use the whole data and we will save this model to use it into our dashboard

modelRFFinal <- train(final_result~., data=datasetState, method="rf", trControl=fitControl, preProcess=preProcess, na.action=na.omit)

saveRDS(modelRFFinal, "./classification_model.rds")

Step 6: Regresion models

Now we would want to estimate the final grade of the students. For this we will select again the “DDD” course, but now, we will use information from the first 100 days. Additionally, we create the train and test datasets.

course="DDD"
days=100

clicksInfo<-getClickInfo(course,days)

## `summarise()` regrouping output by 'id_student' (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

assessInfo<-getAssessmentPredictors(course,days)

## `summarise()` ungrouping output (override with `.groups` argument)

latenessInfo<-getLateAssessments(course,days)

## `summarise()` ungrouping output (override with `.groups` argument)

temp1<-merge(clicksInfo,assessInfo,by="id_student")
predictors<-merge(temp1,latenessInfo,by="id_student")

datasetGrade<-merge(predictors,finalGrade,by="id_student")%>%
          select(!"id_student")

index <- createDataPartition(datasetGrade$score, p = 0.7, list = FALSE)

trainGrade <- datasetGrade[index, ]
testGrade  <- datasetGrade[-index, ]

Explanation:

We recreate the datasets for course “DDD” and the first 100 days of data.

We create a partition of 70/30 for the test and train data using the CreateDataPartition function

Now we train a linear model.

modelLM <- train(score ~ ., data = trainGrade, method = "lm", trControl=fitControl, preProcess=preProcess, na.action=na.omit)
print(modelLM)

## Linear Regression 
## 
## 2128 samples
##    8 predictor
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1914, 1915, 1915, 1915, 1915, 1917, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   15.19527  0.3383661  12.22378
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

This model is able to predict the grade of the students with an average error of 15 points. However, it is only able to explain 33% of the variation.

To have a better idea of how well the model predict the unseen data, we plot the prediction versus the actual values.

plot(predict(modelLM,testGrade),na.omit(testGrade)$score)

Now we try with more sophisticated models, for example Ridge:

modelRidge <- train(score ~ ., data = trainGrade, method = "ridge", trControl=fitControl, preProcess=preProcess, na.action=na.omit)
print(modelRidge)

## Ridge Regression 
## 
## 2128 samples
##    8 predictor
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1915, 1914, 1915, 1915, 1914, 1916, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE     
##   0e+00   15.20021  0.3376306  12.22654
##   1e-04   15.19965  0.3376713  12.22644
##   1e-01   15.14822  0.3419997  12.21331
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

plot(predict(modelRidge,testGrade),na.omit(testGrade)$score)

And MARS:

modelEarth <- train(score ~ ., data = trainGrade, method = "earth", trControl=fitControl, preProcess=preProcess, na.action=na.omit)

## Loading required package: earth

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

## Loading required package: TeachingDemos

print(modelEarth)

## Multivariate Adaptive Regression Spline 
## 
## 2128 samples
##    8 predictor
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1914, 1917, 1914, 1916, 1915, 1915, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE      Rsquared   MAE     
##    2      15.46431  0.3163037  12.49410
##    8      15.00975  0.3543607  11.99766
##   15      14.94976  0.3600060  11.98836
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 1.

plot(predict(modelEarth,testGrade),na.omit(testGrade)$score)

As you can see there is no better way to estimate the final grade with the data that we have. So we will train the linear model and save it.

modelLMFinal <- train(score ~ ., data = datasetGrade, method = "lm", trControl=fitControl, preProcess=preProcess, na.action=na.omit)

saveRDS(modelLMFinal, "./regression_model.rds")

Step 7: Prediction Dashboard

We will use our both models (the classification and regression models) to create a dashboard. In this dashboard the, the user will input the variables for each student and the system will predict if that student will pass or fail (for classification) and what grade with the student get.

Let’s create the App.R. Remember to copy the models that you save into the same directory as the App.R for it to work.

#
# This is a Shiny web application. You can run the application by clicking
# the 'Run App' button above.
#
# Find out more about building applications with Shiny here:
#
#    http://shiny.rstudio.com/
#

library(shiny)
library(shinydashboard)

## 
## Attaching package: 'shinydashboard'

## The following object is masked from 'package:graphics':
## 
##     box

classificationModel= readRDS("./classification_model.rds")
print("model")

## [1] "model"

print(classificationModel)

## Random Forest 
## 
## 4677 samples
##   16 predictor
##    2 classes: 'Fail', 'Pass' 
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3975, 3974, 3974, 3975, 3974, 3974, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7814777  0.4271093
##   20    0.8211084  0.5697518
##   39    0.8202035  0.5676693
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 20.

regressionModel= readRDS("./regression_model.rds")

ui <- dashboardPage(
    dashboardHeader(title = "Prediction Dashboard"),
    dashboardSidebar(
        sidebarMenu(
            h3("Prediction Types"),
            menuItem("Classification", tabName = "classification", icon = icon("dashboard")),
            menuItem("Regression", tabName = "regression", icon = icon("th")),
            h3("VLE Data"),
                numericInput("total_clicks", "Total Clicks", value = 100),
                numericInput("total_elements", "Total Elements", value = 100),
                sliderInput("active_days", "Active Days",  min = 0, max = 50, value = 25),
                numericInput("average_daily_clicks", "Average Daily Clicks",  value = 25),
                numericInput("average_elements", "Average Daily Elements",  value = 25),
            h3("Assessment Data"),
                sliderInput("avgScore", "Average Score",  min = 0, max = 100, value = 50),
                numericInput("delivered", "Deliverd Assessments", value = 5),
                numericInput("sumDelays", "Delayed Days",  value = 0)
        )
    ),
    dashboardBody(
        tabItems(
            # First tab content
            tabItem(tabName = "classification",
                    fluidRow(
                        box(title="Student Info",
                            radioButtons("gender", "Gender",
                                         choices = list("Male" = "M", "Female" = "F"),selected = "M"),
                            selectInput("region", "Region",
                                        choices = list("East Anglian Region"="East Anglian Region",
                                                       "Yorkshire Region"="Yorkshire Region",
                                                       "East Midlands Region"="East Midlands Region",
                                                       "South East Region"="South East Region",
                                                       "North Western Region"= "North Western Region",
                                                       "Scotland"="Scotland",
                                                       "South West Region"="South West Region",
                                                       "West Midlands Region"="West Midlands Region",
                                                       "Wales"="Wales",
                                                       "Ireland"="Ireland",             
                                                       "South Region"="South Region",
                                                       "London Region"="London Region",
                                                       "North Region"="North Region"),selected = "London Region"),
                            selectInput("highest_education", "Highest Level of Education",
                                        choices =list("A Level or Equivalent",
                                                      "Lower Than A Level",
                                                      "HE Qualification",
                                                      "Post Graduate Qualification",
                                                      "No Formal quals"
                                        ), selected="HE Qualification"),
                            selectInput("imd_band", "IMD Band",
                                        choices =list("0-10%", 
                                                      "20-30%",
                                                      "30-40%",
                                                      "40-50%",
                                                      "50-60%",
                                                      "60-70%",
                                                      "70-80%",
                                                      "80-90%",
                                                      "90-100%"
                                        ), selected="50-60%"),
                            selectInput("age_band", "Age Band",
                                        choices =list("0-35",
                                                      "35-55",
                                                      "55<="
                                        ), selected="0-35"),
                            numericInput("num_of_prev_attempts", "Previous Attempts", value = 0),
                            numericInput("studied_credits", "Studied Credits", value = 60),
                            radioButtons("disability", "Disability",
                                         choices = list("Yes" = "Y", "No" = "N"),selected = "N")
                        ),
                        valueBoxOutput("classificationPrediction"),

                    ),
                    
                    ),
            tabItem(tabName = "regression",
                    
                    fluidRow(
                        valueBoxOutput("regressionPrediction"),
                    )
            )
            ),
            
            # Second tab content
            
        
    )
)

server <- function(input, output) {
    
    output$classificationPrediction <- renderValueBox({
        dataset=data.frame("total_clicks"=input$total_clicks,
                           "total_elements"=input$total_elements,
                           "active_days"= input$active_days,
                           "average_daily_clicks"=input$average_daily_clicks,
                           "average_elements" = input$average_elements,
                           "avgScore" = input$avgScore,
                           "delivered" =input$delivered,
                           "sumDelays" = input$sumDelays,
                           "gender"= input$gender,
                           "region"= input$region,
                           "highest_education"= input$highest_education,
                           "imd_band"=input$imd_band,
                           "age_band"=input$age_band,
                           "num_of_prev_attempts"=input$num_of_prev_attempts,
                           "studied_credits"=input$studied_credits,
                           "disability"=input$disability,
                           "final_result"=NA
                           
        )
        print(classificationModel)
        predictedValue=predict(classificationModel,dataset)
        print(predictedValue)
        valueBox(
            ifelse(predictedValue[1]=="Pass","Pass","Fail"),"Prediction", icon = icon(ifelse(predictedValue[1]=="Pass","check","exclamation")),
            color = ifelse(predictedValue[1]=="Pass","green","red")
        )
    })
    
    output$regressionPrediction <- renderValueBox({
        datasetRegression=data.frame("total_clicks"=input$total_clicks,
                           "total_elements"=input$total_elements,
                           "active_days"= input$active_days,
                           "average_daily_clicks"=input$average_daily_clicks,
                           "average_elements" = input$average_elements,
                           "avgScore" = input$avgScore,
                           "delivered" =input$delivered,
                           "sumDelays" = input$sumDelays,
                           "score"=NA
                           
        )
        value=predict(regressionModel,datasetRegression)
        valueBox(
            format(value[1], digits=2, nsmall=2),"Final Grade", icon = icon(ifelse(value[1]>70,"check",ifelse(value[1]>50,"exclamation","times"))),
            color = ifelse(value[1]>70,"green",ifelse(value>50,"yellow","red"))
        )
    })
}

shinyApp(ui, server)

Shiny applications not supported in static R Markdown documents