Introduction and Objective

This RMarkdown is created to increase our knowledge and experience as we learn by building a good model that can predict and classify each student tech profile based on number of hours studying, number and type of courses taken, and average score per type of courses taken during the education via online platform called Tortuga Code.

We will use data from Kaggle: https://www.kaggle.com/scarecrow2020/tech-students-profile-prediction?select=dataset-tortuga.csv. Here we already provided with a dataset consists 20.000 rows and 16 columns, including the target variable PROFILE.

By predicting or classifying the tech profile, the model can help Tortuga Code (the online education platform) to be more prepare with a specific promotional catalog for their student candidate. Referencing to the catalog, student candidate will know which type of courses should be taken per their own interest, how long it will take to be completed, and the average score needed for course completion.

Library Used

library(tidyverse)
library(ggplot2)
library(e1071)
library(caret)
library(randomForest)

Read Data and Exploratory Data Analysis

We will read the dataset first then take a look on each columns’ data type.

profile <- read.csv("data_input/dataset-tortuga.csv")
glimpse(profile)

## Rows: 20,000
## Columns: 16
## $ X                                <int> 28, 81, 89, 138, 143, 169, 179, 22...
## $ NAME                             <chr> "Stormy Muto", "Carlos Ferro", "Ro...
## $ USER_ID                          <int> 58283940, 1357218, 63212105, 23239...
## $ HOURS_DATASCIENCE                <dbl> 7, 32, 45, 36, 61, 24, 54, 0, 40, ...
## $ HOURS_BACKEND                    <dbl> 39, 0, 0, 19, 78, 69, 52, 42, 20, ...
## $ HOURS_FRONTEND                   <dbl> 29, 44, 59, 28, 38, 68, 12, 13, 34...
## $ NUM_COURSES_BEGINNER_DATASCIENCE <dbl> 2, 2, 0, 0, 6, 3, 4, 0, 0, 0, 0, 2...
## $ NUM_COURSES_BEGINNER_BACKEND     <dbl> 4, 0, 5, 5, 11, 7, 3, 5, 5, 4, 4, ...
## $ NUM_COURSES_BEGINNER_FRONTEND    <dbl> 0, 0, 4, 7, 0, 0, 0, 5, 2, 0, 0, 4...
## $ NUM_COURSES_ADVANCED_DATASCIENCE <dbl> 2, 0, 0, 0, 4, 4, 5, 0, 0, 0, 0, 1...
## $ NUM_COURSES_ADVANCED_BACKEND     <dbl> 5, 5, 4, 5, 3, 5, 9, 5, 7, 7, 5, 5...
## $ NUM_COURSES_ADVANCED_FRONTEND    <dbl> 0, 0, 1, 3, 0, 0, 0, 3, 0, 4, 2, 5...
## $ AVG_SCORE_DATASCIENCE            <dbl> 84, 67, NA, NA, 66, 66, 87, NA, NA...
## $ AVG_SCORE_BACKEND                <dbl> 74, 45, 54, 71, 85, 75, 51, 74, 53...
## $ AVG_SCORE_FRONTEND               <dbl> NA, NA, 47, 89, NA, NA, NA, 67, 65...
## $ PROFILE                          <chr> "beginner_front_end", "beginner_fr...

Below are the description for each column in the dataset:

X : Useless column.
NAME : Name of the student.
USER_ID : ID for each student.
HOURS_DATASCIENCE : Numbers of hours studied data science courses.
HOURS_BACKEND : Numbers of hours studied web (Back-End).
HOURS_FRONTEND : Numbers of hours studied web (Front-End).
NUM_COURSES_BEGINNER_DATASCIENCE : Numbers of beginner courses of Data Science completed by the student.
NUM_COURSES_BEGINNER_BACKEND : Numbers of beginner courses of Web (Back-End) completed by the student.
NUM_COURSES_BEGINNER_FRONTEND : Numbers of beginner courses of Web (Front-End) completed by the student.
NUM_COURSES_ADVANCED_DATASCIENCE : Numbers of advanced courses of Data Science completed by the student.
NUM_COURSES_ADVANCED_BACKEND : Numbers of advanced courses of Web (Back-End) completed by the student.
NUM_COURSES_ADVANCED_FRONTEND : Numbers of advanced courses of Web (Front-End) completed by the student.
AVG_SCORE_DATASCIENCE : Average score in Data Science completed by the student.
AVG_SCORE_BACKEND : Average score in Web (Back-End) completed by the student.
AVG_SCORE_FRONTEND : Average score in Web (Front-End) completed by the student.
PROFILE : Tech profile of the students.

Then, we can check whether there are any NA values inside the dataset and how is the data distribution inside the numeric columns.

colSums(is.na(profile))

##                                X                             NAME 
##                                0                                0 
##                          USER_ID                HOURS_DATASCIENCE 
##                                0                               14 
##                    HOURS_BACKEND                   HOURS_FRONTEND 
##                               53                               16 
## NUM_COURSES_BEGINNER_DATASCIENCE     NUM_COURSES_BEGINNER_BACKEND 
##                               26                               18 
##    NUM_COURSES_BEGINNER_FRONTEND NUM_COURSES_ADVANCED_DATASCIENCE 
##                               39                                2 
##     NUM_COURSES_ADVANCED_BACKEND    NUM_COURSES_ADVANCED_FRONTEND 
##                                8                               37 
##            AVG_SCORE_DATASCIENCE                AVG_SCORE_BACKEND 
##                              220                               84 
##               AVG_SCORE_FRONTEND                          PROFILE 
##                              168                                0

ggplot(gather(profile[,-c(1:3, 16)], cols, value), aes(x = value)) + 
       geom_histogram(bins = 10) + facet_wrap(~cols, scales = 'free_x')

Data Pre-processing

Since we found lots of NA values inside the dataset and the data distribution for several columns also not all is normally distributed, we will replace the NA values with median values.

# NA median impute
prevalues <- preProcess(profile, method=c("medianImpute"))
profile <- predict(prevalues, profile)
colSums(is.na(profile))

##                                X                             NAME 
##                                0                                0 
##                          USER_ID                HOURS_DATASCIENCE 
##                                0                                0 
##                    HOURS_BACKEND                   HOURS_FRONTEND 
##                                0                                0 
## NUM_COURSES_BEGINNER_DATASCIENCE     NUM_COURSES_BEGINNER_BACKEND 
##                                0                                0 
##    NUM_COURSES_BEGINNER_FRONTEND NUM_COURSES_ADVANCED_DATASCIENCE 
##                                0                                0 
##     NUM_COURSES_ADVANCED_BACKEND    NUM_COURSES_ADVANCED_FRONTEND 
##                                0                                0 
##            AVG_SCORE_DATASCIENCE                AVG_SCORE_BACKEND 
##                                0                                0 
##               AVG_SCORE_FRONTEND                          PROFILE 
##                                0                                0

Then we can deselect X, NAME, and USER_ID columns since those columns aren’t relevant with the model preparation, further, we will also change the data type inside PROFILE column.

profile <- profile %>% select(-c(X, NAME, USER_ID)) %>% 
  mutate(PROFILE = as.factor(PROFILE))

glimpse(profile)

## Rows: 20,000
## Columns: 13
## $ HOURS_DATASCIENCE                <dbl> 7, 32, 45, 36, 61, 24, 54, 0, 40, ...
## $ HOURS_BACKEND                    <dbl> 39, 0, 0, 19, 78, 69, 52, 42, 20, ...
## $ HOURS_FRONTEND                   <dbl> 29, 44, 59, 28, 38, 68, 12, 13, 34...
## $ NUM_COURSES_BEGINNER_DATASCIENCE <dbl> 2, 2, 0, 0, 6, 3, 4, 0, 0, 0, 0, 2...
## $ NUM_COURSES_BEGINNER_BACKEND     <dbl> 4, 0, 5, 5, 11, 7, 3, 5, 5, 4, 4, ...
## $ NUM_COURSES_BEGINNER_FRONTEND    <dbl> 0, 0, 4, 7, 0, 0, 0, 5, 2, 0, 0, 4...
## $ NUM_COURSES_ADVANCED_DATASCIENCE <dbl> 2, 0, 0, 0, 4, 4, 5, 0, 0, 0, 0, 1...
## $ NUM_COURSES_ADVANCED_BACKEND     <dbl> 5, 5, 4, 5, 3, 5, 9, 5, 7, 7, 5, 5...
## $ NUM_COURSES_ADVANCED_FRONTEND    <dbl> 0, 0, 1, 3, 0, 0, 0, 3, 0, 4, 2, 5...
## $ AVG_SCORE_DATASCIENCE            <dbl> 84, 67, 65, 65, 66, 66, 87, 65, 65...
## $ AVG_SCORE_BACKEND                <dbl> 74, 45, 54, 71, 85, 75, 51, 74, 53...
## $ AVG_SCORE_FRONTEND               <dbl> 68, 68, 47, 89, 68, 68, 68, 67, 65...
## $ PROFILE                          <fct> beginner_front_end, beginner_front...

Model Fitting and Evaluation

Cross-Validation

After preparing the dataset, we can randomly split the dataset into train:test dataset with 80:20 proportion.

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- sample(x = nrow(profile), nrow(profile) * 0.8)

profile_train <- profile[index,]
profile_test <- profile[-index,]

Then we can take a look on the target variable class proportions.

prop.table(table(profile$PROFILE))

## 
##      advanced_backend advanced_data_science    advanced_front_end 
##               0.16695               0.16650               0.16685 
##      beginner_backend beginner_data_science    beginner_front_end 
##               0.16660               0.16635               0.16675

As seen above, the target variable class proportions are well balanced thus, we can proceed to model fitting & evaluation via Naive Bayes and Random Forest in next step.

Naive Bayes

Here we can try building a machine learning model using Bayes’ theorem while also implement laplace smoothing in case our dataset happened to have data scarcity.

# model fitting
naive_model <- naiveBayes(x = profile_train %>% select(-PROFILE), 
                          y = profile_train$PROFILE,
                          laplace = 1)

Then using the model created, we can predict the target class inside profile_test. The prediction result will be stored inside profile_test_pred_naive object and used for model evaluation via confusionMatrix() function.

# predict data test
profile_test_pred_naive <- predict(object = naive_model,
        newdata = profile_test,
        type = "class")

# confusion matrix data train
confusionMatrix(data = profile_test_pred_naive, reference = profile_test$PROFILE)

## Confusion Matrix and Statistics
## 
##                        Reference
## Prediction              advanced_backend advanced_data_science
##   advanced_backend                   491                    13
##   advanced_data_science               88                   466
##   advanced_front_end                  15                    24
##   beginner_backend                    56                    22
##   beginner_data_science                5                    52
##   beginner_front_end                  17                    53
##                        Reference
## Prediction              advanced_front_end beginner_backend
##   advanced_backend                      15               87
##   advanced_data_science                 14               93
##   advanced_front_end                   505               29
##   beginner_backend                      62              356
##   beginner_data_science                 27               44
##   beginner_front_end                    52               73
##                        Reference
## Prediction              beginner_data_science beginner_front_end
##   advanced_backend                         19                 21
##   advanced_data_science                    22                 19
##   advanced_front_end                       57                 34
##   beginner_backend                         49                 50
##   beginner_data_science                   498                 12
##   beginner_front_end                       28                532
## 
## Overall Statistics
##                                                
##                Accuracy : 0.712                
##                  95% CI : (0.6977, 0.726)      
##     No Information Rate : 0.1705               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6545               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: advanced_backend Class: advanced_data_science
## Sensitivity                           0.7307                       0.7397
## Specificity                           0.9534                       0.9300
## Pos Pred Value                        0.7601                       0.6638
## Neg Pred Value                        0.9460                       0.9503
## Prevalence                            0.1680                       0.1575
## Detection Rate                        0.1227                       0.1165
## Detection Prevalence                  0.1615                       0.1755
## Balanced Accuracy                     0.8420                       0.8348
##                      Class: advanced_front_end Class: beginner_backend
## Sensitivity                             0.7481                  0.5220
## Specificity                             0.9522                  0.9280
## Pos Pred Value                          0.7605                  0.5983
## Neg Pred Value                          0.9490                  0.9043
## Prevalence                              0.1688                  0.1705
## Detection Rate                          0.1263                  0.0890
## Detection Prevalence                    0.1660                  0.1487
## Balanced Accuracy                       0.8502                  0.7250
##                      Class: beginner_data_science Class: beginner_front_end
## Sensitivity                                0.7400                    0.7964
## Specificity                                0.9579                    0.9331
## Pos Pred Value                             0.7806                    0.7046
## Neg Pred Value                             0.9479                    0.9581
## Prevalence                                 0.1683                    0.1670
## Detection Rate                             0.1245                    0.1330
## Detection Prevalence                       0.1595                    0.1888
## Balanced Accuracy                          0.8489                    0.8647

Since the target variable are multi-class, accuracy metric should be used as the model performance evaluation metric.

The first model we created via Naive Bayes algorithm, already shown a good performance with 71% accuracy. Let’s try create another model using Random Forest algorithm.

Random Forest

By using Random Forest algorithm we can:

Suppressing bias and variance from the Decision Tree, so that the model performance is better in predicting.
Automatic feature selection.
Use out-of-bag error in lieu of model evaluation.

With the chunk below, we use k-fold cross-validation first then build the Random Forest model. Afterwards, we saved the model created using saveRDS() function because re-running the Random Forest model creation usually taking a lot of time due to the number of decision trees created.

set.seed(100)

# k-fold cross validation
ctrl <- trainControl(method = "repeatedcv",
                     number = 5, # k fold = 5
                     repeats = 3) # total observation = 5x3x500 = 7500 models

profile_forest <- train(PROFILE ~ .,
                   data = profile_train,
                   method = "rf",
                   trControl = ctrl)

saveRDS(profile_forest, "profile_forest.RDS")

# read model
profile_forest <- readRDS("profile_forest.RDS")
profile_forest

## Random Forest 
## 
## 16000 samples
##    12 predictor
##     6 classes: 'advanced_backend', 'advanced_data_science', 'advanced_front_end', 'beginner_backend', 'beginner_data_science', 'beginner_front_end' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 12799, 12800, 12802, 12801, 12798, 12801, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9131457  0.8957736
##    7    0.9117292  0.8940741
##   12    0.9039164  0.8846985
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Based on the model performance above, the model chosen is mtry = 2 with the highest accuracy value, amounting to 91.31% ,when tested on bootstrap sampling data.

Out-of-Bag (OOB) Error

At the Bootstrap sampling stage, there will be data that is not used in model preparation, this is known as Out-of-Bag (OOB) data. The Random Forest model will use OOB data as test data to do the evaluation by calculating the errors. This error is known as OOB Error. In the case of classification, OOB error is the percentage of misclassified OOB data.

profile_forest$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 8.39%
## Confusion matrix:
##                       advanced_backend advanced_data_science advanced_front_end
## advanced_backend                  2423                   120                 25
## advanced_data_science               30                  2404                 80
## advanced_front_end                  22                    32               2481
## beginner_backend                    71                    68                 43
## beginner_data_science               23                    38                 40
## beginner_front_end                  25                    18                 49
##                       beginner_backend beginner_data_science beginner_front_end
## advanced_backend                    73                     9                 17
## advanced_data_science               45                    83                 58
## advanced_front_end                  41                    17                 69
## beginner_backend                  2356                    75                 37
## beginner_data_science               43                  2482                 28
## beginner_front_end                  47                    16               2512
##                       class.error
## advanced_backend       0.09148856
## advanced_data_science  0.10962963
## advanced_front_end     0.06799399
## beginner_backend       0.11094340
## beginner_data_science  0.06480784
## beginner_front_end     0.05811774

The OOB Error value in the profile_forest model is 8.39%. In other words, the model’s accuracy on OOB data is 91.61%.

Then, if we check the prediction result compare to the test data set, we can found that the model mostly predict the class correctly.

profile_test_pred_forest <- predict(profile_forest, profile_test, type = "raw")
cm_profile <- confusionMatrix(data = profile_test_pred_forest,
                         reference = profile_test$PROFILE)
cm_profile$table

##                        Reference
## Prediction              advanced_backend advanced_data_science
##   advanced_backend                   626                     2
##   advanced_data_science               22                   570
##   advanced_front_end                   6                    18
##   beginner_backend                    12                     9
##   beginner_data_science                3                    15
##   beginner_front_end                   3                    16
##                        Reference
## Prediction              advanced_front_end beginner_backend
##   advanced_backend                       7               20
##   advanced_data_science                  9               18
##   advanced_front_end                   633                8
##   beginner_backend                      11              605
##   beginner_data_science                  5               21
##   beginner_front_end                    10               10
##                        Reference
## Prediction              beginner_data_science beginner_front_end
##   advanced_backend                          4                  5
##   advanced_data_science                     7                  5
##   advanced_front_end                        6                  9
##   beginner_backend                         13                  9
##   beginner_data_science                   637                  4
##   beginner_front_end                        6                636

Interpretation

In a machine learning model, there is a trade-off between interpretability and performance. Random Forest model performance can be superior compared to other models created from other algorithm, but it is not easy to be interpreted because many random factors are involved. But at least we can see what predictors are most important in making the random forest model through the variable importance:

varImp(profile_forest)

## rf variable importance
## 
##                                  Overall
## HOURS_BACKEND                    100.000
## HOURS_DATASCIENCE                 89.092
## AVG_SCORE_FRONTEND                84.258
## AVG_SCORE_DATASCIENCE             84.003
## HOURS_FRONTEND                    75.250
## AVG_SCORE_BACKEND                 45.527
## NUM_COURSES_BEGINNER_BACKEND      40.116
## NUM_COURSES_BEGINNER_FRONTEND     38.535
## NUM_COURSES_ADVANCED_BACKEND      32.005
## NUM_COURSES_ADVANCED_FRONTEND     18.185
## NUM_COURSES_BEGINNER_DATASCIENCE   9.597
## NUM_COURSES_ADVANCED_DATASCIENCE   0.000

plot(varImp(profile_forest))

Conclusion

After comparing both models (naive_model Vs profile_forest), model created via Random Forest algorithm (profile_forest) shows a better performance since its accuracy metric is higher.

Further, as mentioned above, Random Forest model is not easy to be interpreted but we can see which predictors with high variable importance meaning that those predictors or variables are the most important in model creation. Based on the result above, top 5 predictors with the highest importance percentage are: HOURS_BACKEND, HOURS_DATASCIENCE, AVG_SCORE_FRONTEND, AVG_SCORE_DATASCIENCE, HOURS_FRONTEND.

Thus, seeing the predictors importance pattern, it can be seen that number of hours used for studying and average completion score are very important and most probably have a significant influence for classifying each student tech profile.

Overall, the initial objective to create a good model has been achieved via Random Forest algorithm. Tortuga Code, as an online education platform, can use the model created (profile_forest) to classify each student’s tech profile and prepare a specific promotional catalog for their student candidate. Referencing to the catalog, student candidate will know which type of courses should be taken per their own interest, how long it will take to be completed, and the average score needed for course completion.

Classify online educational platform students’ tech profile using Naive Bayes and Random Forest

Margareth Devina

11 April 2021