Introduction and Objective
This RMarkdown is created to increase our knowledge and experience as we learn by building a good model that can predict and classify each student tech profile based on number of hours studying, number and type of courses taken, and average score per type of courses taken during the education via online platform called Tortuga Code.
We will use data from Kaggle: https://www.kaggle.com/scarecrow2020/tech-students-profile-prediction?select=dataset-tortuga.csv. Here we already provided with a dataset consists 20.000 rows and 16 columns, including the target variable PROFILE.
By predicting or classifying the tech profile, the model can help Tortuga Code (the online education platform) to be more prepare with a specific promotional catalog for their student candidate. Referencing to the catalog, student candidate will know which type of courses should be taken per their own interest, how long it will take to be completed, and the average score needed for course completion.
Library Used
library(tidyverse)
library(ggplot2)
library(e1071)
library(caret)
library(randomForest)Read Data and Exploratory Data Analysis
We will read the dataset first then take a look on each columns’ data type.
profile <- read.csv("data_input/dataset-tortuga.csv")
glimpse(profile)## Rows: 20,000
## Columns: 16
## $ X <int> 28, 81, 89, 138, 143, 169, 179, 22...
## $ NAME <chr> "Stormy Muto", "Carlos Ferro", "Ro...
## $ USER_ID <int> 58283940, 1357218, 63212105, 23239...
## $ HOURS_DATASCIENCE <dbl> 7, 32, 45, 36, 61, 24, 54, 0, 40, ...
## $ HOURS_BACKEND <dbl> 39, 0, 0, 19, 78, 69, 52, 42, 20, ...
## $ HOURS_FRONTEND <dbl> 29, 44, 59, 28, 38, 68, 12, 13, 34...
## $ NUM_COURSES_BEGINNER_DATASCIENCE <dbl> 2, 2, 0, 0, 6, 3, 4, 0, 0, 0, 0, 2...
## $ NUM_COURSES_BEGINNER_BACKEND <dbl> 4, 0, 5, 5, 11, 7, 3, 5, 5, 4, 4, ...
## $ NUM_COURSES_BEGINNER_FRONTEND <dbl> 0, 0, 4, 7, 0, 0, 0, 5, 2, 0, 0, 4...
## $ NUM_COURSES_ADVANCED_DATASCIENCE <dbl> 2, 0, 0, 0, 4, 4, 5, 0, 0, 0, 0, 1...
## $ NUM_COURSES_ADVANCED_BACKEND <dbl> 5, 5, 4, 5, 3, 5, 9, 5, 7, 7, 5, 5...
## $ NUM_COURSES_ADVANCED_FRONTEND <dbl> 0, 0, 1, 3, 0, 0, 0, 3, 0, 4, 2, 5...
## $ AVG_SCORE_DATASCIENCE <dbl> 84, 67, NA, NA, 66, 66, 87, NA, NA...
## $ AVG_SCORE_BACKEND <dbl> 74, 45, 54, 71, 85, 75, 51, 74, 53...
## $ AVG_SCORE_FRONTEND <dbl> NA, NA, 47, 89, NA, NA, NA, 67, 65...
## $ PROFILE <chr> "beginner_front_end", "beginner_fr...
Below are the description for each column in the dataset:
X: Useless column.NAME: Name of the student.USER_ID: ID for each student.HOURS_DATASCIENCE: Numbers of hours studied data science courses.HOURS_BACKEND: Numbers of hours studied web (Back-End).HOURS_FRONTEND: Numbers of hours studied web (Front-End).NUM_COURSES_BEGINNER_DATASCIENCE: Numbers of beginner courses of Data Science completed by the student.NUM_COURSES_BEGINNER_BACKEND: Numbers of beginner courses of Web (Back-End) completed by the student.NUM_COURSES_BEGINNER_FRONTEND: Numbers of beginner courses of Web (Front-End) completed by the student.
NUM_COURSES_ADVANCED_DATASCIENCE: Numbers of advanced courses of Data Science completed by the student.NUM_COURSES_ADVANCED_BACKEND: Numbers of advanced courses of Web (Back-End) completed by the student.NUM_COURSES_ADVANCED_FRONTEND: Numbers of advanced courses of Web (Front-End) completed by the student.AVG_SCORE_DATASCIENCE: Average score in Data Science completed by the student.AVG_SCORE_BACKEND: Average score in Web (Back-End) completed by the student.AVG_SCORE_FRONTEND: Average score in Web (Front-End) completed by the student.PROFILE: Tech profile of the students.
Then, we can check whether there are any NA values inside the dataset and how is the data distribution inside the numeric columns.
colSums(is.na(profile))## X NAME
## 0 0
## USER_ID HOURS_DATASCIENCE
## 0 14
## HOURS_BACKEND HOURS_FRONTEND
## 53 16
## NUM_COURSES_BEGINNER_DATASCIENCE NUM_COURSES_BEGINNER_BACKEND
## 26 18
## NUM_COURSES_BEGINNER_FRONTEND NUM_COURSES_ADVANCED_DATASCIENCE
## 39 2
## NUM_COURSES_ADVANCED_BACKEND NUM_COURSES_ADVANCED_FRONTEND
## 8 37
## AVG_SCORE_DATASCIENCE AVG_SCORE_BACKEND
## 220 84
## AVG_SCORE_FRONTEND PROFILE
## 168 0
ggplot(gather(profile[,-c(1:3, 16)], cols, value), aes(x = value)) +
geom_histogram(bins = 10) + facet_wrap(~cols, scales = 'free_x')Data Pre-processing
Since we found lots of NA values inside the dataset and the data distribution for several columns also not all is normally distributed, we will replace the NA values with median values.
# NA median impute
prevalues <- preProcess(profile, method=c("medianImpute"))
profile <- predict(prevalues, profile)
colSums(is.na(profile))## X NAME
## 0 0
## USER_ID HOURS_DATASCIENCE
## 0 0
## HOURS_BACKEND HOURS_FRONTEND
## 0 0
## NUM_COURSES_BEGINNER_DATASCIENCE NUM_COURSES_BEGINNER_BACKEND
## 0 0
## NUM_COURSES_BEGINNER_FRONTEND NUM_COURSES_ADVANCED_DATASCIENCE
## 0 0
## NUM_COURSES_ADVANCED_BACKEND NUM_COURSES_ADVANCED_FRONTEND
## 0 0
## AVG_SCORE_DATASCIENCE AVG_SCORE_BACKEND
## 0 0
## AVG_SCORE_FRONTEND PROFILE
## 0 0
Then we can deselect X, NAME, and USER_ID columns since those columns aren’t relevant with the model preparation, further, we will also change the data type inside PROFILE column.
profile <- profile %>% select(-c(X, NAME, USER_ID)) %>%
mutate(PROFILE = as.factor(PROFILE))
glimpse(profile)## Rows: 20,000
## Columns: 13
## $ HOURS_DATASCIENCE <dbl> 7, 32, 45, 36, 61, 24, 54, 0, 40, ...
## $ HOURS_BACKEND <dbl> 39, 0, 0, 19, 78, 69, 52, 42, 20, ...
## $ HOURS_FRONTEND <dbl> 29, 44, 59, 28, 38, 68, 12, 13, 34...
## $ NUM_COURSES_BEGINNER_DATASCIENCE <dbl> 2, 2, 0, 0, 6, 3, 4, 0, 0, 0, 0, 2...
## $ NUM_COURSES_BEGINNER_BACKEND <dbl> 4, 0, 5, 5, 11, 7, 3, 5, 5, 4, 4, ...
## $ NUM_COURSES_BEGINNER_FRONTEND <dbl> 0, 0, 4, 7, 0, 0, 0, 5, 2, 0, 0, 4...
## $ NUM_COURSES_ADVANCED_DATASCIENCE <dbl> 2, 0, 0, 0, 4, 4, 5, 0, 0, 0, 0, 1...
## $ NUM_COURSES_ADVANCED_BACKEND <dbl> 5, 5, 4, 5, 3, 5, 9, 5, 7, 7, 5, 5...
## $ NUM_COURSES_ADVANCED_FRONTEND <dbl> 0, 0, 1, 3, 0, 0, 0, 3, 0, 4, 2, 5...
## $ AVG_SCORE_DATASCIENCE <dbl> 84, 67, 65, 65, 66, 66, 87, 65, 65...
## $ AVG_SCORE_BACKEND <dbl> 74, 45, 54, 71, 85, 75, 51, 74, 53...
## $ AVG_SCORE_FRONTEND <dbl> 68, 68, 47, 89, 68, 68, 68, 67, 65...
## $ PROFILE <fct> beginner_front_end, beginner_front...
Model Fitting and Evaluation
Cross-Validation
After preparing the dataset, we can randomly split the dataset into train:test dataset with 80:20 proportion.
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(x = nrow(profile), nrow(profile) * 0.8)
profile_train <- profile[index,]
profile_test <- profile[-index,]Then we can take a look on the target variable class proportions.
prop.table(table(profile$PROFILE))##
## advanced_backend advanced_data_science advanced_front_end
## 0.16695 0.16650 0.16685
## beginner_backend beginner_data_science beginner_front_end
## 0.16660 0.16635 0.16675
As seen above, the target variable class proportions are well balanced thus, we can proceed to model fitting & evaluation via Naive Bayes and Random Forest in next step.
Naive Bayes
Here we can try building a machine learning model using Bayes’ theorem while also implement laplace smoothing in case our dataset happened to have data scarcity.
# model fitting
naive_model <- naiveBayes(x = profile_train %>% select(-PROFILE),
y = profile_train$PROFILE,
laplace = 1)Then using the model created, we can predict the target class inside profile_test. The prediction result will be stored inside profile_test_pred_naive object and used for model evaluation via confusionMatrix() function.
# predict data test
profile_test_pred_naive <- predict(object = naive_model,
newdata = profile_test,
type = "class")
# confusion matrix data train
confusionMatrix(data = profile_test_pred_naive, reference = profile_test$PROFILE)## Confusion Matrix and Statistics
##
## Reference
## Prediction advanced_backend advanced_data_science
## advanced_backend 491 13
## advanced_data_science 88 466
## advanced_front_end 15 24
## beginner_backend 56 22
## beginner_data_science 5 52
## beginner_front_end 17 53
## Reference
## Prediction advanced_front_end beginner_backend
## advanced_backend 15 87
## advanced_data_science 14 93
## advanced_front_end 505 29
## beginner_backend 62 356
## beginner_data_science 27 44
## beginner_front_end 52 73
## Reference
## Prediction beginner_data_science beginner_front_end
## advanced_backend 19 21
## advanced_data_science 22 19
## advanced_front_end 57 34
## beginner_backend 49 50
## beginner_data_science 498 12
## beginner_front_end 28 532
##
## Overall Statistics
##
## Accuracy : 0.712
## 95% CI : (0.6977, 0.726)
## No Information Rate : 0.1705
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.6545
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: advanced_backend Class: advanced_data_science
## Sensitivity 0.7307 0.7397
## Specificity 0.9534 0.9300
## Pos Pred Value 0.7601 0.6638
## Neg Pred Value 0.9460 0.9503
## Prevalence 0.1680 0.1575
## Detection Rate 0.1227 0.1165
## Detection Prevalence 0.1615 0.1755
## Balanced Accuracy 0.8420 0.8348
## Class: advanced_front_end Class: beginner_backend
## Sensitivity 0.7481 0.5220
## Specificity 0.9522 0.9280
## Pos Pred Value 0.7605 0.5983
## Neg Pred Value 0.9490 0.9043
## Prevalence 0.1688 0.1705
## Detection Rate 0.1263 0.0890
## Detection Prevalence 0.1660 0.1487
## Balanced Accuracy 0.8502 0.7250
## Class: beginner_data_science Class: beginner_front_end
## Sensitivity 0.7400 0.7964
## Specificity 0.9579 0.9331
## Pos Pred Value 0.7806 0.7046
## Neg Pred Value 0.9479 0.9581
## Prevalence 0.1683 0.1670
## Detection Rate 0.1245 0.1330
## Detection Prevalence 0.1595 0.1888
## Balanced Accuracy 0.8489 0.8647
Since the target variable are multi-class, accuracy metric should be used as the model performance evaluation metric.
The first model we created via Naive Bayes algorithm, already shown a good performance with 71% accuracy. Let’s try create another model using Random Forest algorithm.
Random Forest
By using Random Forest algorithm we can:
- Suppressing bias and variance from the Decision Tree, so that the model performance is better in predicting.
- Automatic feature selection.
- Use out-of-bag error in lieu of model evaluation.
With the chunk below, we use k-fold cross-validation first then build the Random Forest model. Afterwards, we saved the model created using saveRDS() function because re-running the Random Forest model creation usually taking a lot of time due to the number of decision trees created.
set.seed(100)
# k-fold cross validation
ctrl <- trainControl(method = "repeatedcv",
number = 5, # k fold = 5
repeats = 3) # total observation = 5x3x500 = 7500 models
profile_forest <- train(PROFILE ~ .,
data = profile_train,
method = "rf",
trControl = ctrl)saveRDS(profile_forest, "profile_forest.RDS")# read model
profile_forest <- readRDS("profile_forest.RDS")
profile_forest## Random Forest
##
## 16000 samples
## 12 predictor
## 6 classes: 'advanced_backend', 'advanced_data_science', 'advanced_front_end', 'beginner_backend', 'beginner_data_science', 'beginner_front_end'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 12799, 12800, 12802, 12801, 12798, 12801, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9131457 0.8957736
## 7 0.9117292 0.8940741
## 12 0.9039164 0.8846985
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Based on the model performance above, the model chosen is mtry = 2 with the highest accuracy value, amounting to 91.31% ,when tested on bootstrap sampling data.
Out-of-Bag (OOB) Error
At the Bootstrap sampling stage, there will be data that is not used in model preparation, this is known as Out-of-Bag (OOB) data. The Random Forest model will use OOB data as test data to do the evaluation by calculating the errors. This error is known as OOB Error. In the case of classification, OOB error is the percentage of misclassified OOB data.
profile_forest$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 8.39%
## Confusion matrix:
## advanced_backend advanced_data_science advanced_front_end
## advanced_backend 2423 120 25
## advanced_data_science 30 2404 80
## advanced_front_end 22 32 2481
## beginner_backend 71 68 43
## beginner_data_science 23 38 40
## beginner_front_end 25 18 49
## beginner_backend beginner_data_science beginner_front_end
## advanced_backend 73 9 17
## advanced_data_science 45 83 58
## advanced_front_end 41 17 69
## beginner_backend 2356 75 37
## beginner_data_science 43 2482 28
## beginner_front_end 47 16 2512
## class.error
## advanced_backend 0.09148856
## advanced_data_science 0.10962963
## advanced_front_end 0.06799399
## beginner_backend 0.11094340
## beginner_data_science 0.06480784
## beginner_front_end 0.05811774
The OOB Error value in the profile_forest model is 8.39%. In other words, the model’s accuracy on OOB data is 91.61%.
Then, if we check the prediction result compare to the test data set, we can found that the model mostly predict the class correctly.
profile_test_pred_forest <- predict(profile_forest, profile_test, type = "raw")
cm_profile <- confusionMatrix(data = profile_test_pred_forest,
reference = profile_test$PROFILE)
cm_profile$table## Reference
## Prediction advanced_backend advanced_data_science
## advanced_backend 626 2
## advanced_data_science 22 570
## advanced_front_end 6 18
## beginner_backend 12 9
## beginner_data_science 3 15
## beginner_front_end 3 16
## Reference
## Prediction advanced_front_end beginner_backend
## advanced_backend 7 20
## advanced_data_science 9 18
## advanced_front_end 633 8
## beginner_backend 11 605
## beginner_data_science 5 21
## beginner_front_end 10 10
## Reference
## Prediction beginner_data_science beginner_front_end
## advanced_backend 4 5
## advanced_data_science 7 5
## advanced_front_end 6 9
## beginner_backend 13 9
## beginner_data_science 637 4
## beginner_front_end 6 636
Interpretation
In a machine learning model, there is a trade-off between interpretability and performance. Random Forest model performance can be superior compared to other models created from other algorithm, but it is not easy to be interpreted because many random factors are involved. But at least we can see what predictors are most important in making the random forest model through the variable importance:
varImp(profile_forest)## rf variable importance
##
## Overall
## HOURS_BACKEND 100.000
## HOURS_DATASCIENCE 89.092
## AVG_SCORE_FRONTEND 84.258
## AVG_SCORE_DATASCIENCE 84.003
## HOURS_FRONTEND 75.250
## AVG_SCORE_BACKEND 45.527
## NUM_COURSES_BEGINNER_BACKEND 40.116
## NUM_COURSES_BEGINNER_FRONTEND 38.535
## NUM_COURSES_ADVANCED_BACKEND 32.005
## NUM_COURSES_ADVANCED_FRONTEND 18.185
## NUM_COURSES_BEGINNER_DATASCIENCE 9.597
## NUM_COURSES_ADVANCED_DATASCIENCE 0.000
plot(varImp(profile_forest))Conclusion
After comparing both models (naive_model Vs profile_forest), model created via Random Forest algorithm (profile_forest) shows a better performance since its accuracy metric is higher.
Further, as mentioned above, Random Forest model is not easy to be interpreted but we can see which predictors with high variable importance meaning that those predictors or variables are the most important in model creation. Based on the result above, top 5 predictors with the highest importance percentage are: HOURS_BACKEND, HOURS_DATASCIENCE, AVG_SCORE_FRONTEND, AVG_SCORE_DATASCIENCE, HOURS_FRONTEND.
Thus, seeing the predictors importance pattern, it can be seen that number of hours used for studying and average completion score are very important and most probably have a significant influence for classifying each student tech profile.
Overall, the initial objective to create a good model has been achieved via Random Forest algorithm. Tortuga Code, as an online education platform, can use the model created (profile_forest) to classify each student’s tech profile and prepare a specific promotional catalog for their student candidate. Referencing to the catalog, student candidate will know which type of courses should be taken per their own interest, how long it will take to be completed, and the average score needed for course completion.