The data used in this analysis is from https://www.kaggle.com/aljarah/xAPI-Edu-Data.
library(dplyr)
Attaching package: 㤼㸱dplyr㤼㸲
The following objects are masked from 㤼㸱package:stats㤼㸲:
filter, lag
The following objects are masked from 㤼㸱package:base㤼㸲:
intersect, setdiff, setequal, union
library(ggplot2)
non-uniform 'Rounding' sampler used
library(gridExtra)
Attaching package: 㤼㸱gridExtra㤼㸲
The following object is masked from 㤼㸱package:dplyr㤼㸲:
combine
grades <- read.csv("xAPI-Edu-Data.csv", stringsAsFactors = T)
grades$Class <- factor(grades$Class, levels = c("L", "M", "H"))
head(grades, 10)
summary(grades)
gender NationalITy PlaceofBirth StageID GradeID SectionID Topic
F:175 KW :179 KuwaIT :180 HighSchool : 33 G-02 :147 A:283 IT : 95
M:305 Jordan :172 Jordan :176 lowerlevel :199 G-08 :116 B:167 French : 65
Palestine: 28 Iraq : 22 MiddleSchool:248 G-07 :101 C: 30 Arabic : 59
Iraq : 22 lebanon : 19 G-04 : 48 Science: 51
lebanon : 17 SaudiArabia: 16 G-06 : 32 English: 45
Tunis : 12 USA : 16 G-11 : 13 Biology: 30
(Other) : 50 (Other) : 51 (Other): 23 (Other):135
Semester Relation raisedhands VisITedResources AnnouncementsView Discussion
F:245 Father:283 Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 1.00
S:235 Mum :197 1st Qu.: 15.75 1st Qu.:20.0 1st Qu.:14.00 1st Qu.:20.00
Median : 50.00 Median :65.0 Median :33.00 Median :39.00
Mean : 46.77 Mean :54.8 Mean :37.92 Mean :43.28
3rd Qu.: 75.00 3rd Qu.:84.0 3rd Qu.:58.00 3rd Qu.:70.00
Max. :100.00 Max. :99.0 Max. :98.00 Max. :99.00
ParentAnsweringSurvey ParentschoolSatisfaction StudentAbsenceDays Class
No :210 Bad :188 Above-7:191 L:127
Yes:270 Good:292 Under-7:289 M:211
H:142
glimpse(grades)
Rows: 480
Columns: 17
$ gender <fct> M, M, M, M, M, F, M, M, F, F, M, M, M, M, F, F, M, M, F, M, F, F, ~
$ NationalITy <fct> KW, KW, KW, KW, KW, KW, KW, KW, KW, KW, KW, KW, KW, lebanon, KW, K~
$ PlaceofBirth <fct> KuwaIT, KuwaIT, KuwaIT, KuwaIT, KuwaIT, KuwaIT, KuwaIT, KuwaIT, Ku~
$ StageID <fct> lowerlevel, lowerlevel, lowerlevel, lowerlevel, lowerlevel, lowerl~
$ GradeID <fct> G-04, G-04, G-04, G-04, G-04, G-04, G-07, G-07, G-07, G-07, G-07, ~
$ SectionID <fct> A, A, A, A, A, A, A, A, A, B, A, B, A, A, A, A, B, A, A, B, A, B, ~
$ Topic <fct> IT, IT, IT, IT, IT, IT, Math, Math, Math, IT, Math, Math, IT, Math~
$ Semester <fct> F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, ~
$ Relation <fct> Father, Father, Father, Father, Father, Father, Father, Father, Fa~
$ raisedhands <int> 15, 20, 10, 30, 40, 42, 35, 50, 12, 70, 50, 19, 5, 20, 62, 30, 36,~
$ VisITedResources <int> 16, 20, 7, 25, 50, 30, 12, 10, 21, 80, 88, 6, 1, 14, 70, 40, 30, 1~
$ AnnouncementsView <int> 2, 3, 0, 5, 12, 13, 0, 15, 16, 25, 30, 19, 0, 12, 44, 22, 20, 35, ~
$ Discussion <int> 20, 25, 30, 35, 50, 70, 17, 22, 50, 70, 80, 12, 11, 19, 60, 66, 80~
$ ParentAnsweringSurvey <fct> Yes, Yes, No, No, No, Yes, No, Yes, Yes, Yes, Yes, Yes, No, No, No~
$ ParentschoolSatisfaction <fct> Good, Good, Bad, Bad, Bad, Bad, Bad, Good, Good, Good, Good, Good,~
$ StudentAbsenceDays <fct> Under-7, Under-7, Above-7, Above-7, Above-7, Above-7, Above-7, Und~
$ Class <fct> M, M, L, L, M, M, L, M, M, M, H, M, L, L, H, M, M, M, M, H, M, M, ~
anyNA(grades)
[1] FALSE
There are no missing values in the dataframe.
grades_cleaned <- grades %>%
rename(RaisedHands = raisedhands, VisitedResources = VisITedResources, AnnouncementsViewed = AnnouncementsView, Nationality = NationalITy, Education = StageID, GradeLevel = GradeID, Classroom = SectionID, Grade = Class)
head(grades_cleaned)
Their distributions are mainly bimodal but we won’t know if it will negatively impact the model until we train it.
dist1 <- ggplot(data=grades_cleaned, aes(RaisedHands)) +
geom_histogram(bins=30) +
ggtitle("Distribution of raised hands")
dist2 <- ggplot(data=grades_cleaned, aes(VisitedResources)) +
geom_histogram(bins=30) +
ggtitle("Distribution of visited resources")
dist3 <- ggplot(data=grades_cleaned, aes(AnnouncementsViewed)) +
geom_histogram(bins=30) +
ggtitle("Distribution of view announcements")
dist4 <- ggplot(data=grades_cleaned, aes(Discussion)) +
geom_histogram(bins=30) +
ggtitle("Distribution of discussion participation")
grid.arrange(dist1, dist2, dist3, dist4, nrow=2)
library(caret)
Loading required package: lattice
#sample data 70%
RNGkind(sample.kind = "Rounding")
non-uniform 'Rounding' sampler used
set.seed(123)
sample_index <- sample(1:nrow(grades_cleaned), size = floor(0.70*nrow(grades_cleaned)), replace = F)
train <- grades_cleaned[sample_index,]
test <- grades_cleaned[-sample_index,]
grades_actual <- test$Grade
head(train)
glimpse(train)
Rows: 336
Columns: 17
$ gender <fct> F, M, M, F, M, F, M, F, M, M, M, M, M, M, F, M, M, M, M, M, M, F, ~
$ Nationality <fct> Jordan, Jordan, KW, Jordan, Jordan, KW, KW, Jordan, KW, KW, Jordan~
$ PlaceofBirth <fct> Egypt, Jordan, KuwaIT, Jordan, Jordan, KuwaIT, KuwaIT, Jordan, Kuw~
$ Education <fct> MiddleSchool, lowerlevel, MiddleSchool, MiddleSchool, MiddleSchool~
$ GradeLevel <fct> G-07, G-02, G-08, G-08, G-08, G-07, G-04, G-08, G-04, G-08, G-08, ~
$ Classroom <fct> A, B, A, A, A, B, A, A, A, C, A, C, A, A, A, B, B, B, A, A, B, A, ~
$ Topic <fct> Quran, Arabic, Arabic, Chemistry, Geology, IT, Math, Geology, Hist~
$ Semester <fct> F, S, S, S, S, F, S, F, S, S, S, S, S, S, F, F, F, F, S, F, S, S, ~
$ Relation <fct> Mum, Mum, Father, Mum, Mum, Father, Father, Mum, Father, Father, M~
$ RaisedHands <int> 100, 32, 15, 84, 71, 10, 15, 70, 10, 5, 81, 87, 50, 11, 70, 88, 11~
$ VisitedResources <int> 80, 82, 43, 92, 84, 12, 6, 69, 17, 21, 84, 81, 90, 70, 4, 90, 2, 5~
$ AnnouncementsViewed <int> 95, 59, 42, 29, 67, 4, 32, 46, 12, 42, 77, 42, 83, 32, 39, 76, 0, ~
$ Discussion <int> 90, 63, 33, 43, 80, 80, 40, 45, 14, 14, 85, 19, 13, 29, 90, 81, 50~
$ ParentAnsweringSurvey <fct> No, Yes, Yes, Yes, Yes, No, Yes, Yes, No, No, Yes, Yes, Yes, Yes, ~
$ ParentschoolSatisfaction <fct> Bad, Bad, Good, Good, Good, Bad, Good, Good, Bad, Good, Good, Good~
$ StudentAbsenceDays <fct> Under-7, Above-7, Under-7, Under-7, Under-7, Under-7, Under-7, Abo~
$ Grade <fct> H, M, M, H, M, M, H, M, L, L, H, H, H, M, H, H, L, H, M, M, M, H, ~
The different grade categories are currently unbalanced. We would have do oversampling as the number of training samples are very little (336).
table(train$Grade)
L M H
95 142 99
ggplot(train, aes(fill=Grade, x=Grade)) +
geom_bar() +
ggtitle("Proportion of each Grade")
RNGkind(sample.kind = "Rounding")
non-uniform 'Rounding' sampler used
set.seed(123)
x <- train%>%select(-Grade)
y <- train$Grade
up_train <- upSample(x = x, y = y)
table(up_train$Class)
L M H
142 142 142
ggplot(up_train, aes(fill=Class, x=Class)) +
geom_bar() +
ggtitle("Proportion of each Grade")
The classes are now balanced.
library(tree)
Registered S3 method overwritten by 'tree':
method from
print.tree cli
RNGkind(sample.kind = "Rounding")
non-uniform 'Rounding' sampler used
set.seed(123)
grades_tree <- tree(Class~., up_train)
cv_grades <- cv.tree(grades_tree, FUN=prune.misclass)
optimal_nodes <- cv_grades$size[which.min(cv_grades$dev)]
prune_grades <- prune.misclass(grades_tree, best = optimal_nodes)
predict_gradestree <- predict(prune_grades, newdata=test, type="class")
confusionMatrix(predict_gradestree, grades_actual, mode='everything')
Confusion Matrix and Statistics
Reference
Prediction L M H
L 22 5 1
M 10 51 17
H 0 13 25
Overall Statistics
Accuracy : 0.6806
95% CI : (0.5978, 0.7557)
No Information Rate : 0.4792
P-Value [Acc > NIR] : 8.327e-07
Kappa : 0.4835
Mcnemar's Test P-Value : 0.3618
Statistics by Class:
Class: L Class: M Class: H
Sensitivity 0.6875 0.7391 0.5814
Specificity 0.9464 0.6400 0.8713
Pos Pred Value 0.7857 0.6538 0.6579
Neg Pred Value 0.9138 0.7273 0.8302
Precision 0.7857 0.6538 0.6579
Recall 0.6875 0.7391 0.5814
F1 0.7333 0.6939 0.6173
Prevalence 0.2222 0.4792 0.2986
Detection Rate 0.1528 0.3542 0.1736
Detection Prevalence 0.1944 0.5417 0.2639
Balanced Accuracy 0.8170 0.6896 0.7263
Decision Tree gives an accuracy score of 68.06% and the lowest F1 score of 61.73% from class H (students with highest range scores).
library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 㤼㸱randomForest㤼㸲
The following object is masked from 㤼㸱package:gridExtra㤼㸲:
combine
The following object is masked from 㤼㸱package:ggplot2㤼㸲:
margin
The following object is masked from 㤼㸱package:dplyr㤼㸲:
combine
#training the model
RNGkind(sample.kind = "Rounding")
non-uniform 'Rounding' sampler used
set.seed(123)
grades_rf <- randomForest(Class~., data=up_train, importance=TRUE)
grades_rf
Call:
randomForest(formula = Class ~ ., data = up_train, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 15.02%
Confusion matrix:
L M H class.error
L 134 8 0 0.05633803
M 12 108 22 0.23943662
H 0 22 120 0.15492958
importance(grades_rf, type=1, scale = F)
MeanDecreaseAccuracy
gender 0.011084420
Nationality 0.023522779
PlaceofBirth 0.019591909
Education 0.004608699
GradeLevel 0.021099785
Classroom 0.005401456
Topic 0.035241177
Semester 0.002425242
Relation 0.028116559
RaisedHands 0.094552762
VisitedResources 0.126330957
AnnouncementsViewed 0.060741853
Discussion 0.027718799
ParentAnsweringSurvey 0.046165198
ParentschoolSatisfaction 0.014272386
StudentAbsenceDays 0.170948524
varImpPlot(grades_rf)
The training model has an estimated error of 15.02%. This means that the estimated accuracy would be around 84.98%.
According to the Mean Decrease Accuracy graph, the top 3 most important variables that can negatively impact the accuracy of the model are the number of days students are absent (StudentAbsenceDays), the number of times students visited resources (VisitedResources), and the number of times students raised their hands and participate in class (RaisedHands).
The 3 least important variables are which class students are in (Classroom), their education level (Education) and the semester (Semester).
#test the model
predict_gradesrf <- predict(grades_rf, newdata=test)
confusionMatrix(predict_gradesrf, grades_actual, mode='everything')
Confusion Matrix and Statistics
Reference
Prediction L M H
L 26 7 0
M 6 48 10
H 0 14 33
Overall Statistics
Accuracy : 0.7431
95% CI : (0.6636, 0.8122)
No Information Rate : 0.4792
P-Value [Acc > NIR] : 1.019e-10
Kappa : 0.5977
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: L Class: M Class: H
Sensitivity 0.8125 0.6957 0.7674
Specificity 0.9375 0.7867 0.8614
Pos Pred Value 0.7879 0.7500 0.7021
Neg Pred Value 0.9459 0.7375 0.8969
Precision 0.7879 0.7500 0.7021
Recall 0.8125 0.6957 0.7674
F1 0.8000 0.7218 0.7333
Prevalence 0.2222 0.4792 0.2986
Detection Rate 0.1806 0.3333 0.2292
Detection Prevalence 0.2292 0.4444 0.3264
Balanced Accuracy 0.8750 0.7412 0.8144
When the model is applied to the test data, the accuracy score is 74.31%. This suggests that there might be a degree of overfitting occurring when training. Looking at the F1 score, the model predicts students with the highest grades (M) the least accurate (72.18%).
library(gbm)
package 㤼㸱gbm㤼㸲 was built under R version 4.1.2Loaded gbm 2.1.8
RNGkind(sample.kind = "Rounding")
non-uniform 'Rounding' sampler used
set.seed(123)
grades_boost <- gbm(Class ~ . ,data = up_train, distribution = "multinomial", n.trees = 500, shrinkage = 0.01, interaction.depth = 4)
Setting `distribution = "multinomial"` is ill-advised as it is currently broken. It exists only for backwards compatibility. Use at your own risk.
grades_boost
gbm(formula = Class ~ ., distribution = "multinomial",
data = up_train, n.trees = 500, interaction.depth = 4, shrinkage = 0.01)
A gradient boosted model with multinomial loss function.
500 iterations were performed.
There were 16 predictors of which 16 had non-zero influence.
summary(grades_boost)
According to the variable importance table, the 3 most important and 3 least important variables remain unchanged compared to Random Forest (Bagging). However, the order of the 3 most important variables differ where VisitedResources is the most important instead of StudentAbsenceDays.
predict_gradesboost <- predict(grades_boost, test)
Using 500 trees...
predictions <- colnames(predict_gradesboost)[apply(predict_gradesboost, 1, which.max)]
confusionMatrix(as.factor(predictions), grades_actual, mode='everything')
Levels are not in the same order for reference and data. Refactoring data to match.
Confusion Matrix and Statistics
Reference
Prediction L M H
L 25 8 0
M 7 44 11
H 0 17 32
Overall Statistics
Accuracy : 0.7014
95% CI : (0.6196, 0.7747)
No Information Rate : 0.4792
P-Value [Acc > NIR] : 5.545e-08
Kappa : 0.5343
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: L Class: M Class: H
Sensitivity 0.7812 0.6377 0.7442
Specificity 0.9286 0.7600 0.8317
Pos Pred Value 0.7576 0.7097 0.6531
Neg Pred Value 0.9369 0.6951 0.8842
Precision 0.7576 0.7097 0.6531
Recall 0.7812 0.6377 0.7442
F1 0.7692 0.6718 0.6957
Prevalence 0.2222 0.4792 0.2986
Detection Rate 0.1736 0.3056 0.2222
Detection Prevalence 0.2292 0.4306 0.3403
Balanced Accuracy 0.8549 0.6988 0.7879
The overall accuracy is 70.14% with the lowest F1 score of 67.18% for class M, students who got medium range scores.
library(e1071)
RNGkind(sample.kind = "Rounding")
non-uniform 'Rounding' sampler used
set.seed(123)
grades_svm <- svm(Class~., data=up_train,
method="C-classification", kernal="radial",
gamma=0.1, cost=10)
summary(grades_svm)
Call:
svm(formula = Class ~ ., data = up_train, method = "C-classification", kernal = "radial",
gamma = 0.1, cost = 10)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 10
Number of Support Vectors: 252
( 47 117 88 )
Number of Classes: 3
Levels:
L M H
predict_gradessvm <- predict(grades_svm, test)
confusionMatrix(predict_gradessvm, grades_actual, mode='everything')
Confusion Matrix and Statistics
Reference
Prediction L M H
L 26 5 0
M 6 55 9
H 0 9 34
Overall Statistics
Accuracy : 0.7986
95% CI : (0.7237, 0.8608)
No Information Rate : 0.4792
P-Value [Acc > NIR] : 3.046e-15
Kappa : 0.6804
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: L Class: M Class: H
Sensitivity 0.8125 0.7971 0.7907
Specificity 0.9554 0.8000 0.9109
Pos Pred Value 0.8387 0.7857 0.7907
Neg Pred Value 0.9469 0.8108 0.9109
Precision 0.8387 0.7857 0.7907
Recall 0.8125 0.7971 0.7907
F1 0.8254 0.7914 0.7907
Prevalence 0.2222 0.4792 0.2986
Detection Rate 0.1806 0.3819 0.2361
Detection Prevalence 0.2153 0.4861 0.2986
Balanced Accuracy 0.8839 0.7986 0.8508
The accuracy of SVM is slightly better than Random Forest with a score of 79.86%. It is able to predict class H a lot better too with an F1 score of 79.07%.
The best model to predict students’ grade classes is SVM with the highest accuracy score of 79.86%. It is able to predict class L, students with the lowest range scores, the best with an F1 score of 82.54%. It has difficulties in predicting class H, students with the highest range scores, with the lowest F1 score of 79.07%.
The 3 most important features in predicting students’ grade classes will be StudentAbsenceDays, VisitedResources and RaisedHands, according to the feature importance from Random Forest (Bagging) and Random Forest (Gradient Boosting). These features will most likely be of high importance in SVM too.