The data used in this analysis is from https://www.kaggle.com/aljarah/xAPI-Edu-Data.

Data Pre-Processing

library(dplyr)

Attaching package: 㤼㸱dplyr㤼㸲

The following objects are masked from 㤼㸱package:stats㤼㸲:

    filter, lag

The following objects are masked from 㤼㸱package:base㤼㸲:

    intersect, setdiff, setequal, union
library(ggplot2)
non-uniform 'Rounding' sampler used
library(gridExtra)

Attaching package: 㤼㸱gridExtra㤼㸲

The following object is masked from 㤼㸱package:dplyr㤼㸲:

    combine
grades <- read.csv("xAPI-Edu-Data.csv", stringsAsFactors = T)
grades$Class <- factor(grades$Class, levels = c("L", "M", "H"))
head(grades, 10)
summary(grades)
 gender     NationalITy       PlaceofBirth         StageID       GradeID    SectionID     Topic    
 F:175   KW       :179   KuwaIT     :180   HighSchool  : 33   G-02   :147   A:283     IT     : 95  
 M:305   Jordan   :172   Jordan     :176   lowerlevel  :199   G-08   :116   B:167     French : 65  
         Palestine: 28   Iraq       : 22   MiddleSchool:248   G-07   :101   C: 30     Arabic : 59  
         Iraq     : 22   lebanon    : 19                      G-04   : 48             Science: 51  
         lebanon  : 17   SaudiArabia: 16                      G-06   : 32             English: 45  
         Tunis    : 12   USA        : 16                      G-11   : 13             Biology: 30  
         (Other)  : 50   (Other)    : 51                      (Other): 23             (Other):135  
 Semester   Relation    raisedhands     VisITedResources AnnouncementsView   Discussion   
 F:245    Father:283   Min.   :  0.00   Min.   : 0.0     Min.   : 0.00     Min.   : 1.00  
 S:235    Mum   :197   1st Qu.: 15.75   1st Qu.:20.0     1st Qu.:14.00     1st Qu.:20.00  
                       Median : 50.00   Median :65.0     Median :33.00     Median :39.00  
                       Mean   : 46.77   Mean   :54.8     Mean   :37.92     Mean   :43.28  
                       3rd Qu.: 75.00   3rd Qu.:84.0     3rd Qu.:58.00     3rd Qu.:70.00  
                       Max.   :100.00   Max.   :99.0     Max.   :98.00     Max.   :99.00  
                                                                                          
 ParentAnsweringSurvey ParentschoolSatisfaction StudentAbsenceDays Class  
 No :210               Bad :188                 Above-7:191        L:127  
 Yes:270               Good:292                 Under-7:289        M:211  
                                                                   H:142  
                                                                          
                                                                          
                                                                          
                                                                          
glimpse(grades)
Rows: 480
Columns: 17
$ gender                   <fct> M, M, M, M, M, F, M, M, F, F, M, M, M, M, F, F, M, M, F, M, F, F, ~
$ NationalITy              <fct> KW, KW, KW, KW, KW, KW, KW, KW, KW, KW, KW, KW, KW, lebanon, KW, K~
$ PlaceofBirth             <fct> KuwaIT, KuwaIT, KuwaIT, KuwaIT, KuwaIT, KuwaIT, KuwaIT, KuwaIT, Ku~
$ StageID                  <fct> lowerlevel, lowerlevel, lowerlevel, lowerlevel, lowerlevel, lowerl~
$ GradeID                  <fct> G-04, G-04, G-04, G-04, G-04, G-04, G-07, G-07, G-07, G-07, G-07, ~
$ SectionID                <fct> A, A, A, A, A, A, A, A, A, B, A, B, A, A, A, A, B, A, A, B, A, B, ~
$ Topic                    <fct> IT, IT, IT, IT, IT, IT, Math, Math, Math, IT, Math, Math, IT, Math~
$ Semester                 <fct> F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, ~
$ Relation                 <fct> Father, Father, Father, Father, Father, Father, Father, Father, Fa~
$ raisedhands              <int> 15, 20, 10, 30, 40, 42, 35, 50, 12, 70, 50, 19, 5, 20, 62, 30, 36,~
$ VisITedResources         <int> 16, 20, 7, 25, 50, 30, 12, 10, 21, 80, 88, 6, 1, 14, 70, 40, 30, 1~
$ AnnouncementsView        <int> 2, 3, 0, 5, 12, 13, 0, 15, 16, 25, 30, 19, 0, 12, 44, 22, 20, 35, ~
$ Discussion               <int> 20, 25, 30, 35, 50, 70, 17, 22, 50, 70, 80, 12, 11, 19, 60, 66, 80~
$ ParentAnsweringSurvey    <fct> Yes, Yes, No, No, No, Yes, No, Yes, Yes, Yes, Yes, Yes, No, No, No~
$ ParentschoolSatisfaction <fct> Good, Good, Bad, Bad, Bad, Bad, Bad, Good, Good, Good, Good, Good,~
$ StudentAbsenceDays       <fct> Under-7, Under-7, Above-7, Above-7, Above-7, Above-7, Above-7, Und~
$ Class                    <fct> M, M, L, L, M, M, L, M, M, M, H, M, L, L, H, M, M, M, M, H, M, M, ~
anyNA(grades)
[1] FALSE

There are no missing values in the dataframe.

grades_cleaned <- grades %>%
  rename(RaisedHands = raisedhands, VisitedResources = VisITedResources, AnnouncementsViewed = AnnouncementsView, Nationality = NationalITy, Education = StageID, GradeLevel = GradeID, Classroom = SectionID, Grade = Class)
head(grades_cleaned)

Their distributions are mainly bimodal but we won’t know if it will negatively impact the model until we train it.

dist1 <- ggplot(data=grades_cleaned, aes(RaisedHands)) + 
  geom_histogram(bins=30) +
  ggtitle("Distribution of raised hands")
dist2 <- ggplot(data=grades_cleaned, aes(VisitedResources)) + 
  geom_histogram(bins=30) +
  ggtitle("Distribution of visited resources")
dist3 <- ggplot(data=grades_cleaned, aes(AnnouncementsViewed)) + 
  geom_histogram(bins=30) +
  ggtitle("Distribution of view announcements")
dist4 <- ggplot(data=grades_cleaned, aes(Discussion)) + 
  geom_histogram(bins=30) +
  ggtitle("Distribution of discussion participation")
grid.arrange(dist1, dist2, dist3, dist4, nrow=2)

library(caret)
Loading required package: lattice
#sample data 70%
RNGkind(sample.kind = "Rounding")
non-uniform 'Rounding' sampler used
set.seed(123)
sample_index <- sample(1:nrow(grades_cleaned), size = floor(0.70*nrow(grades_cleaned)), replace = F)
train <- grades_cleaned[sample_index,]
test <- grades_cleaned[-sample_index,]
grades_actual <- test$Grade
head(train)
glimpse(train)
Rows: 336
Columns: 17
$ gender                   <fct> F, M, M, F, M, F, M, F, M, M, M, M, M, M, F, M, M, M, M, M, M, F, ~
$ Nationality              <fct> Jordan, Jordan, KW, Jordan, Jordan, KW, KW, Jordan, KW, KW, Jordan~
$ PlaceofBirth             <fct> Egypt, Jordan, KuwaIT, Jordan, Jordan, KuwaIT, KuwaIT, Jordan, Kuw~
$ Education                <fct> MiddleSchool, lowerlevel, MiddleSchool, MiddleSchool, MiddleSchool~
$ GradeLevel               <fct> G-07, G-02, G-08, G-08, G-08, G-07, G-04, G-08, G-04, G-08, G-08, ~
$ Classroom                <fct> A, B, A, A, A, B, A, A, A, C, A, C, A, A, A, B, B, B, A, A, B, A, ~
$ Topic                    <fct> Quran, Arabic, Arabic, Chemistry, Geology, IT, Math, Geology, Hist~
$ Semester                 <fct> F, S, S, S, S, F, S, F, S, S, S, S, S, S, F, F, F, F, S, F, S, S, ~
$ Relation                 <fct> Mum, Mum, Father, Mum, Mum, Father, Father, Mum, Father, Father, M~
$ RaisedHands              <int> 100, 32, 15, 84, 71, 10, 15, 70, 10, 5, 81, 87, 50, 11, 70, 88, 11~
$ VisitedResources         <int> 80, 82, 43, 92, 84, 12, 6, 69, 17, 21, 84, 81, 90, 70, 4, 90, 2, 5~
$ AnnouncementsViewed      <int> 95, 59, 42, 29, 67, 4, 32, 46, 12, 42, 77, 42, 83, 32, 39, 76, 0, ~
$ Discussion               <int> 90, 63, 33, 43, 80, 80, 40, 45, 14, 14, 85, 19, 13, 29, 90, 81, 50~
$ ParentAnsweringSurvey    <fct> No, Yes, Yes, Yes, Yes, No, Yes, Yes, No, No, Yes, Yes, Yes, Yes, ~
$ ParentschoolSatisfaction <fct> Bad, Bad, Good, Good, Good, Bad, Good, Good, Bad, Good, Good, Good~
$ StudentAbsenceDays       <fct> Under-7, Above-7, Under-7, Under-7, Under-7, Under-7, Under-7, Abo~
$ Grade                    <fct> H, M, M, H, M, M, H, M, L, L, H, H, H, M, H, H, L, H, M, M, M, H, ~

The different grade categories are currently unbalanced. We would have do oversampling as the number of training samples are very little (336).

table(train$Grade)

  L   M   H 
 95 142  99 
ggplot(train, aes(fill=Grade, x=Grade)) +
  geom_bar() +
  ggtitle("Proportion of each Grade")

RNGkind(sample.kind = "Rounding")
non-uniform 'Rounding' sampler used
set.seed(123)
x <- train%>%select(-Grade)
y <- train$Grade
up_train <- upSample(x = x, y = y)                         
table(up_train$Class)

  L   M   H 
142 142 142 
ggplot(up_train, aes(fill=Class, x=Class)) +
  geom_bar() +
  ggtitle("Proportion of each Grade")

The classes are now balanced.

Decision Tree

library(tree)
Registered S3 method overwritten by 'tree':
  method     from
  print.tree cli 
RNGkind(sample.kind = "Rounding")
non-uniform 'Rounding' sampler used
set.seed(123)
grades_tree <- tree(Class~., up_train)
cv_grades <- cv.tree(grades_tree, FUN=prune.misclass)
optimal_nodes <- cv_grades$size[which.min(cv_grades$dev)]
prune_grades <- prune.misclass(grades_tree, best = optimal_nodes)
predict_gradestree <- predict(prune_grades, newdata=test, type="class")
confusionMatrix(predict_gradestree, grades_actual, mode='everything')
Confusion Matrix and Statistics

          Reference
Prediction  L  M  H
         L 22  5  1
         M 10 51 17
         H  0 13 25

Overall Statistics
                                          
               Accuracy : 0.6806          
                 95% CI : (0.5978, 0.7557)
    No Information Rate : 0.4792          
    P-Value [Acc > NIR] : 8.327e-07       
                                          
                  Kappa : 0.4835          
                                          
 Mcnemar's Test P-Value : 0.3618          

Statistics by Class:

                     Class: L Class: M Class: H
Sensitivity            0.6875   0.7391   0.5814
Specificity            0.9464   0.6400   0.8713
Pos Pred Value         0.7857   0.6538   0.6579
Neg Pred Value         0.9138   0.7273   0.8302
Precision              0.7857   0.6538   0.6579
Recall                 0.6875   0.7391   0.5814
F1                     0.7333   0.6939   0.6173
Prevalence             0.2222   0.4792   0.2986
Detection Rate         0.1528   0.3542   0.1736
Detection Prevalence   0.1944   0.5417   0.2639
Balanced Accuracy      0.8170   0.6896   0.7263

Decision Tree gives an accuracy score of 68.06% and the lowest F1 score of 61.73% from class H (students with highest range scores).

Random Forest (Bagging)

library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 㤼㸱randomForest㤼㸲

The following object is masked from 㤼㸱package:gridExtra㤼㸲:

    combine

The following object is masked from 㤼㸱package:ggplot2㤼㸲:

    margin

The following object is masked from 㤼㸱package:dplyr㤼㸲:

    combine
#training the model
RNGkind(sample.kind = "Rounding")
non-uniform 'Rounding' sampler used
set.seed(123)
grades_rf <- randomForest(Class~., data=up_train, importance=TRUE)
grades_rf

Call:
 randomForest(formula = Class ~ ., data = up_train, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 15.02%
Confusion matrix:
    L   M   H class.error
L 134   8   0  0.05633803
M  12 108  22  0.23943662
H   0  22 120  0.15492958
importance(grades_rf, type=1, scale = F)
                         MeanDecreaseAccuracy
gender                            0.011084420
Nationality                       0.023522779
PlaceofBirth                      0.019591909
Education                         0.004608699
GradeLevel                        0.021099785
Classroom                         0.005401456
Topic                             0.035241177
Semester                          0.002425242
Relation                          0.028116559
RaisedHands                       0.094552762
VisitedResources                  0.126330957
AnnouncementsViewed               0.060741853
Discussion                        0.027718799
ParentAnsweringSurvey             0.046165198
ParentschoolSatisfaction          0.014272386
StudentAbsenceDays                0.170948524
varImpPlot(grades_rf)

The training model has an estimated error of 15.02%. This means that the estimated accuracy would be around 84.98%.

According to the Mean Decrease Accuracy graph, the top 3 most important variables that can negatively impact the accuracy of the model are the number of days students are absent (StudentAbsenceDays), the number of times students visited resources (VisitedResources), and the number of times students raised their hands and participate in class (RaisedHands).

The 3 least important variables are which class students are in (Classroom), their education level (Education) and the semester (Semester).

#test the model
predict_gradesrf <- predict(grades_rf, newdata=test)
confusionMatrix(predict_gradesrf, grades_actual, mode='everything')
Confusion Matrix and Statistics

          Reference
Prediction  L  M  H
         L 26  7  0
         M  6 48 10
         H  0 14 33

Overall Statistics
                                          
               Accuracy : 0.7431          
                 95% CI : (0.6636, 0.8122)
    No Information Rate : 0.4792          
    P-Value [Acc > NIR] : 1.019e-10       
                                          
                  Kappa : 0.5977          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: L Class: M Class: H
Sensitivity            0.8125   0.6957   0.7674
Specificity            0.9375   0.7867   0.8614
Pos Pred Value         0.7879   0.7500   0.7021
Neg Pred Value         0.9459   0.7375   0.8969
Precision              0.7879   0.7500   0.7021
Recall                 0.8125   0.6957   0.7674
F1                     0.8000   0.7218   0.7333
Prevalence             0.2222   0.4792   0.2986
Detection Rate         0.1806   0.3333   0.2292
Detection Prevalence   0.2292   0.4444   0.3264
Balanced Accuracy      0.8750   0.7412   0.8144

When the model is applied to the test data, the accuracy score is 74.31%. This suggests that there might be a degree of overfitting occurring when training. Looking at the F1 score, the model predicts students with the highest grades (M) the least accurate (72.18%).

Random Forest (Gradient Boosting)

library(gbm)
package 㤼㸱gbm㤼㸲 was built under R version 4.1.2Loaded gbm 2.1.8
RNGkind(sample.kind = "Rounding")
non-uniform 'Rounding' sampler used
set.seed(123)
grades_boost <- gbm(Class ~ . ,data = up_train, distribution = "multinomial", n.trees = 500, shrinkage = 0.01, interaction.depth = 4)
Setting `distribution = "multinomial"` is ill-advised as it is currently broken. It exists only for backwards compatibility. Use at your own risk.
grades_boost
gbm(formula = Class ~ ., distribution = "multinomial", 
    data = up_train, n.trees = 500, interaction.depth = 4, shrinkage = 0.01)
A gradient boosted model with multinomial loss function.
500 iterations were performed.
There were 16 predictors of which 16 had non-zero influence.
summary(grades_boost)

According to the variable importance table, the 3 most important and 3 least important variables remain unchanged compared to Random Forest (Bagging). However, the order of the 3 most important variables differ where VisitedResources is the most important instead of StudentAbsenceDays.

predict_gradesboost <- predict(grades_boost, test)
Using 500 trees...
predictions <- colnames(predict_gradesboost)[apply(predict_gradesboost, 1, which.max)]
confusionMatrix(as.factor(predictions), grades_actual, mode='everything')
Levels are not in the same order for reference and data. Refactoring data to match.
Confusion Matrix and Statistics

          Reference
Prediction  L  M  H
         L 25  8  0
         M  7 44 11
         H  0 17 32

Overall Statistics
                                          
               Accuracy : 0.7014          
                 95% CI : (0.6196, 0.7747)
    No Information Rate : 0.4792          
    P-Value [Acc > NIR] : 5.545e-08       
                                          
                  Kappa : 0.5343          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: L Class: M Class: H
Sensitivity            0.7812   0.6377   0.7442
Specificity            0.9286   0.7600   0.8317
Pos Pred Value         0.7576   0.7097   0.6531
Neg Pred Value         0.9369   0.6951   0.8842
Precision              0.7576   0.7097   0.6531
Recall                 0.7812   0.6377   0.7442
F1                     0.7692   0.6718   0.6957
Prevalence             0.2222   0.4792   0.2986
Detection Rate         0.1736   0.3056   0.2222
Detection Prevalence   0.2292   0.4306   0.3403
Balanced Accuracy      0.8549   0.6988   0.7879

The overall accuracy is 70.14% with the lowest F1 score of 67.18% for class M, students who got medium range scores.

Support Vector Machine (SVM)

library(e1071)
RNGkind(sample.kind = "Rounding")
non-uniform 'Rounding' sampler used
set.seed(123)
grades_svm <- svm(Class~., data=up_train, 
          method="C-classification", kernal="radial", 
          gamma=0.1, cost=10)
summary(grades_svm)

Call:
svm(formula = Class ~ ., data = up_train, method = "C-classification", kernal = "radial", 
    gamma = 0.1, cost = 10)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  10 

Number of Support Vectors:  252

 ( 47 117 88 )


Number of Classes:  3 

Levels: 
 L M H
predict_gradessvm <- predict(grades_svm, test)
confusionMatrix(predict_gradessvm, grades_actual, mode='everything')
Confusion Matrix and Statistics

          Reference
Prediction  L  M  H
         L 26  5  0
         M  6 55  9
         H  0  9 34

Overall Statistics
                                          
               Accuracy : 0.7986          
                 95% CI : (0.7237, 0.8608)
    No Information Rate : 0.4792          
    P-Value [Acc > NIR] : 3.046e-15       
                                          
                  Kappa : 0.6804          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: L Class: M Class: H
Sensitivity            0.8125   0.7971   0.7907
Specificity            0.9554   0.8000   0.9109
Pos Pred Value         0.8387   0.7857   0.7907
Neg Pred Value         0.9469   0.8108   0.9109
Precision              0.8387   0.7857   0.7907
Recall                 0.8125   0.7971   0.7907
F1                     0.8254   0.7914   0.7907
Prevalence             0.2222   0.4792   0.2986
Detection Rate         0.1806   0.3819   0.2361
Detection Prevalence   0.2153   0.4861   0.2986
Balanced Accuracy      0.8839   0.7986   0.8508

The accuracy of SVM is slightly better than Random Forest with a score of 79.86%. It is able to predict class H a lot better too with an F1 score of 79.07%.

Conclusion

The best model to predict students’ grade classes is SVM with the highest accuracy score of 79.86%. It is able to predict class L, students with the lowest range scores, the best with an F1 score of 82.54%. It has difficulties in predicting class H, students with the highest range scores, with the lowest F1 score of 79.07%.

The 3 most important features in predicting students’ grade classes will be StudentAbsenceDays, VisitedResources and RaisedHands, according to the feature importance from Random Forest (Bagging) and Random Forest (Gradient Boosting). These features will most likely be of high importance in SVM too.

---
title: "Students' Grades Prediction"
output: html_notebook
---

The data used in this analysis is from https://www.kaggle.com/aljarah/xAPI-Edu-Data.

# Data Pre-Processing

```{r}
library(dplyr)
library(ggplot2)
library(gridExtra)
```

```{r}
grades <- read.csv("xAPI-Edu-Data.csv", stringsAsFactors = T)
grades$Class <- factor(grades$Class, levels = c("L", "M", "H"))
head(grades, 10)
```

```{r}
summary(grades)
```

```{r}
glimpse(grades)
```


```{r}
anyNA(grades)
```

There are no missing values in the dataframe.

```{r}
grades_cleaned <- grades %>%
  rename(RaisedHands = raisedhands, VisitedResources = VisITedResources, AnnouncementsViewed = AnnouncementsView, Nationality = NationalITy, Education = StageID, GradeLevel = GradeID, Classroom = SectionID, Grade = Class)
head(grades_cleaned)
```

Their distributions are mainly bimodal but we won't know if it will negatively impact the model until we train it.

```{r}
dist1 <- ggplot(data=grades_cleaned, aes(RaisedHands)) + 
  geom_histogram(bins=30) +
  ggtitle("Distribution of raised hands")
```

```{r}
dist2 <- ggplot(data=grades_cleaned, aes(VisitedResources)) + 
  geom_histogram(bins=30) +
  ggtitle("Distribution of visited resources")
```

```{r}
dist3 <- ggplot(data=grades_cleaned, aes(AnnouncementsViewed)) + 
  geom_histogram(bins=30) +
  ggtitle("Distribution of view announcements")
```

```{r}
dist4 <- ggplot(data=grades_cleaned, aes(Discussion)) + 
  geom_histogram(bins=30) +
  ggtitle("Distribution of discussion participation")
```


```{r}
grid.arrange(dist1, dist2, dist3, dist4, nrow=2)
```

```{r}
library(caret)
```

```{r}
#sample data 70%
RNGkind(sample.kind = "Rounding")
set.seed(123)
sample_index <- sample(1:nrow(grades_cleaned), size = floor(0.70*nrow(grades_cleaned)), replace = F)
train <- grades_cleaned[sample_index,]
test <- grades_cleaned[-sample_index,]
grades_actual <- test$Grade
head(train)
glimpse(train)
```

The different grade categories are currently unbalanced. We would have do oversampling as the number of training samples are very little (336).

```{r}
table(train$Grade)
ggplot(train, aes(fill=Grade, x=Grade)) +
  geom_bar() +
  ggtitle("Proportion of each Grade")
```


```{r}
RNGkind(sample.kind = "Rounding")
set.seed(123)
x <- train%>%select(-Grade)
y <- train$Grade
up_train <- upSample(x = x, y = y)                         
table(up_train$Class)
ggplot(up_train, aes(fill=Class, x=Class)) +
  geom_bar() +
  ggtitle("Proportion of each Grade")
```
The classes are now balanced.

# Decision Tree

```{r}
library(tree)
```

```{r}
RNGkind(sample.kind = "Rounding")
set.seed(123)
grades_tree <- tree(Class~., up_train)
cv_grades <- cv.tree(grades_tree, FUN=prune.misclass)
optimal_nodes <- cv_grades$size[which.min(cv_grades$dev)]
prune_grades <- prune.misclass(grades_tree, best = optimal_nodes)
```

```{r}
predict_gradestree <- predict(prune_grades, newdata=test, type="class")
confusionMatrix(predict_gradestree, grades_actual, mode='everything')
```
Decision Tree gives an accuracy score of 68.06% and the lowest F1 score of 61.73% from class H (students with highest range scores).

# Random Forest (Bagging)

```{r}
library(randomForest)
```

```{r}
#training the model
RNGkind(sample.kind = "Rounding")
set.seed(123)
grades_rf <- randomForest(Class~., data=up_train, importance=TRUE)
grades_rf
importance(grades_rf, type=1, scale = F)
varImpPlot(grades_rf)
```

The training model has an estimated error of 15.02%. This means that the estimated accuracy would be around 84.98%. 

According to the Mean Decrease Accuracy graph, the top 3 most important variables that can negatively impact the accuracy of the model are the number of days students are absent (StudentAbsenceDays), the number of times students visited resources (VisitedResources), and the number of times students raised their hands and participate in class (RaisedHands).

The 3 least important variables are which class students are in (Classroom), their education level (Education) and the semester (Semester).

```{r}
#test the model
predict_gradesrf <- predict(grades_rf, newdata=test)
confusionMatrix(predict_gradesrf, grades_actual, mode='everything')
```

When the model is applied to the test data, the accuracy score is 74.31%. This suggests that there might be a degree of overfitting occurring when training. Looking at the F1 score, the model predicts students with the highest grades (M) the least accurate (72.18%).

# Random Forest (Gradient Boosting)
```{r}
library(gbm)
```

```{r}
RNGkind(sample.kind = "Rounding")
set.seed(123)
grades_boost <- gbm(Class ~ . ,data = up_train, distribution = "multinomial", n.trees = 500, shrinkage = 0.01, interaction.depth = 4)
grades_boost
summary(grades_boost)
```

According to the variable importance table, the 3 most important and 3 least important variables remain unchanged compared to Random Forest (Bagging). However, the order of the 3 most important variables differ where VisitedResources is the most important instead of StudentAbsenceDays.

```{r}
predict_gradesboost <- predict(grades_boost, test)
predictions <- colnames(predict_gradesboost)[apply(predict_gradesboost, 1, which.max)]
confusionMatrix(as.factor(predictions), grades_actual, mode='everything')
```

The overall accuracy is 70.14% with the lowest F1 score of 67.18% for class M, students who got medium range scores.

# Support Vector Machine (SVM)

```{r}
library(e1071)
```

```{r}
RNGkind(sample.kind = "Rounding")
set.seed(123)
grades_svm <- svm(Class~., data=up_train, 
          method="C-classification", kernal="radial", 
          gamma=0.1, cost=10)
summary(grades_svm)
predict_gradessvm <- predict(grades_svm, test)
confusionMatrix(predict_gradessvm, grades_actual, mode='everything')
```

The accuracy of SVM is slightly better than Random Forest with a score of 79.86%. It is able to predict class H a lot better too with an F1 score of 79.07%.

# Conclusion
The best model to predict students' grade classes is SVM with the highest accuracy score of 79.86%. It is able to predict class L, students with the lowest range scores, the best with an F1 score of 82.54%. It has difficulties in predicting class H, students with the highest range scores, with the lowest F1 score of 79.07%.

The 3 most important features in predicting students' grade classes will be StudentAbsenceDays, VisitedResources and RaisedHands, according to the feature importance from Random Forest (Bagging) and Random Forest (Gradient Boosting). These features will most likely be of high importance in SVM too.