Students’ Academic Performance Prediction

Introduction

This study aims at identifying the factors and the extent to which they affect the academic success of a student.There are a total of 16 features in the dataset which span across multiple categories like:

Demographic features such as gender and nationality.
Academic background features such as educational stage, grade Level and section.
Behavioral features such as raised hand on class, opening resources, answering survey by parents, and school satisfaction.

The perfomance of the student is measured by the “class” variable which is divided into three groups.

L-This stands for low level and is given to students who score from 0 to 69
M-This is allotted to students who score somewhere between 70 and 89
H-This comprises of the most suucessful students who score more than 89 in their tests

As there are a total of 16 (not too many) independent features in the dataset,we will try to explore them in detail and note their relationship with the class (our response) variable.This will enable us to locate the features that impact the performance of the students considerably.This initial exploration of the data can help us maximize insights into the dataset by extracting important variables and therby prove instrumental in developing parsiomious models. As “class” is a categorical variable, we can also use classification techniques of supervised learning to predict the class of new students based on the set of attributes (or features) that they possess. The various algorithms that can be tried are logistic regression,decision trees,random forests and neural networks.

The analysis obtained on the basis of the exploratory data analysis followed by the appropriate model (random forest in our case) development will be aimed at helping the schools, students and the parents in adopting appropriate measures to ensure the success of students at school.

Packages Required

The following packages will be used in our analysis

tibble: To view the data in a concise way
Knitr: To display an aligned table on the screen
dplyr: To manipulate the data
corrplot: To draw the correlation table
ggplot2:To plot visually appealing charts and graphs
tidyr: To tidy messy data,if the case arises
DT: To enable scrolling of the table
plotly: To plot interactive charts
rattle: To plot decision trees
rpart.plot: To customize the decision trees
RcolorBrewer: To impart color to the trees
randomForest: To generate random forest model

##installing the packages
library(knitr)
library(tibble)
library(dplyr)
library(corrplot)
library(ggplot2)
library(tidyr)
library(DT)
library(plotly)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(randomForest)

Data Preparation

The dataset is obtained from Kaggle.The link to access the data is https://www.kaggle.com/aljarah/xAPI-Edu-Data/data.This is an educational data set which is collected from learning management system (LMS) called Kalboard 360 using a learner activity tracker tool,experience API (xAPI). The xAPI is a component of the training and learning architecture (TLA) that enables monitoring learning progress and learner’s actions like reading an article or watching a training video. The experience API helps the learning activity providers to determine the learner, activity and objects that describe a learning experience.

The usage of this dataset is subject to the below citations as requested on the kaggle site.

Amrieh, E. A., Hamtini, T., & Aljarah, I. (2016). Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods. International Journal of Database Theory and Application, 9(8), 119-136.
Amrieh, E. A., Hamtini, T., & Aljarah, I. (2015, November). Preprocessing and analyzing educational data set using X-API for improving student’s performance. In Applied Electrical Engineering and Computing Technologies (AEECT), 2015 IEEE Jordan Conference on (pp. 1-5). IEEE.

The dataset has a total of 480 observations based on 16 features and has both integer and categorical variables.It is quite clean and has no missing values.

We now import the dataset to explore in in further detail.

#importing the dataset
students<-read.csv("data.csv")
colnames(students)

##  [1] "gender"                   "NationalITy"             
##  [3] "PlaceofBirth"             "StageID"                 
##  [5] "GradeID"                  "SectionID"               
##  [7] "Topic"                    "Semester"                
##  [9] "Relation"                 "raisedhands"             
## [11] "VisITedResources"         "AnnouncementsView"       
## [13] "Discussion"               "ParentAnsweringSurvey"   
## [15] "ParentschoolSatisfaction" "StudentAbsenceDays"      
## [17] "Class"

dim(students)

## [1] 480  17

As can be seen from the output obtained,the dataset consists of 17 columns. However,as the class is our response variable, we consider only the rest of the 16 feautures and try to visualise and understand their effect on class. Only 4 variables are continuous variables and rest can be treated as factors with a number of levels. Hence, we do not set the stringsAsFactors argument to FALSE in the read.csv function. Few columns have quite lengthy names and hence we can rename such columns.

#renaming the columns
names<-c("gender","nation","birthplace",
  "stageid","gradeid","sectionid","topic",
  "semester","relation","raisedhands",
  "n_visit","n_view","discussion","p_answer",
  "p_satis","n_absent","class")
colnames(students)<-names

The variables are thereby renamed for easy interpretation. Next, we will create a data dictionary.

#creating the data dictionary
names<-colnames(students)

Attributes <- c("Gender of the student",
"Nationality of the student",
"Place of birth of the student",
"Educational level that the student belongs to",
"The grade that the student is enrolled in",
"Classroom the student belongs to",
"Course topic",
"School year semester",
"Parent responsible for the student",
"Number of times the student has raised hands",
"Number of times the student visited the course content",
"Number of times the student checked the announcements",
"Number of times the student participated in group discussions",
"Did the parent answer the survey",
"The degree of parent satisfaction from school ",
"Number of days a student was absent",
"Indicator of the performance of the student")

#creating the datatype variable
Datatype<-c(rep("string",9),rep("numeric",4),rep("string",2),"numeric","string")

#creating the variable description

Description <- as_data_frame(cbind(names,Datatype,Attributes))
colnames(Description) <- c("Variable","Datatype","Description")
library(knitr)
kable(Description)

Variable	Datatype	Description
gender	string	Gender of the student
nation	string	Nationality of the student
birthplace	string	Place of birth of the student
stageid	string	Educational level that the student belongs to
gradeid	string	The grade that the student is enrolled in
sectionid	string	Classroom the student belongs to
topic	string	Course topic
semester	string	School year semester
relation	string	Parent responsible for the student
raisedhands	numeric	Number of times the student has raised hands
n_visit	numeric	Number of times the student visited the course content
n_view	numeric	Number of times the student checked the announcements
discussion	numeric	Number of times the student participated in group discussions
p_answer	string	Did the parent answer the survey
p_satis	string	The degree of parent satisfaction from school
n_absent	numeric	Number of days a student was absent
class	string	Indicator of the performance of the student

#checking the structure of the dataset
str(students)

## 'data.frame':    480 obs. of  17 variables:
##  $ gender     : Factor w/ 2 levels "F","M": 2 2 2 2 2 1 2 2 1 1 ...
##  $ nation     : Factor w/ 14 levels "Egypt","Iran",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ birthplace : Factor w/ 14 levels "Egypt","Iran",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ stageid    : Factor w/ 3 levels "HighSchool","lowerlevel",..: 2 2 2 2 2 2 3 3 3 3 ...
##  $ gradeid    : Factor w/ 10 levels "G-02","G-04",..: 2 2 2 2 2 2 5 5 5 5 ...
##  $ sectionid  : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 2 ...
##  $ topic      : Factor w/ 12 levels "Arabic","Biology",..: 8 8 8 8 8 8 9 9 9 8 ...
##  $ semester   : Factor w/ 2 levels "F","S": 1 1 1 1 1 1 1 1 1 1 ...
##  $ relation   : Factor w/ 2 levels "Father","Mum": 1 1 1 1 1 1 1 1 1 1 ...
##  $ raisedhands: int  15 20 10 30 40 42 35 50 12 70 ...
##  $ n_visit    : int  16 20 7 25 50 30 12 10 21 80 ...
##  $ n_view     : int  2 3 0 5 12 13 0 15 16 25 ...
##  $ discussion : int  20 25 30 35 50 70 17 22 50 70 ...
##  $ p_answer   : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 2 1 2 2 2 ...
##  $ p_satis    : Factor w/ 2 levels "Bad","Good": 2 2 1 1 1 1 1 2 2 2 ...
##  $ n_absent   : Factor w/ 2 levels "Above-7","Under-7": 2 2 1 1 1 1 1 2 2 2 ...
##  $ class      : Factor w/ 3 levels "H","L","M": 3 3 2 2 3 3 2 3 3 3 ...

We can see that the factor levels of the class variable are not appropriate.Hence, lets re-arrange the levels to a meaningful order low,medium,high.

#changing the levels of class
students$class<-factor(students$class,levels = c("L","M","H"))

Next, we will make use of the summary command to understand the basic statistics of the variables which will also help us spot any missing values or outliers that may be present in the data.

#checking the summary statistics
summary(students)

##  gender        nation          birthplace          stageid   
##  F:175   KW       :179   KuwaIT     :180   HighSchool  : 33  
##  M:305   Jordan   :172   Jordan     :176   lowerlevel  :199  
##          Palestine: 28   Iraq       : 22   MiddleSchool:248  
##          Iraq     : 22   lebanon    : 19                     
##          lebanon  : 17   SaudiArabia: 16                     
##          Tunis    : 12   USA        : 16                     
##          (Other)  : 50   (Other)    : 51                     
##     gradeid    sectionid     topic     semester   relation  
##  G-02   :147   A:283     IT     : 95   F:245    Father:283  
##  G-08   :116   B:167     French : 65   S:235    Mum   :197  
##  G-07   :101   C: 30     Arabic : 59                        
##  G-04   : 48             Science: 51                        
##  G-06   : 32             English: 45                        
##  G-11   : 13             Biology: 30                        
##  (Other): 23             (Other):135                        
##   raisedhands        n_visit         n_view        discussion    p_answer 
##  Min.   :  0.00   Min.   : 0.0   Min.   : 0.00   Min.   : 1.00   No :210  
##  1st Qu.: 15.75   1st Qu.:20.0   1st Qu.:14.00   1st Qu.:20.00   Yes:270  
##  Median : 50.00   Median :65.0   Median :33.00   Median :39.00            
##  Mean   : 46.77   Mean   :54.8   Mean   :37.92   Mean   :43.28            
##  3rd Qu.: 75.00   3rd Qu.:84.0   3rd Qu.:58.00   3rd Qu.:70.00            
##  Max.   :100.00   Max.   :99.0   Max.   :98.00   Max.   :99.00            
##                                                                           
##  p_satis       n_absent   class  
##  Bad :188   Above-7:191   L:127  
##  Good:292   Under-7:289   M:211  
##                           H:142  
##                                  
##                                  
##                                  
##

The output clearly indicates the absence of any missing values in the dataset.The data is now clean and ready for further analysis.As the dataset has a number of observations,lets display only the first few observations to get a brief idea of the data.

#observing the first few observations
datatable(head(students, 20))

Exploratory Analysis

We will start by exploring the categorical variables in detail.

Gender

Let us try to find if there exists any relationship between the gender and the performance of the students

#exploring the gender variable
table(students$gender)

## 
##   F   M 
## 175 305

table(students$gender,students$class)

##    
##       L   M   H
##   F  24  76  75
##   M 103 135  67

round(prop.table(table(students$gender,students$class),1),2)

##    
##        L    M    H
##   F 0.14 0.43 0.43
##   M 0.34 0.44 0.22

There is a considerably higher percentage of male students in our study when compared to their female counterparts. 43 % of females have scored more than 90 marks in their assessments whereas only 22 % of male students have been able to cross this threshold. Also, only 14 % of females have scored less than 70 when compared to the 34 % of males who fall into this class. Thus,it can be concluded that the female students have outperformed the male students according to the data we have. This is depicted visually in the graph below.

#noting its effect on the performance of the students
p <- ggplot(students, aes(gender,fill=class))
p +
  stat_count()+ ggtitle("Performance by gender") +
  labs(x="Gender",y="Performance") + 
  theme(plot.title = element_text(face="bold", size=16, hjust=0)) +
  theme(axis.title = element_text(color="#666666", face="bold", size=10))

Nationality

Let us try to observe if any particular nationality is dominant in our analysis. It can be obtained using simple stat_count function in ggplot

#exploring the nationality variable
ggplot(students, aes(x=nation))+
  stat_count(color="darkblue", fill="lightblue")+
  ggtitle("Distribution of students by nationality")+
  theme(axis.text = element_text(angle = 45))

The graph reveals that Jordan and Kuwait are over-represented in our sample when compared to other nationalities. Egypt, Iran, Lybia, Morocco, Syria, Tunis, USA and Venezuela have very few observations. The topics pursued by the students of various nationalities can be studies from the graph below.

ggplot(students, aes(x = topic, fill = nation)) + geom_bar() +
    labs(x = "Topic", y = "Student Count")+
  coord_flip()+
  ggtitle("Topic By Nationality")

Chemistry has the least diversity among all the topics. Most of the enrolled students in chemistry are from Jordan. Also, most of the students who have pursued IT are from Kuwait. Topics like French, English and Arabic have the most diversity.

Next, we will study the effect of nationality of the students on their performance.

#noting the effect of nationality on the performance of students
table(students$nation)

## 
##       Egypt        Iran        Iraq      Jordan          KW     lebanon 
##           9           6          22         172         179          17 
##       Lybia     Morocco   Palestine SaudiArabia       Syria       Tunis 
##           6           4          28          11           7          12 
##         USA    venzuela 
##           6           1

round(prop.table(table(students$nation,students$class),1),2)

##              
##                  L    M    H
##   Egypt       0.33 0.44 0.22
##   Iran        0.33 0.67 0.00
##   Iraq        0.00 0.36 0.64
##   Jordan      0.22 0.48 0.31
##   KW          0.38 0.42 0.20
##   lebanon     0.12 0.35 0.53
##   Lybia       1.00 0.00 0.00
##   Morocco     0.25 0.50 0.25
##   Palestine   0.00 0.57 0.43
##   SaudiArabia 0.09 0.36 0.55
##   Syria       0.29 0.43 0.29
##   Tunis       0.33 0.42 0.25
##   USA         0.17 0.33 0.50
##   venzuela    0.00 0.00 1.00

p <- ggplot(students, aes(nation,fill=class))
p + geom_bar(position = "fill")+ ggtitle("Performance by nationality") +
  labs(x="Nationality") + 
  coord_flip()

It must be noted that the students from Iraq have been performing better than their counterparts.They have no student who has scored less than a 70 in the test. This is indeed commendable. Venezuela has only one student in the survey and has a 100 % success rate here. But this cannot be trusted as they are under-represented in our sample data.

StageID

#exploring the stageid variable
ggplot(students, aes(x=stageid))+
  stat_count(color="darkblue", fill="lightblue")+
  ggtitle("Distribution of students across different stageid")

Most of the data available is for the middle school students and high school has very few observations. Next we will try to note if these levels have considerable effect on our class variable.

#noting its effect on the performance
table(students$stageid)

## 
##   HighSchool   lowerlevel MiddleSchool 
##           33          199          248

round(prop.table(table(students$stageid,students$class),1),2)

##               
##                   L    M    H
##   HighSchool   0.24 0.42 0.33
##   lowerlevel   0.33 0.40 0.28
##   MiddleSchool 0.22 0.48 0.31

p <- ggplot(students, aes(stageid,fill=class))
p + stat_count()+ ggtitle("Performance by stageid") +
  labs(x="StageID",y="Count") + 
  theme(plot.title = element_text(face="bold", size=16, hjust=0)) +
  theme(axis.title = element_text(face="bold", size=10))

The school levels do not tend to affect the performance of the students considerably. Majority of the students score between the range of 70-89 irrespective of their schooling level.

Topic

#exploring the topic variable
ggplot(students, aes(x=topic))+
  stat_count(color="darkblue", fill="lightblue")+
  ggtitle("Count of students enrolled in different topics")

The most popular topics are IT, French, Arabic and Science. Though IT and Science are technical subjects,students seem to be interested in learning languages like French as well. As the countries in the dataset are mostly from the Middle East, Arabic unboudebtly is a popular choice for them.

#noting its effect on the performance
table(students$topic)

## 
##    Arabic   Biology Chemistry   English    French   Geology   History 
##        59        30        24        45        65        24        19 
##        IT      Math     Quran   Science   Spanish 
##        95        21        22        51        25

round(prop.table(table(students$topic,students$class),1),2)

##            
##                L    M    H
##   Arabic    0.29 0.39 0.32
##   Biology   0.13 0.33 0.53
##   Chemistry 0.33 0.25 0.42
##   English   0.22 0.40 0.38
##   French    0.25 0.45 0.31
##   Geology   0.00 0.75 0.25
##   History   0.16 0.63 0.21
##   IT        0.40 0.44 0.16
##   Math      0.33 0.38 0.29
##   Quran     0.27 0.36 0.36
##   Science   0.20 0.49 0.31
##   Spanish   0.32 0.48 0.20

p <- ggplot(students, aes(topic,fill=class))
p + stat_count()+ ggtitle("Performance by topic") +
  labs(x="Topic",y="Count") + 
  theme(plot.title = element_text(face="bold", size=16, hjust=0)) +
  theme(axis.title = element_text(face="bold", size=10))

Students have been performing really well in biology. We can also note that no student has scored less than 70 in Geology.

Semester

#exploring the semester variable
  ggplot(students, aes(x=semester))+
  stat_count(color="darkblue", fill="lightblue")+ggtitle("Count of students enrolled in different semesters")

There is almost an equal representation of the students in the two semesters,i.e first and second. Let us try to observe if students tend to perform better in any one of these semesters.

#noting its effect on the performance
table(students$semester)

## 
##   F   S 
## 245 235

round(prop.table(table(students$semester,students$class),1),2)

##    
##        L    M    H
##   F 0.31 0.43 0.25
##   S 0.21 0.45 0.34

p <- ggplot(students, aes(semester,fill=class))
p + stat_count()+ ggtitle("Performance by semester") +
  labs(x="Semester",y="Performance") + 
  theme(plot.title = element_text(face="bold", size=16, hjust=0)) +
  theme(axis.title = element_text(face="bold", size=10))

The majority of the students seem to score between 70-89 irrespective of the semesters . However,the proportion of students scoring more than 89 is higher in the second semester.

Section ID

ggplot(students, aes(x = sectionid, fill = topic))+
  geom_bar() +
  labs(x = "Section ID", y = "Count")+
  ggtitle("Distribution of students across the sections")

Section A has the most versatile course offerrings but the subsequent sections do not have many topics taught. Class C has only IT and Science courses and this might be the reason of low enrollments in this section.

Parents’ satisfaction

#exploring the parents' satisfaction variable
ggplot(students, aes(x=p_satis))+
  stat_count(color="darkblue", fill="lightblue")+
  ggtitle("Parents' satisfaction")

Most of the parents seem to be pretty much satisfied with the schools. Let us see if this variable is can be instrumental in predicting the performance of the students.

#noting its effect on the performance
table(students$p_satis)

## 
##  Bad Good 
##  188  292

round(prop.table(table(students$p_satis,students$class),1),2)

##       
##           L    M    H
##   Bad  0.45 0.43 0.13
##   Good 0.15 0.45 0.40

p <- ggplot(students, aes(p_satis,fill=class))
p + stat_count()+ ggtitle("Performance by parents' satisfaction") +
  labs(x="Parents' satisfaction",y="Performance") + 
  theme(plot.title = element_text(face="bold", size=16, hjust=0)) +
  theme(axis.title = element_text(face="bold", size=10))

We can note that the children of parents who are satisfied with the school perform better than the unsatisfied ones on an average.

#Plotting p_answer vs p_satis
ggplot(students, aes(x = p_answer, fill = p_satis)) +
    geom_bar() + 
    labs(x = "Survey answered by parents", y = "count")+
  ggtitle("Survey answered and parents satisfaction")

It can be seen that the parenrs who are not satisfied with the school tend to fill up the survey less often.

Let us start exploring the 4 continuous variables to note their distribution and effect on the performance of the students.

Exploring the continuous variables

We will now check if the continuous variables have any considerable correlation among them.

#Checking the correaltion between continuous variables
corrplot(cor(students[,c(10:13)]),method="number")

We can conclude that the variables n_visit and raisedhands have quite a significant correlation between themselves. Hence, the students who have been visiting the resources continuously are more likely to raise hands in the classes than the ones who didnt.

Let us analyze the continuous variables in more detail

#exploring the continuous variables
p <- plot_ly(students, x = ~class, y = ~raisedhands, type = "box") %>%
  layout(boxmode = "group",title= "Performance by raisedhands",
         xaxis = list(title = 'class',
                      zeroline = TRUE),
         yaxis = list(title = 'Raised hands'
                      ))
p

q<- plot_ly(students, x = ~class, y = ~n_visit, type = "box") %>%
  layout(boxmode = "group",title= "Performance by number of visits",
         xaxis = list(title = 'class',
                      zeroline = TRUE),
         yaxis = list(title = 'n_visits'
                      ))
q

r <- plot_ly(students, x = ~class, y = ~n_view, type = "box") %>%
  layout(boxmode = "group",title= "Performance by number of views",
         xaxis = list(title = 'class',
                      zeroline = TRUE),
         yaxis = list(title = 'n_views'
                      ))
r

s <- plot_ly(students, x = ~class, y = ~discussion, type = "box") %>%
  layout(boxmode = "group",title= "Performance by number of discussions",
         xaxis = list(title = 'class',
                      zeroline = TRUE),
         yaxis = list(title = 'Discussions'
                      ))
s

The performance of the students seems to be dependent on the number of times they raised their hands. This can be a measure of their involvement in the class. It can be observed here that the students who raised their hands more often than others generally outperformed the others. A few students however have good performance despite their less involvement in class aand are shown as outliers in the graph above.

Also, the students who viewed the announcements, participated in discussions and visited resources usully scored better than the ones who didnt.

Now that we know that a student’s performance is dependent on the number of times he has raised his hands in the class, we will try to figure out if this number is in turn influenced by other variables in the dataset.

ggplot(students, aes(x = raisedhands, color = gender))+
  geom_density()+
  ggtitle("Raised hands Vs gender")

The female students can be observed to have raised hands more than the male students. Our previous analysis hinted at the fact that the females have outperformed the males. This seems to concur with the idea that the number of hands raise is a potential factor to determine the academic performance of the students.

ggplot(students, aes(x = raisedhands, color = topic)) +
  geom_density() +
   ggtitle("Raised hands Vs topic")

Geology can be said to have been the most engaging subjects of all. IT on the other hand has witnessed extremely less student participation.

Model development

#Checking the proportion of students belonging to different classes
round(table(students$class)/nrow(students),2)

## 
##    L    M    H 
## 0.26 0.44 0.30

The vast majority of students (44%) belong to the medium category and score somewhere between 70 to 89. 30 % of the students have managed to secure greater than 89 and around 125 students were placed in the low grade level.

Let us now split the dataset into training and testing samples. The training sample is drawn by setting an initial seed to obtain a reproducible set. It comprises of 80% of the total observations (384 in total). The test set is composed of the remaining 20 % observations (96).

#Subsetting the data into train and test samples
set.seed(123)
subset <- sample(nrow(students), nrow(students) * 0.8)
train = students[subset, ]
test = students[-subset, ]

class(students$class)

## [1] "factor"

As our response variable - class, is a factor we will need a classification model. Decision trees and random forest models will be developed to predict the performance of the students.

Decision tree

Let us begin by training a decision tree.

#Building a decision tree
tree_model <- rpart(class ~ ., data = train, method = "class")
fancyRpartPlot(tree_model)

#Predicting on training set
train$prediction <- predict(tree_model,type = "class")
table(train$class, train$prediction, dnn = c("Truth", "Predicted"))

##      Predicted
## Truth   L   M   H
##     L  95  12   1
##     M  10 134  19
##     H   0  13 100

The accuracy of the model comes out to be 81.8 % which is quite good. Let us now test the performance of our model on the test data.

#Predicting on test set
test$prediction <- predict(tree_model,test,type = "class")
table(test$class, test$prediction, dnn = c("Truth", "Predicted"))

##      Predicted
## Truth  L  M  H
##     L 11  8  0
##     M  3 36  9
##     H  1  5 23

The accuracy comes out to be 70 % which is considerably lower than the accuracy of the training set. This can be attributed to the fact that the model has ovefit the training data.

Random Forest

Next, we will explore the random forest model and see if we can improve our model accuracy.

The current data frame has a list of independent variables, so we can make it formula and then pass as a parameter value for randomForest.

#Making a formula for providing as an argument to the random forest model
varNames <- names(students)
#Excluding the response variable "class"
varNames <- varNames[!varNames %in% c("class")]
# add + sign between exploratory variables
varNames1 <- paste(varNames, collapse = "+")
# Add the "class" variable and convert it to a formula object
rf_input <- as.formula(paste("class", varNames1, sep = " ~ "))

Now that we have the required training data and the formula for building our random forest model,lets build one using 100 decision trees.

#Building the random forest model
rf_model<-randomForest(rf_input,train,ntree=500,importance=TRUE)
plot(rf_model,main="Error rate")

A total of 500 decision trees are used in building the model and the vote of all these trees are considered in deciding the final class of the students. The graph obtained suggests that the error rate doesnt fall considerably after 100 trees.

varImpPlot(rf_model,sort=T,main="Variable Importance")

The features that are most useful in determining the performance of the students are depicted in the graphs above and are listed in decreasing order of their importance. We can conclude that the number of times a student visits a course content, the number of times he raises his hand in classroom and the number of days that he doesnt attend his classes are extremely crucial in determining his final grades. Based on these insights, the variables can be selected for any other predictive modeling technique as well.

#Selecting the variables
importance(rf_model,sort=T)

##                      L          M         H MeanDecreaseAccuracy
## gender       5.3967537  2.1643174  7.473829             7.948149
## nation       7.7988923  6.9537419  5.374124            11.758944
## birthplace   5.0077935  5.1432231  5.998259             9.544412
## stageid     -1.3331670  3.5293190  2.196716             3.491243
## gradeid      1.9149513  5.9541111  4.051593             7.515748
## sectionid   -0.8462757 -1.3912504  2.320246            -0.188291
## topic        3.7970192  9.5222092  8.005472            13.079494
## semester    -0.5688386 -2.3392693 -2.069180            -3.222765
## relation     7.3177978  0.8382449 15.603990            13.243749
## raisedhands 26.3740333  2.4617059 22.707164            29.646391
## n_visit     28.0237609  5.4651543 22.839899            32.792444
## n_view      19.0286792  7.5321836 12.539526            22.125791
## discussion   8.6672447  4.0938156 10.601223            12.474780
## p_answer    15.0679052  6.4273333 11.914579            17.746647
## p_satis      8.6629530  2.5764839  9.481245            10.369000
## n_absent    39.7714385  5.7953042 38.951456            48.456443
##             MeanDecreaseGini
## gender              4.088496
## nation             11.173286
## birthplace         10.661968
## stageid             2.534392
## gradeid            10.349722
## sectionid           3.079700
## topic              19.754304
## semester            1.797588
## relation            7.276900
## raisedhands        39.433665
## n_visit            43.946514
## n_view             28.217505
## discussion         19.973013
## p_answer            8.923195
## p_satis             5.183095
## n_absent           32.791574

The sectionid and semester of the students do not seem to affect their performance significantly and hence we will build a model excluding these two variables.

#Final model
final_rf_model <- randomForest(class ~ .-semester - sectionid , data = train, importance = TRUE,
                         ntree = 500)

Next, we predict the performance of the students using the random forest model. This needs to be carried out on both the training and the test sets. Let us start with the training test and then calculate the confusion matrix. This will help us obtain a count of total misclassified grades.

#predicting the performance of the students in the training set
train$predicted_class<-predict(final_rf_model)
table(train$class, train$predicted_class, dnn = c("Truth", "Predicted"))

##      Predicted
## Truth   L   M   H
##     L  98   9   1
##     M  13 135  15
##     H   0  15  98

The accuracy using random forest model is 79 %. Now we can predict the class for the test sample and calculate model accuracy for the sample.

#predicting the performance of the students in the testing set
test$predicted_class<-predict(final_rf_model,test)
table(test$class, test$predicted_class, dnn = c("Truth", "Predicted"))

##      Predicted
## Truth  L  M  H
##     L 12  7  0
##     M  3 38  7
##     H  0  7 22

Accuracy level has dropped to 74 % but it is still better than the decision tree model. Hence we will use random forest model as our final model to predict the performance of the students.

Summary

The exploratory data analysis and the final predictive model built using random forest revealed some interesting findings which are listed below

Female students have outperformed the male students in the school. 43 % of females have scored more than 90 marks in their assessment whereas only 22% of the male students were able to cross this threshold.
Jordan and Kuwait are over-represented in our sample when compared to other nationalities. Egypt, Iran, Lybia, Morocco, Syria, Tunis, USA and Venezuela on the other hand, have very few observations.
Chemistry has the least diversity among all the topics. Also, most of the enrolled students in chemistry are from Jordan.
Most of the students who have pursued IT are from Kuwait. Topics like French, English and Arabic have the most diversity.
Students from Iraq have performed better when compared to their counterparts.
Majority of the students score between the range of 70-89 irrespective of their schooling level.
The most popular subjects among the students are IT, French, Arabic and Science.
Section A has the most versatile course offerrings but the other two sections do not have many topics taught.
Section C has only two courses - Science and IT.
The students who viewed the announcements, raised their hands in class, participated in discussions and visited resources scored better than the ones who didnt.
The number of times a student visits a course content, the number of times he raises his hand in classroom and the number of days that he doesnt attend his classes are the most important variables in predicting the performance of the students (as suggested by the random forest model)

Limitations

Although random forest was used to develop the model, it gave us an accuracy of only 74 %. It would be advisable to try other classification techniques like logistic regression and neural networks and see if they can further increase the accuracy/performance of the model.