This study aims at identifying the factors and the extent to which they affect the academic success of a student.There are a total of 16 features in the dataset which span across multiple categories like:
The perfomance of the student is measured by the “class” variable which is divided into three groups.
As there are a total of 16 (not too many) independent features in the dataset,we will try to explore them in detail and note their relationship with the class (our response) variable.This will enable us to locate the features that impact the performance of the students considerably.This initial exploration of the data can help us maximize insights into the dataset by extracting important variables and therby prove instrumental in developing parsiomious models. As “class” is a categorical variable, we can also use classification techniques of supervised learning to predict the class of new students based on the set of attributes (or features) that they possess. The various algorithms that can be tried are logistic regression,decision trees,random forests and neural networks.
The analysis obtained on the basis of the exploratory data analysis followed by the appropriate model (random forest in our case) development will be aimed at helping the schools, students and the parents in adopting appropriate measures to ensure the success of students at school.
The following packages will be used in our analysis
##installing the packages
library(knitr)
library(tibble)
library(dplyr)
library(corrplot)
library(ggplot2)
library(tidyr)
library(DT)
library(plotly)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(randomForest)
The dataset is obtained from Kaggle.The link to access the data is https://www.kaggle.com/aljarah/xAPI-Edu-Data/data.This is an educational data set which is collected from learning management system (LMS) called Kalboard 360 using a learner activity tracker tool,experience API (xAPI). The xAPI is a component of the training and learning architecture (TLA) that enables monitoring learning progress and learner’s actions like reading an article or watching a training video. The experience API helps the learning activity providers to determine the learner, activity and objects that describe a learning experience.
The usage of this dataset is subject to the below citations as requested on the kaggle site.
Amrieh, E. A., Hamtini, T., & Aljarah, I. (2016). Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods. International Journal of Database Theory and Application, 9(8), 119-136.
Amrieh, E. A., Hamtini, T., & Aljarah, I. (2015, November). Preprocessing and analyzing educational data set using X-API for improving student’s performance. In Applied Electrical Engineering and Computing Technologies (AEECT), 2015 IEEE Jordan Conference on (pp. 1-5). IEEE.
The dataset has a total of 480 observations based on 16 features and has both integer and categorical variables.It is quite clean and has no missing values.
We now import the dataset to explore in in further detail.
#importing the dataset
students<-read.csv("data.csv")
colnames(students)
## [1] "gender" "NationalITy"
## [3] "PlaceofBirth" "StageID"
## [5] "GradeID" "SectionID"
## [7] "Topic" "Semester"
## [9] "Relation" "raisedhands"
## [11] "VisITedResources" "AnnouncementsView"
## [13] "Discussion" "ParentAnsweringSurvey"
## [15] "ParentschoolSatisfaction" "StudentAbsenceDays"
## [17] "Class"
dim(students)
## [1] 480 17
As can be seen from the output obtained,the dataset consists of 17 columns. However,as the class is our response variable, we consider only the rest of the 16 feautures and try to visualise and understand their effect on class. Only 4 variables are continuous variables and rest can be treated as factors with a number of levels. Hence, we do not set the stringsAsFactors argument to FALSE in the read.csv function. Few columns have quite lengthy names and hence we can rename such columns.
#renaming the columns
names<-c("gender","nation","birthplace",
"stageid","gradeid","sectionid","topic",
"semester","relation","raisedhands",
"n_visit","n_view","discussion","p_answer",
"p_satis","n_absent","class")
colnames(students)<-names
The variables are thereby renamed for easy interpretation. Next, we will create a data dictionary.
#creating the data dictionary
names<-colnames(students)
Attributes <- c("Gender of the student",
"Nationality of the student",
"Place of birth of the student",
"Educational level that the student belongs to",
"The grade that the student is enrolled in",
"Classroom the student belongs to",
"Course topic",
"School year semester",
"Parent responsible for the student",
"Number of times the student has raised hands",
"Number of times the student visited the course content",
"Number of times the student checked the announcements",
"Number of times the student participated in group discussions",
"Did the parent answer the survey",
"The degree of parent satisfaction from school ",
"Number of days a student was absent",
"Indicator of the performance of the student")
#creating the datatype variable
Datatype<-c(rep("string",9),rep("numeric",4),rep("string",2),"numeric","string")
#creating the variable description
Description <- as_data_frame(cbind(names,Datatype,Attributes))
colnames(Description) <- c("Variable","Datatype","Description")
library(knitr)
kable(Description)
Variable | Datatype | Description |
---|---|---|
gender | string | Gender of the student |
nation | string | Nationality of the student |
birthplace | string | Place of birth of the student |
stageid | string | Educational level that the student belongs to |
gradeid | string | The grade that the student is enrolled in |
sectionid | string | Classroom the student belongs to |
topic | string | Course topic |
semester | string | School year semester |
relation | string | Parent responsible for the student |
raisedhands | numeric | Number of times the student has raised hands |
n_visit | numeric | Number of times the student visited the course content |
n_view | numeric | Number of times the student checked the announcements |
discussion | numeric | Number of times the student participated in group discussions |
p_answer | string | Did the parent answer the survey |
p_satis | string | The degree of parent satisfaction from school |
n_absent | numeric | Number of days a student was absent |
class | string | Indicator of the performance of the student |
#checking the structure of the dataset
str(students)
## 'data.frame': 480 obs. of 17 variables:
## $ gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 1 2 2 1 1 ...
## $ nation : Factor w/ 14 levels "Egypt","Iran",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ birthplace : Factor w/ 14 levels "Egypt","Iran",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ stageid : Factor w/ 3 levels "HighSchool","lowerlevel",..: 2 2 2 2 2 2 3 3 3 3 ...
## $ gradeid : Factor w/ 10 levels "G-02","G-04",..: 2 2 2 2 2 2 5 5 5 5 ...
## $ sectionid : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 2 ...
## $ topic : Factor w/ 12 levels "Arabic","Biology",..: 8 8 8 8 8 8 9 9 9 8 ...
## $ semester : Factor w/ 2 levels "F","S": 1 1 1 1 1 1 1 1 1 1 ...
## $ relation : Factor w/ 2 levels "Father","Mum": 1 1 1 1 1 1 1 1 1 1 ...
## $ raisedhands: int 15 20 10 30 40 42 35 50 12 70 ...
## $ n_visit : int 16 20 7 25 50 30 12 10 21 80 ...
## $ n_view : int 2 3 0 5 12 13 0 15 16 25 ...
## $ discussion : int 20 25 30 35 50 70 17 22 50 70 ...
## $ p_answer : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 2 1 2 2 2 ...
## $ p_satis : Factor w/ 2 levels "Bad","Good": 2 2 1 1 1 1 1 2 2 2 ...
## $ n_absent : Factor w/ 2 levels "Above-7","Under-7": 2 2 1 1 1 1 1 2 2 2 ...
## $ class : Factor w/ 3 levels "H","L","M": 3 3 2 2 3 3 2 3 3 3 ...
We can see that the factor levels of the class variable are not appropriate.Hence, lets re-arrange the levels to a meaningful order low,medium,high.
#changing the levels of class
students$class<-factor(students$class,levels = c("L","M","H"))
Next, we will make use of the summary command to understand the basic statistics of the variables which will also help us spot any missing values or outliers that may be present in the data.
#checking the summary statistics
summary(students)
## gender nation birthplace stageid
## F:175 KW :179 KuwaIT :180 HighSchool : 33
## M:305 Jordan :172 Jordan :176 lowerlevel :199
## Palestine: 28 Iraq : 22 MiddleSchool:248
## Iraq : 22 lebanon : 19
## lebanon : 17 SaudiArabia: 16
## Tunis : 12 USA : 16
## (Other) : 50 (Other) : 51
## gradeid sectionid topic semester relation
## G-02 :147 A:283 IT : 95 F:245 Father:283
## G-08 :116 B:167 French : 65 S:235 Mum :197
## G-07 :101 C: 30 Arabic : 59
## G-04 : 48 Science: 51
## G-06 : 32 English: 45
## G-11 : 13 Biology: 30
## (Other): 23 (Other):135
## raisedhands n_visit n_view discussion p_answer
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 1.00 No :210
## 1st Qu.: 15.75 1st Qu.:20.0 1st Qu.:14.00 1st Qu.:20.00 Yes:270
## Median : 50.00 Median :65.0 Median :33.00 Median :39.00
## Mean : 46.77 Mean :54.8 Mean :37.92 Mean :43.28
## 3rd Qu.: 75.00 3rd Qu.:84.0 3rd Qu.:58.00 3rd Qu.:70.00
## Max. :100.00 Max. :99.0 Max. :98.00 Max. :99.00
##
## p_satis n_absent class
## Bad :188 Above-7:191 L:127
## Good:292 Under-7:289 M:211
## H:142
##
##
##
##
The output clearly indicates the absence of any missing values in the dataset.The data is now clean and ready for further analysis.As the dataset has a number of observations,lets display only the first few observations to get a brief idea of the data.
#observing the first few observations
datatable(head(students, 20))
We will start by exploring the categorical variables in detail.
Let us try to find if there exists any relationship between the gender and the performance of the students
#exploring the gender variable
table(students$gender)
##
## F M
## 175 305
table(students$gender,students$class)
##
## L M H
## F 24 76 75
## M 103 135 67
round(prop.table(table(students$gender,students$class),1),2)
##
## L M H
## F 0.14 0.43 0.43
## M 0.34 0.44 0.22
There is a considerably higher percentage of male students in our study when compared to their female counterparts. 43 % of females have scored more than 90 marks in their assessments whereas only 22 % of male students have been able to cross this threshold. Also, only 14 % of females have scored less than 70 when compared to the 34 % of males who fall into this class. Thus,it can be concluded that the female students have outperformed the male students according to the data we have. This is depicted visually in the graph below.
#noting its effect on the performance of the students
p <- ggplot(students, aes(gender,fill=class))
p +
stat_count()+ ggtitle("Performance by gender") +
labs(x="Gender",y="Performance") +
theme(plot.title = element_text(face="bold", size=16, hjust=0)) +
theme(axis.title = element_text(color="#666666", face="bold", size=10))
Let us try to observe if any particular nationality is dominant in our analysis. It can be obtained using simple stat_count function in ggplot
#exploring the nationality variable
ggplot(students, aes(x=nation))+
stat_count(color="darkblue", fill="lightblue")+
ggtitle("Distribution of students by nationality")+
theme(axis.text = element_text(angle = 45))
The graph reveals that Jordan and Kuwait are over-represented in our sample when compared to other nationalities. Egypt, Iran, Lybia, Morocco, Syria, Tunis, USA and Venezuela have very few observations. The topics pursued by the students of various nationalities can be studies from the graph below.
ggplot(students, aes(x = topic, fill = nation)) + geom_bar() +
labs(x = "Topic", y = "Student Count")+
coord_flip()+
ggtitle("Topic By Nationality")
Chemistry has the least diversity among all the topics. Most of the enrolled students in chemistry are from Jordan. Also, most of the students who have pursued IT are from Kuwait. Topics like French, English and Arabic have the most diversity.
Next, we will study the effect of nationality of the students on their performance.
#noting the effect of nationality on the performance of students
table(students$nation)
##
## Egypt Iran Iraq Jordan KW lebanon
## 9 6 22 172 179 17
## Lybia Morocco Palestine SaudiArabia Syria Tunis
## 6 4 28 11 7 12
## USA venzuela
## 6 1
round(prop.table(table(students$nation,students$class),1),2)
##
## L M H
## Egypt 0.33 0.44 0.22
## Iran 0.33 0.67 0.00
## Iraq 0.00 0.36 0.64
## Jordan 0.22 0.48 0.31
## KW 0.38 0.42 0.20
## lebanon 0.12 0.35 0.53
## Lybia 1.00 0.00 0.00
## Morocco 0.25 0.50 0.25
## Palestine 0.00 0.57 0.43
## SaudiArabia 0.09 0.36 0.55
## Syria 0.29 0.43 0.29
## Tunis 0.33 0.42 0.25
## USA 0.17 0.33 0.50
## venzuela 0.00 0.00 1.00
p <- ggplot(students, aes(nation,fill=class))
p + geom_bar(position = "fill")+ ggtitle("Performance by nationality") +
labs(x="Nationality") +
coord_flip()
It must be noted that the students from Iraq have been performing better than their counterparts.They have no student who has scored less than a 70 in the test. This is indeed commendable. Venezuela has only one student in the survey and has a 100 % success rate here. But this cannot be trusted as they are under-represented in our sample data.
#exploring the stageid variable
ggplot(students, aes(x=stageid))+
stat_count(color="darkblue", fill="lightblue")+
ggtitle("Distribution of students across different stageid")
Most of the data available is for the middle school students and high school has very few observations. Next we will try to note if these levels have considerable effect on our class variable.
#noting its effect on the performance
table(students$stageid)
##
## HighSchool lowerlevel MiddleSchool
## 33 199 248
round(prop.table(table(students$stageid,students$class),1),2)
##
## L M H
## HighSchool 0.24 0.42 0.33
## lowerlevel 0.33 0.40 0.28
## MiddleSchool 0.22 0.48 0.31
p <- ggplot(students, aes(stageid,fill=class))
p + stat_count()+ ggtitle("Performance by stageid") +
labs(x="StageID",y="Count") +
theme(plot.title = element_text(face="bold", size=16, hjust=0)) +
theme(axis.title = element_text(face="bold", size=10))
The school levels do not tend to affect the performance of the students considerably. Majority of the students score between the range of 70-89 irrespective of their schooling level.
#exploring the topic variable
ggplot(students, aes(x=topic))+
stat_count(color="darkblue", fill="lightblue")+
ggtitle("Count of students enrolled in different topics")
The most popular topics are IT, French, Arabic and Science. Though IT and Science are technical subjects,students seem to be interested in learning languages like French as well. As the countries in the dataset are mostly from the Middle East, Arabic unboudebtly is a popular choice for them.
#noting its effect on the performance
table(students$topic)
##
## Arabic Biology Chemistry English French Geology History
## 59 30 24 45 65 24 19
## IT Math Quran Science Spanish
## 95 21 22 51 25
round(prop.table(table(students$topic,students$class),1),2)
##
## L M H
## Arabic 0.29 0.39 0.32
## Biology 0.13 0.33 0.53
## Chemistry 0.33 0.25 0.42
## English 0.22 0.40 0.38
## French 0.25 0.45 0.31
## Geology 0.00 0.75 0.25
## History 0.16 0.63 0.21
## IT 0.40 0.44 0.16
## Math 0.33 0.38 0.29
## Quran 0.27 0.36 0.36
## Science 0.20 0.49 0.31
## Spanish 0.32 0.48 0.20
p <- ggplot(students, aes(topic,fill=class))
p + stat_count()+ ggtitle("Performance by topic") +
labs(x="Topic",y="Count") +
theme(plot.title = element_text(face="bold", size=16, hjust=0)) +
theme(axis.title = element_text(face="bold", size=10))
Students have been performing really well in biology. We can also note that no student has scored less than 70 in Geology.
#exploring the semester variable
ggplot(students, aes(x=semester))+
stat_count(color="darkblue", fill="lightblue")+ggtitle("Count of students enrolled in different semesters")
There is almost an equal representation of the students in the two semesters,i.e first and second. Let us try to observe if students tend to perform better in any one of these semesters.
#noting its effect on the performance
table(students$semester)
##
## F S
## 245 235
round(prop.table(table(students$semester,students$class),1),2)
##
## L M H
## F 0.31 0.43 0.25
## S 0.21 0.45 0.34
p <- ggplot(students, aes(semester,fill=class))
p + stat_count()+ ggtitle("Performance by semester") +
labs(x="Semester",y="Performance") +
theme(plot.title = element_text(face="bold", size=16, hjust=0)) +
theme(axis.title = element_text(face="bold", size=10))
The majority of the students seem to score between 70-89 irrespective of the semesters . However,the proportion of students scoring more than 89 is higher in the second semester.
ggplot(students, aes(x = sectionid, fill = topic))+
geom_bar() +
labs(x = "Section ID", y = "Count")+
ggtitle("Distribution of students across the sections")
Section A has the most versatile course offerrings but the subsequent sections do not have many topics taught. Class C has only IT and Science courses and this might be the reason of low enrollments in this section.
#exploring the parents' satisfaction variable
ggplot(students, aes(x=p_satis))+
stat_count(color="darkblue", fill="lightblue")+
ggtitle("Parents' satisfaction")
Most of the parents seem to be pretty much satisfied with the schools. Let us see if this variable is can be instrumental in predicting the performance of the students.
#noting its effect on the performance
table(students$p_satis)
##
## Bad Good
## 188 292
round(prop.table(table(students$p_satis,students$class),1),2)
##
## L M H
## Bad 0.45 0.43 0.13
## Good 0.15 0.45 0.40
p <- ggplot(students, aes(p_satis,fill=class))
p + stat_count()+ ggtitle("Performance by parents' satisfaction") +
labs(x="Parents' satisfaction",y="Performance") +
theme(plot.title = element_text(face="bold", size=16, hjust=0)) +
theme(axis.title = element_text(face="bold", size=10))
We can note that the children of parents who are satisfied with the school perform better than the unsatisfied ones on an average.
#Plotting p_answer vs p_satis
ggplot(students, aes(x = p_answer, fill = p_satis)) +
geom_bar() +
labs(x = "Survey answered by parents", y = "count")+
ggtitle("Survey answered and parents satisfaction")
It can be seen that the parenrs who are not satisfied with the school tend to fill up the survey less often.
Let us start exploring the 4 continuous variables to note their distribution and effect on the performance of the students.
We will now check if the continuous variables have any considerable correlation among them.
#Checking the correaltion between continuous variables
corrplot(cor(students[,c(10:13)]),method="number")
We can conclude that the variables n_visit and raisedhands have quite a significant correlation between themselves. Hence, the students who have been visiting the resources continuously are more likely to raise hands in the classes than the ones who didnt.
Let us analyze the continuous variables in more detail
#exploring the continuous variables
p <- plot_ly(students, x = ~class, y = ~raisedhands, type = "box") %>%
layout(boxmode = "group",title= "Performance by raisedhands",
xaxis = list(title = 'class',
zeroline = TRUE),
yaxis = list(title = 'Raised hands'
))
p
q<- plot_ly(students, x = ~class, y = ~n_visit, type = "box") %>%
layout(boxmode = "group",title= "Performance by number of visits",
xaxis = list(title = 'class',
zeroline = TRUE),
yaxis = list(title = 'n_visits'
))
q
r <- plot_ly(students, x = ~class, y = ~n_view, type = "box") %>%
layout(boxmode = "group",title= "Performance by number of views",
xaxis = list(title = 'class',
zeroline = TRUE),
yaxis = list(title = 'n_views'
))
r
s <- plot_ly(students, x = ~class, y = ~discussion, type = "box") %>%
layout(boxmode = "group",title= "Performance by number of discussions",
xaxis = list(title = 'class',
zeroline = TRUE),
yaxis = list(title = 'Discussions'
))
s
The performance of the students seems to be dependent on the number of times they raised their hands. This can be a measure of their involvement in the class. It can be observed here that the students who raised their hands more often than others generally outperformed the others. A few students however have good performance despite their less involvement in class aand are shown as outliers in the graph above.
Also, the students who viewed the announcements, participated in discussions and visited resources usully scored better than the ones who didnt.
Now that we know that a student’s performance is dependent on the number of times he has raised his hands in the class, we will try to figure out if this number is in turn influenced by other variables in the dataset.
ggplot(students, aes(x = raisedhands, color = gender))+
geom_density()+
ggtitle("Raised hands Vs gender")
The female students can be observed to have raised hands more than the male students. Our previous analysis hinted at the fact that the females have outperformed the males. This seems to concur with the idea that the number of hands raise is a potential factor to determine the academic performance of the students.
ggplot(students, aes(x = raisedhands, color = topic)) +
geom_density() +
ggtitle("Raised hands Vs topic")
Geology can be said to have been the most engaging subjects of all. IT on the other hand has witnessed extremely less student participation.
#Checking the proportion of students belonging to different classes
round(table(students$class)/nrow(students),2)
##
## L M H
## 0.26 0.44 0.30
The vast majority of students (44%) belong to the medium category and score somewhere between 70 to 89. 30 % of the students have managed to secure greater than 89 and around 125 students were placed in the low grade level.
Let us now split the dataset into training and testing samples. The training sample is drawn by setting an initial seed to obtain a reproducible set. It comprises of 80% of the total observations (384 in total). The test set is composed of the remaining 20 % observations (96).
#Subsetting the data into train and test samples
set.seed(123)
subset <- sample(nrow(students), nrow(students) * 0.8)
train = students[subset, ]
test = students[-subset, ]
class(students$class)
## [1] "factor"
As our response variable - class, is a factor we will need a classification model. Decision trees and random forest models will be developed to predict the performance of the students.
Let us begin by training a decision tree.
#Building a decision tree
tree_model <- rpart(class ~ ., data = train, method = "class")
fancyRpartPlot(tree_model)
#Predicting on training set
train$prediction <- predict(tree_model,type = "class")
table(train$class, train$prediction, dnn = c("Truth", "Predicted"))
## Predicted
## Truth L M H
## L 95 12 1
## M 10 134 19
## H 0 13 100
The accuracy of the model comes out to be 81.8 % which is quite good. Let us now test the performance of our model on the test data.
#Predicting on test set
test$prediction <- predict(tree_model,test,type = "class")
table(test$class, test$prediction, dnn = c("Truth", "Predicted"))
## Predicted
## Truth L M H
## L 11 8 0
## M 3 36 9
## H 1 5 23
The accuracy comes out to be 70 % which is considerably lower than the accuracy of the training set. This can be attributed to the fact that the model has ovefit the training data.
Next, we will explore the random forest model and see if we can improve our model accuracy.
The current data frame has a list of independent variables, so we can make it formula and then pass as a parameter value for randomForest.
#Making a formula for providing as an argument to the random forest model
varNames <- names(students)
#Excluding the response variable "class"
varNames <- varNames[!varNames %in% c("class")]
# add + sign between exploratory variables
varNames1 <- paste(varNames, collapse = "+")
# Add the "class" variable and convert it to a formula object
rf_input <- as.formula(paste("class", varNames1, sep = " ~ "))
Now that we have the required training data and the formula for building our random forest model,lets build one using 100 decision trees.
#Building the random forest model
rf_model<-randomForest(rf_input,train,ntree=500,importance=TRUE)
plot(rf_model,main="Error rate")
A total of 500 decision trees are used in building the model and the vote of all these trees are considered in deciding the final class of the students. The graph obtained suggests that the error rate doesnt fall considerably after 100 trees.
varImpPlot(rf_model,sort=T,main="Variable Importance")
The features that are most useful in determining the performance of the students are depicted in the graphs above and are listed in decreasing order of their importance. We can conclude that the number of times a student visits a course content, the number of times he raises his hand in classroom and the number of days that he doesnt attend his classes are extremely crucial in determining his final grades. Based on these insights, the variables can be selected for any other predictive modeling technique as well.
#Selecting the variables
importance(rf_model,sort=T)
## L M H MeanDecreaseAccuracy
## gender 5.3967537 2.1643174 7.473829 7.948149
## nation 7.7988923 6.9537419 5.374124 11.758944
## birthplace 5.0077935 5.1432231 5.998259 9.544412
## stageid -1.3331670 3.5293190 2.196716 3.491243
## gradeid 1.9149513 5.9541111 4.051593 7.515748
## sectionid -0.8462757 -1.3912504 2.320246 -0.188291
## topic 3.7970192 9.5222092 8.005472 13.079494
## semester -0.5688386 -2.3392693 -2.069180 -3.222765
## relation 7.3177978 0.8382449 15.603990 13.243749
## raisedhands 26.3740333 2.4617059 22.707164 29.646391
## n_visit 28.0237609 5.4651543 22.839899 32.792444
## n_view 19.0286792 7.5321836 12.539526 22.125791
## discussion 8.6672447 4.0938156 10.601223 12.474780
## p_answer 15.0679052 6.4273333 11.914579 17.746647
## p_satis 8.6629530 2.5764839 9.481245 10.369000
## n_absent 39.7714385 5.7953042 38.951456 48.456443
## MeanDecreaseGini
## gender 4.088496
## nation 11.173286
## birthplace 10.661968
## stageid 2.534392
## gradeid 10.349722
## sectionid 3.079700
## topic 19.754304
## semester 1.797588
## relation 7.276900
## raisedhands 39.433665
## n_visit 43.946514
## n_view 28.217505
## discussion 19.973013
## p_answer 8.923195
## p_satis 5.183095
## n_absent 32.791574
The sectionid and semester of the students do not seem to affect their performance significantly and hence we will build a model excluding these two variables.
#Final model
final_rf_model <- randomForest(class ~ .-semester - sectionid , data = train, importance = TRUE,
ntree = 500)
Next, we predict the performance of the students using the random forest model. This needs to be carried out on both the training and the test sets. Let us start with the training test and then calculate the confusion matrix. This will help us obtain a count of total misclassified grades.
#predicting the performance of the students in the training set
train$predicted_class<-predict(final_rf_model)
table(train$class, train$predicted_class, dnn = c("Truth", "Predicted"))
## Predicted
## Truth L M H
## L 98 9 1
## M 13 135 15
## H 0 15 98
The accuracy using random forest model is 79 %. Now we can predict the class for the test sample and calculate model accuracy for the sample.
#predicting the performance of the students in the testing set
test$predicted_class<-predict(final_rf_model,test)
table(test$class, test$predicted_class, dnn = c("Truth", "Predicted"))
## Predicted
## Truth L M H
## L 12 7 0
## M 3 38 7
## H 0 7 22
Accuracy level has dropped to 74 % but it is still better than the decision tree model. Hence we will use random forest model as our final model to predict the performance of the students.
The exploratory data analysis and the final predictive model built using random forest revealed some interesting findings which are listed below
Although random forest was used to develop the model, it gave us an accuracy of only 74 %. It would be advisable to try other classification techniques like logistic regression and neural networks and see if they can further increase the accuracy/performance of the model.