For this assignment, I used the data set “Students’ Academic Performance Data set” from Kaggle. This is an educational data set which is collected from learning management system (LMS) called Kalboard 360. The data is collected using a leaning activity tracker tool which is called experience API. The data set contains 480 student records of which 305 are male and 175 are female. There are a total of 16 features in this data set that are classified into three major categories: (1) Demographic features such as gender and nationality, (2) Academic background features such as educational stage, grade level and section, (3) Behavioral Features such as raised hand in class, viewing resources, answering survey by parents, and school satisfaction.
The purpose of this analysis is to determine what factors influence student performance in school. I used Ordered Logistic Regression models to predict student academic achievement. The dependent variable used in the models is “grades1” which is an ordered categorical dependent variable that has been ordered from lowest grade marks to highest grade marks. I am most interested in looking at the following variables: VisitedResources, StudentAbsenceDays, and gender to see which affects student grades the most and predict the differences in gender. I believe students who visit resources the most often will obtain high grade marks because this enables students to refresh their minds about the course content. Students who are serious and want to perform well in a course will continue to go back to the resources to learn much as possible about the course requirements and assignments. The number of times students visit resources will be positively correlated with high grade marks. I also suspect students who are absent under seven days will perform exceptionally well in school because they will be present for class most of the time to learn about the day’s lesson. Students who are absent under 7 days will will obtain high grades marks. Lastly, I predict males will perform well in school than females because in this data set, there are more males (305) than females (175).
grades1: grade level of the student (Ordered Categorical Dependent Variable)
The students are classified into three numerical intervals based on their total grade/mark:
Low-Level: interval includes values from 0 to 69,
Middle-Level: interval includes values from 70 to 89,
High-Level: interval includes values from 90-100.
Student Absence Days: the number of absence days for each student (nominal: above-7, under-7)
Visited Resources: how many times the student visits a course content(numeric:0-100)
gender: whether the student is male of female
library(readr)
Student_Academic_Data<-read_csv("C:\\Users\\Sangita Roy\\Desktop\\Student_Academic_Data.csv")
head(Student_Academic_Data)
library(dplyr)
Student_Data1<- Student_Academic_Data %>%
rename(Nationality=NationalITy,
VisitedResources=VisITedResources,
Grades=Class) %>%
mutate(grades1= factor(Grades, ordered=TRUE, levels=c("L", "M", "H")),
gender = as.factor(gender),
StudentAbsenceDays=as.factor(StudentAbsenceDays),
StageID=as.factor(StageID)) %>%
mutate(Relation=as.factor(Relation))
head(Student_Data1)
unique(Student_Data1$grades1)
[1] M L H
Levels: L < M < H
library(ZeligChoice)
z.multi <- zelig(grades1 ~ VisitedResources + StudentAbsenceDays, model = "ologit", data = Student_Data1, cite = F)
summary(z.multi)
Model:
Call:
z5$zelig(formula = grades1 ~ VisitedResources + StudentAbsenceDays,
data = Student_Data1)
Coefficients:
Value Std. Error t value
VisitedResources 0.04638 0.00441 10.515
StudentAbsenceDaysUnder-7 3.05166 0.31367 9.729
Intercepts:
Value Std. Error t value
L|M 2.1378 0.2391 8.9392
M|H 6.4703 0.4541 14.2480
Residual Deviance: 614.2876
AIC: 622.2876
Next step: Use 'setx' method
The first model using ordered logistic regression predicts students performance based on the number of times they visit the course content and the number of days they are absent which is either under or above 7 days. P-values were not shown in the models to indicate significance. I used “Social Science Statistics” to calculate the p-value from T score calculator. Results of Model1 are significant at p< .05. (The p-value is 0.00001 for Model1)
library(ZeligChoice)
z.multi2 <- zelig(grades1 ~ gender + StudentAbsenceDays, model = "ologit", data = Student_Data1, cite = F)
summary(z.multi2)
Model:
Call:
z5$zelig(formula = grades1 ~ gender + StudentAbsenceDays, data = Student_Data1)
Coefficients:
Value Std. Error t value
genderM -0.6784 0.1938 -3.501
StudentAbsenceDaysUnder-7 3.5987 0.3005 11.977
Intercepts:
Value Std. Error t value
L|M -0.0723 0.2073 -0.3488
M|H 3.3226 0.3186 10.4278
Residual Deviance: 753.6634
AIC: 761.6634
Next step: Use 'setx' method
The second model using OLR shows the effect gender and the number of days absent in school has on student academic performance. Results are significant at p< .05. (P-value for gender is .000253 and p-value for the second independent variable is 0.00001)
library(ZeligChoice)
z.multi3 <- zelig(grades1 ~ VisitedResources + gender + StudentAbsenceDays, model = "ologit", data = Student_Data1, cite = F)
summary(z.multi3)
Model:
Call:
z5$zelig(formula = grades1 ~ VisitedResources + gender + StudentAbsenceDays,
data = Student_Data1)
Coefficients:
Value Std. Error t value
VisitedResources 0.04584 0.004425 10.361
genderM -0.55759 0.216609 -2.574
StudentAbsenceDaysUnder-7 2.99097 0.313767 9.532
Intercepts:
Value Std. Error t value
L|M 1.6964 0.2887 5.8755
M|H 6.0678 0.4735 12.8157
Residual Deviance: 607.6281
AIC: 617.6281
Next step: Use 'setx' method
The third model predicts student academic performance based on the number of times they visit resources, gender and the number of days they are absent in school. Males are less likely of achieving academic success than females as shown above. Results are significant at p< .05. (P-value for Visited Resources and Student Absence Days under 7 is 0.00001 and .005176 for gender (M).)
Based on the AIC and deviance values, Model3 is the best-fit model because it has the lowest AIC (617.63) and deviance (607.62) value compared to the other two models. This indicates that Model3 is a better predictor of student academic performance in school.
x.below <- setx(z.multi3, StudentAbsenceDays = "Under-7")
x.above <- setx(z.multi3, StudentAbsenceDays = "Above-7")
s.multi3 <- sim(z.multi3, x = x.below, x1 = x.above)
summary(s.multi3)
sim x :
-----
ev
mean sd 50% 2.5% 97.5%
L 0.0391725 0.01183576 0.03759658 0.0223391393 0.06772869
M 0.6608589 0.24736272 0.70792347 0.1676966686 0.97213759
H 0.2999686 0.24060561 0.25514908 0.0008991979 0.77808061
pv
mean sd 50% 2.5% 97.5%
[1,] 2.281 0.5121536 2 1 3
sim x1 :
-----
ev
mean sd 50% 2.5% 97.5%
L 0.43509246 0.04917877 0.43552148 3.405390e-01 0.5330134
M 0.52527322 0.07256877 0.53042908 3.473244e-01 0.6461057
H 0.03963432 0.05555243 0.01678055 3.727199e-05 0.2073550
pv
mean sd 50% 2.5% 97.5%
[1,] 1.606 0.5754086 2 1 3
fd
mean sd 50% 2.5% 97.5%
L 0.3959200 0.04768416 0.3953971 0.3026185 0.4879053248
M -0.1355857 0.20299326 -0.1612539 -0.4461711 0.2167258008
H -0.2603343 0.19583783 -0.2353921 -0.5969084 -0.0008715503
The table above shows the simulated probabilities of the three grades for the two counter-factual situations which is the number of days absent in school, under or above 7 days. The probability of students obtaining a high grade mark if they are absent under 7 days is 30% whereas the probability of obtaining a high grade mark significantly decreases to 4% for students absent above 7 days. The probability of students obtaining a low grade mark in school is greater (44%) when they are absent above 7 days than under 7 days (4%). The difference between the predicted probability of obtaining high grades is -.26 between these two counter-factual situations. In other words, students are 26% less likely of achieving high grade marks when absent for above 7 days than students who are absent under 7 days.
plot(s.multi3)
The variation of the grade levels is very wide as shown in the graphs. This is probably because there are only 480 observations in this data set.
x.male <- setx(z.multi3, gender = "M")
x.female <- setx(z.multi3, gender = "F")
s.multi4 <- sim(z.multi3, x = x.male, x1 = x.female)
summary(s.multi4)
sim x :
-----
ev
mean sd 50% 2.5% 97.5%
L 0.03848613 0.01128416 0.03656384 0.0217086566 0.06438851
M 0.68482162 0.24173246 0.74002732 0.1975581167 0.97477881
H 0.27669225 0.23522404 0.21971781 0.0004277427 0.74964619
pv
mean sd 50% 2.5% 97.5%
[1,] 2.254 0.5155092 2 1 3
sim x1 :
-----
ev
mean sd 50% 2.5% 97.5%
L 0.02291007 0.007720657 0.0213892 0.0120233568 0.04143212
M 0.61374448 0.281678064 0.6496070 0.1165496809 0.98296703
H 0.36334545 0.279471962 0.3309653 0.0006863242 0.85868700
pv
mean sd 50% 2.5% 97.5%
[1,] 2.336 0.5131933 2 2 3
fd
mean sd 50% 2.5% 97.5%
L -0.01557606 0.007244134 -0.01467321 -0.0323113456 -0.003507497
M -0.07107714 0.059148852 -0.06810749 -0.1892625388 0.011021669
H 0.08665321 0.064012412 0.08422758 0.0001008528 0.215577104
Next, I wanted to view the student performance variations between males and females. The probability of males obtaining high grade marks is 28% whereas the probability for females performing well in school is 36%. The simulated difference in the probability for obtaining high grade marks between males and females is 0.08. In other words, females are 8% more likely of obtaining high grades than males. On the other hand, females are 8% less likely of obtaining middle-level grade marks than males.
plot(s.multi4)
In conclusion, students who are absent under 7 days have a higher probability of achieving high grades in school than students who are absent above 7 days as predicted. However, females have a higher probability of achieving academic success than males. A probable explanation is that since there are fewer females in this data set, females are more serious about achieving academic success to make their own careers.
Amrieh, E. A., Hamtini, T., & Aljarah, I. (2016). Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods. International Journal of Database Theory and Application, 9(8), 119-136.
Amrieh, E. A., Hamtini, T., & Aljarah, I. (2015, November). Preprocessing and analyzing educational data set using X-API for improving student’s performance. In Applied Electrical Engineering and Computing Technologies (AEECT), 2015 IEEE Jordan Conference on (pp. 1-5). IEEE.