I was tasked with the job of finding a method by which I could predict the post test scores of some students given various other data on each student. Upon analysis of the structure of the data set, it was revealed that all the columns were of character data type except for number of students(n_students) as well as the pre and post test scores.
str(full_scores)
## 'data.frame': 2133 obs. of 11 variables:
## $ school : chr "ANKYI" "ANKYI" "ANKYI" "ANKYI" ...
## $ school_setting : chr "Urban" "Urban" "Urban" "Urban" ...
## $ school_type : chr "Non-public" "Non-public" "Non-public" "Non-public" ...
## $ classroom : chr "6OL" "6OL" "6OL" "6OL" ...
## $ teaching_method: chr "Standard" "Standard" "Standard" "Standard" ...
## $ n_student : num 20 20 20 20 20 20 20 20 20 20 ...
## $ student_id : chr "2FHT3" "3JIVH" "3XOWE" "556O0" ...
## $ gender : chr "Female" "Female" "Male" "Female" ...
## $ lunch : chr "Does not qualify" "Does not qualify" "Does not qualify" "Does not qualify" ...
## $ pretest : num 62 66 64 61 64 66 63 63 64 61 ...
## $ posttest : num 72 79 76 77 76 74 75 72 77 72 ...
With more keen observations done, it was determined that the student_id column had distinct values for each student, therefore this data type would not be of much help in determining a post test score. The classroom data type had the specific classroom each student was in and the n_student data type had the number of student in each class. Given both data types capture similar information, I decided to use only one in the model building and that was the n_student.
Each of the columns that was of character and being used in the building of the model was converted to a factor as this would be the best way to complete the analysis.
full_scores$school <- as.factor(full_scores$school)
full_scores$school_setting <- as.factor(full_scores$school_setting)
full_scores$school_type <- as.factor(full_scores$school_type)
full_scores$teaching_method <- as.factor(full_scores$teaching_method)
full_scores$gender <- as.factor(full_scores$gender)
full_scores$lunch <- as.factor(full_scores$lunch)
A series of calculations and plots were done to assess the relationship between the various columns and the final post test score. The pre test score was found to have a very positive relation with the post test score.
The teaching method also had a very direct relationship on the post test score as those students in the Experimental method had an average score of 72.98 which was 9.13 more than those students in the Standard method with an average of 63.85.
aggregate(x=full_scores$posttest, by=list(full_scores$teaching_method), FUN=mean)
## Group.1 x
## 1 Experimental 72.98289
## 2 Standard 63.84705
The school setting also had an impact on the test scores as those in the Suburban schools had a far higher average of 76.04 compared to their counterparts in the Rural and Urban schools with averages of 64.05 and 61.75 respectively.
aggregate(x=full_scores$posttest, by=list(full_scores$school_setting), FUN=mean)
## Group.1 x
## 1 Rural 64.05098
## 2 Suburban 76.03766
## 3 Urban 61.74834
Whether or not the student qualifies for free or reduced lunch had an even greater effect on their average than the other factors mentioned above. Those students that qualify for free lunch had an average of 57.48 and those that do not qualify had a significantly higher average of 74.38.
aggregate(x=full_scores$posttest, by=list(full_scores$lunch), FUN=mean)
## Group.1 x
## 1 Does not qualify 74.37531
## 2 Qualifies for reduced/free lunch 57.47603
Unsurprisingly the school type showed a significant variance in the average of the grades as those in Public school received an average of 64.02 compared to those in Non-Public who had an average of 75.96
aggregate(x=full_scores$posttest, by=list(full_scores$school_type), FUN=mean)
## Group.1 x
## 1 Non-public 75.96189
## 2 Public 64.01643
The average scores between the genders, however, showed very little difference as the males only had 0.2 higher than the females.
aggregate(x=full_scores$posttest, by=list(full_scores$gender), FUN=mean)
## Group.1 x
## 1 Female 67.00473
## 2 Male 67.19777
A plot of the test scores compared to the schools the students attended showed a great variance between the highest and lowest average, UKPGS had an average of 91.16 and KZKKE had an average of 47.92 resulting in a variance of 43.24.
The correlation between the number of students in the class and the test scores had a significant negative relationship, which suggests that the more students in a class, the lower the test score will be.
cor(full_scores$posttest, full_scores$n_student)
## [1] -0.5048864
After these observations were done, it was time to complete the model, the n_student was converted to a factor ensure all the classrooms with the same amount of students were grouped together
full_scores$n_student <- as.factor(full_scores$n_student)
Given all the results obtained above, it was determined that the only column that may not have a significant effect on the post test score would be the gender. Therefore 2 models were to be built, one with and the other without the gender taken into consideration.
split_data <- sample.split(full_scores$posttest, SplitRatio = 0.8)
testing_scores <- subset(full_scores, split_data == "FALSE")
training_scores <- subset(full_scores, split_data == "TRUE")
equation_1 <- ("posttest~. -student_id -classroom -gender ")
formula_1 <- as.formula(equation_1)
equation_2 <- ("posttest~. -student_id -classroom")
formula_2 <- as.formula(equation_2)
model_1 <- randomForest(formula = formula_1, data=training_scores, ntree=500,
nodesize = 0*01*nrow(training_scores), mtry=3)
model_2 <- randomForest(formula = formula_2, data=training_scores, ntree=500,
nodesize = 0*01*nrow(training_scores), mtry=3)
predict_1 <- predict(model_1, testing_scores)
predict_2 <- predict(model_2, testing_scores)
Actual_Scores <- testing_scores$posttest
Comparison_1 <- as.data.frame(Actual_Scores)
Comparison_2 <- as.data.frame(Actual_Scores)
Comparison_1$Predicted_Scores <- predict_1
Comparison_2$Predicted_Scores <- predict_2
Comparison_1$Error <- round((Comparison_1$Actual_Scores -
Comparison_1$Predicted_Scores),2)
Comparison_2$Error <- round((Comparison_2$Actual_Scores -
Comparison_2$Predicted_Scores),2)
rmse_1 <- sqrt(mean(Comparison_1$Error^2))
rmse_2 <- sqrt(mean(Comparison_2$Error^2))
rmse_1
## [1] 2.949075
rmse_2
## [1] 2.903714
The square root of the mean error of the model with the gender omitted provides a higher value, 2.95 compared to the model with the gender considered, 2.90. Therefore, even though the averages of the genders were not significantly different, including the gender in the model made it more accurate.
A plot depicting the values of the predicted and actual post test scores using model_2 is shown below:
A couple observations that were of particular interest were, the students that got a perfect score on the post test we all from the same classroom, P2A.
filter(full_scores, posttest>99)
## school school_setting school_type classroom teaching_method n_student
## 1 IDGFP Urban Non-public P2A Experimental 17
## 2 IDGFP Urban Non-public P2A Experimental 17
## 3 IDGFP Urban Non-public P2A Experimental 17
## 4 IDGFP Urban Non-public P2A Experimental 17
## 5 IDGFP Urban Non-public P2A Experimental 17
## 6 IDGFP Urban Non-public P2A Experimental 17
## 7 IDGFP Urban Non-public P2A Experimental 17
## 8 IDGFP Urban Non-public P2A Experimental 17
## student_id gender lunch pretest posttest
## 1 BYVSP Male Does not qualify 86 100
## 2 D9SR6 Female Does not qualify 83 100
## 3 K5955 Male Does not qualify 85 100
## 4 P32P9 Male Does not qualify 83 100
## 5 QXTHU Male Does not qualify 93 100
## 6 RG9R4 Male Does not qualify 83 100
## 7 SH2DM Male Does not qualify 88 100
## 8 W4KYQ Female Does not qualify 81 100
The other observation of interest was the fact that all the students that got scores below 40 were all in Public school and had qualified for free or reduced lunch.
filter(full_scores, posttest<40)
## school school_setting school_type classroom teaching_method n_student
## 1 GOOBU Urban Public CXC Standard 24
## 2 GOOBU Urban Public CXC Standard 24
## 3 GOOBU Urban Public HKF Standard 28
## 4 GOOBU Urban Public HKF Standard 28
## 5 GOOBU Urban Public HKF Standard 28
## 6 GOOBU Urban Public HKF Standard 28
## 7 GOOBU Urban Public HKF Standard 28
## 8 GOOBU Urban Public HKF Standard 28
## 9 GOOBU Urban Public HKF Standard 28
## 10 GOOBU Urban Public HKF Standard 28
## 11 KZKKE Rural Public 3D0 Standard 22
## 12 KZKKE Rural Public 3D0 Standard 22
## 13 KZKKE Rural Public 3D0 Standard 22
## 14 KZKKE Rural Public 3D0 Standard 22
## 15 KZKKE Rural Public 3D0 Standard 22
## 16 KZKKE Rural Public 3D0 Standard 22
## 17 KZKKE Rural Public 5JK Standard 24
## 18 KZKKE Rural Public 5JK Standard 24
## 19 KZKKE Rural Public QTU Standard 23
## 20 KZKKE Rural Public QTU Standard 23
## 21 VVTVA Urban Public A93 Experimental 30
## student_id gender lunch pretest posttest
## 1 11I5O Male Qualifies for reduced/free lunch 31 39
## 2 JX5I4 Female Qualifies for reduced/free lunch 33 36
## 3 78IT6 Female Qualifies for reduced/free lunch 32 32
## 4 8BONX Male Qualifies for reduced/free lunch 31 38
## 5 ESWTA Male Qualifies for reduced/free lunch 32 38
## 6 ET5M9 Female Qualifies for reduced/free lunch 23 35
## 7 G7UZ5 Male Qualifies for reduced/free lunch 30 39
## 8 JR3ZP Female Qualifies for reduced/free lunch 30 39
## 9 LTIOX Male Qualifies for reduced/free lunch 26 34
## 10 T7HV4 Male Qualifies for reduced/free lunch 27 39
## 11 EOZV5 Female Qualifies for reduced/free lunch 30 36
## 12 GFVOX Female Qualifies for reduced/free lunch 28 37
## 13 GONLU Female Qualifies for reduced/free lunch 29 39
## 14 LYT0H Female Qualifies for reduced/free lunch 32 39
## 15 RHOL5 Male Qualifies for reduced/free lunch 31 37
## 16 T1N79 Female Qualifies for reduced/free lunch 26 38
## 17 8Q0G9 Female Qualifies for reduced/free lunch 33 39
## 18 HQ18M Female Qualifies for reduced/free lunch 29 39
## 19 CRILM Male Qualifies for reduced/free lunch 32 38
## 20 QM9VS Male Qualifies for reduced/free lunch 26 37
## 21 HY8JN Male Qualifies for reduced/free lunch 29 39