I was tasked with the job of finding a method by which I could predict the post test scores of some students given various other data on each student. Upon analysis of the structure of the data set, it was revealed that all the columns were of character data type except for number of students(n_students) as well as the pre and post test scores.

str(full_scores)
## 'data.frame':    2133 obs. of  11 variables:
##  $ school         : chr  "ANKYI" "ANKYI" "ANKYI" "ANKYI" ...
##  $ school_setting : chr  "Urban" "Urban" "Urban" "Urban" ...
##  $ school_type    : chr  "Non-public" "Non-public" "Non-public" "Non-public" ...
##  $ classroom      : chr  "6OL" "6OL" "6OL" "6OL" ...
##  $ teaching_method: chr  "Standard" "Standard" "Standard" "Standard" ...
##  $ n_student      : num  20 20 20 20 20 20 20 20 20 20 ...
##  $ student_id     : chr  "2FHT3" "3JIVH" "3XOWE" "556O0" ...
##  $ gender         : chr  "Female" "Female" "Male" "Female" ...
##  $ lunch          : chr  "Does not qualify" "Does not qualify" "Does not qualify" "Does not qualify" ...
##  $ pretest        : num  62 66 64 61 64 66 63 63 64 61 ...
##  $ posttest       : num  72 79 76 77 76 74 75 72 77 72 ...

With more keen observations done, it was determined that the student_id column had distinct values for each student, therefore this data type would not be of much help in determining a post test score. The classroom data type had the specific classroom each student was in and the n_student data type had the number of student in each class. Given both data types capture similar information, I decided to use only one in the model building and that was the n_student.

Each of the columns that was of character and being used in the building of the model was converted to a factor as this would be the best way to complete the analysis.

full_scores$school <- as.factor(full_scores$school)
full_scores$school_setting <- as.factor(full_scores$school_setting)
full_scores$school_type <- as.factor(full_scores$school_type)
full_scores$teaching_method <- as.factor(full_scores$teaching_method)
full_scores$gender <- as.factor(full_scores$gender)
full_scores$lunch <- as.factor(full_scores$lunch)

A series of calculations and plots were done to assess the relationship between the various columns and the final post test score. The pre test score was found to have a very positive relation with the post test score.

The teaching method also had a very direct relationship on the post test score as those students in the Experimental method had an average score of 72.98 which was 9.13 more than those students in the Standard method with an average of 63.85.

aggregate(x=full_scores$posttest, by=list(full_scores$teaching_method), FUN=mean)
##        Group.1        x
## 1 Experimental 72.98289
## 2     Standard 63.84705

The school setting also had an impact on the test scores as those in the Suburban schools had a far higher average of 76.04 compared to their counterparts in the Rural and Urban schools with averages of 64.05 and 61.75 respectively.

aggregate(x=full_scores$posttest, by=list(full_scores$school_setting), FUN=mean)
##    Group.1        x
## 1    Rural 64.05098
## 2 Suburban 76.03766
## 3    Urban 61.74834

Whether or not the student qualifies for free or reduced lunch had an even greater effect on their average than the other factors mentioned above. Those students that qualify for free lunch had an average of 57.48 and those that do not qualify had a significantly higher average of 74.38.

aggregate(x=full_scores$posttest, by=list(full_scores$lunch), FUN=mean)
##                            Group.1        x
## 1                 Does not qualify 74.37531
## 2 Qualifies for reduced/free lunch 57.47603

Unsurprisingly the school type showed a significant variance in the average of the grades as those in Public school received an average of 64.02 compared to those in Non-Public who had an average of 75.96

aggregate(x=full_scores$posttest, by=list(full_scores$school_type), FUN=mean)
##      Group.1        x
## 1 Non-public 75.96189
## 2     Public 64.01643

The average scores between the genders, however, showed very little difference as the males only had 0.2 higher than the females.

aggregate(x=full_scores$posttest, by=list(full_scores$gender), FUN=mean)
##   Group.1        x
## 1  Female 67.00473
## 2    Male 67.19777

A plot of the test scores compared to the schools the students attended showed a great variance between the highest and lowest average, UKPGS had an average of 91.16 and KZKKE had an average of 47.92 resulting in a variance of 43.24.

The correlation between the number of students in the class and the test scores had a significant negative relationship, which suggests that the more students in a class, the lower the test score will be.

cor(full_scores$posttest, full_scores$n_student)
## [1] -0.5048864

After these observations were done, it was time to complete the model, the n_student was converted to a factor ensure all the classrooms with the same amount of students were grouped together

full_scores$n_student <- as.factor(full_scores$n_student)

Given all the results obtained above, it was determined that the only column that may not have a significant effect on the post test score would be the gender. Therefore 2 models were to be built, one with and the other without the gender taken into consideration.

split_data <- sample.split(full_scores$posttest, SplitRatio = 0.8)
testing_scores <- subset(full_scores, split_data == "FALSE")
training_scores <- subset(full_scores, split_data == "TRUE")

equation_1 <- ("posttest~. -student_id -classroom -gender ")
formula_1 <- as.formula(equation_1)

equation_2 <- ("posttest~. -student_id -classroom")
formula_2 <- as.formula(equation_2)

model_1 <- randomForest(formula = formula_1, data=training_scores, ntree=500,
                        nodesize = 0*01*nrow(training_scores), mtry=3)

model_2 <- randomForest(formula = formula_2, data=training_scores, ntree=500,
                        nodesize = 0*01*nrow(training_scores), mtry=3)

predict_1 <- predict(model_1, testing_scores)

predict_2 <- predict(model_2, testing_scores)

Actual_Scores <- testing_scores$posttest
Comparison_1 <- as.data.frame(Actual_Scores)

Comparison_2 <- as.data.frame(Actual_Scores)

Comparison_1$Predicted_Scores <- predict_1

Comparison_2$Predicted_Scores <- predict_2

Comparison_1$Error <- round((Comparison_1$Actual_Scores - 
                               Comparison_1$Predicted_Scores),2)

Comparison_2$Error <- round((Comparison_2$Actual_Scores - 
                               Comparison_2$Predicted_Scores),2)
rmse_1 <- sqrt(mean(Comparison_1$Error^2))
rmse_2 <- sqrt(mean(Comparison_2$Error^2))
rmse_1
## [1] 2.949075
rmse_2
## [1] 2.903714

The square root of the mean error of the model with the gender omitted provides a higher value, 2.95 compared to the model with the gender considered, 2.90. Therefore, even though the averages of the genders were not significantly different, including the gender in the model made it more accurate.

A plot depicting the values of the predicted and actual post test scores using model_2 is shown below:

A couple observations that were of particular interest were, the students that got a perfect score on the post test we all from the same classroom, P2A.

filter(full_scores, posttest>99)
##   school school_setting school_type classroom teaching_method n_student
## 1  IDGFP          Urban  Non-public       P2A    Experimental        17
## 2  IDGFP          Urban  Non-public       P2A    Experimental        17
## 3  IDGFP          Urban  Non-public       P2A    Experimental        17
## 4  IDGFP          Urban  Non-public       P2A    Experimental        17
## 5  IDGFP          Urban  Non-public       P2A    Experimental        17
## 6  IDGFP          Urban  Non-public       P2A    Experimental        17
## 7  IDGFP          Urban  Non-public       P2A    Experimental        17
## 8  IDGFP          Urban  Non-public       P2A    Experimental        17
##   student_id gender            lunch pretest posttest
## 1      BYVSP   Male Does not qualify      86      100
## 2      D9SR6 Female Does not qualify      83      100
## 3      K5955   Male Does not qualify      85      100
## 4      P32P9   Male Does not qualify      83      100
## 5      QXTHU   Male Does not qualify      93      100
## 6      RG9R4   Male Does not qualify      83      100
## 7      SH2DM   Male Does not qualify      88      100
## 8      W4KYQ Female Does not qualify      81      100

The other observation of interest was the fact that all the students that got scores below 40 were all in Public school and had qualified for free or reduced lunch.

filter(full_scores, posttest<40)
##    school school_setting school_type classroom teaching_method n_student
## 1   GOOBU          Urban      Public       CXC        Standard        24
## 2   GOOBU          Urban      Public       CXC        Standard        24
## 3   GOOBU          Urban      Public       HKF        Standard        28
## 4   GOOBU          Urban      Public       HKF        Standard        28
## 5   GOOBU          Urban      Public       HKF        Standard        28
## 6   GOOBU          Urban      Public       HKF        Standard        28
## 7   GOOBU          Urban      Public       HKF        Standard        28
## 8   GOOBU          Urban      Public       HKF        Standard        28
## 9   GOOBU          Urban      Public       HKF        Standard        28
## 10  GOOBU          Urban      Public       HKF        Standard        28
## 11  KZKKE          Rural      Public       3D0        Standard        22
## 12  KZKKE          Rural      Public       3D0        Standard        22
## 13  KZKKE          Rural      Public       3D0        Standard        22
## 14  KZKKE          Rural      Public       3D0        Standard        22
## 15  KZKKE          Rural      Public       3D0        Standard        22
## 16  KZKKE          Rural      Public       3D0        Standard        22
## 17  KZKKE          Rural      Public       5JK        Standard        24
## 18  KZKKE          Rural      Public       5JK        Standard        24
## 19  KZKKE          Rural      Public       QTU        Standard        23
## 20  KZKKE          Rural      Public       QTU        Standard        23
## 21  VVTVA          Urban      Public       A93    Experimental        30
##    student_id gender                            lunch pretest posttest
## 1       11I5O   Male Qualifies for reduced/free lunch      31       39
## 2       JX5I4 Female Qualifies for reduced/free lunch      33       36
## 3       78IT6 Female Qualifies for reduced/free lunch      32       32
## 4       8BONX   Male Qualifies for reduced/free lunch      31       38
## 5       ESWTA   Male Qualifies for reduced/free lunch      32       38
## 6       ET5M9 Female Qualifies for reduced/free lunch      23       35
## 7       G7UZ5   Male Qualifies for reduced/free lunch      30       39
## 8       JR3ZP Female Qualifies for reduced/free lunch      30       39
## 9       LTIOX   Male Qualifies for reduced/free lunch      26       34
## 10      T7HV4   Male Qualifies for reduced/free lunch      27       39
## 11      EOZV5 Female Qualifies for reduced/free lunch      30       36
## 12      GFVOX Female Qualifies for reduced/free lunch      28       37
## 13      GONLU Female Qualifies for reduced/free lunch      29       39
## 14      LYT0H Female Qualifies for reduced/free lunch      32       39
## 15      RHOL5   Male Qualifies for reduced/free lunch      31       37
## 16      T1N79 Female Qualifies for reduced/free lunch      26       38
## 17      8Q0G9 Female Qualifies for reduced/free lunch      33       39
## 18      HQ18M Female Qualifies for reduced/free lunch      29       39
## 19      CRILM   Male Qualifies for reduced/free lunch      32       38
## 20      QM9VS   Male Qualifies for reduced/free lunch      26       37
## 21      HY8JN   Male Qualifies for reduced/free lunch      29       39