The following report is an analysis of the Performance data set, which looked at the study habits, extracurricular, test scores and overall academic evaluation of 10000 students. This report will break down the data and examine it as well as create models to better predict a student’s overall Academic Performance.
The table contained the following 6 attributes which were renamed, Hours Studied (Study.Hrs), Previous Scores (Tests), Extracurricular Activities (Activities), Sleep Hours (Sleep.Hrs), Sample Question Papers Practiced (Practice.Tests) and Performance Index (Academics). In addition to these name modifications another category was created to see the difference between Tests and Academics. Lastly, the following categorical variables were created from to bin the values.
Study.Hrs derived Study.Quality where “Low”=[1,3], “Med”=[4,6] and “High”=[7,9].
Tests derived Test.Letter where “A+”=[90,100], “A”=[80,89], “B”=[70,79],“C”=[60,69], “D”=[50,59] and “F”=[0,49]. This represents typical letters assigned to grades.
Sleep.Qual was derived from Sleep.Hrs where “Low”=[4,5], Med=[6,7], High[8,9].
Test.Prep was derived from Practice.Tests where “Low”=[0,2], “Med”=[3,6] and “High”=[7,9].
Total.Letter was derived from Academics where “A+”=[90,100], “A”=[80,89], “B”=[70,79],“C”=[60,69], “D”=[50,59] and “F”=[0,49]. The imbalance for this is to focus on those who are academically inclined
Activities.1 was created as a binary attribute.
Study.Hrs | Tests | Activities | Sleep.Hrs | Practice.Test | Academics | Score.Difference | Study.Quality | Test.Grade | Sleep.Qual | Test.Prep | Total.Letter | Activities.1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 99 | Yes | 9 | 1 | 91 | 8 | High | A+ | High | Low | A+ | 1 |
4 | 82 | No | 4 | 2 | 65 | 17 | Med | A | Low | Low | C | 0 |
8 | 51 | Yes | 7 | 2 | 45 | 6 | High | D | Med | Low | F | 1 |
5 | 52 | Yes | 5 | 2 | 36 | 16 | Med | D | Low | Low | F | 1 |
7 | 75 | No | 8 | 5 | 66 | 9 | High | B | High | Med | C | 0 |
The following are the box plots of each of the numerical variables.
Study.Hrs | Sleep.Hrs | Practice.Test | Tests | Academics | Score.Difference | |
---|---|---|---|---|---|---|
Min. :1.000 | Min. :4.000 | Min. :0.000 | Min. :40.00 | Min. : 10.00 | Min. :-5.00 | |
1st Qu.:3.000 | 1st Qu.:5.000 | 1st Qu.:2.000 | 1st Qu.:54.00 | 1st Qu.: 40.00 | 1st Qu.: 8.00 | |
Median :5.000 | Median :7.000 | Median :5.000 | Median :69.00 | Median : 55.00 | Median :14.00 | |
Mean :4.993 | Mean :6.531 | Mean :4.583 | Mean :69.45 | Mean : 55.22 | Mean :14.22 | |
3rd Qu.:7.000 | 3rd Qu.:8.000 | 3rd Qu.:7.000 | 3rd Qu.:85.00 | 3rd Qu.: 71.00 | 3rd Qu.:21.00 | |
Max. :9.000 | Max. :9.000 | Max. :9.000 | Max. :99.00 | Max. :100.00 | Max. :32.00 |
There are no outlines among the data
There seems to be a near uniform distribution in all original attributes with the exception of Academics which seems to have a normal distribution. Similarly it is no surprise Score.Difference resembles a normal distribution as well.
The following is an examination of each categorical variable with regards to the average Academics and Tests scores. In Addition the Tests scores impact on Academics
For most of the categorical variables there is not a significant impact on the average test scores. However, there is some distinction among the categorical variables and overall academic performance. How long an individual studies influences this score signifantly with there being a difference of 17 points between a high amount of studying and low. Additionally a student’s average performance score is closely connected with how well they perform on tests.
The following is a correlation heat map showing the correlation between the numeric variables. These bolster earlier findings that test scores are significantly correlated with overall academic performance. Further there is soem correlation between the amount of time studying and overall academic performance.Perhaps the most shocking aspect which is the complete lack of correlation between any of the other variables especially Study hours and tests scores. It is as if they were randomly assigned with each other.
The goal of the following linear regression is to determine an equation that will help predict Academics given the other variables. Part of the process required partitioning of the data in to training and validation data with them representing 60% and 40% respectively. ### Training Data The following include the coefficients of the model, a summary of the residuals and an overall evaluation on accuracy on the training data.
##
## Call:
## lm(formula = Academics ~ ., data = Performance.3, subset = train.rows.lm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6288 -1.3715 -0.0163 1.3607 8.7951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.172388 0.164270 -208.03 <2e-16 ***
## Study.Hrs 2.863300 0.010182 281.20 <2e-16 ***
## Tests 1.018000 0.001518 670.45 <2e-16 ***
## Sleep.Hrs 0.484104 0.015576 31.08 <2e-16 ***
## Practice.Test 0.200591 0.009269 21.64 <2e-16 ***
## Activities.1 0.626546 0.052721 11.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.041 on 5994 degrees of freedom
## Multiple R-squared: 0.9887, Adjusted R-squared: 0.9887
## F-statistic: 1.049e+05 on 5 and 5994 DF, p-value: < 2.2e-16
## ME RMSE MAE MPE MAPE
## Test set -8.710983e-17 2.0402 1.620976 -0.2328367 3.453975
### Findings As can be seen through the coefficients the variables that have the largest impact on overall performance (Academics) are previous tests and Hours studied. Due to Tests having higher average score and variance than Study.Hrs, it makes sense it’s coefficient is so much smaller. A surprising find is the magnitude of the intercept value. It is a third of the highest Academics and larger than some of the lowest ones.
The following is a sample of the training data available and a summary of the errors.
## Validation.Academics Predicted residuals
## 1 91 91.83682 -0.83681553
## 10 69 69.81926 -0.81925531
## 11 84 84.31141 -0.31141248
## 12 73 72.46184 0.53815746
## 13 27 27.02165 -0.02164551
## 17 67 68.34978 -1.34977753
## Validation.Academics Predicted residuals
## Min. : 11.00 Min. :11.95 Min. :-7.560921
## 1st Qu.: 40.00 1st Qu.:40.02 1st Qu.:-1.312237
## Median : 56.00 Median :55.54 Median :-0.006367
## Mean : 55.37 Mean :55.33 Mean : 0.035996
## 3rd Qu.: 71.00 3rd Qu.:70.79 3rd Qu.: 1.378849
## Max. :100.00 Max. :98.08 Max. : 7.563466
## ME RMSE MAE MPE MAPE
## Test set 0.03599606 2.034308 1.612487 -0.167627 3.455711
There was not much variation in the Model’s proficiency between the
training data and the validation data. The with a mean error of less
than .1 and MAPE of 1.6 it does well. The magnitude of error is minimal.
Further the Mean Absolute Percentage Error is virtually the same as the
training data at 3.45%. This model is not overfitted to the provided
data.
The model’s fitness is subjective to the purpose and need for accuracy.
However, in most circumstances involving Academic Performance a MAPE of
3.4% is accaptable.
Activities | Study.Quality | Test.Grade | Sleep.Qual | Test.Prep | Total.Letter |
---|---|---|---|---|---|
Yes | High | A+ | High | Low | A+ |
No | Med | A | Low | Low | C |
Yes | High | D | Med | Low | F |
Yes | Med | D | Low | Low | F |
No | High | B | High | Med | C |
The Following is a breakdown of the training data for the different variables and shown by each Total.Letter percentage.
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## A A+ B C D F
## 0.09066667 0.02616667 0.14483333 0.16283333 0.16583333 0.40966667
##
## Conditional probabilities:
## Activities
## Y No Yes
## A 0.4411765 0.5588235
## A+ 0.4713376 0.5286624
## B 0.4982739 0.5017261
## C 0.5148414 0.4851586
## D 0.5216080 0.4783920
## F 0.5093572 0.4906428
##
## Study.Quality
## Y High Low Med
## A 0.59558824 0.02573529 0.37867647
## A+ 0.93630573 0.00000000 0.06369427
## B 0.35212888 0.25546605 0.39240506
## C 0.31525077 0.35516888 0.32958035
## D 0.33065327 0.34773869 0.32160804
## F 0.22823434 0.44344996 0.32831570
##
## Test.Grade
## Y A A+ B C D F
## A 0.294117647 0.704044118 0.001838235 0.000000000 0.000000000 0.000000000
## A+ 0.012738854 0.987261146 0.000000000 0.000000000 0.000000000 0.000000000
## B 0.444188723 0.397008055 0.157652474 0.001150748 0.000000000 0.000000000
## C 0.352098260 0.110542477 0.401228250 0.136131013 0.000000000 0.000000000
## D 0.109547739 0.000000000 0.369849246 0.357788945 0.161809045 0.001005025
## F 0.000000000 0.000000000 0.038242474 0.209113100 0.340113914 0.412530513
##
## Sleep.Qual
## Y High Low Med
## A 0.3639706 0.2812500 0.3547794
## A+ 0.4394904 0.2738854 0.2866242
## B 0.3578826 0.3141542 0.3279632
## C 0.3613101 0.3193449 0.3193449
## D 0.3276382 0.3437186 0.3286432
## F 0.3242474 0.3344182 0.3413344
##
## Test.Prep
## Y High Low Med
## A 0.3235294 0.2610294 0.4154412
## A+ 0.3694268 0.2038217 0.4267516
## B 0.3164557 0.2589183 0.4246260
## C 0.3029683 0.2824974 0.4145343
## D 0.2793970 0.3045226 0.4160804
## F 0.2990236 0.2860049 0.4149715
The following is a sample of the training data with actual class, the probabilities of belonging to each class and Naive Bayes’ predicted class. There is also a confusion Matrix and corresponding metrics for the training data.
## Actual predicted A A+ B C
## 1 B B 1.926956e-01 2.417870e-04 0.3965860857 0.3049573497
## 2 B B 1.979301e-01 3.267214e-04 0.4197600678 0.2888050921
## 3 D F 1.586576e-03 5.651412e-04 0.0012711389 0.0011760565
## 4 F F 8.485336e-04 3.594799e-05 0.0013121826 0.1321322630
## 5 F F 3.989386e-05 3.990483e-07 0.0006374570 0.1255488140
## 6 F F 2.144645e-05 1.953818e-07 0.0004255591 0.0007622731
## D F
## 1 0.10314365 0.002375492
## 2 0.09095363 0.002224427
## 3 0.20704164 0.788359444
## 4 0.32296996 0.542701118
## 5 0.31713822 0.556635219
## 6 0.14457024 0.854220289
## Confusion Matrix and Statistics
##
## Reference
## Prediction A A+ B C D F
## A 311 152 92 0 0 0
## A+ 0 0 0 0 0 0
## B 232 5 622 231 2 0
## C 1 0 112 488 372 59
## D 0 0 43 190 192 39
## F 0 0 0 68 429 2360
##
## Overall Statistics
##
## Accuracy : 0.6622
## 95% CI : (0.65, 0.6741)
## No Information Rate : 0.4097
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5368
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: A+ Class: B Class: C Class: D Class: F
## Sensitivity 0.57169 0.00000 0.7158 0.49949 0.19296 0.9601
## Specificity 0.95528 1.00000 0.9084 0.89170 0.94565 0.8597
## Pos Pred Value 0.56036 NaN 0.5696 0.47287 0.41379 0.8260
## Neg Pred Value 0.95721 0.97383 0.9497 0.90157 0.85495 0.9688
## Prevalence 0.09067 0.02617 0.1448 0.16283 0.16583 0.4097
## Detection Rate 0.05183 0.00000 0.1037 0.08133 0.03200 0.3933
## Detection Prevalence 0.09250 0.00000 0.1820 0.17200 0.07733 0.4762
## Balanced Accuracy 0.76348 0.50000 0.8121 0.69559 0.56931 0.9099
Naive Bayes was terrible at predicting A+ students. One reason is their scarcity, they account for only 2.6% of the training data. Another is it being at the end of the spectrum. Aside from the inability to predict A+, most of the predictions are either in their proper or one grade away. With only 3% being predicted as being greater than one letter grade away from its actual grade.
The following is the confusion matrix for the validation data.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A A+ B C D F
## A 202 108 63 0 0 0
## A+ 0 0 0 0 0 0
## B 161 5 424 134 0 0
## C 1 0 74 345 224 34
## D 1 0 44 123 121 22
## F 0 0 1 57 279 1577
##
## Overall Statistics
##
## Accuracy : 0.6672
## 95% CI : (0.6524, 0.6819)
## No Information Rate : 0.4082
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5433
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: A+ Class: B Class: C Class: D Class: F
## Sensitivity 0.55342 0.00000 0.6997 0.52352 0.19391 0.9657
## Specificity 0.95296 1.00000 0.9116 0.90033 0.94372 0.8576
## Pos Pred Value 0.54155 NaN 0.5856 0.50885 0.38907 0.8239
## Neg Pred Value 0.95506 0.97175 0.9444 0.90548 0.86365 0.9732
## Prevalence 0.09125 0.02825 0.1515 0.16475 0.15600 0.4083
## Detection Rate 0.05050 0.00000 0.1060 0.08625 0.03025 0.3942
## Detection Prevalence 0.09325 0.00000 0.1810 0.16950 0.07775 0.4785
## Balanced Accuracy 0.75319 0.50000 0.8056 0.71192 0.56882 0.9117
Similar to the training data, Naive Bayes is terrible at finding the A+ groups. Its accuracy percentage is low. But a reason for this is number of predicting variable and how narrow most of them span. Also there are a total of 324 potential combinations of the other variables. This means that the average number of entries for each combination is over 30 in all the data and over 12 for the validation data. Despite the low accuracy and sensitivity rates, this model is good at predicting within 1 letter grade of the actual grade. If the categories had been larger, the accuracy would have certainly gone up.to demonstrate this the sensitivity rate of F, which has the largest range, was 97%.
For KNN, the data was subset to the original values with the exception of Academics which was replaced by Total value as seen below. The data is partitioned and normalized as part of the preprocessing.
## Activities Study.Hrs Tests Sleep.Hrs Practice.Test Total.Letter
## 1 Yes 7 99 9 1 A+
## 2 No 4 82 4 2 C
## 3 Yes 8 51 7 2 F
## 4 Yes 5 52 5 2 F
## 5 No 7 75 8 5 C
## 6 No 3 78 9 6 C
## k-Nearest Neighbors
##
## 6000 samples
## 5 predictor
## 6 classes: 'A', 'A+', 'B', 'C', 'D', 'F'
##
## Pre-processing: centered (5), scaled (5)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 5400, 5399, 5401, 5399, 5400, 5398, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7910677 0.7206426
## 7 0.7974462 0.7289311
## 9 0.7996113 0.7314587
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
This shows the optimal k=9 with an accuracy of 80%. However it is not far from k=7.
The following are the two confusion matrixes for the training and validation data respectively.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A A+ B C D F
## A 428 68 60 1 0 0
## A+ 13 96 0 0 0 0
## B 92 0 676 91 1 0
## C 0 0 124 804 115 1
## D 0 0 0 101 775 83
## F 0 0 0 0 110 2361
##
## Overall Statistics
##
## Accuracy : 0.8567
## 95% CI : (0.8475, 0.8654)
## No Information Rate : 0.4075
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8082
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: A+ Class: B Class: C Class: D Class: F
## Sensitivity 0.80300 0.58537 0.7860 0.8064 0.7742 0.9656
## Specificity 0.97640 0.99777 0.9642 0.9520 0.9632 0.9691
## Pos Pred Value 0.76840 0.88073 0.7860 0.7701 0.8081 0.9555
## Neg Pred Value 0.98071 0.98846 0.9642 0.9611 0.9552 0.9762
## Prevalence 0.08883 0.02733 0.1433 0.1662 0.1668 0.4075
## Detection Rate 0.07133 0.01600 0.1127 0.1340 0.1292 0.3935
## Detection Prevalence 0.09283 0.01817 0.1433 0.1740 0.1598 0.4118
## Balanced Accuracy 0.88970 0.79157 0.8751 0.8792 0.8687 0.9674
## Confusion Matrix and Statistics
##
## Reference
## Prediction A A+ B C D F
## A 277 56 68 0 0 0
## A+ 13 50 0 0 0 0
## B 85 0 437 84 0 0
## C 1 0 109 479 98 0
## D 0 0 1 76 436 79
## F 0 0 0 0 84 1567
##
## Overall Statistics
##
## Accuracy : 0.8115
## 95% CI : (0.799, 0.8235)
## No Information Rate : 0.4115
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7476
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: A+ Class: B Class: C Class: D Class: F
## Sensitivity 0.73670 0.47170 0.7106 0.7496 0.7055 0.9520
## Specificity 0.96578 0.99666 0.9501 0.9381 0.9539 0.9643
## Pos Pred Value 0.69077 0.79365 0.7211 0.6972 0.7365 0.9491
## Neg Pred Value 0.97249 0.98578 0.9476 0.9517 0.9466 0.9664
## Prevalence 0.09400 0.02650 0.1537 0.1598 0.1545 0.4115
## Detection Rate 0.06925 0.01250 0.1092 0.1197 0.1090 0.3917
## Detection Prevalence 0.10025 0.01575 0.1515 0.1718 0.1480 0.4128
## Balanced Accuracy 0.85124 0.73418 0.8303 0.8439 0.8297 0.9582
KNN is shown to have a higher accuracy than Naive Bayes. Also, unlike Naive Bayes, there are accurate predictions of the “A+” class for both training and validation data. There are even some false positives. There is a significant decline in the average sensitivity between the two data sets. Excluding the ends of the spectrum, “A+” and “F”, there is a decline of about 7%. Just like Naive bayes, the majory of predictions are within one letter grade of the actual grade. In the training set there are only three instances that are predicted greater than one letter grade away. The Validation has only two instances. If the categories were broader, then this would improve the accuracy. KNN appears to be better at classifying than Naive Bayes.
The data has shown there is virtually no correlation between the amount of sleep an individual gets and how well they do on a test or their overall academic score. If an individual is looking to do well academically then there is no impact of getting a good night’s sleep. It would depend on what the individual desires.
Similarly taking practice tests will not improve an individuals performance on tests or Academics. An individual should instead dedicate time towards studying notes. ### Studying Studying is showing to positively impact a student’s overall academic performance. However, there seems to be a lack of correlation with Studying and test performance.
The best indication of having a high academic performance is by doing well on tests. The average Academics score for some with A+ on tests is over 80.
Linear regression is shown to do well with predicting a student’s Performance Index. The MAPE is relatively small. An area of concern is there may be several other factors not included. For organizations looking to recruit the best of the best, it would be wise to use a KNN approach. Unlike Naive Bayes, it successfully identified many “A+” students.