Introduction

The following report is an analysis of the Performance data set, which looked at the study habits, extracurricular, test scores and overall academic evaluation of 10000 students. This report will break down the data and examine it as well as create models to better predict a student’s overall Academic Performance.

Preprocessing

The table contained the following 6 attributes which were renamed, Hours Studied (Study.Hrs), Previous Scores (Tests), Extracurricular Activities (Activities), Sleep Hours (Sleep.Hrs), Sample Question Papers Practiced (Practice.Tests) and Performance Index (Academics). In addition to these name modifications another category was created to see the difference between Tests and Academics. Lastly, the following categorical variables were created from to bin the values.

Study.Hrs derived Study.Quality where “Low”=[1,3], “Med”=[4,6] and “High”=[7,9].

Tests derived Test.Letter where “A+”=[90,100], “A”=[80,89], “B”=[70,79],“C”=[60,69], “D”=[50,59] and “F”=[0,49]. This represents typical letters assigned to grades.

Sleep.Qual was derived from Sleep.Hrs where “Low”=[4,5], Med=[6,7], High[8,9].

Test.Prep was derived from Practice.Tests where “Low”=[0,2], “Med”=[3,6] and “High”=[7,9].

Total.Letter was derived from Academics where “A+”=[90,100], “A”=[80,89], “B”=[70,79],“C”=[60,69], “D”=[50,59] and “F”=[0,49]. The imbalance for this is to focus on those who are academically inclined

Activities.1 was created as a binary attribute.

Data sample
Study.Hrs Tests Activities Sleep.Hrs Practice.Test Academics Score.Difference Study.Quality Test.Grade Sleep.Qual Test.Prep Total.Letter Activities.1
7 99 Yes 9 1 91 8 High A+ High Low A+ 1
4 82 No 4 2 65 17 Med A Low Low C 0
8 51 Yes 7 2 45 6 High D Med Low F 1
5 52 Yes 5 2 36 16 Med D Low Low F 1
7 75 No 8 5 66 9 High B High Med C 0

Exploratory Analysis

Data Boxplots

The following are the box plots of each of the numerical variables.

Summary
Study.Hrs Sleep.Hrs Practice.Test Tests Academics Score.Difference
Min. :1.000 Min. :4.000 Min. :0.000 Min. :40.00 Min. : 10.00 Min. :-5.00
1st Qu.:3.000 1st Qu.:5.000 1st Qu.:2.000 1st Qu.:54.00 1st Qu.: 40.00 1st Qu.: 8.00
Median :5.000 Median :7.000 Median :5.000 Median :69.00 Median : 55.00 Median :14.00
Mean :4.993 Mean :6.531 Mean :4.583 Mean :69.45 Mean : 55.22 Mean :14.22
3rd Qu.:7.000 3rd Qu.:8.000 3rd Qu.:7.000 3rd Qu.:85.00 3rd Qu.: 71.00 3rd Qu.:21.00
Max. :9.000 Max. :9.000 Max. :9.000 Max. :99.00 Max. :100.00 Max. :32.00

Findings

There are no outlines among the data

Histograms

Findings

There seems to be a near uniform distribution in all original attributes with the exception of Academics which seems to have a normal distribution. Similarly it is no surprise Score.Difference resembles a normal distribution as well.

Categorical Attributes Impact on Academics and Tests

The following is an examination of each categorical variable with regards to the average Academics and Tests scores. In Addition the Tests scores impact on Academics

Findings

For most of the categorical variables there is not a significant impact on the average test scores. However, there is some distinction among the categorical variables and overall academic performance. How long an individual studies influences this score signifantly with there being a difference of 17 points between a high amount of studying and low. Additionally a student’s average performance score is closely connected with how well they perform on tests.

Correlation

The following is a correlation heat map showing the correlation between the numeric variables. These bolster earlier findings that test scores are significantly correlated with overall academic performance. Further there is soem correlation between the amount of time studying and overall academic performance.Perhaps the most shocking aspect which is the complete lack of correlation between any of the other variables especially Study hours and tests scores. It is as if they were randomly assigned with each other.

Predictive Analysis

Linear Regression

The goal of the following linear regression is to determine an equation that will help predict Academics given the other variables. Part of the process required partitioning of the data in to training and validation data with them representing 60% and 40% respectively. ### Training Data The following include the coefficients of the model, a summary of the residuals and an overall evaluation on accuracy on the training data.

## 
## Call:
## lm(formula = Academics ~ ., data = Performance.3, subset = train.rows.lm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6288 -1.3715 -0.0163  1.3607  8.7951 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -34.172388   0.164270 -208.03   <2e-16 ***
## Study.Hrs       2.863300   0.010182  281.20   <2e-16 ***
## Tests           1.018000   0.001518  670.45   <2e-16 ***
## Sleep.Hrs       0.484104   0.015576   31.08   <2e-16 ***
## Practice.Test   0.200591   0.009269   21.64   <2e-16 ***
## Activities.1    0.626546   0.052721   11.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.041 on 5994 degrees of freedom
## Multiple R-squared:  0.9887, Adjusted R-squared:  0.9887 
## F-statistic: 1.049e+05 on 5 and 5994 DF,  p-value: < 2.2e-16
##                     ME   RMSE      MAE        MPE     MAPE
## Test set -8.710983e-17 2.0402 1.620976 -0.2328367 3.453975

### Findings As can be seen through the coefficients the variables that have the largest impact on overall performance (Academics) are previous tests and Hours studied. Due to Tests having higher average score and variance than Study.Hrs, it makes sense it’s coefficient is so much smaller. A surprising find is the magnitude of the intercept value. It is a third of the highest Academics and larger than some of the lowest ones.

Validation Data

The following is a sample of the training data available and a summary of the errors.

##    Validation.Academics Predicted   residuals
## 1                    91  91.83682 -0.83681553
## 10                   69  69.81926 -0.81925531
## 11                   84  84.31141 -0.31141248
## 12                   73  72.46184  0.53815746
## 13                   27  27.02165 -0.02164551
## 17                   67  68.34978 -1.34977753
##  Validation.Academics   Predicted       residuals        
##  Min.   : 11.00       Min.   :11.95   Min.   :-7.560921  
##  1st Qu.: 40.00       1st Qu.:40.02   1st Qu.:-1.312237  
##  Median : 56.00       Median :55.54   Median :-0.006367  
##  Mean   : 55.37       Mean   :55.33   Mean   : 0.035996  
##  3rd Qu.: 71.00       3rd Qu.:70.79   3rd Qu.: 1.378849  
##  Max.   :100.00       Max.   :98.08   Max.   : 7.563466
##                  ME     RMSE      MAE       MPE     MAPE
## Test set 0.03599606 2.034308 1.612487 -0.167627 3.455711

Findings

There was not much variation in the Model’s proficiency between the training data and the validation data. The with a mean error of less than .1 and MAPE of 1.6 it does well. The magnitude of error is minimal. Further the Mean Absolute Percentage Error is virtually the same as the training data at 3.45%. This model is not overfitted to the provided data.
The model’s fitness is subjective to the purpose and need for accuracy. However, in most circumstances involving Academic Performance a MAPE of 3.4% is accaptable.

Naive Bayes

As mentioned above, categorical attributes were created from each of the numerical attributes. For NaiveBayes, the data table used is shown below. Similar to the linear regression, partitioning was done to evaluate its effectiveness. This data may be ideal as there is a lack of correlation among the predictor variables.
Naive Bayes Data Sample
Activities Study.Quality Test.Grade Sleep.Qual Test.Prep Total.Letter
Yes High A+ High Low A+
No Med A Low Low C
Yes High D Med Low F
Yes Med D Low Low F
No High B High Med C

Training Data

The Following is a breakdown of the training data for the different variables and shown by each Total.Letter percentage.

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##          A         A+          B          C          D          F 
## 0.09066667 0.02616667 0.14483333 0.16283333 0.16583333 0.40966667 
## 
## Conditional probabilities:
##     Activities
## Y           No       Yes
##   A  0.4411765 0.5588235
##   A+ 0.4713376 0.5286624
##   B  0.4982739 0.5017261
##   C  0.5148414 0.4851586
##   D  0.5216080 0.4783920
##   F  0.5093572 0.4906428
## 
##     Study.Quality
## Y          High        Low        Med
##   A  0.59558824 0.02573529 0.37867647
##   A+ 0.93630573 0.00000000 0.06369427
##   B  0.35212888 0.25546605 0.39240506
##   C  0.31525077 0.35516888 0.32958035
##   D  0.33065327 0.34773869 0.32160804
##   F  0.22823434 0.44344996 0.32831570
## 
##     Test.Grade
## Y              A          A+           B           C           D           F
##   A  0.294117647 0.704044118 0.001838235 0.000000000 0.000000000 0.000000000
##   A+ 0.012738854 0.987261146 0.000000000 0.000000000 0.000000000 0.000000000
##   B  0.444188723 0.397008055 0.157652474 0.001150748 0.000000000 0.000000000
##   C  0.352098260 0.110542477 0.401228250 0.136131013 0.000000000 0.000000000
##   D  0.109547739 0.000000000 0.369849246 0.357788945 0.161809045 0.001005025
##   F  0.000000000 0.000000000 0.038242474 0.209113100 0.340113914 0.412530513
## 
##     Sleep.Qual
## Y         High       Low       Med
##   A  0.3639706 0.2812500 0.3547794
##   A+ 0.4394904 0.2738854 0.2866242
##   B  0.3578826 0.3141542 0.3279632
##   C  0.3613101 0.3193449 0.3193449
##   D  0.3276382 0.3437186 0.3286432
##   F  0.3242474 0.3344182 0.3413344
## 
##     Test.Prep
## Y         High       Low       Med
##   A  0.3235294 0.2610294 0.4154412
##   A+ 0.3694268 0.2038217 0.4267516
##   B  0.3164557 0.2589183 0.4246260
##   C  0.3029683 0.2824974 0.4145343
##   D  0.2793970 0.3045226 0.4160804
##   F  0.2990236 0.2860049 0.4149715

The following is a sample of the training data with actual class, the probabilities of belonging to each class and Naive Bayes’ predicted class. There is also a confusion Matrix and corresponding metrics for the training data.

##   Actual predicted            A           A+            B            C
## 1      B         B 1.926956e-01 2.417870e-04 0.3965860857 0.3049573497
## 2      B         B 1.979301e-01 3.267214e-04 0.4197600678 0.2888050921
## 3      D         F 1.586576e-03 5.651412e-04 0.0012711389 0.0011760565
## 4      F         F 8.485336e-04 3.594799e-05 0.0013121826 0.1321322630
## 5      F         F 3.989386e-05 3.990483e-07 0.0006374570 0.1255488140
## 6      F         F 2.144645e-05 1.953818e-07 0.0004255591 0.0007622731
##            D           F
## 1 0.10314365 0.002375492
## 2 0.09095363 0.002224427
## 3 0.20704164 0.788359444
## 4 0.32296996 0.542701118
## 5 0.31713822 0.556635219
## 6 0.14457024 0.854220289
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A   A+    B    C    D    F
##         A   311  152   92    0    0    0
##         A+    0    0    0    0    0    0
##         B   232    5  622  231    2    0
##         C     1    0  112  488  372   59
##         D     0    0   43  190  192   39
##         F     0    0    0   68  429 2360
## 
## Overall Statistics
##                                         
##                Accuracy : 0.6622        
##                  95% CI : (0.65, 0.6741)
##     No Information Rate : 0.4097        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.5368        
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: A+ Class: B Class: C Class: D Class: F
## Sensitivity           0.57169   0.00000   0.7158  0.49949  0.19296   0.9601
## Specificity           0.95528   1.00000   0.9084  0.89170  0.94565   0.8597
## Pos Pred Value        0.56036       NaN   0.5696  0.47287  0.41379   0.8260
## Neg Pred Value        0.95721   0.97383   0.9497  0.90157  0.85495   0.9688
## Prevalence            0.09067   0.02617   0.1448  0.16283  0.16583   0.4097
## Detection Rate        0.05183   0.00000   0.1037  0.08133  0.03200   0.3933
## Detection Prevalence  0.09250   0.00000   0.1820  0.17200  0.07733   0.4762
## Balanced Accuracy     0.76348   0.50000   0.8121  0.69559  0.56931   0.9099

Findings

Naive Bayes was terrible at predicting A+ students. One reason is their scarcity, they account for only 2.6% of the training data. Another is it being at the end of the spectrum. Aside from the inability to predict A+, most of the predictions are either in their proper or one grade away. With only 3% being predicted as being greater than one letter grade away from its actual grade.

Validation Data

The following is the confusion matrix for the validation data.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A   A+    B    C    D    F
##         A   202  108   63    0    0    0
##         A+    0    0    0    0    0    0
##         B   161    5  424  134    0    0
##         C     1    0   74  345  224   34
##         D     1    0   44  123  121   22
##         F     0    0    1   57  279 1577
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6672          
##                  95% CI : (0.6524, 0.6819)
##     No Information Rate : 0.4082          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5433          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: A+ Class: B Class: C Class: D Class: F
## Sensitivity           0.55342   0.00000   0.6997  0.52352  0.19391   0.9657
## Specificity           0.95296   1.00000   0.9116  0.90033  0.94372   0.8576
## Pos Pred Value        0.54155       NaN   0.5856  0.50885  0.38907   0.8239
## Neg Pred Value        0.95506   0.97175   0.9444  0.90548  0.86365   0.9732
## Prevalence            0.09125   0.02825   0.1515  0.16475  0.15600   0.4083
## Detection Rate        0.05050   0.00000   0.1060  0.08625  0.03025   0.3942
## Detection Prevalence  0.09325   0.00000   0.1810  0.16950  0.07775   0.4785
## Balanced Accuracy     0.75319   0.50000   0.8056  0.71192  0.56882   0.9117

Findings

Similar to the training data, Naive Bayes is terrible at finding the A+ groups. Its accuracy percentage is low. But a reason for this is number of predicting variable and how narrow most of them span. Also there are a total of 324 potential combinations of the other variables. This means that the average number of entries for each combination is over 30 in all the data and over 12 for the validation data. Despite the low accuracy and sensitivity rates, this model is good at predicting within 1 letter grade of the actual grade. If the categories had been larger, the accuracy would have certainly gone up.to demonstrate this the sensitivity rate of F, which has the largest range, was 97%.

K Nearest Neighbours

For KNN, the data was subset to the original values with the exception of Academics which was replaced by Total value as seen below. The data is partitioned and normalized as part of the preprocessing.

##   Activities Study.Hrs Tests Sleep.Hrs Practice.Test Total.Letter
## 1        Yes         7    99         9             1           A+
## 2         No         4    82         4             2            C
## 3        Yes         8    51         7             2            F
## 4        Yes         5    52         5             2            F
## 5         No         7    75         8             5            C
## 6         No         3    78         9             6            C
## k-Nearest Neighbors 
## 
## 6000 samples
##    5 predictor
##    6 classes: 'A', 'A+', 'B', 'C', 'D', 'F' 
## 
## Pre-processing: centered (5), scaled (5) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 5400, 5399, 5401, 5399, 5400, 5398, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.7910677  0.7206426
##   7  0.7974462  0.7289311
##   9  0.7996113  0.7314587
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

This shows the optimal k=9 with an accuracy of 80%. However it is not far from k=7.

The following are the two confusion matrixes for the training and validation data respectively.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A   A+    B    C    D    F
##         A   428   68   60    1    0    0
##         A+   13   96    0    0    0    0
##         B    92    0  676   91    1    0
##         C     0    0  124  804  115    1
##         D     0    0    0  101  775   83
##         F     0    0    0    0  110 2361
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8567          
##                  95% CI : (0.8475, 0.8654)
##     No Information Rate : 0.4075          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8082          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: A+ Class: B Class: C Class: D Class: F
## Sensitivity           0.80300   0.58537   0.7860   0.8064   0.7742   0.9656
## Specificity           0.97640   0.99777   0.9642   0.9520   0.9632   0.9691
## Pos Pred Value        0.76840   0.88073   0.7860   0.7701   0.8081   0.9555
## Neg Pred Value        0.98071   0.98846   0.9642   0.9611   0.9552   0.9762
## Prevalence            0.08883   0.02733   0.1433   0.1662   0.1668   0.4075
## Detection Rate        0.07133   0.01600   0.1127   0.1340   0.1292   0.3935
## Detection Prevalence  0.09283   0.01817   0.1433   0.1740   0.1598   0.4118
## Balanced Accuracy     0.88970   0.79157   0.8751   0.8792   0.8687   0.9674
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A   A+    B    C    D    F
##         A   277   56   68    0    0    0
##         A+   13   50    0    0    0    0
##         B    85    0  437   84    0    0
##         C     1    0  109  479   98    0
##         D     0    0    1   76  436   79
##         F     0    0    0    0   84 1567
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8115         
##                  95% CI : (0.799, 0.8235)
##     No Information Rate : 0.4115         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7476         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: A+ Class: B Class: C Class: D Class: F
## Sensitivity           0.73670   0.47170   0.7106   0.7496   0.7055   0.9520
## Specificity           0.96578   0.99666   0.9501   0.9381   0.9539   0.9643
## Pos Pred Value        0.69077   0.79365   0.7211   0.6972   0.7365   0.9491
## Neg Pred Value        0.97249   0.98578   0.9476   0.9517   0.9466   0.9664
## Prevalence            0.09400   0.02650   0.1537   0.1598   0.1545   0.4115
## Detection Rate        0.06925   0.01250   0.1092   0.1197   0.1090   0.3917
## Detection Prevalence  0.10025   0.01575   0.1515   0.1718   0.1480   0.4128
## Balanced Accuracy     0.85124   0.73418   0.8303   0.8439   0.8297   0.9582

Findings

KNN is shown to have a higher accuracy than Naive Bayes. Also, unlike Naive Bayes, there are accurate predictions of the “A+” class for both training and validation data. There are even some false positives. There is a significant decline in the average sensitivity between the two data sets. Excluding the ends of the spectrum, “A+” and “F”, there is a decline of about 7%. Just like Naive bayes, the majory of predictions are within one letter grade of the actual grade. In the training set there are only three instances that are predicted greater than one letter grade away. The Validation has only two instances. If the categories were broader, then this would improve the accuracy. KNN appears to be better at classifying than Naive Bayes.

Summary of Findings and Conclusions

Sleep

The data has shown there is virtually no correlation between the amount of sleep an individual gets and how well they do on a test or their overall academic score. If an individual is looking to do well academically then there is no impact of getting a good night’s sleep. It would depend on what the individual desires.

Practice Tests

Similarly taking practice tests will not improve an individuals performance on tests or Academics. An individual should instead dedicate time towards studying notes. ### Studying Studying is showing to positively impact a student’s overall academic performance. However, there seems to be a lack of correlation with Studying and test performance.

Test Performance

The best indication of having a high academic performance is by doing well on tests. The average Academics score for some with A+ on tests is over 80.

Predicting Performance Index

Linear regression is shown to do well with predicting a student’s Performance Index. The MAPE is relatively small. An area of concern is there may be several other factors not included. For organizations looking to recruit the best of the best, it would be wise to use a KNN approach. Unlike Naive Bayes, it successfully identified many “A+” students.