NBA Data

The first data set to be studied here is NBA data. The following data was collected from BasketballReference.com, and contains NBA team performance statistics and other miscellaneous information, such as wins, arena attendance, playoffs, etc. To gather enough complete data, data from the past three complete seasons have been taken into account here, which are 2017-18, 2018-19, and 2019-2020. The overarching focus here will be if team stats are good indicators of being a playoff team. Varying levels of competition each year, “stat-padding”, and the change in style of play in the NBA can all change the impact of statistics in the modern NBA. Teams are scoring more points than ever, but on the other hand are giving up more points than ever. Therefore, what makes good teams good is not stuffing the stats sheets, but rather competing relative to other teams. However, this might also not always be the case, as some conferences in the NBA are weaker than others certain years, making playoff runs for some teams easier than others. This seems like a tricky situation, but one that can be answered using data science.

Question

Given the data and the assignment in mind, the question being asked should be focused on classification. Therefore, the question proposed for this data set is: are NBA team performance statistics good measurements for classifying whether a team makes the NBA playoffs or not?

Pre-kNN Analysis

After cleaning and preparing the data of the past three seasons, we are left with 90 observations. There are 30 teams, times 3 seasons, leaving us with these 90 observations. Furthermore, we have 36 columns that were determined to be of importance to the question. With 16 teams making the playoffs each year, we can expect to see 48 occurrences of playoff teams, and 42 occurences of non-playoff teams.

## 
##  0  1 
## 42 48

As shown in the table above, our assumptions were confirmed. 0 in the table indicates not making the playoffs, while 1 represents making the playoffs.


Another part of the process of kNN analysis is scaling the data, as well as checking for correlations. After running correlations on all of the numeric variables, several correlations were, and those variables were removed. In addition, attempt variables, such as field goals attempted, were removed, as the percentage of makes and the number of makes were deemed more important. To prepare for kNN, test and training sets must be created. Here, 80% of the rows will be used for training, and 20% will be used for testing.

##   TrainingRows TestRows
## 1           73       17

The table above shows the number of rows used for traning and testing the data. As we can see, roughly 80% went to training, and 20% to testing.

kNN Analysis and Evaluations

After creating the kNN model, the confusion matrix can be used to find and determine different evaluation metrics for the model.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction 0 1
##          0 6 1
##          1 2 8
##                                          
##                Accuracy : 0.8235         
##                  95% CI : (0.5657, 0.962)
##     No Information Rate : 0.5294         
##     P-Value [Acc > NIR] : 0.01212        
##                                          
##                   Kappa : 0.6434         
##                                          
##  Mcnemar's Test P-Value : 1.00000        
##                                          
##             Sensitivity : 0.8889         
##             Specificity : 0.7500         
##          Pos Pred Value : 0.8000         
##          Neg Pred Value : 0.8571         
##               Precision : 0.8000         
##                  Recall : 0.8889         
##                      F1 : 0.8421         
##              Prevalence : 0.5294         
##          Detection Rate : 0.4706         
##    Detection Prevalence : 0.5882         
##       Balanced Accuracy : 0.8194         
##                                          
##        'Positive' Class : 1              
## 

With the information of the above confusion matrix, we can evaluate our model. The accuracy of the kNN model is 0.8235, or 82.35%, which is good. The true positive rate, which is the rate at which the model correctly classifies playoff teams, is equal to the sensitivity, which is 0.8889, or 88.89%. The false positive rate, which is the rate at which the model incorrectly classifies teams as playoff teams, is equal to one minus the specificity, which is 0.25, or 25% in this case. Our TPR is good. Our FPR is alright, but could surely be better. Also seen is the Kappa value of 0.6434, which represents an substantial agreement on how to classify/characterize values.


More metrics to look at are the ROC curve and the AUC (area under curve). The ROC graphically displays adjusted threshold values, and the goal is to have the area under the ROC curve to be as large as possible.

## [[1]]
## [1] 0.9652778

As we see in the graph, the ROC curve is taking up a large portion of the upper region of the graph. This is confirmed by the AUC value also shown above, at 0.965, which is considered a excellent in terms of evaluation metrics. This means that the area under the curve is 96.5% of the total area of the graph


Two more metrics we can look at are the log loss value, and the F1 score.

## [1] 0.0767358
## [1] 0.8

The first value seen above is our log loss value, which is 0.077. Log loss measures the uncertainty of the probabilities of classification, so the goal is to have this value as close to 0 as possible. The value of 0.077 is very good in this situation. The second value, our F1 score, is 0.8. F1 scores are another measure of accuracy, on a scale from 0 to 1, so our value is 0.8 is good.


Next, we can look at the confusion matrix table to identifty miss-classification patterns in the model.

##           Actual
## Prediction 0 1
##          0 6 1
##          1 2 8

From the prediction table from the confusion matrix, we can get a glimpse into how the kNN model classified or predicted observations versus the actual value of the observation. As seen, there are two miss-classifications where the model predicted a team to make the playoffs and they didn’t, and one where they predicted a team to not make the playoffs and they did. My best guess to these miss-classifications would be the varying level of competition year to year. In some seasons, such as 2017-18, teams with 46 wins out of 82 games didn’t make the playoffs, while in 2019-20, teams with 33 wins were able to make the playoffs. These varying level of performance needed to make the playoffs can be highly variable, which might throwoff our classifications in the model. To attempt to improve the model, we can adjust the threshold value to 0.3.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction 0 1
##          0 5 0
##          1 3 9
##                                          
##                Accuracy : 0.8235         
##                  95% CI : (0.5657, 0.962)
##     No Information Rate : 0.5294         
##     P-Value [Acc > NIR] : 0.01212        
##                                          
##                   Kappa : 0.6383         
##                                          
##  Mcnemar's Test P-Value : 0.24821        
##                                          
##             Sensitivity : 1.0000         
##             Specificity : 0.6250         
##          Pos Pred Value : 0.7500         
##          Neg Pred Value : 1.0000         
##               Precision : 0.7500         
##                  Recall : 1.0000         
##                      F1 : 0.8571         
##              Prevalence : 0.5294         
##          Detection Rate : 0.5294         
##    Detection Prevalence : 0.7059         
##       Balanced Accuracy : 0.8125         
##                                          
##        'Positive' Class : 1              
## 

As seen by the new confusion matrix, adjusting the threshold did very little. The TPR increased, but the FPR also increased. The accuracy did not change at all, and the kappa changed only very slightly. Overall adjusting the threshold did not lead to any real improvements.

Recommendations

Here, we were able to build a fairly good kNN model suited to classify whether NBA teams were able to make the playoffs. With an accuracy of 82.35%, and good other evaluation metrics, the model was able to do its job fairly well, but we can’t say it is perfect. Obviously, all of the metrics we discussed in this analysis could be improved upon with better data and better modeling techniques. The biggest thing I would recommend to improve this model for classification would be to simply gather more data. If more NBA seasons were included in the data, everything might be more representative, which could help our model’s accuracy rates. However, the evolution of the NBA and shift in play style over a short period of recent time might be a drawback to studying a lot of seasons. Also, collecting more variables in the data could also help. Maybe something like which conference or division each team could have an impact on whether or not they make the playoffs, due to varying levels of competition discussed earlier. When deploying this model, one must be aware of the imperfections of it. Our accuracy is only at 82.35%, which has a lot of room for improvement. Also, the limited data, only 90 observations, can be influential in determining if a model is reliable to someone or not. Therefore, when using this model, one must take into account the nature of the model, as well as the different metrics that have room for improvement.

Heart Data

The next data set to be worked on is a heart disease data set collected from Kaggle.com. The data contains different measurements of physical health, as well as demographics, and if the patient has heart disease or not. Heart disease is a very prelevant disease in the modern world, so it should be interesting to merge the world of statistics with such a serious issue.

Question

The question proposed for this data set is: are certain heart disease detectors reliable for classifying whether a patient has heart disease or not?

Pre-kNN Analysis

After reading in the data, we see 303 observations with 14 columns.Furthermore, we have 14 columns that were determined to be of importance to the question.

## 
##   0   1 
## 138 165

The base split we see here is that 138 of the patients don’t have heart disease, which is encoded as 0, while 165 have heart disease, which is encoded as 1 here.

Next, we scale the data, which is a key part of the process for using kNN. We then can correlate the numeric data to determine if there are any significant correlatoins to remove. After running the correlatons, we didn’t find any significant correlations, so we keep all of the variables. To prepare for kNN, test and training sets must be created. Here, 80% of the rows will be used for training, and 20% will be used for testing.


##   TrainingRows TestRows
## 1          243       60

The table above shows the number of rows used for traning and testing the data. As we can see, roughly 80% went to training, and 20% to testing.


kNN Analysis and Evaluations

After creating the kNN model, the confusion matrix can be used to find and determine different evaluation metrics for the model.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 16  2
##          1 11 31
##                                          
##                Accuracy : 0.7833         
##                  95% CI : (0.658, 0.8793)
##     No Information Rate : 0.55           
##     P-Value [Acc > NIR] : 0.0001472      
##                                          
##                   Kappa : 0.5486         
##                                          
##  Mcnemar's Test P-Value : 0.0265003      
##                                          
##             Sensitivity : 0.9394         
##             Specificity : 0.5926         
##          Pos Pred Value : 0.7381         
##          Neg Pred Value : 0.8889         
##               Precision : 0.7381         
##                  Recall : 0.9394         
##                      F1 : 0.8267         
##              Prevalence : 0.5500         
##          Detection Rate : 0.5167         
##    Detection Prevalence : 0.7000         
##       Balanced Accuracy : 0.7660         
##                                          
##        'Positive' Class : 1              
## 

With the information of the above confusion matrix, we can evaluate our model. The accuracy of the kNN model is 0.7833, or 78.33%, which is good, but could be better. The true positive rate, which is the rate at which the model correctly classifies playoff teams, is equal to the sensitivity, which is 0.9394, or 93.94%. The false positive rate, which is the rate at which the model incorrectly classifies teams as playoff teams, is equal to one minus the specificity, which is 0.4074, or 40.74% in this case. Our TPR is very good. Our FPR is quite poor. Also seen is the Kappa value of 0.5486, which represents an moderate agreement on how to classify/characterize values.


Again, we can evaluate the ROC curve and the AUC (area under curve).

## [[1]]
## [1] 0.9281706

As we see in the graph, the ROC curve is taking up a large portion of the upper region of the graph. This is confirmed by the AUC value also shown above, at 0.9282, which is considered a excellent in terms of evaluation metrics. This means that the area under the curve is 92.82% of the total area of the graph.


Two more metrics we can look at are the log loss value, and the F1 score.

## [1] -2.964037
## [1] 0.7111111

The first value seen above is our log loss value, which is -2.96. Log loss measures the uncertainty of the probabilities of classification, so the goal is to have this value as close to 0 as possible. The value of -2.96 is very bad in this situation, as it is relatively no where near 0. The second value, our F1 score, is 0.71. F1 scores are another measure of accuracy, on a scale from 0 to 1, so our value is 0.71 is alright.


Next, we can look at the confusion matrix table to identifty miss-classification patterns in the model.

##           Actual
## Prediction  0  1
##          0 16  2
##          1 11 31

After looking at the table of predicted versus actual observations, we see that most of the miss-classifications are coming from false positives, or when the model predicts a patient has heart disease when they actually don’t. My best guess for this pattern of miss-classifications is that several of the variables were categorical transformed into dummy variables, which could possibly lead to inaccurate predictions. To account for the high rate of false positives, we can adjust the threshold to 0.7.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 25  6
##          1  2 27
##                                           
##                Accuracy : 0.8667          
##                  95% CI : (0.7541, 0.9406)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 1.653e-07       
##                                           
##                   Kappa : 0.7342          
##                                           
##  Mcnemar's Test P-Value : 0.2888          
##                                           
##             Sensitivity : 0.8182          
##             Specificity : 0.9259          
##          Pos Pred Value : 0.9310          
##          Neg Pred Value : 0.8065          
##               Precision : 0.9310          
##                  Recall : 0.8182          
##                      F1 : 0.8710          
##              Prevalence : 0.5500          
##          Detection Rate : 0.4500          
##    Detection Prevalence : 0.4833          
##       Balanced Accuracy : 0.8721          
##                                           
##        'Positive' Class : 1               
## 

After adjusting the threshold, we see that the model is much improved, the Accuracy, FPR, and the Kappa value all improve We see a slight decrease in the TPR, however, as a whole, the adjusted model seems to be much better than the original. Also, we can note that the adjusted threshold was able to greatly reduce the amount of false positives from the classifications.

Recommendations

With all of this information, we were able to build a good model for classifying heart disease patients. With an accuracy rate of 86.7% after adjusting the threshold, and having fairly decent other evaluation metrics, I would say that this model is able to do its job fairly well. However, the model could always be improved upon with some changes to our techniques and data collection. One recommendation I would give, would be to greatly expand the sample size. In this data set, we only have 303 observations. Though sufficient for analysis, I feel as if a greater sample size, such as over 1000, would help the results be more representative, helping the classifier. One other recommendation I would give would be to add more demographic variables to the data. In this set, the only demographic variables are age and sex. If we were to have other variables, such as race, or even as far as income, these variables might help for the classfier to taake in more information that could play a role in heart disease. When deploying this model, one must be aware of the imperfections of the model. In the case of our heart data model, the log loss values was very poor, and the F1 score was decent at best. If these evaluation metrics are important to someone, this must be taken into account when deploying this model.

References

  1. https://www.basketball-reference.com/leagues/NBA_2020.html
  2. https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/case_when
  3. https://www.kaggle.com/ronitf/heart-disease-uci