Dataset 1: Reviews of Coffee Shops

The first data set contains information from Yelp ratings for coffee shops in Austin, Texas. Each Yelp rating is logged with the name of the coffee shop, the content of the review, the numeric rating, the relative rating, whether the rating is considered high or low, the overall sentiment, the sentiment about the shop’s vibe, the sentiment about its tea offerings, the sentiment about the service they offer, etc. The question we are hoping to answer is as follows: after cleaning data to remove the data points creating noise, can we create a kNN model that uses some of these rating factors to tell us which coffee shops will be good and highly rated?

Data Cleaning

We removed the columns with many missing values and columns that affect accuracy. After that, we found that the base level for identifying if a coffee shop was highly-rated or not was 83.3%.

##         1 
## 0.8334813

Plot k vs. Accuracy to See How Many Neighbors to Use

Based on the plot below, 11 is the best number of neighbors for a higher accuracy level.

Run kNN analysis with 11 Nearest Neighbors

Model Evaluation

Confusion Matrix

This confusion matrix tells us that the Accuracy is 84.8%, Kappa is 28.5%, Sensitivity is 97.3%, Specificity is 24.0%, and F1 score is 91.4% (printed separately below). These are pretty good statistics, as the Accuracy has gone up from our base rate of 80%, though the Kappa is pretty low. Kappa measures the degree of agreement among raters, so this is something to keep in mind when analyzing the model. The true positive rate (TPR), also known as Sensitivity, is 97.32%, which means the model correctly identifies about 97% of good coffee shops as good coffee shops. However, the false positive rate (FPR) is 75.97%, which is the percentage of bad coffee shops that the model incorrectly identifies as good coffee shops.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   0   1
##          0  37  20
##          1 117 727
##                                           
##                Accuracy : 0.8479          
##                  95% CI : (0.8228, 0.8708)
##     No Information Rate : 0.8291          
##     P-Value [Acc > NIR] : 0.07051         
##                                           
##                   Kappa : 0.2847          
##                                           
##  Mcnemar's Test P-Value : 2.367e-16       
##                                           
##             Sensitivity : 0.9732          
##             Specificity : 0.2403          
##          Pos Pred Value : 0.8614          
##          Neg Pred Value : 0.6491          
##              Prevalence : 0.8291          
##          Detection Rate : 0.8069          
##    Detection Prevalence : 0.9367          
##       Balanced Accuracy : 0.6067          
##                                           
##        'Positive' Class : 1               
## 
##        F1 
## 0.9138906

Log Loss

We calculate the Log Loss of the model to be 0.55. Log Loss measures the uncertainty of the probabilities in the model by comparing them to the true labels and will more heavily penalize classifications that are confident in the wrong direction. Typically for balanced binary problems, 0.693 is an accepted baseline for a “good” Log Loss score. That beings said, we don’t have a balanced data set (80% of our data is for high-rated coffee shops!), so this is not bad for what we are working with.

## [1] 0.5487878

AUC

The area under the receiver operating curve (ROC) below tells us how much the model can distinguish between classes, or in other words, whether or not a coffee shop is high-rated. The greater our area under the curve, the better our model is at distinguishing between well-rated and poorly-rated coffee shops in Austin. We see that our AUC prints as 0.78, which is pretty good! Note that we also plot the y = x line to visualize the difference between our model (the colorful curve) and what it would be like to just randomly guess.

## [1] 0.7840714

Miss-Classification Errors

Based on the model, we have a pretty high Sensitivity, or TPR, which means we are correctly identifying a majority of the high-rated coffee shops. However, our FPR and Specificity are pretty bad for this model. Our high FPR means there are a lot of low-rated coffee shops that are being incorrectly identified as good coffee shops. In addition, our specificity of 24% means less than a fourth of low-rated coffee shops are being correctly identified as bad coffee shops. We believe this may be a result of our unbalanced dataset (747 high-rated shops compared to 154 low-rated shops). Therefore, Kappa is a useful metric for our dataset because it takes into account the imbalance in class distribution. Instead of focusing on improving the overall accuracy (which is already pretty good), we will focus on improving the Kappa value. In terms of our overall question, we would rather have more false negatives than false positives because we prefer to have a smaller list of high-rated coffee shops that are actually good! Because of this, we will work on improving the Specificity in addition to the Kappa.

Adjust Threshold

We decided to adjust the threshold of the model to see if we can further improve the Kappa. We have more false positives than false negatives according to our confusion matrix, so we will adjust the threshold to be higher. A higher threshold reduces the amount of times that the model will predict that the coffee shop is high-rated. Clearly, adjusting the threshold here to 0.6 has reduced the number of false positives that we get (117 to 105) but increased the number of false negatives (20 to 35!). The accuracy and sensitivity have also gone down, though the Specificity went up from 24% to 32.5%. The Kappa score increased from 0.285 to 0.338, which is a good sign that the model is improving.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   0   1
##          0  50  35
##          1 104 712
##                                           
##                Accuracy : 0.8457          
##                  95% CI : (0.8205, 0.8687)
##     No Information Rate : 0.8291          
##     P-Value [Acc > NIR] : 0.09853         
##                                           
##                   Kappa : 0.3379          
##                                           
##  Mcnemar's Test P-Value : 8.037e-09       
##                                           
##             Sensitivity : 0.9531          
##             Specificity : 0.3247          
##          Pos Pred Value : 0.8725          
##          Neg Pred Value : 0.5882          
##               Precision : 0.8725          
##                  Recall : 0.9531          
##                      F1 : 0.9111          
##              Prevalence : 0.8291          
##          Detection Rate : 0.7902          
##    Detection Prevalence : 0.9057          
##       Balanced Accuracy : 0.6389          
##                                           
##        'Positive' Class : 1               
## 

Dataset 2: Outcomes of Cancer Patients

This second data set contains the overall outcomes of patients diagnosed with Hepatocellular Carcinoma, along with many other data points about the patient’s health. The data comes from the University Hospital in Portugal. The question we are hoping to answer is as follows: after cleaning data and removing any data points that create noise, can we create a kNN model that uses some health factors to tell us if someone has a better chance of survival?

Data Cleaning

We removed the columns with many missing values and columns that affect accuracy. After that, we found that the base level for identifying if a patient would survive or not was 60.4%.

Plot k vs. Accuracy to See How Many Neighbors to Use

Based on the plot below, 3 was the best number of neighbors for a higher accuracy level.

Run kNN analysis with 3 Nearest Neighbors

Model Evaluation

Confusion Matrix

Originally, the baseline Accuracy was around 60%; Accuracy is now up to 71.8%. The Sensitivity/TPR is 75%, which means about 3/4 of patients who survived are correctly identified as survivors. The Specificity is 67% and the false positive rate (FPR) is 33%, which is a bit high (about 1/3 of patients that died are being identified as survivors) but certainly not terrible. The F1 score is 76.9%. All of these statistics are fairly good, especially when you compare the baseline to the prediction model. The Kappa value is 0.410, which is pretty good given the slightly unbalanced dataset.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0  8  5
##          1  4 15
##                                           
##                Accuracy : 0.7188          
##                  95% CI : (0.5325, 0.8625)
##     No Information Rate : 0.625           
##     P-Value [Acc > NIR] : 0.1814          
##                                           
##                   Kappa : 0.4098          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.7500          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.7895          
##          Neg Pred Value : 0.6154          
##              Prevalence : 0.6250          
##          Detection Rate : 0.4688          
##    Detection Prevalence : 0.5938          
##       Balanced Accuracy : 0.7083          
##                                           
##        'Positive' Class : 1               
## 

Log Loss

The Log Loss score we get with this data is 5.8, which is very poor. This means that our model has a very low probability for correctly selecting a high-rated coffee shop that is actually high-rated. Since this score is so high, we could likely improve our model and hopefully increase our probability of selecting the actual class.

## [1] 5.795204

AUC

The area under our ROC for this model is 0.458, which is not great. The shape of the curve also also suggests that the model is not much better at predicting whether a cancer patient will survive than random guessing.

## [1] 0.4583333

Miss-Classification Errors

The model we created does a good job at correctly identifying survivors (Sensitivity/TPR = 75%), but could definitely be improved. We would like to see a decrease in the FPR so we can reduce the amount of people who passed away that are categorized as survivors. There are two main issues with our data: the data set is small (32 observations) and the data we do have is unbalanced (20 survivors compared to 12 patients who died). Therefore, we will focus again on increasing the Kappa value to measure model improvement since it takes into account the imbalance of the classification groups. Additionally, we will work towards increasing Sensitivity/TPR and decreasing the FPR to ensure we are correctly classifying as many survivors as possible. We believe it is most important to be confident in predicting true positives so we do not incorrectly give a terminal patient the hope of survival.

Adjust Threshold

We also saw more of a problem with false negatives relative to how many actual negatives vs positives that we had: 4 out of the 12 predictions on actual negatives were false (predicted 1) while 5 out of the 15 predictions on actual positives were false (predicted 0). So, we adjusted the threshold to 0.3 to see what this would do to our model. We have reduced our number of false negatives by 3, but we have far more false positives now (9 instead of 4). Although our Sensitivity/TPR increased to 90%, our Accuracy, Specificity, and Kappa have all declined significantly, so we don’t see this adjusting of the threshold as helpful with this model.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0  3  2
##          1  9 18
##                                           
##                Accuracy : 0.6562          
##                  95% CI : (0.4681, 0.8143)
##     No Information Rate : 0.625           
##     P-Value [Acc > NIR] : 0.43364         
##                                           
##                   Kappa : 0.1698          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.9000          
##             Specificity : 0.2500          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value : 0.6000          
##               Precision : 0.6667          
##                  Recall : 0.9000          
##                      F1 : 0.7660          
##              Prevalence : 0.6250          
##          Detection Rate : 0.5625          
##    Detection Prevalence : 0.8438          
##       Balanced Accuracy : 0.5750          
##                                           
##        'Positive' Class : 1               
## 

Conclusion

Overall, this lab was a great way to learn more about machine learning evaluation and specifically which metrics are useful for different questions and models. It was interesting to compare the differences between the first and second dataset in terms of model performance and evaluation. Our first analysis was more successful most likely the dataset was larger; the second dataset had only 32 observations after data cleaning. For the first dataset, we prioritized increasing the Kappa value and decreasing the false positive rate (FPR) and in the second dataset we prioritized increasing the Kappa value and increasing the Sensitivity/True Positive Rate (TPR). We also observed the results of changing the threshold higher for the first dataset and lower for the second. Although the second analysis wasn’t as successful as we were hoping, I still learned a lot about evaluating machine learning models.

If we had more time, I would have liked to do further analyses, specifically by adjusting the threshold. We focus on one or two metrics in specific depending on the question being asked, so it would be interesting to create a couple different questions for each dataset, determine which metrics are most important for that question and then find the best model for each.