TODO: Explain how the choice of K affects the quality of your prediction when using a \(K\) Nearest Neighbors algorithm.
Explanation:The choice of K in kNN impacts how the model generalizes. A small K (e.g., 1 or 3) makes predictions sensitive to noise, leading to overfitting. A large K smooths predictions by considering more neighbors but risks under fitting by ignoring local patterns. The best K balances bias and variance. You can usually find this balance by doing some cross validation.
3. Feature Engineering
Create a version of the year column that is a factor (instead of numeric).
Create dummy variables that indicate the presence of “cherry”, “chocolate” and “earth” in the description.
Take care to handle upper and lower case characters.
Create 3 new features that represent the interaction between time and the cherry, chocolate and earth indicators.
Confusion Matrix and Statistics
Reference
Prediction Burgundy California Casablanca_Valley Marlborough New_York
Burgundy 86 14 3 6 0
California 68 669 12 18 10
Casablanca_Valley 0 0 0 0 0
Marlborough 0 0 0 0 0
New_York 1 0 3 1 1
Oregon 83 108 8 20 15
Reference
Prediction Oregon
Burgundy 31
California 246
Casablanca_Valley 0
Marlborough 0
New_York 1
Oregon 269
Overall Statistics
Accuracy : 0.6127
95% CI : (0.5888, 0.6361)
No Information Rate : 0.4728
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3551
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity 0.36134 0.8458 0.00000
Specificity 0.96237 0.5986 1.00000
Pos Pred Value 0.61429 0.6540 NaN
Neg Pred Value 0.90085 0.8123 0.98446
Prevalence 0.14226 0.4728 0.01554
Detection Rate 0.05140 0.3999 0.00000
Detection Prevalence 0.08368 0.6115 0.00000
Balanced Accuracy 0.66186 0.7222 0.50000
Class: Marlborough Class: New_York Class: Oregon
Sensitivity 0.0000 0.0384615 0.4918
Specificity 1.0000 0.9963570 0.7922
Pos Pred Value NaN 0.1428571 0.5348
Neg Pred Value 0.9731 0.9849940 0.7624
Prevalence 0.0269 0.0155409 0.3270
Detection Rate 0.0000 0.0005977 0.1608
Detection Prevalence 0.0000 0.0041841 0.3007
Balanced Accuracy 0.5000 0.5174093 0.6420
6. Kappa
How do we determine whether a Kappa value represents a good, bad or some other outcome?
TODO: Kappa is used to assess whether a model’s performance is meaningfully better than random chance. It does so by comparing the observed agreement (how often the model’s predictions match actual values) to the expected agreement (the level of agreement we would expect by chance). While this is a simplified interpretation of the equation, a high Kappa value generally indicates strong model performance. In most cases, a Kappa value above 0.6 is considered good, as it suggests substantial agreement. However, it is important to be cautious if Kappa exceeds 0.85 or 0.9, as this may indicate overfitting, meaning the model is too closely tailored to the training data and may not generalize well to new data.
7. Improvement
How can we interpret the confusion matrix, and how can we improve in our predictions?
TODO: Explain:The confusion matrix indicates that the model performs well in classifying California wines but has difficulty distinguishing between Burgundy and Oregon, frequently misclassifying them. Additionally, Casablanca Valley and Marlborough are not classified at all, suggesting the model fails to recognize them. To enhance performance, potential improvements include balancing the dataset, incorporating more informative features, optimizing the k-value in kNN, or utilizing a more advanced model such as Random Forest.