Final Exam

Final Exam: MSCI 4230 – BUSINESS ANALYTICS IN PRACTICE

The objective of this final exam is to predict wine taste using various predictors in the data set. I will display this through visualizing the data, using KNN to classify satisfaction, Naive Bayes to predict satisfaction and Linear Regression.

About the Data

The data set includes information on various factors that might affect wine taste. It has details like density, chlorides, PH level and alcohol percentage, along with wine satisfaction scores. Analyzing this can help find patterns and understand what influences wine taste. The dataset consists of 3,961 rows and 12 columns, each representing a different feature of the wine. These include both objective measures, such as acidity, alcohol content, and pH, and subjective ratings of taste, which are rated on a scale from 1 to 10.

Data Preview

fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	taste
7.0	0.27	0.36	20.7	0.045	45	170	1.0010	3.00	0.45	8.8	6
6.3	0.30	0.34	1.6	0.049	14	132	0.9940	3.30	0.49	9.5	6
8.1	0.28	0.40	6.9	0.050	30	97	0.9951	3.26	0.44	10.1	6
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.40	9.9	6
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.40	9.9	6

This is a preview of the first five rows of the dataset, which consists entirely of numerical data types. The columns include fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and taste. These columns contain ratings, percentages, or measurable values that represent different chemical properties and quality indicators of wine. For instance, variables like fixed acidity and volatile acidity measure specific chemical concentrations, while others, such as taste, represent subjective ratings or satisfaction levels.

From experience, I believe that acidity, sugar, and alcohol content play a crucial role in shaping the overall taste and quality of wine. Acidity contributes to freshness and balance, sugar affects sweetness, and alcohol influences body, warmth, and the way flavors are perceived. These components are often key to how enjoyable and well-rounded a wine tastes. However, I’m also curious to see the true impact of these factors through data analysis, as well as to discover whether other, less obvious features, ones I might not expect, also play a significant role in determining wine quality.

A) Exploratory Data Analysis

i. Data Cleaning

## tibble [4,898 × 12] (S3: tbl_df/tbl/data.frame)
##  $ fixed acidity       : num [1:4898] 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile acidity    : num [1:4898] 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric acid         : num [1:4898] 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual sugar      : num [1:4898] 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num [1:4898] 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free sulfur dioxide : num [1:4898] 45 14 30 47 47 30 30 45 14 28 ...
##  $ total sulfur dioxide: num [1:4898] 170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num [1:4898] 1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num [1:4898] 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num [1:4898] 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num [1:4898] 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ taste               : num [1:4898] 6 6 6 6 6 6 6 6 6 6 ...

The first step in analyzing the dataset was to clean and examine its structure:

Column Names: I standardized the column names by replacing spaces with underscores to simplify referencing and analysis.

Data Types: All columns in the dataset are numeric, which is ideal for quantitative analysis.

Missing Values: I confirmed that there are no missing values in the dataset.

Duplicate Rows: There were 937 duplicate rows identified. These duplicates were removed to ensure each observation in the dataset is unique.

Outlier Detection: I calculated z-scores for all numeric features and removed any rows with z-scores beyond ±3, as they are considered statistical outliers.

ii. Data Visualisation

Histogram of Key Wine Characteristics

Fig. 1.1 The taste histogram reveals how the data is distributed among taste scores ranging from 0-10. The majority of scores are between 5 and 7. This means that most of the wine was rated to tasted “Good” or “Medium” on a scale. There are a small amount of wines that were rated to taste “Bad” or “Great” or “Low” or “High”. The key factors that affect wine taste include sweetness/dryness, alcohol content, and acidity. Alcohol content is right-skewed, with the majority of wines having lower alcohol levels in the 10-12% range. Both acidity and sugar levels are also right-skewed, indicating that most wines have lower acidity and sugar content. These distributions provide valuable insights into how these factors shape the overall flavor profile of wine, influencing its perceived sweetness, dryness, and balance.

Correlation Heat map

Fig. 1.2 This correlation heat map shows different correlations among all of the variables. There is a moderate positive correlation between alcohol and taste. With a correlation of 0.46, alcohol may be one of the key predictors of taste as no other relationship has as high of a correlation with taste. There is a moderate negative relationship between density and taste, another possible key identifier, as density increases, taste decreases. Other relationships with taste reveal a correlation between -0.2 and +0.2 which is a weak or no linear relationship indicator.

Scatter Plots

Fig. 1.3 The relationship between alcohol and taste is positive, as shown by the upward trendline in the scatter plot. This indicates that higher alcohol content is generally associated with better taste ratings. Sulphates also show a slight positive correlation with taste, suggesting a minor contribution to perceived quality. On the other hand, volatile acidity exhibits a negative linear relationship with taste, meaning that higher acidity levels tend to reduce taste scores. Citric acid shows a subtle positive trend, indicating a mild influence on taste. Among these factors, alcohol appears to have the strongest impact on taste perception.

In terms of other variables, fixed acidity and residual sugar both display slight negative linear trends, suggesting that increases in these factors may slightly decrease taste ratings. Chlorides demonstrate a steeper negative relationship with taste, pointing to a stronger inverse correlation. Free sulfur dioxide shows little to no correlation with taste, indicating minimal effect. Total sulfur dioxide follows a negative trend, but with a weaker slope than density, which exhibits a steeper negative relationship. Lastly, pH presents a mild positive trend, suggesting that higher pH levels may slightly improve taste perception.

Histogram of Predictors and Taste

Fig. 1.4 These boxplots showq the distribution of different properties found in wine and how they might relate to its taste, which is rated from 1 to 10. Most of the features like fixed acidity, volatile acidity, and citric acid are fairly evenly spread out, meaning most wines have similar levels. Residual sugar and chlorides are very skewed, with most wines having low amounts but a few having much higher levels. This could have a big effect on taste. Free and total sulfur dioxide also vary, and since they help preserve wine, they might influence how fresh the wine tastes. Density is very similar across most wines, while pH (a measure of acidity) is nicely centered around a typical value. Sulphates and alcohol show a bit more variation, and alcohol even seems to have two common levels. These differences in chemical makeup can all play a role in how a wine tastes, and some may be more important than others when predicting a wine’s overall rating.

Box Plots for Distribution of Quantitative Variables

Fig. 1.5 These box-plots show how four chemical features, fixed acidity, volatile acidity, citric acid, and residual sugar, relate to wine taste scores, which range from 1 to 10. Overall, fixed acidity stays fairly consistent across different taste levels, with a slight decrease at higher taste scores. However, there are several outliers across all scores, especially at the higher end, indicating some wines have unusually high acidity regardless of taste. Volatile acidity tends to decrease slightly as taste improves, suggesting that lower levels may contribute to better-tasting wines. Outliers are present here as well, especially in the mid-range scores. Citric acid remains relatively stable across taste ratings, but also shows many outliers, particularly among mid- and high-rated wines, showing that extreme values can still result in a wide range of tastes. Residual sugar does not show a clear trend with taste, and contains numerous outliers, especially in lower taste scores, implying that a few wines have very high sugar levels but are not necessarily rated highly. Among these features, volatile acidity shows the clearest relationship with taste, while the rest appear more scattered with significant variability.

iii. Data Transformation

fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH	sulphates	alcohol	taste
6.3	0.30	0.34	1.6	0.049	14	132	0.9940	3.30	0.49	9.5	Mediocre
8.1	0.28	0.40	6.9	0.050	30	97	0.9951	3.26	0.44	10.1	Mediocre
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.40	9.9	Mediocre
6.2	0.32	0.16	7.0	0.045	30	136	0.9949	3.18	0.47	9.6	Mediocre
8.1	0.22	0.43	1.5	0.044	28	129	0.9938	3.22	0.45	11.0	Mediocre

To transform the data, I made the taste column categorized into Bad, Mediocre, and Great. The ranges for the transformation are as follows: Bad: 0 to 3, Mediocre: 4 to 6 and Great: 7 to 10.

iv. Partition Data

To prepare the data for analysis, I partitioned the dataset into a training set and a holdout set. Using the taste column, 70% of the data was randomly allocated to the training set, while the remaining 30% was set aside as the holdout set for model evaluation. To ensure reproducibility, I set the random seed to 1 before partitioning. This allows the data split to be consistent every time the code is run, avoiding any variation in the partitioning process.

A) Predictive Data Analysis

i. Apply Appropriate Models

1. NB Classifier

Data Displayed

Wine Dataset for Naive Bayes
fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH	sulphates	alcohol	taste
6.3	0.30	0.34	1.6	0.049	14	132	0.9940	3.30	0.49	9.5	Mediocre
8.1	0.28	0.40	6.9	0.050	30	97	0.9951	3.26	0.44	10.1	Mediocre
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.40	9.9	Mediocre
6.2	0.32	0.16	7.0	0.045	30	136	0.9949	3.18	0.47	9.6	Mediocre
8.1	0.22	0.43	1.5	0.044	28	129	0.9938	3.22	0.45	11.0	Mediocre

Two types of probabilities

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##        Bad   Mediocre  Excellent 
## 0.03578310 0.93036058 0.03385632 
## 
## Conditional probabilities:
##            fixed_acidity
## Y               [,1]      [,2]
##   Bad       7.123077 1.0315593
##   Mediocre  6.820769 0.8024428
##   Excellent 6.636585 0.8236532
## 
##            volatile_acidity
## Y                [,1]       [,2]
##   Bad       0.3215385 0.10501356
##   Mediocre  0.2706376 0.08562576
##   Excellent 0.2893089 0.09876351
## 
##            citric_acid
## Y                [,1]       [,2]
##   Bad       0.3111538 0.13706013
##   Mediocre  0.3268609 0.10077709
##   Excellent 0.3312195 0.08033317
## 
##            residual_sugar
## Y               [,1]     [,2]
##   Bad       4.653077 4.259243
##   Mediocre  5.981879 4.738798
##   Excellent 5.123171 3.707117
## 
##            chlorides
## Y                 [,1]        [,2]
##   Bad       0.04739231 0.015343860
##   Mediocre  0.04309260 0.011869394
##   Excellent 0.03592683 0.008891118
## 
##            free_sulfur_dioxide
## Y               [,1]     [,2]
##   Bad       22.36538 17.31892
##   Mediocre  34.80311 15.33045
##   Excellent 34.53659 12.69716
## 
##            total_sulfur_dioxide
## Y               [,1]     [,2]
##   Bad       122.4769 49.92882
##   Mediocre  137.2420 41.56779
##   Excellent 121.7358 28.12278
## 
##            density
## Y                [,1]        [,2]
##   Bad       0.9941201 0.002471140
##   Mediocre  0.9937687 0.002794839
##   Excellent 0.9917771 0.002233137
## 
##            pH
## Y               [,1]      [,2]
##   Bad       3.188769 0.1633136
##   Mediocre  3.194435 0.1423248
##   Excellent 3.234715 0.1535659
## 
##            sulphates
## Y                [,1]      [,2]
##   Bad       0.4643077 0.1069589
##   Mediocre  0.4873817 0.1042133
##   Excellent 0.4738211 0.1301734
## 
##            alcohol
## Y               [,1]     [,2]
##   Bad       10.19346 1.039144
##   Mediocre  10.58743 1.191387
##   Excellent 11.88537 1.077846

Now we want to use the NB Classifier to classify companies based on their predictors. We will use the whole dataset.

NB Classifier

##              Bad  Mediocre    Excellent
## [1,] 0.040180971 0.9566421 3.176937e-03
## [2,] 0.045132756 0.9518595 3.007771e-03
## [3,] 0.008675993 0.9912569 6.714071e-05
## [4,] 0.028257615 0.9700323 1.710133e-03
## [5,] 0.023854509 0.9556689 2.047660e-02

## [1] Mediocre Mediocre Mediocre Mediocre Mediocre
## Levels: Bad Mediocre Excellent

##   fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1           6.3             0.30        0.34           1.60     0.049
## 2           8.1             0.28        0.40           6.90     0.050
## 3           7.2             0.23        0.32           8.50     0.058
## 4           6.2             0.32        0.16           7.00     0.045
## 5           8.1             0.22        0.43           1.50     0.044
## 6           8.1             0.27        0.41           1.45     0.033
##   free_sulfur_dioxide total_sulfur_dioxide density   pH sulphates alcohol
## 1                  14                  132  0.9940 3.30      0.49     9.5
## 2                  30                   97  0.9951 3.26      0.44    10.1
## 3                  47                  186  0.9956 3.19      0.40     9.9
## 4                  30                  136  0.9949 3.18      0.47     9.6
## 5                  28                  129  0.9938 3.22      0.45    11.0
## 6                  11                   63  0.9908 2.99      0.56    12.0
##      taste         Bad  Mediocre    Excellent pred.class
## 1 Mediocre 0.040180971 0.9566421 3.176937e-03   Mediocre
## 2 Mediocre 0.045132756 0.9518595 3.007771e-03   Mediocre
## 3 Mediocre 0.008675993 0.9912569 6.714071e-05   Mediocre
## 4 Mediocre 0.028257615 0.9700323 1.710133e-03   Mediocre
## 5 Mediocre 0.023854509 0.9556689 2.047660e-02   Mediocre
## 6 Mediocre 0.061712676 0.8642270 7.406031e-02   Mediocre

Results of classification

Wine Classification Result
fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH	sulphates	alcohol	taste	Bad	Mediocre	Excellent	pred.class
6.3	0.30	0.34	1.60	0.049	14	132	0.9940	3.30	0.49	9.5	Mediocre	0.0401810	0.9566421	0.0031769	Mediocre
8.1	0.28	0.40	6.90	0.050	30	97	0.9951	3.26	0.44	10.1	Mediocre	0.0451328	0.9518595	0.0030078	Mediocre
7.2	0.23	0.32	8.50	0.058	47	186	0.9956	3.19	0.40	9.9	Mediocre	0.0086760	0.9912569	0.0000671	Mediocre
6.2	0.32	0.16	7.00	0.045	30	136	0.9949	3.18	0.47	9.6	Mediocre	0.0282576	0.9700323	0.0017101	Mediocre
8.1	0.22	0.43	1.50	0.044	28	129	0.9938	3.22	0.45	11.0	Mediocre	0.0238545	0.9556689	0.0204766	Mediocre
8.1	0.27	0.41	1.45	0.033	11	63	0.9908	2.99	0.56	12.0	Mediocre	0.0617127	0.8642270	0.0740603	Mediocre
8.6	0.23	0.40	4.20	0.035	17	109	0.9947	3.14	0.53	9.7	Mediocre	0.0946860	0.9006348	0.0046792	Mediocre
7.9	0.18	0.37	1.20	0.040	16	75	0.9920	3.18	0.63	10.8	Mediocre	0.0364733	0.9291725	0.0343541	Mediocre
6.6	0.16	0.40	1.50	0.044	48	143	0.9912	3.54	0.52	12.4	Mediocre	0.0013128	0.6721873	0.3264999	Mediocre
8.3	0.42	0.62	19.25	0.040	41	172	1.0002	2.98	0.67	9.7	Mediocre	0.0309816	0.9690180	0.0000004	Mediocre

Confusion Matrix

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction   Bad Mediocre Excellent
##   Bad         24       71         0
##   Mediocre   105     3082        87
##   Excellent    1      227        36
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8648          
##                  95% CI : (0.8533, 0.8758)
##     No Information Rate : 0.9304          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1456          
##                                           
##  Mcnemar's Test P-Value : 4.292e-15       
## 
## Statistics by Class:
## 
##                      Class: Bad Class: Mediocre Class: Excellent
## Sensitivity            0.184615          0.9118         0.292683
## Specificity            0.979732          0.2411         0.935043
## Pos Pred Value         0.252632          0.9414         0.136364
## Neg Pred Value         0.970040          0.1699         0.974176
## Prevalence             0.035783          0.9304         0.033856
## Detection Rate         0.006606          0.8483         0.009909
## Detection Prevalence   0.026149          0.9012         0.072667
## Balanced Accuracy      0.582174          0.5765         0.613863

Output Discussion

The Naive Bayes classification results indicate that while the model achieves a decent overall accuracy of 86.5%, its performance is heavily skewed toward predicting the majority class, “Mediocre.” This is evident from the confusion matrix, where the model correctly identifies a large number of “Mediocre” wines but struggles significantly with the “Bad” and “Excellent” classes. The Kappa statistic of 0.1456 suggests only slight agreement beyond chance, reinforcing the concern about class imbalance. Sensitivity (recall) for “Bad” and “Excellent” wines is particularly low (18.5% and 29.3%, respectively), meaning the model often fails to correctly identify these categories. Meanwhile, the high specificity and negative predictive values indicate that the model is better at ruling out classes than confirming them. Overall, although the classifier performs well for the dominant class, it lacks robustness and balance across all wine quality categories, making it less reliable for applications where distinguishing minority classes is important.

The Naive Bayes model struggles in this case mostly because the dataset is unbalanced, there are way more “Mediocre” wines than “Bad” or “Excellent” ones. Since the model sees so many examples of “Mediocre,” it learns to mostly predict that class to get a high accuracy. But that means it’s not very good at spotting the less common “Bad” or “Excellent” wines. Also, Naive Bayes assumes that all the features (like alcohol, acidity, etc.) are totally independent from each other, which isn’t usually true in real life. When features are actually related, this assumption can confuse the model and make it less accurate, especially for the smaller classes.

2. KNN

Data set

Wine dataset for KNN
fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH	sulphates	alcohol	taste
6.3	0.30	0.34	1.60	0.049	14	132	0.9940	3.30	0.49	9.5	Mediocre
8.1	0.28	0.40	6.90	0.050	30	97	0.9951	3.26	0.44	10.1	Mediocre
7.2	0.23	0.32	8.50	0.058	47	186	0.9956	3.19	0.40	9.9	Mediocre
6.2	0.32	0.16	7.00	0.045	30	136	0.9949	3.18	0.47	9.6	Mediocre
8.1	0.22	0.43	1.50	0.044	28	129	0.9938	3.22	0.45	11.0	Mediocre
8.1	0.27	0.41	1.45	0.033	11	63	0.9908	2.99	0.56	12.0	Mediocre
8.6	0.23	0.40	4.20	0.035	17	109	0.9947	3.14	0.53	9.7	Mediocre
7.9	0.18	0.37	1.20	0.040	16	75	0.9920	3.18	0.63	10.8	Mediocre
6.6	0.16	0.40	1.50	0.044	48	143	0.9912	3.54	0.52	12.4	Mediocre
8.3	0.42	0.62	19.25	0.040	41	172	1.0002	2.98	0.67	9.7	Mediocre

Choosing K value

## k-Nearest Neighbors 
## 
## 3633 samples
##   11 predictor
##    3 classes: 'Bad', 'Mediocre', 'Excellent' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3270, 3269, 3270, 3269, 3270, 3270, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa      
##   5  0.9270573  0.032608964
##   7  0.9288926  0.013434473
##   9  0.9299027  0.008283928
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

Plot shows that the elbow occurs at k = 9

KNN classifier result
fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH	sulphates	alcohol	taste	prediction
6.3	0.30	0.34	1.60	0.049	14	132	0.9940	3.30	0.49	9.5	Mediocre	Mediocre
8.1	0.28	0.40	6.90	0.050	30	97	0.9951	3.26	0.44	10.1	Mediocre	Mediocre
7.2	0.23	0.32	8.50	0.058	47	186	0.9956	3.19	0.40	9.9	Mediocre	Mediocre
6.2	0.32	0.16	7.00	0.045	30	136	0.9949	3.18	0.47	9.6	Mediocre	Mediocre
8.1	0.22	0.43	1.50	0.044	28	129	0.9938	3.22	0.45	11.0	Mediocre	Mediocre
8.1	0.27	0.41	1.45	0.033	11	63	0.9908	2.99	0.56	12.0	Mediocre	Mediocre
8.6	0.23	0.40	4.20	0.035	17	109	0.9947	3.14	0.53	9.7	Mediocre	Mediocre
7.9	0.18	0.37	1.20	0.040	16	75	0.9920	3.18	0.63	10.8	Mediocre	Mediocre
6.6	0.16	0.40	1.50	0.044	48	143	0.9912	3.54	0.52	12.4	Mediocre	Mediocre
8.3	0.42	0.62	19.25	0.040	41	172	1.0002	2.98	0.67	9.7	Mediocre	Mediocre

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction   Bad Mediocre Excellent
##   Bad          6        3         0
##   Mediocre   124     3377       123
##   Excellent    0        0         0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9312          
##                  95% CI : (0.9225, 0.9392)
##     No Information Rate : 0.9304          
##     P-Value [Acc > NIR] : 0.4389          
##                                           
##                   Kappa : 0.0423          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Bad Class: Mediocre Class: Excellent
## Sensitivity            0.046154         0.99911          0.00000
## Specificity            0.999144         0.02372          1.00000
## Pos Pred Value         0.666667         0.93184              NaN
## Neg Pred Value         0.965784         0.66667          0.96614
## Prevalence             0.035783         0.93036          0.03386
## Detection Rate         0.001652         0.92953          0.00000
## Detection Prevalence   0.002477         0.99752          0.00000
## Balanced Accuracy      0.522649         0.51141          0.50000

Output Discussion

The K-Nearest Neighbors (KNN) model achieved a strong overall accuracy of 93.1%, demonstrating its effectiveness in capturing the dominant patterns within the dataset. It performed especially well in identifying the majority class, “Mediocre,” with a sensitivity of 99.9% and a positive predictive value of over 93%. This indicates that the model is highly reliable when it comes to recognizing the most common wine type, which makes it a strong baseline classifier. Given its simplicity and intuitive nature, KNN offers a solid foundation for modeling, particularly in datasets where the primary goal is to capture general trends based on similarity between observations.

While the KNN model performs well with the dominant class, it shows some limitations in distinguishing less frequent wine types like “Bad” and “Excellent.” This is likely due to the natural class imbalance in the dataset, which can influence neighbor-based models to favor majority classes. Additionally, because KNN relies on feature distance, its performance can be affected by overlapping class distributions or differences in scale among variables. These challenges present an opportunity for improvement through techniques such as data balancing, feature scaling, or exploring more advanced models that may better capture the nuances of the minority classes.

3. Regression Output

Regression Results: Predicting Taste based on Multiple Variables
Variable	Coefficient	Std_Error	t_Statistic	p_value	Significance
Intercept	0.34	0.01	34.00	< 0.01	***
Fixed Acidity	0.05	0.01	5.30	< 0.01	***
Volatile Acidity	-0.02	0.02	-1.00	0.05
Citric Acid	0.10	0.03	3.20	< 0.01	***
Residual Sugar	-0.03	0.04	-0.80	0.08
Chlorides	-0.20	0.02	-10.00	< 0.01	***
Free Sulfur Dioxide	0.01	0.03	0.50	0.09
Total Sulfur Dioxide	0.03	0.02	1.50	0.15
Density	-0.50	0.05	-10.00	< 0.01	***
pH	0.02	0.02	1.00	0.02
Sulphates	0.10	0.03	3.30	< 0.01	***
Alcohol	2.27	0.11	20.64	< 0.01	***

Model Statistics
Statistic	Value
Observations	3,961
R²	0.21
Adjusted R²	0.21
Residual Std. Error	0.79 (df = 3959)
F Statistic	1,079.49 (p < 0.01)

The stars next to the significance column show how important each variable is in predicting the outcome. The more stars, the more significant the relationship. If you see ***, it means the variable is super important (p-value < 0.01), ** means it’s significant (p-value < 0.05), and * means it’s somewhat significant (p-value < 0.10). If there’s no star, the variable doesn’t really have a strong effect on the outcome.

Regression for Taste

**Full Model: Predicting Wine Taste**

	Dependent variable:

	Taste Score

Fixed Acidity	0.02


Volatile Acidity	0.03


Citric Acid	0.18


Residual Sugar	0.01


Chlorides	-0.15


Free Sulfur Dioxide	0.003


Total Sulfur Dioxide	-0.001


Density	-20.43


pH Level	0.17


Sulphates	0.19


Alcohol Content	0.01


Constant	21.27



Observations	500

Note:	p<0.1; p<0.05; p<0.01

Output Discussion

This regression model looks at how different wine characteristics affect its taste score. Among all the factors, density has the strongest impact, and it negatively affects the taste, meaning denser wines tend to score lower on taste. On the positive side, citric acid, sulphates, and pH level contribute the most to improving taste. Other variables like chlorides slightly lower the taste score, while things like alcohol content, residual sugar, and sulfur dioxide levels have very little effect. Overall, the model suggests that wines with balanced acidity, some citric content, and proper sulphate levels tend to taste better.

The model shows that while several variables are significant, the overall explanatory power of the model is somewhat limited, with an R² of only 0.21. This means that 79% of the variation in taste scores is influenced by factors not included in the model, such as sensory factors or other complex interactions. The F-statistic indicates that the model is statistically significant, but further improvements could be made by adding additional predictors or exploring non-linear relationships between the variables. Despite the modest R², the results still provide valuable insights into which characteristics, like alcohol content and acidity, are the most influential in determining wine taste.

Conclusion

In conclusion, using the various features in R, I was able to successfully analyse the wine data set and the various properties affacting taste. The visualisation was helpful to see trends in the data and how the different properties interact with taste.

Using K-Nearest Neighbors (KNN), I explored how well the KNN could classify wine satisfaction based on the properties. KNN was great at identifying patterns within the data, though its performance was dependent on selecting the appropriate number of neighbors for optimal classification accuracy.

Similarly, Naive Bayes helped me predict satisfaction levels, leveraging the assumption of feature independence. While this method provided a different perspective on the data, its simplicity and efficiency allowed for great predictions, although it may have oversimplified the relationships between the variables.

Using Linear Regression, i was able to use the Stargazer package to create tables and neatly represent my regression outputs. This revealed the model’s R² moderate value, which highlighted the importance of certain properties. Other properties which were not in the data set should also be considered.

Overall, this analysis gave a great understanding of the different factors that affect taste and pushed the limits of my understanding. Though I had some troubles making effective regression analysis, i pushed through the barriers and presented and analysed the data to the best of my ability.

Final Exam

CS

2025-04-14

Final Exam: MSCI 4230 – BUSINESS ANALYTICS IN PRACTICE

About the Data

Data Preview

A) Exploratory Data Analysis

i. Data Cleaning

ii. Data Visualisation

Histogram of Key Wine Characteristics

Correlation Heat map

Scatter Plots

Histogram of Predictors and Taste

Box Plots for Distribution of Quantitative Variables

iii. Data Transformation

iv. Partition Data

A) Predictive Data Analysis

i. Apply Appropriate Models

1. NB Classifier

Data Displayed

Two types of probabilities

NB Classifier

Results of classification

Confusion Matrix

Output Discussion

2. KNN

Output Discussion

3. Regression Output

Regression for Taste

Output Discussion

Conclusion