Business Understanding

Overview of this Project:

Food analysis is an emerging branch of analytically chemistry that focuses on being able to provide detailed information about the chemical composition of food. However, the field of food analysis is not limited to just the technical side of chemical composition but it can also provide insight to the processing of food, quality control of food, and the possible contamination of food. Analysis like this is necessary to guarantee compliance with food and trade laws to not only protect consumer health but also product consistency. In particular, with a large focus on mass production of foods by companies in factories and processing plants in the United States, there is increased importance in maintaining consistent taste and appearance in products to ensure consumer satisfaction and trust. In addition, with the free market put in place, there is the added pressure of competing with other companies that sell similar products to remain in business and be profitable.

As stated above, there are many reasons a company would be interested in doing data analysis for their product. The product of interest in this situation is wine. According to the Grand View Research the United States wine market size was estimated at $63.69 billion in 2021. There is clearly a sizable market for this product, so it is important that a company collects information on their wine. If a company wants to enter the wine market, they should have knowledge on what consumers consider to be “quality wine.” On the other hand, if a company is already in the wine industry, they should make sure to produce wine that people want to buy and consume. Regardless of the situation of the given company, the purpose of a company is to be profitable and that is what the analysis team hopes to help ensure by doing this analysis. The analysis will provide insights into both red and white wine qualities covering many components that build a specific type of wine. Each of these components drive a unique relationship with that specific red or white wine that in turn is important to consumers of that type of wine along with the producers who create and distributors who sell the wine.

Problem Statement:

With all this being said, this project will look into the quality of red and white wine. For the purpose of this project, the analysis team will be looking at chemical characteristics of the wine, which include: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, and alcohol content level. The following characteristics were used to drive the decision on the output variable, quality, determined for the wine. This company wants to determine the most impactful characteristics of the wine in relation to the quality. Quality drives the integrity of wine to its consumers and also distributors. In order to retain a customer base, it is extremely important to create consistency within the quality and taste of wine. Knowing impactful characteristics of the wine will allow the analytics team to provide insight and find a focus to create the best quality wine possible. This will also allow the company to make informed decisions on any adjustments to the wine that is already produced to improve the quality of it. Delivering insights on the quality and relationships to the specific type of wine can drive what emphasis should be placed on which characteristic in order to deliver an overall higher quality wine.

In addition, this project will look into clustering for both the red and white wine. The analytics team wants to use the data collected to make an informed decision to diversify the collection of wine in the store without having numerous wines that are nearly identical in many aspects. Wine consumers are typically interested in either a white or red wine, but the palates of these customers can vary greatly. Wine palates can range from sweet to semi-sweet to dry. With varying palates a variable like quality is subjective and not the best way for the store to arrange their collection as taste will be different for each consumer. Thus, the team wants to be able to tell the store how to logically organize their wine on the shelves based on some of the tested chemical compositions to have distinct groups for the red and white wine. This should make the store arrangements more objective and concrete. However, it should be noted that the largest distinctions between the red and white wine could be different. In general, these different classifications will not only free up store space for other products, but likely save the store money when it comes to purchasing their next shipment of wine in the store. In addition there proves to be no apparent advantage in having an abundance of wine that is relatively similar. This analysis also has the potential to provide insights on projecting different advertising languages for customers and distributors. With the standard advertising language gearing more towards the specific palates preferences of sweet or dry, this analysis could provide insights into a potential new eye-catching marketing strategy that could drive more customer engagement in different wine markets.

MECE List:

  1. Determine the characteristics that have the largest impact on the quality of taste between red and white wine. Consider all the variables made available in the study. Once determined, state clearly which characteristic the company needs to focus on the most to improve or continue the high quality taste of that specific wine. Distinguish any differences between characteristics of quality red and white wine. What makes a red wine a high quality red wine along with what makes a white wine a high quality white wine?

  2. Group the wines based on similar characteristics to limit and diversify the supply in the store. The store has limited space and wants distinctly different wines of each type to fill the shelf space. Identify these clusters considering the quality to be the most important indication on shelf position. The store needs the wine organized in order to maximize their sales and increase their profits. Answer the question: How can a store group the wines based upon similar characteristics to best position their wine selection within the store objectively?

General Information on the Data Set:

The problems listed above will be addressed by looking at the Red Wine Quality data and the White Wine Quality data provided and downloaded through the online server Kaggle. The approach the analysis team plans to take includes primary steps of data cleaning and preparation. First, the team will look to remove any rows with missing data. Then, the analysis team plans to rule out any outliers in the data set to ensure a more statistically significant analysis. It is important for the team to view the red and white wine as two separate and distinct groups when finding outliers, since the determining characteristics could be different. Following outlier removal the team should expect to combine the two distinct tables, while also creating a binary variable for red and white (i.e. 0 = red, 1 = white) for identification. At initial exploration of the data set, it appears to be clean and organized, thus besides the steps above there will not be much outside of the data preparation.

For reference, an estimation of columns and rows for the red wine data set is 12 columns and just over 1600 rows respectively. Similarly the white wine data set is 12 columns and 4900 rows. The columns appear to all be numerical (or decimal) which should be beneficial for statistical analysis in R. In addition, the columns in both data sets appear to be identical in collection methods. With similar data points, a join on the tables given the binary variable creation allows for analysis within one model. The one model analysis will allow for comparison between red and white wines with getting more specific into each sector of wine.

The data set was originally discovered on the website Kaggle.

An alternative to accessing this data can be found here.

The analytics team also wanted to include the necessary citation for the data set, since it is not owned by the people analyzing it. Below the citation is included:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Analytic Techniques

In order to best interpret the wine data for the project purpose, multiple analytics techniques will be used throughout the entire process. First, the analytics team will discuss the primary techniques to determine the important characteristics for the taste quality of each wine. The best method to address the first problem is to create and model a regression analysis with quality as the target variable and the rest of the characteristics as possible input variables. This method deserves a significant time investment as it can create a statistically significant model. In addition, after creating this model, predictions for wine quality can be made if exploratory inputs for the statistically significant characteristics are used. For further understanding on the quality rating the team can also build decision trees based on qualities for each wine. This will allow the team to see what variables influence the decision on quality in a more visual frame as opposed to strictly numerical. The team also plans to use strategies such as boosting and random forests to try and rank these different variables in terms of their significance. The team should expect similar results between the different approaches, but there will likely also be some differences. Knowing this the team will use a training and testing set so that each of these generated models can be compared against each other. This will allow the team to come up with their own unique ranking based on a combination of all of the generated models.

Concerning the second problem in the analysis team’s list, K Means Clustering should suffice in providing a good idea of clusters. K Means Clustering would require evaluation using a training and testing data set (the same ones used in question 1 for consistency). In addition to this, the analysis team will also utilize hierarchical clustering. Hierarchical clustering will give an additional view of the data in clusters. Since there are people who learn visually better the team will also include a dendrogram based on the hierarchical clustering in the analysis to further highlight and illustrate the different clusters visually. A cluster dendrogram allows the team and the store to see if clear clusters and distinct clusters can be made.

Note that this is not an exhaustive list of techniques that the team will use, but instead a general idea of techniques the team may use to answer these questions. This is just an initial examination of potential techniques to complete this product.

Data Understanding, Data Preparation, and Exploratory Data Visualization

Detailed Information on the Data Set:

After establishing a clear and detailed set of questions, the next step the analysis team looked at was becoming more familiar with the dataset. The original dataset was collected in October of 2009. The dataset includes information relating to both red and white wine samples, from the north of Portugal. In particular, they are related to red and white variants of the Portuguese “Vinho Verde” wine company. Both the original and the analytics’ team goal is to model wine quality based on the physicochemical tests and data that has been provided. Note that only physicochemical inputs and a single sensory output (quality) are available as variables, due to privacy and logistical issues. However, upon analyzation the team and Vinho Verde will have a far better understanding on which chemical components make better tasting wines based on color and according to their customers. In addition, the team will help to provide a wide variation of wine, while also keeping wine production to a minimum.

Original Data Tables:

As discussed above the dataset only includes two tables, one for the red wine and one for the white wine. All of the variables in both datasets are synonymous, thus only one data understanding table was constructed. The exact number of columns and rows for each table (type of wine) is included below:

  1. Winequality-red
  • 12 columns x 1,599 rows
  1. Winequality-white
  • 12 columns x 4,898 rows

These datasets can be viewed as classification or regression tasks. The class are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

Data Preparation

The first step to getting the data ready for analysis is to read the tables into R.

#import the read excel library
library(readxl)

red <- read_excel("/Users/kamriefoster/Downloads/winequality-red.xlsx")
white <-read_excel("/Users/kamriefoster/Downloads/winequality-white.xlsx")

Next, the analytics team renamed the columns. The spaces of these variables were replaced with ‘.’ to make them easier to work with in R. Then the team constructed a new binary variable known as “type” so that the data from the red and white wines could be distinguishable. It is important to notice that red wine was represented with a 0, while white wine was represented with a 1. Immediately following this the two datasets were combined into one big dataset. The code below demonstrates this in action:

#creating new variable to represent the two different wine types
red$color <- c(0)
white$color <- c(1)

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
#combining the two tables to create one large table 
wine <- rbind(red, white)
wine
## # A tibble: 6,497 × 13
##    fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
##            <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
##  1           7.4             0.7         0               1.9     0.076
##  2           7.8             0.88        0               2.6     0.098
##  3           7.8             0.76        0.04            2.3     0.092
##  4          11.2             0.28        0.56            1.9     0.075
##  5           7.4             0.7         0               1.9     0.076
##  6           7.4             0.66        0               1.8     0.075
##  7           7.9             0.6         0.06            1.6     0.069
##  8           7.3             0.65        0               1.2     0.065
##  9           7.8             0.58        0.02            2       0.073
## 10           7.5             0.5         0.36            6.1     0.071
## # … with 6,487 more rows, and 8 more variables: free.sulfur.dioxide <dbl>,
## #   total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <dbl>, color <dbl>
#checking the outputs to make sure the tables joined correctly 
head(wine)
## # A tibble: 6 × 13
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
##           <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
## 1           7.4             0.7         0               1.9     0.076
## 2           7.8             0.88        0               2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.7         0               1.9     0.076
## 6           7.4             0.66        0               1.8     0.075
## # … with 8 more variables: free.sulfur.dioxide <dbl>,
## #   total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <dbl>, color <dbl>
tail(wine)
## # A tibble: 6 × 13
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
##           <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
## 1           6.5             0.23        0.38            1.3     0.032
## 2           6.2             0.21        0.29            1.6     0.039
## 3           6.6             0.32        0.36            8       0.047
## 4           6.5             0.24        0.19            1.2     0.041
## 5           5.5             0.29        0.3             1.1     0.022
## 6           6               0.21        0.38            0.8     0.02 
## # … with 8 more variables: free.sulfur.dioxide <dbl>,
## #   total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <dbl>, color <dbl>

Following the completion of the data tables being combined, it was important for the analytics team and the company to determine what each of the variables in the table mean and of their significance. This will be done in the section below.

Data Understanding Table



After having a strong understanding of the data and the variables within the dataset, the analysis team was finally ready to do some initial exploration in R. The work done will be highlighted in the next section.

Initial Work in R:

Summary of the Whole Data Set

To start a summary was developed and examined for the dataset overall. This was used to give the analytics team an idea of the data that they were going to be working with.

summary(wine)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000  
##  Mean   : 7.215   Mean   :0.3397   Mean   :0.3186   Mean   : 5.443  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.00900   Min.   :  1.00      Min.   :  6.0        Min.   :0.9871  
##  1st Qu.:0.03800   1st Qu.: 17.00      1st Qu.: 77.0        1st Qu.:0.9923  
##  Median :0.04700   Median : 29.00      Median :118.0        Median :0.9949  
##  Mean   :0.05603   Mean   : 30.53      Mean   :115.7        Mean   :0.9947  
##  3rd Qu.:0.06500   3rd Qu.: 41.00      3rd Qu.:156.0        3rd Qu.:0.9970  
##  Max.   :0.61100   Max.   :289.00      Max.   :440.0        Max.   :1.0390  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.210   Median :0.5100   Median :10.30   Median :6.000  
##  Mean   :3.219   Mean   :0.5313   Mean   :10.49   Mean   :5.818  
##  3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :9.000  
##      color       
##  Min.   :0.0000  
##  1st Qu.:1.0000  
##  Median :1.0000  
##  Mean   :0.7539  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Some observations to consider from the summary above:

  • the residual sugar has some alarming values. There are some flags in terms of outlying values there. The minimum value is 0.600, the mean value is 5.443, and the maximum value is 65.800. Once a model has been created running tests to determine outliers will be necessary.

  • the free sulfur dioxide in addition has a wide range of values. The minimum value is 1.000, the mean value is 30.53, and the maximum value is 289.00. These could also result in more outliers in the data set and should be looked into it, similary as described above.

  • similarly the total sulfur dioxide has a wide range of values that need to be explored further for outliers. With a minimum value of 6.0, a mean value of 115.7, and a maximum value of 440.0; this is something the team should have a closer look at.

  • with quality as the target variable it is necessary to explore this variable a little further. Note that the minimum value is 3, mean value is 5.8, and the highest value is 9. The analytics team needs to understand the spread of these values in general and of each wine type.

Group by Wine Color and Look at Different Variables

The analytics team then needed to look at the data by the color/type of wine. In particular, it was necessary to look at the red wine data and the white wine data separately.The team grouped the wine by color using the following code:

by_color <- group_by(wine, color)
by_color
## # A tibble: 6,497 × 13
## # Groups:   color [2]
##    fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
##            <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
##  1           7.4             0.7         0               1.9     0.076
##  2           7.8             0.88        0               2.6     0.098
##  3           7.8             0.76        0.04            2.3     0.092
##  4          11.2             0.28        0.56            1.9     0.075
##  5           7.4             0.7         0               1.9     0.076
##  6           7.4             0.66        0               1.8     0.075
##  7           7.9             0.6         0.06            1.6     0.069
##  8           7.3             0.65        0               1.2     0.065
##  9           7.8             0.58        0.02            2       0.073
## 10           7.5             0.5         0.36            6.1     0.071
## # … with 6,487 more rows, and 8 more variables: free.sulfur.dioxide <dbl>,
## #   total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <dbl>, color <dbl>

Wine Quality

Since quality is the target variable the group will need to take some time getting familiar with the variable. First, find the mean values for the quality based on the color.

mean_quality <- summarise(by_color,type = mean(quality, na.rm = TRUE))
mean_quality
## # A tibble: 2 × 2
##   color  type
##   <dbl> <dbl>
## 1     0  5.64
## 2     1  5.88

According to this, it appears that white wine on average has a higher quality rating for taste than the red wine does. However, the difference in average is not alarming enough to support that the company should not continue to study both to increase sales.

After looking at the average quality for each wine quality, the team wanted to look at a visual spread of the values for the quality for each the red and white wine. This was done using a histogram. First the team looked at the spread of the wine overall, then separated it into its two distinct colors. The code and supporting visuals are included below.

hist(wine$quality, xlim = c(2,10), breaks = 6, xlab="Quality", main="Histogram of Overall Wine Quality")

hist(red$quality, xlim = c(2, 10), breaks = 6, xlab="Quality", main="Histogram of Red Wine Quality")

hist(white$quality, xlim = c(2, 10), breaks = 6, xlab="Quality", main="Histogram of White Wine Quality")

The overall spread of the wine quality is nearly identical to the spread of the white wine quality. This is largely due to the distinct differences in the amount of data the team has available for each wine type. The majority of white wine lies between the values of 5 and 6. While the red wine lies predominately between the values of 4 and 6.

The team also wanted to look at boxplots to gain insight on the potential of outliers that may need to be removed from the dataset during analysis. The code and graph is can be seen below:

boxplot(wine$quality)

boxplot(red$quality)

boxplot(white$quality)

The box plot shows that there may be some values that need to be removed to make the analysis relevant and significant.

Residual Sugar

The team was also interested in the spread of residual sugar, so a similar process was followed. First, the average of both was found and this can be seen in the code below:

mean_sugar <- summarise(by_color,sugar = mean(residual.sugar, na.rm = TRUE))
mean_sugar
## # A tibble: 2 × 2
##   color sugar
##   <dbl> <dbl>
## 1     0  2.54
## 2     1  6.39

According to these results, on average white wine has a much larger amount of sugar than that of the red. This is helpful to know as the analytics team is considering sugar to determine some degree of “healthiness” of the wine.

Following finding the averages, like before the team was interested in the spread of values for sugar overall and in the two different types of wine. The team once again used histograms to give a visual of these.

hist(wine$residual.sugar, xlab="Residual Sugar", main="Histogram of Residual Sugar")

hist(red$residual.sugar, xlab="Residual Sugar", main="Histogram of Red Wine Residual Sugar")

hist(white$residual.sugar, xlab="Residual Sugar", main="Histogram of White Wine Residual Sugar")

From the histograms we can see that the values of residual sugar are not normally distributed. This could mean that there are outliers present in our data, or that a transformation may be necessary. In general, both wine colors appear to fall between 0 and 25 grams per decimeters cubed. Also note, that the white wine appears to have a larger spread of values than the red wine.

Following the display of the histograms of the data, the team decided to look at the boxplots for each of these as well. This was done in order to caution for the outliers that were hinted in being present within the histograms.

boxplot(wine$residual.sugar, main="Residual Sugar Boxplot")

boxplot(red$residual.sugar)

boxplot(white$residual.sugar)

Once again the use of box plots highlights the potential for outliers within the dataset that may need to be removed. In particular, an extremely high value of residual sugar in the white wine dataset. The red wine dataset box plot shows many outliers, but this is largely due to the small spread of values within the red wine residual sugar.

plot(wine$residual.sugar, wine$quality, xlab="Residual Sugar", ylab="Quality")

plot(red$residual.sugar, red$quality, xlab="Residual Sugar", ylab="Quality", main="Red Wine Scatterplot")

plot(white$residual.sugar, white$quality, xlab="Residual Sugar", ylab="Quality", main="White Wine Scatterplot")

Looking at the scatter plot there does not seem to be a direct correlation between residual sugar of wine and it’s quality. There does seem to be a normal distribution between residual sugar and wine though.

Alcohol Content

The last variable that the team is explicitly interested in is the alcohol content. The same process was followed to learn more about this individual variable. First the averages for both wine colors was found:

mean_alcohol <- summarise(by_color,alcohol.content = mean(alcohol, na.rm = TRUE))
mean_alcohol
## # A tibble: 2 × 2
##   color alcohol.content
##   <dbl>           <dbl>
## 1     0            10.4
## 2     1            10.5

According to the average alcohol content, the values for red and white are nearly identical. This could mean that the red and white whine share a relatively similar spread in terms of their alcohol content. To see if this is true the team looked at the wine spread overall, followed by the spread of both the red and white wine.

hist(wine$alcohol, xlab="Alcohol Content", main="Overall Wine Alcohol Content")

hist(red$alcohol, xlab="Alcohol Content", main="Red Wine Alcohol Content")

hist(white$alcohol, xlab="Alcohol Content", main="White Wine Alcohol Content")

The histograms show that the alcohol content is spread out over the values of 8-15 with most occurring between 9 and 10 for both the red and white wine. This indicates that overall there isn’t a large difference between the alcohol contents in each of these types of wine.

Once again to get an idea of outliers in terms of the alcohol content boxplots were looked at:

boxplot(wine$alcohol, main="Overall Alcohol Content")

boxplot(red$alcohol, main="Red Wine Alcohol Content")

boxplot(white$alcohol, main="White Wine Alcohol Content")

Based on the boxplots above, the projected outlying values appear to come from the red wine data set. When it comes to alcohol content the white wine doesn’t seem to have any alarming values.

Following this, the team looked at scatterplots of the alcohol content versus the wine quality. The code and visuals are displayed below:

plot(wine$alcohol, wine$quality, xlab="Alcohol Content", ylab="Quality", main="Alcohol Content versus Quality Scatterplot")

plot(red$alcohol, red$quality, xlab="Alcohol Content", ylab="Quality", main="Red Wine Scatterplot")

plot(white$alcohol, white$quality, xlab="Alcohol Content", ylab="Quality", main="White Wine Scatterplot")

From the graphs there does appear to be a slight positive correlation between the alcohol content and the quality taste of the wine. It is not a clear and easy correlation but a positive tendancy can be seen. This means that as the amount of alcohol increases the quality of the wine seems to increase too.

Understand that the there are likely more significant variables that were not discussed in detail in this section. However, all variables will be evaluated and thoroughly examined in the next section.

Modeling and Evaluation

Libraries Needed for Analysis

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(rpart)
library(rpart.plot)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(gbm)
## Loaded gbm 2.1.8.1
library(fpc)
  1. library(MASS)- used to be able to generate the box-cox plots for potential linear regression target variable transformations.
  2. library(rpart)- used to generate the decision tree.
  3. library(rpart.plot)- used to install the graphs(visuals) for the decision trees.
  4. library(randomForest)- required to run a random forest model in R.
  5. library(gbm)- required package installed to create a boosting model.
  6. library(fpc)- required package to run K-means clustering

Generating Training and Testing Sets

To start the modeling and evaluation phase the team first must split the data into a training and testing set using a 90-10 split. 90% of the data will be used in the training set, while the remaining 10% of the data will be used in the testing set. Since the team is more worried about the relative importance of variables based on the models rather than the predictive abilities of the model this split is the most appropriate. However the team needs training and testing sets to be able to calculate each of the models mean squared errors (MSE). The mean squared errors will allow the analytics team to compare all of these different models and evaluate which model performs the best and therefore gives the best results. Thus, if two of the models created contradict each other the team will be able to say with confidence which one is supported more within the analysis based on the MSE values calcuted for each model.

#creating testing and training sets for the white wine

index <- sample(nrow(white), nrow(white)*0.90)
white_train = white[index,]
white_test = white[-index,]

#creating testing and training sets for the red wine
index <- sample(nrow(red), nrow(red)*0.90)
red_train = red[index,]
red_test = red[-index,]

Answering MECE List Question 1

Now that the training and testing sets have been created, the different models can be created to answer the two MECE List questions. The team first focused on MECE List Question 1. It has been displayed below for convenience.

Reminder of MECE Question 1:

Determine the characteristics that have the largest impact on the quality of taste between red and white wine. Consider all the variables made available in the study. Once determined, state clearly which characteristic the company needs to focus on the most to improve or continue the high quality taste of that specific wine. Distinguish any differences between characteristics of quality red and white wine. What makes a red wine a high quality red wine along with what makes a white wine a high quality white wine?

White Wine Simple Linear Regression Model

The first approach the team will take to answer this question is to create a few different regression models (one for red and one for white). The regression model will show statistically significant variables in relation to quality of wine.

white_reg <- lm(quality ~ ., data = white_train)
summary(white_reg)
## 
## Call:
## lm(formula = quality ~ ., data = white_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7705 -0.4927 -0.0416  0.4683  3.1092 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.606e+02  1.976e+01   8.129 5.58e-16 ***
## fixed.acidity         7.388e-02  2.203e-02   3.353 0.000805 ***
## volatile.acidity     -1.896e+00  1.197e-01 -15.840  < 2e-16 ***
## citric.acid          -9.431e-03  1.015e-01  -0.093 0.925958    
## residual.sugar        8.622e-02  7.932e-03  10.869  < 2e-16 ***
## chlorides            -2.519e-02  5.966e-01  -0.042 0.966323    
## free.sulfur.dioxide   3.594e-03  8.923e-04   4.028 5.73e-05 ***
## total.sulfur.dioxide -4.251e-04  4.007e-04  -1.061 0.288768    
## density              -1.609e+02  2.004e+01  -8.029 1.25e-15 ***
## pH                    7.546e-01  1.112e-01   6.783 1.33e-11 ***
## sulphates             6.707e-01  1.064e-01   6.304 3.19e-10 ***
## alcohol               1.800e-01  2.544e-02   7.078 1.70e-12 ***
## color                        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.758 on 4396 degrees of freedom
## Multiple R-squared:  0.2811, Adjusted R-squared:  0.2793 
## F-statistic: 156.2 on 11 and 4396 DF,  p-value: < 2.2e-16

Removing the following values because they were found to be not statistically significant when making the initial model for the white wine:

  • citric.acid
  • chlorides
  • total.sulfur.dioxide
  • color

Now remake the model with the remaining variables to see if they are all considered statistically significant:

white_reg2 <- lm(quality ~ fixed.acidity + volatile.acidity + residual.sugar + free.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train)
summary(white_reg2)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar + 
##     free.sulfur.dioxide + density + pH + sulphates + alcohol, 
##     data = white_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7516 -0.4965 -0.0440  0.4649  3.1973 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.651e+02  1.903e+01   8.676  < 2e-16 ***
## fixed.acidity        7.457e-02  2.157e-02   3.457 0.000552 ***
## volatile.acidity    -1.919e+00  1.152e-01 -16.656  < 2e-16 ***
## residual.sugar       8.747e-02  7.688e-03  11.377  < 2e-16 ***
## free.sulfur.dioxide  3.019e-03  7.127e-04   4.236 2.32e-05 ***
## density             -1.655e+02  1.929e+01  -8.580  < 2e-16 ***
## pH                   7.598e-01  1.091e-01   6.967 3.72e-12 ***
## sulphates            6.619e-01  1.059e-01   6.250 4.49e-10 ***
## alcohol              1.786e-01  2.528e-02   7.064 1.87e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7578 on 4399 degrees of freedom
## Multiple R-squared:  0.2809, Adjusted R-squared:  0.2796 
## F-statistic: 214.8 on 8 and 4399 DF,  p-value: < 2.2e-16

For the simple linear regression model it can be seen that the predictability of the model is low. However, since the team’s goal is not to predict wine quality but instead to find variable significance this is still useful. The model made above identified the following variables to be statistically significant:

  • fixed.acidity
  • volatile.acidity
  • residual.sugar
  • free.sulfur.dioxide
  • density
  • pH
  • sulphates
  • alcohol

White Wine Unit Normal Scaling

These variables will be looked into closer by the analytics team to confirm their significance. To rank significance the team can look to do unit normal scaling. The following is the code to do this:

white_unit_normal = as.data.frame(apply(white_train, 2, function(x){(x - mean(x))/sd(x)}))
white_reg2_unit_normal <- lm(quality ~ fixed.acidity + volatile.acidity + residual.sugar + free.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_unit_normal)
summary(white_reg2_unit_normal)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar + 
##     free.sulfur.dioxide + density + pH + sulphates + alcohol, 
##     data = white_unit_normal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2021 -0.5561 -0.0492  0.5207  3.5811 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.131e-15  1.278e-02   0.000 1.000000    
## fixed.acidity        7.069e-02  2.045e-02   3.457 0.000552 ***
## volatile.acidity    -2.194e-01  1.317e-02 -16.656  < 2e-16 ***
## residual.sugar       4.959e-01  4.358e-02  11.377  < 2e-16 ***
## free.sulfur.dioxide  5.796e-02  1.368e-02   4.236 2.32e-05 ***
## density             -5.537e-01  6.454e-02  -8.580  < 2e-16 ***
## pH                   1.288e-01  1.849e-02   6.967 3.72e-12 ***
## sulphates            8.492e-02  1.359e-02   6.250 4.49e-10 ***
## alcohol              2.461e-01  3.483e-02   7.064 1.87e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8488 on 4399 degrees of freedom
## Multiple R-squared:  0.2809, Adjusted R-squared:  0.2796 
## F-statistic: 214.8 on 8 and 4399 DF,  p-value: < 2.2e-16

According to the unit normal scaling done above the following is the ranking of importance of the variables that were found to be significant for the white wine simple linear regression model. The most significant variables have the largest absolute value of the Estimates displayed above:

  1. density
  2. residual sugar
  3. alcohol
  4. volatile acidity
  5. pH
  6. sulphates
  7. fixed acidity
  8. free sulfur dioxide

Following completion of the model, the team then needed to evaluate how well the model performed by doing a model assessment. The model assessment was done in order to be able to compare this models with some of the others that will be created.

# Out-of-sample prediction (test error)
#pi is a vector that contains predicted values for test set.
pi <- predict(white_reg2, white_test) #same as above

#MSE - Mean Squared Error
mean((pi - white_test$quality)^2)
## [1] 0.4758651

The mean squared error value will be used to directly compare the models to one another. The mean squared error value of the regression with no transformations is around 0.54.

Exploriing Potential Transformations for White Wine

Next, a scatter plot was created to determine if a transformation may be necessary.

plot(white_reg2$fitted.values, white_reg2$residuals)
abline(h = 0)

The scatterplot shows that a transformation could be beneficial. To check this start with a Box-Cox transformation on the target variable quality. Determine the best Box-Cox transformation to perform.

boxcox(white_reg2)

Based on this created plot it does not appear that a Box-Cox transformation is what the regression model needs. However, there could be benefits on transforming some of the x-variables that were found to be significant. The ability to identify any useful transformations are shown in the array of scatterplots down below:

#create the set that has the training set and the significant variables identified 
#by the second simple linear regression model

#white_reg2_data <- cbind(white_train$density, white_train$residual.sugar, white_train$alcohol, white_train$volatile.acidity, white_train$pH, white_train$sulphates, white_train$fixed.acidity, white_train$free.sulfur.dioxide, white_train$quality)

#pairs(white_reg2_data)

pairs(white_train)

However, after looking at the variables versus the quality there does not appear to be any obvious transformations that will help the model that the analysis team has created. Thus, after creating the white regression model it was time to follow the same process for the red wine.

Red Wine Simple Linear Regression Model

red_reg <- lm(quality ~., data = red_train)
summary(red_reg)
## 
## Call:
## lm(formula = quality ~ ., data = red_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.63340 -0.36315 -0.04728  0.44616  2.02541 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          13.7522878 22.3452341   0.615   0.5384    
## fixed.acidity         0.0082698  0.0274937   0.301   0.7636    
## volatile.acidity     -1.0982378  0.1298528  -8.458  < 2e-16 ***
## citric.acid          -0.2122706  0.1577633  -1.346   0.1787    
## residual.sugar        0.0245083  0.0157164   1.559   0.1191    
## chlorides            -1.8399412  0.4542700  -4.050 5.39e-05 ***
## free.sulfur.dioxide   0.0037908  0.0023026   1.646   0.0999 .  
## total.sulfur.dioxide -0.0031936  0.0007695  -4.150 3.52e-05 ***
## density              -9.2554448 22.8104424  -0.406   0.6850    
## pH                   -0.5182250  0.2033107  -2.549   0.0109 *  
## sulphates             0.9123560  0.1200465   7.600 5.34e-14 ***
## alcohol               0.2857361  0.0279826  10.211  < 2e-16 ***
## color                        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6485 on 1427 degrees of freedom
## Multiple R-squared:  0.3596, Adjusted R-squared:  0.3547 
## F-statistic: 72.86 on 11 and 1427 DF,  p-value: < 2.2e-16

Removing the following values because they were found to be not statistically significant when making the initial model for the white wine:

  • fixed.acidity
  • citric.acid
  • residual.sugar
  • free.sulfur.dioxide
  • fixed.acidity
  • chlorides
  • density
  • color

Now remake the model with the remaining variables to see if they are all considered statistically significant:

red_reg2 <- lm(quality ~ volatile.acidity + chlorides + total.sulfur.dioxide + pH + sulphates + alcohol, data = red_train)
summary(red_reg2)
## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + total.sulfur.dioxide + 
##     pH + sulphates + alcohol, data = red_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.60702 -0.35295 -0.05101  0.45634  1.94513 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.2248784  0.4192460  10.077  < 2e-16 ***
## volatile.acidity     -1.0231932  0.1068818  -9.573  < 2e-16 ***
## chlorides            -1.8913225  0.4292911  -4.406 1.13e-05 ***
## total.sulfur.dioxide -0.0022300  0.0005354  -4.165 3.30e-05 ***
## pH                   -0.4277785  0.1220915  -3.504 0.000473 ***
## sulphates             0.8799711  0.1153087   7.631 4.21e-14 ***
## alcohol               0.2932070  0.0178229  16.451  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6495 on 1432 degrees of freedom
## Multiple R-squared:  0.3554, Adjusted R-squared:  0.3527 
## F-statistic: 131.6 on 6 and 1432 DF,  p-value: < 2.2e-16

For the simple linear regression model made above the following variables are found to be statistically significant:

  • volatile.acidity
  • chlorides
  • total.sulfur.dioxide
  • pH
  • sulphates
  • alcohol

These variables will be looked into closer by the analytics team to confirm their significance. The team needed to standardize the model in order to rank the significance of the variables in relevance to the determined quality of the wine. The standardization process followed is shown below:

Red Wine Unit Normal Scaling

red_unit_normal = as.data.frame(apply(red_train, 2, function(x){(x - mean(x))/sd(x)}))
red_reg2_unit_normal <- lm(quality ~ volatile.acidity + chlorides + total.sulfur.dioxide + pH + sulphates + alcohol, data = red_unit_normal)
summary(red_reg2_unit_normal)
## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + total.sulfur.dioxide + 
##     pH + sulphates + alcohol, data = red_unit_normal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2293 -0.4372 -0.0632  0.5653  2.4094 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.164e-15  2.121e-02   0.000 1.000000    
## volatile.acidity     -2.254e-01  2.355e-02  -9.573  < 2e-16 ***
## chlorides            -1.087e-01  2.467e-02  -4.406 1.13e-05 ***
## total.sulfur.dioxide -9.101e-02  2.185e-02  -4.165 3.30e-05 ***
## pH                   -8.225e-02  2.347e-02  -3.504 0.000473 ***
## sulphates             1.866e-01  2.445e-02   7.631 4.21e-14 ***
## alcohol               3.885e-01  2.362e-02  16.451  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8046 on 1432 degrees of freedom
## Multiple R-squared:  0.3554, Adjusted R-squared:  0.3527 
## F-statistic: 131.6 on 6 and 1432 DF,  p-value: < 2.2e-16

According to the unit normal scaling done above the following is the ranking of importance of the variables that were found to be significant for the red wine simple linear regression model. The most significant variables have the largest absolute value of the Estimates displayed above:

  1. alcohol
  2. volatile acidity
  3. sulphates
  4. chlorides
  5. pH
  6. total sulfur dioxide

Following completion of the model, the team then needed to evaluate how well the model performed by doing a model assessment. The model assessment was done in order to be able to compare this models with some of the others that will be created.

## Out-of-sample prediction (test error)
#pi is a vector that contains predicted values for test set.
pi <- predict(red_reg2, red_test) #same as above

#MSE - Mean Squared Error
mean((pi - red_test$quality)^2)
## [1] 0.411694

The mean squared error value will be used to directly compare the models to one another. The mean squared error value of the red regression model with no transformations is 0.500.

Exploriing Potential Transformations for Red Wine

Next, a scatter plot was created to determine if a transformation may be necessary.

plot(red_reg2$fitted.values, red_reg2$residuals)
abline(h = 0)

The scatterplot shows that a transformation could be beneficial. To check this start with a Box-Cox transformation on the target variable quality. Determine the best Box-Cox transformation to perform.

boxcox(red_reg2)

Based on this created plot it does not appear that a Box-Cox transformation is what the regression model needs. However, there could be benefits on transforming some of the x-variables that were found to be significant. The ability to identify any useful transformations are shown in the array of scatterplots down below:

pairs(red_train)

However, after looking at the variables versus the quality there does not appear to be any obvious transformations that will help the model that the analysis team has created. Thus, a different model will now be explored. The analytics team developed a few more means to determine variable importance.

Simple Decision Trees

Now create the regression decision trees to be analyzed. First, start with the white wine decision trees. While this will not directly pull the most important variables in a structured order like the linear regression models above. However, these will still gives us a general idea of which variables play a role in determing the quality for both types of wine.

White wine decision trees:

white_rpart <- rpart(formula = quality ~ ., data = white_train)
white_rpart
## n= 4408 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 4408 3512.8640 5.876588  
##    2) alcohol< 10.85 2784 1714.1020 5.607040  
##      4) volatile.acidity>=0.2375 1741  882.5526 5.402068  
##        8) volatile.acidity>=0.3025 822  337.2652 5.238443 *
##        9) volatile.acidity< 0.3025 919  503.5952 5.548422 *
##      5) volatile.acidity< 0.2375 1043  636.3068 5.949185 *
##    3) alcohol>=10.85 1624 1249.7320 6.338670  
##      6) free.sulfur.dioxide< 11.5 100  106.1900 5.410000 *
##      7) free.sulfur.dioxide>=11.5 1524 1051.6400 6.399606  
##       14) alcohol< 11.74167 742  495.4407 6.192722 *
##       15) alcohol>=11.74167 782  494.3069 6.595908 *
prp(white_rpart, digits = 4, extra = 1)

From the decision tree it is clear that alcohol content levels do appear to be an important factor in determining quality. Alcohol levels above 10.85 as well as a free sulfur dioxide level above 11.5 appear to be the largest contributor into having the highest quality white wine. These two variables in particular accounts for 1,516 rows of data with quality above 6.

white_test_pred_tree = predict(white_rpart, white_test)
#MSE - Mean Squared Error
mean((white_test_pred_tree - white_test$quality)^2)
## [1] 0.4677441

The mean squared error of this model is 0.5019 which is slightly higher than that of the simple linear regression model. However, it is important to notice that the indications of alcohol and free sulfur dioxide being important match that of which was found within the regression model. Thus, the models do appear to support one another’s finding with significant variables in determining quality.

A similar process was then followed for the training set on the red wine.

Red Wine Decision Tree:

red_rpart <- rpart(formula = quality ~ ., data = red_train)
red_rpart
## n= 1439 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 1439 937.18970 5.635858  
##    2) alcohol< 10.525 875 380.42970 5.363429  
##      4) sulphates< 0.575 351 112.87750 5.142450 *
##      5) sulphates>=0.575 524 238.93130 5.511450  
##       10) volatile.acidity>=0.335 467 187.13060 5.441113 *
##       11) volatile.acidity< 0.335 57  30.56140 6.087719 *
##    3) alcohol>=10.525 564 391.06910 6.058511  
##      6) sulphates< 0.645 251 170.47810 5.721116  
##       12) volatile.acidity>=0.87 22  19.09091 4.636364 *
##       13) volatile.acidity< 0.87 229 123.01310 5.825328  
##         26) volatile.acidity>=0.385 167  82.17964 5.682635 *
##         27) volatile.acidity< 0.385 62  28.27419 6.209677 *
##      7) sulphates>=0.645 313 169.10540 6.329073  
##       14) alcohol< 11.55 187  88.26738 6.096257 *
##       15) alcohol>=11.55 126  55.65873 6.674603 *
prp(red_rpart, digits = 4, extra = 1)

Similar to that of white wine, alcohol appears to be one of the most influential variables in determining quality for the red wine. In addition sulphate, volatile acidity, and pH. Specifically red wine with an alcohol content over 10.55, sulphate with a value over 0.645 and volatile acididty falling in the categories described above. These branches include data with wine quality over 5.9. In particular, the data in question includes 315 rows of red wine data.

red_test_pred_tree = predict(red_rpart, red_test)
#MSE - Mean Squared Error
mean((red_test_pred_tree - red_test$quality)^2)
## [1] 0.5238108

The mean squared error of this model is 0.569784 which is slightly higher than that of the simple linear regression model. However, it is important to notice that the indications of alcohol, sulphate, volatile acidity, and pH being important match that of which was found within the regression model. Thus, the models do appear to support one another’s finding with significant variables in determining quality.

Following the make of a simple decision tree, a different model was created. The next model used on the red and white wine data was random forests.

Generating Random Forests

Code to create the random forests with the wine is shown below. This will serve as another model that will rank the variable importance in relevance to the target variable quality.

white_rf <- randomForest(quality~., data = white_train, importance = TRUE)
white_rf
## 
## Call:
##  randomForest(formula = quality ~ ., data = white_train, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 0.3607324
##                     % Var explained: 54.73

After the model has been created, variable importance can be found by using the following code. The results displayed here will rank the importance similar to that of the regression model done earlier. Note that the variables with the highest %IncMSE are the most important variables.

white_rf$importance
##                         %IncMSE IncNodePurity
## fixed.acidity        0.08237745      210.1871
## volatile.acidity     0.21329098      377.7725
## citric.acid          0.09558745      231.3114
## residual.sugar       0.16107224      253.4245
## chlorides            0.12088027      268.8710
## free.sulfur.dioxide  0.15544769      353.1036
## total.sulfur.dioxide 0.11784338      274.3808
## density              0.25340158      380.6179
## pH                   0.10011867      237.9446
## sulphates            0.06415603      200.2912
## alcohol              0.41886399      568.9979
## color                0.00000000        0.0000

This is the ranking of importance that the random forests found for the white wine.

  1. alcohol
  2. density
  3. volatile acidity
  4. residual sugar
  5. free sulfur dioxide
  6. chlorides
  7. total sulfur dioxide
  8. pH
  9. citric acid
  10. fixed acidity
  11. sulphates

Then a model assessment was done by doing a prediction on the testing set.

white_rf_pred <- predict(white_rf, white_test)
#rather than using the testing set you can place any dataset you want to make the predictions on second at the boston_test location

mean((white_test$quality - white_rf_pred)^2)
## [1] 0.2815972
#changed the bagging predicitions to the predictions from the random forest

The MSE for this model was found to be 0.3246 which is considerably better than the other two models that were created to answer the first MECE list question. Thus, at least with this training and testing set this is the ranking that appears to be the best for variable significance.

Now the same process will be followed for the red wine.

Red Wine Random Forest

red_rf <- randomForest(quality~., data = red_train, importance = TRUE)
red_rf
## 
## Call:
##  randomForest(formula = quality ~ ., data = red_train, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 0.3216685
##                     % Var explained: 50.61

After the model has been created, variable importance can be found by using the following code. The results displayed here will rank the importance similar to that of the regression model done earlier. Note that the variables with the highest %IncMSE are the most important variables.

red_rf$importance
##                         %IncMSE IncNodePurity
## fixed.acidity        0.05105115      49.43235
## volatile.acidity     0.12285780     114.65898
## citric.acid          0.07096039      59.10683
## residual.sugar       0.03092518      48.95228
## chlorides            0.05390661      60.41635
## free.sulfur.dioxide  0.04383920      43.90709
## total.sulfur.dioxide 0.08747885      71.03016
## density              0.08998264      68.73128
## pH                   0.04641160      51.89329
## sulphates            0.17567684     128.40042
## alcohol              0.26325625     196.39440
## color                0.00000000       0.00000

This is the ranking of importance that the random forests found for the red wine.

  1. alcohol
  2. sulphates
  3. volatile acidity
  4. density
  5. total sulfur dioxide
  6. citric acid
  7. fixed acidity
  8. chlorides
  9. pH
  10. free sulfur dioxide
  11. residual sugar

Then a model assessment was done by doing a prediction on the testing set.

red_rf_pred <- predict(red_rf, red_test)
#rather than using the testing set you can place any dataset you want to make the predictions on second at the boston_test location

mean((red_test$quality - red_rf_pred)^2)
## [1] 0.3432787
#changed the bagging predicitions to the predictions from the random forest

The MSE for this model was found to be 0.4474 which is considerably better than the other two models that were created to answer the first MECE list question. Thus, at least with this training and testing set this is the ranking that appears to be the best for variable significance.

Following the Random Forests the last model the team created for the red and white wine analysis was boosting.

Boosting Models for the Wines

Now it is time to create the boosting model. The following code demonstrates how this is done. First, starting with the white wine.

white_boost <- gbm(formula = quality~., data=white_train, distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 8)
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
## variable 12: color has no variation.
summary(white_boost) 

##                                       var   rel.inf
## alcohol                           alcohol 17.801682
## free.sulfur.dioxide   free.sulfur.dioxide 11.198183
## volatile.acidity         volatile.acidity 10.963910
## residual.sugar             residual.sugar  8.764238
## total.sulfur.dioxide total.sulfur.dioxide  8.521944
## density                           density  8.375762
## pH                                     pH  7.727910
## fixed.acidity               fixed.acidity  6.839731
## chlorides                       chlorides  6.815873
## citric.acid                   citric.acid  6.678154
## sulphates                       sulphates  6.312612
## color                               color  0.000000

Based on the boosting model created for the white wine, the rankings of significance of these variables were determined by the rel.inf column. This columns stands for relative influence and the higher the value in this column the more important it is in determining the wine quality. The rankings based on these values for white wine are dispalyed below:

  1. alcohol
  2. free sulfur dioxide
  3. volatile acidity
  4. total sulfur dioxide
  5. residual sugar
  6. density
  7. pH
  8. fixed acidity
  9. chlorides
  10. citric acid
  11. sulphates

Next, the team had to do a model assessment in order to see how this model compares to others the team had created. This looks as presented below:

white_boost_pred_test <- predict(white_boost, white_test, n.trees = 10000)
mean((white_test$quality - white_boost_pred_test)^2)
## [1] 0.3248569

The MSE is still lower than that of the linear regression model and the decision trees, however it is slightly higher than that of the random forests. However, the ranking of significant variables is similar to that of which we have found in all of the models. A boosting model will now be made for the red wine.

red_boost <- gbm(formula = quality~., data=red_train, distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 8)
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
## variable 12: color has no variation.
summary(red_boost) 

##                                       var   rel.inf
## alcohol                           alcohol 17.751832
## volatile.acidity         volatile.acidity 12.695594
## sulphates                       sulphates 11.220348
## total.sulfur.dioxide total.sulfur.dioxide  9.340146
## chlorides                       chlorides  8.675430
## density                           density  7.971040
## pH                                     pH  7.627874
## citric.acid                   citric.acid  7.313409
## fixed.acidity               fixed.acidity  6.555281
## residual.sugar             residual.sugar  5.802245
## free.sulfur.dioxide   free.sulfur.dioxide  5.046800
## color                               color  0.000000

Based on the boosting model created for the red wine, the rankings of significance of these variables were found. As a reminder the rel.inf column was used to determine these values. Higher rel.inf means more influnence on the targer variable quality. These results are presented below:

  1. alcohol
  2. volatile acidity
  3. sulphates
  4. total sulfur dioxide
  5. chlorides
  6. density
  7. pH
  8. citric acid
  9. fixed acidity
  10. residual sugar
  11. free sulfur dioxide

Next, the team had to do a model assessment in order to see how this model compares to others the team had created. This looks as presented below:

red_boost_pred_test <- predict(red_boost, red_test, n.trees = 10000)
mean((red_test$quality - red_boost_pred_test)^2)
## [1] 0.3547087

The MSE is still lower than that of the linear regression model and the decision trees, however it is slightly higher than that of the random forests. However, the ranking of significant variables is similar to that of which we have found in all of the models. After creating 4 different models the team was then ready to make a final conclusion pertaining to the rankings of variable significance for each the red and white wine. First for convenience, the team pulled all of the information for each wine type together:

Collection of the Models, their Results, and MSE Values

The results for all of the white wine models for the significance for each of the variables is presented in the table below:

Based on these results the team’s final conclusion for top 4 significant variables (based on significance) for white wine was determined to be:

1. alcohol
2. volatile acidity
3. density
4. residual sugar

These results were picked based on information presented in all of the models that the team created. These results were not merely a copy and paste presentation of variable significance based on one model. The team carefully considered each of these models and the results that they contributed to determine a ranking of their own.

More insight to why certain variables were chosen for certain values are included in the description below:

The team decided that since 3/4 models consisted of alcohol as the most influential variable for quality it was placed at number 1. Volatile acidity was then ranked second most important due to it’s position in the two better models at third and linear regression at fourth. The next spot in the teams ranking went to density for being second in the random forest (best model) and first in the worst model linear regression. However, density was sixth in the boosting and did not show up in the decision tree. The fourth and final spot was given to residual sugar which was found to be fourth and fifth in the two best models as well as second in the linear regression. This was the logic behind the choices the team made in deciding the ranking presented on the screen.

The results for all of the red wine models for the significance for each of the variables is presented in the table below:

Based on these results the team’s final conclusion for top 4 significant variables (based on significance) for red wine was determined to be:

1. alcohol
2. sulphates
3. volatile acidity
4. chlorides

These results were picked based on information presented in all of the models that the team created. These results were not merely a copy and paste presentation of variable significance based on one model. The team carefully considered each of these models and the results that they contributed to determine a ranking of their own.

More insight to why the variables were ordered this way by the analytics team is provided in the details included below:

Once again alcohol was selected as the most important variable relative for red wine quality. This time going 4/4 on the models this way. Next the team selected sulphates as number 2 because it was found to be number 2 in the best model as well as 2 in the decision trees. The other two models placed this at 3. Closely following, is volatile acidity as it is opposite of sulphates (3rd on the best model and decision trees and second for boosting and linear regression. The fourth and final spot was designated to chlorides for having the values of 4 and 5 in two of the models. This was the logic behind the decision the team made in their presumed rankings.

Further Supporting Details Using Aggregation Methods

In addition to the models the team also took a more direct approach to prove their findings. The team looked to filter the data by quality value (3-9) for each wine type (red and white). The idea here was to aggregate the data by the quality value and look at the mean summary for each to get a better idea of the differential in the variables between each quality value. In general, if there is a correlation between the two variables (one the team found to be significant and the quality value) The team started by grouping the wine accordingly.

First remove the color column from the dataset as it is not necessary for this part of the analysis:

white <- white[,-13]
#white

red<- red[,-13]

The team then created a new distinct data set for each individual value of quality.

white_quality_3 <- filter(white, quality == 3)
#white_quality_3

white_quality_4 <- filter(white, quality == 4)
#white_quality_4

white_quality_5 <- filter(white, quality == 5)
#white_quality_5

white_quality_6 <- filter(white, quality == 6)
#white_quality_6

white_quality_7 <- filter(white, quality == 7)
#white_quality_7

white_quality_8 <- filter(white, quality == 8)
#white_quality_8

white_quality_9 <- filter(white, quality == 9)
#white_quality_9

First starting with mean summary for the white wines with quality of 3.

white3 <- colMeans(white_quality_3)
white3
##        fixed.acidity     volatile.acidity          citric.acid 
##             7.600000             0.333250             0.336000 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##             6.392500             0.054300            53.325000 
## total.sulfur.dioxide              density                   pH 
##           170.600000             0.994884             3.187500 
##            sulphates              alcohol              quality 
##             0.474500            10.345000             3.000000

Next, the same was done for white wine having a quality value of 4:

white4 <- colMeans(white_quality_4)
white4
##        fixed.acidity     volatile.acidity          citric.acid 
##           7.12944785           0.38122699           0.30423313 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           4.62822086           0.05009816          23.35889571 
## total.sulfur.dioxide              density                   pH 
##         125.27914110           0.99427669           3.18288344 
##            sulphates              alcohol              quality 
##           0.47613497          10.15245399           4.00000000

Next, the same was done for white wine having a quality value of 5:

white5 <- colMeans(white_quality_5)
white5
##        fixed.acidity     volatile.acidity          citric.acid 
##           6.93397392           0.30201098           0.33765271 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           7.33496911           0.05154633          36.43205216 
## total.sulfur.dioxide              density                   pH 
##         150.90459849           0.99526257           3.16883322 
##            sulphates              alcohol              quality 
##           0.48220316           9.80884008           5.00000000

Next, the same was done for white wine having a quality value of 6:

white6 <- colMeans(white_quality_6)
white6
##        fixed.acidity     volatile.acidity          citric.acid 
##           6.83767061           0.26056415           0.33802548 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           6.44160601           0.04521747          35.65059145 
## total.sulfur.dioxide              density                   pH 
##         137.04731574           0.99396135           3.18859873 
##            sulphates              alcohol              quality 
##           0.49110555          10.57537155           6.00000000

Next, do the same for white wine having a quality value of 7:

white7 <- colMeans(white_quality_7)
white7
##        fixed.acidity     volatile.acidity          citric.acid 
##           6.73471591           0.26276705           0.32562500 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           5.18647727           0.03819091          34.12556818 
## total.sulfur.dioxide              density                   pH 
##         125.11477273           0.99245240           3.21389773 
##            sulphates              alcohol              quality 
##           0.50310227          11.36793561           7.00000000

Next, the same was done for white wine having a quality value of 8:

white8 <- colMeans(white_quality_8)
white8
##        fixed.acidity     volatile.acidity          citric.acid 
##           6.65714286           0.27740000           0.32651429 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           5.67142857           0.03831429          36.72000000 
## total.sulfur.dioxide              density                   pH 
##         126.16571429           0.99223594           3.21868571 
##            sulphates              alcohol              quality 
##           0.48622857          11.63600000           8.00000000

Lastly the same was done for white wine having a quality value of 9:

white9 <- colMeans(white_quality_9)
white9
##        fixed.acidity     volatile.acidity          citric.acid 
##              7.42000              0.29800              0.38600 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##              4.12000              0.02740             33.40000 
## total.sulfur.dioxide              density                   pH 
##            116.00000              0.99146              3.30800 
##            sulphates              alcohol              quality 
##              0.46600             12.18000              9.00000

Now the same process was followed for the red wine. First, start by filtering the data so that only one quality value is present for each dataset.

red_quality_3 <- filter(red, quality == 3)
#red_quality_3

red_quality_4 <- filter(red, quality == 4)
#red_quality_4

red_quality_5 <- filter(red, quality == 5)
#red_quality_5

red_quality_6 <- filter(red, quality == 6)
#red_quality_6

red_quality_7 <- filter(red, quality == 7)
#red_quality_7

red_quality_8 <- filter(red, quality == 8)
#red_quality_8

Now look at the summary for each. First starting with mean summary for the red wines with quality of 3.

red3 <- colMeans(red_quality_3)
red3
##        fixed.acidity     volatile.acidity          citric.acid 
##             8.360000             0.884500             0.171000 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##             2.635000             0.122500            11.000000 
## total.sulfur.dioxide              density                   pH 
##            24.900000             0.997464             3.398000 
##            sulphates              alcohol              quality 
##             0.570000             9.955000             3.000000

Next, do the same for red wine having a quality value of 4:

red4 <- colMeans(red_quality_4)
red4
##        fixed.acidity     volatile.acidity          citric.acid 
##           7.77924528           0.69396226           0.17415094 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           2.69433962           0.09067925          12.26415094 
## total.sulfur.dioxide              density                   pH 
##          36.24528302           0.99654245           3.38150943 
##            sulphates              alcohol              quality 
##           0.59641509          10.26509434           4.00000000

Next, do the same for red wine having a quality value of 5:

red5 <- colMeans(red_quality_5)
red5
##        fixed.acidity     volatile.acidity          citric.acid 
##           8.16725404           0.57704112           0.24368576 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           2.52885463           0.09273568          16.98384728 
## total.sulfur.dioxide              density                   pH 
##          56.51395007           0.99710363           3.30494860 
##            sulphates              alcohol              quality 
##           0.62096916           9.89970631           5.00000000

Next, do the same for red wine having a quality value of 6:

red6 <- colMeans(red_quality_6)
red6
##        fixed.acidity     volatile.acidity          citric.acid 
##           8.34717868           0.49748433           0.27382445 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           2.47719436           0.08495611          15.71159875 
## total.sulfur.dioxide              density                   pH 
##          40.86990596           0.99661506           3.31807210 
##            sulphates              alcohol              quality 
##           0.67532915          10.62951933           6.00000000

Next, do the same for red wine having a quality value of 7:

red7 <- colMeans(red_quality_7)
red7
##        fixed.acidity     volatile.acidity          citric.acid 
##           8.87236181           0.40391960           0.37517588 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           2.72060302           0.07658794          14.04522613 
## total.sulfur.dioxide              density                   pH 
##          35.02010050           0.99610427           3.29075377 
##            sulphates              alcohol              quality 
##           0.74125628          11.46591290           7.00000000

Lastly, the same was done for red wine having a quality value of 8:

red8 <- colMeans(red_quality_8)
red8
##        fixed.acidity     volatile.acidity          citric.acid 
##           8.56666667           0.42333333           0.39111111 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           2.57777778           0.06844444          13.27777778 
## total.sulfur.dioxide              density                   pH 
##          33.44444444           0.99521222           3.26722222 
##            sulphates              alcohol              quality 
##           0.76777778          12.09444444           8.00000000

Below are two tables including the quality and the average of the variables that were found to be significant for each wine type. Since most of the analysis has the white wine first followed by the red wine, the team will place the information in the same order for uniformity.

Recall that the four most important variables for white wine suggested from the models the team created were alcohol, density, volatile acidity, and residual sugar.

The aggregation summary for the white wine:

The information in the table for white wine adds additional evidence that there are direct correlations between the four variables that the team found most significant and the quality of the wine. This conclusion is drawn from the general trends found in the data below:

  • As alcohol increases the quality of the white wine also increases.
  • As density decreases the quality of the white wine increases.
  • As volatile acidity decreases the quality of the white wine increases to an extent.
  • As the residual sugar decreases the quality of the white wine increases.

Note that these are not strict rules that all of the wines tested followed. However these conclusions give the team guidelines of what to expect in terms of each of the variables. The purpose is to show that there do appear to be general correlations between the quality and the 4 other attributes that were found to be significant in terms of their means at each quality value.

Now the aggregation summary for the red wine:

The information in the table for red wine adds additional evidence that there are direct correlations between the four variables that the team found most significant and the quality of the wine. This conclusion is drawn from the general trends found in the data below:

  • As alcohol increases the quality of the red wine also increases.
  • As volatile acidity decreases the quality of the red wine increases.
  • As sulphates increase the quality of the red wine also increases.
  • As the chlorides present decrease, the quality of the wine increases.

Note once again, that these are not strict rules that all of the wines tested followed. However these conclusions give the team guidelines of what to expect in terms of each of the variables. The purpose is to show that there do appear to be general correlations between the quality and the 4 other attributes that were found to be significant in terms of their means at each quality value.

The team largely wanted to provide further evidence to support the claims that each of the variables determined to be significant was indeed significant in contributing to the quality of the wine. The aggregated tables allow anyone (even people not with a background in data analysis) to see the variables and how they relate to each other. The trends show the direct correlations within the dataset.

Answering MECE List Question 2

After coming to conclusions for the first MECE List Question, the second question should be addressed. As a reminder and for convenience question 2 has been placed below:

Group the wines based on similar characteristics to limit and diversify the supply in the store. The store has limited space and wants distinctly different wines of each type to fill the shelf space. Identify clusters in order to help determine the shelf position of the wine. The store needs the wine organized in order to maximize their sales and increase their profits. Answer the question: How can a store group the wines based upon similar characteristics to best position their wine selection within the store ojectively?

To start the team decided that the data should be filtered to only include values of wine where quality was greater than 5. This filtering was done on the basis that the store only wants to sell quality wine to their customers. The store does not want to waste money on low quality wine that will likely not presume a profit for them. Thus the clustering that should take place, should only be on the wine of the utmost quality. This was the premise for filtering the wine in the way the code below shows:

#filtering the white wine data 
quality_white <- filter(white, quality > 5)
quality_white
## # A tibble: 3,258 × 12
##    fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
##            <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
##  1           7               0.27        0.36           20.7     0.045
##  2           6.3             0.3         0.34            1.6     0.049
##  3           8.1             0.28        0.4             6.9     0.05 
##  4           7.2             0.23        0.32            8.5     0.058
##  5           7.2             0.23        0.32            8.5     0.058
##  6           8.1             0.28        0.4             6.9     0.05 
##  7           6.2             0.32        0.16            7       0.045
##  8           7               0.27        0.36           20.7     0.045
##  9           6.3             0.3         0.34            1.6     0.049
## 10           8.1             0.22        0.43            1.5     0.044
## # … with 3,248 more rows, and 7 more variables: free.sulfur.dioxide <dbl>,
## #   total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <dbl>
#filtering the red wine data
quality_red <- filter(red, quality > 5)
quality_red
## # A tibble: 855 × 12
##    fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
##            <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
##  1          11.2            0.28         0.56            1.9     0.075
##  2           7.3            0.65         0               1.2     0.065
##  3           7.8            0.58         0.02            2       0.073
##  4           8.5            0.28         0.56            1.8     0.092
##  5           7.9            0.32         0.51            1.8     0.341
##  6           8.9            0.22         0.48            1.8     0.077
##  7           6.9            0.4          0.14            2.4     0.085
##  8           7.8            0.645        0               2       0.082
##  9           6.9            0.685        0               2.5     0.105
## 10           6.9            0.605        0.12           10.7     0.073
## # … with 845 more rows, and 7 more variables: free.sulfur.dioxide <dbl>,
## #   total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <dbl>

Two different clustering methods will be performed on the datasets.

Wine with Quality Greater than 5 Dendograms

The first to be used is more high level and this is the hierarchial clustering with the dendogram for a visual aspect.

#hierarchical clustering 

#perform hierarchical clustering on the first twelve columns of data
hc_result_white_quality <- hclust(dist(quality_white[,1:12]))
#plot a dendrogram displaying your results
plot(hc_result_white_quality)
#cut the dendrogram into three clusters
rect.hclust(hc_result_white_quality, k = 3)

The visual displays that there is a large amount of skewness in the clusters for the white wine data set with quality over 5. The same will be done for the red wine this time.

#hierarchical clustering 
#perform hierarchical clustering on the first twelve columns of data
hc_result_red_quality <- hclust(dist(quality_red[,1:12]))
#plot a dendrogram displaying your results
plot(hc_result_red_quality)
#cut the dendrogram into three clusters
rect.hclust(hc_result_red_quality, k = 3)

The same outcome occurs for the red wine. A different approach was taken with the same filtered dataset using the K-means clustering method to see if the team could get distinctly different clusters within each the red and white wine data. The steps to do this are presented below.

K-Means Clustering for Wine with Quality Greater than 5

K-Means Clustering is great for grouping similarities within a group. Start by creating 3 clusters for each type of wine based on wine with a quality greater than 5. First use the filtered white wine.

#Create 4 distinct clusters.
fit_white <- kmeans(quality_white[, 1:12], 3)
fit_white
## K-means clustering with 3 clusters of sizes 1264, 1309, 685
## 
## Cluster means:
##   fixed.acidity volatile.acidity citric.acid residual.sugar  chlorides
## 1      6.685206        0.2601859   0.3209889       4.215941 0.03906883
## 2      6.800038        0.2637280   0.3369824       6.123491 0.04363866
## 3      7.016788        0.2626204   0.3529343       9.330292 0.04865985
##   free.sulfur.dioxide total.sulfur.dioxide   density       pH sulphates
## 1            25.73339             96.17484 0.9919169 3.197484 0.4834889
## 2            37.07487            138.85600 0.9935815 3.202819 0.4942857
## 3            49.52628            190.74818 0.9960622 3.186088 0.5130657
##     alcohol  quality
## 1 11.401701 6.431962
## 2 10.800255 6.415584
## 3  9.921703 6.226277
## 
## Clustering vector:
##    [1] 3 2 1 3 3 1 2 3 2 2 2 1 1 1 2 1 1 2 2 3 2 2 2 1 2 1 1 2 2 2 2 2 2 2 3 2 3
##   [38] 2 1 1 2 3 3 2 1 1 2 2 1 2 2 1 3 1 1 1 1 1 3 3 2 2 3 2 2 2 2 2 3 2 1 2 1 3
##   [75] 3 3 2 1 2 2 2 3 3 3 3 2 1 1 1 1 1 1 3 1 1 2 2 1 1 3 3 2 2 2 2 3 3 1 2 1 2
##  [112] 2 3 3 3 3 1 3 2 3 2 2 1 2 1 3 1 2 2 3 3 3 3 3 3 3 3 3 3 2 2 1 3 3 2 1 1 2
##  [149] 2 1 1 1 1 3 2 3 2 2 3 3 3 1 1 1 3 3 3 3 3 2 2 3 3 2 2 1 1 1 3 2 2 3 2 2 1
##  [186] 2 2 3 2 2 2 1 1 2 2 3 2 3 1 1 1 1 3 2 2 2 2 2 2 1 2 2 2 2 2 2 3 3 3 2 2 2
##  [223] 3 2 3 1 1 2 2 3 2 3 2 1 1 1 1 2 1 1 3 2 2 2 1 1 2 3 3 2 1 1 1 1 3 1 1 2 2
##  [260] 2 2 1 1 2 2 3 3 3 2 3 3 3 1 1 3 3 1 2 2 3 2 1 1 2 2 1 2 1 1 3 1 2 1 2 1 2
##  [297] 1 1 1 2 3 1 2 2 1 3 3 1 1 3 2 1 3 2 3 3 3 2 2 3 2 3 2 2 2 1 1 2 2 1 1 1 2
##  [334] 1 1 1 1 3 3 3 3 3 2 2 2 2 3 1 2 3 1 2 3 2 1 2 2 2 3 2 1 1 2 3 1 2 2 3 3 2
##  [371] 1 2 3 1 2 1 2 2 2 2 2 2 2 1 1 2 1 2 2 1 2 1 1 3 3 2 2 2 3 3 1 2 2 2 2 2 3
##  [408] 2 3 3 3 3 2 2 1 1 2 3 3 3 3 3 1 2 2 3 3 1 1 1 2 1 3 2 3 3 1 2 1 1 3 2 2 2
##  [445] 3 2 3 2 2 2 1 2 1 2 3 1 3 2 3 2 2 1 2 1 3 3 3 3 3 3 3 3 1 3 2 2 3 3 2 3 2
##  [482] 1 1 1 2 2 2 2 1 1 3 2 3 3 2 2 1 1 1 2 1 1 1 3 2 1 1 2 3 1 2 3 3 3 1 2 3 3
##  [519] 1 3 2 3 2 2 1 1 1 1 1 1 1 3 3 3 2 2 3 3 1 1 1 3 1 1 2 2 1 2 1 3 2 1 1 2 2
##  [556] 3 1 2 2 2 1 1 1 2 1 2 2 3 1 3 1 3 1 3 1 2 1 1 3 1 1 3 2 1 3 2 2 2 2 1 1 2
##  [593] 2 2 3 2 3 1 3 3 3 3 1 2 2 1 2 1 2 1 1 3 3 1 3 1 2 3 2 1 1 1 1 2 1 1 2 2 1
##  [630] 2 1 1 2 2 2 1 2 3 2 2 3 3 1 3 2 2 1 2 3 2 1 3 1 1 2 3 2 2 3 3 3 2 2 2 2 3
##  [667] 3 3 2 1 1 1 2 1 1 2 2 3 2 1 2 2 1 3 2 3 3 2 2 2 3 2 2 2 3 2 3 3 2 2 1 2 2
##  [704] 2 1 2 1 3 2 1 1 1 1 3 1 1 1 1 3 3 1 1 2 2 2 3 3 1 1 1 2 2 3 2 2 2 2 3 1 1
##  [741] 2 1 1 3 3 3 2 2 2 1 2 1 2 1 2 3 2 2 2 1 1 1 1 1 1 3 2 2 1 2 1 1 1 2 1 3 2
##  [778] 1 1 1 1 1 3 1 1 2 2 1 2 3 3 2 2 3 1 1 2 2 3 1 2 3 3 3 3 2 2 3 3 1 2 3 2 3
##  [815] 2 2 2 2 2 2 2 2 2 2 1 1 1 1 3 1 3 2 2 2 2 3 3 1 1 2 1 2 1 3 3 3 1 1 1 2 3
##  [852] 1 3 3 1 1 3 3 1 3 2 2 2 2 3 3 1 2 2 2 2 2 3 3 3 1 2 2 3 2 2 1 3 1 2 3 3 2
##  [889] 2 2 2 2 2 2 1 1 1 2 1 1 2 3 2 1 1 1 1 1 2 3 3 1 3 3 2 2 1 1 3 1 1 1 2 2 2
##  [926] 2 1 1 2 2 2 1 1 1 1 2 2 2 1 2 2 3 2 2 2 1 2 2 2 2 3 2 2 2 3 1 2 2 2 2 1 2
##  [963] 2 2 2 3 2 1 3 3 1 3 3 3 2 1 2 2 2 2 2 3 2 3 2 2 2 2 2 2 3 1 3 3 2 2 3 2 2
## [1000] 3 3 1 3 2 2 1 1 3 1 1 2 2 2 1 2 2 2 2 2 2 2 3 2 2 2 2 1 3 3 3 3 1 1 1 3 3
## [1037] 2 2 2 1 1 2 3 3 1 1 2 3 1 1 1 2 3 1 1 1 2 2 3 2 3 2 1 2 1 2 2 3 1 1 1 1 1
## [1074] 3 2 2 2 2 2 2 2 2 1 2 3 3 2 3 3 1 3 2 2 1 2 2 1 3 1 2 3 3 3 3 3 3 3 2 3 3
## [1111] 2 2 3 3 3 3 2 2 2 3 2 2 1 2 1 3 3 2 2 2 2 1 3 2 1 3 2 3 3 3 2 2 2 3 3 2 3
## [1148] 2 3 2 1 2 3 2 1 1 1 3 2 3 3 1 2 2 2 2 2 1 2 2 1 2 3 2 2 2 3 2 3 3 2 3 3 3
## [1185] 3 1 1 1 2 2 1 2 2 1 3 3 3 3 3 2 3 2 2 1 1 3 1 1 3 2 1 3 3 1 2 3 3 3 2 3 1
## [1222] 2 2 3 2 2 2 2 3 2 1 3 1 1 1 3 3 3 1 1 3 3 3 3 2 1 1 1 2 2 1 1 3 3 2 1 1 3
## [1259] 3 1 1 3 3 2 2 2 2 3 1 3 3 3 1 3 2 3 1 3 3 2 1 3 3 3 3 3 3 3 2 3 1 3 3 3 3
## [1296] 3 3 3 3 2 3 3 3 3 1 1 1 2 1 1 2 2 2 2 3 2 2 3 2 1 2 3 1 1 1 3 2 2 1 3 1 2
## [1333] 3 2 3 3 1 2 2 2 2 2 2 3 1 2 3 3 3 2 1 1 2 2 2 2 2 2 1 2 3 1 1 1 1 1 1 1 2
## [1370] 1 1 1 3 3 3 3 3 3 3 2 1 2 1 1 1 1 2 1 2 2 1 1 1 3 2 2 2 2 2 2 2 1 2 2 2 2
## [1407] 2 2 2 3 2 2 2 2 2 1 2 2 2 2 2 1 3 1 3 2 2 2 2 1 2 2 2 3 1 2 1 1 3 1 3 1 1
## [1444] 1 1 2 3 1 1 2 3 3 1 1 2 2 2 2 2 1 2 2 1 1 2 2 2 1 1 3 3 1 1 2 1 1 3 1 2 3
## [1481] 2 1 1 2 2 2 2 3 3 2 1 3 2 3 1 2 1 1 1 1 2 2 2 2 1 1 1 3 2 1 1 3 3 3 1 1 1
## [1518] 3 3 3 3 3 3 2 2 2 2 1 3 2 2 3 2 2 3 3 2 1 3 1 2 1 2 1 3 2 3 3 3 3 2 2 3 1
## [1555] 3 3 3 3 3 2 3 3 3 1 2 1 3 3 1 2 2 2 2 1 1 3 1 2 1 2 2 2 1 1 2 3 3 2 2 2 2
## [1592] 2 1 1 2 1 3 2 1 3 1 1 3 3 3 3 3 3 1 3 2 2 1 2 3 2 2 1 1 1 2 3 3 1 1 2 2 2
## [1629] 2 2 2 2 2 2 3 1 1 1 1 1 1 3 1 1 2 3 1 2 2 1 2 3 1 1 1 2 2 1 2 2 2 1 2 2 1
## [1666] 2 1 2 2 2 2 1 2 2 1 2 1 2 2 1 2 3 2 2 1 3 3 3 3 3 3 3 2 1 2 1 1 2 2 1 2 3
## [1703] 1 1 2 3 2 1 1 2 3 1 1 1 2 2 3 2 3 1 2 1 3 1 2 2 1 2 1 1 2 1 3 2 2 1 2 2 2
## [1740] 1 1 2 3 3 3 1 3 1 1 1 1 2 1 1 1 2 3 2 1 1 2 1 1 2 1 3 2 3 2 2 1 1 2 2 3 2
## [1777] 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 2 1 1 1 1 1 2 2 1 2 1 3 1 2 1 2 1 1 1 1 1 2
## [1814] 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 3 2 2 2 1 2 2 2 2 1 3 2 1 1 2 1 2 2 1 1
## [1851] 3 2 1 1 1 1 1 1 1 2 2 2 1 2 2 1 3 2 2 2 2 1 1 3 3 1 1 1 1 1 3 3 2 2 1 1 2
## [1888] 1 2 1 1 1 2 2 1 1 2 3 3 3 3 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 2 2 2 3 2 1 3 1
## [1925] 3 3 3 2 2 1 1 2 3 1 1 1 1 2 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 3 2 2 1 1 1 2 1
## [1962] 2 1 1 2 2 2 3 1 1 2 2 2 1 1 2 3 1 2 2 1 2 2 2 2 2 3 2 1 2 3 1 2 2 2 2 2 2
## [1999] 2 1 2 3 1 1 2 2 1 2 1 2 1 3 2 2 2 1 1 2 1 1 2 2 2 1 1 1 1 1 1 2 2 2 1 2 1
## [2036] 2 2 2 2 2 2 2 1 2 2 1 2 2 2 1 1 1 1 1 2 2 1 2 3 3 2 2 1 1 2 2 2 2 1 1 1 1
## [2073] 1 1 2 2 2 2 2 3 1 1 3 3 3 3 3 3 3 1 3 1 2 2 1 1 2 2 2 2 2 2 3 1 2 2 2 3 1
## [2110] 1 1 1 1 1 1 1 2 1 2 2 1 2 2 1 1 3 1 1 1 1 1 1 1 1 2 1 1 3 2 2 2 1 3 3 3 1
## [2147] 1 2 1 3 3 3 3 2 1 1 2 1 1 1 2 2 2 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 2 2 2 2 3
## [2184] 2 2 2 2 2 3 2 2 1 1 1 1 1 1 2 3 1 2 2 1 1 1 3 2 2 2 3 2 2 3 3 2 1 1 2 2 2
## [2221] 3 2 2 2 1 1 1 1 1 1 2 3 1 2 2 1 1 1 1 2 2 2 1 3 1 1 1 1 1 2 3 1 2 1 3 3 1
## [2258] 2 3 1 1 1 1 1 1 1 3 1 1 2 1 2 2 2 1 2 2 2 2 2 2 1 2 1 2 1 3 3 1 1 1 1 1 3
## [2295] 1 2 2 2 2 2 2 2 1 2 1 2 1 1 2 2 2 2 3 2 3 2 2 2 1 1 1 1 2 1 1 1 1 2 1 1 1
## [2332] 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 2 2 1 2 1 2 3 3 1 2 2 2 1 1 2 2 1
## [2369] 1 2 1 2 2 3 2 1 2 1 3 3 2 2 2 2 2 1 1 2 1 1 2 2 2 2 2 2 1 1 2 3 1 3 1 2 3
## [2406] 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 3 1 3 1 3 2 2 3 2 3 3 2 2 1 1 2 2 2 3 1 3 3
## [2443] 3 3 1 1 3 3 1 2 2 2 1 1 2 1 3 3 2 1 2 2 2 2 2 2 2 2 3 2 2 2 2 2 1 1 2 2 2
## [2480] 2 1 1 1 1 1 1 2 1 3 2 3 2 1 1 1 1 1 1 1 2 2 2 2 3 1 1 2 2 1 1 1 1 1 1 3 1
## [2517] 1 1 1 1 1 2 1 2 1 2 1 1 2 3 3 3 1 1 2 3 3 2 3 1 1 1 2 1 1 2 1 1 2 1 2 1 3
## [2554] 3 1 1 1 1 1 1 1 2 2 2 2 1 1 2 1 2 1 1 2 1 1 1 1 2 1 1 1 2 1 1 2 2 2 2 1 2
## [2591] 1 1 1 2 2 1 1 3 2 2 2 1 3 2 2 1 3 3 3 2 1 2 1 1 3 2 1 3 1 1 1 1 2 2 2 2 1
## [2628] 2 1 2 1 1 2 2 2 2 3 3 1 3 2 2 1 2 1 2 2 3 2 2 2 2 2 2 2 1 2 1 2 2 1 1 2 1
## [2665] 2 1 2 2 2 1 3 2 1 1 1 2 1 2 1 1 1 1 1 2 3 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1
## [2702] 1 1 1 2 2 2 2 2 3 1 1 1 2 2 1 2 1 2 2 2 2 2 2 2 2 2 1 1 3 1 3 3 3 3 3 3 1
## [2739] 2 2 1 2 3 2 3 2 3 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 2 3 3
## [2776] 1 2 1 3 1 3 2 1 1 1 1 2 1 2 2 2 1 1 2 1 2 1 1 1 2 1 2 1 2 2 1 2 1 1 2 3 1
## [2813] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 3 1 1 1 2 1 1 1 2 1 1 1 1 2 1 1 2 2 2 1
## [2850] 2 2 2 2 2 2 2 1 2 2 2 1 1 1 1 2 2 1 2 2 3 2 3 3 1 2 2 1 1 2 1 1 3 1 2 2 1
## [2887] 1 3 1 1 2 2 2 3 2 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 1 1 2 2 1 2 1 1 1 1
## [2924] 1 2 2 1 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 1 1 2 2 1 1 1 2 1 3 1 1 3 3 2 1 1 2
## [2961] 1 1 1 2 1 1 1 3 1 2 3 3 1 1 3 3 3 2 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 3 2 1 1
## [2998] 1 2 2 1 2 2 1 1 1 1 1 2 1 1 2 1 2 2 2 1 2 2 1 2 3 3 2 1 2 2 1 1 1 1 1 2 2
## [3035] 2 2 1 3 1 1 2 2 2 2 1 2 2 1 3 1 1 1 2 1 3 3 2 2 2 3 2 2 1 3 2 2 1 1 2 1 2
## [3072] 2 2 2 2 1 1 1 1 1 1 1 1 2 1 2 2 2 1 2 1 1 2 3 2 2 2 2 3 2 3 1 1 1 2 2 2 1
## [3109] 3 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 2 3 1 1 1 2 2 2 2 2 1 1 1 2 2 1 2 2 2 1
## [3146] 3 3 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 3 3 3 3 3 1 2 2 1 1 2 2 1 2 1 1 2 2 2 2
## [3183] 2 2 1 1 2 2 2 1 1 2 1 2 1 1 1 1 2 2 1 1 2 2 2 2 2 1 1 2 2 1 1 1 3 1 2 3 1
## [3220] 3 2 1 3 3 1 3 3 1 1 1 1 1 1 1 1 1 2 1 2 1 2 2 1 2 3 3 2 2 3 2 1 2 1 2 1 1
## [3257] 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 487840.8 491295.2 519314.7
##  (between_SS / total_SS =  74.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

With quality filtered above 5 and all variables present the clusters seem to be distinct based on the following variables for the white wine:

fixed acidity, residual sugar, sulfur dioxide, and alcohol

However, the values for quality do not appear to be distinctly different. For the store the organization of the wine based on quality is important. The store wants the lowest quality wine on the bottom shelf and the highest quality wine on the top shelf. Thus, the team will need to try a different approach to achieve these results. Now follow the same process for the red wine.

Create 3 clusters for the filtered red wine:

#Create 4 distinct clusters.
fit_red <- kmeans(quality_red[, 1:12], 3)
fit_red
## K-means clustering with 3 clusters of sizes 467, 299, 89
## 
## Cluster means:
##   fixed.acidity volatile.acidity citric.acid residual.sugar  chlorides
## 1      8.738116        0.4614668   0.3178801       2.405782 0.08326552
## 2      8.294983        0.4831104   0.2746488       2.513712 0.08259866
## 3      7.689888        0.5105618   0.2902247       3.293820 0.07969663
##   free.sulfur.dioxide total.sulfur.dioxide   density       pH sulphates
## 1            9.012848             21.58030 0.9964748 3.293383 0.6868094
## 2           21.637124             49.99331 0.9965991 3.339766 0.6986288
## 3           26.735955             96.85393 0.9959785 3.303371 0.7029213
##    alcohol  quality
## 1 10.90635 6.327623
## 2 10.79242 6.197324
## 3 10.79607 6.258427
## 
## Clustering vector:
##   [1] 2 1 1 3 2 2 2 1 2 3 1 1 1 1 1 1 3 2 1 1 1 1 2 3 3 3 1 2 1 1 3 2 2 2 1 2 3
##  [38] 2 1 1 1 2 2 1 2 1 3 1 2 1 1 2 1 2 2 1 3 1 2 1 1 1 1 2 1 2 2 1 2 3 2 3 2 2
##  [75] 1 2 2 2 2 1 1 3 1 1 1 2 1 2 1 1 1 1 2 1 1 2 1 2 1 1 1 2 1 1 2 2 2 2 2 2 1
## [112] 2 2 1 1 1 1 1 1 3 3 2 3 2 3 2 3 2 2 1 1 2 2 3 2 1 1 3 1 1 1 1 1 1 1 1 2 1
## [149] 1 1 3 1 1 1 2 2 2 1 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 2 2 2 2 3 2 1 2 2 2 2 2
## [186] 1 1 1 1 3 1 1 1 2 2 1 2 2 1 1 1 1 2 2 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1
## [223] 2 1 1 2 2 1 1 1 1 1 1 2 1 2 1 1 3 3 1 2 1 3 2 3 3 1 1 1 1 1 2 2 2 2 1 1 1
## [260] 1 1 3 2 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 2 1 1 2 2 2 1 1 1 2 1 2 3 2 2 1 3 1
## [297] 2 2 1 3 1 1 1 1 1 1 2 1 1 2 2 2 1 3 1 2 1 1 2 1 2 1 1 3 1 2 1 1 1 1 1 1 1
## [334] 1 1 1 2 1 1 1 3 1 1 1 1 1 2 2 2 1 1 2 1 1 1 2 1 1 2 2 3 3 2 2 2 1 2 2 2 3
## [371] 3 1 2 1 1 3 2 1 1 1 1 1 1 2 1 1 2 1 2 1 1 3 3 1 1 2 2 2 2 2 1 2 2 1 1 3 1
## [408] 1 1 2 2 1 1 1 2 2 2 1 2 1 2 2 1 3 2 2 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 2 3 2
## [445] 3 2 2 3 3 1 2 2 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 2 1 2 1 1 1
## [482] 1 1 1 2 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 1 2 1 2
## [519] 1 1 1 1 2 1 2 2 2 1 2 2 2 2 2 1 1 1 3 3 1 2 1 1 1 1 1 1 1 1 1 1 2 3 2 2 1
## [556] 3 1 3 2 2 2 1 1 2 2 3 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 2 1 1 3 1 1 1 1 1 1 2
## [593] 1 1 1 2 2 1 2 1 1 2 1 1 3 3 2 1 1 2 1 1 1 1 1 2 2 2 3 3 2 1 1 1 2 2 1 1 2
## [630] 1 2 2 2 1 2 2 2 1 1 1 2 2 2 3 1 3 3 1 1 1 2 2 2 2 1 1 1 1 1 1 3 3 1 2 2 2
## [667] 3 1 1 3 2 1 1 1 2 3 2 2 2 2 1 1 2 2 2 1 2 3 2 2 2 3 1 3 1 2 2 1 2 1 2 1 2
## [704] 1 1 1 2 1 1 2 2 3 2 2 3 2 2 2 1 1 1 1 3 3 1 2 1 1 1 1 1 1 2 2 1 1 1 3 2 1
## [741] 3 1 1 1 1 1 1 1 1 1 2 1 1 2 2 2 1 1 1 2 2 1 1 2 2 1 1 3 3 3 1 3 3 3 1 1 1
## [778] 2 1 1 3 1 2 1 2 2 2 2 3 3 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 2 2 2 1 2 2
## [815] 1 1 1 1 2 1 1 2 2 1 1 1 1 2 1 1 1 2 2 1 2 1 2 3 1 1 1 1 1 2 2 2 1 3 2 1 2
## [852] 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1]  39792.36  52797.83 115920.28
##  (between_SS / total_SS =  71.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

With quality above 5 for red wine, there do not appear to be any distinctions in the clusters besides in the variables involving sulfur dioxide. However, once again there does not appear to be any large distinctions within the variable quality and other variables found significant, thus does not help the store with their organization.

In addition the team reconsidered if quality was the best way to cluster the wine since it is considered a subjective value and could differ from person to person. Also, recall that most of the values in this data set are considered normal, average values so it is hard for large distinctions to be made via clustering with the data at hand.

Then, the team took a different approach to find appropriate clusters of each wine type. The team decided to create a new dataset only including the variables that were found to be of utmost importance pertaining to the quality.

White Wine Dendogram Using the Variables Determined in Queston 1 of the Analysis

First, starting with the white wine:

white_quality <- white$quality
white_alcohol <- white$alcohol
white_volatile_acidity <- white$volatile.acidity
white_density <- white$density
white_residual_sugar <- white$residual.sugar

important_white <- data.frame(white_quality, white_alcohol, white_volatile_acidity, white_density, white_residual_sugar)

As done above with the quality dataset the team first tried hierarchial clustering to get a general idea of how well this new dataset would cluster.

#hierarchical clustering 
#perform hierarchical clustering on the first five columns of data
hc_result_important_white <- hclust(dist(important_white[,1:5]))
#plot a dendrogram displaying your results
plot(hc_result_important_white)
#cut the dendrogram into three clusters
rect.hclust(hc_result_important_white, k = 3)

As seen above there is a better more even split for clusters in the white wine when just the variables deemed to be significant in question 1 are considered. With so many data points it is hard to get a good evaluation of what the clusters were split on.

White Wine K-Means Clustering with the Variable Determined in Question 1

Thus the team decided to run K-means clustering. This allowed the team to get a better general idea of each of these clusters and what numerical values they consisted of. Specifially, this allowed the team to present the averages for the clusters in each of the columns to be able to find a variable to organize the store by.

fit_important_white <- kmeans(important_white[, 1:5], 3)
fit_important_white
## K-means clustering with 3 clusters of sizes 997, 2369, 1532
## 
## Cluster means:
##   white_quality white_alcohol white_volatile_acidity white_density
## 1      5.723170      9.493865              0.2772467     0.9980640
## 2      5.987336     11.054313              0.2715935     0.9918715
## 3      5.809399     10.343231              0.2891678     0.9947341
##   white_residual_sugar
## 1            14.414142
## 2             2.139046
## 3             7.745986
## 
## Clustering vector:
##    [1] 1 2 3 3 3 3 3 1 2 2 2 2 2 2 1 2 2 2 2 3 2 2 2 2 2 3 2 3 2 2 3 2 2 2 1 2 3
##   [38] 2 1 1 2 2 3 2 2 2 2 2 3 3 2 2 2 2 2 2 1 1 2 2 3 1 2 2 2 2 3 2 2 3 3 2 2 2
##   [75] 2 3 2 2 3 3 2 1 3 3 1 1 1 1 1 1 1 3 2 2 2 1 1 2 3 1 1 3 1 1 1 1 3 1 1 2 3
##  [112] 1 1 3 3 2 2 2 1 1 3 1 3 3 2 2 3 3 2 2 3 3 1 1 3 1 2 3 2 2 2 3 2 2 2 2 2 3
##  [149] 2 2 3 3 2 2 3 1 1 2 2 2 2 1 2 1 1 1 3 2 3 3 2 2 2 2 1 3 2 1 2 3 3 2 1 1 1
##  [186] 1 3 2 2 3 3 1 3 3 2 2 2 3 1 3 1 1 1 1 2 1 3 3 2 2 3 2 3 3 2 3 3 3 1 3 3 3
##  [223] 2 2 2 1 2 3 2 1 3 3 3 1 1 1 1 1 2 1 1 3 2 2 1 2 2 2 2 3 2 1 1 2 2 2 3 3 2
##  [260] 2 3 1 3 1 2 1 3 3 3 3 3 3 1 3 3 3 3 2 2 2 2 2 3 1 1 3 1 1 1 1 1 3 1 3 2 2
##  [297] 3 2 1 2 2 2 2 2 3 3 3 3 2 2 2 3 2 3 3 2 2 3 3 2 2 3 2 3 2 3 1 3 1 2 2 2 2
##  [334] 3 3 2 2 1 2 2 3 2 3 2 2 2 2 1 1 1 2 3 2 3 2 1 1 2 2 2 1 2 3 1 2 2 2 2 2 1
##  [371] 2 2 2 2 2 2 2 2 2 2 2 1 3 2 2 2 2 3 3 1 1 2 3 2 1 2 2 2 1 2 2 1 2 1 3 2 2
##  [408] 2 2 2 1 1 2 2 1 1 2 3 2 2 2 3 3 3 3 3 3 1 2 1 1 2 2 2 3 3 2 2 2 2 2 3 2 1
##  [445] 1 3 2 1 1 2 3 2 2 2 2 1 2 3 2 3 3 1 1 2 1 3 3 2 1 1 1 2 2 2 2 3 2 2 2 2 3
##  [482] 2 1 3 3 2 2 2 2 3 2 2 3 2 2 3 3 2 2 3 1 1 2 2 1 1 2 2 2 2 2 2 3 2 3 2 2 2
##  [519] 2 1 2 2 2 2 2 2 2 2 2 2 3 2 1 1 1 1 1 1 2 2 3 3 2 3 1 2 3 2 2 3 1 2 2 1 2
##  [556] 2 1 2 1 2 3 2 3 2 2 3 1 2 2 3 3 2 1 2 2 2 2 2 2 3 3 2 3 3 2 3 3 2 2 3 1 2
##  [593] 3 2 2 2 3 2 1 2 2 2 2 2 2 2 3 3 1 3 2 2 1 3 3 1 2 2 2 2 1 3 3 2 2 1 3 3 3
##  [630] 2 1 1 2 1 2 2 2 1 3 3 2 1 1 1 1 3 2 1 3 1 1 1 1 1 2 3 2 2 2 2 2 1 3 2 3 2
##  [667] 2 3 2 1 3 2 1 2 3 3 1 2 2 2 2 2 1 2 2 1 3 1 3 2 2 3 3 3 1 1 2 1 1 1 1 2 2
##  [704] 2 2 3 3 2 1 2 1 2 2 3 2 2 3 1 2 2 3 2 2 2 2 2 2 3 3 2 1 3 3 1 3 3 1 3 2 3
##  [741] 2 2 2 1 3 2 3 3 3 3 3 3 1 3 2 2 1 1 2 2 3 3 1 1 3 2 1 2 3 3 3 2 2 1 3 2 2
##  [778] 1 1 1 2 3 1 1 1 3 1 1 3 3 1 2 3 2 2 3 3 1 3 2 1 3 1 1 3 1 1 1 1 1 3 2 2 1
##  [815] 1 2 3 3 2 3 2 3 3 3 3 3 3 2 3 2 2 3 2 2 2 2 2 2 2 3 3 1 1 2 2 2 1 2 1 2 2
##  [852] 1 2 1 2 2 1 1 3 3 2 1 3 2 3 2 2 1 1 2 3 1 2 2 2 2 2 2 2 2 2 2 3 3 1 2 2 2
##  [889] 2 2 3 2 2 2 2 2 3 2 2 3 3 1 1 3 2 2 2 2 3 1 3 3 3 2 2 2 2 2 2 2 2 3 3 1 2
##  [926] 2 2 2 1 1 3 1 1 3 1 3 2 1 3 1 3 1 2 2 3 3 1 2 2 3 2 3 2 2 3 2 2 2 2 1 1 2
##  [963] 1 2 1 2 2 1 2 2 2 2 1 2 2 2 3 3 3 2 2 3 2 2 2 2 3 3 2 3 2 2 1 2 2 2 2 3 1
## [1000] 2 2 3 3 2 2 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 2 2 3 3 2 2 2 3 2 1 2 3 3 3 3 2
## [1037] 2 3 3 2 2 2 3 1 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 3 2 2 2 1 3 2 3 2 2 1 3
## [1074] 3 1 2 3 2 1 3 2 3 3 1 3 3 2 2 1 2 1 1 2 3 3 2 3 3 3 2 3 2 2 3 3 2 2 2 2 2
## [1111] 2 2 1 2 2 2 2 3 2 1 1 2 2 2 2 1 2 3 2 3 2 2 1 2 2 1 2 3 2 3 2 3 2 2 2 3 2
## [1148] 2 3 3 3 3 2 2 3 2 1 1 2 1 2 1 3 2 2 3 2 2 2 2 2 3 2 2 1 3 1 2 2 2 3 3 2 2
## [1185] 2 1 2 2 2 2 2 3 1 1 3 1 3 1 1 3 2 2 3 2 2 2 1 3 2 2 2 3 2 2 2 3 2 1 3 2 2
## [1222] 3 2 3 2 2 2 3 2 2 1 2 3 3 2 2 2 2 2 2 2 2 3 2 3 1 3 2 2 3 3 2 2 2 2 2 2 3
## [1259] 3 3 3 2 2 2 2 3 2 1 3 1 1 2 3 2 1 2 1 3 2 2 2 2 3 3 2 2 2 2 2 2 2 3 2 2 2
## [1296] 3 3 3 3 2 3 3 3 3 1 2 2 2 2 2 1 2 2 2 3 3 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 3
## [1333] 1 2 2 2 3 3 3 1 2 2 1 1 2 2 3 3 3 2 2 2 2 3 2 2 1 3 2 2 3 3 2 2 2 2 2 2 1
## [1370] 1 2 3 3 3 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 1 1 3 1 1 3 2 3 3
## [1407] 3 2 1 2 3 3 2 3 2 2 2 1 2 2 1 2 2 3 3 2 2 2 2 2 2 2 2 2 2 1 1 2 3 3 1 2 2
## [1444] 3 1 2 2 2 2 3 1 2 1 1 1 1 2 2 2 2 2 3 2 2 3 2 3 2 2 2 1 1 2 3 2 2 3 2 2 2
## [1481] 2 2 2 2 2 2 2 1 1 1 2 2 3 3 3 3 2 2 2 2 3 2 2 2 1 2 3 3 1 2 2 2 3 3 3 2 3
## [1518] 3 3 2 2 2 2 3 1 1 2 3 2 1 1 3 3 3 1 2 2 2 2 3 1 2 2 2 2 3 2 2 2 2 3 2 2 3
## [1555] 3 2 3 3 3 2 2 2 2 2 2 2 1 2 3 3 2 3 3 2 3 2 2 2 1 1 1 3 3 1 3 1 2 3 1 2 2
## [1592] 3 2 3 2 3 2 3 2 2 2 3 2 2 2 2 2 3 1 1 2 3 2 2 1 2 3 2 1 3 2 2 3 3 2 3 3 2
## [1629] 2 3 2 2 3 3 1 2 3 2 1 2 3 3 1 1 1 3 3 2 2 2 2 1 2 1 3 2 2 3 3 3 3 3 2 1 1
## [1666] 2 2 3 2 2 3 3 3 3 3 3 3 2 3 3 1 3 1 1 1 1 3 1 2 2 2 1 2 3 2 2 1 2 2 2 2 3
## [1703] 2 3 3 2 2 1 2 1 2 3 2 3 3 2 3 2 2 2 2 2 1 2 3 2 3 1 3 3 2 1 2 2 3 2 2 3 2
## [1740] 2 3 2 1 3 2 1 2 3 2 2 3 2 1 2 2 1 3 3 2 3 3 3 2 3 3 3 3 2 2 2 3 2 3 3 2 1
## [1777] 1 2 2 2 3 2 3 2 3 3 2 2 3 2 2 2 2 1 2 3 2 3 1 3 3 2 1 3 3 1 2 1 1 1 2 2 2
## [1814] 2 2 2 1 2 2 2 2 2 2 3 1 2 1 3 3 3 3 1 3 1 3 1 3 2 3 3 2 2 1 3 2 1 3 1 1 2
## [1851] 2 2 2 2 3 1 2 2 2 1 2 3 2 3 2 3 1 1 3 3 2 3 1 1 3 1 2 2 1 1 1 1 1 1 3 1 2
## [1888] 1 1 2 1 1 1 1 3 3 3 3 3 2 1 3 3 2 3 3 2 2 3 3 3 2 3 2 2 3 3 2 1 1 3 3 3 2
## [1925] 2 2 2 2 3 3 3 2 3 3 2 3 3 2 2 3 1 3 1 1 2 3 1 2 2 3 1 2 3 2 1 1 2 3 1 2 2
## [1962] 2 2 1 3 1 3 3 2 2 2 2 2 1 1 3 3 3 2 1 1 1 1 1 1 1 2 3 3 2 2 1 3 3 1 1 1 1
## [1999] 1 1 2 2 2 3 2 1 1 2 2 3 2 2 2 1 3 3 2 3 2 1 3 2 3 2 1 2 2 3 3 3 2 3 2 1 2
## [2036] 2 2 3 2 2 3 2 3 2 1 2 3 3 3 3 1 1 1 2 2 2 1 2 3 2 1 2 2 2 3 2 2 2 2 2 2 3
## [2073] 3 3 3 2 2 2 2 3 2 3 3 3 2 2 2 2 2 2 3 3 2 1 1 3 3 1 1 3 1 2 1 2 3 3 1 3 1
## [2110] 3 1 1 1 2 3 1 2 2 2 2 3 2 1 3 2 1 1 2 2 2 2 1 1 2 3 3 3 3 3 1 1 1 1 2 3 3
## [2147] 3 1 1 3 2 1 2 2 3 3 2 2 2 2 3 2 2 2 1 2 2 1 1 1 1 1 3 1 1 1 3 3 1 2 3 2 3
## [2184] 3 3 1 2 2 3 1 2 3 3 1 3 2 2 2 3 1 3 2 3 2 2 3 1 2 2 2 2 2 2 2 2 1 2 2 3 2
## [2221] 2 1 3 1 3 1 1 1 1 3 1 3 2 2 3 3 3 1 3 3 3 3 3 1 1 3 1 3 2 3 1 2 3 3 2 2 2
## [2258] 2 3 2 3 2 2 1 3 1 3 3 3 1 2 1 3 3 2 1 3 2 3 2 2 2 1 1 1 1 3 3 1 3 2 2 2 3
## [2295] 3 3 3 3 2 3 2 1 3 1 2 2 2 1 2 2 1 2 2 2 3 3 2 3 1 2 3 2 2 2 3 2 1 2 1 2 1
## [2332] 2 3 2 1 3 2 3 3 3 2 3 2 2 3 3 1 2 2 3 1 1 2 2 2 3 2 2 1 3 3 2 2 1 2 2 3 3
## [2369] 1 3 3 3 2 2 2 2 2 3 1 2 2 3 2 3 2 1 2 2 2 2 2 3 2 2 1 2 1 2 2 2 2 2 3 2 2
## [2406] 3 2 3 1 2 3 1 3 2 3 2 3 1 2 1 3 1 3 2 3 2 2 3 2 3 3 3 3 1 1 3 3 2 3 3 3 2
## [2443] 1 2 1 2 2 3 2 2 3 3 2 2 1 3 3 3 3 1 1 2 2 3 2 1 1 2 2 2 2 3 3 1 2 2 2 2 3
## [2480] 1 2 3 3 3 3 3 3 2 2 1 3 3 1 1 1 2 1 1 1 3 1 3 2 2 2 1 3 3 1 3 3 1 2 2 2 2
## [2517] 2 1 2 3 1 2 3 2 1 3 2 2 2 1 2 3 3 1 2 3 1 3 3 2 1 3 3 1 2 2 3 3 3 1 3 3 1
## [2554] 2 2 2 1 2 2 2 2 3 2 2 2 1 2 2 3 3 2 2 2 3 2 3 3 3 2 2 1 1 1 1 1 1 1 2 1 2
## [2591] 1 3 2 1 2 2 2 2 2 1 2 1 2 2 2 2 2 3 3 2 1 3 2 3 2 1 1 1 2 3 1 3 2 1 2 2 2
## [2628] 2 3 2 2 2 1 1 1 2 1 1 2 2 2 1 3 3 3 3 2 2 1 3 3 2 2 1 3 3 3 2 1 2 3 2 2 2
## [2665] 3 3 2 2 2 3 2 2 1 2 3 3 2 2 2 3 2 1 3 2 2 2 2 3 1 3 2 3 2 2 3 3 1 2 2 2 2
## [2702] 1 1 1 3 3 3 1 3 1 1 2 3 3 3 3 1 3 2 3 2 1 3 2 3 1 2 2 1 2 3 3 2 2 3 1 2 3
## [2739] 1 2 1 1 2 2 2 2 2 1 2 1 2 2 3 2 2 3 1 2 3 3 3 2 2 2 3 3 2 3 2 2 1 2 2 2 3
## [2776] 3 3 2 2 2 2 1 3 3 3 1 3 1 3 3 3 3 3 2 3 3 2 3 2 2 3 3 3 1 3 2 1 3 3 2 2 2
## [2813] 2 2 2 2 2 2 3 2 2 3 3 2 3 2 3 2 2 1 1 3 2 2 2 2 3 3 2 2 2 2 2 2 3 3 2 2 2
## [2850] 3 1 3 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 3 2 2 3 2 3 2 3 3
## [2887] 2 2 2 3 2 2 3 2 2 2 3 3 2 1 2 2 3 2 2 2 3 2 2 2 3 3 2 2 2 2 2 2 2 3 1 2 3
## [2924] 3 1 3 2 3 3 1 2 2 1 3 2 2 2 3 3 3 3 3 3 3 3 2 3 3 3 3 2 2 2 2 2 3 2 2 2 3
## [2961] 1 3 1 2 2 2 1 1 2 2 2 2 3 1 3 2 3 1 1 3 2 3 3 2 3 3 3 3 3 2 2 3 3 3 3 2 2
## [2998] 2 2 3 3 3 3 3 2 2 2 2 2 1 2 3 2 1 1 3 1 2 3 3 1 2 2 1 1 2 3 1 3 1 1 3 2 3
## [3035] 3 2 2 1 2 1 1 1 1 2 1 3 3 2 2 3 2 3 3 2 3 2 2 3 3 2 3 3 3 3 1 1 1 2 2 3 3
## [3072] 3 2 2 3 2 2 3 2 2 2 3 2 2 2 3 2 3 2 2 2 1 2 2 3 3 2 2 3 2 2 3 3 2 2 2 2 2
## [3109] 1 2 2 2 2 1 2 2 2 2 2 2 3 2 2 3 1 3 3 2 2 2 2 1 2 2 2 1 2 2 3 2 1 3 2 2 1
## [3146] 1 1 3 1 3 3 1 2 2 2 3 1 3 2 3 2 1 3 2 2 2 3 3 2 3 2 2 3 3 3 1 2 3 2 2 2 3
## [3183] 2 2 2 3 2 2 3 2 2 2 3 2 2 3 3 3 3 3 2 3 2 3 3 2 3 2 2 1 3 1 2 2 2 2 2 3 2
## [3220] 2 2 1 2 1 2 2 2 3 3 2 3 3 1 1 2 2 3 2 2 2 2 2 2 2 3 2 3 2 2 2 2 1 3 2 2 1
## [3257] 1 1 1 1 1 2 1 2 2 3 3 2 2 3 2 3 2 2 2 2 2 2 3 2 2 2 2 2 3 2 2 2 2 3 2 2 3
## [3294] 2 2 3 3 3 2 3 2 2 3 2 2 2 2 3 1 3 2 2 2 2 3 2 2 3 2 2 1 2 2 2 2 2 2 2 2 2
## [3331] 3 3 2 3 2 1 1 1 1 2 2 2 3 1 1 3 3 3 1 3 3 2 3 3 1 3 1 1 3 3 2 3 3 2 2 2 2
## [3368] 2 2 3 2 2 3 2 2 3 2 2 2 1 2 3 2 2 3 2 2 1 2 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2
## [3405] 2 2 3 3 3 2 2 3 3 1 2 2 1 2 1 2 1 2 2 1 2 1 1 3 2 1 1 1 2 2 3 2 3 3 1 3 2
## [3442] 2 2 2 1 3 3 3 2 3 3 3 3 3 2 2 2 2 2 1 3 2 2 2 3 2 2 3 2 1 2 3 2 2 2 3 3 2
## [3479] 3 2 2 2 3 3 3 2 2 2 2 3 2 2 1 1 3 2 2 1 2 2 1 3 1 2 2 2 2 2 1 3 3 2 2 1 3
## [3516] 2 2 2 2 3 3 2 3 3 3 3 2 3 3 2 3 3 3 1 3 3 2 2 2 2 2 2 3 2 2 1 3 1 3 2 2 1
## [3553] 1 1 2 3 3 2 1 2 2 2 2 2 2 3 2 2 2 3 2 3 2 2 2 2 1 2 2 3 3 3 2 2 3 2 2 1 1
## [3590] 1 2 2 2 2 3 2 2 1 2 3 1 2 3 3 2 3 3 2 1 1 3 2 2 1 3 2 1 2 1 1 2 1 2 1 3 1
## [3627] 2 3 3 3 3 3 1 3 3 2 2 2 2 2 2 3 3 2 2 3 2 2 3 2 2 1 2 1 3 3 3 1 3 2 2 2 2
## [3664] 2 3 2 1 2 3 2 1 3 2 2 2 2 2 3 1 3 1 3 2 1 3 1 2 1 2 3 2 3 3 3 1 2 2 1 2 2
## [3701] 2 1 1 1 1 1 2 2 1 2 2 1 3 3 2 3 2 2 1 3 2 2 2 3 2 3 2 3 3 2 1 3 2 2 1 2 3
## [3738] 2 2 1 1 1 1 1 1 1 2 3 1 3 1 2 2 2 2 3 2 2 2 3 3 2 3 2 2 1 1 1 1 1 2 3 1 2
## [3775] 1 3 3 3 2 3 1 3 3 3 1 2 1 3 3 3 3 3 1 1 2 1 1 3 2 2 2 2 2 3 2 2 3 3 3 3 1
## [3812] 2 2 2 1 3 3 3 3 3 2 1 3 3 2 2 3 2 3 3 3 2 2 1 3 2 2 2 3 2 2 3 2 2 2 1 3 1
## [3849] 2 3 3 2 2 2 2 2 2 2 3 3 3 1 3 3 2 2 2 3 1 1 3 3 1 1 3 2 1 2 3 3 1 1 2 3 1
## [3886] 2 3 2 1 1 3 2 3 2 3 2 3 1 2 3 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 3 3 2 2 2 1
## [3923] 2 2 2 2 3 2 1 1 2 2 3 3 2 1 3 3 2 3 3 2 2 3 3 3 2 2 3 3 2 1 2 2 1 2 2 2 3
## [3960] 3 1 1 3 2 1 3 3 3 3 2 3 2 2 3 3 2 3 3 2 2 1 2 2 3 3 3 3 1 2 3 1 2 2 3 3 2
## [3997] 3 2 3 3 3 2 1 2 3 2 2 2 1 3 2 2 3 1 1 3 3 2 1 3 3 2 3 1 2 2 1 2 2 3 2 2 1
## [4034] 3 1 3 2 2 2 2 2 3 1 1 1 1 1 1 1 2 1 1 3 1 1 1 1 2 1 2 3 3 2 2 2 3 3 3 3 1
## [4071] 3 2 1 2 3 1 3 2 2 3 3 1 3 2 1 2 2 2 2 2 3 3 2 3 2 3 2 3 3 2 1 3 3 2 2 2 2
## [4108] 1 2 2 2 2 2 2 2 2 1 1 2 2 1 1 1 2 2 3 3 3 1 3 2 3 3 1 3 2 2 3 2 3 1 1 3 3
## [4145] 1 1 1 3 1 2 3 2 1 2 3 2 3 1 1 1 1 1 1 2 2 2 2 2 2 2 1 2 3 2 1 2 2 2 2 1 1
## [4182] 3 3 2 2 1 2 2 2 3 2 2 3 2 3 2 2 3 1 2 2 2 2 2 3 2 1 2 2 2 3 1 3 2 1 1 1 2
## [4219] 2 1 2 3 2 2 3 3 3 2 1 3 3 2 3 3 2 1 1 2 1 2 3 2 2 2 1 1 2 2 2 2 1 2 1 2 3
## [4256] 3 3 2 2 3 3 2 3 3 2 3 2 2 2 1 3 1 1 1 1 1 2 3 2 1 1 1 1 3 3 3 3 2 2 1 1 1
## [4293] 3 1 3 1 3 3 3 3 3 3 1 2 2 3 1 2 2 2 3 2 2 2 3 3 3 2 3 3 3 1 2 2 2 3 3 3 3
## [4330] 3 3 3 1 1 1 1 1 1 1 2 1 1 2 2 2 2 2 2 1 3 3 2 2 2 3 2 3 2 3 3 2 2 2 1 3 3
## [4367] 2 1 1 3 2 2 2 2 2 2 3 2 3 3 3 2 2 2 3 3 3 2 3 3 1 1 1 1 1 1 1 1 1 1 1 1 3
## [4404] 3 3 3 2 2 3 2 3 2 3 2 3 1 3 2 1 1 1 1 2 2 1 3 3 1 2 2 1 2 3 3 2 3 2 3 2 3
## [4441] 1 2 1 3 3 3 2 2 2 1 1 3 2 2 1 1 1 1 2 3 1 2 2 2 1 1 2 3 2 2 2 2 2 2 3 2 2
## [4478] 3 3 2 1 3 2 2 2 2 3 2 2 2 2 2 3 3 2 1 2 3 3 3 2 1 2 3 3 3 2 2 2 2 2 2 2 2
## [4515] 3 2 2 3 2 1 3 2 2 1 3 1 1 3 2 1 1 1 2 3 2 1 1 2 3 3 2 2 3 3 2 2 2 2 1 3 2
## [4552] 2 3 2 2 1 2 1 3 3 2 2 3 3 2 3 3 3 2 2 2 2 2 2 3 2 2 1 2 1 1 2 3 3 2 1 2 3
## [4589] 3 2 3 3 3 1 2 3 2 1 1 2 2 3 1 2 3 2 3 1 2 2 2 2 3 3 3 2 2 2 2 3 2 2 2 3 2
## [4626] 3 3 2 2 2 1 2 3 3 3 3 2 2 2 1 2 2 2 3 2 2 2 2 2 3 3 2 2 2 3 3 3 3 2 2 2 2
## [4663] 3 3 3 3 3 3 3 2 1 2 2 3 2 2 3 3 1 2 2 2 2 2 2 3 2 1 1 1 1 1 2 1 1 3 2 3 2
## [4700] 1 1 3 3 2 3 3 2 1 2 2 2 1 3 2 2 3 2 2 3 3 2 3 3 3 3 1 2 3 3 2 2 3 3 2 2 2
## [4737] 2 3 2 2 2 2 2 2 3 2 3 1 1 1 2 3 2 2 3 2 2 2 2 3 3 1 2 2 3 2 2 2 1 1 1 1 3
## [4774] 2 2 2 2 2 1 2 3 3 1 1 1 2 2 2 1 2 2 1 2 3 3 2 2 2 1 3 2 2 2 3 2 3 3 3 3 2
## [4811] 2 2 2 2 2 2 1 2 2 2 2 2 2 2 3 3 3 2 3 3 2 2 2 2 2 2 1 3 3 2 3 2 2 2 2 2 2
## [4848] 2 2 3 2 3 2 2 2 1 1 3 3 3 3 2 2 2 2 2 2 2 2 3 3 2 1 2 2 2 2 2 2 3 3 1 2 2
## [4885] 3 3 3 2 2 1 2 2 2 2 3 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 10046.299  8326.033  7260.271
##  (between_SS / total_SS =  81.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

After looking at the clusters created, the team decided that there appears to be a good, distinct approach to organize the store. From the k-means analysis done above the white wine in the store should be organized by the amount of residual sugar present. While initially there may seem little to no benefit the team proposed this to the store.

By organizing the wine by residual sugar the store can have a definite organization to their white wine collection. First and foremost as stated in the data understanding table, quality is a variable determined by sensory data. Thus, there is no scientifically accurate value to quantify the quality value sense it is considered subjective data. However, by clustering on objective data that can be truly measured in a lab or in the winery it allows the store to find a direct location in the store. Sensory data would work if the store stayed consistent in ordering the wine, but if the store decides to bring in a new white wine they will not be able to place the wine in the correct “quality” cluster until it has been reviewed. In other words, since new wine won’t have a definite quality value and will have a definite residual sugar value the organization of the store will be more consistent. In addition the store could promote the “healthier” wine, wine with less residual sugar, on the bottom shelf and work its way up. This could be convenient for customers if they are looking to decrease their sugar intake.

For the instance of this wine the following would “classify” all of the different white wines and which shelf they belong on:

  • Bottom Shelf: white wine with residual sugar at or under 3 g/dm^3
  • Middle Shelf: white wine with residual sugar between 3 and 14 g/dm^3
  • Top Shelf: white wine with residual sugar above 14 g/dm^3

Red Wine Dendogram Using the Variables Determined in Queston 1 of the Analysis

Now the team will look at the most important variables found for the red wine and follow the same process to cluster the red wine. The results will likely be different than that of the white wine. The code is displayed below for how this analysis was achieved.

red_quality <- red$quality
red_alcohol <- red$alcohol
red_volatile_acidity <- red$volatile.acidity
red_sulphates <-red$sulphates
red_chlorides <- red$chlorides

important_red <- data.frame(red_quality, red_alcohol, red_volatile_acidity, red_sulphates, red_chlorides)

Following the creation of the dataset the team then decided to run a hierarchial clustering to obtain a visual on if the data would split into 3 clusters better. The following displays the code to do this for the red wine.

#hierarchical clustering 
#perform hierarchical clustering on the first five columns of data
hc_result_important_red <- hclust(dist(important_red[,1:5]))
#plot a dendrogram displaying your results
plot(hc_result_important_red)
#cut the dendrogram into three clusters
rect.hclust(hc_result_important_red, k = 3)

As seen above there is a better more even split for clusters in the red wine when just the variables deemed to be significant in question 1 are considered. However, with the abundance of data points it is nearly impossible to figure out the distinctions from the visual alone. So the team went a different direction to get more information on clustering.

Red Wine K-Means Clustering with the Variable Determined in Question 1

Now, to get a better general idea of each of these clusters the team once again ran K-means clustering against the data. This allows the team to present the averages for the clusters in each of the columns.

fit_important_red <- kmeans(important_red[, 1:5], 3)
fit_important_red
## K-means clustering with 3 clusters of sizes 707, 469, 423
## 
## Cluster means:
##   red_quality red_alcohol red_volatile_acidity red_sulphates red_chlorides
## 1    4.899576    9.818175            0.5899646     0.6198444    0.09451061
## 2    6.127932   10.028181            0.4980384     0.6883156    0.08856716
## 3    6.321513   11.871592            0.4569740     0.6887234    0.07447281
## 
## Clustering vector:
##    [1] 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 1 1 1 1 2 1 2 1 2 1 2 2
##   [38] 2 1 1 1 1 2 1 1 3 1 1 1 1 1 2 2 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 2 1 1 1
##   [75] 1 1 1 2 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 3 1 1 1 2 2 2 2 1 1 1 1 1 2 1 1
##  [112] 1 1 2 1 2 2 2 2 2 1 2 1 1 1 1 1 1 2 1 1 3 3 2 2 1 1 1 1 1 1 1 3 1 3 1 1 1
##  [149] 2 2 2 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 2 2 3 1 1 1 2 1 1 1 1 1 1 2
##  [186] 1 1 1 1 1 1 2 1 1 1 1 1 2 3 1 2 1 1 1 2 2 2 1 1 2 3 2 2 1 2 1 1 1 1 1 2 1
##  [223] 1 2 1 2 2 1 2 1 3 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2 2 1 1 2 1 1 1 1
##  [260] 2 1 1 1 1 1 2 1 3 2 2 2 2 1 1 1 2 2 2 3 2 2 2 1 2 1 1 2 2 2 1 2 1 2 2 2 1
##  [297] 1 1 1 1 2 2 1 1 1 2 1 2 2 2 2 2 2 1 1 3 1 2 3 2 3 1 1 2 2 2 3 3 2 1 3 3 2
##  [334] 1 3 3 3 1 3 2 2 3 2 2 2 1 3 3 2 2 2 2 1 3 3 3 1 3 2 2 1 2 1 1 2 3 2 1 1 3
##  [371] 1 2 2 1 2 3 3 3 3 2 2 2 2 2 1 2 2 2 2 2 3 2 1 1 1 3 1 3 3 1 1 2 2 2 1 2 2
##  [408] 2 3 1 2 1 1 3 1 1 3 1 2 1 3 3 1 3 1 3 3 2 1 2 3 1 3 1 2 1 2 3 2 1 2 3 3 3
##  [445] 3 2 1 1 2 2 2 2 2 3 1 3 1 1 3 1 3 1 3 1 2 1 3 3 2 1 1 3 2 1 2 1 1 3 1 2 1
##  [482] 3 1 1 3 1 1 2 2 2 2 3 3 3 3 3 2 1 3 3 2 3 3 3 2 3 2 2 2 3 1 2 2 3 3 1 3 1
##  [519] 3 1 2 1 1 1 1 1 1 3 2 1 2 1 1 3 2 2 1 2 3 1 1 3 1 3 2 1 2 2 2 2 2 2 2 3 1
##  [556] 1 3 1 3 3 3 1 1 2 3 3 2 2 2 3 3 3 1 1 2 2 1 1 1 2 1 1 1 2 2 2 2 1 3 2 1 3
##  [593] 1 1 1 1 2 2 2 2 1 2 1 2 2 2 3 2 2 3 1 1 2 1 2 1 1 2 1 1 1 1 1 3 1 1 1 1 2
##  [630] 1 2 1 2 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 3 3 1 1 3 3 1 1 1 2 2 1 2 1 2 2 1 1
##  [667] 2 2 1 2 1 1 1 1 2 1 2 1 1 1 1 2 1 3 1 3 1 1 1 1 1 1 1 1 1 3 2 2 1 2 2 2 2
##  [704] 1 1 1 1 1 3 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 3 1 1 1 1 1 2 1 1
##  [741] 3 1 1 1 1 2 2 1 2 2 1 1 1 1 2 2 3 1 1 1 1 1 2 1 2 2 1 1 2 1 2 1 1 2 2 1 2
##  [778] 2 1 1 2 1 1 1 1 1 1 2 2 1 2 1 2 1 3 1 1 2 2 2 1 1 3 2 2 3 3 3 1 2 1 2 1 1
##  [815] 3 1 2 3 1 1 1 3 1 1 1 1 3 1 3 3 1 3 1 1 1 1 3 3 3 1 3 1 2 1 2 1 1 2 1 1 1
##  [852] 1 1 2 2 3 2 3 2 3 1 3 1 1 1 1 3 3 3 2 3 1 1 3 3 3 1 3 2 1 1 3 3 1 2 1 2 3
##  [889] 3 1 1 1 2 1 2 2 3 2 3 1 3 2 2 2 2 1 1 3 2 3 3 2 3 3 3 3 1 2 3 3 1 3 3 2 1
##  [926] 3 2 1 1 3 1 1 2 1 1 3 3 1 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 1 3 1
##  [963] 1 3 3 3 3 1 3 1 2 2 3 1 3 1 1 1 3 1 2 1 3 2 1 3 3 1 1 2 1 1 2 1 1 2 3 3 2
## [1000] 3 3 2 3 3 1 3 3 3 3 1 3 2 1 2 3 3 3 3 3 1 3 3 1 3 2 2 3 3 2 2 3 3 1 2 2 2
## [1037] 3 1 3 3 1 2 3 3 3 3 2 1 2 2 1 1 3 3 2 2 3 1 3 3 2 3 3 3 3 2 3 3 3 1 3 1 2
## [1074] 2 1 2 3 1 1 3 3 3 2 3 2 1 3 3 2 2 3 3 3 3 2 1 2 1 3 1 3 3 3 3 3 3 3 3 1 2
## [1111] 2 3 3 2 3 3 3 3 3 3 3 3 3 3 1 3 3 3 1 2 2 1 3 3 3 3 3 3 1 2 2 3 3 2 1 2 2
## [1148] 3 3 3 3 3 1 3 3 1 3 3 2 1 3 2 3 1 1 1 1 3 3 3 2 2 3 2 2 3 1 3 3 3 3 3 2 1
## [1185] 1 3 3 3 1 1 3 1 3 1 2 2 2 2 2 2 2 3 3 1 3 3 3 1 3 3 2 1 2 2 2 3 2 3 2 2 3
## [1222] 3 2 3 2 1 1 1 3 1 3 1 1 1 3 1 2 3 1 1 1 1 3 1 2 1 1 1 3 2 2 1 1 1 1 1 1 3
## [1259] 3 3 1 1 1 1 3 2 2 3 2 3 3 3 1 1 3 2 3 2 2 3 2 2 2 2 1 1 3 3 1 1 1 2 3 1 2
## [1296] 1 1 3 3 1 3 3 3 1 1 1 1 1 1 1 1 3 1 2 2 2 3 3 2 2 1 3 3 3 2 2 2 2 1 2 2 1
## [1333] 2 1 1 3 1 1 1 2 2 2 2 2 1 2 3 1 1 1 1 3 1 1 1 1 1 3 1 2 1 1 2 1 3 1 1 2 2
## [1370] 1 1 3 1 1 1 1 1 3 2 2 2 1 1 1 1 1 1 1 1 1 3 1 1 1 1 2 1 1 2 3 1 1 3 2 2 3
## [1407] 2 2 3 2 3 2 2 1 1 1 1 3 1 1 1 1 3 1 2 2 3 1 1 3 1 2 3 2 2 2 1 1 1 3 3 2 1
## [1444] 3 3 2 1 1 1 3 3 3 3 1 2 2 2 1 1 3 2 1 2 2 1 1 2 1 2 1 1 3 3 1 1 3 1 3 1 1
## [1481] 1 1 1 1 1 1 1 1 3 2 3 3 3 1 2 2 1 3 2 3 1 1 1 3 3 1 2 3 2 1 2 1 2 2 2 2 1
## [1518] 2 1 1 2 1 1 1 2 1 2 2 2 2 3 1 2 1 3 2 2 2 3 1 3 3 2 2 3 2 1 3 1 3 1 1 3 1
## [1555] 3 2 1 3 1 1 1 1 1 1 1 2 3 1 1 3 3 3 1 3 2 3 2 3 3 3 3 3 3 1 3 3 3 2 3 1 3
## [1592] 3 3 2 1 3 3 1 3
## 
## Within cluster sum of squares by cluster:
## [1] 368.1572 230.7635 436.3967
##  (between_SS / total_SS =  65.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

After looking at the clusters created, the team found a variable that the red wine could be clustered on. From the k-means analysis done above the red wine in the store should be organized instead by the amount of alcohol present. While initially there may seem little to no benefit the team proposed this to the store.

By organizing the red wine by the alcohol content the store can have a definite organization to their red wine collection. First and foremost as stated in the data understanding table, quality is a variable determined by sensory data. Thus, there is no scientifically accurate value to quantify the quality value sense it is considered subjective data. However by clustering on objective data that can be truly measured in a lab or in the winery, it allows the store to find a direct location in the store. Sensory data would work if the store stayed consistent in ordering the wine, but if the store decides to bring in a new red wine they will not be able to place the wine in the correct “quality” cluster until it has been reviewed. In other words, since new wine won’t have a definite quality value and will have a definite alcohol content, the organization of the store will be more consistent. In addition, individuals cannot come to a consensus for a subjective value because it is all a matter of opinion and varies from person to person. So to heir on the safe and more data definite side the team believed that the use of an objective variable in the dataset would prove to be more beneficial. So based on the large distinct differences in alcohol content for the three red wine clusters the team decided this is how the red wine should be organized. Note that this organization could be convenient for customers to know the percent alcohol of the wine they plan to buy and consume. This is especially important to ensure safe and smart alcohol consumption.

For the instance of this wine the following would “classify” all of the different white wines and which shelf they belong on:

  • Bottom Shelf: red wine with an alcohol value at or under 9.5%.
  • Middle Shelf: red wine with an alcohol value between 9.5% and 12%.
  • Top Shelf: red wine with an alcohol value above 12%.

Extent of Analysis

While the analysis done is thorough, there is the opportunity for much more to be learned if additional data were to be provided. In addition, if the team had more time, then the analysis could have been in more detail. For example, cross validation could have been performed on the data sets and models. Values such as the number of trees and the potential for more transformations could have been explored. However, with the amount of time and resources available the analysis team answered the question as sufficiently and fully as possible within those given constraints and still provided a thorough investigation of both questions.

Deployment

This section has the purpose of providing a quick summary of the overall project and how it was completed. In addition, this portion will also discuss how the results from the analysis can be used by wineries, stores, and/or consumers. It will also look to discuss the limitations of the analysis and how the analysis could be improved. In the simplest terms it provides a general synapses of the findings of this entire R Markdown file. It should be a convenient way to find the conclusions of the analysis without all of the details involved.

Report Summary

This section will provide a recap of the 2 different questions that the analysis team was trying to solve as well as the proposed solutions found:

Question 1: Determine the characteristics that have the largest impact on the quality of taste between red and white wine. Consider all the variables made available in the study. Once determined, state clearly which characteristic the company needs to focus on the most to improve or continue the high quality taste of that specific wine. Distinguish any differences between characteristics of quality red and white wine. What makes a red wine a high quality red wine along with what makes a white wine a high quality white wine?

WHITE WINE

  • Based on the results of the different models created by the team (simple linear regression, decision trees, random forests, and boosting) the final conclusion for top 4 significant variables (based on significance) for white wine was determined to be:

1. alcohol
2. volatile acidity
3. density
4. residual sugar

These were determined by looking at the MSE values for all of the models, as well as looking at the results of each of the models created. Thus, the results are not a copy and pasting of one model, but instead a combination of each model’s results put together. The team weighed benefits and drawbacks for each variable that has been selected and their ranking in the list (frequency of importance, ranking of importance in each model, etc.)

RED WINE

  • Based on the results of the different models created by the team (simple linear regression, decision trees, random forests, and boosting) the final conclusion for top 4 significant variables (based on significance) for RED wine was determined to be:

1. alcohol
2. sulphates
3. volatile acidity
4. chlorides

These were determined by looking at the MSE values for all of the models, as well as looking at the results of each of the models created. Thus, the results are not a copy and pasting of one model, but instead a combination of each model’s results put together. The team weighed benefits and drawbacks for each variable that has been selected and their ranking in the list (frequency of importance, ranking of importance in each model, etc.)

Now to answer the second question.

Question 2: Group the wines based on similar characteristics to limit and diversify the supply in the store. The store has limited space and wants distinctly different wines of each type to fill the shelf space. Identify clusters in order to help determine the shelf position of the wine. The store needs the wine organized in order to maximize their sales and increase their profits. Answer the question: How can a store group the wines based upon similar characteristics to best position their wine selection within the store?

  • Based on the results of the two different analytical techniques, the team came to the consensus that the two distinct wines should not be organized (or clustered) by their individual qualities, but instead on two distinct variables. This conclusion was largely considered since quality is considered more of a subjective variable. This means that there is no definite answer, and it is a data point that is more considered on an individual basis. Another example of this would be rating pain on a scale of 1-10. In other words we don’t all feel pain in the same way (have different pain intolerance), like the way we don’t all have the same taste buds. Thus white wine was decided to be clustered by the amount of residual sugar and red wine on the percent alcohol present. Theses are both measurable quantities and allow for more consistency of the placement of wine in terms of shelving. This was the team’s final decision in terms of shelving for both red and white wines:

    • Bottom Shelf: white wine with residual sugar at or under 3 g/dm^3 and red wine with an alcohol value at or under 9.5%.
    • Middle Shelf: white wine with residual sugar between 3 and 14 g/dm^3 and red wine with an alcohol value between 9.5% and 12%.
    • Top Shelf: white wine with residual sugar above 14 g/dm^3 and red wine with an alcohol value above 12%.

Overview of Techniques, Analysis, and Insights

Next, the techniques that were implemented to draw these conclusions will be discussed.

  • Linear regression helped establish clear conclusions on variable importance with the target variable quality. One of the four models used to make conclusions on ranking the variables and their importance.

  • Decisions trees were utilized to give a general idea of which variables might be important. These were also one of the four models used to make conclusion on the ranking of variables and their importance. The decision trees done by the analysis team gave more insight on specifics of the model by providing values/cut-offs determining the splits in the models.

  • Random forests were the third considered model for determining variable importance. Provided the team an additional way of looking and splitting up the data to determine this importance.

  • Boosting was the last of the four models to determine the variable importance in relation to the quality. Once again an additional way to organize and test the data to determine the rankings the team came up with.

  • Aggregating the data was used to understand relationships between each of the variables. In particular the variables deemed significant directly against those of quality. This form of analysis allowed the analytics team to quality averages as a means of comparison across the board. This also allowed the team to present direct correlations found within the data that anyone can follow and understand.

  • K means clustering and hierarchical clustering were utilized to group wines into similar distinct groups. The team looked to create three general clusters to start to organize the store: a bottom shelf, a middle shelf, and a top shelf. After determining that there was a significant amount of data in each cluster by using the visual aspect provided by the hierarchical clustering, the team then ran the K means clustering to get the averages of each distinct cluster. These groups ultimately led to the final decisions in which variables both the red and white wine were clustered on.

  • Scatter plots, histograms, and boxplots were utilized to understand the relationship between the target variable and the other variables in the dataset. These were specifically used in the data exploring stage to develop a better overall understanding of the data and how it was set up and working. The scatter plots allowed the team to see if the variables had any correlation to the quality of the wine, while the histograms gave the team a general overview of how the data was spread and the average.

Then the analysis team reflected on some insights gained throughout the course of the project. Of course, the team learned a lot so it should be noted that this is not an exhaustive list, but merely a few key insights gained. These included:

  • Data is widely available within the world today, and it can provide answers or predictions for questions like never before. However, if the data is not constructed in a way to find these desired answers, then it can be extremely difficult to find sought answers for seemingly simple questions. In terms of this project, the data at hand was clean and organized, however it took a lot of time and effort to reconstruct the data to be friendly for the established MECE list. It is important to note that the time spent manipulating the data upfront was time saved during the actual analysis process in R. Thus, always make sure to work with a clean and organized data set before running any type of analysis.

  • The team also learned that the same set of data can call for many different conclusions based on the type of analysis ran. Thus, it is important to not only run a testing and training dataset to check the accuracy of conclusions but also to consider running multiple analytic techniques to ensure that this conclusion is accurate across the board. The team also understands that each of the techniques is used for different reasons and within different constraints. Thus, it is important to know and understand exactly what questions are being answered.

  • R is very good with quantitative data to run analysis on. It is easy to run many different kinds of models in R with numeric values. R also allows the ability to easily make a website published for anyone to see and share.

Deployment Specifics

This section looks to make the results the analytics team found actionable. In short, describing why this analysis is useful for wineries, stores, and consumers. A few next steps with the results would be to:

  • Help wineries produce better wine based upon the significant variables and which values are key to on average increase the quality. In addition, having quality available could allow the winery to have a better idea of what to sell the wine for to make the largest profit margin.

  • Present findings to stores to help them organize their wine in a new way. This analysis could also aid stores in which sectors of wine they should buy from in order to increase the diversity of wine in the store while either decreasing or remaining constant in the amount of wine they order. The larger the selection the more likely a customer is to find what they are looking for and spend money at your store.

  • Continue to gather customer feedback on the wines they buy and their quality rating to use as a marketing strategy. People love feedback from other people whether that be a google review or word of mouth. The more that wineries and stores can get people to talk about the wine the more likely they will be to sell it.

Lastly, the team reflected on the project and developed some suggestions on how to take the analysis completed in this project to the next level if it were to be redone or continued. In general, an expansion on the analysis is the biggest suggestion. For example:

  • Have access to more data such as: wine name, wine brand, wine price, grapes used, temperature stored, etc. The more information presented to the analytics team the more opportunity to draw meaningful conclusions for whomever it is being completed for. If the team had access to who was making which wine, the team could find distinctions and have the ability to help a “struggling” winery out or give them pointers based on the data.

  • Predict the quality based upon values for the variables pulled from a lab. Be able to answer the question will this wine most likely be good or bad without sending out samples for it to be tasted. People also love statistics so to be able to have any evidence supporting claims would help the business sell more and attract more customers.