1. Define Phase

Define phase is the first phase of the DMAIC project and a Six Sigma Process. It is important to understand what key output metric, (in Six Sigma referred to as Y) to be improved. The key points of this phase are:

  • The project charter is created where the problem that needs to be solved is defined, the key project metrics are determined, goals, etc.
  • The SIPOC, Supplier-Input-Process-Output-Customer, process map
  • Voice of Customer and CTQs, which allow us to determine the project metric (Y)

1.1 Problem Statement

The wine factory produces a high standard of wine and exports it to Italy, Spain and Switzerland. This wine production is a beverage made from fermented grape and other fruit juices with lower amount of alcohol.

The selection of the problem to be solved is based on Voice of Customer (VOC) from a dataset of the past two years. The consumers want a high quality wine, which means that they want wine with a great taste, flavour and appearance. Based on this database, the low quality rate has been recorded only for the black and red wine product. The current quality average rate is 5.8 with a standard deviation of +- 1. The specifications required to meet the VoC and factory commitments is a quality of 7.

1.2 Voice of Customer and CTQs

A Critical to Quality Tree (also known as a CTQ Tree) is a Six Sigma tool used to identify the needs of the customer and translate that information into measurable product and process requirements. It allows organizations to understand the characteristics of a product or service that most drives quality for customers.

Before initiating any process improvement project, it’s important for a business to determine the characteristics of the product or service that are critical to quality as judged by customers – this is known as CTQ.

Creating a CTQ Tree determines the drivers behind those characteristics and helps companies find ways to meet them. In this case this is the CTQ Defined:

Need from the customer: “I, as a customer, want a high quality wine”

So the CTQs defined are:

  • Taste
  • Flavour
  • Appearance

So, the different typical features like alcohol, density, citic acid, etc will be the variables Xs influencing the CTQs.

These factors will affect to the result of the taste, appearance and flavour of the final wine and consequently, in the customer satisfaction.

1.3 High Level Process (SIPOC)

The SIPOC (Supplier, Input, Process, Output, Customer)is a high level process map and allow to identify the process boundaries (start/stop), create a common understanding about the process, locate the problem in a context and become familiar with inputs and outputs at each step of the process

2. Measure Phase

2.1 Data Preparation

## Warning: package 'highcharter' was built under R version 4.0.5
## Warning: package 'qqplotr' was built under R version 4.0.5
## Warning: package 'corrplot' was built under R version 4.0.5
## Warning: package 'multipanelfigure' was built under R version 4.0.5
## Warning: package 'caret' was built under R version 4.0.5
## Warning: package 'e1071' was built under R version 4.0.5
## Warning: package 'qcc' was built under R version 4.0.5

Data

setwd("~/R/Github Project/Wine Quality")
data <- read.csv("winequalityN.csv")

2.2 Data Cleaning

2.2.1 Removing the Duplicate Rows

data <- data[!duplicated(data), ]
dim(data)
## [1] 5329   13

The rows got reduced to 5329 after removing the duplicate information.

2.2.2 Missing Values

colSums(is.na(data))
##                 type        fixed.acidity     volatile.acidity 
##                    0                   10                    8 
##          citric.acid       residual.sugar            chlorides 
##                    3                    2                    2 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##                    0                    0                    0 
##                   pH            sulphates              alcohol 
##                    9                    4                    0 
##              quality 
##                    0

2.3 Variables Identification

  • Input variables (based on physicochemical tests):

    1 - fixed acidity (tartaric acid - g / dm^3)

    2 - volatile acidity (acetic acid - g / dm^3)

    3 - citric acid (g / dm^3)

    4 - residual sugar (g / dm^3)

    5 - chlorides (sodium chloride - g / dm^3)

    6 - free sulfur dioxide (mg / dm^3)

    7 - total sulfur dioxide (mg / dm^3)

    8 - density (g / cm^3)

    9 - pH

    10 - sulphates (potassium sulphate - g / dm3)

    11 - alcohol (% by volume)

  • Output variable (based on sensory data):

    12 - quality (score between 0 and 10)

2.4 Explanatory Data Analysis

2.4.1 Data Summary

summary(data)
##      type           fixed.acidity    volatile.acidity  citric.acid    
##  Length:5329        Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  Class :character   1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2400  
##  Mode  :character   Median : 7.000   Median :0.3000   Median :0.3100  
##                     Mean   : 7.217   Mean   :0.3441   Mean   :0.3187  
##                     3rd Qu.: 7.700   3rd Qu.:0.4100   3rd Qu.:0.4000  
##                     Max.   :15.900   Max.   :1.5800   Max.   :1.6600  
##                     NA's   :10       NA's   :8        NA's   :3       
##  residual.sugar     chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  1.00      Min.   :  6.0       
##  1st Qu.: 1.800   1st Qu.:0.03800   1st Qu.: 16.00      1st Qu.: 75.0       
##  Median : 2.700   Median :0.04700   Median : 28.00      Median :116.0       
##  Mean   : 5.054   Mean   :0.05667   Mean   : 30.06      Mean   :114.2       
##  3rd Qu.: 7.500   3rd Qu.:0.06600   3rd Qu.: 41.00      3rd Qu.:154.0       
##  Max.   :65.800   Max.   :0.61100   Max.   :289.00      Max.   :440.0       
##  NA's   :2        NA's   :2                                                 
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9922   1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50  
##  Median :0.9946   Median :3.210   Median :0.5100   Median :10.40  
##  Mean   :0.9945   Mean   :3.224   Mean   :0.5333   Mean   :10.55  
##  3rd Qu.:0.9968   3rd Qu.:3.330   3rd Qu.:0.6000   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##                   NA's   :9       NA's   :4                       
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.796  
##  3rd Qu.:6.000  
##  Max.   :9.000  
## 

2.4.2 Distribution of Single Variables (Histograms)

par(mfrow=c(2,2))

p1 <- ggplot(data, aes(x=pH, y=..count.., fill = type)) + 
  geom_histogram(position="dodge") +
     labs(title="pH distribution",  subtitle = "Histogram Chart",
       x = "pH", y = "Frequency", fill = "Type") +
  scale_fill_brewer(palette = "Set1") 

p2 <- ggplot(data, aes(x=free.sulfur.dioxide, y=..count.., fill = type)) + 
  geom_histogram(position="dodge") +
     labs(title="Free SO2 distribution",  subtitle = "Histogram Chart",
       x = "pH", y = "Frequency", fill = "Type") +
  scale_fill_brewer(palette = "Set1")

p3 <- ggplot(data, aes(x=fixed.acidity, y=..count.., fill = type)) + 
  geom_histogram(position="dodge") +
     labs(title="Free SO2 distribution",  subtitle = "Histogram Chart",
       x = "Free SO2", y = "Frequency", fill = "Type") +
  scale_fill_brewer(palette = "Set1")

p4 <- ggplot(data, aes(x=volatile.acidity, y=..count.., fill = type)) + 
  geom_histogram(position="dodge") +
     labs(title="Volatile Acidity distribution",  subtitle = "Histogram Chart",
       x = "Volatile Acidity", y = "Frequency", fill = "Type") +
  scale_fill_brewer(palette = "Set1")

p5 <- ggplot(data, aes(x=residual.sugar, y=..count.., fill = type)) + 
  geom_histogram(position="dodge") +
     labs(title="Residual sugar distribution",  subtitle = "Histogram Chart",
       x = "Residual sugar", y = "Frequency", fill = "Type") +
  scale_fill_brewer(palette = "Set1")

p6 <- ggplot(data, aes(x=density, y=..count.., fill = type)) + 
  geom_histogram(position="dodge") +
     labs(title="Density distribution",  subtitle = "Histogram Chart",
       x = "Density", y = "Frequency", fill = "Type") +
  scale_fill_brewer(palette = "Set1")

p7 <- ggplot(data, aes(x=alcohol, y=..count.., fill = type)) + 
  geom_histogram(position="dodge") +
     labs(title="Alcohol distribution",  subtitle = "Histogram Chart",
       x = "Alcohol", y = "Frequency", fill = "Type") +
  scale_fill_brewer(palette = "Set1")

p8 <- ggplot(data, aes(x=chlorides, y=..count.., fill = type)) + 
  geom_histogram(position="dodge") +
     labs(title="Chlorides distribution",  subtitle = "Histogram Chart",
       x = "Chlorides", y = "Frequency", fill = "Type") +
  scale_fill_brewer(palette = "Set1")

grid.arrange(p1,p2,p3,p4, ncol = 2)

grid.arrange(p5,p6,p7,p8, ncol = 2)

By taking a look at dataset, we can see a lot of right skewed distributions (for example, fixed and volatile acidity, residual sugar, chlorides, free and total SO2, density).

par(mfrow=c(2,2))
p9 <- ggplot(data, aes(x=quality, y=..count.., fill = type)) + 
  geom_histogram(position="dodge") +
     labs(title="Quality distribution",  subtitle = "Histogram Chart",
       x = "Quality", y = "Frequency", fill = "Type") +
  scale_fill_brewer(palette = "Set1")
  
p9

The quality ranges from 3 to 8. Most of the scores are 5 and 6 (medium quality). We have very little data for scores at the tails.

2.4.3 Boxplots

 par(mfrow=c(1,5), oma = c(1,1,0,0) + 0.1,  mar = c(3,3,1,1) + 0.1)
 
  b1 <- ggplot(data, aes(x=fixed.acidity, y=type, fill =type)) + 
  geom_boxplot() +
  labs(title="Fixed Acidity", x = "Fixed Acidity", y = "Type", fill = "Type") +
  scale_fill_brewer(palette = "Set1") + coord_flip() 
  
  b2 <- ggplot(data, aes(x=volatile.acidity, y=type, fill =type)) + 
  geom_boxplot() +
  labs(title="Volatile Acidity", x = "Volatile Acidity", y = "Type", fill = "Type") +
  scale_fill_brewer(palette = "Set1") + coord_flip() 
  
  b3 <- ggplot(data, aes(x=citric.acid, y=type, fill =type)) + 
  geom_boxplot() +
  labs(title="Citric Acid", x = "Citric Acid", y = "Type", fill = "Type") +
  scale_fill_brewer(palette = "Set1") + coord_flip() 

  b4 <- ggplot(data, aes(x=residual.sugar, y=type, fill =type)) + 
  geom_boxplot() +
  labs(title="Residual Sugar", x = "Residual Sugar", y = "Type", fill = "Type") +
  scale_fill_brewer(palette = "Set1") + coord_flip()
  
  b5 <- ggplot(data, aes(x=chlorides, y=type, fill =type)) + 
  geom_boxplot() +
  labs(title="Chlorides", x = "Chlorides", y = "Type", fill = "Type") +
  scale_fill_brewer(palette = "Set1") + coord_flip()
  
  b6 <- ggplot(data, aes(x=alcohol, y=type, fill =type)) + 
  geom_boxplot() +
  labs(title="Alcohol", x = "Alcohol", y = "Type", fill = "Type") +
  scale_fill_brewer(palette = "Set1") + coord_flip()
  
  b7 <- ggplot(data, aes(x=density, y=type, fill =type)) + 
  geom_boxplot() +
  labs(title="Density", x = "Density", y = "Type", fill = "Type") +
  scale_fill_brewer(palette = "Set1") + coord_flip()
  
  b8 <- ggplot(data, aes(x=free.sulfur.dioxide, y=type, fill =type)) + 
  geom_boxplot() +
  labs(title="Free Sulfur Dioxide", x = "Free Sulfur Dioxide", y = "Type", fill = "Type") +
  scale_fill_brewer(palette = "Set1") + coord_flip()
  
  b9 <- ggplot(data, aes(x=pH, y=type, fill =type)) + 
  geom_boxplot() +
  labs(title="pH", x = "pH", y = "Type", fill = "Type") +
  scale_fill_brewer(palette = "Set1") + coord_flip()
  
  b10 <- ggplot(data, aes(x=sulphates, y=type, fill =type)) + 
  geom_boxplot() +
  labs(title="Sulphates", x = "Sulphates", y = "Type", fill = "Type") +
  scale_fill_brewer(palette = "Set1") + coord_flip()
  
  b11 <- ggplot(data, aes(x=free.sulfur.dioxide, y=type, fill =type)) + 
  geom_boxplot() +
  labs(title="Total Sulfur Dioxide", x = "Total Sulfur Dioxide", y = "Type", fill = "Type") +
  scale_fill_brewer(palette = "Set1") + coord_flip()
  
  b12 <- ggplot(data, aes(x=quality, y=type, fill =type)) + 
  geom_boxplot() +
  labs(title="Quality", x = "Quality", y = "Type", fill = "Type") +
  scale_fill_brewer(palette = "Set1") + coord_flip()
    

figure_b1 <- multi_panel_figure(columns = 3, rows = 1, panel_label_type = "none")

figure_b2 <- multi_panel_figure(columns = 3, rows = 1, panel_label_type = "none")

figure_b3 <- multi_panel_figure(columns = 3, rows = 1, panel_label_type = "none")

figure_b4 <- multi_panel_figure(columns = 3, rows = 1, panel_label_type = "none")

figure_b1 %<>%
  fill_panel(b1, column = 1, row = 1) %<>%
  fill_panel(b2, column = 2, row = 1) %<>%
  fill_panel(b3, column = 3, row = 1) 

figure_b1

figure_b2 %<>%
  fill_panel(b4, column = 1, row = 1) %<>%
  fill_panel(b5, column = 2, row = 1) %<>%
  fill_panel(b6, column = 3, row = 1) 

figure_b2

figure_b3 %<>%
  fill_panel(b7, column = 1, row = 1) %<>%
  fill_panel(b8, column = 2, row = 1) %<>%
  fill_panel(b9, column = 3, row = 1) 

figure_b3

figure_b4 %<>%
  fill_panel(b10, column = 1, row = 1) %<>%
  fill_panel(b11, column = 2, row = 1) %<>%
  fill_panel(b12, column = 3, row = 1) 

figure_b4

It can be observed some differences between red wine and white wine:

  1. Red Wine:
  • Suphates, free sulfur, density, Chlorides, Fixed acidity, volatile acidity have a high number of outliers. If those outliers are eliminated then distribution of the variables may be considered to be symmetric.

  • Alcohol has an irregular shaped distribution but without pronounced outliers.

  • Chlorides are mostly concentrated between 0.5 and 1 g/dm3

  • Quality scores are mostly medium (5,6) with very little amount of records at both tails (3 and 8 scores)

  • In most cases data is not normally distributed. It will be neccesary to transform the data to make a normal distribution

  1. Black Wine:
  • Sulphates, pH, free sulfur, Chlorides, Citric Acid, Fixed acidity, volatile acidity have a high number of outliers. If those outliers are eliminated then distribution of the variables may be considered to be symmetric.

  • Alcohol and density are quite simmetric.

  • Quality scores are mostly medium (5,6) with very little amount of records at both tails (3 and 9 scores)

  • In most cases data is not normally distributed. It will be neccesary to transform the data to make a normal distribution

2.4.4 Variables Correlation

Correlation measures the extent that two variables are related to one another. It is a single-number measure of the relationship between two variables. It’s important to remind that correlation doesn’t imply causal relationship

The value of the correlational coefficient varies between +1 and -1

  • +1 implies a strong positive association.
  • -1 implies a strong negative association.
#Red Wine
data_red <- data %>%
  filter (type =="red") %>%
  select(-type)

var_correlation <-cor(data_red, use="pairwise.complete.obs")
corrplot(var_correlation, method="color", 
         addCoef.col = "black", number.cex = 0.5,  tl.cex=0.8, tl.srt=70,tl.col="black" )

The most correlated variables with red wine quality are:

  • volatile-acidity (-0.40)
  • alcohol (0.48)
  • citric acid (0.23)
  • sulphates (0.25)

When quality score increase, the volatile-acidity tend to decrease. And when quality score increase, the alcohol level tend to also increase. It’s necessary to test these hypotheses with ANOVA test.

In terms of relationships between another independent variables, some strong correlations are observed.

  • fixed acidity - density (0.67)
  • fixed acidity - citric acid (0.67)
  • fixed acidity - ph (-0.69)
#Red Wine
data_white <- data %>%
  filter (type =="white") %>%
  select(-type)

var_correlation <-cor(data_white, use="pairwise.complete.obs")
corrplot(var_correlation, method="color", 
         addCoef.col = "black", number.cex = 0.5,  tl.cex=0.8, tl.srt=70,tl.col="black" )

The most correlated variables with white wine quality are:

  • density (-0.34)
  • alcohol (0.46)
  • chlorides (-0.22)

When quality score increase, the density tend to decrease. And when quality score increase, the alcohol level tend to also increase. It’s necessary to test these hypotheses with ANOVA test.

In terms of relationships between another independent variables, some strong correlations are observed.

  • alcohol - density (-0.76)
  • residual sugar - density (0.82)

2.4.5 Skewness

If the skewness is between -0.5 and 0.5, the data are fairly symmetrical. If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed. If the skewness is less than -1 or greater than 1, the data are highly skewed.

skewness(data_white$quality)
## [1] 0.1138907
skewness(data_red$quality)
## [1] 0.1919821
skewness(data_white$pH)
## [1] NA
skewness(data_red$pH)
## [1] NA
skewness(data_white$density)
## [1] 1.268285
skewness(data_red$density)
## [1] 0.04467906
skewness(data_white$alcohol)
## [1] 0.4512139
skewness(data_red$alcohol)
## [1] 0.857944
skewness(data_white$free.sulfur.dioxide)
## [1] 1.559729
skewness(data_red$free.sulfur.dioxide)
## [1] 1.223873
skewness(data_white$chlorides)
## [1] NA
skewness(data_red$chlorides)
## [1] 5.490346
skewness(data_white$free.sulfur.dioxide)
## [1] 1.559729
skewness(data_red$free.sulfur.dioxide)
## [1] 1.223873

Skewness does violate a normality assumption. But it can also occur, as in these cases, that we have many outliers that are influencing the final distribution.

To fix it, some sort of transformation is probably required. A log/sqrt/inverse transformation might work. It’s not unreasonable to use Box-Cox to suggest a transformation for predictor variables, so that would be one method. It will be neccesary to transform the data to make a normal distribution

2.4.6 Probability Plot

The normal probability plot (Chambers et al., 1983) is a graphical technique for assessing whether or not a data set is approximately normally distributed

  pp1 <- ggplot(mapping = aes(sample = data_white$alcohol)) +
  stat_qq_point(size = 2,color = "#377EB8")  +
  stat_qq_line(color="black") +
 xlab("x-axis") + ylab("y-axis") +
  labs(title=" Prob. Plot - Alcohol (White Wine)")

  pp2 <- ggplot(mapping = aes(sample = data_red$alcohol)) +
  stat_qq_point(size = 2,color = "#E41A1C")  +
  stat_qq_line(color="black") +
 xlab("x-axis") + ylab("y-axis") +
     labs(title=" Prob. Plot - Alcohol (Red Wine)")
  
  pp3 <- ggplot(mapping = aes(sample = data_white$fixed.acidity)) +
  stat_qq_point(size = 2,color = "#377EB8")  +
  stat_qq_line(color="black") +
 xlab("x-axis") + ylab("y-axis") +
  labs(title=" Prob. Plot - Fixed Acidity (White Wine)")

  pp4 <- ggplot(mapping = aes(sample = data_red$fixed.acidity)) +
  stat_qq_point(size = 2,color = "#E41A1C")  +
  stat_qq_line(color="black") +
 xlab("x-axis") + ylab("y-axis") +
  labs(title=" Prob. Plot - Fixed Acidity (Red Wine)")

grid.arrange(pp1,pp2,pp3,pp4, ncol = 2)

A straight, diagonal line means that you have normally distributed data. If the line is skewed to the left or right, it means that you do not have normally distributed data. It can be confirmed what has been analyzed throughout the study, it cannot be assumed normality in the variables, so the variables need to be transformed.

2.4.7 Data Transformation

# Center, scale, and transform red wine data
preprocess_redwine <- preProcess(data_red[,1:11], c("BoxCox", "center", "scale"), na.remove = TRUE)
new_datared <- data.frame(trans = predict(preprocess_redwine, data_red))
colnames(new_datared)
##  [1] "trans.fixed.acidity"        "trans.volatile.acidity"    
##  [3] "trans.citric.acid"          "trans.residual.sugar"      
##  [5] "trans.chlorides"            "trans.free.sulfur.dioxide" 
##  [7] "trans.total.sulfur.dioxide" "trans.density"             
##  [9] "trans.pH"                   "trans.sulphates"           
## [11] "trans.alcohol"              "trans.quality"
# Center, scale, and transform black wine data
preprocess_whitewine <- preProcess(data_white[,1:11], c("BoxCox", "center", "scale"), na.remove = TRUE)
new_datawhite <- data.frame(trans = predict(preprocess_whitewine, data_white))
colnames(new_datawhite)
##  [1] "trans.fixed.acidity"        "trans.volatile.acidity"    
##  [3] "trans.citric.acid"          "trans.residual.sugar"      
##  [5] "trans.chlorides"            "trans.free.sulfur.dioxide" 
##  [7] "trans.total.sulfur.dioxide" "trans.density"             
##  [9] "trans.pH"                   "trans.sulphates"           
## [11] "trans.alcohol"              "trans.quality"

2.4.8 Remove Outliers

The extreme outliers will be removed to facilitate the analysis and the visualization

new_datawhite <- new_datawhite[!abs(new_datawhite$trans.fixed.acidity) > 3,]
new_datawhite <- new_datawhite[!abs(new_datawhite$trans.volatile.acidity) > 3,]
new_datawhite <- new_datawhite[!abs(new_datawhite$trans.citric.acid) > 3,]
new_datawhite <- new_datawhite[!abs(new_datawhite$trans.residual.sugar) > 3,]
new_datawhite <- new_datawhite[!abs(new_datawhite$trans.chlorides) > 3,]
new_datawhite <- new_datawhite[!abs(new_datawhite$trans.density) > 3,]
new_datawhite <- new_datawhite[!abs(new_datawhite$trans.pH) > 3,]
new_datawhite <- new_datawhite[!abs(new_datawhite$trans.sulphates) > 3,]
new_datawhite <- new_datawhite[!abs(new_datawhite$trans.alcohol) > 3,]

new_datared <- new_datared[!abs(new_datared$trans.fixed.acidity) > 3,]
new_datared <- new_datared[!abs(new_datared$trans.volatile.acidity) > 3,]
new_datared <- new_datared[!abs(new_datared$trans.citric.acid) > 3,]
new_datared <- new_datared[!abs(new_datared$trans.residual.sugar) > 3,]
new_datared <- new_datared[!abs(new_datared$trans.chlorides) > 3,]
new_datared <- new_datared[!abs(new_datared$trans.density) > 3,]
new_datared <- new_datared[!abs(new_datared$trans.pH) > 3,]
new_datared <- new_datared[!abs(new_datared$trans.sulphates) > 3,]
new_datared <- new_datared[!abs(new_datared$trans.alcohol) > 3,]

Checking for normal distribution:

 t1 <- ggplot(mapping = aes(sample = data_white$chlorides)) +
  stat_qq_point(size = 2,color = "#377EB8")  +
  stat_qq_line(color="black") +
 xlab("x-axis") + ylab("y-axis") +
  labs(title=" Before Transformation - Alcohol (White Wine)")

 t2 <- ggplot(mapping = aes(sample = new_datawhite$trans.chlorides)) +
  stat_qq_point(size = 2,color = "#377EB8")  +
  stat_qq_line(color="black") +
 xlab("x-axis") + ylab("y-axis") +
  labs(title=" After Transformation - Residual Sugar (White Wine)")

 t3 <- ggplot(mapping = aes(sample = data_red$chlorides)) +
  stat_qq_point(size = 2,color = "#E41A1C")  +
  stat_qq_line(color="black") +
 xlab("x-axis") + ylab("y-axis") +
  labs(title=" Before Transformation - Residual Sugar (Red Wine)")

 t4 <- ggplot(mapping = aes(sample = new_datared$trans.chlorides)) +
  stat_qq_point(size = 2,color = "#E41A1C")  +
  stat_qq_line(color="black") +
 xlab("x-axis") + ylab("y-axis") +
  labs(title=" After Transformation - Residual Sugar (Red Wine)")
 

  grid.arrange(t1,t2,t3,t4, ncol = 2)  

2.4.9 Correlation with the transformed variables

var_correlation <-cor(new_datawhite, use="pairwise.complete.obs")
corrplot(var_correlation, method="color", 
         addCoef.col = "black", number.cex = 0.5,  tl.cex=0.8, tl.srt=70,tl.col="black" )

The most correlated variables with white wine quality are:

  • density (-0.34)
  • alcohol (0.45)
  • chlorides (-0.31)
  • volatile acidity (-0.18)
var_correlation <-cor(new_datared, use="pairwise.complete.obs")
corrplot(var_correlation, method="color", 
         addCoef.col = "black", number.cex = 0.5,  tl.cex=0.8, tl.srt=70,tl.col="black" )

The most correlated variables with red wine quality are:

  • volatile-acidity (-0.40)
  • alcohol (0.49)
  • sulphates (0.38)
  • citric acid (0.24)

2.4.10 Process Capability

Process Capability Indices (PCI) are considered some of the fundamentals quality measurement tools needed. This because they evaluate related process performance, and compare one material or supplier with others, when applied properly to improve process control performance.

The the capacity of a process can only performed when the process is under control. So 9 samples are taken of size 20 and 30 respectively, to analyze the process capability for red wine and black wine. Then, it will be checked that all of the samples come from a situation that is in control.

red_range <- qcc(data = sample_red, type = "R", plot = T)

red_average <- qcc(data = sample_red, type = "xbar", plot = T)

white_range <- qcc(data = sample_white, type = "R", plot = T)

white_average <- qcc(data = sample_white, type = "xbar", plot = T)

As we can see both processes are under control, based on the R-Chart and the X-bar chart. Thanks to this, the capacity of the process can be checked. The specification limits in this wine production process in quality terms are 4 and 9, and the target is a quality of 7

lsl=3;usl=9
target=7
process.capability(white_average, spec.limits=c(lsl,usl),target=target)

## 
## Process Capability Analysis
## 
## Call:
## process.capability(object = white_average, spec.limits = c(lsl,     usl), target = target)
## 
## Number of obs = 180          Target = 7
##        Center = 5.694           LSL = 3
##        StdDev = 0.8081          USL = 9
## 
## Capability indices:
## 
##        Value    2.5%   97.5%
## Cp    1.2375  1.1093  1.3655
## Cp_l  1.1115  1.0065  1.2164
## Cp_u  1.3635  1.2382  1.4889
## Cp_k  1.1115  0.9865  1.2365
## Cpm   0.6513  0.5630  0.7394
## 
## Exp<LSL 0.043%    Obs<LSL 0%
## Exp>USL 0%    Obs>USL 0%
process.capability(red_average, spec.limits=c(lsl,usl),target=target)

## 
## Process Capability Analysis
## 
## Call:
## process.capability(object = red_average, spec.limits = c(lsl,     usl), target = target)
## 
## Number of obs = 180          Target = 7
##        Center = 5.694           LSL = 3
##        StdDev = 0.6734          USL = 9
## 
## Capability indices:
## 
##        Value    2.5%   97.5%
## Cp    1.4850  1.3312  1.6386
## Cp_l  1.3338  1.2108  1.4567
## Cp_u  1.6363  1.4883  1.7842
## Cp_k  1.3338  1.1873  1.4802
## Cpm   0.6807  0.5867  0.7746
## 
## Exp<LSL 0%    Obs<LSL 0%
## Exp>USL 0%    Obs>USL 0%

Cp, measures a process’s potential capability, which is defined as the allowable spread over the actual spread. By convention, when a process has a Cp value less than 1.0, it is considered potentially incapable of meeting specification requirements. Conversely, when a process Cp is greater than or equal to 1.0, the process has the potential of being capable.

Cpk measures how close the process is to the target and how consistent the process is around the average performance.Cpk of less than 1.66 needs a closer look. A Cpk that’s less than 1.33 needs some action to make it higher, and a Cpk of less than 1.0 means that the process is not capable of meeting its requirements. In this case, these are the values:

  • Red wine has a Cp = 1.49 and CpK = 1.33
  • Black wine has a Cp = 1.24 and CpK = 1.11

These means that our process is capable, but it is needed to take some actions to improve it.

3. Analysis Phase

3.1 Hypothesis Tests

  • Hypothesis 1 - Quality is independent on alcohol concentration.

    • Null Hypothesis : Quality is independent on alcohol concentration.

    • Alternative Hypothesis : Quality is not independent on alcohol concentration.

t.test(data$alcohol, data$quality, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  data$alcohol and data$quality
## t = 234.91, df = 10656, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.712860 4.792174
## sample estimates:
## mean of x mean of y 
## 10.548539  5.796022

The p-value is less 0.05 hence the null hypothesis can be rejected.Thus, there is significant difference between alcohol concentration and quality of wine. Quality is dependent on alcohol concentration and hence it can be used for analysis. There is significant difference in alcohol concentration in high and low quality wine.

  • Hypothesis 2 - Quality is independent on volatile-acid in red wine

    • Null Hypothesis : Quality is independent on volatile-acid in red wine.

    • Alternative Hypothesis : Quality is not independent on on volatile-acid in red wine concentration.

t.test(data_red$volatile.acidity, data_red$quality, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  data_red$volatile.acidity and data_red$quality
## t = -222.5, df = 2715, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.138762 -5.048980
## sample estimates:
## mean of x mean of y 
## 0.5293814 5.6232524

The p-value is less 0.05 hence the null hypothesis can be rejected.Thus, there is significant difference between volatile acidity and quality of wine. Quality is dependent on volatile acidity and hence it can be used for analysis. There is significant difference in volatile acidity concentration in high and low quality wine.

  • Hypothesis 3 - Quality is independent on density in white wine

    • Null Hypothesis : Quality is independent on density in white wine.

    • Alternative Hypothesis : Quality is not independent density in white wine concentration.

t.test(data_white$density, data_white$quality, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  data_white$density and data_white$quality
## t = -343.84, df = 7938, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.889087 -4.833657
## sample estimates:
## mean of x mean of y 
##  0.993792  5.855164

The p-value is less 0.05 hence the null hypothesis can be rejected.Thus, there is significant difference between density and quality of white wine. Quality is dependent on density and hence it can be used for analysis. There is significant difference in density concentration in high and low quality wine.

  • Hypothesis 4 - Quality is independent on citric acid in red wine

    • Null Hypothesis : Quality is independent on citric acid in red wine.

    • Alternative Hypothesis : Quality is not independent on citric acid in red wine.

t.test(data_red$citric.acid, data_red$quality, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  data_red$citric.acid and data_red$quality
## t = -232.95, df = 2715, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.395758 -5.305680
## sample estimates:
## mean of x mean of y 
## 0.2725331 5.6232524

The p-value is less 0.05 hence the null hypothesis can be rejected. Thus, there is significant difference between citric acid and quality of red wine

  • Hypothesis 5 - Quality is independent on sulphates in red wine

    • Null Hypothesis : Quality is independent on sulphates in red wine.

    • Alternative Hypothesis : Quality is not independent on sulphates in red wine.

t.test(data_red$sulphates, data_red$quality, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  data_red$sulphates and data_red$quality
## t = -217.44, df = 2714, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.009400 -4.919861
## sample estimates:
## mean of x mean of y 
##  0.658622  5.623252

The p-value is less 0.05 hence the null hypothesis can be rejected. Thus, there is significant difference between sulphates and quality of red wine

  • Hypothesis 6 - There is no significant difference between quality of red wine and quality of white wine

    • Null Hypothesis : There is no significant difference between quality of red wine and quality of white wine

    • Alternative Hypothesis : There is significant difference between quality of red wine and quality of white wine

t.test(data_white$quality, data_red$quality, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  data_white$quality and data_red$quality
## t = 8.4412, df = 5327, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1780516 0.2857711
## sample estimates:
## mean of x mean of y 
##  5.855164  5.623252

We reject our null hypothesis since p < 0.05.White wine had a higher mean quality of 5.85 as compared to Red wine which had a mean quality of 5.62. This t test showed there was a significant difference in quality between Red and White wine.

3.2 Regression Analysis

Regression Analysis allow to identify the relationship between a continuous variable (X) and a continuous output variable (Y).

And linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables. In this case it will be used the lm function, which stands for linear model, and it is used to build a linear regression model.

  • Y variable: White and Red Wine
  • Possible X variables: alcohol, density, volatile acidity, citric acid, chrolides, sulphates
Model_red<- lm(trans.quality~trans.alcohol+trans.volatile.acidity+trans.citric.acid+trans.sulphates+trans.density,data=new_datared)
summary(Model_red)
## 
## Call:
## lm(formula = trans.quality ~ trans.alcohol + trans.volatile.acidity + 
##     trans.citric.acid + trans.sulphates + trans.density, data = new_datared)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8000 -0.3748 -0.0465  0.4567  2.1038 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.633760   0.018168 310.095  < 2e-16 ***
## trans.alcohol           0.308536   0.023691  13.023  < 2e-16 ***
## trans.volatile.acidity -0.201249   0.024687  -8.152  8.4e-16 ***
## trans.citric.acid      -0.009076   0.026978  -0.336    0.737    
## trans.sulphates         0.186355   0.021889   8.514  < 2e-16 ***
## trans.density          -0.026420   0.026761  -0.987    0.324    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.653 on 1290 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.3622, Adjusted R-squared:  0.3597 
## F-statistic: 146.5 on 5 and 1290 DF,  p-value: < 2.2e-16
Model_white<- lm(trans.quality~trans.alcohol+trans.density+trans.chlorides+trans.volatile.acidity,data=new_datawhite)
summary(Model_white)
## 
## Call:
## lm(formula = trans.quality ~ trans.alcohol + trans.density + 
##     trans.chlorides + trans.volatile.acidity, data = new_datawhite)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4693 -0.4385 -0.0447  0.5185  3.0146 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.85805    0.01260 464.843  < 2e-16 ***
## trans.alcohol           0.42919    0.02255  19.036  < 2e-16 ***
## trans.density           0.08586    0.02209   3.887 0.000103 ***
## trans.chlorides        -0.09798    0.01753  -5.589 2.45e-08 ***
## trans.volatile.acidity -0.17380    0.01310 -13.266  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7695 on 3726 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.2466, Adjusted R-squared:  0.2458 
## F-statistic: 304.8 on 4 and 3726 DF,  p-value: < 2.2e-16

These are the models that identifies the low and high quality wines. So this can allow us to improve the process and adjust it to that regression function. Multiple R-squared will always increase if you add more independent variables. But Adjusted R-squared will decrease if you add an independent variable that does not help the model. In this case the Rsquare is 36% and 25% respectively.

Through this analysis, we can see that:

  • Alcohol, density, chlorides, volatile acidity are factors which affect quality level in an interpretable manner in the white wine.

  • Alcohol, volatile acidity, citric acid, sulphates and density are factors which affect quality level in an interpretable manner in the red wine.

Those functions can be used to ensure the wine quality in our factory

4. Improve Phase

Once we know the few Xs that affect our Y or problem thanks to previous phase, next step is to generate potential solutions for improvement concerning those Xs. Here you can find some ways to generate these solutions:

  • Brainstorming with the team
  • Standardization
  • 5S techniques
  • Visual Management
  • Poka Yoke Techniques
  • Benchmarking

5. Control Phase

In this phase is necessary to ensure a capable, predictable, and stable process.

  • Implement a control system or statistical process control of the variables which want to be controlled because the new improvements are running, so, this control will realize about the positive or negative changes

  • Standardize the new process

  • Document lessons learned of the project and close it