The graph above definitely shows that there are 2 outliers in total sulfur dioxide. This is a variable that is associated to each Red Wine within the data set. It also appears that the quality isn’t rated higher than 8.
Both plots above look at the count level of fixed.acidity as it is associated to each Wine within the data set. The first plot used above is just a standard qplot histogram. The second uses a smaller binwidth and the scale_x_continous to show a bit more detail. It can be concluded that the majority of fixed.acidity is between 6 and 10.
Both plots above look at the count level of volatile.acidity as it is associated to each Wine within the data set. I initially wonder how the volatile.acidity and the fixed.acidity (another variable within the set of data) relate to one another.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
All data and plots above look at the count level of citric.acid as it is associated to each Wine within the data set. There appears to be an outlier at 1.0 for citric.acidn as we view the charts above.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The first plot and data above look a the count level of density as it is associated to each Wine within the data set. The second plot above looks at the count level of PH level as it is associated to each Wine within the data set. Lastly the two plots look at the percentage of alcohol as it relates to each Red Wine set. The last plot is broken into bucket categories. PH and Density are by far the most normally distributed data among the rest of the variables.
There are 1599 observations and 13 variables within the dataset (x, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfu.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality.)
Other observations:
Quality range is from 0 to 10 and the average or mean is 5.6
Max density is 1.0037
The max alcohol is 14.9%
The main feature of interest I want to find is what correlates to the quality rating of the individual wine itself.
Largely I think that the alcohol level will contribute to the quality in so much as the boldness of the wine. The acidity level will most likely affect the bitterness of the wine and the citric acid level will affect the freshness of the wine.
Yes I created buckets for the alcohol percentage. I used ranges of 0-5, 5-10 and 10-15.
The pH and density were normally distributed, most of the others were skewed to the right.
## X fixed.acidity volatile.acidity
## X 1.000000000 0.00000000 0.724669575
## fixed.acidity -0.268483920 1.00000000 0.000000000
## volatile.acidity -0.008815099 -0.25613089 1.000000000
## citric.acid -0.153551355 0.67170343 -0.552495685
## residual.sugar -0.031260835 0.11477672 0.001917882
## chlorides -0.119868519 0.09370519 0.061297772
## free.sulfur.dioxide 0.090479643 -0.15379419 -0.010503827
## total.sulfur.dioxide -0.117849669 -0.11318144 0.076470005
## density -0.368372087 0.66804729 0.022026232
## pH 0.136005328 -0.68297819 0.234937294
## sulphates -0.125306999 0.18300566 -0.260986685
## alcohol 0.245122841 -0.06166827 -0.202288027
## quality 0.066452608 0.12405165 -0.390557780
## citric.acid residual.sugar chlorides
## X 6.744481e-10 2.115297e-01 1.533390e-06
## fixed.acidity 0.000000e+00 4.199465e-06 1.751746e-04
## volatile.acidity 0.000000e+00 9.389168e-01 1.422491e-02
## citric.acid 1.000000e+00 8.083723e-09 2.220446e-16
## residual.sugar 1.435772e-01 1.000000e+00 2.617079e-02
## chlorides 2.038229e-01 5.560954e-02 1.000000e+00
## free.sulfur.dioxide -6.097813e-02 1.870490e-01 5.562147e-03
## total.sulfur.dioxide 3.553302e-02 2.030279e-01 4.740047e-02
## density 3.649472e-01 3.552834e-01 2.006323e-01
## pH -5.419041e-01 -8.565242e-02 -2.650261e-01
## sulphates 3.127700e-01 5.527121e-03 3.712605e-01
## alcohol 1.099032e-01 4.207544e-02 -2.211405e-01
## quality 2.263725e-01 1.373164e-02 -1.289066e-01
## free.sulfur.dioxide total.sulfur.dioxide
## X 2.915917e-04 2.297726e-06
## fixed.acidity 6.335579e-10 5.709033e-06
## volatile.acidity 6.747011e-01 2.213857e-03
## citric.acid 1.473916e-02 1.555454e-01
## residual.sugar 4.685141e-14 2.220446e-16
## chlorides 8.241238e-01 5.809120e-02
## free.sulfur.dioxide 1.000000e+00 0.000000e+00
## total.sulfur.dioxide 6.676665e-01 1.000000e+00
## density -2.194583e-02 7.126948e-02
## pH 7.037750e-02 -6.649456e-02
## sulphates 5.165757e-02 4.294684e-02
## alcohol -6.940835e-02 -2.056539e-01
## quality -5.065606e-02 -1.851003e-01
## density pH sulphates alcohol
## X 0.000000e+00 4.770847e-08 4.992031e-07 0.000000e+00
## fixed.acidity 0.000000e+00 0.000000e+00 1.648681e-13 1.364868e-02
## volatile.acidity 3.787554e-01 0.000000e+00 0.000000e+00 3.330669e-16
## citric.acid 0.000000e+00 0.000000e+00 0.000000e+00 1.059462e-05
## residual.sugar 0.000000e+00 6.065915e-04 8.252134e-01 9.258425e-02
## chlorides 5.551115e-16 0.000000e+00 0.000000e+00 0.000000e+00
## free.sulfur.dioxide 3.804985e-01 4.869975e-03 3.888321e-02 5.492314e-03
## total.sulfur.dioxide 4.354284e-03 7.818341e-03 8.601835e-02 1.110223e-16
## density 1.000000e+00 0.000000e+00 2.418474e-09 0.000000e+00
## pH -3.416993e-01 1.000000e+00 2.109424e-15 1.110223e-16
## sulphates 1.485064e-01 -1.966476e-01 1.000000e+00 1.783053e-04
## alcohol -4.961798e-01 2.056325e-01 9.359475e-02 1.000000e+00
## quality -1.749192e-01 -5.773139e-02 2.513971e-01 4.761663e-01
## quality
## X 7.857465e-03
## fixed.acidity 6.495635e-07
## volatile.acidity 0.000000e+00
## citric.acid 0.000000e+00
## residual.sugar 5.832180e-01
## chlorides 2.313383e-07
## free.sulfur.dioxide 4.283398e-02
## total.sulfur.dioxide 8.615331e-14
## density 1.874945e-12
## pH 2.096278e-02
## sulphates 0.000000e+00
## alcohol 0.000000e+00
## quality 1.000000e+00
I found a function online (listed below) that allowed me to compare each variable’s relationship to one another. This represent all variables associated to Red Wine data.
https://stat.ethz.ch/pipermail/r-help/2001-November/016201.html
This plot represents all variable’s correlation associated to Red Wine data. As far as comparing bivariants, I first I wanted to dive deeper into the correlation between alcohol and quality.
These plots represent the quality variable as it is associated to alcohol percentage within the Red Wine data. Overall there doesn’t seem to be a heavy relationship between the alcohol level and the quality rating.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
This plot and stat summary represents the fixed.acidity variable as it is associated to alcohol percentage within the Red Wine data. It generally appears that the alcohol percentage and the fixed acidity are within 2 metric points of one another based upon the chart above.
This plot represents the fixed.acidity variable as it is associated to the citric.acid variable within the Red Wine data. The graph shows some sort of corresponding relationship between fixed acidity and the citric acid. I know categorically that citric acid is acidic based, so this matches up with what I thought may be represented in these correlations.
This plot represents the fixed.acidity variable as it is associated to the density variable within the Red Wine data. There appears to be a corresponding relationship between fixed acidity and density as well.
This box plot represents the alcohol_buckets variable as it is associated to the PH variable within the Red Wine data. There seems to be a slight relationship between the pH range and the alcohol buckets created variable as well.
Referring to quality vs alcohol: There does appear to a slight relationship between alcohol buckets and quality. Generally the higher alcohol the better the quality since there are more are more data points within this vicinity.
As pH decreases citric acid also increases, because of what acidic solution is it is easy to infer that this would be backed up by this data.
Acidity and density seemed to give a strong relationship between the two.
Definitely acidity and density.
This represent the density variable as it is associated to the fixed.acidity variable within the Red Wine data, while also including the quality rating by color. After seeing the previous graphs I knew that that where density and acidity increases. I wanted to see if there was any relationship with the quality. The quality is broken down by each color represented in the legend to the right.
This represent the alcohol percentage as it is associated to the citric.acid variable within the Red Wine data, while also including the quality rating by color. It would appear that this isn’t a direct relationship between the citric acid and alcohol. But there does seem to be more highly rated wines in the 11 to 13 percent alcohol range as well as when the citric acid level is around the .5 metric.
This represent the PH variable as it is associated to the volatile.acidity variable ithin the Red Wine data, while also including the quality rating by color. The above graph represent a summary of data quality where PH and volatile acidity are have an inverse correlation relationship.
There is definitely a relationship in the alcohol percentage and the quality rating of the wine. As a general property the higher the alcohol the better the win. We can also see that there is a shift towards less volatile acidity.
The most interesting thing to see was the relationship between the alcohol level and quality as well as how citric acid also played a role, most likely due to the freshness of the wine.
This represent the density variable as it is associated to the fixed.acidity variable within the Red Wine data, while also including the quality rating by color. This graph shows that the acidity increase as density increases as well as how that loosely the higher the quality typically also the more acidic.
This represent the percentage of alcohol by volume as it is associated to the number of individual wines within the Red Wine data. This graph shows the alcohol parentage based on each count of the number of wines. Generally speaking highest is around the 9.5 percent marker per wine. Using a different data set, this of course would possibly change.
This box plot represent the alcohol in buckets percentage by categories as it is associated to the PH variable within the Red Wine data. This final plot has broken down the Alcohol in buckets which I created distinctly new buckets to form this graph. It is based upon the PH level and roughly has some relationship as the alcohol level rises so does the PH level.
Overall I found this project and report to be very interesting and valuable as it relates to using statistics and charts in R. The most interesting thing I found is that generally the best red wines range from 8 percent alcohol level to about a 10.5 percent based upon the quality rating. Additionally, I tried to get the boxplot to have different colors based upon the bucket variable that I had created. This was the most difficult syntax for me. I ended up just making them all one color.
Also, the process of learning R was honestly a lot of fun. I have an analytic mindset and I firmly believe that the tools and concepts that I learned will be used in the future. I know SQL, Python, HTML, CSS and JavaScript and this was the quickest language I learned as it pertains to difficulty.
If I were to compare data like this again I would love to see the names, prices and regions where the grapes were grown. I would like to see how region and price would compare to the overall quality. The names would be just a great value add for future personal experimentation I think this data could be leveraged for additional insights in the future.