红酒数据集由11个数值型变量和1个因变量构成,本次探索性数据分析主要探索这11个变量属性与质量这一因变量之间的关系。
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
红酒的质量评价最多的为5分,其次为6分,再其次是7分。
酒精含量分布如下图
当酒精含量值为9.5的数量最多,中位数为10.2。整体分布为当酒精含量大于9.5时,随着酒精含量的升高,分布的红酒数量减少。当酒精含量小于9.5时,随着酒精含量的升高,分布的红酒数量增多。
乙酸含量分布如下图
挥发性酸的含量具有两个峰值,一个为0.4,一个为0.6左右
柠檬酸含量分布如下图
柠檬酸的含量分布基本比较均匀分布,有三个较高峰,分别在0.01/0.24/0.49
硫酸盐含量分布如下图
分布中硫酸盐含量数量最多的是0.55,原始数据呈现长尾分布。
含糖量分布如下图
分布中含糖量数量最多的是1.9-2.1,原始数据呈现长尾分布。
PH分布如下图
pH值主要分布在2.9-3.6区间,分布最多的值为3.25-3.4之间
盐含量分布如下图
含盐量主要分布在0-0.2区间,含量值最多的是0.05-0.1之间。原始数据呈现长尾分布
密度分布如下图
密度主要分布在0.994-1.000之间,本身数据差异不大。
总二氧化硫含量/自由二氧化硫含量分布如下图
* 统计绘画出所有变量自身的箱线图如上,与上述单个分析结果是相符合的。
## 'data.frame': 1599 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality_f : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality quality_f
## Min. : 8.40 Min. :3.000 3: 10
## 1st Qu.: 9.50 1st Qu.:5.000 4: 53
## Median :10.20 Median :6.000 5:681
## Mean :10.42 Mean :5.636 6:638
## 3rd Qu.:11.10 3rd Qu.:6.000 7:199
## Max. :14.90 Max. :8.000 8: 18
##
## Call:
## lm(formula = quality ~ alcohol, data = redwines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.87497 0.17471 10.73 <2e-16 ***
## alcohol 0.36084 0.01668 21.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = quality ~ volatile.acidity, data = redwines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.79071 -0.54411 -0.00687 0.47350 2.93148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.56575 0.05791 113.39 <2e-16 ***
## volatile.acidity -1.76144 0.10389 -16.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7437 on 1597 degrees of freedom
## Multiple R-squared: 0.1525, Adjusted R-squared: 0.152
## F-statistic: 287.4 on 1 and 1597 DF, p-value: < 2.2e-16
较低柠檬酸含量的红酒数据基本分布在易挥发性酸含量平均值之上,即具有较低柠檬酸含量的红酒同时较高易挥发性酸含量。反之同理,即具有较高柠檬酸含量的红酒同时较低易挥发性酸含量。
##
## Call:
## lm(formula = quality ~ volatile.acidity, data = redwines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.79071 -0.54411 -0.00687 0.47350 2.93148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.56575 0.05791 113.39 <2e-16 ***
## volatile.acidity -1.76144 0.10389 -16.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7437 on 1597 degrees of freedom
## Multiple R-squared: 0.1525, Adjusted R-squared: 0.152
## F-statistic: 287.4 on 1 and 1597 DF, p-value: < 2.2e-16