This is assignment 2 of the Exploring Data Analytics class. The professor asked us to formulate a research questions that the students are interested in and find data for use in answering your research questions
One day I went wine shooping with my friends to celerate my friends birthday. However, I did not know much about the Wine and I did not know how to choose a affordable great wine. In the end, my friend used an App to choose the wine. I decided to spend some time to know more about the wine ranking,the method to choose a good wine and build up the basic understanding of wine ingrident.
In order to know how does the chemical elements in the wine affect the taste and quality.I obtain the two dataset from PSU, which includes only red wine and Kaggle Wine Dataset, which includes all kinds of wine.I would like to explore the datasets itself and compare two data set to know the elements of wine. What is the most important ingredient/features to affact the taste.
Through compare two dataset, I would know the different of the red wine and gernal wine in each of the conponents and chemestry elements. Also, I would know how does the elements affect the overall qualities.
I listed the question about the Wine I.Two Wine Data Exploration 1. Are those two data well organized? 2. How much data did we obtain? 3. How many variables row and column 4. How many missing value? 5. Are those two data reliable data? 6. Are those two data similar ? 7. Are those two data duplicated?
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readxl)
library(readr)
setwd("~/Downloads/")
Wine_data <- read_excel("~/Downloads/Wine_data.xlsx")
View(Wine_data)
# Structure and summary of the Dataframe
str(Wine_data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4898 obs. of 12 variables:
## $ fixed acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free sulfur dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total sulfur dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : num 6 6 6 6 6 6 6 6 6 6 ...
summary(Wine_data)
## fixed acidity volatile acidity citric acid residual sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free sulfur dioxide total sulfur dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## fixed.acidity = col_double(),
## volatile.acidity = col_double(),
## citric.acid = col_double(),
## residual.sugar = col_double(),
## chlorides = col_double(),
## free.sulfur.dioxide = col_double(),
## total.sulfur.dioxide = col_integer(),
## density = col_double(),
## pH = col_double(),
## sulphates = col_double(),
## alcohol = col_double(),
## quality = col_integer()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 2 parsing failures.
## row # A tibble: 2 x 5 col row col expected actual expected <int> <chr> <chr> <chr> actual 1 1296 total.sulfur.dioxide no trailing characters .5 file 2 1297 total.sulfur.dioxide no trailing characters .5 row # ... with 1 more variables: file <chr>
## Classes 'tbl_df', 'tbl' and 'data.frame': 1599 obs. of 13 variables:
## $ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: int 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 5 variables:
## ..$ row : int 1296 1297
## ..$ col : chr "total.sulfur.dioxide" "total.sulfur.dioxide"
## ..$ expected: chr "no trailing characters" "no trailing characters"
## ..$ actual : chr ".5" ".5"
## ..$ file : chr "'~/Desktop/wineQualityReds.csv'" "'~/Desktop/wineQualityReds.csv'"
## - attr(*, "spec")=List of 2
## ..$ cols :List of 13
## .. ..$ X1 : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ fixed.acidity : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ volatile.acidity : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ citric.acid : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ residual.sugar : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ chlorides : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ free.sulfur.dioxide : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ total.sulfur.dioxide: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ density : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ pH : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ sulphates : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ alcohol : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ quality : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
## X1 fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.43 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## NA's :2
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
##