Project Proposal

Data Preparation

library(tidyverse)
wine <- read.csv('wineQualityReds.csv')

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Which variables in the dataset are responsible for the quality of the wine.

Cases

What are the cases, and how many are there?

Each case represents one wine. There are 1599 observations in the given data set.

Data collection

Describe the method of data collection.

The data set is created by Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009.

Type of study

What type of study is this (observational/experiment)?

This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

Data is available online here:
https://www.kaggle.com/piyushgoyal443/red-wine-dataset/downloads/red-wine-dataset.zip/1

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: http://dx.doi.org/10.1016/j.dss.2009.05.016

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable is quality and is numerical.

Explanatory

What is the explanatory variable, and what type is it (numerical/categorical)?

All the explanatory variablies fixed.acidity, volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol are numerical.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

names(wine)

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

str(wine)

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Univariate Analysis

Red wine : Quality

summary(wine$quality)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

ggplot(aes(x = quality, color = I('white')), data = wine)+
  geom_bar()+ scale_x_continuous(breaks = seq(3, 8, 1))

Bivariate Analysis

Box Plot: Alcohol vs quality

ggplot(aes(x = as.factor(quality), y = wine$alcohol),
  data = wine)+
  geom_boxplot()+
  scale_x_discrete(breaks = seq(1, 10, 1))+
  scale_y_continuous(breaks = seq(8, 15, 0.5))