Data_3_607_Project2

#Introduction

On my last data set I wanted to find a dataset that has a lot of columns and rows. The one I chose is from Melvin Matanos and his dataset on wine. This dataset has a lot of columns and my focus is trying to find the wine type with quality, quality_label,resiudal.sugar, alcohol, and wine_type. There can be some analysis done with these collected data.

#Step 1 Overview

library(knitr)
library(stringr)
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
require(ggplot2)
wine <- read.csv("https://raw.githubusercontent.com/Wilchau/Data607Project2/main/Data_3.csv")
head(wine)

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 0           7.0             0.17        0.74           12.8      0.05
## 2 1           7.7             0.64        0.21            2.2      0.08
## 3 2           6.8             0.39        0.34            7.4      0.02
## 4 3           6.3             0.28        0.47           11.2      0.04
## 5 4           7.4             0.35        0.20           13.9      0.05
##   free.sulfure.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                   24                  126 0.99420 3.26      0.38    12.2
## 2                   32                  133 0.99560 3.27      0.45     9.9
## 3                   38                  133 0.99212 3.18      0.44    12.0
## 4                   61                  183 0.99592 3.12      0.51     9.5
## 5                   63                  229 0.99888 3.11      0.50     8.9
##   quality wine_type quality_label
## 1       8     white          high
## 2       5       red           low
## 3       7     white        medium
## 4       6     white        medium
## 5       6     white        medium

#Step 2 Pull out the necessary variables Drinking wine is like an art-work. There are many variables that can contribute to the taste. First we can use select() to grab the necessary variables: quality, quality_label,resiudal.sugar, alcohol, and wine_type. Then I will focus on showing the statisitcal summary and focus on pH vs all the other variables.

wine_df <-select(wine,X,residual.sugar,pH,alcohol,quality,wine_type,quality_label)
summary(wine_df)

##        X     residual.sugar       pH           alcohol        quality   
##  Min.   :0   Min.   : 2.2   Min.   :3.110   Min.   : 8.9   Min.   :5.0  
##  1st Qu.:1   1st Qu.: 7.4   1st Qu.:3.120   1st Qu.: 9.5   1st Qu.:6.0  
##  Median :2   Median :11.2   Median :3.180   Median : 9.9   Median :6.0  
##  Mean   :2   Mean   : 9.5   Mean   :3.188   Mean   :10.5   Mean   :6.4  
##  3rd Qu.:3   3rd Qu.:12.8   3rd Qu.:3.260   3rd Qu.:12.0   3rd Qu.:7.0  
##  Max.   :4   Max.   :13.9   Max.   :3.270   Max.   :12.2   Max.   :8.0  
##   wine_type         quality_label     
##  Length:5           Length:5          
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

ggplot(data=wine_df, aes(x=pH, y=quality, group=1)) +
  geom_line()+
  geom_point()

ggplot(data=wine_df, aes(x=pH, y=alcohol, group=1)) +
  geom_line()+
  geom_point()

ggplot(data=wine_df, aes(x=pH, y=residual.sugar, group=1)) +
  geom_line()+
  geom_point()

Based on this obsersation. From my reading that pH affects the texture of the wine. I can see around pH 3.15-3.25 there is an optimal pH level for wine that can have peak sugar, quality, and alcoholical level. Once it goes past 3.25 pH level the wine can be a little bit more basic and as a result can decrease the quality of the wine. When we look at the statistical summary: Median :3.180. This shows that most wine that is around this point has the optimal level of quality.

#Conclusion Don’t let your wine become basic or else the quality will decrease drastically.

Data_3_607_Project2

Wilson Chau

2022-10-09