knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.1     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## -- Conflicts ---------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readr)

You should use RStudio (probably with ggplot, tidyr, and dplyr) for this.

We will use a dataset from the UC-Irvine Machine Learning Data Repository. It’s just a place to keep cool datasets. You might want to check it out sometime.

Wine quality dataset description: http://archive.ics.uci.edu/ml/datasets/Wine+Quality 12 variables, 1599 rows of Red Wine: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv (There is another file for White wine, but you don’t need it for this).

Be sure to do these things for the big dataset (smithj should be your name and first initial, not smith, unless your name is J Smith):

Save the data into your own Y: Drive or GoogleDrive Space, using:

Make a new script file for your homework called smithj-220hw5.R Make an RMarkdown file called smithj-220hw5.rmd The final column, “quality” is a 1-10 variable, where 10 means a very high quality wine (1 is lousy). This “quality” variable will be your “y” response variable for this assignment.

Import the dataset into RStudio using readr or the Import Dataset tool.

write.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", file="RedWine.csv")
RedWine <- read.csv(file = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", head = TRUE, sep=";")
View(RedWine)

Quality vs. Other Variables

(Notice that it uses semicolons instead of commas as the delimiter). Using your improved “pairs” command, look at all the variables. Eek. Since we really only care about quality, let’s just look at that one against the others:

RedWine %>%
  gather(-quality, key = "var", value = "value") %>%
  ggplot(aes(x = value, y = quality, color= "density")) +
  geom_jitter() +               #Would geom_jitter() be a better choice?
  stat_smooth(method="lm") +        #Might loess work better here?
  facet_wrap(~ var, scales = "free")

Sulphates and alcohol seem to make the wine have a higher quality rating. Chlorides and volatile activity make the quality rating decrease.

Quality vs. Alcohol

Perhaps “alcohol” is be the best candidate. Make a scatterplot of the two variables

RedWine %>%
  ggplot(aes(x = alcohol, y = quality))+
  geom_jitter()+
  stat_smooth(method="lm")

The higher the alcohol content, the higher quality the wine has.

Regression: Quality vs. Density

Make a simple regression predicting quality from density. Spoiler: lm(y~x)

ggplot(data=RedWine, aes(y = `quality`, x = `density`)) +
  geom_jitter()+
  geom_smooth(method="lm")

From this graph, it appears that the denser the wine, the less quality it has.

From the simple display, what is your slope and intercept?

R1 <- lm(`quality` ~ `density`, data = RedWine)
R1
## 
## Call:
## lm(formula = quality ~ density, data = RedWine)
## 
## Coefficients:
## (Intercept)      density  
##       80.24       -74.85
summary(R1)
## 
## Call:
## lm(formula = quality ~ density, data = RedWine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7885 -0.6216  0.1554  0.4271  2.5177 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    80.24      10.51   7.636 3.83e-14 ***
## density       -74.85      10.54  -7.100 1.87e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7954 on 1597 degrees of freedom
## Multiple R-squared:  0.0306, Adjusted R-squared:  0.02999 
## F-statistic: 50.41 on 1 and 1597 DF,  p-value: 1.875e-12

Intercept: 80.24; Slope: -74.85

Using “summary,” what about r^2? Which variable is best?

r^2 is 0.0306 and the adjusted r^2 is 0.02999. This value means that model explains only 0.02999 of the variability of the response data around its mean. Consequently, it is good to know this variable because we know that the line isn’t very accurate.

Quality vs. pH and Density

Repeat your model using pH and density as the explanatory variable for quality.

ggplot(data=RedWine, aes(y = `quality`, x = `pH`, `density`)) +
  geom_jitter()+
  geom_smooth(method="lm")

R1G <- lm(quality ~ `density` + `pH`, data = RedWine)
R1G
## 
## Call:
## lm(formula = quality ~ density + pH, data = RedWine)
## 
## Coefficients:
## (Intercept)      density           pH  
##    101.9302     -94.2968      -0.6959
summary (R1G)
## 
## Call:
## lm(formula = quality ~ density + pH, data = RedWine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8219 -0.6242  0.1169  0.4475  2.4942 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 101.9302    11.2558   9.056  < 2e-16 ***
## density     -94.2968    11.1301  -8.472  < 2e-16 ***
## pH           -0.6959     0.1361  -5.114 3.53e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7892 on 1596 degrees of freedom
## Multiple R-squared:  0.04623,    Adjusted R-squared:  0.04503 
## F-statistic: 38.68 on 2 and 1596 DF,  p-value: < 2.2e-16

pH and Density together seem to have little effect on the quality of the wine.

Explore with a few more promising candidates, using lm and graphs

Quality vs. Sulphates and Density

ggplot(data=RedWine, aes(y = `quality`, x = `sulphates`, `density`)) +
  geom_jitter()+
  geom_smooth(method="lm")

R2G <- lm(quality ~ `density` + `sulphates`, data = RedWine)
R2G
## 
## Call:
## lm(formula = quality ~ density + sulphates, data = RedWine)
## 
## Coefficients:
## (Intercept)      density    sulphates  
##      97.314      -92.869        1.351
summary (R2G)
## 
## Call:
## lm(formula = quality ~ density + sulphates, data = RedWine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1842 -0.5271  0.0158  0.4699  2.5116 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  97.3136    10.1778   9.561   <2e-16 ***
## density     -92.8690    10.2219  -9.085   <2e-16 ***
## sulphates     1.3513     0.1138  11.873   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7627 on 1596 degrees of freedom
## Multiple R-squared:  0.1093, Adjusted R-squared:  0.1082 
## F-statistic: 97.89 on 2 and 1596 DF,  p-value: < 2.2e-16

An increase in sulphates and density increase the quality level of the wine.

Quality vs. Residual Sugar and Density

ggplot(data=RedWine, aes(y = `quality`, x = `residual.sugar`, `density`)) +
  geom_jitter()+
  geom_smooth(method="lm")

R3G <- lm(quality ~ `density` + `residual.sugar`, data = RedWine)
R3G
## 
## Call:
## lm(formula = quality ~ density + residual.sugar, data = RedWine)
## 
## Coefficients:
##    (Intercept)         density  residual.sugar  
##       93.27072       -88.04742         0.04974
summary (R3G)
## 
## Call:
## lm(formula = quality ~ density + residual.sugar, data = RedWine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7891 -0.6095  0.1334  0.4465  2.5528 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     93.27072   11.19303   8.333  < 2e-16 ***
## density        -88.04742   11.24311  -7.831 8.74e-15 ***
## residual.sugar   0.04974    0.01505   3.305 0.000971 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7929 on 1596 degrees of freedom
## Multiple R-squared:  0.03719,    Adjusted R-squared:  0.03598 
## F-statistic: 30.82 on 2 and 1596 DF,  p-value: 7.36e-14

Residual sugar and density seem to have little effect on the quality level.

Quality vs. Chlorides and Density

ggplot(data=RedWine, aes(y = `quality`, x = `chlorides`, `density`)) +
  geom_jitter()+
  geom_smooth(method="lm")

R4G <- lm(quality ~ `density` + `chlorides`, data = RedWine)
R4G
## 
## Call:
## lm(formula = quality ~ density + chlorides, data = RedWine)
## 
## Coefficients:
## (Intercept)      density    chlorides  
##      72.021      -66.455       -1.677
summary (R4G)
## 
## Call:
## lm(formula = quality ~ density + chlorides, data = RedWine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7015 -0.6246  0.1459  0.4246  2.4980 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  72.0211    10.6710   6.749 2.07e-11 ***
## density     -66.4546    10.7133  -6.203 7.04e-10 ***
## chlorides    -1.6772     0.4296  -3.904 9.86e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7918 on 1596 degrees of freedom
## Multiple R-squared:  0.03977,    Adjusted R-squared:  0.03856 
## F-statistic: 33.05 on 2 and 1596 DF,  p-value: 8.644e-15

As chlorides and density increase, the quality level decreases.