knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.1 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts ---------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readr)
You should use RStudio (probably with ggplot, tidyr, and dplyr) for this.
We will use a dataset from the UC-Irvine Machine Learning Data Repository. It’s just a place to keep cool datasets. You might want to check it out sometime.
Wine quality dataset description: http://archive.ics.uci.edu/ml/datasets/Wine+Quality 12 variables, 1599 rows of Red Wine: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv (There is another file for White wine, but you don’t need it for this).
Be sure to do these things for the big dataset (smithj should be your name and first initial, not smith, unless your name is J Smith):
Save the data into your own Y: Drive or GoogleDrive Space, using:
Make a new script file for your homework called smithj-220hw5.R Make an RMarkdown file called smithj-220hw5.rmd The final column, “quality” is a 1-10 variable, where 10 means a very high quality wine (1 is lousy). This “quality” variable will be your “y” response variable for this assignment.
Import the dataset into RStudio using readr or the Import Dataset tool.
write.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", file="RedWine.csv")
RedWine <- read.csv(file = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", head = TRUE, sep=";")
View(RedWine)
(Notice that it uses semicolons instead of commas as the delimiter). Using your improved “pairs” command, look at all the variables. Eek. Since we really only care about quality, let’s just look at that one against the others:
RedWine %>%
gather(-quality, key = "var", value = "value") %>%
ggplot(aes(x = value, y = quality, color= "density")) +
geom_jitter() + #Would geom_jitter() be a better choice?
stat_smooth(method="lm") + #Might loess work better here?
facet_wrap(~ var, scales = "free")
Sulphates and alcohol seem to make the wine have a higher quality rating. Chlorides and volatile activity make the quality rating decrease.
Perhaps “alcohol” is be the best candidate. Make a scatterplot of the two variables
RedWine %>%
ggplot(aes(x = alcohol, y = quality))+
geom_jitter()+
stat_smooth(method="lm")
The higher the alcohol content, the higher quality the wine has.
Make a simple regression predicting quality from density. Spoiler: lm(y~x)
ggplot(data=RedWine, aes(y = `quality`, x = `density`)) +
geom_jitter()+
geom_smooth(method="lm")
From this graph, it appears that the denser the wine, the less quality it has.
From the simple display, what is your slope and intercept?
R1 <- lm(`quality` ~ `density`, data = RedWine)
R1
##
## Call:
## lm(formula = quality ~ density, data = RedWine)
##
## Coefficients:
## (Intercept) density
## 80.24 -74.85
summary(R1)
##
## Call:
## lm(formula = quality ~ density, data = RedWine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7885 -0.6216 0.1554 0.4271 2.5177
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.24 10.51 7.636 3.83e-14 ***
## density -74.85 10.54 -7.100 1.87e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7954 on 1597 degrees of freedom
## Multiple R-squared: 0.0306, Adjusted R-squared: 0.02999
## F-statistic: 50.41 on 1 and 1597 DF, p-value: 1.875e-12
Intercept: 80.24; Slope: -74.85
Using “summary,” what about r^2? Which variable is best?
r^2 is 0.0306 and the adjusted r^2 is 0.02999. This value means that model explains only 0.02999 of the variability of the response data around its mean. Consequently, it is good to know this variable because we know that the line isn’t very accurate.
Repeat your model using pH and density as the explanatory variable for quality.
ggplot(data=RedWine, aes(y = `quality`, x = `pH`, `density`)) +
geom_jitter()+
geom_smooth(method="lm")
R1G <- lm(quality ~ `density` + `pH`, data = RedWine)
R1G
##
## Call:
## lm(formula = quality ~ density + pH, data = RedWine)
##
## Coefficients:
## (Intercept) density pH
## 101.9302 -94.2968 -0.6959
summary (R1G)
##
## Call:
## lm(formula = quality ~ density + pH, data = RedWine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8219 -0.6242 0.1169 0.4475 2.4942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.9302 11.2558 9.056 < 2e-16 ***
## density -94.2968 11.1301 -8.472 < 2e-16 ***
## pH -0.6959 0.1361 -5.114 3.53e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7892 on 1596 degrees of freedom
## Multiple R-squared: 0.04623, Adjusted R-squared: 0.04503
## F-statistic: 38.68 on 2 and 1596 DF, p-value: < 2.2e-16
pH and Density together seem to have little effect on the quality of the wine.
Explore with a few more promising candidates, using lm and graphs
ggplot(data=RedWine, aes(y = `quality`, x = `sulphates`, `density`)) +
geom_jitter()+
geom_smooth(method="lm")
R2G <- lm(quality ~ `density` + `sulphates`, data = RedWine)
R2G
##
## Call:
## lm(formula = quality ~ density + sulphates, data = RedWine)
##
## Coefficients:
## (Intercept) density sulphates
## 97.314 -92.869 1.351
summary (R2G)
##
## Call:
## lm(formula = quality ~ density + sulphates, data = RedWine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1842 -0.5271 0.0158 0.4699 2.5116
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 97.3136 10.1778 9.561 <2e-16 ***
## density -92.8690 10.2219 -9.085 <2e-16 ***
## sulphates 1.3513 0.1138 11.873 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7627 on 1596 degrees of freedom
## Multiple R-squared: 0.1093, Adjusted R-squared: 0.1082
## F-statistic: 97.89 on 2 and 1596 DF, p-value: < 2.2e-16
An increase in sulphates and density increase the quality level of the wine.
ggplot(data=RedWine, aes(y = `quality`, x = `residual.sugar`, `density`)) +
geom_jitter()+
geom_smooth(method="lm")
R3G <- lm(quality ~ `density` + `residual.sugar`, data = RedWine)
R3G
##
## Call:
## lm(formula = quality ~ density + residual.sugar, data = RedWine)
##
## Coefficients:
## (Intercept) density residual.sugar
## 93.27072 -88.04742 0.04974
summary (R3G)
##
## Call:
## lm(formula = quality ~ density + residual.sugar, data = RedWine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7891 -0.6095 0.1334 0.4465 2.5528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.27072 11.19303 8.333 < 2e-16 ***
## density -88.04742 11.24311 -7.831 8.74e-15 ***
## residual.sugar 0.04974 0.01505 3.305 0.000971 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7929 on 1596 degrees of freedom
## Multiple R-squared: 0.03719, Adjusted R-squared: 0.03598
## F-statistic: 30.82 on 2 and 1596 DF, p-value: 7.36e-14
Residual sugar and density seem to have little effect on the quality level.
ggplot(data=RedWine, aes(y = `quality`, x = `chlorides`, `density`)) +
geom_jitter()+
geom_smooth(method="lm")
R4G <- lm(quality ~ `density` + `chlorides`, data = RedWine)
R4G
##
## Call:
## lm(formula = quality ~ density + chlorides, data = RedWine)
##
## Coefficients:
## (Intercept) density chlorides
## 72.021 -66.455 -1.677
summary (R4G)
##
## Call:
## lm(formula = quality ~ density + chlorides, data = RedWine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7015 -0.6246 0.1459 0.4246 2.4980
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.0211 10.6710 6.749 2.07e-11 ***
## density -66.4546 10.7133 -6.203 7.04e-10 ***
## chlorides -1.6772 0.4296 -3.904 9.86e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7918 on 1596 degrees of freedom
## Multiple R-squared: 0.03977, Adjusted R-squared: 0.03856
## F-statistic: 33.05 on 2 and 1596 DF, p-value: 8.644e-15
As chlorides and density increase, the quality level decreases.