## installed_and_loaded.packages.
## prettydoc TRUE
## dplyr TRUE
## ggplot2 TRUE
Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Wine Data Set
“These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.” UCI Machine Learning Repository
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
wine_df <- read.csv(url, header = F)
names(wine_df) <- c("cultivars", "Alcohol", "Malic_acid", "Ash", "Alcalinity_of_ash",
"Magnesium", "Total_phenols", "Flavanoids", "Nonflavanoid_phenols", "Proanthocyanins",
"Color_intensity", "Hue", "OD280_OD315_of_dilutedwines", "Proline")
glimpse(wine_df)## Observations: 178
## Variables: 14
## $ cultivars <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Alcohol <dbl> 14.23, 13.20, 13.16, 14.37, 13.24,...
## $ Malic_acid <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76...
## $ Ash <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45...
## $ Alcalinity_of_ash <dbl> 15.6, 11.2, 18.6, 16.8, 21.0, 15.2...
## $ Magnesium <int> 127, 100, 101, 113, 118, 112, 96, ...
## $ Total_phenols <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27...
## $ Flavanoids <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39...
## $ Nonflavanoid_phenols <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34...
## $ Proanthocyanins <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97...
## $ Color_intensity <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75...
## $ Hue <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05...
## $ OD280_OD315_of_dilutedwines <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85...
## $ Proline <int> 1065, 1050, 1185, 1480, 735, 1450,...
Question
Is the measure of color intensity predictive of the wine’s alcohol content?
Model Summary
wine_model <- lm(Alcohol ~ Color_intensity, wine_df)
summary(wine_model)##
## Call:
## lm(formula = Alcohol ~ Color_intensity, data = wine_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.62083 -0.50404 -0.00891 0.48728 1.80223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.03286 0.12295 97.868 < 2e-16 ***
## Color_intensity 0.19133 0.02211 8.654 3.06e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6819 on 176 degrees of freedom
## Multiple R-squared: 0.2985, Adjusted R-squared: 0.2945
## F-statistic: 74.9 on 1 and 176 DF, p-value: 3.056e-15
intercept <- coef(wine_model)[1]
slope <- coef(wine_model)[2]Scatter Plot
a <- ggplot(wine_model, aes(Color_intensity, Alcohol))
a + geom_point() + geom_abline(slope = slope, intercept = intercept, show.legend = TRUE)Model Interpretation
This linear model is expressed as \(\widehat{Alcohol} = 12.03286 + 0.19133*{ColorIntensity}\)
For each additional unit increase in the color intensity, the model expects an increase of .19 in alcohol content.
In this model, multiple \(R^2\) is \(0.2985\), which means that the model’s least-squares line accounts for approximately \(30\%\) of the variation in the alcohol content.
The Color Intensity and Y-intercept’s p-values are both near zero, which means that there is very little chance that they are not relevant to the model.
Model Diagnostics
Let’s assess if this linear model is reliable.
Linearity: Do the variables have a linear relationship?
At first glance, the scatter plot of the variables appears to show a linear relationship, but at second glance, it may show that there is a slight curvilinear relationship.
Nearly normal residuals: Are the model’s residuals distributed normally?
Yes, per the histogram and Q-Q plot, the residuals are normally distributed.
hist(wine_model$residuals)qqnorm(wine_model$residuals)
qqline(wine_model$residuals)Homoscedasticity: Is there constant variability among the residuals?
Based on the scatter plot of the residuals shown above, the residuals first appear to be more negative, then positive and more negative at the end.
plot(wine_model$residuals ~ wine_df$Color_intensity)
abline(h = 0, lty = 3)Independent observations: Are the data from a random sample and not from a time series?
We do not have information on how the samples were collected.