DATA 605 FUNDAMENTALS OF COMPUTATIONAL MATHEMATICS

Discussion 11: Linear Regression

Kyle Gilde

11/8/2017

##           installed_and_loaded.packages.
## prettydoc                           TRUE
## dplyr                               TRUE
## ggplot2                             TRUE

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Wine Data Set

“These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.” UCI Machine Learning Repository

url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"

wine_df <- read.csv(url, header = F)

names(wine_df) <- c("cultivars", "Alcohol", "Malic_acid", "Ash", "Alcalinity_of_ash", 
    "Magnesium", "Total_phenols", "Flavanoids", "Nonflavanoid_phenols", "Proanthocyanins", 
    "Color_intensity", "Hue", "OD280_OD315_of_dilutedwines", "Proline")

glimpse(wine_df)
## Observations: 178
## Variables: 14
## $ cultivars                   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Alcohol                     <dbl> 14.23, 13.20, 13.16, 14.37, 13.24,...
## $ Malic_acid                  <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76...
## $ Ash                         <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45...
## $ Alcalinity_of_ash           <dbl> 15.6, 11.2, 18.6, 16.8, 21.0, 15.2...
## $ Magnesium                   <int> 127, 100, 101, 113, 118, 112, 96, ...
## $ Total_phenols               <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27...
## $ Flavanoids                  <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39...
## $ Nonflavanoid_phenols        <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34...
## $ Proanthocyanins             <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97...
## $ Color_intensity             <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75...
## $ Hue                         <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05...
## $ OD280_OD315_of_dilutedwines <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85...
## $ Proline                     <int> 1065, 1050, 1185, 1480, 735, 1450,...

Question

Is the measure of color intensity predictive of the wine’s alcohol content?

Model Summary

wine_model <- lm(Alcohol ~ Color_intensity, wine_df)
summary(wine_model)
## 
## Call:
## lm(formula = Alcohol ~ Color_intensity, data = wine_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.62083 -0.50404 -0.00891  0.48728  1.80223 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     12.03286    0.12295  97.868  < 2e-16 ***
## Color_intensity  0.19133    0.02211   8.654 3.06e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6819 on 176 degrees of freedom
## Multiple R-squared:  0.2985, Adjusted R-squared:  0.2945 
## F-statistic:  74.9 on 1 and 176 DF,  p-value: 3.056e-15
intercept <- coef(wine_model)[1]
slope <- coef(wine_model)[2]

Scatter Plot

a <- ggplot(wine_model, aes(Color_intensity, Alcohol))
a + geom_point() + geom_abline(slope = slope, intercept = intercept, show.legend = TRUE)

Model Interpretation

  • This linear model is expressed as \(\widehat{Alcohol} = 12.03286 + 0.19133*{ColorIntensity}\)

  • For each additional unit increase in the color intensity, the model expects an increase of .19 in alcohol content.

  • In this model, multiple \(R^2\) is \(0.2985\), which means that the model’s least-squares line accounts for approximately \(30\%\) of the variation in the alcohol content.

  • The Color Intensity and Y-intercept’s p-values are both near zero, which means that there is very little chance that they are not relevant to the model.

Model Diagnostics

Let’s assess if this linear model is reliable.

Linearity: Do the variables have a linear relationship?

At first glance, the scatter plot of the variables appears to show a linear relationship, but at second glance, it may show that there is a slight curvilinear relationship.

Nearly normal residuals: Are the model’s residuals distributed normally?

Yes, per the histogram and Q-Q plot, the residuals are normally distributed.

hist(wine_model$residuals)

qqnorm(wine_model$residuals)
qqline(wine_model$residuals)

Homoscedasticity: Is there constant variability among the residuals?

Based on the scatter plot of the residuals shown above, the residuals first appear to be more negative, then positive and more negative at the end.

plot(wine_model$residuals ~ wine_df$Color_intensity)
abline(h = 0, lty = 3)

Independent observations: Are the data from a random sample and not from a time series?

We do not have information on how the samples were collected.