sum - average summer temperature har - harvest rainfall sep - September temperature win - winter rainfall age - age of the vintage in 1992

library(knitr)
wine <- read.csv("http://jamessuleiman.com/mba676/assets/units/unit05/bordeaux.csv", header = TRUE, stringsAsFactors = FALSE)
library(ggvis)
wine %>%
  ggvis(~har, ~price) %>%
  layer_points()
wine %>%
  ggvis(~har, ~price) %>%
  layer_points() %>%
  layer_model_predictions(model = "lm", se = TRUE, stroke := "red")
## Guessing formula = price ~ har

Question 4 Run a regression model to determine if harvest rain alone is a good predictor for price

It is significant (p <= 0.05), but with an R-squared of 0.1996, there is still a decent amount of unexplained variance.

model.1 <- lm(price ~ har, wine)
summary(model.1)
## 
## Call:
## lm(formula = price ~ har, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.223 -12.035  -4.105   6.910  57.497 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 47.37198    8.29268   5.713 5.98e-06 ***
## har         -0.12814    0.05132  -2.497   0.0195 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.12 on 25 degrees of freedom
## Multiple R-squared:  0.1996, Adjusted R-squared:  0.1676 
## F-statistic: 6.235 on 1 and 25 DF,  p-value: 0.01947

Question 5 Use the explanatory variables for harvest rain, summer temperature, September rainfall, winter rainfall, and age to build a regression model. Are all variables significant(??=0.5??=0.5) in explaining price?

Hide # all variables except September rainfall appear to be significant.

model.2 <- lm(price ~ har + sum + sep + win + age, wine)
summary(model.2)
## 
## Call:
## lm(formula = price ~ har + sum + sep + win + age, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0454  -7.9644  -0.7287   3.8198  23.8098 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -326.11712   68.80879  -4.739 0.000111 ***
## har           -0.07990    0.03758  -2.126 0.045513 *  
## sum           16.30232    4.66355   3.496 0.002154 ** 
## sep            2.59129    2.24518   1.154 0.261403    
## win            0.05183    0.02011   2.577 0.017568 *  
## age            0.88488    0.29455   3.004 0.006757 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.6 on 21 degrees of freedom
## Multiple R-squared:  0.7526, Adjusted R-squared:  0.6937 
## F-statistic: 12.78 on 5 and 21 DF,  p-value: 8.814e-06

Question 6 Type anova(model.1, model.2) Where model.1 and model.2 are the variable names for the models you created in question 4 and question 5. Anova is used to compare linear models with the same observations where one model has more/other explanatory variables. What do you think your results are telling you?

Hide # That the two models are significantly different # The best linear fit is found by minimizing Residual Sum of Squares (RSS) # Your intro stats course may have called this Sum of Squares due to Error (SSE)

anova(model.1, model.2)
## Analysis of Variance Table
## 
## Model 1: price ~ har
## Model 2: price ~ har + sum + sep + win + age
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     25 9138.8                                  
## 2     21 2824.8  4      6314 11.735 3.654e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
wine.2 <- read.csv(url("http://jamessuleiman.com/mba676/assets/units/unit05/bordeauxp.csv"))
wine.2
##    yearnew  sum har  sep win age
## 1     1981 17.0 111 18.0 535  11
## 2     1982 17.4 162 18.5 712  10
## 3     1983 17.4 119 17.9 845   9
## 4     1984 16.5 119 16.0 591   8
## 5     1985 16.8  38 18.9 744   7
## 6     1986 16.3 171 17.5 563   6
## 7     1987 17.0 115 18.9 452   5
## 8     1988 17.1  59 16.8 808   4
## 9     1989 18.6  82 18.4 443   3
## 10    1990 18.7  80 19.3 468   2
## 11    1991 17.7 183 20.4 570   1

Question 8 Type predict(model.2, wine.2, interval=“confidence”) where model.2 is whatever your called your model in question 5 and wine.2 is the variable that you read the csv file to in question 7. What values did predict() return?

# predicted values for price from the new data set using the model built on the old data set
# (lwr and upr are the 95% confidence lower and upper bounds)
predict(model.2, wine.2, interval="confidence")
##          fit       lwr      upr
## 1  26.260262 14.695065 37.82546
## 2  38.291343 22.251405 54.33128
## 3  46.180781 26.253064 66.10850
## 4  12.535262 -1.309492 26.38002
## 5  38.457597 23.478669 53.43653
## 6   5.786086 -6.803287 18.37546
## 7  18.661583  2.932962 34.39020
## 8  36.891315 14.768376 59.01425
## 9  43.849990 18.217468 69.48251
## 10 48.383075 22.594484 74.17167
## 11 31.103736  8.947153 53.26032