sum - average summer temperature har - harvest rainfall sep - September temperature win - winter rainfall age - age of the vintage in 1992
library(knitr)
wine <- read.csv("http://jamessuleiman.com/mba676/assets/units/unit05/bordeaux.csv", header = TRUE, stringsAsFactors = FALSE)
library(ggvis)
wine %>%
ggvis(~har, ~price) %>%
layer_points()
wine %>%
ggvis(~har, ~price) %>%
layer_points() %>%
layer_model_predictions(model = "lm", se = TRUE, stroke := "red")
## Guessing formula = price ~ har
Question 4 Run a regression model to determine if harvest rain alone is a good predictor for price
It is significant (p <= 0.05), but with an R-squared of 0.1996, there is still a decent amount of unexplained variance.
model.1 <- lm(price ~ har, wine)
summary(model.1)
##
## Call:
## lm(formula = price ~ har, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.223 -12.035 -4.105 6.910 57.497
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.37198 8.29268 5.713 5.98e-06 ***
## har -0.12814 0.05132 -2.497 0.0195 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.12 on 25 degrees of freedom
## Multiple R-squared: 0.1996, Adjusted R-squared: 0.1676
## F-statistic: 6.235 on 1 and 25 DF, p-value: 0.01947
Question 5 Use the explanatory variables for harvest rain, summer temperature, September rainfall, winter rainfall, and age to build a regression model. Are all variables significant(??=0.5??=0.5) in explaining price?
Hide # all variables except September rainfall appear to be significant.
model.2 <- lm(price ~ har + sum + sep + win + age, wine)
summary(model.2)
##
## Call:
## lm(formula = price ~ har + sum + sep + win + age, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.0454 -7.9644 -0.7287 3.8198 23.8098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -326.11712 68.80879 -4.739 0.000111 ***
## har -0.07990 0.03758 -2.126 0.045513 *
## sum 16.30232 4.66355 3.496 0.002154 **
## sep 2.59129 2.24518 1.154 0.261403
## win 0.05183 0.02011 2.577 0.017568 *
## age 0.88488 0.29455 3.004 0.006757 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.6 on 21 degrees of freedom
## Multiple R-squared: 0.7526, Adjusted R-squared: 0.6937
## F-statistic: 12.78 on 5 and 21 DF, p-value: 8.814e-06
Question 6 Type anova(model.1, model.2) Where model.1 and model.2 are the variable names for the models you created in question 4 and question 5. Anova is used to compare linear models with the same observations where one model has more/other explanatory variables. What do you think your results are telling you?
Hide # That the two models are significantly different # The best linear fit is found by minimizing Residual Sum of Squares (RSS) # Your intro stats course may have called this Sum of Squares due to Error (SSE)
anova(model.1, model.2)
## Analysis of Variance Table
##
## Model 1: price ~ har
## Model 2: price ~ har + sum + sep + win + age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 25 9138.8
## 2 21 2824.8 4 6314 11.735 3.654e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
wine.2 <- read.csv(url("http://jamessuleiman.com/mba676/assets/units/unit05/bordeauxp.csv"))
wine.2
## yearnew sum har sep win age
## 1 1981 17.0 111 18.0 535 11
## 2 1982 17.4 162 18.5 712 10
## 3 1983 17.4 119 17.9 845 9
## 4 1984 16.5 119 16.0 591 8
## 5 1985 16.8 38 18.9 744 7
## 6 1986 16.3 171 17.5 563 6
## 7 1987 17.0 115 18.9 452 5
## 8 1988 17.1 59 16.8 808 4
## 9 1989 18.6 82 18.4 443 3
## 10 1990 18.7 80 19.3 468 2
## 11 1991 17.7 183 20.4 570 1
Question 8 Type predict(model.2, wine.2, interval=“confidence”) where model.2 is whatever your called your model in question 5 and wine.2 is the variable that you read the csv file to in question 7. What values did predict() return?
# predicted values for price from the new data set using the model built on the old data set
# (lwr and upr are the 95% confidence lower and upper bounds)
predict(model.2, wine.2, interval="confidence")
## fit lwr upr
## 1 26.260262 14.695065 37.82546
## 2 38.291343 22.251405 54.33128
## 3 46.180781 26.253064 66.10850
## 4 12.535262 -1.309492 26.38002
## 5 38.457597 23.478669 53.43653
## 6 5.786086 -6.803287 18.37546
## 7 18.661583 2.932962 34.39020
## 8 36.891315 14.768376 59.01425
## 9 43.849990 18.217468 69.48251
## 10 48.383075 22.594484 74.17167
## 11 31.103736 8.947153 53.26032