Bordeaux is a region in France popular for producing wine. While this wine has been produced in much the same way for hundreds of years, there are differences in price and quality from year to year that are sometimes very significant. Bordeaux wines are widely believed to taste better when they are older, so there’s an incentive to store young wines until they are mature. The main problem is that it is hard to determine the quality of the wine when it is so young just by tasting it, since the taste will change so significantly by the time it will actually be consumed. This is why wine tasters and experts are helpful. They taste the wines and then predict which ones will be the best wines later.
On March 4, 1990, the New York Times announced that Princeton economics professor Orley Ashenfelter can predict the quality of Bordeaux wine without tasting a single drop. Ashenfelter used a method called linear regression. The methods predicts an outcome variable or dependent variable. As independent variables, he used age of the wine– so the older wines are more expensive– and weather-related information, specifically the average growing season temperature, the harvest rain, and winter rain.
Lets discuss the model…
Wine <- read.csv("wine.csv")
str(Wine)
## 'data.frame': 25 obs. of 7 variables:
## $ Year : int 1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
## $ Price : num 7.5 8.04 7.69 6.98 6.78 ...
## $ WinterRain : int 600 690 502 420 582 485 763 830 697 608 ...
## $ AGST : num 17.1 16.7 17.1 16.1 16.4 ...
## $ HarvestRain: int 160 80 130 110 187 187 290 38 52 155 ...
## $ Age : int 31 30 28 26 25 24 23 22 21 20 ...
## $ FrancePop : num 43184 43495 44218 45152 45654 ...
summary(Wine)
## Year Price WinterRain AGST
## Min. :1952 Min. :6.205 Min. :376.0 Min. :14.98
## 1st Qu.:1960 1st Qu.:6.519 1st Qu.:536.0 1st Qu.:16.20
## Median :1966 Median :7.121 Median :600.0 Median :16.53
## Mean :1966 Mean :7.067 Mean :605.3 Mean :16.51
## 3rd Qu.:1972 3rd Qu.:7.495 3rd Qu.:697.0 3rd Qu.:17.07
## Max. :1978 Max. :8.494 Max. :830.0 Max. :17.65
## HarvestRain Age FrancePop
## Min. : 38.0 Min. : 5.0 Min. :43184
## 1st Qu.: 89.0 1st Qu.:11.0 1st Qu.:46584
## Median :130.0 Median :17.0 Median :50255
## Mean :148.6 Mean :17.2 Mean :49694
## 3rd Qu.:187.0 3rd Qu.:23.0 3rd Qu.:52894
## Max. :292.0 Max. :31.0 Max. :54602
Let’s now create a one-variable linear regression equation using AGST to predict Price.
model1 <- lm(Price ~ AGST, data=Wine)
summary(model1)
##
## Call:
## lm(formula = Price ~ AGST, data = Wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78450 -0.23882 -0.03727 0.38992 0.90318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4178 2.4935 -1.371 0.183710
## AGST 0.6351 0.1509 4.208 0.000335 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared: 0.435, Adjusted R-squared: 0.4105
## F-statistic: 17.71 on 1 and 23 DF, p-value: 0.000335
NOTE: Multiple R-squared will always increase if you add more independent variables. But Adjusted R-squared will decrease if you add an independent variable that doesn’t help the model.
Now, lets compute SSE. Our residuals, or error terms, are stored in the vector model1$residuals.
model1$residuals
## 1 2 3 4 5 6
## 0.04204258 0.82983774 0.21169394 0.15609432 -0.23119140 0.38991701
## 7 8 9 10 11 12
## -0.48959140 0.90318115 0.45372410 0.14887461 -0.23882157 -0.08974238
## 13 14 15 16 17 18
## 0.66185660 -0.05211511 -0.62726647 -0.74714947 0.42113502 -0.03727441
## 19 20 21 22 23 24
## 0.10685278 -0.78450270 -0.64017590 -0.05508720 -0.67055321 -0.22040381
## 25
## 0.55866518
SSE <- sum(model1$residuals^2)
SSE
## [1] 5.734875
Now, lets add another variable Harvest Rain
model2 <- lm(Price ~ AGST + HarvestRain, data=Wine)
summary((model2))
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = Wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88321 -0.19600 0.06178 0.15379 0.59722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.20265 1.85443 -1.188 0.247585
## AGST 0.60262 0.11128 5.415 1.94e-05 ***
## HarvestRain -0.00457 0.00101 -4.525 0.000167 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808
## F-statistic: 26.59 on 2 and 22 DF, p-value: 1.347e-06
We can observe significant increase in Rsquare and Adjusted R square.. Also the estimate for Harvest rai is negetive. This indicates that the new model is probably better than the previous model. Lets calculate SSE
SSE <- sum(model2$residuals^2)
SSE
## [1] 2.970373
This SSE is better than SSE of model 1.
Now, Lets build a 3rd model with all of our indepnedent variables.
model3 <- lm(Price ~ AGST + WinterRain + HarvestRain + Age + FrancePop, data=Wine)
summary(model3)
##
## Call:
## lm(formula = Price ~ AGST + WinterRain + HarvestRain + Age +
## FrancePop, data = Wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48179 -0.24662 -0.00726 0.22012 0.51987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.504e-01 1.019e+01 -0.044 0.965202
## AGST 6.012e-01 1.030e-01 5.836 1.27e-05 ***
## WinterRain 1.043e-03 5.310e-04 1.963 0.064416 .
## HarvestRain -3.958e-03 8.751e-04 -4.523 0.000233 ***
## Age 5.847e-04 7.900e-02 0.007 0.994172
## FrancePop -4.953e-05 1.667e-04 -0.297 0.769578
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared: 0.8294, Adjusted R-squared: 0.7845
## F-statistic: 18.47 on 5 and 19 DF, p-value: 1.044e-06
SSE<- sum(model3$residuals^2)
SSE
## [1] 1.732113
As we can see, age and france both are insignificant for our model. Lets remove them and create model4
model4 <- lm(Price ~ AGST + WinterRain + HarvestRain, data=Wine)
summary(model4)
##
## Call:
## lm(formula = Price ~ AGST + WinterRain + HarvestRain, data = Wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67472 -0.12958 0.01973 0.20751 0.63846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.3016263 2.0366743 -2.112 0.046831 *
## AGST 0.6810242 0.1117011 6.097 4.75e-06 ***
## WinterRain 0.0011765 0.0005920 1.987 0.060097 .
## HarvestRain -0.0039481 0.0009987 -3.953 0.000726 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.345 on 21 degrees of freedom
## Multiple R-squared: 0.7537, Adjusted R-squared: 0.7185
## F-statistic: 21.42 on 3 and 21 DF, p-value: 1.359e-06
SSE <- sum(model4$residuals^2)
SSE
## [1] 2.500209
Our R square has dropped significantly, also the Adjusted R square. There is increase in SSE, which is not good sign. Also we know as per our knowledge that Age of wine has a very good impact on its price. Let’s add age to model4, by creating a new model
model5 <- lm(Price ~ AGST + WinterRain + HarvestRain + Age, data=Wine)
summary(model5)
##
## Call:
## lm(formula = Price ~ AGST + WinterRain + HarvestRain + Age, data = Wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45470 -0.24273 0.00752 0.19773 0.53637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4299802 1.7658975 -1.942 0.066311 .
## AGST 0.6072093 0.0987022 6.152 5.2e-06 ***
## WinterRain 0.0010755 0.0005073 2.120 0.046694 *
## HarvestRain -0.0039715 0.0008538 -4.652 0.000154 ***
## Age 0.0239308 0.0080969 2.956 0.007819 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared: 0.8286, Adjusted R-squared: 0.7943
## F-statistic: 24.17 on 4 and 20 DF, p-value: 2.036e-07
SSE <- sum(model5$residuals^2)
SSE
## [1] 1.740162
This looks great!! Our R square has incresed, SSE has decreased, We can increased significance of age in model.
This also shows multicollinearity between age and population of france.
We can check corelation between variables using “cor” function.
cor(Wine)
## Year Price WinterRain AGST HarvestRain
## Year 1.00000000 -0.4477679 0.016970024 -0.24691585 0.02800907
## Price -0.44776786 1.0000000 0.136650547 0.65956286 -0.56332190
## WinterRain 0.01697002 0.1366505 1.000000000 -0.32109061 -0.27544085
## AGST -0.24691585 0.6595629 -0.321090611 1.00000000 -0.06449593
## HarvestRain 0.02800907 -0.5633219 -0.275440854 -0.06449593 1.00000000
## Age -1.00000000 0.4477679 -0.016970024 0.24691585 -0.02800907
## FrancePop 0.99448510 -0.4668616 -0.001621627 -0.25916227 0.04126439
## Age FrancePop
## Year -1.00000000 0.994485097
## Price 0.44776786 -0.466861641
## WinterRain -0.01697002 -0.001621627
## AGST 0.24691585 -0.259162274
## HarvestRain -0.02800907 0.041264394
## Age 1.00000000 -0.994485097
## FrancePop -0.99448510 1.000000000
So we know our model does a good job predicting the data it’s seen (i.e. training data). Let’s test our model for predicting prices for the data it has not seen. The new data is test data. The accuracy of the model on the test data is often referred to as out-of-sample accuracy
WineTest <- read.csv("wine_test.csv")
str(WineTest)
## 'data.frame': 2 obs. of 7 variables:
## $ Year : int 1979 1980
## $ Price : num 6.95 6.5
## $ WinterRain : int 717 578
## $ AGST : num 16.2 16
## $ HarvestRain: int 122 74
## $ Age : int 4 3
## $ FrancePop : num 54836 55110
To make predictions for these two test points, we’ll use the predict function.
PredictTest <- predict(model5, newdata=WineTest)
PredictTest
## 1 2
## 6.768925 6.684910
As we can see, the predicted values are close to actual value. So, our prediction model is actually good.
Lets quantify it by preducting R-squared for PredictTest set.
R-square = 1 - SSE/SST
SSE <- sum((WineTest$Price - PredictTest)^2)
SSE
## [1] 0.06926281
SST <- sum((WineTest$Price - mean(Wine$Price))^2)
SST
## [1] 0.3369269
R_Squared <- 1 - SSE/SST
R_Squared
## [1] 0.7944278
This is pretty good out of sample accuracy of our model!! But, in reality we need more data to be conclusive.
R Square can be negetive, but cant be greater than 1 as SSE/SST>=0 always