Bordeaux is a region in France popular for producing wine. While this wine has been produced in much the same way for hundreds of years, there are differences in price and quality from year to year that are sometimes very significant. Bordeaux wines are widely believed to taste better when they are older, so there’s an incentive to store young wines until they are mature. The main problem is that it is hard to determine the quality of the wine when it is so young just by tasting it, since the taste will change so significantly by the time it will actually be consumed. This is why wine tasters and experts are helpful. They taste the wines and then predict which ones will be the best wines later.

On March 4, 1990, the New York Times announced that Princeton economics professor Orley Ashenfelter can predict the quality of Bordeaux wine without tasting a single drop. Ashenfelter used a method called linear regression. The methods predicts an outcome variable or dependent variable. As independent variables, he used age of the wine– so the older wines are more expensive– and weather-related information, specifically the average growing season temperature, the harvest rain, and winter rain.

Lets discuss the model…

Wine <- read.csv("wine.csv")
str(Wine)
## 'data.frame':    25 obs. of  7 variables:
##  $ Year       : int  1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
##  $ Price      : num  7.5 8.04 7.69 6.98 6.78 ...
##  $ WinterRain : int  600 690 502 420 582 485 763 830 697 608 ...
##  $ AGST       : num  17.1 16.7 17.1 16.1 16.4 ...
##  $ HarvestRain: int  160 80 130 110 187 187 290 38 52 155 ...
##  $ Age        : int  31 30 28 26 25 24 23 22 21 20 ...
##  $ FrancePop  : num  43184 43495 44218 45152 45654 ...
summary(Wine)
##       Year          Price         WinterRain         AGST      
##  Min.   :1952   Min.   :6.205   Min.   :376.0   Min.   :14.98  
##  1st Qu.:1960   1st Qu.:6.519   1st Qu.:536.0   1st Qu.:16.20  
##  Median :1966   Median :7.121   Median :600.0   Median :16.53  
##  Mean   :1966   Mean   :7.067   Mean   :605.3   Mean   :16.51  
##  3rd Qu.:1972   3rd Qu.:7.495   3rd Qu.:697.0   3rd Qu.:17.07  
##  Max.   :1978   Max.   :8.494   Max.   :830.0   Max.   :17.65  
##   HarvestRain         Age         FrancePop    
##  Min.   : 38.0   Min.   : 5.0   Min.   :43184  
##  1st Qu.: 89.0   1st Qu.:11.0   1st Qu.:46584  
##  Median :130.0   Median :17.0   Median :50255  
##  Mean   :148.6   Mean   :17.2   Mean   :49694  
##  3rd Qu.:187.0   3rd Qu.:23.0   3rd Qu.:52894  
##  Max.   :292.0   Max.   :31.0   Max.   :54602

Let’s now create a one-variable linear regression equation using AGST to predict Price.

model1 <- lm(Price ~ AGST, data=Wine)
summary(model1)
## 
## Call:
## lm(formula = Price ~ AGST, data = Wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78450 -0.23882 -0.03727  0.38992  0.90318 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.4178     2.4935  -1.371 0.183710    
## AGST          0.6351     0.1509   4.208 0.000335 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared:  0.435,  Adjusted R-squared:  0.4105 
## F-statistic: 17.71 on 1 and 23 DF,  p-value: 0.000335

NOTE: Multiple R-squared will always increase if you add more independent variables. But Adjusted R-squared will decrease if you add an independent variable that doesn’t help the model.

Now, lets compute SSE. Our residuals, or error terms, are stored in the vector model1$residuals.

model1$residuals
##           1           2           3           4           5           6 
##  0.04204258  0.82983774  0.21169394  0.15609432 -0.23119140  0.38991701 
##           7           8           9          10          11          12 
## -0.48959140  0.90318115  0.45372410  0.14887461 -0.23882157 -0.08974238 
##          13          14          15          16          17          18 
##  0.66185660 -0.05211511 -0.62726647 -0.74714947  0.42113502 -0.03727441 
##          19          20          21          22          23          24 
##  0.10685278 -0.78450270 -0.64017590 -0.05508720 -0.67055321 -0.22040381 
##          25 
##  0.55866518
SSE <- sum(model1$residuals^2)
SSE
## [1] 5.734875

Now, lets add another variable Harvest Rain

model2 <- lm(Price ~ AGST + HarvestRain, data=Wine)
summary((model2))
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = Wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88321 -0.19600  0.06178  0.15379  0.59722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.20265    1.85443  -1.188 0.247585    
## AGST         0.60262    0.11128   5.415 1.94e-05 ***
## HarvestRain -0.00457    0.00101  -4.525 0.000167 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared:  0.7074, Adjusted R-squared:  0.6808 
## F-statistic: 26.59 on 2 and 22 DF,  p-value: 1.347e-06

We can observe significant increase in Rsquare and Adjusted R square.. Also the estimate for Harvest rai is negetive. This indicates that the new model is probably better than the previous model. Lets calculate SSE

SSE <- sum(model2$residuals^2)
SSE
## [1] 2.970373

This SSE is better than SSE of model 1.

Now, Lets build a 3rd model with all of our indepnedent variables.

model3 <- lm(Price ~ AGST + WinterRain + HarvestRain + Age + FrancePop, data=Wine)
summary(model3)
## 
## Call:
## lm(formula = Price ~ AGST + WinterRain + HarvestRain + Age + 
##     FrancePop, data = Wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48179 -0.24662 -0.00726  0.22012  0.51987 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.504e-01  1.019e+01  -0.044 0.965202    
## AGST         6.012e-01  1.030e-01   5.836 1.27e-05 ***
## WinterRain   1.043e-03  5.310e-04   1.963 0.064416 .  
## HarvestRain -3.958e-03  8.751e-04  -4.523 0.000233 ***
## Age          5.847e-04  7.900e-02   0.007 0.994172    
## FrancePop   -4.953e-05  1.667e-04  -0.297 0.769578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared:  0.8294, Adjusted R-squared:  0.7845 
## F-statistic: 18.47 on 5 and 19 DF,  p-value: 1.044e-06
SSE<- sum(model3$residuals^2)
SSE
## [1] 1.732113

As we can see, age and france both are insignificant for our model. Lets remove them and create model4

model4 <- lm(Price ~ AGST + WinterRain + HarvestRain, data=Wine)
summary(model4)
## 
## Call:
## lm(formula = Price ~ AGST + WinterRain + HarvestRain, data = Wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67472 -0.12958  0.01973  0.20751  0.63846 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.3016263  2.0366743  -2.112 0.046831 *  
## AGST         0.6810242  0.1117011   6.097 4.75e-06 ***
## WinterRain   0.0011765  0.0005920   1.987 0.060097 .  
## HarvestRain -0.0039481  0.0009987  -3.953 0.000726 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.345 on 21 degrees of freedom
## Multiple R-squared:  0.7537, Adjusted R-squared:  0.7185 
## F-statistic: 21.42 on 3 and 21 DF,  p-value: 1.359e-06
SSE <- sum(model4$residuals^2)
SSE
## [1] 2.500209

Our R square has dropped significantly, also the Adjusted R square. There is increase in SSE, which is not good sign. Also we know as per our knowledge that Age of wine has a very good impact on its price. Let’s add age to model4, by creating a new model

model5 <- lm(Price ~ AGST + WinterRain + HarvestRain + Age, data=Wine)
summary(model5)
## 
## Call:
## lm(formula = Price ~ AGST + WinterRain + HarvestRain + Age, data = Wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45470 -0.24273  0.00752  0.19773  0.53637 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.4299802  1.7658975  -1.942 0.066311 .  
## AGST         0.6072093  0.0987022   6.152  5.2e-06 ***
## WinterRain   0.0010755  0.0005073   2.120 0.046694 *  
## HarvestRain -0.0039715  0.0008538  -4.652 0.000154 ***
## Age          0.0239308  0.0080969   2.956 0.007819 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared:  0.8286, Adjusted R-squared:  0.7943 
## F-statistic: 24.17 on 4 and 20 DF,  p-value: 2.036e-07
SSE <- sum(model5$residuals^2)
SSE
## [1] 1.740162

This looks great!! Our R square has incresed, SSE has decreased, We can increased significance of age in model.

This also shows multicollinearity between age and population of france.

We can check corelation between variables using “cor” function.

cor(Wine)
##                    Year      Price   WinterRain        AGST HarvestRain
## Year         1.00000000 -0.4477679  0.016970024 -0.24691585  0.02800907
## Price       -0.44776786  1.0000000  0.136650547  0.65956286 -0.56332190
## WinterRain   0.01697002  0.1366505  1.000000000 -0.32109061 -0.27544085
## AGST        -0.24691585  0.6595629 -0.321090611  1.00000000 -0.06449593
## HarvestRain  0.02800907 -0.5633219 -0.275440854 -0.06449593  1.00000000
## Age         -1.00000000  0.4477679 -0.016970024  0.24691585 -0.02800907
## FrancePop    0.99448510 -0.4668616 -0.001621627 -0.25916227  0.04126439
##                     Age    FrancePop
## Year        -1.00000000  0.994485097
## Price        0.44776786 -0.466861641
## WinterRain  -0.01697002 -0.001621627
## AGST         0.24691585 -0.259162274
## HarvestRain -0.02800907  0.041264394
## Age          1.00000000 -0.994485097
## FrancePop   -0.99448510  1.000000000

So we know our model does a good job predicting the data it’s seen (i.e. training data). Let’s test our model for predicting prices for the data it has not seen. The new data is test data. The accuracy of the model on the test data is often referred to as out-of-sample accuracy

WineTest <- read.csv("wine_test.csv")
str(WineTest)
## 'data.frame':    2 obs. of  7 variables:
##  $ Year       : int  1979 1980
##  $ Price      : num  6.95 6.5
##  $ WinterRain : int  717 578
##  $ AGST       : num  16.2 16
##  $ HarvestRain: int  122 74
##  $ Age        : int  4 3
##  $ FrancePop  : num  54836 55110

To make predictions for these two test points, we’ll use the predict function.

PredictTest <- predict(model5, newdata=WineTest)
PredictTest
##        1        2 
## 6.768925 6.684910

As we can see, the predicted values are close to actual value. So, our prediction model is actually good.

Lets quantify it by preducting R-squared for PredictTest set.

R-square = 1 - SSE/SST

SSE <- sum((WineTest$Price - PredictTest)^2)
SSE
## [1] 0.06926281
SST <- sum((WineTest$Price - mean(Wine$Price))^2)
SST
## [1] 0.3369269
R_Squared <- 1 - SSE/SST
R_Squared
## [1] 0.7944278

This is pretty good out of sample accuracy of our model!! But, in reality we need more data to be conclusive.
R Square can be negetive, but cant be greater than 1 as SSE/SST>=0 always