Source: Analytics Edge Unit 1

Bordeaux is a region in France popular for producing wine. While this wine has been produced in much the same way for hundreds of years, there are differences in price and quality from year to year that are sometimes very significant. Bordeaux wines are widely believed to taste better when they are older, so there’s an incentive to store young wines until they are mature. The main problem is that it is hard to determine the quality of the wine when it is so young just by tasting it,since the taste will change so significantly by the time it will actually be consumed. This is why wine tasters and experts are helpful. They taste the wines and then predict which ones will be the best wines later. The question we’ll address in this lecture– can analytics model this process better and make stronger predictions?

Load the data

setwd("C:/Users/jzchen/Documents/Courses/Analytics edge/Unit 2")
wine <- read.csv("wine.csv")
str(wine)
## 'data.frame':    25 obs. of  7 variables:
##  $ Year       : int  1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
##  $ Price      : num  7.5 8.04 7.69 6.98 6.78 ...
##  $ WinterRain : int  600 690 502 420 582 485 763 830 697 608 ...
##  $ AGST       : num  17.1 16.7 17.1 16.1 16.4 ...
##  $ HarvestRain: int  160 80 130 110 187 187 290 38 52 155 ...
##  $ Age        : int  31 30 28 26 25 24 23 22 21 20 ...
##  $ FrancePop  : num  43184 43495 44218 45152 45654 ...

Build the first linear regression model

model1 <- lm(Price ~ AGST, data = wine)
summary(model1)
## 
## Call:
## lm(formula = Price ~ AGST, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78450 -0.23882 -0.03727  0.38992  0.90318 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.4178     2.4935  -1.371 0.183710    
## AGST          0.6351     0.1509   4.208 0.000335 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared:  0.435,  Adjusted R-squared:  0.4105 
## F-statistic: 17.71 on 1 and 23 DF,  p-value: 0.000335
model1$residuals
##           1           2           3           4           5           6 
##  0.04204258  0.82983774  0.21169394  0.15609432 -0.23119140  0.38991701 
##           7           8           9          10          11          12 
## -0.48959140  0.90318115  0.45372410  0.14887461 -0.23882157 -0.08974238 
##          13          14          15          16          17          18 
##  0.66185660 -0.05211511 -0.62726647 -0.74714947  0.42113502 -0.03727441 
##          19          20          21          22          23          24 
##  0.10685278 -0.78450270 -0.64017590 -0.05508720 -0.67055321 -0.22040381 
##          25 
##  0.55866518
SSE <- sum(model1$residuals^2)
SSE
## [1] 5.734875

2nd linear regression model

model2 <- lm(Price ~ AGST + HarvestRain, data = wine)
summary(model2)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88321 -0.19600  0.06178  0.15379  0.59722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.20265    1.85443  -1.188 0.247585    
## AGST         0.60262    0.11128   5.415 1.94e-05 ***
## HarvestRain -0.00457    0.00101  -4.525 0.000167 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared:  0.7074, Adjusted R-squared:  0.6808 
## F-statistic: 26.59 on 2 and 22 DF,  p-value: 1.347e-06
model2$residuals
##            1            2            3            4            5 
##  0.114049616  0.523788530  0.147680820 -0.032339673 -0.058527063 
##            6            7            8            9           10 
##  0.597221745  0.153788612  0.424676073  0.005640893  0.152563009 
##           11           12           13           14           15 
## -0.454426446  0.414425096  0.376732242 -0.200740923  0.018215806 
##           16           17           18           19           20 
## -0.309662756  0.154053312 -0.195997112  0.100432413 -0.883211577 
##           21           22           23           24           25 
## -0.485011835  0.061776460 -0.183631188 -0.531811642  0.090315589
SSE2 <- sum(model2$residuals^2)
SSE2
## [1] 2.970373

Note that in practice, we don’t care about the SSE on the training set.

3rd model

model3 <- lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data = wine)
summary(model3)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age + 
##     FrancePop, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48179 -0.24662 -0.00726  0.22012  0.51987 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.504e-01  1.019e+01  -0.044 0.965202    
## AGST         6.012e-01  1.030e-01   5.836 1.27e-05 ***
## HarvestRain -3.958e-03  8.751e-04  -4.523 0.000233 ***
## WinterRain   1.043e-03  5.310e-04   1.963 0.064416 .  
## Age          5.847e-04  7.900e-02   0.007 0.994172    
## FrancePop   -4.953e-05  1.667e-04  -0.297 0.769578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared:  0.8294, Adjusted R-squared:  0.7845 
## F-statistic: 18.47 on 5 and 19 DF,  p-value: 1.044e-06
SSE3 <- sum(model3$residuals^2)
SSE3
## [1] 1.732113

4th model

model4 <- lm(Price ~ HarvestRain + WinterRain, data = wine)
summary(model4)
## 
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0933 -0.3222 -0.1012  0.3871  1.1877 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.865e+00  6.616e-01  11.888 4.76e-11 ***
## HarvestRain -4.971e-03  1.601e-03  -3.105  0.00516 ** 
## WinterRain  -9.848e-05  9.007e-04  -0.109  0.91392    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5611 on 22 degrees of freedom
## Multiple R-squared:  0.3177, Adjusted R-squared:  0.2557 
## F-statistic: 5.122 on 2 and 22 DF,  p-value: 0.01492

Add more variables to the 4th model

model4 <- lm(Price ~ AGST + HarvestRain + WinterRain +  Age, data = wine)
summary(model4)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45470 -0.24273  0.00752  0.19773  0.53637 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.4299802  1.7658975  -1.942 0.066311 .  
## AGST         0.6072093  0.0987022   6.152  5.2e-06 ***
## HarvestRain -0.0039715  0.0008538  -4.652 0.000154 ***
## WinterRain   0.0010755  0.0005073   2.120 0.046694 *  
## Age          0.0239308  0.0080969   2.956 0.007819 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared:  0.8286, Adjusted R-squared:  0.7943 
## F-statistic: 24.17 on 4 and 20 DF,  p-value: 2.036e-07

5th model

model5 <- lm(Price ~ HarvestRain + WinterRain, data = wine)
summary(model5)
## 
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0933 -0.3222 -0.1012  0.3871  1.1877 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.865e+00  6.616e-01  11.888 4.76e-11 ***
## HarvestRain -4.971e-03  1.601e-03  -3.105  0.00516 ** 
## WinterRain  -9.848e-05  9.007e-04  -0.109  0.91392    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5611 on 22 degrees of freedom
## Multiple R-squared:  0.3177, Adjusted R-squared:  0.2557 
## F-statistic: 5.122 on 2 and 22 DF,  p-value: 0.01492

If we compare 5th model to 3rd model, we will see that after removing FranceProp, Age becomes a more significant variable in the new model, while before it was not at all.

coorelation and multicollinearity

cor(wine$WinterRain, wine$Price)
## [1] 0.1366505
cor(wine$Age, wine$FrancePop)
## [1] -0.9944851
cor(wine)
##                    Year      Price   WinterRain        AGST HarvestRain
## Year         1.00000000 -0.4477679  0.016970024 -0.24691585  0.02800907
## Price       -0.44776786  1.0000000  0.136650547  0.65956286 -0.56332190
## WinterRain   0.01697002  0.1366505  1.000000000 -0.32109061 -0.27544085
## AGST        -0.24691585  0.6595629 -0.321090611  1.00000000 -0.06449593
## HarvestRain  0.02800907 -0.5633219 -0.275440854 -0.06449593  1.00000000
## Age         -1.00000000  0.4477679 -0.016970024  0.24691585 -0.02800907
## FrancePop    0.99448510 -0.4668616 -0.001621627 -0.25916227  0.04126439
##                     Age    FrancePop
## Year        -1.00000000  0.994485097
## Price        0.44776786 -0.466861641
## WinterRain  -0.01697002 -0.001621627
## AGST         0.24691585 -0.259162274
## HarvestRain -0.02800907  0.041264394
## Age          1.00000000 -0.994485097
## FrancePop   -0.99448510  1.000000000

Remove both age and france population from the model

model5 <- lm(Price ~ AGST + HarvestRain + WinterRain, data = wine)
summary(model5)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67472 -0.12958  0.01973  0.20751  0.63846 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.3016263  2.0366743  -2.112 0.046831 *  
## AGST         0.6810242  0.1117011   6.097 4.75e-06 ***
## HarvestRain -0.0039481  0.0009987  -3.953 0.000726 ***
## WinterRain   0.0011765  0.0005920   1.987 0.060097 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.345 on 21 degrees of freedom
## Multiple R-squared:  0.7537, Adjusted R-squared:  0.7185 
## F-statistic: 21.42 on 3 and 21 DF,  p-value: 1.359e-06

R squared dropped. Model4 remains to be a better model.

Evaluate the model on test data

wineTest <- read.csv("wine_test.csv")
str(wineTest)
## 'data.frame':    2 obs. of  7 variables:
##  $ Year       : int  1979 1980
##  $ Price      : num  6.95 6.5
##  $ WinterRain : int  717 578
##  $ AGST       : num  16.2 16
##  $ HarvestRain: int  122 74
##  $ Age        : int  4 3
##  $ FrancePop  : num  54836 55110
predictTest <- predict(model4, newdata = wineTest)
predictTest
##        1        2 
## 6.768925 6.684910
SSE <- sum((wineTest$Price - predictTest)^2)
SST <- sum((wineTest$Price - mean(wine$Price))^2)
1 - SSE/SST
## [1] 0.7944278