The Statisical Sommelier

Bordeaux Wine

  • Large differences in price and quality between years, although wine is produced in a similar way

  • Meant to be aged, so hard to tell if wine will be good when it is on the market

  • Can analytics be used to come up with a different system for judging wine?

Building a Model

  • March 1990 - Orley Ashenfelter, a Princeton economics professor, claims he can predict wine quality without tasting the wine

  • Ashenfelter used a method called linear regression
    • Predicts an outcome variable, or dependent variable
    • Predicts using a set of independent variables
  • Dependent variable: typical price in 1990-1991 wine auctions (approximates quality)

  • Independent variables:
    • Age - older wines are more expensive
    • Weather
      • Average Growing Season Temperature
      • Harvest Rain
      • Winter Rain

The Data (1952 - 1978)

One - Variable Linear Regression

\[y = 7.07\] \[y = 0.5(AGST) - 1.25\]

The Regression Model

  • One-variable regression model \[y^i = \beta_0 +\beta_1 x^i + \eta^i \]

yi = dependent variable (wine price) for the ith observation

xi = indepedent variable (temperature) for the ith observation

Etai = error term for the ith observation

Beta_0 = intercept coefficient

Beta_1 = regression coefficient for the independent variable

Selecting the Best Model

R2

  • Compares the best model to a “baseline” model
  • The baseline model does not use any variables
    • Predicts same outcome (price) regardless of the independent variable (temperature)

Intepreting R2

\[ R^2 = 1 - \frac{SSE}{SST} \] * R2 captures value added from using a model + R2 = 0 means no improvement over baseline + R2 = 1 means a perfect predictive model

Multivariable Linear Regression

\[y^i = \beta_0 +\beta_1 x^i_1 + \beta_2 x^i_2 +... + \beta_k x^i_k + \eta^i \]

yi = dependent variable (wine price) for the ith observation

xi_j = jth indepedent variable (wine price) for the ith observation

Etai = error term for the ith observation

Beta_0 = intercept coefficient

Beta_j = regression coefficient for the jth independent variable

  • Best model coefficients selected to minimize SSE

Adding Variables

  • Adding more variables can improve the model
  • Diminishing returns as more variables are added

Selecting Variables

  • Not all available variables should be used
    • Each new variable requires more data
    • Causes overfitting high R2 on data used to create model, but bad performance on unseen data

Understanding the Model and Coefficients

Correlation

A measure of the linear relationship between variables

+1 = perfect positive linear relationship 0 = no linear relationship -1 = perfect negative linear relationship

Examples of Correlation

Predictive Ability

  • Our wine model has a value of R2 = 0.83

  • Tells us our accuracy on the data that we used to build the model

  • But how well does the model perform on new data?
    • Bordeaux wine buyers profit from being able to predict the quality of a wine years before it matures

Out-of-Sample R2

  • Better model R2 does not mean better test set R2
  • Need more data to to be conclusive
  • Out-of-sample R2 can be negative!

The Analytics Edge

  • A linear regression model with only a few variables can predict wine prices fairly well

  • In many cases, outperforms wine experts’ opinions

  • A quantitative approach to a qualitative problem

Linear Regression in R

Read in data

# Load in the dataset
wine = read.csv("wine.csv")
# Output the string
str(wine)
## 'data.frame':    25 obs. of  7 variables:
##  $ Year       : int  1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
##  $ Price      : num  7.5 8.04 7.69 6.98 6.78 ...
##  $ WinterRain : int  600 690 502 420 582 485 763 830 697 608 ...
##  $ AGST       : num  17.1 16.7 17.1 16.1 16.4 ...
##  $ HarvestRain: int  160 80 130 110 187 187 290 38 52 155 ...
##  $ Age        : int  31 30 28 26 25 24 23 22 21 20 ...
##  $ FrancePop  : num  43184 43495 44218 45152 45654 ...
# Output the summary
z = summary(wine)
kable(z)
Year Price WinterRain AGST HarvestRain Age FrancePop
Min. :1952 Min. :6.205 Min. :376.0 Min. :14.98 Min. : 38.0 Min. : 5.0 Min. :43184
1st Qu.:1960 1st Qu.:6.519 1st Qu.:536.0 1st Qu.:16.20 1st Qu.: 89.0 1st Qu.:11.0 1st Qu.:46584
Median :1966 Median :7.121 Median :600.0 Median :16.53 Median :130.0 Median :17.0 Median :50255
Mean :1966 Mean :7.067 Mean :605.3 Mean :16.51 Mean :148.6 Mean :17.2 Mean :49694
3rd Qu.:1972 3rd Qu.:7.495 3rd Qu.:697.0 3rd Qu.:17.07 3rd Qu.:187.0 3rd Qu.:23.0 3rd Qu.:52894
Max. :1978 Max. :8.494 Max. :830.0 Max. :17.65 Max. :292.0 Max. :31.0 Max. :54602

Linear Regression

Linear Regression (one variable)

# Linear Regression model
model1 = lm(Price ~ AGST, data=wine)
# Output the summary
summary(model1)
## 
## Call:
## lm(formula = Price ~ AGST, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78450 -0.23882 -0.03727  0.38992  0.90318 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.4178     2.4935  -1.371 0.183710    
## AGST          0.6351     0.1509   4.208 0.000335 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared:  0.435,  Adjusted R-squared:  0.4105 
## F-statistic: 17.71 on 1 and 23 DF,  p-value: 0.000335

Sum of Squared Errors

# Sum of Squared Errors
z = model1$residuals
kable(z)
x
0.0420426
0.8298377
0.2116939
0.1560943
-0.2311914
0.3899170
-0.4895914
0.9031812
0.4537241
0.1488746
-0.2388216
-0.0897424
0.6618566
-0.0521151
-0.6272665
-0.7471495
0.4211350
-0.0372744
0.1068528
-0.7845027
-0.6401759
-0.0550872
-0.6705532
-0.2204038
0.5586652
SSE = sum(model1$residuals^2)
SSE
## [1] 5.734875

Linear Regression (two variables)

# Linear Regression model
model2 = lm(Price ~ AGST + HarvestRain, data=wine)
# Output the summary
summary(model2)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88321 -0.19600  0.06178  0.15379  0.59722 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -2.20265    1.85443  -1.188  0.247585    
## AGST         0.60262    0.11128   5.415 0.0000194 ***
## HarvestRain -0.00457    0.00101  -4.525  0.000167 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared:  0.7074, Adjusted R-squared:  0.6808 
## F-statistic: 26.59 on 2 and 22 DF,  p-value: 0.000001347

Sum of Squared Errors

# Sum of Squared Errors
SSE = sum(model2$residuals^2)
SSE
## [1] 2.970373

Linear Regression (all variables)

# Linear Regression model
model3 = lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data=wine)
# Output the summary
summary(model3)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age + 
##     FrancePop, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48179 -0.24662 -0.00726  0.22012  0.51987 
## 
## Coefficients:
##                Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept) -0.45039886 10.18888394  -0.044  0.965202    
## AGST         0.60122388  0.10302027   5.836 0.0000127 ***
## HarvestRain -0.00395812  0.00087511  -4.523  0.000233 ***
## WinterRain   0.00104251  0.00053099   1.963  0.064416 .  
## Age          0.00058475  0.07900313   0.007  0.994172    
## FrancePop   -0.00004953  0.00016668  -0.297  0.769578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared:  0.8294, Adjusted R-squared:  0.7845 
## F-statistic: 18.47 on 5 and 19 DF,  p-value: 0.000001044

Sum of Squared Errors

# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE
## [1] 1.732113

Remove FrancePop

# Linear Regression model
model4 = lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine)
# Output the summary
summary(model4)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45470 -0.24273  0.00752  0.19773  0.53637 
## 
## Coefficients:
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -3.4299802  1.7658975  -1.942  0.066311 .  
## AGST         0.6072093  0.0987022   6.152 0.0000052 ***
## HarvestRain -0.0039715  0.0008538  -4.652  0.000154 ***
## WinterRain   0.0010755  0.0005073   2.120  0.046694 *  
## Age          0.0239308  0.0080969   2.956  0.007819 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared:  0.8286, Adjusted R-squared:  0.7943 
## F-statistic: 24.17 on 4 and 20 DF,  p-value: 0.0000002036

Correlations

# Correlations of the dataset
cor(wine$WinterRain, wine$Price)
## [1] 0.1366505
cor(wine$Age, wine$FrancePop)
## [1] -0.9944851
cor(wine)
##                    Year      Price   WinterRain        AGST HarvestRain         Age    FrancePop
## Year         1.00000000 -0.4477679  0.016970024 -0.24691585  0.02800907 -1.00000000  0.994485097
## Price       -0.44776786  1.0000000  0.136650547  0.65956286 -0.56332190  0.44776786 -0.466861641
## WinterRain   0.01697002  0.1366505  1.000000000 -0.32109061 -0.27544085 -0.01697002 -0.001621627
## AGST        -0.24691585  0.6595629 -0.321090611  1.00000000 -0.06449593  0.24691585 -0.259162274
## HarvestRain  0.02800907 -0.5633219 -0.275440854 -0.06449593  1.00000000 -0.02800907  0.041264394
## Age         -1.00000000  0.4477679 -0.016970024  0.24691585 -0.02800907  1.00000000 -0.994485097
## FrancePop    0.99448510 -0.4668616 -0.001621627 -0.25916227  0.04126439 -0.99448510  1.000000000

Remove Age and FrancePop

# Linear Regression model
model5 = lm(Price ~ AGST + HarvestRain + WinterRain, data=wine)
# Output the summary
summary(model5)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67472 -0.12958  0.01973  0.20751  0.63846 
## 
## Coefficients:
##               Estimate Std. Error t value   Pr(>|t|)    
## (Intercept) -4.3016263  2.0366743  -2.112   0.046831 *  
## AGST         0.6810242  0.1117011   6.097 0.00000475 ***
## HarvestRain -0.0039481  0.0009987  -3.953   0.000726 ***
## WinterRain   0.0011765  0.0005920   1.987   0.060097 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.345 on 21 degrees of freedom
## Multiple R-squared:  0.7537, Adjusted R-squared:  0.7185 
## F-statistic: 21.42 on 3 and 21 DF,  p-value: 0.000001359

Read in test set

# Load in the test set
wineTest = read.csv("wine_test.csv")
# Output the string
str(wineTest)
## 'data.frame':    2 obs. of  7 variables:
##  $ Year       : int  1979 1980
##  $ Price      : num  6.95 6.5
##  $ WinterRain : int  717 578
##  $ AGST       : num  16.2 16
##  $ HarvestRain: int  122 74
##  $ Age        : int  4 3
##  $ FrancePop  : num  54836 55110

Make test set predictions

# Make predictions using the fourth model on the test set
predictTest = predict(model4, newdata=wineTest)
predictTest
##        1        2 
## 6.768925 6.684910

Compute R-squared

# Compute the R^2
SSE = sum((wineTest$Price - predictTest)^2)
SST = sum((wineTest$Price - mean(wine$Price))^2)
1 - SSE/SST
## [1] 0.7944278