Analytics Edge: Unit 2 - Linear Regression

The Statisical Sommelier

Bordeaux Wine

Large differences in price and quality between years, although wine is produced in a similar way
Meant to be aged, so hard to tell if wine will be good when it is on the market
Can analytics be used to come up with a different system for judging wine?

Building a Model

March 1990 - Orley Ashenfelter, a Princeton economics professor, claims he can predict wine quality without tasting the wine
Ashenfelter used a method called linear regression
- Predicts an outcome variable, or dependent variable
- Predicts using a set of independent variables
Dependent variable: typical price in 1990-1991 wine auctions (approximates quality)
Independent variables:
- Age - older wines are more expensive
- Weather
  - Average Growing Season Temperature
  - Harvest Rain
  - Winter Rain

The Data (1952 - 1978)

One - Variable Linear Regression

\[y = 7.07\] \[y = 0.5(AGST) - 1.25\]

The Regression Model

One-variable regression model \[y^i = \beta_0 +\beta_1 x^i + \eta^i \]

yⁱ = dependent variable (wine price) for the i^th observation

xⁱ = indepedent variable (temperature) for the i^th observation

Etaⁱ = error term for the i^th observation

Beta_0 = intercept coefficient

Beta_1 = regression coefficient for the independent variable

Selecting the Best Model

R²

Compares the best model to a “baseline” model
The baseline model does not use any variables
- Predicts same outcome (price) regardless of the independent variable (temperature)

Intepreting R²

\[ R^2 = 1 - \frac{SSE}{SST} \] * R² captures value added from using a model + R² = 0 means no improvement over baseline + R² = 1 means a perfect predictive model

Multivariable Linear Regression

\[y^i = \beta_0 +\beta_1 x^i_1 + \beta_2 x^i_2 +... + \beta_k x^i_k + \eta^i \]

yⁱ = dependent variable (wine price) for the i^th observation

xⁱ_j = j^th indepedent variable (wine price) for the i^th observation

Etaⁱ = error term for the i^th observation

Beta_0 = intercept coefficient

Beta_j = regression coefficient for the j^th independent variable

Best model coefficients selected to minimize SSE

Adding Variables

Adding more variables can improve the model
Diminishing returns as more variables are added

Selecting Variables

Not all available variables should be used
- Each new variable requires more data
- Causes overfitting high R² on data used to create model, but bad performance on unseen data

Understanding the Model and Coefficients

Correlation

A measure of the linear relationship between variables

+1 = perfect positive linear relationship 0 = no linear relationship -1 = perfect negative linear relationship

Examples of Correlation

Predictive Ability

Our wine model has a value of R² = 0.83
Tells us our accuracy on the data that we used to build the model
But how well does the model perform on new data?
- Bordeaux wine buyers profit from being able to predict the quality of a wine years before it matures

Out-of-Sample R²

Better model R² does not mean better test set R²
Need more data to to be conclusive
Out-of-sample R² can be negative!

The Analytics Edge

A linear regression model with only a few variables can predict wine prices fairly well
In many cases, outperforms wine experts’ opinions
A quantitative approach to a qualitative problem

Linear Regression in R

Read in data

# Load in the dataset
wine = read.csv("wine.csv")
# Output the string
str(wine)
## 'data.frame':    25 obs. of  7 variables:
##  $ Year       : int  1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
##  $ Price      : num  7.5 8.04 7.69 6.98 6.78 ...
##  $ WinterRain : int  600 690 502 420 582 485 763 830 697 608 ...
##  $ AGST       : num  17.1 16.7 17.1 16.1 16.4 ...
##  $ HarvestRain: int  160 80 130 110 187 187 290 38 52 155 ...
##  $ Age        : int  31 30 28 26 25 24 23 22 21 20 ...
##  $ FrancePop  : num  43184 43495 44218 45152 45654 ...
# Output the summary
z = summary(wine)
kable(z)

Year	Price	WinterRain	AGST	HarvestRain	Age	FrancePop
Min. :1952	Min. :6.205	Min. :376.0	Min. :14.98	Min. : 38.0	Min. : 5.0	Min. :43184
1st Qu.:1960	1st Qu.:6.519	1st Qu.:536.0	1st Qu.:16.20	1st Qu.: 89.0	1st Qu.:11.0	1st Qu.:46584
Median :1966	Median :7.121	Median :600.0	Median :16.53	Median :130.0	Median :17.0	Median :50255
Mean :1966	Mean :7.067	Mean :605.3	Mean :16.51	Mean :148.6	Mean :17.2	Mean :49694
3rd Qu.:1972	3rd Qu.:7.495	3rd Qu.:697.0	3rd Qu.:17.07	3rd Qu.:187.0	3rd Qu.:23.0	3rd Qu.:52894
Max. :1978	Max. :8.494	Max. :830.0	Max. :17.65	Max. :292.0	Max. :31.0	Max. :54602

Linear Regression

Linear Regression (one variable)

# Linear Regression model
model1 = lm(Price ~ AGST, data=wine)
# Output the summary
summary(model1)
## 
## Call:
## lm(formula = Price ~ AGST, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78450 -0.23882 -0.03727  0.38992  0.90318 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.4178     2.4935  -1.371 0.183710    
## AGST          0.6351     0.1509   4.208 0.000335 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared:  0.435,  Adjusted R-squared:  0.4105 
## F-statistic: 17.71 on 1 and 23 DF,  p-value: 0.000335

Sum of Squared Errors

# Sum of Squared Errors
z = model1$residuals
kable(z)

x
0.0420426
0.8298377
0.2116939
0.1560943
-0.2311914
0.3899170
-0.4895914
0.9031812
0.4537241
0.1488746
-0.2388216
-0.0897424
0.6618566
-0.0521151
-0.6272665
-0.7471495
0.4211350
-0.0372744
0.1068528
-0.7845027
-0.6401759
-0.0550872
-0.6705532
-0.2204038
0.5586652

SSE = sum(model1$residuals^2)
SSE
## [1] 5.734875

Linear Regression (two variables)

# Linear Regression model
model2 = lm(Price ~ AGST + HarvestRain, data=wine)
# Output the summary
summary(model2)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88321 -0.19600  0.06178  0.15379  0.59722 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -2.20265    1.85443  -1.188  0.247585    
## AGST         0.60262    0.11128   5.415 0.0000194 ***
## HarvestRain -0.00457    0.00101  -4.525  0.000167 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared:  0.7074, Adjusted R-squared:  0.6808 
## F-statistic: 26.59 on 2 and 22 DF,  p-value: 0.000001347

Sum of Squared Errors

# Sum of Squared Errors
SSE = sum(model2$residuals^2)
SSE
## [1] 2.970373

Linear Regression (all variables)

# Linear Regression model
model3 = lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data=wine)
# Output the summary
summary(model3)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age + 
##     FrancePop, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48179 -0.24662 -0.00726  0.22012  0.51987 
## 
## Coefficients:
##                Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept) -0.45039886 10.18888394  -0.044  0.965202    
## AGST         0.60122388  0.10302027   5.836 0.0000127 ***
## HarvestRain -0.00395812  0.00087511  -4.523  0.000233 ***
## WinterRain   0.00104251  0.00053099   1.963  0.064416 .  
## Age          0.00058475  0.07900313   0.007  0.994172    
## FrancePop   -0.00004953  0.00016668  -0.297  0.769578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared:  0.8294, Adjusted R-squared:  0.7845 
## F-statistic: 18.47 on 5 and 19 DF,  p-value: 0.000001044

Sum of Squared Errors

# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE
## [1] 1.732113

Remove FrancePop

# Linear Regression model
model4 = lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine)
# Output the summary
summary(model4)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45470 -0.24273  0.00752  0.19773  0.53637 
## 
## Coefficients:
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -3.4299802  1.7658975  -1.942  0.066311 .  
## AGST         0.6072093  0.0987022   6.152 0.0000052 ***
## HarvestRain -0.0039715  0.0008538  -4.652  0.000154 ***
## WinterRain   0.0010755  0.0005073   2.120  0.046694 *  
## Age          0.0239308  0.0080969   2.956  0.007819 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared:  0.8286, Adjusted R-squared:  0.7943 
## F-statistic: 24.17 on 4 and 20 DF,  p-value: 0.0000002036

Correlations

# Correlations of the dataset
cor(wine$WinterRain, wine$Price)
## [1] 0.1366505
cor(wine$Age, wine$FrancePop)
## [1] -0.9944851
cor(wine)
##                    Year      Price   WinterRain        AGST HarvestRain         Age    FrancePop
## Year         1.00000000 -0.4477679  0.016970024 -0.24691585  0.02800907 -1.00000000  0.994485097
## Price       -0.44776786  1.0000000  0.136650547  0.65956286 -0.56332190  0.44776786 -0.466861641
## WinterRain   0.01697002  0.1366505  1.000000000 -0.32109061 -0.27544085 -0.01697002 -0.001621627
## AGST        -0.24691585  0.6595629 -0.321090611  1.00000000 -0.06449593  0.24691585 -0.259162274
## HarvestRain  0.02800907 -0.5633219 -0.275440854 -0.06449593  1.00000000 -0.02800907  0.041264394
## Age         -1.00000000  0.4477679 -0.016970024  0.24691585 -0.02800907  1.00000000 -0.994485097
## FrancePop    0.99448510 -0.4668616 -0.001621627 -0.25916227  0.04126439 -0.99448510  1.000000000

Remove Age and FrancePop

# Linear Regression model
model5 = lm(Price ~ AGST + HarvestRain + WinterRain, data=wine)
# Output the summary
summary(model5)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67472 -0.12958  0.01973  0.20751  0.63846 
## 
## Coefficients:
##               Estimate Std. Error t value   Pr(>|t|)    
## (Intercept) -4.3016263  2.0366743  -2.112   0.046831 *  
## AGST         0.6810242  0.1117011   6.097 0.00000475 ***
## HarvestRain -0.0039481  0.0009987  -3.953   0.000726 ***
## WinterRain   0.0011765  0.0005920   1.987   0.060097 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.345 on 21 degrees of freedom
## Multiple R-squared:  0.7537, Adjusted R-squared:  0.7185 
## F-statistic: 21.42 on 3 and 21 DF,  p-value: 0.000001359

Read in test set

# Load in the test set
wineTest = read.csv("wine_test.csv")
# Output the string
str(wineTest)
## 'data.frame':    2 obs. of  7 variables:
##  $ Year       : int  1979 1980
##  $ Price      : num  6.95 6.5
##  $ WinterRain : int  717 578
##  $ AGST       : num  16.2 16
##  $ HarvestRain: int  122 74
##  $ Age        : int  4 3
##  $ FrancePop  : num  54836 55110

Make test set predictions

# Make predictions using the fourth model on the test set
predictTest = predict(model4, newdata=wineTest)
predictTest
##        1        2 
## 6.768925 6.684910

Compute R-squared

# Compute the R^2
SSE = sum((wineTest$Price - predictTest)^2)
SST = sum((wineTest$Price - mean(wine$Price))^2)
1 - SSE/SST
## [1] 0.7944278

Analytics Edge: Unit 2 - Linear Regression

Sulman Khan

October 26, 2018

The Statisical Sommelier

Bordeaux Wine

Building a Model

The Data (1952 - 1978)

One - Variable Linear Regression

The Regression Model

Selecting the Best Model

R²

Intepreting R²

Multivariable Linear Regression

Adding Variables

Selecting Variables

Understanding the Model and Coefficients

Correlation

Examples of Correlation

Predictive Ability

Out-of-Sample R²

The Analytics Edge

Linear Regression in R

Read in data

Linear Regression

Linear Regression (one variable)

Sum of Squared Errors

Linear Regression (two variables)

Sum of Squared Errors

Linear Regression (all variables)

Sum of Squared Errors

Remove FrancePop

Correlations

Remove Age and FrancePop

Read in test set

Make test set predictions

Compute R-squared

Analytics Edge: Unit 2 - Linear Regression

Sulman Khan

October 26, 2018

The Statisical Sommelier

Bordeaux Wine

Building a Model

The Data (1952 - 1978)

One - Variable Linear Regression

The Regression Model

Selecting the Best Model

R2

Intepreting R2

Multivariable Linear Regression

Adding Variables

Selecting Variables

Understanding the Model and Coefficients

Correlation

Examples of Correlation

Predictive Ability

Out-of-Sample R2

The Analytics Edge

Linear Regression in R

Read in data

Linear Regression

Linear Regression (one variable)

Sum of Squared Errors

Linear Regression (two variables)

Sum of Squared Errors

Linear Regression (all variables)

Sum of Squared Errors

Remove FrancePop

Correlations

Remove Age and FrancePop

Read in test set

Make test set predictions

Compute R-squared

R²

Intepreting R²

Out-of-Sample R²