Large differences in price and quality between years, although wine is produced in a similar way
Meant to be aged, so hard to tell if wine will be good when it is on the market
Can analytics be used to come up with a different system for judging wine?
March 1990 - Orley Ashenfelter, a Princeton economics professor, claims he can predict wine quality without tasting the wine
Dependent variable: typical price in 1990-1991 wine auctions (approximates quality)
\[y = 7.07\] \[y = 0.5(AGST) - 1.25\]
yi = dependent variable (wine price) for the ith observation
xi = indepedent variable (temperature) for the ith observation
Etai = error term for the ith observation
Beta_0 = intercept coefficient
Beta_1 = regression coefficient for the independent variable
\[ R^2 = 1 - \frac{SSE}{SST} \] * R2 captures value added from using a model + R2 = 0 means no improvement over baseline + R2 = 1 means a perfect predictive model
\[y^i = \beta_0 +\beta_1 x^i_1 + \beta_2 x^i_2 +... + \beta_k x^i_k + \eta^i \]
yi = dependent variable (wine price) for the ith observation
xi_j = jth indepedent variable (wine price) for the ith observation
Etai = error term for the ith observation
Beta_0 = intercept coefficient
Beta_j = regression coefficient for the jth independent variable
A measure of the linear relationship between variables
+1 = perfect positive linear relationship 0 = no linear relationship -1 = perfect negative linear relationship
Our wine model has a value of R2 = 0.83
Tells us our accuracy on the data that we used to build the model
A linear regression model with only a few variables can predict wine prices fairly well
In many cases, outperforms wine experts’ opinions
A quantitative approach to a qualitative problem
# Load in the dataset
wine = read.csv("wine.csv")
# Output the string
str(wine)
## 'data.frame': 25 obs. of 7 variables:
## $ Year : int 1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
## $ Price : num 7.5 8.04 7.69 6.98 6.78 ...
## $ WinterRain : int 600 690 502 420 582 485 763 830 697 608 ...
## $ AGST : num 17.1 16.7 17.1 16.1 16.4 ...
## $ HarvestRain: int 160 80 130 110 187 187 290 38 52 155 ...
## $ Age : int 31 30 28 26 25 24 23 22 21 20 ...
## $ FrancePop : num 43184 43495 44218 45152 45654 ...
# Output the summary
z = summary(wine)
kable(z)| Year | Price | WinterRain | AGST | HarvestRain | Age | FrancePop | |
|---|---|---|---|---|---|---|---|
| Min. :1952 | Min. :6.205 | Min. :376.0 | Min. :14.98 | Min. : 38.0 | Min. : 5.0 | Min. :43184 | |
| 1st Qu.:1960 | 1st Qu.:6.519 | 1st Qu.:536.0 | 1st Qu.:16.20 | 1st Qu.: 89.0 | 1st Qu.:11.0 | 1st Qu.:46584 | |
| Median :1966 | Median :7.121 | Median :600.0 | Median :16.53 | Median :130.0 | Median :17.0 | Median :50255 | |
| Mean :1966 | Mean :7.067 | Mean :605.3 | Mean :16.51 | Mean :148.6 | Mean :17.2 | Mean :49694 | |
| 3rd Qu.:1972 | 3rd Qu.:7.495 | 3rd Qu.:697.0 | 3rd Qu.:17.07 | 3rd Qu.:187.0 | 3rd Qu.:23.0 | 3rd Qu.:52894 | |
| Max. :1978 | Max. :8.494 | Max. :830.0 | Max. :17.65 | Max. :292.0 | Max. :31.0 | Max. :54602 |
# Linear Regression model
model1 = lm(Price ~ AGST, data=wine)
# Output the summary
summary(model1)
##
## Call:
## lm(formula = Price ~ AGST, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78450 -0.23882 -0.03727 0.38992 0.90318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4178 2.4935 -1.371 0.183710
## AGST 0.6351 0.1509 4.208 0.000335 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared: 0.435, Adjusted R-squared: 0.4105
## F-statistic: 17.71 on 1 and 23 DF, p-value: 0.000335# Sum of Squared Errors
z = model1$residuals
kable(z)| x |
|---|
| 0.0420426 |
| 0.8298377 |
| 0.2116939 |
| 0.1560943 |
| -0.2311914 |
| 0.3899170 |
| -0.4895914 |
| 0.9031812 |
| 0.4537241 |
| 0.1488746 |
| -0.2388216 |
| -0.0897424 |
| 0.6618566 |
| -0.0521151 |
| -0.6272665 |
| -0.7471495 |
| 0.4211350 |
| -0.0372744 |
| 0.1068528 |
| -0.7845027 |
| -0.6401759 |
| -0.0550872 |
| -0.6705532 |
| -0.2204038 |
| 0.5586652 |
SSE = sum(model1$residuals^2)
SSE
## [1] 5.734875# Linear Regression model
model2 = lm(Price ~ AGST + HarvestRain, data=wine)
# Output the summary
summary(model2)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88321 -0.19600 0.06178 0.15379 0.59722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.20265 1.85443 -1.188 0.247585
## AGST 0.60262 0.11128 5.415 0.0000194 ***
## HarvestRain -0.00457 0.00101 -4.525 0.000167 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808
## F-statistic: 26.59 on 2 and 22 DF, p-value: 0.000001347# Sum of Squared Errors
SSE = sum(model2$residuals^2)
SSE
## [1] 2.970373# Linear Regression model
model3 = lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data=wine)
# Output the summary
summary(model3)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age +
## FrancePop, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48179 -0.24662 -0.00726 0.22012 0.51987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.45039886 10.18888394 -0.044 0.965202
## AGST 0.60122388 0.10302027 5.836 0.0000127 ***
## HarvestRain -0.00395812 0.00087511 -4.523 0.000233 ***
## WinterRain 0.00104251 0.00053099 1.963 0.064416 .
## Age 0.00058475 0.07900313 0.007 0.994172
## FrancePop -0.00004953 0.00016668 -0.297 0.769578
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared: 0.8294, Adjusted R-squared: 0.7845
## F-statistic: 18.47 on 5 and 19 DF, p-value: 0.000001044# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE
## [1] 1.732113# Linear Regression model
model4 = lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine)
# Output the summary
summary(model4)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45470 -0.24273 0.00752 0.19773 0.53637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4299802 1.7658975 -1.942 0.066311 .
## AGST 0.6072093 0.0987022 6.152 0.0000052 ***
## HarvestRain -0.0039715 0.0008538 -4.652 0.000154 ***
## WinterRain 0.0010755 0.0005073 2.120 0.046694 *
## Age 0.0239308 0.0080969 2.956 0.007819 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared: 0.8286, Adjusted R-squared: 0.7943
## F-statistic: 24.17 on 4 and 20 DF, p-value: 0.0000002036# Correlations of the dataset
cor(wine$WinterRain, wine$Price)
## [1] 0.1366505
cor(wine$Age, wine$FrancePop)
## [1] -0.9944851
cor(wine)
## Year Price WinterRain AGST HarvestRain Age FrancePop
## Year 1.00000000 -0.4477679 0.016970024 -0.24691585 0.02800907 -1.00000000 0.994485097
## Price -0.44776786 1.0000000 0.136650547 0.65956286 -0.56332190 0.44776786 -0.466861641
## WinterRain 0.01697002 0.1366505 1.000000000 -0.32109061 -0.27544085 -0.01697002 -0.001621627
## AGST -0.24691585 0.6595629 -0.321090611 1.00000000 -0.06449593 0.24691585 -0.259162274
## HarvestRain 0.02800907 -0.5633219 -0.275440854 -0.06449593 1.00000000 -0.02800907 0.041264394
## Age -1.00000000 0.4477679 -0.016970024 0.24691585 -0.02800907 1.00000000 -0.994485097
## FrancePop 0.99448510 -0.4668616 -0.001621627 -0.25916227 0.04126439 -0.99448510 1.000000000# Linear Regression model
model5 = lm(Price ~ AGST + HarvestRain + WinterRain, data=wine)
# Output the summary
summary(model5)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67472 -0.12958 0.01973 0.20751 0.63846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.3016263 2.0366743 -2.112 0.046831 *
## AGST 0.6810242 0.1117011 6.097 0.00000475 ***
## HarvestRain -0.0039481 0.0009987 -3.953 0.000726 ***
## WinterRain 0.0011765 0.0005920 1.987 0.060097 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.345 on 21 degrees of freedom
## Multiple R-squared: 0.7537, Adjusted R-squared: 0.7185
## F-statistic: 21.42 on 3 and 21 DF, p-value: 0.000001359# Load in the test set
wineTest = read.csv("wine_test.csv")
# Output the string
str(wineTest)
## 'data.frame': 2 obs. of 7 variables:
## $ Year : int 1979 1980
## $ Price : num 6.95 6.5
## $ WinterRain : int 717 578
## $ AGST : num 16.2 16
## $ HarvestRain: int 122 74
## $ Age : int 4 3
## $ FrancePop : num 54836 55110# Make predictions using the fourth model on the test set
predictTest = predict(model4, newdata=wineTest)
predictTest
## 1 2
## 6.768925 6.684910# Compute the R^2
SSE = sum((wineTest$Price - predictTest)^2)
SST = sum((wineTest$Price - mean(wine$Price))^2)
1 - SSE/SST
## [1] 0.7944278