Bordeaux is a region in France popular for producing wine.While this wine has been produced in much the same way for hundreds of years, there are differences in price and quality from year to year that are sometimes very significant. Bordeaux wines are widely believed to taste better when they are older, so there’s an incentive to store young wines until they are mature. The main problem is that it is hard to determine the quality of the wine when it is so young just by tasting it, since the taste will change so significantly by the time it will actually be consumed.
wine = read.csv("wine.csv")
str(wine)
## 'data.frame': 25 obs. of 7 variables:
## $ Year : int 1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
## $ Price : num 7.5 8.04 7.69 6.98 6.78 ...
## $ WinterRain : int 600 690 502 420 582 485 763 830 697 608 ...
## $ AGST : num 17.1 16.7 17.1 16.1 16.4 ...
## $ HarvestRain: int 160 80 130 110 187 187 290 38 52 155 ...
## $ Age : int 31 30 28 26 25 24 23 22 21 20 ...
## $ FrancePop : num 43184 43495 44218 45152 45654 ...
summary(wine)
## Year Price WinterRain AGST
## Min. :1952 Min. :6.205 Min. :376.0 Min. :14.98
## 1st Qu.:1960 1st Qu.:6.519 1st Qu.:536.0 1st Qu.:16.20
## Median :1966 Median :7.121 Median :600.0 Median :16.53
## Mean :1966 Mean :7.067 Mean :605.3 Mean :16.51
## 3rd Qu.:1972 3rd Qu.:7.495 3rd Qu.:697.0 3rd Qu.:17.07
## Max. :1978 Max. :8.494 Max. :830.0 Max. :17.65
## HarvestRain Age FrancePop
## Min. : 38.0 Min. : 5.0 Min. :43184
## 1st Qu.: 89.0 1st Qu.:11.0 1st Qu.:46584
## Median :130.0 Median :17.0 Median :50255
## Mean :148.6 Mean :17.2 Mean :49694
## 3rd Qu.:187.0 3rd Qu.:23.0 3rd Qu.:52894
## Max. :292.0 Max. :31.0 Max. :54602
We can see that we have a data frame with 25 observations of seven different variables. Year gives the year the wine was produced, and it’s just a unique identifier for each observation. Price is the dependent variable we’re trying to predict. And WinterRain, AGST, HarvestRain, Age, and FrancePop are the independent variables we’ll use to predict Price.
Let’s now create a one-variable linear regression equation using AGST to predict Price.
model1 = lm(Price ~ AGST, data = wine)
summary(model1)
##
## Call:
## lm(formula = Price ~ AGST, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78450 -0.23882 -0.03727 0.38992 0.90318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4178 2.4935 -1.371 0.183710
## AGST 0.6351 0.1509 4.208 0.000335 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared: 0.435, Adjusted R-squared: 0.4105
## F-statistic: 17.71 on 1 and 23 DF, p-value: 0.000335
** Multiple R-squared will always increase if you add more independent variables. ** But Adjusted R-squared will decrease if you add an independent variable that doesn’t help the model.
model1$residuals
## 1 2 3 4 5 6
## 0.04204258 0.82983774 0.21169394 0.15609432 -0.23119140 0.38991701
## 7 8 9 10 11 12
## -0.48959140 0.90318115 0.45372410 0.14887461 -0.23882157 -0.08974238
## 13 14 15 16 17 18
## 0.66185660 -0.05211511 -0.62726647 -0.74714947 0.42113502 -0.03727441
## 19 20 21 22 23 24
## 0.10685278 -0.78450270 -0.64017590 -0.05508720 -0.67055321 -0.22040381
## 25
## 0.55866518
SSE = sum(model1$residuals^2)
SSE
## [1] 5.734875
Now let’s add another variable to our regression model, HarvestRain.
model2 = lm(Price ~ AGST + HarvestRain, data=wine)
summary(model2)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88321 -0.19600 0.06178 0.15379 0.59722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.20265 1.85443 -1.188 0.247585
## AGST 0.60262 0.11128 5.415 1.94e-05 ***
## HarvestRain -0.00457 0.00101 -4.525 0.000167 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808
## F-statistic: 26.59 on 2 and 22 DF, p-value: 1.347e-06
The new R-squared value near the bottom of the output suggests that this variable really helped our model. Our Multiple R-squared and Adjusted R-squared both increased significantly compared to the previous model. This indicates that this new model is probably better than the previous model.
SSE_model2 = sum(model2$residuals^2)
SSE_model2
## [1] 2.970373
Now let’s build a third model, this time with all of our independent variables.
model3 = lm(Price ~ AGST+WinterRain+HarvestRain+FrancePop+Age, data= wine)
summary(model3)
##
## Call:
## lm(formula = Price ~ AGST + WinterRain + HarvestRain + FrancePop +
## Age, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48179 -0.24662 -0.00726 0.22012 0.51987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.504e-01 1.019e+01 -0.044 0.965202
## AGST 6.012e-01 1.030e-01 5.836 1.27e-05 ***
## WinterRain 1.043e-03 5.310e-04 1.963 0.064416 .
## HarvestRain -3.958e-03 8.751e-04 -4.523 0.000233 ***
## FrancePop -4.953e-05 1.667e-04 -0.297 0.769578
## Age 5.847e-04 7.900e-02 0.007 0.994172
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared: 0.8294, Adjusted R-squared: 0.7845
## F-statistic: 18.47 on 5 and 19 DF, p-value: 1.044e-06
If we look at the bottom of the output, we can again see that the Multiple R-squared and Adjusted R-squared have both increased.
sse_model3 = sum(model3$residuals^2)
sse_model3
## [1] 1.732113
And if we type SSE, we can see that the sum of squared errors for model3 is 1.7, even better than before.
Now we will create a linear regression model to predict Price using HarvestRain and WinterRain as independent variables.
model4 = lm(Price ~ HarvestRain+WinterRain, data = wine)
summary(model4)
##
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0933 -0.3222 -0.1012 0.3871 1.1877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.865e+00 6.616e-01 11.888 4.76e-11 ***
## HarvestRain -4.971e-03 1.601e-03 -3.105 0.00516 **
## WinterRain -9.848e-05 9.007e-04 -0.109 0.91392
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5611 on 22 degrees of freedom
## Multiple R-squared: 0.3177, Adjusted R-squared: 0.2557
## F-statistic: 5.122 on 2 and 22 DF, p-value: 0.01492
Removing FrancePop variable as it is not significant independent variable to predict wine price and creating a new model.
model5 = lm(Price ~ HarvestRain+WinterRain+AGST+Age, data = wine)
summary(model5)
##
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain + AGST + Age, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45470 -0.24273 0.00752 0.19773 0.53637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4299802 1.7658975 -1.942 0.066311 .
## HarvestRain -0.0039715 0.0008538 -4.652 0.000154 ***
## WinterRain 0.0010755 0.0005073 2.120 0.046694 *
## AGST 0.6072093 0.0987022 6.152 5.2e-06 ***
## Age 0.0239308 0.0080969 2.956 0.007819 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared: 0.8286, Adjusted R-squared: 0.7943
## F-statistic: 24.17 on 4 and 20 DF, p-value: 2.036e-07
If we look at each of our independent variables in the new model, and the stars, we can see that something a little strange happened. Before, Age was not significant at all in our model3. But now, Age has two stars, meaning that it’s very significant in this new model. This is due to something called multicollinearity. Age and FrancePopulation are what we call highly correlated.
cor(wine$WinterRain,wine$Price)
## [1] 0.1366505
cor(wine$Age,wine$FrancePop)
## [1] -0.9944851
We’ve confirmed that Age and FrancePopulation are definitely highly correlated. So we do have multicollinearity problems in our model that uses all of the available independent variables. Keep in mind that multicollinearity refers to the situation when two independent variables are highly correlated. A high correlation between an independent variable and the dependent variable is a good thing since we’re trying to predict the dependent variable using the independent variables.
Now we will remove both Age and FrancePopulation, since they were both insignificant in our model and build a new model.
model6 = lm(Price ~ WinterRain+AGST+HarvestRain, data=wine)
summary(model6)
##
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67472 -0.12958 0.01973 0.20751 0.63846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.3016263 2.0366743 -2.112 0.046831 *
## WinterRain 0.0011765 0.0005920 1.987 0.060097 .
## AGST 0.6810242 0.1117011 6.097 4.75e-06 ***
## HarvestRain -0.0039481 0.0009987 -3.953 0.000726 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.345 on 21 degrees of freedom
## Multiple R-squared: 0.7537, Adjusted R-squared: 0.7185
## F-statistic: 21.42 on 3 and 21 DF, p-value: 1.359e-06
If we take a look at the summary of this new model and look at the Coefficients table, we can see that AverageGrowingSeasonTemperature and HarvestRain are very significant, and WinterRain is almost significant. So this model looks pretty good, but if we look at our R-squared, we can see that it dropped to 0.75. The model that includes Age has an R-squared of 0.83. So if we had removed Age and FrancePopulation at the same time, we would have missed a significant variable, and the R-squared of our final model would have been lower.
we’ll stick with model5 for the rest of this lecture, the model that uses AGST, HarvestRain, WinterRain, and Age as the independent variables.
wineTest = read.csv("wine_test.csv")
str(wineTest)
## 'data.frame': 2 obs. of 7 variables:
## $ Year : int 1979 1980
## $ Price : num 6.95 6.5
## $ WinterRain : int 717 578
## $ AGST : num 16.2 16
## $ HarvestRain: int 122 74
## $ Age : int 4 3
## $ FrancePop : num 54836 55110
predictTest = predict(model5,wineTest)
predictTest
## 1 2
## 6.768925 6.684910
Calculation SSE
SSE_Predict = sum((wineTest$Price-predictTest)^2)
Calculating SST
SST_Predict = sum((wineTest$Price-mean(wine$Price))^2)
Calculating R squared value
1-SSE_Predict/SST_Predict
## [1] 0.7944278
The value of R Squared shows that the prediction is close to what is expected but to have more accurate prediction we need a large data set.