PREDICTING THE QUALITY OF WINES USING R

Bordeaux is a region in France popular for producing wine.While this wine has been produced in much the same way for hundreds of years, there are differences in price and quality from year to year that are sometimes very significant. Bordeaux wines are widely believed to taste better when they are older, so there’s an incentive to store young wines until they are mature. The main problem is that it is hard to determine the quality of the wine when it is so young just by tasting it, since the taste will change so significantly by the time it will actually be consumed.

wine = read.csv("wine.csv")
str(wine)
## 'data.frame':    25 obs. of  7 variables:
##  $ Year       : int  1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
##  $ Price      : num  7.5 8.04 7.69 6.98 6.78 ...
##  $ WinterRain : int  600 690 502 420 582 485 763 830 697 608 ...
##  $ AGST       : num  17.1 16.7 17.1 16.1 16.4 ...
##  $ HarvestRain: int  160 80 130 110 187 187 290 38 52 155 ...
##  $ Age        : int  31 30 28 26 25 24 23 22 21 20 ...
##  $ FrancePop  : num  43184 43495 44218 45152 45654 ...
summary(wine)
##       Year          Price         WinterRain         AGST      
##  Min.   :1952   Min.   :6.205   Min.   :376.0   Min.   :14.98  
##  1st Qu.:1960   1st Qu.:6.519   1st Qu.:536.0   1st Qu.:16.20  
##  Median :1966   Median :7.121   Median :600.0   Median :16.53  
##  Mean   :1966   Mean   :7.067   Mean   :605.3   Mean   :16.51  
##  3rd Qu.:1972   3rd Qu.:7.495   3rd Qu.:697.0   3rd Qu.:17.07  
##  Max.   :1978   Max.   :8.494   Max.   :830.0   Max.   :17.65  
##   HarvestRain         Age         FrancePop    
##  Min.   : 38.0   Min.   : 5.0   Min.   :43184  
##  1st Qu.: 89.0   1st Qu.:11.0   1st Qu.:46584  
##  Median :130.0   Median :17.0   Median :50255  
##  Mean   :148.6   Mean   :17.2   Mean   :49694  
##  3rd Qu.:187.0   3rd Qu.:23.0   3rd Qu.:52894  
##  Max.   :292.0   Max.   :31.0   Max.   :54602

We can see that we have a data frame with 25 observations of seven different variables. Year gives the year the wine was produced, and it’s just a unique identifier for each observation. Price is the dependent variable we’re trying to predict. And WinterRain, AGST, HarvestRain, Age, and FrancePop are the independent variables we’ll use to predict Price.

Linear Regression using Single variable

Let’s now create a one-variable linear regression equation using AGST to predict Price.

model1 = lm(Price ~ AGST, data = wine)
summary(model1)
## 
## Call:
## lm(formula = Price ~ AGST, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78450 -0.23882 -0.03727  0.38992  0.90318 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.4178     2.4935  -1.371 0.183710    
## AGST          0.6351     0.1509   4.208 0.000335 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared:  0.435,  Adjusted R-squared:  0.4105 
## F-statistic: 17.71 on 1 and 23 DF,  p-value: 0.000335

** Multiple R-squared will always increase if you add more independent variables. ** But Adjusted R-squared will decrease if you add an independent variable that doesn’t help the model.

model1$residuals
##           1           2           3           4           5           6 
##  0.04204258  0.82983774  0.21169394  0.15609432 -0.23119140  0.38991701 
##           7           8           9          10          11          12 
## -0.48959140  0.90318115  0.45372410  0.14887461 -0.23882157 -0.08974238 
##          13          14          15          16          17          18 
##  0.66185660 -0.05211511 -0.62726647 -0.74714947  0.42113502 -0.03727441 
##          19          20          21          22          23          24 
##  0.10685278 -0.78450270 -0.64017590 -0.05508720 -0.67055321 -0.22040381 
##          25 
##  0.55866518
SSE = sum(model1$residuals^2)
SSE
## [1] 5.734875

Now let’s add another variable to our regression model, HarvestRain.

model2 = lm(Price ~ AGST + HarvestRain, data=wine)
summary(model2)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88321 -0.19600  0.06178  0.15379  0.59722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.20265    1.85443  -1.188 0.247585    
## AGST         0.60262    0.11128   5.415 1.94e-05 ***
## HarvestRain -0.00457    0.00101  -4.525 0.000167 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared:  0.7074, Adjusted R-squared:  0.6808 
## F-statistic: 26.59 on 2 and 22 DF,  p-value: 1.347e-06

The new R-squared value near the bottom of the output suggests that this variable really helped our model. Our Multiple R-squared and Adjusted R-squared both increased significantly compared to the previous model. This indicates that this new model is probably better than the previous model.

SSE_model2 = sum(model2$residuals^2)
SSE_model2
## [1] 2.970373

Now let’s build a third model, this time with all of our independent variables.

model3 = lm(Price ~ AGST+WinterRain+HarvestRain+FrancePop+Age, data= wine)
summary(model3)
## 
## Call:
## lm(formula = Price ~ AGST + WinterRain + HarvestRain + FrancePop + 
##     Age, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48179 -0.24662 -0.00726  0.22012  0.51987 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.504e-01  1.019e+01  -0.044 0.965202    
## AGST         6.012e-01  1.030e-01   5.836 1.27e-05 ***
## WinterRain   1.043e-03  5.310e-04   1.963 0.064416 .  
## HarvestRain -3.958e-03  8.751e-04  -4.523 0.000233 ***
## FrancePop   -4.953e-05  1.667e-04  -0.297 0.769578    
## Age          5.847e-04  7.900e-02   0.007 0.994172    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared:  0.8294, Adjusted R-squared:  0.7845 
## F-statistic: 18.47 on 5 and 19 DF,  p-value: 1.044e-06

If we look at the bottom of the output, we can again see that the Multiple R-squared and Adjusted R-squared have both increased.

sse_model3 = sum(model3$residuals^2)
sse_model3
## [1] 1.732113

And if we type SSE, we can see that the sum of squared errors for model3 is 1.7, even better than before.

Now we will create a linear regression model to predict Price using HarvestRain and WinterRain as independent variables.

model4 = lm(Price ~ HarvestRain+WinterRain, data = wine)
summary(model4)
## 
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0933 -0.3222 -0.1012  0.3871  1.1877 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.865e+00  6.616e-01  11.888 4.76e-11 ***
## HarvestRain -4.971e-03  1.601e-03  -3.105  0.00516 ** 
## WinterRain  -9.848e-05  9.007e-04  -0.109  0.91392    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5611 on 22 degrees of freedom
## Multiple R-squared:  0.3177, Adjusted R-squared:  0.2557 
## F-statistic: 5.122 on 2 and 22 DF,  p-value: 0.01492

Removing FrancePop variable as it is not significant independent variable to predict wine price and creating a new model.

model5 = lm(Price ~ HarvestRain+WinterRain+AGST+Age, data = wine)
summary(model5)
## 
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain + AGST + Age, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45470 -0.24273  0.00752  0.19773  0.53637 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.4299802  1.7658975  -1.942 0.066311 .  
## HarvestRain -0.0039715  0.0008538  -4.652 0.000154 ***
## WinterRain   0.0010755  0.0005073   2.120 0.046694 *  
## AGST         0.6072093  0.0987022   6.152  5.2e-06 ***
## Age          0.0239308  0.0080969   2.956 0.007819 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared:  0.8286, Adjusted R-squared:  0.7943 
## F-statistic: 24.17 on 4 and 20 DF,  p-value: 2.036e-07

If we look at each of our independent variables in the new model, and the stars, we can see that something a little strange happened. Before, Age was not significant at all in our model3. But now, Age has two stars, meaning that it’s very significant in this new model. This is due to something called multicollinearity. Age and FrancePopulation are what we call highly correlated.

Computing Correlation

cor(wine$WinterRain,wine$Price)
## [1] 0.1366505
cor(wine$Age,wine$FrancePop)
## [1] -0.9944851

We’ve confirmed that Age and FrancePopulation are definitely highly correlated. So we do have multicollinearity problems in our model that uses all of the available independent variables. Keep in mind that multicollinearity refers to the situation when two independent variables are highly correlated. A high correlation between an independent variable and the dependent variable is a good thing since we’re trying to predict the dependent variable using the independent variables.

Now we will remove both Age and FrancePopulation, since they were both insignificant in our model and build a new model.

model6 = lm(Price ~ WinterRain+AGST+HarvestRain, data=wine)
summary(model6)
## 
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67472 -0.12958  0.01973  0.20751  0.63846 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.3016263  2.0366743  -2.112 0.046831 *  
## WinterRain   0.0011765  0.0005920   1.987 0.060097 .  
## AGST         0.6810242  0.1117011   6.097 4.75e-06 ***
## HarvestRain -0.0039481  0.0009987  -3.953 0.000726 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.345 on 21 degrees of freedom
## Multiple R-squared:  0.7537, Adjusted R-squared:  0.7185 
## F-statistic: 21.42 on 3 and 21 DF,  p-value: 1.359e-06

If we take a look at the summary of this new model and look at the Coefficients table, we can see that AverageGrowingSeasonTemperature and HarvestRain are very significant, and WinterRain is almost significant. So this model looks pretty good, but if we look at our R-squared, we can see that it dropped to 0.75. The model that includes Age has an R-squared of 0.83. So if we had removed Age and FrancePopulation at the same time, we would have missed a significant variable, and the R-squared of our final model would have been lower.

we’ll stick with model5 for the rest of this lecture, the model that uses AGST, HarvestRain, WinterRain, and Age as the independent variables.

Making Predictions

wineTest = read.csv("wine_test.csv")
str(wineTest)
## 'data.frame':    2 obs. of  7 variables:
##  $ Year       : int  1979 1980
##  $ Price      : num  6.95 6.5
##  $ WinterRain : int  717 578
##  $ AGST       : num  16.2 16
##  $ HarvestRain: int  122 74
##  $ Age        : int  4 3
##  $ FrancePop  : num  54836 55110
predictTest = predict(model5,wineTest)
predictTest
##        1        2 
## 6.768925 6.684910

Calculation SSE

SSE_Predict = sum((wineTest$Price-predictTest)^2)

Calculating SST

SST_Predict = sum((wineTest$Price-mean(wine$Price))^2)

Calculating R squared value

1-SSE_Predict/SST_Predict
## [1] 0.7944278

The value of R Squared shows that the prediction is close to what is expected but to have more accurate prediction we need a large data set.