library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
This ML project is designed to accurately predict the quality, and therefore, the price of wine in the Bordeaux region of France. It is widely known that older wines typically taste better and consequently fetch a higher price. Wine buyers profit from knowing in advance which wines will taste better in the future.
I used a multi-variable linear regression in RStudio with price as my dependent variable (outcome variable) and independent variables of weather and age to predict wine prices. I created a total of five ML models and used a test dataset from 1979 - 1980 to verify that my model was accurate.
An analysis of the test data and the predictTest concludes that model4 is the closest prediction of wine prices of 6.95 for 1979 and 6.5 for 1980.
wine = read.csv("wine.csv")
One of my first steps is to become more familar with the data. My initial analysis finds that there are seven varibles and 25 observations. AGST = Average Growing Season Temperature, Age = the age of the wine, FrancePop = population of France.
str(wine)
## 'data.frame': 25 obs. of 7 variables:
## $ Year : int 1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
## $ Price : num 7.5 8.04 7.69 6.98 6.78 ...
## $ WinterRain : int 600 690 502 420 582 485 763 830 697 608 ...
## $ AGST : num 17.1 16.7 17.1 16.1 16.4 ...
## $ HarvestRain: int 160 80 130 110 187 187 290 38 52 155 ...
## $ Age : int 31 30 28 26 25 24 23 22 21 20 ...
## $ FrancePop : num 43184 43495 44218 45152 45654 ...
I see from the summary that the wine from this region has a minimum price of 6.205 and a maximum price of 8.494. This is good to take note of for our future analysis.
summary(wine)
## Year Price WinterRain AGST
## Min. :1952 Min. :6.205 Min. :376.0 Min. :14.98
## 1st Qu.:1960 1st Qu.:6.519 1st Qu.:536.0 1st Qu.:16.20
## Median :1966 Median :7.121 Median :600.0 Median :16.53
## Mean :1966 Mean :7.067 Mean :605.3 Mean :16.51
## 3rd Qu.:1972 3rd Qu.:7.495 3rd Qu.:697.0 3rd Qu.:17.07
## Max. :1978 Max. :8.494 Max. :830.0 Max. :17.65
## HarvestRain Age FrancePop
## Min. : 38.0 Min. : 5.0 Min. :43184
## 1st Qu.: 89.0 1st Qu.:11.0 1st Qu.:46584
## Median :130.0 Median :17.0 Median :50255
## Mean :148.6 Mean :17.2 Mean :49694
## 3rd Qu.:187.0 3rd Qu.:23.0 3rd Qu.:52894
## Max. :292.0 Max. :31.0 Max. :54602
I run model1 and see that the Multiple R-squared = 0.435 and the adjusted R-squared = 0.4105 using only one independent variable. From the residuals I calculate the sum of squared errs SSE = 5.734875.
model1 = lm(Price ~ AGST, data = wine)
summary(model1)
##
## Call:
## lm(formula = Price ~ AGST, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78450 -0.23882 -0.03727 0.38992 0.90318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4178 2.4935 -1.371 0.183710
## AGST 0.6351 0.1509 4.208 0.000335 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared: 0.435, Adjusted R-squared: 0.4105
## F-statistic: 17.71 on 1 and 23 DF, p-value: 0.000335
model1$residuals
## 1 2 3 4 5 6
## 0.04204258 0.82983774 0.21169394 0.15609432 -0.23119140 0.38991701
## 7 8 9 10 11 12
## -0.48959140 0.90318115 0.45372410 0.14887461 -0.23882157 -0.08974238
## 13 14 15 16 17 18
## 0.66185660 -0.05211511 -0.62726647 -0.74714947 0.42113502 -0.03727441
## 19 20 21 22 23 24
## 0.10685278 -0.78450270 -0.64017590 -0.05508720 -0.67055321 -0.22040381
## 25
## 0.55866518
SSE = sum(model1$residuals^2)
In models 2-3 I added additional variables. Model4 uses AGST (average growing season temperature), HarvestRain, WinterRain, Age as the independent variables.
model4 = lm(Price ~ AGST + HarvestRain + WinterRain + Age, data = wine)
summary(model4)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45470 -0.24273 0.00752 0.19773 0.53637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4299802 1.7658975 -1.942 0.066311 .
## AGST 0.6072093 0.0987022 6.152 5.2e-06 ***
## HarvestRain -0.0039715 0.0008538 -4.652 0.000154 ***
## WinterRain 0.0010755 0.0005073 2.120 0.046694 *
## Age 0.0239308 0.0080969 2.956 0.007819 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared: 0.8286, Adjusted R-squared: 0.7943
## F-statistic: 24.17 on 4 and 20 DF, p-value: 2.036e-07
I used my test data with model4 to see if it accurately predicts the price of Bordeaux wine from 1979-1980. I begin the prediction process by loading the wine_test.csv data.
wineTest = read.csv("wine_test.csv")
predictTest = predict(model4, newdata = wineTest)
predictTest
## 1 2
## 6.768925 6.684910
Now I must compute the sum or squared errs along with the sum of squared totals
SSE = sum((wineTest$Price - predictTest)^2)
SST = sum((wineTest$Price - mean(wine$Price))^2)
1-SSE/SST
## [1] 0.7944278
An analysis of the test data and the predictTest concludes that model4 is the closest prediction of wine prices of 6.95 for 1979 and 6.5 for 1980.
str(wineTest)
## 'data.frame': 2 obs. of 7 variables:
## $ Year : int 1979 1980
## $ Price : num 6.95 6.5
## $ WinterRain : int 717 578
## $ AGST : num 16.2 16
## $ HarvestRain: int 122 74
## $ Age : int 4 3
## $ FrancePop : num 54836 55110
predictTest
## 1 2
## 6.768925 6.684910
plot(model1)