Lets start by loading our data set.
wine<-read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\unit 2\\wine.csv", header = TRUE)
Top row of the data
head(wine)
## Year Price WinterRain AGST HarvestRain Age FrancePop
## 1 1952 7.4950 600 17.1167 160 31 43183.57
## 2 1953 8.0393 690 16.7333 80 30 43495.03
## 3 1955 7.6858 502 17.1500 130 28 44217.86
## 4 1957 6.9845 420 16.1333 110 26 45152.25
## 5 1958 6.7772 582 16.4167 187 25 45653.81
## 6 1959 8.0757 485 17.4833 187 24 46128.64
We can look at the structure of our data by using the str function and summary of the data by using summary function
str(wine)
## 'data.frame': 25 obs. of 7 variables:
## $ Year : int 1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
## $ Price : num 7.5 8.04 7.69 6.98 6.78 ...
## $ WinterRain : int 600 690 502 420 582 485 763 830 697 608 ...
## $ AGST : num 17.1 16.7 17.1 16.1 16.4 ...
## $ HarvestRain: int 160 80 130 110 187 187 290 38 52 155 ...
## $ Age : int 31 30 28 26 25 24 23 22 21 20 ...
## $ FrancePop : num 43184 43495 44218 45152 45654 ...
summary(wine)
## Year Price WinterRain AGST
## Min. :1952 Min. :6.205 Min. :376.0 Min. :14.98
## 1st Qu.:1960 1st Qu.:6.519 1st Qu.:536.0 1st Qu.:16.20
## Median :1966 Median :7.121 Median :600.0 Median :16.53
## Mean :1966 Mean :7.067 Mean :605.3 Mean :16.51
## 3rd Qu.:1972 3rd Qu.:7.495 3rd Qu.:697.0 3rd Qu.:17.07
## Max. :1978 Max. :8.494 Max. :830.0 Max. :17.65
## HarvestRain Age FrancePop
## Min. : 38.0 Min. : 5.0 Min. :43184
## 1st Qu.: 89.0 1st Qu.:11.0 1st Qu.:46584
## Median :130.0 Median :17.0 Median :50255
## Mean :148.6 Mean :17.2 Mean :49694
## 3rd Qu.:187.0 3rd Qu.:23.0 3rd Qu.:52894
## Max. :292.0 Max. :31.0 Max. :54602
We can see that we have a data frame with 25 observations of seven different variables.
Price is the dependent variable we are trying to predict. And WinterRain, AGST, HarvestRain, Age, and FrancePop are the independent variables we will use to predict Price.
We can also look at the statistical summary of our data using the summary function.
we will use the lm function, which stands for linear model. This is the function we will use whenever we want to build a linear regression model.
model1 = lm(Price ~ AGST, data=wine)
summary(model1)
##
## Call:
## lm(formula = Price ~ AGST, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78450 -0.23882 -0.03727 0.38992 0.90318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4178 2.4935 -1.371 0.183710
## AGST 0.6351 0.1509 4.208 0.000335 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared: 0.435, Adjusted R-squared: 0.4105
## F-statistic: 17.71 on 1 and 23 DF, p-value: 0.000335
We can see Multiple R-squared is 0.435, which is the R-squared value.
Beside it is a number labeled Adjusted R-squared values are 0.41.
Multiple R-squared will always increase if you add more independent variables.
But Adjusted R-squared will decrease if you add an independent variable that doesn’t help the model.
This is a good way to determine if an additional variable should even be included in the model.
Sum of Squared Errors
SSE = sum(model1$residuals^2)
SSE
## [1] 5.734875
Now let’s add another variable to our regression model, HarvestRain.
model2 = lm(Price ~ AGST + HarvestRain, data=wine)
summary(model2)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88321 -0.19600 0.06178 0.15379 0.59722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.20265 1.85443 -1.188 0.247585
## AGST 0.60262 0.11128 5.415 1.94e-05 ***
## HarvestRain -0.00457 0.00101 -4.525 0.000167 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808
## F-statistic: 26.59 on 2 and 22 DF, p-value: 1.347e-06
you can see that this variable really helped our model. Our Multiple R-squared and Adjusted R-squared both increased significantly compared to the previous model. This indicates that this new model is probably better than the previous model. Let’s now also compute the sum of squared errors for this new model.
Sum of Squared Errors
SSE = sum(model2$residuals^2)
SSE
## [1] 2.970373
We can see that the sum of squared errors for model2 is 2.97, which is much better than the sum of squared errors for model1.
Now let’s build a third model, this time with all of our independent variables.
model3 = lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data=wine)
summary(model3)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age +
## FrancePop, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48179 -0.24662 -0.00726 0.22012 0.51987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.504e-01 1.019e+01 -0.044 0.965202
## AGST 6.012e-01 1.030e-01 5.836 1.27e-05 ***
## HarvestRain -3.958e-03 8.751e-04 -4.523 0.000233 ***
## WinterRain 1.043e-03 5.310e-04 1.963 0.064416 .
## Age 5.847e-04 7.900e-02 0.007 0.994172
## FrancePop -4.953e-05 1.667e-04 -0.297 0.769578
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared: 0.8294, Adjusted R-squared: 0.7845
## F-statistic: 18.47 on 5 and 19 DF, p-value: 1.044e-06
If we look at the bottom of the output, we can again see that the Multiple R-squared and Adjusted R-squared have both increased.
Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE
## [1] 1.732113
we can see that the sum of squared errors for model3 is 1.7, even better than before.
In the summary function, The Estimate column gives the coefficients for the intercept and for each of the independent variables in our model. If a coefficient is not significantly different from 0, then we should probably remove the variable from our model since it’s not helping to predict the dependent variable.
The standard error column gives a measure of how much the coefficient is likely to vary from the estimate value.
The t value is the estimate divided by the standard error. It will be negative if the estimate is negative and positive if the estimate is positive. The larger the absolute value of the t value, the more likely the coefficient is to be significant. So we want independent variables with a large absolute value in this column.
The last column of numbers gives a measure of how plausible it is that the coefficient is actually 0, given the data we used to build the model.
The less plausible it is, or the smaller the probability number in this column, the less likely it is that our coefficient estimate is actually 0. This number will be large if the absolute value of the t value is small, and it will be small if the absolute value of the t value is large.
We want independent variables with small values in this column. This is a lot of information, but the easiest way in R to determine if a variable is significant is to look at the stars at the end of each row.
The star coding scheme is explained at the bottom of the Coefficients table.
Three stars is the highest level of significance and corresponds to a probability value less than 0.001, or the smallest possible probabilities. Two stars is also very significant and corresponds to a probability between 0.001 and 0.01.
One star is still significant and corresponds to a probability between 0.01 and 0.05. A period, or dot, means that the coefficient is almost significant and corresponds to a probability between 0.05 and 0.10.
Nothing at the end of a row means that the variable is not significant in the model.
If we look at the stars at the end of each row in this Coefficients table, we can see that Age and FrancePopulation are both insignificant in our model.
Let’s see if we can improve our linear regression model. we built a linear regression model called model3, that used all of our independent variables to predict the dependent variable, Price.
In our R Console, we can see the summary output of this model. As we just learned, both Age and FrancePopulation are insignificant in our model.
Because of this, we should consider removing these variables from our model.
Removing FrancePop
model4 = lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine)
summary(model4)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45470 -0.24273 0.00752 0.19773 0.53637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4299802 1.7658975 -1.942 0.066311 .
## AGST 0.6072093 0.0987022 6.152 5.2e-06 ***
## HarvestRain -0.0039715 0.0008538 -4.652 0.000154 ***
## WinterRain 0.0010755 0.0005073 2.120 0.046694 *
## Age 0.0239308 0.0080969 2.956 0.007819 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared: 0.8286, Adjusted R-squared: 0.7943
## F-statistic: 24.17 on 4 and 20 DF, p-value: 2.036e-07
We can see that the R-squared, for this model, is 0.8286 and our Adjusted R-squared is 0.79.
we can see that for model3, the R-squared was 0.8294, and the Adjusted R-squared was 0.7845.
So this model is just stronger than the previous model because our Adjusted R-squared actually increased by removing FrancePopulation.
If we look at each of our independent variables in the new model, and the stars, we can see that something a little strange happened. Before, Age was not significant at all in our model.
But now, Age has two stars, meaning that it’s very significant in this new model. This is due to something called multicollinearity. Age and FrancePopulation are what we call highly correlated.
Correlation measures the linear relationship between two variables and is a number between -1 and +1.
Let’s compute the correlation between WinterRain and Price.
cor(wine$WinterRain, wine$Price)
## [1] 0.1366505
Correlation between Age and FrancePopulation.
cor(wine$Age, wine$FrancePop)
## [1] -0.9944851
The correlation between all pairs of variables in our data
cor(wine)
## Year Price WinterRain AGST HarvestRain
## Year 1.00000000 -0.4477679 0.016970024 -0.24691585 0.02800907
## Price -0.44776786 1.0000000 0.136650547 0.65956286 -0.56332190
## WinterRain 0.01697002 0.1366505 1.000000000 -0.32109061 -0.27544085
## AGST -0.24691585 0.6595629 -0.321090611 1.00000000 -0.06449593
## HarvestRain 0.02800907 -0.5633219 -0.275440854 -0.06449593 1.00000000
## Age -1.00000000 0.4477679 -0.016970024 0.24691585 -0.02800907
## FrancePop 0.99448510 -0.4668616 -0.001621627 -0.25916227 0.04126439
## Age FrancePop
## Year -1.00000000 0.994485097
## Price 0.44776786 -0.466861641
## WinterRain -0.01697002 -0.001621627
## AGST 0.24691585 -0.259162274
## HarvestRain -0.02800907 0.041264394
## Age 1.00000000 -0.994485097
## FrancePop -0.99448510 1.000000000
As we saw earlier, Age and FrancePopulation are highly correlated with a correlation of -0.99.
We can see it from the graph also,
plot(wine$Age, wine$FrancePop, xlab="age", ylab="FrancePop", main="Age Vs FrancePop", col="blue", pch=19)
Multicollinearity refers to the situation when two independent variables are highly correlated.
A high correlation between an independent variable and the dependent variable is a good thing since we’re trying to predict the dependent variable using the independent variables.
Now due to the possibility of multicollinearity, you always want to remove the insignificant variables one at a time.
Let’s see what would have happened if we had removed both Age and FrancePopulation, since they were both insignificant in our model that used all of the independent variables.
model5 = lm(Price ~ AGST + HarvestRain + WinterRain, data=wine)
summary(model5)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67472 -0.12958 0.01973 0.20751 0.63846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.3016263 2.0366743 -2.112 0.046831 *
## AGST 0.6810242 0.1117011 6.097 4.75e-06 ***
## HarvestRain -0.0039481 0.0009987 -3.953 0.000726 ***
## WinterRain 0.0011765 0.0005920 1.987 0.060097 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.345 on 21 degrees of freedom
## Multiple R-squared: 0.7537, Adjusted R-squared: 0.7185
## F-statistic: 21.42 on 3 and 21 DF, p-value: 1.359e-06
If we take a look at the summary of this new model and look at the Coefficients table, we can see that AverageGrowingSeasonTemperature and HarvestRain are very significant, and WinterRain is almost significant. So this model looks pretty good, but if we look at our R-squared, we can see that it dropped to 0.75.
The model that includes Age has an R-squared of 0.83. So if we had removed Age and FrancePopulation at the same time, we would have missed a significant variable, and the R-squared of our final model would have been lower.
So why didn’t we keep FrancePopulation instead of Age?
Well, we expect Age to be significant. Older wines are typically more expensive, so Age makes more intuitive sense in our model.
We have two data points that we did not use to build our model in the file “wine_test.csv”.
Let’s load this new data file into R.
wineTest<-read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\unit 2\\wine_test.csv", header = TRUE)
If we take a look at the structure of wineTest using the str function
str(wineTest)
## 'data.frame': 2 obs. of 7 variables:
## $ Year : int 1979 1980
## $ Price : num 6.95 6.5
## $ WinterRain : int 717 578
## $ AGST : num 16.2 16
## $ HarvestRain: int 122 74
## $ Age : int 4 3
## $ FrancePop : num 54836 55110
we can see that we have two observations and the same seven variables as before.
To make predictions for these two test points, we’ll use the predict function.
Make test set predictions We’ll call the output predictTest, and we’ll use the predict function using model4.
predictTest = predict(model4, newdata=wineTest)
predictTest
## 1 2
## 6.768925 6.684910
If we take a look at our predictions, we can see that for the first data point we predict 6.7689, and for the second data point we predict 6.6849.
If we look back at our str output, we can see that the actual Price for the first data point is 6.95, and the actual Price for the second data point is 6.5.
So it looks like our predictions are pretty good, but we can quantify this by computing the R-squared value for our test set.
Let’s compute the sum of squared errors on our test set.
SSE = sum((wineTest$Price - predictTest)^2)
Now let’s compute the total sum of squares, which we’ll call SST.
SST = sum((wineTest$Price - mean(wine$Price))^2)
Now, to compute the test set R-squared, we can just type 1 - SSE/SST.
1 - SSE/SST
## [1] 0.7944278
So the R-squared on our test set is 0.79. This is a pretty good out-of-sample R-squared. But while we do well on these two test points, keep in mind that our test set is really small.
We should increase the size of our test set to be more confident about the out-of-sample accuracy of our model.