Lets start by loading our data set.

wine<-read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\unit 2\\wine.csv", header = TRUE)

Top row of the data

head(wine)
##   Year  Price WinterRain    AGST HarvestRain Age FrancePop
## 1 1952 7.4950        600 17.1167         160  31  43183.57
## 2 1953 8.0393        690 16.7333          80  30  43495.03
## 3 1955 7.6858        502 17.1500         130  28  44217.86
## 4 1957 6.9845        420 16.1333         110  26  45152.25
## 5 1958 6.7772        582 16.4167         187  25  45653.81
## 6 1959 8.0757        485 17.4833         187  24  46128.64

We can look at the structure of our data by using the str function and summary of the data by using summary function

str(wine)
## 'data.frame':    25 obs. of  7 variables:
##  $ Year       : int  1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
##  $ Price      : num  7.5 8.04 7.69 6.98 6.78 ...
##  $ WinterRain : int  600 690 502 420 582 485 763 830 697 608 ...
##  $ AGST       : num  17.1 16.7 17.1 16.1 16.4 ...
##  $ HarvestRain: int  160 80 130 110 187 187 290 38 52 155 ...
##  $ Age        : int  31 30 28 26 25 24 23 22 21 20 ...
##  $ FrancePop  : num  43184 43495 44218 45152 45654 ...
summary(wine)
##       Year          Price         WinterRain         AGST      
##  Min.   :1952   Min.   :6.205   Min.   :376.0   Min.   :14.98  
##  1st Qu.:1960   1st Qu.:6.519   1st Qu.:536.0   1st Qu.:16.20  
##  Median :1966   Median :7.121   Median :600.0   Median :16.53  
##  Mean   :1966   Mean   :7.067   Mean   :605.3   Mean   :16.51  
##  3rd Qu.:1972   3rd Qu.:7.495   3rd Qu.:697.0   3rd Qu.:17.07  
##  Max.   :1978   Max.   :8.494   Max.   :830.0   Max.   :17.65  
##   HarvestRain         Age         FrancePop    
##  Min.   : 38.0   Min.   : 5.0   Min.   :43184  
##  1st Qu.: 89.0   1st Qu.:11.0   1st Qu.:46584  
##  Median :130.0   Median :17.0   Median :50255  
##  Mean   :148.6   Mean   :17.2   Mean   :49694  
##  3rd Qu.:187.0   3rd Qu.:23.0   3rd Qu.:52894  
##  Max.   :292.0   Max.   :31.0   Max.   :54602

We can see that we have a data frame with 25 observations of seven different variables.

Applying Linear Regression

Price is the dependent variable we are trying to predict. And WinterRain, AGST, HarvestRain, Age, and FrancePop are the independent variables we will use to predict Price.

We can also look at the statistical summary of our data using the summary function.

Linear Regression (one variable)

we will use the lm function, which stands for linear model. This is the function we will use whenever we want to build a linear regression model.

model1 = lm(Price ~ AGST, data=wine)
summary(model1)
## 
## Call:
## lm(formula = Price ~ AGST, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78450 -0.23882 -0.03727  0.38992  0.90318 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.4178     2.4935  -1.371 0.183710    
## AGST          0.6351     0.1509   4.208 0.000335 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared:  0.435,  Adjusted R-squared:  0.4105 
## F-statistic: 17.71 on 1 and 23 DF,  p-value: 0.000335

We can see Multiple R-squared is 0.435, which is the R-squared value.
Beside it is a number labeled Adjusted R-squared values are 0.41.

Multiple R-squared will always increase if you add more independent variables.
But Adjusted R-squared will decrease if you add an independent variable that doesn’t help the model.
This is a good way to determine if an additional variable should even be included in the model.

Sum of Squared Errors

SSE = sum(model1$residuals^2)
SSE
## [1] 5.734875

Linear Regression (two variables)

Now let’s add another variable to our regression model, HarvestRain.

model2 = lm(Price ~ AGST + HarvestRain, data=wine)
summary(model2)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88321 -0.19600  0.06178  0.15379  0.59722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.20265    1.85443  -1.188 0.247585    
## AGST         0.60262    0.11128   5.415 1.94e-05 ***
## HarvestRain -0.00457    0.00101  -4.525 0.000167 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared:  0.7074, Adjusted R-squared:  0.6808 
## F-statistic: 26.59 on 2 and 22 DF,  p-value: 1.347e-06

you can see that this variable really helped our model. Our Multiple R-squared and Adjusted R-squared both increased significantly compared to the previous model. This indicates that this new model is probably better than the previous model. Let’s now also compute the sum of squared errors for this new model.

Sum of Squared Errors

SSE = sum(model2$residuals^2)
SSE
## [1] 2.970373

We can see that the sum of squared errors for model2 is 2.97, which is much better than the sum of squared errors for model1.

Linear Regression (all variables)

Now let’s build a third model, this time with all of our independent variables.

model3 = lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data=wine)
summary(model3)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age + 
##     FrancePop, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48179 -0.24662 -0.00726  0.22012  0.51987 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.504e-01  1.019e+01  -0.044 0.965202    
## AGST         6.012e-01  1.030e-01   5.836 1.27e-05 ***
## HarvestRain -3.958e-03  8.751e-04  -4.523 0.000233 ***
## WinterRain   1.043e-03  5.310e-04   1.963 0.064416 .  
## Age          5.847e-04  7.900e-02   0.007 0.994172    
## FrancePop   -4.953e-05  1.667e-04  -0.297 0.769578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared:  0.8294, Adjusted R-squared:  0.7845 
## F-statistic: 18.47 on 5 and 19 DF,  p-value: 1.044e-06

If we look at the bottom of the output, we can again see that the Multiple R-squared and Adjusted R-squared have both increased.

Sum of Squared Errors

SSE = sum(model3$residuals^2)
SSE
## [1] 1.732113

we can see that the sum of squared errors for model3 is 1.7, even better than before.

In the summary function, The Estimate column gives the coefficients for the intercept and for each of the independent variables in our model. If a coefficient is not significantly different from 0, then we should probably remove the variable from our model since it’s not helping to predict the dependent variable.

The standard error column gives a measure of how much the coefficient is likely to vary from the estimate value.
The t value is the estimate divided by the standard error. It will be negative if the estimate is negative and positive if the estimate is positive. The larger the absolute value of the t value, the more likely the coefficient is to be significant. So we want independent variables with a large absolute value in this column.
The last column of numbers gives a measure of how plausible it is that the coefficient is actually 0, given the data we used to build the model.
The less plausible it is, or the smaller the probability number in this column, the less likely it is that our coefficient estimate is actually 0. This number will be large if the absolute value of the t value is small, and it will be small if the absolute value of the t value is large.
We want independent variables with small values in this column. This is a lot of information, but the easiest way in R to determine if a variable is significant is to look at the stars at the end of each row.

The star coding scheme is explained at the bottom of the Coefficients table.
Three stars is the highest level of significance and corresponds to a probability value less than 0.001, or the smallest possible probabilities. Two stars is also very significant and corresponds to a probability between 0.001 and 0.01.
One star is still significant and corresponds to a probability between 0.01 and 0.05. A period, or dot, means that the coefficient is almost significant and corresponds to a probability between 0.05 and 0.10.
Nothing at the end of a row means that the variable is not significant in the model.

If we look at the stars at the end of each row in this Coefficients table, we can see that Age and FrancePopulation are both insignificant in our model.

Let’s see if we can improve our linear regression model. we built a linear regression model called model3, that used all of our independent variables to predict the dependent variable, Price.

In our R Console, we can see the summary output of this model. As we just learned, both Age and FrancePopulation are insignificant in our model.
Because of this, we should consider removing these variables from our model.

Removing FrancePop

model4 = lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine)
summary(model4)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45470 -0.24273  0.00752  0.19773  0.53637 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.4299802  1.7658975  -1.942 0.066311 .  
## AGST         0.6072093  0.0987022   6.152  5.2e-06 ***
## HarvestRain -0.0039715  0.0008538  -4.652 0.000154 ***
## WinterRain   0.0010755  0.0005073   2.120 0.046694 *  
## Age          0.0239308  0.0080969   2.956 0.007819 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared:  0.8286, Adjusted R-squared:  0.7943 
## F-statistic: 24.17 on 4 and 20 DF,  p-value: 2.036e-07

We can see that the R-squared, for this model, is 0.8286 and our Adjusted R-squared is 0.79.

we can see that for model3, the R-squared was 0.8294, and the Adjusted R-squared was 0.7845.
So this model is just stronger than the previous model because our Adjusted R-squared actually increased by removing FrancePopulation.
If we look at each of our independent variables in the new model, and the stars, we can see that something a little strange happened. Before, Age was not significant at all in our model.
But now, Age has two stars, meaning that it’s very significant in this new model. This is due to something called multicollinearity. Age and FrancePopulation are what we call highly correlated.

Correlation measures the linear relationship between two variables and is a number between -1 and +1.

Let’s compute the correlation between WinterRain and Price.

cor(wine$WinterRain, wine$Price)
## [1] 0.1366505

Correlation between Age and FrancePopulation.

cor(wine$Age, wine$FrancePop)
## [1] -0.9944851

The correlation between all pairs of variables in our data

cor(wine)
##                    Year      Price   WinterRain        AGST HarvestRain
## Year         1.00000000 -0.4477679  0.016970024 -0.24691585  0.02800907
## Price       -0.44776786  1.0000000  0.136650547  0.65956286 -0.56332190
## WinterRain   0.01697002  0.1366505  1.000000000 -0.32109061 -0.27544085
## AGST        -0.24691585  0.6595629 -0.321090611  1.00000000 -0.06449593
## HarvestRain  0.02800907 -0.5633219 -0.275440854 -0.06449593  1.00000000
## Age         -1.00000000  0.4477679 -0.016970024  0.24691585 -0.02800907
## FrancePop    0.99448510 -0.4668616 -0.001621627 -0.25916227  0.04126439
##                     Age    FrancePop
## Year        -1.00000000  0.994485097
## Price        0.44776786 -0.466861641
## WinterRain  -0.01697002 -0.001621627
## AGST         0.24691585 -0.259162274
## HarvestRain -0.02800907  0.041264394
## Age          1.00000000 -0.994485097
## FrancePop   -0.99448510  1.000000000

As we saw earlier, Age and FrancePopulation are highly correlated with a correlation of -0.99.

We can see it from the graph also,

plot(wine$Age, wine$FrancePop, xlab="age", ylab="FrancePop", main="Age Vs FrancePop", col="blue", pch=19)

Multicollinearity refers to the situation when two independent variables are highly correlated.
A high correlation between an independent variable and the dependent variable is a good thing since we’re trying to predict the dependent variable using the independent variables.
Now due to the possibility of multicollinearity, you always want to remove the insignificant variables one at a time.
Let’s see what would have happened if we had removed both Age and FrancePopulation, since they were both insignificant in our model that used all of the independent variables.

model5 = lm(Price ~ AGST + HarvestRain + WinterRain, data=wine)
summary(model5)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67472 -0.12958  0.01973  0.20751  0.63846 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.3016263  2.0366743  -2.112 0.046831 *  
## AGST         0.6810242  0.1117011   6.097 4.75e-06 ***
## HarvestRain -0.0039481  0.0009987  -3.953 0.000726 ***
## WinterRain   0.0011765  0.0005920   1.987 0.060097 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.345 on 21 degrees of freedom
## Multiple R-squared:  0.7537, Adjusted R-squared:  0.7185 
## F-statistic: 21.42 on 3 and 21 DF,  p-value: 1.359e-06

If we take a look at the summary of this new model and look at the Coefficients table, we can see that AverageGrowingSeasonTemperature and HarvestRain are very significant, and WinterRain is almost significant. So this model looks pretty good, but if we look at our R-squared, we can see that it dropped to 0.75.
The model that includes Age has an R-squared of 0.83. So if we had removed Age and FrancePopulation at the same time, we would have missed a significant variable, and the R-squared of our final model would have been lower.
So why didn’t we keep FrancePopulation instead of Age?
Well, we expect Age to be significant. Older wines are typically more expensive, so Age makes more intuitive sense in our model.

We have two data points that we did not use to build our model in the file “wine_test.csv”.
Let’s load this new data file into R.

wineTest<-read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\unit 2\\wine_test.csv", header = TRUE)

If we take a look at the structure of wineTest using the str function

str(wineTest)
## 'data.frame':    2 obs. of  7 variables:
##  $ Year       : int  1979 1980
##  $ Price      : num  6.95 6.5
##  $ WinterRain : int  717 578
##  $ AGST       : num  16.2 16
##  $ HarvestRain: int  122 74
##  $ Age        : int  4 3
##  $ FrancePop  : num  54836 55110

we can see that we have two observations and the same seven variables as before.

To make predictions for these two test points, we’ll use the predict function.

Make test set predictions We’ll call the output predictTest, and we’ll use the predict function using model4.

predictTest = predict(model4, newdata=wineTest)
predictTest
##        1        2 
## 6.768925 6.684910

If we take a look at our predictions, we can see that for the first data point we predict 6.7689, and for the second data point we predict 6.6849.
If we look back at our str output, we can see that the actual Price for the first data point is 6.95, and the actual Price for the second data point is 6.5.
So it looks like our predictions are pretty good, but we can quantify this by computing the R-squared value for our test set.

Let’s compute the sum of squared errors on our test set.

SSE = sum((wineTest$Price - predictTest)^2)

Now let’s compute the total sum of squares, which we’ll call SST.

SST = sum((wineTest$Price - mean(wine$Price))^2)

Now, to compute the test set R-squared, we can just type 1 - SSE/SST.

1 - SSE/SST
## [1] 0.7944278

So the R-squared on our test set is 0.79. This is a pretty good out-of-sample R-squared. But while we do well on these two test points, keep in mind that our test set is really small.
We should increase the size of our test set to be more confident about the out-of-sample accuracy of our model.