Linear Regression - Exploration

Summary

Analysis of Linear Regression model using wine data-set for self study.

First check the directory and then set it. It should contains ‘wine’ data set.

‘’’{r} getwd() setwd(“~/Tutorials/Videos/MITx- 15.071x The Analytics Edge/Unit-2/data”) ‘’’

wine = read.csv("wine.csv")
str(wine)

## 'data.frame':    25 obs. of  7 variables:
##  $ Year       : int  1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
##  $ Price      : num  7.5 8.04 7.69 6.98 6.78 ...
##  $ WinterRain : int  600 690 502 420 582 485 763 830 697 608 ...
##  $ AGST       : num  17.1 16.7 17.1 16.1 16.4 ...
##  $ HarvestRain: int  160 80 130 110 187 187 290 38 52 155 ...
##  $ Age        : int  31 30 28 26 25 24 23 22 21 20 ...
##  $ FrancePop  : num  43184 43495 44218 45152 45654 ...

Price is dependent/target variable and rest are independent variable. We need to look which independent variable are best suited in our model.

Let’s now create one variable linear regression equation using AGST to predict price.

model1 = lm(Price ~ AGST, data = wine)
summary(model1)

## 
## Call:
## lm(formula = Price ~ AGST, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7845 -0.2388 -0.0373  0.3899  0.9032 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.418      2.494   -1.37  0.18371    
## AGST           0.635      0.151    4.21  0.00034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.499 on 23 degrees of freedom
## Multiple R-squared:  0.435,  Adjusted R-squared:  0.41 
## F-statistic: 17.7 on 1 and 23 DF,  p-value: 0.000335

First we see the formulat or the variables (Dependent and Independent) used in our models. After that summary of residuals or error terms. Following that is a description of the coefficients of our model. The first row corresponds to intercept term and second row corresponds to our independent variable, AGST. The Estimate column gives estimates of the beta values for our model. So here beta0, or the coefficent of intercept term, is estimated to be -3.4. And beta1 or the coefficient of independent variable, is estimated to 0.635. Towards the bottom of the output, you can see multiple R-squared value. Beside it is a number labeled Adjusted R-squared. in this case it is 0.41.

This number adjusts the R-squrared value to account for the number of independent variable used realtive the number of data points. Multiple R-squared will always increase if you add more independent variables. But adjusted R-squared decrease if you add more independent variable that doesn’t help the model. This is good way to determine if an additional variable should even be included in the model.

Let’s compute sum of squared errors, or SSE. Our residuals, or error terms are stored in a vector model1$residuals

model1$residuals

##        1        2        3        4        5        6        7        8 
##  0.04204  0.82984  0.21169  0.15609 -0.23119  0.38992 -0.48959  0.90318 
##        9       10       11       12       13       14       15       16 
##  0.45372  0.14887 -0.23882 -0.08974  0.66186 -0.05212 -0.62727 -0.74715 
##       17       18       19       20       21       22       23       24 
##  0.42114 -0.03727  0.10685 -0.78450 -0.64018 -0.05509 -0.67055 -0.22040 
##       25 
##  0.55867

SSE = sum(model1$residuals^2)
SSE

## [1] 5.735

Adding new independent variable HarvestRain

model2 = lm(Price ~ AGST + HarvestRain, data = wine)
summary(model2)

## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8832 -0.1960  0.0618  0.1538  0.5972 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.20265    1.85443   -1.19  0.24759    
## AGST         0.60262    0.11128    5.42  1.9e-05 ***
## HarvestRain -0.00457    0.00101   -4.52  0.00017 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.367 on 22 degrees of freedom
## Multiple R-squared:  0.707,  Adjusted R-squared:  0.681 
## F-statistic: 26.6 on 2 and 22 DF,  p-value: 1.35e-06

At bottom of output, this variable really helped our model. Our Multiple R-squared and Adjusted R-squared both increased significantly compare to this previous model. This indicates that this new model is probably better than the previous model. Now compute SSE.

SSE2 = sum(model2$residuals^2)
SSE2

## [1] 2.97

Now build new model which will have all independent variable.

model3 = lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data = wine)
summary(model3)

## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age + 
##     FrancePop, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4818 -0.2466 -0.0073  0.2201  0.5199 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.50e-01   1.02e+01   -0.04  0.96520    
## AGST         6.01e-01   1.03e-01    5.84  1.3e-05 ***
## HarvestRain -3.96e-03   8.75e-04   -4.52  0.00023 ***
## WinterRain   1.04e-03   5.31e-04    1.96  0.06442 .  
## Age          5.85e-04   7.90e-02    0.01  0.99417    
## FrancePop   -4.95e-05   1.67e-04   -0.30  0.76958    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.302 on 19 degrees of freedom
## Multiple R-squared:  0.829,  Adjusted R-squared:  0.784 
## F-statistic: 18.5 on 5 and 19 DF,  p-value: 1.04e-06

At bottom of output, both multiple R-squared and adjusted R-squared both increased. SSE is also better than the previous two models.

SSE3 = sum(model3$residuals^2)
SSE3

## [1] 1.732

We can see the stars, which tells us about the significance of model. By looking at it, AGST, HarvestRain and WinterRain are significant. So, we try removing other variable and see the model behaviour.

model4 = lm(Price ~ AGST + HarvestRain + WinterRain + Age, data = wine)
summary(model4)

## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4547 -0.2427  0.0075  0.1977  0.5364 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.429980   1.765898   -1.94  0.06631 .  
## AGST         0.607209   0.098702    6.15  5.2e-06 ***
## HarvestRain -0.003972   0.000854   -4.65  0.00015 ***
## WinterRain   0.001076   0.000507    2.12  0.04669 *  
## Age          0.023931   0.008097    2.96  0.00782 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared:  0.829,  Adjusted R-squared:  0.794 
## F-statistic: 24.2 on 4 and 20 DF,  p-value: 2.04e-07

R-squared and adjusted R-squared are just strong. Adjusted R-squared got increased here. As compare to previous model, Age got the significance (which was not in previous model). This is due to multicollinearity. Age and FrancePop are what we call is highly correlated.

Correlation measures the linear relationship between two variables and is a number between -1 and +1. A correlation of +1 means a perfect possitive linear relationship. A correlation of -1 means negative linear relationship. A correlation of 0 means no linear relationship.

When we say that two variables are highly correlated, we mean that the absolute value of the correlation is close to 1. Lets take some example

plot(wine$WinterRain, log(wine$Price))

plot of chunk unnamed-chunk-9

cor(wine$WinterRain, wine$Price)

## [1] 0.1367

By looking at plot, it’s hard to detect any linear relationship. So, correlation of these two variables (WinterRain, Price) is 0.14. So, there’s slight positive relationship between these two variables.

Let’s take another example of HarvestRain and AGST.

plot(wine$HarvestRain, wine$AGST)

plot of chunk unnamed-chunk-10

cor(wine$HarvestRain, wine$AGST)

## [1] -0.0645

So, the correlation of these two variables (HarvestRain and AGST) is -0.06 which is close to zero as compare to previous one.

Another plot between Age and population of france.

plot(wine$Age, wine$FrancePop)

plot of chunk unnamed-chunk-11

cor(wine$Age,wine$FrancePop)

## [1] -0.9945

It shows strong negative linear relationship. The correlation is -0.99. These two variables are highly correlated.

Lets compute for all variables.

cor(wine)

##                 Year   Price WinterRain    AGST HarvestRain      Age
## Year         1.00000 -0.4478   0.016970 -0.2469     0.02801 -1.00000
## Price       -0.44777  1.0000   0.136651  0.6596    -0.56332  0.44777
## WinterRain   0.01697  0.1367   1.000000 -0.3211    -0.27544 -0.01697
## AGST        -0.24692  0.6596  -0.321091  1.0000    -0.06450  0.24692
## HarvestRain  0.02801 -0.5633  -0.275441 -0.0645     1.00000 -0.02801
## Age         -1.00000  0.4478  -0.016970  0.2469    -0.02801  1.00000
## FrancePop    0.99449 -0.4669  -0.001622 -0.2592     0.04126 -0.99449
##             FrancePop
## Year         0.994485
## Price       -0.466862
## WinterRain  -0.001622
## AGST        -0.259162
## HarvestRain  0.041264
## Age         -0.994485
## FrancePop    1.000000

We’ve observed that Age and FrancePopulation are definitely highly correlated. So we do have multicollinearity problems in our model that uses all of the available independent variables. Keep in mind that multicollinearity refers to situation when two independent variables are highly correlated. A high correlation between an independent variable and the dependent variable is a good thing.

Since we are tying to predict dependent variable using independent variables. Now due to possibility of multicollinearity, you always want to remove the insignificant variables one at a time. So, we remove both age and francepop because of multicollinearity problem.

model5 = lm(Price ~ AGST + HarvestRain + WinterRain, data = wine)
summary(model5)

## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6747 -0.1296  0.0197  0.2075  0.6385 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.301626   2.036674   -2.11  0.04683 *  
## AGST         0.681024   0.111701    6.10  4.7e-06 ***
## HarvestRain -0.003948   0.000999   -3.95  0.00073 ***
## WinterRain   0.001177   0.000592    1.99  0.06010 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.345 on 21 degrees of freedom
## Multiple R-squared:  0.754,  Adjusted R-squared:  0.719 
## F-statistic: 21.4 on 3 and 21 DF,  p-value: 1.36e-06

So, in this all independent variables are significant. But a drop in R-squared value as compared to earlier one (where Age was taken into consideration in model) So, if we had removed Age and FrancePopulation at the same time, we would have missed a significant variable and R-squared of our final model sould have been lower.

So why didn’t we keep FrancePopulation instead of Age ? Well, we expect Age to significant. Older wines are typically more expensive, so Age makes more intutitve sense in our model. Multicollinearity reminds us that the coefficients are only interpretable in the presence of other variables used. High correlations can even cause coefficients to have an unintutive sign.

There is no definitive cut-off value for what makes a correlation too high. But typically, a correlation greater than 0.7 or less than 0.7 is cause for concern. So, earlier computed correlations, you can see that it doesn’t look like we have any other highly correlated independent variables. So, we will stick with model4 (AGST, HarvestRain, WinterRain and Age as independent variable)

Linear Regression - Exploration

Navneet

10 April 2015

Summary