Regression Assumptions

Here, we will study whether the linear model satisfy the regression assumptions. If the model satisfies these assumptions then we consider our model is fit. The assumptions are -
1. Errors should follow normal distribution.
2. The independent variables should not have multicollinearity.
3. The error variances should be homogeneous or should not have heteroscedasticity.
4. Independence of error terms or No auto correlation/serial correlation between errors.

We load the data first. In this data, we will see how sales of the cigarette in the state are affected by variables such as Age, High School Education, Income, Race, Gender and Price.

head(cig)

##   State  Age   HS Income Black Female Price Sales
## 1    AL 27.0 41.3   2948  26.2   51.7  42.7  89.8
## 2    AK 22.9 66.1   4644   3.0   45.7  41.8 121.3
## 3    Az 26.3 58.1   3665   3.0   50.8  38.5 115.2
## 4    AR 29.1 39.9   2878  18.3   51.5  38.8 100.3
## 5    CA 28.1 62.6   4493   7.0   50.8  39.7 123.0
## 6    co 26.2 63.9   3855   3.0   50.7  31.1 124.8

summary(cig)

##      State         Age             HS           Income         Black      
##  1A     : 1   Min.   :22.9   Min.   :37.8   Min.   :2626   Min.   : 0.20  
##  AK     : 1   1st Qu.:26.4   1st Qu.:48.3   1st Qu.:3271   1st Qu.: 1.25  
##  AL     : 1   Median :27.4   Median :53.3   Median :3751   Median : 5.70  
##  AR     : 1   Mean   :27.5   Mean   :53.1   Mean   :3764   Mean   : 9.87  
##  Az     : 1   3rd Qu.:28.8   3rd Qu.:59.1   3rd Qu.:4116   3rd Qu.:13.55  
##  CA     : 1   Max.   :32.3   Max.   :67.3   Max.   :5079   Max.   :71.10  
##  (Other):45                                                               
##      Female         Price          Sales      
##  Min.   :45.7   Min.   :29.0   Min.   : 65.5  
##  1st Qu.:50.8   1st Qu.:34.7   1st Qu.:105.3  
##  Median :51.1   Median :38.9   Median :119.0  
##  Mean   :51.0   Mean   :38.1   Mean   :121.5  
##  3rd Qu.:51.5   3rd Qu.:41.4   3rd Qu.:124.5  
##  Max.   :53.5   Max.   :45.5   Max.   :265.1  
##

We do some plotting and check the correlation between response variable and predictor variables.

require(ggplot2)
qplot(Age, Sales, data = cig, geom = c("point", "smooth"), method = "lm")

plot of chunk unnamed-chunk-3

cor.test(cig$Sales, cig$Age)

## 
##  Pearson's product-moment correlation
## 
## data:  cig$Sales and cig$Age
## t = 1.63, df = 49, p-value = 0.1094
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.0520  0.4729
## sample estimates:
##    cor 
## 0.2268

We notice there’s weak positive correlation between Sales and Age.

Correlation between High school education and cigarette sales.

qplot(HS, Sales, data = cig, geom = c("point", "smooth"), method = "lm")

plot of chunk unnamed-chunk-4

cor.test(cig$Sales, cig$HS)

## 
##  Pearson's product-moment correlation
## 
## data:  cig$Sales and cig$HS
## t = 0.4692, df = 49, p-value = 0.641
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2126  0.3363
## sample estimates:
##     cor 
## 0.06688

There’s almost no correlation between high school and cigarette sales.

Correlation between Income and Sales.

qplot(Income, Sales, data = cig, geom = c("point", "smooth"), method = "lm")

plot of chunk unnamed-chunk-5

cor.test(cig$Sales, cig$Income)

## 
##  Pearson's product-moment correlation
## 
## data:  cig$Sales and cig$Income
## t = 2.419, df = 49, p-value = 0.01932
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0561 0.5525
## sample estimates:
##    cor 
## 0.3266

Weak positive correlation between Income and Sales.

Correlation between Black and Sales.

qplot(Black, Sales, data = cig, geom = c("point", "smooth"), method = "lm")

plot of chunk unnamed-chunk-6

cor.test(cig$Sales, cig$Black)

## 
##  Pearson's product-moment correlation
## 
## data:  cig$Sales and cig$Black
## t = 1.276, df = 49, p-value = 0.2081
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1013  0.4334
## sample estimates:
##    cor 
## 0.1793

Weak positive correlation between Black and Sales.

Correlation between Female and Sales.

qplot(Female, Sales, data = cig, geom = c("point", "smooth"), method = "lm")

plot of chunk unnamed-chunk-7

cor.test(cig$Sales, cig$Female)

## 
##  Pearson's product-moment correlation
## 
## data:  cig$Sales and cig$Female
## t = 1.036, df = 49, p-value = 0.3052
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1346  0.4056
## sample estimates:
##    cor 
## 0.1464

Weak positive correlation between Female variable and Sales.

Correlation between Price and Sales.

qplot(Price, Sales, data = cig, geom = c("point", "smooth"), method = "lm")

plot of chunk unnamed-chunk-8

cor.test(cig$Sales, cig$Price)

## 
##  Pearson's product-moment correlation
## 
## data:  cig$Sales and cig$Price
## t = -2.208, df = 49, p-value = 0.03195
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.53228 -0.02752
## sample estimates:
##     cor 
## -0.3008

There’s a weak negative correlation between Price and Sales of the cigarette.

We go ahead and build our regression model inspite of weak correlation between response variable and predictor variables.

cig_reg <- lm(Sales ~ . -State, data = cig)
summary(cig_reg)

## 
## Call:
## lm(formula = Sales ~ . - State, data = cig)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -48.58 -12.28  -5.57   6.06 132.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  99.9793   246.2876    0.41   0.6868   
## Age           4.3639     3.2149    1.36   0.1816   
## HS           -0.1123     0.7957   -0.14   0.8884   
## Income        0.0194     0.0101    1.93   0.0606 . 
## Black         0.3148     0.4752    0.66   0.5111   
## Female       -0.8511     5.5654   -0.15   0.8792   
## Price        -3.2901     1.0299   -3.19   0.0026 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.2 on 44 degrees of freedom
## Multiple R-squared:  0.319,  Adjusted R-squared:  0.226 
## F-statistic: 3.44 on 6 and 44 DF,  p-value: 0.00717

par(mfrow = c(2,2))
plot(cig_reg)

plot of chunk unnamed-chunk-9

The only significant variable is Price(P-value = 0.0026). We got RMSE of 28.2. R-squared is 0.319, Adjusted R-squared is 0.226. Overall model p-value is 0.00717, so the model is good. However, we need to check whether it satisfies the four regression assumptions.

Error should follow normal distribution: Looking at the Residuals vs Fitted plot and Normal Q-Q plot, we can see the error points are scattered randomly in Residuals vs Fitted plot and Normal Q-Q plot indicates error are normally distributed. So the model satisfy the first assumption.
The independent variables should not have multicollinearity: To check whether there is any relationship between independent variables, we will see the VIF(Variance Inflation Factor) scores. If VIF > 5 then there is multicollinearity.

require(car)
vif(cig_reg)

##    Age     HS Income  Black Female  Price 
##  2.296  2.544  2.271  2.297  2.412  1.130

None of the predictor variables score more than 5. So there is not multicollinearity.
3. The error variances should be homogeneous or should not have heteroscadasticity: Looking at the residual plots above, we can conclude there’s no pattern at all. So there is no heteroscadasticity.
4. Independence of error terms or no auto correlation/serial correlation between errors: For this assumption we will do D-W(Durbin-Watson) test. It ranges from 0-4. If D-W value is close to 2 then there’s no serial correlation. If the D-W value is close to 0 then there is positive serial correlation. If D-W value is close to 4 then there is negative serial correlation.

require(lmtest)
dwtest(cig_reg)

## 
##  Durbin-Watson test
## 
## data:  cig_reg
## DW = 1.656, p-value = 0.1124
## alternative hypothesis: true autocorrelation is greater than 0

Durbin-Watson test value for the model is 1.656 which is closer to 2. So there’s no serial correlation.

Hence, this model satisfy all the regression assumptions.

Regression Assumptions

Loy

Saturday, January 17, 2015