Here, we will study whether the linear model satisfy the regression assumptions. If the model satisfies these assumptions then we consider our model is fit. The assumptions are -
1. Errors should follow normal distribution.
2. The independent variables should not have multicollinearity.
3. The error variances should be homogeneous or should not have heteroscedasticity.
4. Independence of error terms or No auto correlation/serial correlation between errors.
We load the data first. In this data, we will see how sales of the cigarette in the state are affected by variables such as Age, High School Education, Income, Race, Gender and Price.
head(cig)
## State Age HS Income Black Female Price Sales
## 1 AL 27.0 41.3 2948 26.2 51.7 42.7 89.8
## 2 AK 22.9 66.1 4644 3.0 45.7 41.8 121.3
## 3 Az 26.3 58.1 3665 3.0 50.8 38.5 115.2
## 4 AR 29.1 39.9 2878 18.3 51.5 38.8 100.3
## 5 CA 28.1 62.6 4493 7.0 50.8 39.7 123.0
## 6 co 26.2 63.9 3855 3.0 50.7 31.1 124.8
summary(cig)
## State Age HS Income Black
## 1A : 1 Min. :22.9 Min. :37.8 Min. :2626 Min. : 0.20
## AK : 1 1st Qu.:26.4 1st Qu.:48.3 1st Qu.:3271 1st Qu.: 1.25
## AL : 1 Median :27.4 Median :53.3 Median :3751 Median : 5.70
## AR : 1 Mean :27.5 Mean :53.1 Mean :3764 Mean : 9.87
## Az : 1 3rd Qu.:28.8 3rd Qu.:59.1 3rd Qu.:4116 3rd Qu.:13.55
## CA : 1 Max. :32.3 Max. :67.3 Max. :5079 Max. :71.10
## (Other):45
## Female Price Sales
## Min. :45.7 Min. :29.0 Min. : 65.5
## 1st Qu.:50.8 1st Qu.:34.7 1st Qu.:105.3
## Median :51.1 Median :38.9 Median :119.0
## Mean :51.0 Mean :38.1 Mean :121.5
## 3rd Qu.:51.5 3rd Qu.:41.4 3rd Qu.:124.5
## Max. :53.5 Max. :45.5 Max. :265.1
##
We do some plotting and check the correlation between response variable and predictor variables.
require(ggplot2)
qplot(Age, Sales, data = cig, geom = c("point", "smooth"), method = "lm")
cor.test(cig$Sales, cig$Age)
##
## Pearson's product-moment correlation
##
## data: cig$Sales and cig$Age
## t = 1.63, df = 49, p-value = 0.1094
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.0520 0.4729
## sample estimates:
## cor
## 0.2268
We notice there’s weak positive correlation between Sales and Age.
Correlation between High school education and cigarette sales.
qplot(HS, Sales, data = cig, geom = c("point", "smooth"), method = "lm")
cor.test(cig$Sales, cig$HS)
##
## Pearson's product-moment correlation
##
## data: cig$Sales and cig$HS
## t = 0.4692, df = 49, p-value = 0.641
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2126 0.3363
## sample estimates:
## cor
## 0.06688
There’s almost no correlation between high school and cigarette sales.
Correlation between Income and Sales.
qplot(Income, Sales, data = cig, geom = c("point", "smooth"), method = "lm")
cor.test(cig$Sales, cig$Income)
##
## Pearson's product-moment correlation
##
## data: cig$Sales and cig$Income
## t = 2.419, df = 49, p-value = 0.01932
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0561 0.5525
## sample estimates:
## cor
## 0.3266
Weak positive correlation between Income and Sales.
Correlation between Black and Sales.
qplot(Black, Sales, data = cig, geom = c("point", "smooth"), method = "lm")
cor.test(cig$Sales, cig$Black)
##
## Pearson's product-moment correlation
##
## data: cig$Sales and cig$Black
## t = 1.276, df = 49, p-value = 0.2081
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1013 0.4334
## sample estimates:
## cor
## 0.1793
Weak positive correlation between Black and Sales.
Correlation between Female and Sales.
qplot(Female, Sales, data = cig, geom = c("point", "smooth"), method = "lm")
cor.test(cig$Sales, cig$Female)
##
## Pearson's product-moment correlation
##
## data: cig$Sales and cig$Female
## t = 1.036, df = 49, p-value = 0.3052
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1346 0.4056
## sample estimates:
## cor
## 0.1464
Weak positive correlation between Female variable and Sales.
Correlation between Price and Sales.
qplot(Price, Sales, data = cig, geom = c("point", "smooth"), method = "lm")
cor.test(cig$Sales, cig$Price)
##
## Pearson's product-moment correlation
##
## data: cig$Sales and cig$Price
## t = -2.208, df = 49, p-value = 0.03195
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.53228 -0.02752
## sample estimates:
## cor
## -0.3008
There’s a weak negative correlation between Price and Sales of the cigarette.
We go ahead and build our regression model inspite of weak correlation between response variable and predictor variables.
cig_reg <- lm(Sales ~ . -State, data = cig)
summary(cig_reg)
##
## Call:
## lm(formula = Sales ~ . - State, data = cig)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.58 -12.28 -5.57 6.06 132.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.9793 246.2876 0.41 0.6868
## Age 4.3639 3.2149 1.36 0.1816
## HS -0.1123 0.7957 -0.14 0.8884
## Income 0.0194 0.0101 1.93 0.0606 .
## Black 0.3148 0.4752 0.66 0.5111
## Female -0.8511 5.5654 -0.15 0.8792
## Price -3.2901 1.0299 -3.19 0.0026 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.2 on 44 degrees of freedom
## Multiple R-squared: 0.319, Adjusted R-squared: 0.226
## F-statistic: 3.44 on 6 and 44 DF, p-value: 0.00717
par(mfrow = c(2,2))
plot(cig_reg)
The only significant variable is Price(P-value = 0.0026). We got RMSE of 28.2. R-squared is 0.319, Adjusted R-squared is 0.226. Overall model p-value is 0.00717, so the model is good. However, we need to check whether it satisfies the four regression assumptions.
require(car)
vif(cig_reg)
## Age HS Income Black Female Price
## 2.296 2.544 2.271 2.297 2.412 1.130
None of the predictor variables score more than 5. So there is not multicollinearity.
3. The error variances should be homogeneous or should not have heteroscadasticity: Looking at the residual plots above, we can conclude there’s no pattern at all. So there is no heteroscadasticity.
4. Independence of error terms or no auto correlation/serial correlation between errors: For this assumption we will do D-W(Durbin-Watson) test. It ranges from 0-4. If D-W value is close to 2 then there’s no serial correlation. If the D-W value is close to 0 then there is positive serial correlation. If D-W value is close to 4 then there is negative serial correlation.
require(lmtest)
dwtest(cig_reg)
##
## Durbin-Watson test
##
## data: cig_reg
## DW = 1.656, p-value = 0.1124
## alternative hypothesis: true autocorrelation is greater than 0
Durbin-Watson test value for the model is 1.656 which is closer to 2. So there’s no serial correlation.
Hence, this model satisfy all the regression assumptions.