Blog 2: Multiple Linear Regression

Load the data

data("USArrests")

USArrests Database
	Murder	Assault	UrbanPop	Rape
Alabama	13.2	236	58	21.2
Alaska	10.0	263	48	44.5
Arizona	8.1	294	80	31.0
Arkansas	8.8	190	50	19.5
California	9.0	276	91	40.6
Colorado	7.9	204	78	38.7

Data Analysis and Vizualisation

str(USArrests)

## 'data.frame':    50 obs. of  4 variables:
##  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
##  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
##  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
##  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

Let’s use DataExplorer package for some data analysis

plot_intro(USArrests, 
           ggtheme = theme_minimal(),
  title = "Missing Data",
  theme_config = theme(plot.title = element_text(color = "orange")),
  geom_label_args = c(hjust = "inward")
)

plot_histogram(USArrests)

plot_boxplot(USArrests,by = "Murder", ncol = 1L)

USArrests%>%
  hchart(type = "scatter", hcaes(x = UrbanPop, y = Murder))%>% hc_chart(type = "scatter", options3d = list(enabled = TRUE, alpha = 30, beta = 0))%>% 
  hc_title(text = "Scatter Chart Murder vs Urban Population")

USArrests%>%
  hchart(type = "scatter", hcaes(x = Assault, y = Murder))%>% hc_chart(type = "scatter", options3d = list(enabled = TRUE, alpha = 30, beta = 0)) %>% 
  hc_title(text = "Scatter Chart Murder vs Assault")

USArrests%>%
  hchart(type = "scatter", hcaes(x = Rape, y = Murder))%>% hc_chart(type = "scatter", options3d = list(enabled = TRUE, alpha = 30, beta = 0))%>% 
  hc_title(text = "Scatter Chart Murder vs Rape")

pca<-hchart(princomp(USArrests, cor = TRUE))
pca

hchart(cor(USArrests))

Splitting the data set

Split the data into train data set and test data set. We will take 80% of the data as the train data set and the rest will be used as the test data set.

RNGkind(sample.kind = "Rounding")

set.seed(100) 

index_ins <- sample(x = nrow(USArrests) , size = nrow(USArrests)*0.8) 
ins_train <- USArrests[index_ins,]
ins_test <- USArrests[-index_ins, ]
dftrain <- data.frame(ins_train)

dftest <- data.frame(ins_test)

Building Model

Model 1

model_all <- lm(formula = Murder ~ . , data = dftrain)
summary(model_all)

## 
## Call:
## lm(formula = Murder ~ ., data = dftrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9257 -1.6737 -0.4489  1.0895  7.5068 
## 
## Coefficients:
##              Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)  2.075901   1.870953   1.110       0.275    
## Assault      0.043879   0.006923   6.338 0.000000246 ***
## UrbanPop    -0.027691   0.031615  -0.876       0.387    
## Rape         0.008534   0.065006   0.131       0.896    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.572 on 36 degrees of freedom
## Multiple R-squared:  0.6872, Adjusted R-squared:  0.6611 
## F-statistic: 26.36 on 3 and 36 DF,  p-value: 0.000000003368

Based on this model we can see that Adjusted R-squared is 0.66. Adjusted R-squared identifies the percentage of variance in the target field that is explained by the input or inputs.

Model 2

Model2 <-lm(Murder~Assault+UrbanPop,data=USArrests)
summary(Model2)

## 
## Call:
## lm(formula = Murder ~ Assault + UrbanPop, data = USArrests)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5530 -1.7093 -0.3677  1.2284  7.5985 
## 
## Coefficients:
##              Estimate Std. Error t value         Pr(>|t|)    
## (Intercept)  3.207153   1.740790   1.842           0.0717 .  
## Assault      0.043910   0.004579   9.590 0.00000000000122 ***
## UrbanPop    -0.044510   0.026363  -1.688           0.0980 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.58 on 47 degrees of freedom
## Multiple R-squared:  0.6634, Adjusted R-squared:  0.6491 
## F-statistic: 46.32 on 2 and 47 DF,  p-value: 0.000000000007704

The first model with complete variables has adjusted R-squared of 0.66, meaning that the model can explain 66% of variance in the target variable. Model2 has a little lower adjusted R-squared which is almost 65%.

Model 3

Model3 <-lm(Murder~Assault+Rape,data=USArrests)
summary(Model3)

## 
## Call:
## lm(formula = Murder ~ Assault + Rape, data = USArrests)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8667 -1.7653 -0.3746  1.3030  7.8864 
## 
## Coefficients:
##             Estimate Std. Error t value     Pr(>|t|)    
## (Intercept) 0.418863   0.976184   0.429        0.670    
## Assault     0.040029   0.006087   6.576 0.0000000359 ***
## Rape        0.025142   0.054157   0.464        0.645    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.651 on 47 degrees of freedom
## Multiple R-squared:  0.6446, Adjusted R-squared:  0.6295 
## F-statistic: 42.63 on 2 and 47 DF,  p-value: 0.0000000000276

The third model has adjusted R-squared of almost 0.63, meaning that the model can explain 63% of variance in the target variable.

For further actions I prefer to use Model 2, which has higher adjusted R-squared than Model 3.

Prediction & Model Performance

Model Performance

The prediction will originally run on the test data set. But we will predict based on the train data set as well to check whether or not our model is over-fitted.

Prediction of train data set

pred <- predict(Model2, newdata = ins_train)
ins_train$prediction <- pred

pred

##         Kansas       Illinois       Nebraska        Arizona       Michigan 
##       5.319106      10.446362       4.926319      12.555841      11.110416 
##       Virginia       Oklahoma        Wyoming      Minnesota    Connecticut 
##       7.252946       6.810844       7.606027       3.430979       4.609941 
##       Missouri           Ohio         Hawaii           Iowa         Nevada 
##       7.907391       5.138062       1.532642       3.129014      10.667112 
##    Mississippi   South Dakota          Idaho North Carolina        Vermont 
##      12.621370       4.980438       6.072782      16.001835       3.890496 
##       Kentucky  Massachusetts          Texas  New Hampshire   Pennsylvania 
##       5.678793       5.966346       8.472215       3.217434       4.656854 
##     California          Maine  West Virginia      Wisconsin       Colorado 
##      11.275846       4.581645       5.027951       2.596689       8.692966 
##        Georgia      Louisiana   North Dakota     New Jersey       New York 
##       9.801524      11.203040       3.224640       6.227403      10.532380 
##        Indiana     Washington        Florida           Utah         Alaska 
##       5.275797       6.324832      14.356149       4.915509      12.618968

Error on the train data set prediction

MAPE(y_pred = ins_train$prediction, y_true = ins_train$Murder)*100

## [1] 39.03189

Prediction of test data set

prediction <- predict(Model2, newdata = ins_test)
ins_test$prediction <- prediction

prediction

##        Alabama       Arkansas       Delaware       Maryland        Montana 
##      10.988294       9.324520      10.452967      13.397937       5.634283 
##     New Mexico         Oregon   Rhode Island South Carolina      Tennessee 
##      12.605756       7.206634       6.975073      13.321527       8.836106

Error on the train data set prediction

MAPE(y_pred = ins_test$prediction, y_true = ins_test$Murder)*100

## [1] 32.78952

Prediction Result

datatable(ins_test)

Model Assumption

Linearity

linearity <- data.frame(residual = Model2$residuals, fitted = Model2$fitted.values)

highchart() %>%
  hc_add_series(linearity, "scatter", hcaes(residual,fitted))#%>% hc_add_series(resact, "line", hcaes(x = residual, y = fitted))

Normality of Residual

Secondly, the linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with histogram or the Shapiro test.

histchart <- hchart(Model2$residuals, name = "VALUE")

histchart

Let’s use Shapiro-Wilk normality test to check for normality distribution

If the test is non-significant (p>. 05) it tells us that the distribution of the sample is not significantly different from a normal distribution. If, however, the test is significant (p < . 05) then the distribution in question is significantly different from a normal distribution.

shapiro.test(Model2$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  Model2$residuals
## W = 0.96937, p-value = 0.2182

Based on the test result, is not significantly different from a normal distribution.

No Multicolinearity

Linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other.

We can confirm multicollinearity using VIF (Variance Inflation Factor): When the VIF value is more than 10, it means that there is multicollinearity.

vif(Model2)

##  Assault UrbanPop 
## 1.071828 1.071828

No Autocorrelation

Linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other.

Let’s use Durbin - Watson Test to check for autocorrelation The DW statistic ranges from zero to four, with a value of 2.0 indicating zero autocorrelation. Values below 2.0 mean there is positive autocorrelation and above 2.0 indicates negative autocorrelation.

durbinWatsonTest(Model2)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.1064333      1.769422   0.404
##  Alternative hypothesis: rho != 0

Homoscedasticity of Residual

plot(x = Model2$fitted.values, y = Model2$residuals, col = "blue")
abline(h = 0, col = "red", lty = 6)

Conclusion

The Linear Regression Model generated has Adjusted R-Squared value of 0.6491, meaning that the model can explain 64.91% of variance in the target variable (Murder). Based on the previous evaluation, we can state that the model fulfills the assumptions, such as Linearity, Normality of Residual, and Homoscedasticity of Residual.