data("USArrests")
| Murder | Assault | UrbanPop | Rape | |
|---|---|---|---|---|
| Alabama | 13.2 | 236 | 58 | 21.2 |
| Alaska | 10.0 | 263 | 48 | 44.5 |
| Arizona | 8.1 | 294 | 80 | 31.0 |
| Arkansas | 8.8 | 190 | 50 | 19.5 |
| California | 9.0 | 276 | 91 | 40.6 |
| Colorado | 7.9 | 204 | 78 | 38.7 |
str(USArrests)
## 'data.frame': 50 obs. of 4 variables:
## $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
## $ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
## $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
## $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
Let’s use DataExplorer package for some data analysis
plot_intro(USArrests,
ggtheme = theme_minimal(),
title = "Missing Data",
theme_config = theme(plot.title = element_text(color = "orange")),
geom_label_args = c(hjust = "inward")
)
plot_histogram(USArrests)
plot_boxplot(USArrests,by = "Murder", ncol = 1L)
USArrests%>%
hchart(type = "scatter", hcaes(x = UrbanPop, y = Murder))%>% hc_chart(type = "scatter", options3d = list(enabled = TRUE, alpha = 30, beta = 0))%>%
hc_title(text = "Scatter Chart Murder vs Urban Population")
USArrests%>%
hchart(type = "scatter", hcaes(x = Assault, y = Murder))%>% hc_chart(type = "scatter", options3d = list(enabled = TRUE, alpha = 30, beta = 0)) %>%
hc_title(text = "Scatter Chart Murder vs Assault")
USArrests%>%
hchart(type = "scatter", hcaes(x = Rape, y = Murder))%>% hc_chart(type = "scatter", options3d = list(enabled = TRUE, alpha = 30, beta = 0))%>%
hc_title(text = "Scatter Chart Murder vs Rape")
pca<-hchart(princomp(USArrests, cor = TRUE))
pca
hchart(cor(USArrests))
Split the data into train data set and test data set. We will take 80% of the data as the train data set and the rest will be used as the test data set.
RNGkind(sample.kind = "Rounding")
set.seed(100)
index_ins <- sample(x = nrow(USArrests) , size = nrow(USArrests)*0.8)
ins_train <- USArrests[index_ins,]
ins_test <- USArrests[-index_ins, ]
dftrain <- data.frame(ins_train)
dftest <- data.frame(ins_test)
model_all <- lm(formula = Murder ~ . , data = dftrain)
summary(model_all)
##
## Call:
## lm(formula = Murder ~ ., data = dftrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9257 -1.6737 -0.4489 1.0895 7.5068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.075901 1.870953 1.110 0.275
## Assault 0.043879 0.006923 6.338 0.000000246 ***
## UrbanPop -0.027691 0.031615 -0.876 0.387
## Rape 0.008534 0.065006 0.131 0.896
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.572 on 36 degrees of freedom
## Multiple R-squared: 0.6872, Adjusted R-squared: 0.6611
## F-statistic: 26.36 on 3 and 36 DF, p-value: 0.000000003368
Based on this model we can see that Adjusted R-squared is 0.66. Adjusted R-squared identifies the percentage of variance in the target field that is explained by the input or inputs.
Model2 <-lm(Murder~Assault+UrbanPop,data=USArrests)
summary(Model2)
##
## Call:
## lm(formula = Murder ~ Assault + UrbanPop, data = USArrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5530 -1.7093 -0.3677 1.2284 7.5985
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.207153 1.740790 1.842 0.0717 .
## Assault 0.043910 0.004579 9.590 0.00000000000122 ***
## UrbanPop -0.044510 0.026363 -1.688 0.0980 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.58 on 47 degrees of freedom
## Multiple R-squared: 0.6634, Adjusted R-squared: 0.6491
## F-statistic: 46.32 on 2 and 47 DF, p-value: 0.000000000007704
The first model with complete variables has adjusted R-squared of 0.66, meaning that the model can explain 66% of variance in the target variable. Model2 has a little lower adjusted R-squared which is almost 65%.
Model3 <-lm(Murder~Assault+Rape,data=USArrests)
summary(Model3)
##
## Call:
## lm(formula = Murder ~ Assault + Rape, data = USArrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8667 -1.7653 -0.3746 1.3030 7.8864
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.418863 0.976184 0.429 0.670
## Assault 0.040029 0.006087 6.576 0.0000000359 ***
## Rape 0.025142 0.054157 0.464 0.645
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.651 on 47 degrees of freedom
## Multiple R-squared: 0.6446, Adjusted R-squared: 0.6295
## F-statistic: 42.63 on 2 and 47 DF, p-value: 0.0000000000276
The third model has adjusted R-squared of almost 0.63, meaning that the model can explain 63% of variance in the target variable.
For further actions I prefer to use Model 2, which has higher adjusted R-squared than Model 3.
Model Performance
The prediction will originally run on the test data set. But we will predict based on the train data set as well to check whether or not our model is over-fitted.
Prediction of train data set
pred <- predict(Model2, newdata = ins_train)
ins_train$prediction <- pred
pred
## Kansas Illinois Nebraska Arizona Michigan
## 5.319106 10.446362 4.926319 12.555841 11.110416
## Virginia Oklahoma Wyoming Minnesota Connecticut
## 7.252946 6.810844 7.606027 3.430979 4.609941
## Missouri Ohio Hawaii Iowa Nevada
## 7.907391 5.138062 1.532642 3.129014 10.667112
## Mississippi South Dakota Idaho North Carolina Vermont
## 12.621370 4.980438 6.072782 16.001835 3.890496
## Kentucky Massachusetts Texas New Hampshire Pennsylvania
## 5.678793 5.966346 8.472215 3.217434 4.656854
## California Maine West Virginia Wisconsin Colorado
## 11.275846 4.581645 5.027951 2.596689 8.692966
## Georgia Louisiana North Dakota New Jersey New York
## 9.801524 11.203040 3.224640 6.227403 10.532380
## Indiana Washington Florida Utah Alaska
## 5.275797 6.324832 14.356149 4.915509 12.618968
Error on the train data set prediction
MAPE(y_pred = ins_train$prediction, y_true = ins_train$Murder)*100
## [1] 39.03189
Prediction of test data set
prediction <- predict(Model2, newdata = ins_test)
ins_test$prediction <- prediction
prediction
## Alabama Arkansas Delaware Maryland Montana
## 10.988294 9.324520 10.452967 13.397937 5.634283
## New Mexico Oregon Rhode Island South Carolina Tennessee
## 12.605756 7.206634 6.975073 13.321527 8.836106
Error on the train data set prediction
MAPE(y_pred = ins_test$prediction, y_true = ins_test$Murder)*100
## [1] 32.78952
datatable(ins_test)
linearity <- data.frame(residual = Model2$residuals, fitted = Model2$fitted.values)
highchart() %>%
hc_add_series(linearity, "scatter", hcaes(residual,fitted))#%>% hc_add_series(resact, "line", hcaes(x = residual, y = fitted))
Secondly, the linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with histogram or the Shapiro test.
histchart <- hchart(Model2$residuals, name = "VALUE")
histchart
Let’s use Shapiro-Wilk normality test to check for normality distribution
If the test is non-significant (p>. 05) it tells us that the distribution of the sample is not significantly different from a normal distribution. If, however, the test is significant (p < . 05) then the distribution in question is significantly different from a normal distribution.
shapiro.test(Model2$residuals)
##
## Shapiro-Wilk normality test
##
## data: Model2$residuals
## W = 0.96937, p-value = 0.2182
Based on the test result, is not significantly different from a normal distribution.
Linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other.
We can confirm multicollinearity using VIF (Variance Inflation Factor): When the VIF value is more than 10, it means that there is multicollinearity.
vif(Model2)
## Assault UrbanPop
## 1.071828 1.071828
Linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other.
Let’s use Durbin - Watson Test to check for autocorrelation The DW statistic ranges from zero to four, with a value of 2.0 indicating zero autocorrelation. Values below 2.0 mean there is positive autocorrelation and above 2.0 indicates negative autocorrelation.
durbinWatsonTest(Model2)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1064333 1.769422 0.404
## Alternative hypothesis: rho != 0
plot(x = Model2$fitted.values, y = Model2$residuals, col = "blue")
abline(h = 0, col = "red", lty = 6)
The Linear Regression Model generated has Adjusted R-Squared value of 0.6491, meaning that the model can explain 64.91% of variance in the target variable (Murder). Based on the previous evaluation, we can state that the model fulfills the assumptions, such as Linearity, Normality of Residual, and Homoscedasticity of Residual.