This project was created to fulfill the Learning by Building (LBB) assignment on the Regression Model material in the Machine Learning Specialization Course. The dataset used in this project is the Crime dataset.
## Rows: 47
## Columns: 17
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19…
## $ M <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140, 124, 134, 128, …
## $ So <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,…
## $ Ed <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 105, 108, 113, 117…
## $ Po1 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121, 75, 67, 62, 57,…
## $ Po2 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, 71, 60, 61, 53, …
## $ LF <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632, 580, 595, 624, …
## $ M.F <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 1029, 966, 972, 972…
## $ Pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 28, 22, 30, 33, 1…
## $ NW <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106, 59, 10, 46, 72…
## $ U1 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83, 77, 77, 92, 11…
## $ U2 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 25, 27, 43, 47, 3…
## $ GDP <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526, 657, 580, 507, …
## $ Ineq <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174, 170, 172, 206, …
## $ Prob <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399, 0.034201, 0.042…
## $ Time <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9995, 20.6993, 24…
## $ y <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, 705, 1674, 849, …
First of all, we need to rename the columns and drop the X column that used for index only
crime <- crime %>%
select(-X)
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
glimpse(crime)## Rows: 47
## Columns: 16
## $ percent_m <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
## $ is_south <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ mean_education <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
## $ police_exp60 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,…
## $ police_exp59 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, …
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
## $ m_per1000f <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
## $ state_pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
## $ nonwhites_per1000 <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
## $ unemploy_m24 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
## $ unemploy_m39 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
## $ gdp <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
## $ inequality <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
## $ prob_prison <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
## $ time_prison <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
## $ crime_rate <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …
Here is the description for each columns:
percent_m: percentage of males aged 14-24is_south: whether it is in a Southern state. 1 for Yes,
0 for No.mean_education: mean years of schoolingpolice_exp60: police expenditure in 1960police_exp59: police expenditure in 1959labour_participation: labour force participation
ratem_per1000f: number of males per 1000 femalesstate_pop: state populationnonwhites_per1000: number of non-whites resident per
1000 peopleunemploy_m24: unemployment rate of urban males aged
14-24unemploy_m39: unemployment rate of urban males aged
35-39gdp: gross domestic product per headinequality: income inequalityprob_prison: probability of imprisonmenttime_prison: avg time served in prisonscrime_rate: crime rate in an unspecified categoryAfter that we need to check whether there is missing value or not.
## percent_m is_south mean_education
## 0 0 0
## police_exp60 police_exp59 labour_participation
## 0 0 0
## m_per1000f state_pop nonwhites_per1000
## 0 0 0
## unemploy_m24 unemploy_m39 gdp
## 0 0 0
## inequality prob_prison time_prison
## 0 0 0
## crime_rate
## 0
There is no missing values, and continue to check the correlation
between each columns or variables using ggcorr.
Based on the correlation plot, we can see that not all variables have
correlation with crime_rate variable. The variables that have the
highest correlation with crime_rate are police_exp59 and
police_exp60.
In this project, I will split the 70% of data as the data train and 30% as the data test.
I will build two models first using all remaining predictor variables and second using the high correlated predictors.
##
## Call:
## lm(formula = crime_rate ~ ., data = crime_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -304.71 -91.82 -31.14 101.84 426.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.033e+04 2.233e+03 -4.628 0.00024 ***
## percent_m 9.668e+00 4.475e+00 2.160 0.04530 *
## is_south 8.101e+00 2.233e+02 0.036 0.97148
## mean_education 2.630e+01 7.698e+00 3.416 0.00329 **
## police_exp60 2.009e+01 1.186e+01 1.694 0.10858
## police_exp59 -1.486e+01 1.354e+01 -1.098 0.28771
## labour_participation -4.659e+00 2.172e+00 -2.145 0.04669 *
## m_per1000f 8.266e+00 2.861e+00 2.889 0.01019 *
## state_pop 1.370e+00 1.792e+00 0.764 0.45521
## nonwhites_per1000 1.333e+00 9.991e-01 1.335 0.19960
## unemploy_m24 -1.198e+01 5.132e+00 -2.334 0.03212 *
## unemploy_m39 1.424e+01 9.869e+00 1.443 0.16724
## gdp 6.436e-01 1.262e+00 0.510 0.61671
## inequality 4.247e+00 2.855e+00 1.488 0.15518
## prob_prison 9.401e+02 3.823e+03 0.246 0.80869
## time_prison 1.478e+01 9.430e+00 1.567 0.13558
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 199.9 on 17 degrees of freedom
## Multiple R-squared: 0.8813, Adjusted R-squared: 0.7766
## F-statistic: 8.414 on 15 and 17 DF, p-value: 3.846e-05
Based on the above summary, we can identify predictor variables with
Pr(>|t|) values greater than 0.05, which include
is_south, police_exp60,
police_exp59, state_pop,
nonwhites_per1000, unemploy_m39,
gdp, inequality, prob_prison, and
time_prison. Pr(>|t|) values greater than 0.05 indicate
that these predictors are statistically insignificant for predicting the
target variable crime_rate.
Consequently, we can simplify the model by removing these predictors with such Pr(>|t|) values and recompute the model to potentially improve the Adjusted R-squared of 0.7766.
The second one is using high correlated predictors.
##
## Call:
## lm(formula = crime_rate ~ police_exp60 + police_exp59, data = crime_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -624.25 -170.68 48.68 153.57 477.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.02 145.47 0.688 0.497
## police_exp60 21.35 14.37 1.486 0.148
## police_exp59 -12.69 15.32 -0.829 0.414
##
## Residual standard error: 294 on 30 degrees of freedom
## Multiple R-squared: 0.5466, Adjusted R-squared: 0.5164
## F-statistic: 18.08 on 2 and 30 DF, p-value: 7.03e-06
The R-squared of crime_lm approximates 0.5164, which
ideally needs to be improved upon by, for example, adding more predictor
variables. One method for choosing predictor variables that we’ll employ
is the stepwise regression algorithm.
model_none <- lm(crime_rate ~ 1, data = crime_train)
model_forward <- step(object = model_none, #model tanpa prediktor
direction = "forward",
scope = list(upper = model_all), #model dengan semua prediktor
trace = F)
summary(model_forward)##
## Call:
## lm(formula = crime_rate ~ police_exp60 + percent_m + m_per1000f +
## is_south + mean_education + time_prison + inequality, data = crime_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -363.98 -88.84 11.32 78.90 531.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7901.441 1534.858 -5.148 2.54e-05 ***
## police_exp60 11.153 1.514 7.366 1.02e-07 ***
## percent_m 8.039 3.735 2.153 0.04120 *
## m_per1000f 3.982 1.529 2.605 0.01526 *
## is_south 344.961 137.104 2.516 0.01866 *
## mean_education 16.534 5.409 3.057 0.00527 **
## time_prison 14.378 5.668 2.537 0.01781 *
## inequality 3.039 1.957 1.553 0.13293
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 200.6 on 25 degrees of freedom
## Multiple R-squared: 0.8241, Adjusted R-squared: 0.7749
## F-statistic: 16.74 on 7 and 25 DF, p-value: 5.419e-08
model_backward <- step(object = model_all,
direction = "backward",
trace = F)
summary(model_backward)##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## labour_participation + m_per1000f + nonwhites_per1000 + unemploy_m24 +
## unemploy_m39 + inequality + time_prison, data = crime_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -283.69 -80.50 -16.92 82.84 429.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9643.5294 1570.9487 -6.139 3.53e-06 ***
## percent_m 8.3543 3.9050 2.139 0.043752 *
## mean_education 25.3573 6.1671 4.112 0.000459 ***
## police_exp60 8.3365 1.7646 4.724 0.000103 ***
## labour_participation -3.5144 1.5221 -2.309 0.030727 *
## m_per1000f 7.1285 2.0723 3.440 0.002337 **
## nonwhites_per1000 1.1725 0.6474 1.811 0.083817 .
## unemploy_m24 -10.2895 3.9585 -2.599 0.016368 *
## unemploy_m39 16.0240 8.4283 1.901 0.070453 .
## inequality 4.6137 1.7115 2.696 0.013205 *
## time_prison 15.7576 5.6929 2.768 0.011222 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.1 on 22 degrees of freedom
## Multiple R-squared: 0.8682, Adjusted R-squared: 0.8082
## F-statistic: 14.49 on 10 and 22 DF, p-value: 1.705e-07
model_both <- step(object = model_none,
direction = "both",
scope = list(upper = model_all),
trace = F)
summary(model_both)##
## Call:
## lm(formula = crime_rate ~ police_exp60 + percent_m + m_per1000f +
## is_south + mean_education + time_prison + inequality, data = crime_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -363.98 -88.84 11.32 78.90 531.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7901.441 1534.858 -5.148 2.54e-05 ***
## police_exp60 11.153 1.514 7.366 1.02e-07 ***
## percent_m 8.039 3.735 2.153 0.04120 *
## m_per1000f 3.982 1.529 2.605 0.01526 *
## is_south 344.961 137.104 2.516 0.01866 *
## mean_education 16.534 5.409 3.057 0.00527 **
## time_prison 14.378 5.668 2.537 0.01781 *
## inequality 3.039 1.957 1.553 0.13293
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 200.6 on 25 degrees of freedom
## Multiple R-squared: 0.8241, Adjusted R-squared: 0.7749
## F-statistic: 16.74 on 7 and 25 DF, p-value: 5.419e-08
After doing step-wise regression method, I compare 3 new models and
the highest adjusted R-Squared is 0.8082 from
model_backward using backward elimination.
The Mean Absolute Percentage Error (MAPE) measures the percentage
difference between the predicted and actual values. In our analysis, we
will compare the MAPE between the first model and the second model to
assess their respective accuracy in predicting the target variable. We
will compare the MAPE between model and
model_backward.
## [1] 0.2135468
## [1] 0.3699559
For the first model MAPE means that by average the difference with the actual number is 21.35% and for the second model the difference with the actual number is 36.99%. We can conclude that the first model is better because it is lower than the first one.
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.96664, p-value = 0.3934
##
## Shapiro-Wilk normality test
##
## data: model_backward$residuals
## W = 0.97453, p-value = 0.6143
Based on the Shapiro Test, we can conclude that the error is distributed normally for the second model because the P-value is higher than 0.05 which is 0.6143.
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 13.691, df = 2, p-value = 0.001064
##
## studentized Breusch-Pagan test
##
## data: model_backward
## BP = 9.5527, df = 10, p-value = 0.4806
Based on the Breusch-Pagan, we can conclude that there is heteroscesdasticity on the first model because the P-value is lower than 0.05 which is 0.001064. For the second model we conclude that there is homocesdasticity because the P-value is above 0.05, it means the error is constant.
## percent_m mean_education police_exp60
## 2.008588 4.492356 3.075057
## labour_participation m_per1000f nonwhites_per1000
## 3.797952 3.449977 2.662333
## unemploy_m24 unemploy_m39 inequality
## 4.145015 4.330561 3.842859
## time_prison
## 1.739850
Based on the Variance Inflation Factor Test above, we conclude that
multicollinearity is not present in our model_backward
because the VIF values for all variables are below 10.
As a conclusion, the second model model_backward
utilizing the stepwise-regression method outperforms the first model
with only one variable having the highest positive correlation with the
‘crime_rate’ variable. The reasons for the superiority of model_step are
as follows: - Only model_backword exhibit normal error distribution
according to the Shapiro Test. - The Variance Inflation Factor (VIF)
Test indicates that multicollinearity is absent in the
model_backward, as all VIF values for the variables are
below 10.
In addition, the MAPE value of the backword model is worse than the model, which is 36.99%, which means a difference of about 15%.