These are the basic packages that I will use throughout the process:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
crime <- read.csv("C:/Users/Asus/Documents/DataQuest/Algoritma Data Science/Regression Model/RM/data_input/crime.csv")
glimpse(crime)## Observations: 47
## Variables: 17
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,...
## $ M <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140, 124, 134, 12...
## $ So <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0,...
## $ Ed <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 105, 108, 113, ...
## $ Po1 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121, 75, 67, 62, ...
## $ Po2 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, 71, 60, 61, 5...
## $ LF <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632, 580, 595, 62...
## $ M.F <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 1029, 966, 972, ...
## $ Pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 28, 22, 30, 33...
## $ NW <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106, 59, 10, 46,...
## $ U1 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83, 77, 77, 92,...
## $ U2 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 25, 27, 43, 47...
## $ GDP <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526, 657, 580, 50...
## $ Ineq <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174, 170, 172, 20...
## $ Prob <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399, 0.034201, 0....
## $ Time <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9995, 20.6993,...
## $ y <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, 705, 1674, 84...
The dataset was collected in 1960. Since the full description of the dataset wasn’t available, the descriptions for columns were retrieved from the authors of the MASS package. The variables are:
X1: Index M: percentage of males aged 14-24 So: whether it is in a Southern state. 1 for Yes, 0 for No. Ed: mean years of schooling Po1: police expenditure in 1960 Po2: police expenditure in 1959 LF: labour force participation rate M.F: number of males per 1000 females Pop: state population NW: number of non-whites resident per 1000 people U1: unemployment rate of urban males aged 14-24 U2: unemployment rate of urban males aged 35-39 GDP: gross domestic product per head Ineq: income inequality Prob: probability of imprisonment Time: average time served in prisons y: crime rate in an unspecified category
We are going to remove the first column because it only contain index data which will not be needed in our analysis. Additionally, we will also rename the columns for easier analysis and transform is.South column into a factor data type.
# your code here
crime <- crime %>%
select(-X)
colnames(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
# transform & making new column
crime <- crime %>%
mutate(is_south = as.factor(is_south))
crime## percent_m is_south mean_education police_exp60 police_exp59
## 1 151 1 91 58 56
## 2 143 0 113 103 95
## 3 142 1 89 45 44
## 4 136 0 121 149 141
## 5 141 0 121 109 101
## 6 121 0 110 118 115
## 7 127 1 111 82 79
## 8 131 1 109 115 109
## 9 157 1 90 65 62
## 10 140 0 118 71 68
## 11 124 0 105 121 116
## 12 134 0 108 75 71
## 13 128 0 113 67 60
## 14 135 0 117 62 61
## 15 152 1 87 57 53
## 16 142 1 88 81 77
## 17 143 0 110 66 63
## 18 135 1 104 123 115
## 19 130 0 116 128 128
## 20 125 0 108 113 105
## 21 126 0 108 74 67
## 22 157 1 89 47 44
## 23 132 0 96 87 83
## 24 131 0 116 78 73
## 25 130 0 116 63 57
## 26 131 0 121 160 143
## 27 135 0 109 69 71
## 28 152 0 112 82 76
## 29 119 0 107 166 157
## 30 166 1 89 58 54
## 31 140 0 93 55 54
## 32 125 0 109 90 81
## 33 147 1 104 63 64
## 34 126 0 118 97 97
## 35 123 0 102 97 87
## 36 150 0 100 109 98
## 37 177 1 87 58 56
## 38 133 0 104 51 47
## 39 149 1 88 61 54
## 40 145 1 104 82 74
## 41 148 0 122 72 66
## 42 141 0 109 56 54
## 43 162 1 99 75 70
## 44 136 0 121 95 96
## 45 139 1 88 46 41
## 46 126 0 104 106 97
## 47 130 0 121 90 91
## labour_participation m_per1000f state_pop nonwhites_per1000 unemploy_m24
## 1 510 950 33 301 108
## 2 583 1012 13 102 96
## 3 533 969 18 219 94
## 4 577 994 157 80 102
## 5 591 985 18 30 91
## 6 547 964 25 44 84
## 7 519 982 4 139 97
## 8 542 969 50 179 79
## 9 553 955 39 286 81
## 10 632 1029 7 15 100
## 11 580 966 101 106 77
## 12 595 972 47 59 83
## 13 624 972 28 10 77
## 14 595 986 22 46 77
## 15 530 986 30 72 92
## 16 497 956 33 321 116
## 17 537 977 10 6 114
## 18 537 978 31 170 89
## 19 536 934 51 24 78
## 20 567 985 78 94 130
## 21 602 984 34 12 102
## 22 512 962 22 423 97
## 23 564 953 43 92 83
## 24 574 1038 7 36 142
## 25 641 984 14 26 70
## 26 631 1071 3 77 102
## 27 540 965 6 4 80
## 28 571 1018 10 79 103
## 29 521 938 168 89 92
## 30 521 973 46 254 72
## 31 535 1045 6 20 135
## 32 586 964 97 82 105
## 33 560 972 23 95 76
## 34 542 990 18 21 102
## 35 526 948 113 76 124
## 36 531 964 9 24 87
## 37 638 974 24 349 76
## 38 599 1024 7 40 99
## 39 515 953 36 165 86
## 40 560 981 96 126 88
## 41 601 998 9 19 84
## 42 523 968 4 2 107
## 43 522 996 40 208 73
## 44 574 1012 29 36 111
## 45 480 968 19 49 135
## 46 599 989 40 24 78
## 47 623 1049 3 22 113
## unemploy_m39 gdp inequality prob_prison time_prison crime_rate
## 1 41 394 261 0.084602 26.2011 791
## 2 36 557 194 0.029599 25.2999 1635
## 3 33 318 250 0.083401 24.3006 578
## 4 39 673 167 0.015801 29.9012 1969
## 5 20 578 174 0.041399 21.2998 1234
## 6 29 689 126 0.034201 20.9995 682
## 7 38 620 168 0.042100 20.6993 963
## 8 35 472 206 0.040099 24.5988 1555
## 9 28 421 239 0.071697 29.4001 856
## 10 24 526 174 0.044498 19.5994 705
## 11 35 657 170 0.016201 41.6000 1674
## 12 31 580 172 0.031201 34.2984 849
## 13 25 507 206 0.045302 36.2993 511
## 14 27 529 190 0.053200 21.5010 664
## 15 43 405 264 0.069100 22.7008 798
## 16 47 427 247 0.052099 26.0991 946
## 17 35 487 166 0.076299 19.1002 539
## 18 34 631 165 0.119804 18.1996 929
## 19 34 627 135 0.019099 24.9008 750
## 20 58 626 166 0.034801 26.4010 1225
## 21 33 557 195 0.022800 37.5998 742
## 22 34 288 276 0.089502 37.0994 439
## 23 32 513 227 0.030700 25.1989 1216
## 24 42 540 176 0.041598 17.6000 968
## 25 21 486 196 0.069197 21.9003 523
## 26 41 674 152 0.041698 22.1005 1993
## 27 22 564 139 0.036099 28.4999 342
## 28 28 537 215 0.038201 25.8006 1216
## 29 36 637 154 0.023400 36.7009 1043
## 30 26 396 237 0.075298 28.3011 696
## 31 40 453 200 0.041999 21.7998 373
## 32 43 617 163 0.042698 30.9014 754
## 33 24 462 233 0.049499 25.5005 1072
## 34 35 589 166 0.040799 21.6997 923
## 35 50 572 158 0.020700 37.4011 653
## 36 38 559 153 0.006900 44.0004 1272
## 37 28 382 254 0.045198 31.6995 831
## 38 27 425 225 0.053998 16.6999 566
## 39 35 395 251 0.047099 27.3004 826
## 40 31 488 228 0.038801 29.3004 1151
## 41 20 590 144 0.025100 30.0001 880
## 42 37 489 170 0.088904 12.1996 542
## 43 27 496 224 0.054902 31.9989 823
## 44 37 622 162 0.028100 30.0001 1030
## 45 53 457 249 0.056202 32.5996 455
## 46 25 593 171 0.046598 16.6999 508
## 47 40 588 160 0.052802 16.0997 849
## [1] FALSE
We would like to see how social-economic conditions could reflect on the crime rate in a state. Now we are going to inspect the correlation for each variable.
## Warning in ggcorr(data = crime, label = TRUE, label_size = 2, hjust = 0.8): data
## in column(s) 'is_south' are not numeric and were ignored
The correlation matrix above reveals the correlation between all of the numeric variables of the crime dataset (is_south was excluded because of its categorical value). From this matrix, we can see that police_exp60 and police_exp59 have high correlation with one another. This condition is not ideal for a linear regression because one of the assumption when we are building a linear regression model is non-multicollinearity, ie. there is no predictor variable that correlate strongly with one another. Further discussion about multicollinearity is in the last section of this article. Based on this result, I will combine those two variables into one pol_exp column that holds police expenditures data from the last 2 years.
crimex <- crime %>%
mutate(pol_exp = police_exp60+police_exp59) %>%
select(-c(police_exp60,police_exp59))
crimex## percent_m is_south mean_education labour_participation m_per1000f state_pop
## 1 151 1 91 510 950 33
## 2 143 0 113 583 1012 13
## 3 142 1 89 533 969 18
## 4 136 0 121 577 994 157
## 5 141 0 121 591 985 18
## 6 121 0 110 547 964 25
## 7 127 1 111 519 982 4
## 8 131 1 109 542 969 50
## 9 157 1 90 553 955 39
## 10 140 0 118 632 1029 7
## 11 124 0 105 580 966 101
## 12 134 0 108 595 972 47
## 13 128 0 113 624 972 28
## 14 135 0 117 595 986 22
## 15 152 1 87 530 986 30
## 16 142 1 88 497 956 33
## 17 143 0 110 537 977 10
## 18 135 1 104 537 978 31
## 19 130 0 116 536 934 51
## 20 125 0 108 567 985 78
## 21 126 0 108 602 984 34
## 22 157 1 89 512 962 22
## 23 132 0 96 564 953 43
## 24 131 0 116 574 1038 7
## 25 130 0 116 641 984 14
## 26 131 0 121 631 1071 3
## 27 135 0 109 540 965 6
## 28 152 0 112 571 1018 10
## 29 119 0 107 521 938 168
## 30 166 1 89 521 973 46
## 31 140 0 93 535 1045 6
## 32 125 0 109 586 964 97
## 33 147 1 104 560 972 23
## 34 126 0 118 542 990 18
## 35 123 0 102 526 948 113
## 36 150 0 100 531 964 9
## 37 177 1 87 638 974 24
## 38 133 0 104 599 1024 7
## 39 149 1 88 515 953 36
## 40 145 1 104 560 981 96
## 41 148 0 122 601 998 9
## 42 141 0 109 523 968 4
## 43 162 1 99 522 996 40
## 44 136 0 121 574 1012 29
## 45 139 1 88 480 968 19
## 46 126 0 104 599 989 40
## 47 130 0 121 623 1049 3
## nonwhites_per1000 unemploy_m24 unemploy_m39 gdp inequality prob_prison
## 1 301 108 41 394 261 0.084602
## 2 102 96 36 557 194 0.029599
## 3 219 94 33 318 250 0.083401
## 4 80 102 39 673 167 0.015801
## 5 30 91 20 578 174 0.041399
## 6 44 84 29 689 126 0.034201
## 7 139 97 38 620 168 0.042100
## 8 179 79 35 472 206 0.040099
## 9 286 81 28 421 239 0.071697
## 10 15 100 24 526 174 0.044498
## 11 106 77 35 657 170 0.016201
## 12 59 83 31 580 172 0.031201
## 13 10 77 25 507 206 0.045302
## 14 46 77 27 529 190 0.053200
## 15 72 92 43 405 264 0.069100
## 16 321 116 47 427 247 0.052099
## 17 6 114 35 487 166 0.076299
## 18 170 89 34 631 165 0.119804
## 19 24 78 34 627 135 0.019099
## 20 94 130 58 626 166 0.034801
## 21 12 102 33 557 195 0.022800
## 22 423 97 34 288 276 0.089502
## 23 92 83 32 513 227 0.030700
## 24 36 142 42 540 176 0.041598
## 25 26 70 21 486 196 0.069197
## 26 77 102 41 674 152 0.041698
## 27 4 80 22 564 139 0.036099
## 28 79 103 28 537 215 0.038201
## 29 89 92 36 637 154 0.023400
## 30 254 72 26 396 237 0.075298
## 31 20 135 40 453 200 0.041999
## 32 82 105 43 617 163 0.042698
## 33 95 76 24 462 233 0.049499
## 34 21 102 35 589 166 0.040799
## 35 76 124 50 572 158 0.020700
## 36 24 87 38 559 153 0.006900
## 37 349 76 28 382 254 0.045198
## 38 40 99 27 425 225 0.053998
## 39 165 86 35 395 251 0.047099
## 40 126 88 31 488 228 0.038801
## 41 19 84 20 590 144 0.025100
## 42 2 107 37 489 170 0.088904
## 43 208 73 27 496 224 0.054902
## 44 36 111 37 622 162 0.028100
## 45 49 135 53 457 249 0.056202
## 46 24 78 25 593 171 0.046598
## 47 22 113 40 588 160 0.052802
## time_prison crime_rate pol_exp
## 1 26.2011 791 114
## 2 25.2999 1635 198
## 3 24.3006 578 89
## 4 29.9012 1969 290
## 5 21.2998 1234 210
## 6 20.9995 682 233
## 7 20.6993 963 161
## 8 24.5988 1555 224
## 9 29.4001 856 127
## 10 19.5994 705 139
## 11 41.6000 1674 237
## 12 34.2984 849 146
## 13 36.2993 511 127
## 14 21.5010 664 123
## 15 22.7008 798 110
## 16 26.0991 946 158
## 17 19.1002 539 129
## 18 18.1996 929 238
## 19 24.9008 750 256
## 20 26.4010 1225 218
## 21 37.5998 742 141
## 22 37.0994 439 91
## 23 25.1989 1216 170
## 24 17.6000 968 151
## 25 21.9003 523 120
## 26 22.1005 1993 303
## 27 28.4999 342 140
## 28 25.8006 1216 158
## 29 36.7009 1043 323
## 30 28.3011 696 112
## 31 21.7998 373 109
## 32 30.9014 754 171
## 33 25.5005 1072 127
## 34 21.6997 923 194
## 35 37.4011 653 184
## 36 44.0004 1272 207
## 37 31.6995 831 114
## 38 16.6999 566 98
## 39 27.3004 826 115
## 40 29.3004 1151 156
## 41 30.0001 880 138
## 42 12.1996 542 110
## 43 31.9989 823 145
## 44 30.0001 1030 191
## 45 32.5996 455 87
## 46 16.6999 508 203
## 47 16.0997 849 181
# updated correlation matrix
ggcorr(crimex, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)## Warning in ggcorr(crimex, label = TRUE, label_size = 2.9, hjust = 1, layout.exp
## = 2): data in column(s) 'is_south' are not numeric and were ignored
From the updated correlation matrix, it can be seen that the police expenditures have the highest correlation with the crime rate. Meanwhile, the number of non-white residents has no correlation with the crime rate. From the analysis above, I decided to not using nonwhites_per1000 variable as a predictor as it has no correlation for our target variable. The use of nonwhites_per1000 variable might affect the calculation of our model variable significance and promoting incorrect significancies). ___ # Model Building ## Simple Linear Regression
In building simple linear regression, it is important to have predictor variable that has high linear relationship with the target variables. Based on the correlation matrix, I will be using pol_exp as my predictor variable. Prior to model building, I will inspect if there is any outlier from our data.
Based on the boxplot, there is one outlier in our pol_exp data. I will remove this outlier first before building our model because based on the scatterplot, it looks like it has a high influence on our model.
# model building
slm.o <- lm(crime_rate ~ pol_exp, crimex)
slm <- lm(crime_rate ~ pol_exp, crime.clean) # without outlier
# plotting the linear regression model
plot(crimex$pol_exp, crimex$crime_rate)
abline(slm.o$coefficients[1],slm.o$coefficients[2], col = "red")
abline(slm$coefficients[1],slm$coefficients[2], col = "green") # without outlier##
## Call:
## lm(formula = crime_rate ~ pol_exp, data = crimex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -581.07 -151.94 20.33 147.24 580.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 152.0674 128.5331 1.183 0.243
## pol_exp 4.5573 0.7354 6.197 0.000000159 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 287.2 on 45 degrees of freedom
## Multiple R-squared: 0.4605, Adjusted R-squared: 0.4485
## F-statistic: 38.4 on 1 and 45 DF, p-value: 0.0000001591
##
## Call:
## lm(formula = crime_rate ~ pol_exp, data = crime.clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -650.8 -153.4 28.0 163.1 541.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.4783 130.6502 0.348 0.729
## pol_exp 5.2941 0.7679 6.894 0.0000000164 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.8 on 44 degrees of freedom
## Multiple R-squared: 0.5193, Adjusted R-squared: 0.5083
## F-statistic: 47.53 on 1 and 44 DF, p-value: 0.00000001636
From both of the regression model, pol_exp has positive corelation with crime.rate and is significantly affecting crime_rate (3 stars/p-value < 0.05).
If we look at the abline of the plot, we can see that the outliers of our data have a high influence over our model. Therefore, it is safer to remove the outliers from the data. The removal of outliers increase the R-squared of the model ~0.6 points. Based on this analysis, I will use slm as our simple linear model. Additionally, It also has higher R-squared as one of the tools for evaluating a linear regression model.
R-squared value tells us how well our model describes our data. It measures the extent to which the variance in our target variable (crime_rate) can be explained by the predictor variables (pol_exp, etc.). As we increase the number of predictor variables, our model’s R-squared value will also increase as it incorporates new legitimate information as well as the noise introduced by these extra variables. Because of this, it is better to evaluate the adjusted R-squared of our model. Adjusted R-squared does not increase the way regular multiple R-squared does because it is adjusted for the number of predictor variables in our model. It increases only when the new variable actually leads to a better prediction.
slm model can be written as:
crime_rate = 45.4783 + 5.2941(pol_exp)
which also means:
every 1 unit increase of police expenditure in the last 2 years, there will be an increase in crime rate at about 5.2941 times of the police expenditure.
Even so, our simple linear model only have an adjusted R-squared of 0.5083. An improvement can be made by adding more data/adding more variables to our model that may explain our target. On the constraint of only having this dataset, I will try to add more variables and use multiple linear regression for this case.
On building multiple linear regression, I will use automatic feature selection using step-wise regression with backward elimination. I will use crime_rate as our target variable and all the remaining variables as our predictor, except the nonwhites_per1000 variable which has zero correlation with crime_rate.
mlm <- lm(crime_rate ~ percent_m + is_south + mean_education + labour_participation +
m_per1000f + state_pop + unemploy_m24 +
unemploy_m39 + gdp + inequality + prob_prison + time_prison +
pol_exp, crime.clean)
step(mlm, direction = "backward")## Start: AIC=499.41
## crime_rate ~ percent_m + is_south + mean_education + labour_participation +
## m_per1000f + state_pop + unemploy_m24 + unemploy_m39 + gdp +
## inequality + prob_prison + time_prison + pol_exp
##
## Df Sum of Sq RSS AIC
## - state_pop 1 22 1298610 497.42
## - is_south 1 113 1298701 497.42
## - time_prison 1 3277 1301864 497.53
## - labour_participation 1 3529 1302116 497.54
## - gdp 1 8271 1306859 497.71
## - unemploy_m24 1 36151 1334738 498.68
## - m_per1000f 1 42497 1341084 498.90
## <none> 1298587 499.41
## - unemploy_m39 1 88024 1386611 500.43
## - prob_prison 1 117246 1415833 501.39
## - percent_m 1 168873 1467461 503.04
## - mean_education 1 299178 1597766 506.95
## - inequality 1 432350 1730938 510.63
## - pol_exp 1 1067853 2366441 525.02
##
## Step: AIC=497.42
## crime_rate ~ percent_m + is_south + mean_education + labour_participation +
## m_per1000f + unemploy_m24 + unemploy_m39 + gdp + inequality +
## prob_prison + time_prison + pol_exp
##
## Df Sum of Sq RSS AIC
## - is_south 1 114 1298724 495.42
## - time_prison 1 3523 1302132 495.54
## - labour_participation 1 3546 1302155 495.54
## - gdp 1 8633 1307242 495.72
## - unemploy_m24 1 36360 1334970 496.69
## - m_per1000f 1 48877 1347487 497.12
## <none> 1298610 497.42
## - unemploy_m39 1 88625 1387234 498.45
## - prob_prison 1 117555 1416165 499.40
## - percent_m 1 168904 1467514 501.04
## - mean_education 1 299444 1598054 504.96
## - inequality 1 464523 1763132 509.48
## - pol_exp 1 1163183 2461793 524.84
##
## Step: AIC=495.42
## crime_rate ~ percent_m + mean_education + labour_participation +
## m_per1000f + unemploy_m24 + unemploy_m39 + gdp + inequality +
## prob_prison + time_prison + pol_exp
##
## Df Sum of Sq RSS AIC
## - time_prison 1 3507 1302231 493.54
## - labour_participation 1 4958 1303683 493.59
## - gdp 1 9106 1307830 493.74
## - unemploy_m24 1 41820 1340544 494.88
## - m_per1000f 1 49582 1348307 495.14
## <none> 1298724 495.42
## - unemploy_m39 1 93362 1392087 496.61
## - prob_prison 1 128301 1427026 497.75
## - percent_m 1 180480 1479204 499.41
## - mean_education 1 299371 1598095 502.96
## - inequality 1 597093 1895817 510.82
## - pol_exp 1 1168618 2467342 522.94
##
## Step: AIC=493.54
## crime_rate ~ percent_m + mean_education + labour_participation +
## m_per1000f + unemploy_m24 + unemploy_m39 + gdp + inequality +
## prob_prison + pol_exp
##
## Df Sum of Sq RSS AIC
## - labour_participation 1 3805 1306036 491.68
## - gdp 1 10889 1313120 491.93
## - unemploy_m24 1 44275 1346506 493.08
## - m_per1000f 1 46320 1348551 493.15
## <none> 1302231 493.54
## - unemploy_m39 1 102971 1405202 495.04
## - percent_m 1 213507 1515738 498.53
## - prob_prison 1 220749 1522981 498.75
## - mean_education 1 295885 1598116 500.96
## - inequality 1 610210 1912441 509.22
## - pol_exp 1 1172569 2474800 521.08
##
## Step: AIC=491.68
## crime_rate ~ percent_m + mean_education + m_per1000f + unemploy_m24 +
## unemploy_m39 + gdp + inequality + prob_prison + pol_exp
##
## Df Sum of Sq RSS AIC
## - gdp 1 10398 1316434 490.04
## - unemploy_m24 1 40481 1346517 491.08
## - m_per1000f 1 44751 1350787 491.23
## <none> 1306036 491.68
## - unemploy_m39 1 104836 1410871 493.23
## - prob_prison 1 218502 1524538 496.79
## - percent_m 1 231590 1537626 497.19
## - mean_education 1 300525 1606560 499.20
## - inequality 1 609144 1915180 507.29
## - pol_exp 1 1208349 2514385 519.81
##
## Step: AIC=490.04
## crime_rate ~ percent_m + mean_education + m_per1000f + unemploy_m24 +
## unemploy_m39 + inequality + prob_prison + pol_exp
##
## Df Sum of Sq RSS AIC
## - unemploy_m24 1 47775 1364208 489.68
## - m_per1000f 1 50927 1367360 489.79
## <none> 1316434 490.04
## - unemploy_m39 1 118175 1434609 492.00
## - percent_m 1 221285 1537719 495.19
## - prob_prison 1 257245 1573678 496.25
## - mean_education 1 327854 1644287 498.27
## - inequality 1 828730 2145163 510.50
## - pol_exp 1 1736089 3052523 526.73
##
## Step: AIC=489.68
## crime_rate ~ percent_m + mean_education + m_per1000f + unemploy_m39 +
## inequality + prob_prison + pol_exp
##
## Df Sum of Sq RSS AIC
## - m_per1000f 1 18044 1382252 488.29
## <none> 1364208 489.68
## - unemploy_m39 1 86771 1450980 490.52
## - percent_m 1 225573 1589781 494.72
## - prob_prison 1 262958 1627167 495.79
## - mean_education 1 297172 1661380 496.75
## - inequality 1 952361 2316569 512.04
## - pol_exp 1 3019469 4383677 541.38
##
## Step: AIC=488.29
## crime_rate ~ percent_m + mean_education + unemploy_m39 + inequality +
## prob_prison + pol_exp
##
## Df Sum of Sq RSS AIC
## <none> 1382252 488.29
## - unemploy_m39 1 115767 1498020 489.99
## - prob_prison 1 257895 1640148 494.16
## - percent_m 1 265460 1647713 494.37
## - mean_education 1 530001 1912253 501.22
## - inequality 1 1020942 2403194 511.73
## - pol_exp 1 3001462 4383714 539.38
##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + unemploy_m39 +
## inequality + prob_prison + pol_exp, data = crime.clean)
##
## Coefficients:
## (Intercept) percent_m mean_education unemploy_m39 inequality
## -4605.850 8.724 16.721 7.064 7.051
## prob_prison pol_exp
## -3863.787 6.796
This step-wise regression method will return the best formula that gives lowest AIC (Akaike Information Criterion), whereas the lower the AIC, the lower the information loss that our model has.
# assigning multiple linear regression model to an object
mlm.step <- lm(formula = crime_rate ~ percent_m + mean_education + unemploy_m39 +
inequality + prob_prison + pol_exp, data = crime.clean)
summary(mlm.step)##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + unemploy_m39 +
## inequality + prob_prison + pol_exp, data = crime.clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -575.90 -117.56 -1.68 125.80 448.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4605.8496 860.8376 -5.350 0.0000041234915 ***
## percent_m 8.7238 3.1876 2.737 0.009295 **
## mean_education 16.7212 4.3241 3.867 0.000407 ***
## unemploy_m39 7.0635 3.9083 1.807 0.078433 .
## inequality 7.0513 1.3138 5.367 0.0000039104557 ***
## prob_prison -3863.7869 1432.3620 -2.697 0.010267 *
## pol_exp 6.7957 0.7385 9.202 0.0000000000255 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 188.3 on 39 degrees of freedom
## Multiple R-squared: 0.7985, Adjusted R-squared: 0.7676
## F-statistic: 25.77 on 6 and 39 DF, p-value: 0.000000000003888
If we compare between our simple linear regression and multiple linear regression model, we can see that our multiple linear regression gave better adjusted R-squared of 0.7676 from previously 0.5083. Therefore, I will use the mlm.step model for our prediction.
From this model, we can see that probability have negative correlation with crime_rate, while other variable positively correlate with crime_rate. The formula of our final model is:
crime_rate = 8.7238(percent_m) + 16.7212(mean_education) + 7.0635(unemploy_m39) + 7.0513(inequality) - 3863.7869(prob_prison) + 6.7957(pol_exp) - 4605.8496
# make prediction
crimex$cr_pred <- predict(mlm.step, crimex)
# calculating RMSE
sqrt(mean((crimex$cr_pred - crimex$crime_rate)^2))## [1] 194.0982
## [1] 342 1993
Based on this result, our linear regression model still gave a considerably high error to our prediction and therefore still needs improvement. This result is actually quite expected from the look of our scatterplot, whereas our data only vaguely resemble a linear line between our predictor (which has the highest correlation) with our target variable.
We can improve this model by adding more data (47 observations might not be enough), or searching for another predictor variables that may have a better linear correlation with our target variable)
Another machine learning approach to predict the crime rate should also be considered as this linear regression model is not very robust compared to other machine learning algorithms such as Random Forest.
This assumption is already checked prior to model building using correlation matrix in EDA step, and through scatterplot between our predictor and our target variable.
##
## Shapiro-Wilk normality test
##
## data: mlm.step$residuals
## W = 0.96654, p-value = 0.2051
P-value > 0.05; H0 accepted. This result means that the residuals of our model distributed normally, therefore our model will have error around its mean.
Heteroscedasticity
##
## studentized Breusch-Pagan test
##
## data: mlm.step
## BP = 8.3082, df = 6, p-value = 0.2164
P-value > 0.05; H0 accepted. This means that our model residuals do not have a pattern that cannot be captured by our model.
## percent_m mean_education unemploy_m39 inequality prob_prison
## 1.970723 3.035998 1.412245 3.486175 1.343397
## pol_exp
## 1.955926
There are no values equal to or more than 10 so Multicollinearity between variables is not found (predictor variables are mutually independent).
Based on the analysis, our mlm.step model passed all of the assumption evaluation.
The mlm.step model gave an R-squared of 0.7676 and has an RMSE of 194. The predictor variables that will contribute to the mlm.step model include percent_m, mean_education, unemploy_m39, inequality, prob_prison, and pol_exp. We can improve this model by adding more data (47 observations might not be enough), or searching for another predictor variables that may have a better linear correlation with our target variable). You can also optimize the prediction by using different machine learning approach to predict the crime rate. As you know, the linear regression model is not very robust compared to other machine learning algorithms such as Random Forest, although it has great interpretability.
Lastly,
Thank you for reading, happy learning, and keep improving!