Load Packages

These are the basic packages that I will use throughout the process:

library(readr) # to read data
library(dplyr) # to tidy data

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(GGally) # to make correlation matrix

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

library(lmtest) # for Breusch-Pagan/heteroscedasticity test

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(car) # for multicollinearity test

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

crime <- read.csv("C:/Users/Asus/Documents/DataQuest/Algoritma Data Science/Regression Model/RM/data_input/crime.csv")
glimpse(crime)

## Observations: 47
## Variables: 17
## $ X    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,...
## $ M    <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140, 124, 134, 12...
## $ So   <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0,...
## $ Ed   <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 105, 108, 113, ...
## $ Po1  <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121, 75, 67, 62, ...
## $ Po2  <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, 71, 60, 61, 5...
## $ LF   <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632, 580, 595, 62...
## $ M.F  <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 1029, 966, 972, ...
## $ Pop  <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 28, 22, 30, 33...
## $ NW   <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106, 59, 10, 46,...
## $ U1   <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83, 77, 77, 92,...
## $ U2   <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 25, 27, 43, 47...
## $ GDP  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526, 657, 580, 50...
## $ Ineq <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174, 170, 172, 20...
## $ Prob <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399, 0.034201, 0....
## $ Time <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9995, 20.6993,...
## $ y    <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, 705, 1674, 84...

The dataset was collected in 1960. Since the full description of the dataset wasn’t available, the descriptions for columns were retrieved from the authors of the MASS package. The variables are:

X1: Index M: percentage of males aged 14-24 So: whether it is in a Southern state. 1 for Yes, 0 for No. Ed: mean years of schooling Po1: police expenditure in 1960 Po2: police expenditure in 1959 LF: labour force participation rate M.F: number of males per 1000 females Pop: state population NW: number of non-whites resident per 1000 people U1: unemployment rate of urban males aged 14-24 U2: unemployment rate of urban males aged 35-39 GDP: gross domestic product per head Ineq: income inequality Prob: probability of imprisonment Time: average time served in prisons y: crime rate in an unspecified category

We are going to remove the first column because it only contain index data which will not be needed in our analysis. Additionally, we will also rename the columns for easier analysis and transform is.South column into a factor data type.

# your code here
crime <- crime %>% 
   select(-X) 
colnames(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")

# transform & making new column
crime <- crime %>%
  mutate(is_south = as.factor(is_south))

crime

##    percent_m is_south mean_education police_exp60 police_exp59
## 1        151        1             91           58           56
## 2        143        0            113          103           95
## 3        142        1             89           45           44
## 4        136        0            121          149          141
## 5        141        0            121          109          101
## 6        121        0            110          118          115
## 7        127        1            111           82           79
## 8        131        1            109          115          109
## 9        157        1             90           65           62
## 10       140        0            118           71           68
## 11       124        0            105          121          116
## 12       134        0            108           75           71
## 13       128        0            113           67           60
## 14       135        0            117           62           61
## 15       152        1             87           57           53
## 16       142        1             88           81           77
## 17       143        0            110           66           63
## 18       135        1            104          123          115
## 19       130        0            116          128          128
## 20       125        0            108          113          105
## 21       126        0            108           74           67
## 22       157        1             89           47           44
## 23       132        0             96           87           83
## 24       131        0            116           78           73
## 25       130        0            116           63           57
## 26       131        0            121          160          143
## 27       135        0            109           69           71
## 28       152        0            112           82           76
## 29       119        0            107          166          157
## 30       166        1             89           58           54
## 31       140        0             93           55           54
## 32       125        0            109           90           81
## 33       147        1            104           63           64
## 34       126        0            118           97           97
## 35       123        0            102           97           87
## 36       150        0            100          109           98
## 37       177        1             87           58           56
## 38       133        0            104           51           47
## 39       149        1             88           61           54
## 40       145        1            104           82           74
## 41       148        0            122           72           66
## 42       141        0            109           56           54
## 43       162        1             99           75           70
## 44       136        0            121           95           96
## 45       139        1             88           46           41
## 46       126        0            104          106           97
## 47       130        0            121           90           91
##    labour_participation m_per1000f state_pop nonwhites_per1000 unemploy_m24
## 1                   510        950        33               301          108
## 2                   583       1012        13               102           96
## 3                   533        969        18               219           94
## 4                   577        994       157                80          102
## 5                   591        985        18                30           91
## 6                   547        964        25                44           84
## 7                   519        982         4               139           97
## 8                   542        969        50               179           79
## 9                   553        955        39               286           81
## 10                  632       1029         7                15          100
## 11                  580        966       101               106           77
## 12                  595        972        47                59           83
## 13                  624        972        28                10           77
## 14                  595        986        22                46           77
## 15                  530        986        30                72           92
## 16                  497        956        33               321          116
## 17                  537        977        10                 6          114
## 18                  537        978        31               170           89
## 19                  536        934        51                24           78
## 20                  567        985        78                94          130
## 21                  602        984        34                12          102
## 22                  512        962        22               423           97
## 23                  564        953        43                92           83
## 24                  574       1038         7                36          142
## 25                  641        984        14                26           70
## 26                  631       1071         3                77          102
## 27                  540        965         6                 4           80
## 28                  571       1018        10                79          103
## 29                  521        938       168                89           92
## 30                  521        973        46               254           72
## 31                  535       1045         6                20          135
## 32                  586        964        97                82          105
## 33                  560        972        23                95           76
## 34                  542        990        18                21          102
## 35                  526        948       113                76          124
## 36                  531        964         9                24           87
## 37                  638        974        24               349           76
## 38                  599       1024         7                40           99
## 39                  515        953        36               165           86
## 40                  560        981        96               126           88
## 41                  601        998         9                19           84
## 42                  523        968         4                 2          107
## 43                  522        996        40               208           73
## 44                  574       1012        29                36          111
## 45                  480        968        19                49          135
## 46                  599        989        40                24           78
## 47                  623       1049         3                22          113
##    unemploy_m39 gdp inequality prob_prison time_prison crime_rate
## 1            41 394        261    0.084602     26.2011        791
## 2            36 557        194    0.029599     25.2999       1635
## 3            33 318        250    0.083401     24.3006        578
## 4            39 673        167    0.015801     29.9012       1969
## 5            20 578        174    0.041399     21.2998       1234
## 6            29 689        126    0.034201     20.9995        682
## 7            38 620        168    0.042100     20.6993        963
## 8            35 472        206    0.040099     24.5988       1555
## 9            28 421        239    0.071697     29.4001        856
## 10           24 526        174    0.044498     19.5994        705
## 11           35 657        170    0.016201     41.6000       1674
## 12           31 580        172    0.031201     34.2984        849
## 13           25 507        206    0.045302     36.2993        511
## 14           27 529        190    0.053200     21.5010        664
## 15           43 405        264    0.069100     22.7008        798
## 16           47 427        247    0.052099     26.0991        946
## 17           35 487        166    0.076299     19.1002        539
## 18           34 631        165    0.119804     18.1996        929
## 19           34 627        135    0.019099     24.9008        750
## 20           58 626        166    0.034801     26.4010       1225
## 21           33 557        195    0.022800     37.5998        742
## 22           34 288        276    0.089502     37.0994        439
## 23           32 513        227    0.030700     25.1989       1216
## 24           42 540        176    0.041598     17.6000        968
## 25           21 486        196    0.069197     21.9003        523
## 26           41 674        152    0.041698     22.1005       1993
## 27           22 564        139    0.036099     28.4999        342
## 28           28 537        215    0.038201     25.8006       1216
## 29           36 637        154    0.023400     36.7009       1043
## 30           26 396        237    0.075298     28.3011        696
## 31           40 453        200    0.041999     21.7998        373
## 32           43 617        163    0.042698     30.9014        754
## 33           24 462        233    0.049499     25.5005       1072
## 34           35 589        166    0.040799     21.6997        923
## 35           50 572        158    0.020700     37.4011        653
## 36           38 559        153    0.006900     44.0004       1272
## 37           28 382        254    0.045198     31.6995        831
## 38           27 425        225    0.053998     16.6999        566
## 39           35 395        251    0.047099     27.3004        826
## 40           31 488        228    0.038801     29.3004       1151
## 41           20 590        144    0.025100     30.0001        880
## 42           37 489        170    0.088904     12.1996        542
## 43           27 496        224    0.054902     31.9989        823
## 44           37 622        162    0.028100     30.0001       1030
## 45           53 457        249    0.056202     32.5996        455
## 46           25 593        171    0.046598     16.6999        508
## 47           40 588        160    0.052802     16.0997        849

Exploratory Data Analysis

# check for missing value
anyNA(crime)

## [1] FALSE

We would like to see how social-economic conditions could reflect on the crime rate in a state. Now we are going to inspect the correlation for each variable.

# your code here
ggcorr(data = crime, label = TRUE, label_size = 2, hjust = 0.8)

## Warning in ggcorr(data = crime, label = TRUE, label_size = 2, hjust = 0.8): data
## in column(s) 'is_south' are not numeric and were ignored

The correlation matrix above reveals the correlation between all of the numeric variables of the crime dataset (is_south was excluded because of its categorical value). From this matrix, we can see that police_exp60 and police_exp59 have high correlation with one another. This condition is not ideal for a linear regression because one of the assumption when we are building a linear regression model is non-multicollinearity, ie. there is no predictor variable that correlate strongly with one another. Further discussion about multicollinearity is in the last section of this article. Based on this result, I will combine those two variables into one pol_exp column that holds police expenditures data from the last 2 years.

crimex <- crime %>%
  mutate(pol_exp = police_exp60+police_exp59) %>% 
  select(-c(police_exp60,police_exp59))

crimex

##    percent_m is_south mean_education labour_participation m_per1000f state_pop
## 1        151        1             91                  510        950        33
## 2        143        0            113                  583       1012        13
## 3        142        1             89                  533        969        18
## 4        136        0            121                  577        994       157
## 5        141        0            121                  591        985        18
## 6        121        0            110                  547        964        25
## 7        127        1            111                  519        982         4
## 8        131        1            109                  542        969        50
## 9        157        1             90                  553        955        39
## 10       140        0            118                  632       1029         7
## 11       124        0            105                  580        966       101
## 12       134        0            108                  595        972        47
## 13       128        0            113                  624        972        28
## 14       135        0            117                  595        986        22
## 15       152        1             87                  530        986        30
## 16       142        1             88                  497        956        33
## 17       143        0            110                  537        977        10
## 18       135        1            104                  537        978        31
## 19       130        0            116                  536        934        51
## 20       125        0            108                  567        985        78
## 21       126        0            108                  602        984        34
## 22       157        1             89                  512        962        22
## 23       132        0             96                  564        953        43
## 24       131        0            116                  574       1038         7
## 25       130        0            116                  641        984        14
## 26       131        0            121                  631       1071         3
## 27       135        0            109                  540        965         6
## 28       152        0            112                  571       1018        10
## 29       119        0            107                  521        938       168
## 30       166        1             89                  521        973        46
## 31       140        0             93                  535       1045         6
## 32       125        0            109                  586        964        97
## 33       147        1            104                  560        972        23
## 34       126        0            118                  542        990        18
## 35       123        0            102                  526        948       113
## 36       150        0            100                  531        964         9
## 37       177        1             87                  638        974        24
## 38       133        0            104                  599       1024         7
## 39       149        1             88                  515        953        36
## 40       145        1            104                  560        981        96
## 41       148        0            122                  601        998         9
## 42       141        0            109                  523        968         4
## 43       162        1             99                  522        996        40
## 44       136        0            121                  574       1012        29
## 45       139        1             88                  480        968        19
## 46       126        0            104                  599        989        40
## 47       130        0            121                  623       1049         3
##    nonwhites_per1000 unemploy_m24 unemploy_m39 gdp inequality prob_prison
## 1                301          108           41 394        261    0.084602
## 2                102           96           36 557        194    0.029599
## 3                219           94           33 318        250    0.083401
## 4                 80          102           39 673        167    0.015801
## 5                 30           91           20 578        174    0.041399
## 6                 44           84           29 689        126    0.034201
## 7                139           97           38 620        168    0.042100
## 8                179           79           35 472        206    0.040099
## 9                286           81           28 421        239    0.071697
## 10                15          100           24 526        174    0.044498
## 11               106           77           35 657        170    0.016201
## 12                59           83           31 580        172    0.031201
## 13                10           77           25 507        206    0.045302
## 14                46           77           27 529        190    0.053200
## 15                72           92           43 405        264    0.069100
## 16               321          116           47 427        247    0.052099
## 17                 6          114           35 487        166    0.076299
## 18               170           89           34 631        165    0.119804
## 19                24           78           34 627        135    0.019099
## 20                94          130           58 626        166    0.034801
## 21                12          102           33 557        195    0.022800
## 22               423           97           34 288        276    0.089502
## 23                92           83           32 513        227    0.030700
## 24                36          142           42 540        176    0.041598
## 25                26           70           21 486        196    0.069197
## 26                77          102           41 674        152    0.041698
## 27                 4           80           22 564        139    0.036099
## 28                79          103           28 537        215    0.038201
## 29                89           92           36 637        154    0.023400
## 30               254           72           26 396        237    0.075298
## 31                20          135           40 453        200    0.041999
## 32                82          105           43 617        163    0.042698
## 33                95           76           24 462        233    0.049499
## 34                21          102           35 589        166    0.040799
## 35                76          124           50 572        158    0.020700
## 36                24           87           38 559        153    0.006900
## 37               349           76           28 382        254    0.045198
## 38                40           99           27 425        225    0.053998
## 39               165           86           35 395        251    0.047099
## 40               126           88           31 488        228    0.038801
## 41                19           84           20 590        144    0.025100
## 42                 2          107           37 489        170    0.088904
## 43               208           73           27 496        224    0.054902
## 44                36          111           37 622        162    0.028100
## 45                49          135           53 457        249    0.056202
## 46                24           78           25 593        171    0.046598
## 47                22          113           40 588        160    0.052802
##    time_prison crime_rate pol_exp
## 1      26.2011        791     114
## 2      25.2999       1635     198
## 3      24.3006        578      89
## 4      29.9012       1969     290
## 5      21.2998       1234     210
## 6      20.9995        682     233
## 7      20.6993        963     161
## 8      24.5988       1555     224
## 9      29.4001        856     127
## 10     19.5994        705     139
## 11     41.6000       1674     237
## 12     34.2984        849     146
## 13     36.2993        511     127
## 14     21.5010        664     123
## 15     22.7008        798     110
## 16     26.0991        946     158
## 17     19.1002        539     129
## 18     18.1996        929     238
## 19     24.9008        750     256
## 20     26.4010       1225     218
## 21     37.5998        742     141
## 22     37.0994        439      91
## 23     25.1989       1216     170
## 24     17.6000        968     151
## 25     21.9003        523     120
## 26     22.1005       1993     303
## 27     28.4999        342     140
## 28     25.8006       1216     158
## 29     36.7009       1043     323
## 30     28.3011        696     112
## 31     21.7998        373     109
## 32     30.9014        754     171
## 33     25.5005       1072     127
## 34     21.6997        923     194
## 35     37.4011        653     184
## 36     44.0004       1272     207
## 37     31.6995        831     114
## 38     16.6999        566      98
## 39     27.3004        826     115
## 40     29.3004       1151     156
## 41     30.0001        880     138
## 42     12.1996        542     110
## 43     31.9989        823     145
## 44     30.0001       1030     191
## 45     32.5996        455      87
## 46     16.6999        508     203
## 47     16.0997        849     181

# updated correlation matrix
ggcorr(crimex, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

## Warning in ggcorr(crimex, label = TRUE, label_size = 2.9, hjust = 1, layout.exp
## = 2): data in column(s) 'is_south' are not numeric and were ignored

From the updated correlation matrix, it can be seen that the police expenditures have the highest correlation with the crime rate. Meanwhile, the number of non-white residents has no correlation with the crime rate. From the analysis above, I decided to not using nonwhites_per1000 variable as a predictor as it has no correlation for our target variable. The use of nonwhites_per1000 variable might affect the calculation of our model variable significance and promoting incorrect significancies). ___ # Model Building ## Simple Linear Regression

In building simple linear regression, it is important to have predictor variable that has high linear relationship with the target variables. Based on the correlation matrix, I will be using pol_exp as my predictor variable. Prior to model building, I will inspect if there is any outlier from our data.

# further EDA - outlier inspection
plot(crimex$pol_exp, crimex$crime_rate)

boxplot(crimex$pol_exp)

Based on the boxplot, there is one outlier in our pol_exp data. I will remove this outlier first before building our model because based on the scatterplot, it looks like it has a high influence on our model.

# remove outliers
crime.clean <- crimex %>% 
  filter(pol_exp<320)
         
boxplot(crime.clean$pol_exp)

plot(crime.clean$pol_exp, crime.clean$crime_rate)

# model building
slm.o <- lm(crime_rate ~ pol_exp, crimex) 
slm <- lm(crime_rate ~ pol_exp, crime.clean) # without outlier


# plotting the linear regression model
plot(crimex$pol_exp, crimex$crime_rate)
abline(slm.o$coefficients[1],slm.o$coefficients[2], col = "red")
abline(slm$coefficients[1],slm$coefficients[2], col = "green") # without outlier

summary(slm.o)

## 
## Call:
## lm(formula = crime_rate ~ pol_exp, data = crimex)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -581.07 -151.94   20.33  147.24  580.59 
## 
## Coefficients:
##             Estimate Std. Error t value    Pr(>|t|)    
## (Intercept) 152.0674   128.5331   1.183       0.243    
## pol_exp       4.5573     0.7354   6.197 0.000000159 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 287.2 on 45 degrees of freedom
## Multiple R-squared:  0.4605, Adjusted R-squared:  0.4485 
## F-statistic:  38.4 on 1 and 45 DF,  p-value: 0.0000001591

summary(slm)

## 
## Call:
## lm(formula = crime_rate ~ pol_exp, data = crime.clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -650.8 -153.4   28.0  163.1  541.3 
## 
## Coefficients:
##             Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)  45.4783   130.6502   0.348        0.729    
## pol_exp       5.2941     0.7679   6.894 0.0000000164 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.8 on 44 degrees of freedom
## Multiple R-squared:  0.5193, Adjusted R-squared:  0.5083 
## F-statistic: 47.53 on 1 and 44 DF,  p-value: 0.00000001636

From both of the regression model, pol_exp has positive corelation with crime.rate and is significantly affecting crime_rate (3 stars/p-value < 0.05).

If we look at the abline of the plot, we can see that the outliers of our data have a high influence over our model. Therefore, it is safer to remove the outliers from the data. The removal of outliers increase the R-squared of the model ~0.6 points. Based on this analysis, I will use slm as our simple linear model. Additionally, It also has higher R-squared as one of the tools for evaluating a linear regression model.

R-squared value tells us how well our model describes our data. It measures the extent to which the variance in our target variable (crime_rate) can be explained by the predictor variables (pol_exp, etc.). As we increase the number of predictor variables, our model’s R-squared value will also increase as it incorporates new legitimate information as well as the noise introduced by these extra variables. Because of this, it is better to evaluate the adjusted R-squared of our model. Adjusted R-squared does not increase the way regular multiple R-squared does because it is adjusted for the number of predictor variables in our model. It increases only when the new variable actually leads to a better prediction.

slm model can be written as:

crime_rate = 45.4783 + 5.2941(pol_exp)

which also means:

every 1 unit increase of police expenditure in the last 2 years, there will be an increase in crime rate at about 5.2941 times of the police expenditure.

Even so, our simple linear model only have an adjusted R-squared of 0.5083. An improvement can be made by adding more data/adding more variables to our model that may explain our target. On the constraint of only having this dataset, I will try to add more variables and use multiple linear regression for this case.

Multiple Linear Regression

On building multiple linear regression, I will use automatic feature selection using step-wise regression with backward elimination. I will use crime_rate as our target variable and all the remaining variables as our predictor, except the nonwhites_per1000 variable which has zero correlation with crime_rate.

mlm <- lm(crime_rate ~ percent_m + is_south + mean_education + labour_participation + 
    m_per1000f + state_pop + unemploy_m24 + 
    unemploy_m39 + gdp + inequality + prob_prison + time_prison + 
    pol_exp, crime.clean)
step(mlm, direction = "backward")

## Start:  AIC=499.41
## crime_rate ~ percent_m + is_south + mean_education + labour_participation + 
##     m_per1000f + state_pop + unemploy_m24 + unemploy_m39 + gdp + 
##     inequality + prob_prison + time_prison + pol_exp
## 
##                        Df Sum of Sq     RSS    AIC
## - state_pop             1        22 1298610 497.42
## - is_south              1       113 1298701 497.42
## - time_prison           1      3277 1301864 497.53
## - labour_participation  1      3529 1302116 497.54
## - gdp                   1      8271 1306859 497.71
## - unemploy_m24          1     36151 1334738 498.68
## - m_per1000f            1     42497 1341084 498.90
## <none>                              1298587 499.41
## - unemploy_m39          1     88024 1386611 500.43
## - prob_prison           1    117246 1415833 501.39
## - percent_m             1    168873 1467461 503.04
## - mean_education        1    299178 1597766 506.95
## - inequality            1    432350 1730938 510.63
## - pol_exp               1   1067853 2366441 525.02
## 
## Step:  AIC=497.42
## crime_rate ~ percent_m + is_south + mean_education + labour_participation + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + gdp + inequality + 
##     prob_prison + time_prison + pol_exp
## 
##                        Df Sum of Sq     RSS    AIC
## - is_south              1       114 1298724 495.42
## - time_prison           1      3523 1302132 495.54
## - labour_participation  1      3546 1302155 495.54
## - gdp                   1      8633 1307242 495.72
## - unemploy_m24          1     36360 1334970 496.69
## - m_per1000f            1     48877 1347487 497.12
## <none>                              1298610 497.42
## - unemploy_m39          1     88625 1387234 498.45
## - prob_prison           1    117555 1416165 499.40
## - percent_m             1    168904 1467514 501.04
## - mean_education        1    299444 1598054 504.96
## - inequality            1    464523 1763132 509.48
## - pol_exp               1   1163183 2461793 524.84
## 
## Step:  AIC=495.42
## crime_rate ~ percent_m + mean_education + labour_participation + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + gdp + inequality + 
##     prob_prison + time_prison + pol_exp
## 
##                        Df Sum of Sq     RSS    AIC
## - time_prison           1      3507 1302231 493.54
## - labour_participation  1      4958 1303683 493.59
## - gdp                   1      9106 1307830 493.74
## - unemploy_m24          1     41820 1340544 494.88
## - m_per1000f            1     49582 1348307 495.14
## <none>                              1298724 495.42
## - unemploy_m39          1     93362 1392087 496.61
## - prob_prison           1    128301 1427026 497.75
## - percent_m             1    180480 1479204 499.41
## - mean_education        1    299371 1598095 502.96
## - inequality            1    597093 1895817 510.82
## - pol_exp               1   1168618 2467342 522.94
## 
## Step:  AIC=493.54
## crime_rate ~ percent_m + mean_education + labour_participation + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + gdp + inequality + 
##     prob_prison + pol_exp
## 
##                        Df Sum of Sq     RSS    AIC
## - labour_participation  1      3805 1306036 491.68
## - gdp                   1     10889 1313120 491.93
## - unemploy_m24          1     44275 1346506 493.08
## - m_per1000f            1     46320 1348551 493.15
## <none>                              1302231 493.54
## - unemploy_m39          1    102971 1405202 495.04
## - percent_m             1    213507 1515738 498.53
## - prob_prison           1    220749 1522981 498.75
## - mean_education        1    295885 1598116 500.96
## - inequality            1    610210 1912441 509.22
## - pol_exp               1   1172569 2474800 521.08
## 
## Step:  AIC=491.68
## crime_rate ~ percent_m + mean_education + m_per1000f + unemploy_m24 + 
##     unemploy_m39 + gdp + inequality + prob_prison + pol_exp
## 
##                  Df Sum of Sq     RSS    AIC
## - gdp             1     10398 1316434 490.04
## - unemploy_m24    1     40481 1346517 491.08
## - m_per1000f      1     44751 1350787 491.23
## <none>                        1306036 491.68
## - unemploy_m39    1    104836 1410871 493.23
## - prob_prison     1    218502 1524538 496.79
## - percent_m       1    231590 1537626 497.19
## - mean_education  1    300525 1606560 499.20
## - inequality      1    609144 1915180 507.29
## - pol_exp         1   1208349 2514385 519.81
## 
## Step:  AIC=490.04
## crime_rate ~ percent_m + mean_education + m_per1000f + unemploy_m24 + 
##     unemploy_m39 + inequality + prob_prison + pol_exp
## 
##                  Df Sum of Sq     RSS    AIC
## - unemploy_m24    1     47775 1364208 489.68
## - m_per1000f      1     50927 1367360 489.79
## <none>                        1316434 490.04
## - unemploy_m39    1    118175 1434609 492.00
## - percent_m       1    221285 1537719 495.19
## - prob_prison     1    257245 1573678 496.25
## - mean_education  1    327854 1644287 498.27
## - inequality      1    828730 2145163 510.50
## - pol_exp         1   1736089 3052523 526.73
## 
## Step:  AIC=489.68
## crime_rate ~ percent_m + mean_education + m_per1000f + unemploy_m39 + 
##     inequality + prob_prison + pol_exp
## 
##                  Df Sum of Sq     RSS    AIC
## - m_per1000f      1     18044 1382252 488.29
## <none>                        1364208 489.68
## - unemploy_m39    1     86771 1450980 490.52
## - percent_m       1    225573 1589781 494.72
## - prob_prison     1    262958 1627167 495.79
## - mean_education  1    297172 1661380 496.75
## - inequality      1    952361 2316569 512.04
## - pol_exp         1   3019469 4383677 541.38
## 
## Step:  AIC=488.29
## crime_rate ~ percent_m + mean_education + unemploy_m39 + inequality + 
##     prob_prison + pol_exp
## 
##                  Df Sum of Sq     RSS    AIC
## <none>                        1382252 488.29
## - unemploy_m39    1    115767 1498020 489.99
## - prob_prison     1    257895 1640148 494.16
## - percent_m       1    265460 1647713 494.37
## - mean_education  1    530001 1912253 501.22
## - inequality      1   1020942 2403194 511.73
## - pol_exp         1   3001462 4383714 539.38

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + unemploy_m39 + 
##     inequality + prob_prison + pol_exp, data = crime.clean)
## 
## Coefficients:
##    (Intercept)       percent_m  mean_education    unemploy_m39      inequality  
##      -4605.850           8.724          16.721           7.064           7.051  
##    prob_prison         pol_exp  
##      -3863.787           6.796

This step-wise regression method will return the best formula that gives lowest AIC (Akaike Information Criterion), whereas the lower the AIC, the lower the information loss that our model has.

# assigning multiple linear regression model to an object
mlm.step <- lm(formula = crime_rate ~ percent_m + mean_education + unemploy_m39 + 
    inequality + prob_prison + pol_exp, data = crime.clean)

summary(mlm.step)

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + unemploy_m39 + 
##     inequality + prob_prison + pol_exp, data = crime.clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -575.90 -117.56   -1.68  125.80  448.44 
## 
## Coefficients:
##                  Estimate Std. Error t value        Pr(>|t|)    
## (Intercept)    -4605.8496   860.8376  -5.350 0.0000041234915 ***
## percent_m          8.7238     3.1876   2.737        0.009295 ** 
## mean_education    16.7212     4.3241   3.867        0.000407 ***
## unemploy_m39       7.0635     3.9083   1.807        0.078433 .  
## inequality         7.0513     1.3138   5.367 0.0000039104557 ***
## prob_prison    -3863.7869  1432.3620  -2.697        0.010267 *  
## pol_exp            6.7957     0.7385   9.202 0.0000000000255 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 188.3 on 39 degrees of freedom
## Multiple R-squared:  0.7985, Adjusted R-squared:  0.7676 
## F-statistic: 25.77 on 6 and 39 DF,  p-value: 0.000000000003888

If we compare between our simple linear regression and multiple linear regression model, we can see that our multiple linear regression gave better adjusted R-squared of 0.7676 from previously 0.5083. Therefore, I will use the mlm.step model for our prediction.

From this model, we can see that probability have negative correlation with crime_rate, while other variable positively correlate with crime_rate. The formula of our final model is:

crime_rate = 8.7238(percent_m) + 16.7212(mean_education) + 7.0635(unemploy_m39) + 7.0513(inequality) - 3863.7869(prob_prison) + 6.7957(pol_exp) - 4605.8496

Model Prediction

# make prediction
crimex$cr_pred <- predict(mlm.step, crimex)

# calculating RMSE
sqrt(mean((crimex$cr_pred - crimex$crime_rate)^2))

## [1] 194.0982

range(crime$crime_rate)

## [1]  342 1993

Based on this result, our linear regression model still gave a considerably high error to our prediction and therefore still needs improvement. This result is actually quite expected from the look of our scatterplot, whereas our data only vaguely resemble a linear line between our predictor (which has the highest correlation) with our target variable.

We can improve this model by adding more data (47 observations might not be enough), or searching for another predictor variables that may have a better linear correlation with our target variable)

Another machine learning approach to predict the crime rate should also be considered as this linear regression model is not very robust compared to other machine learning algorithms such as Random Forest.

Model Assumption Evaluation

Linearity

This assumption is already checked prior to model building using correlation matrix in EDA step, and through scatterplot between our predictor and our target variable.

Normality

hist(mlm.step$residuals, breaks = 5)

shapiro.test(mlm.step$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  mlm.step$residuals
## W = 0.96654, p-value = 0.2051

P-value > 0.05; H0 accepted. This result means that the residuals of our model distributed normally, therefore our model will have error around its mean.

Heteroscedasticity

plot(crime.clean$crime_rate, mlm.step$residuals)
abline(h = 0, col = "red")

bptest(mlm.step)

## 
##  studentized Breusch-Pagan test
## 
## data:  mlm.step
## BP = 8.3082, df = 6, p-value = 0.2164

P-value > 0.05; H0 accepted. This means that our model residuals do not have a pattern that cannot be captured by our model.

Variance Inflation Factor (Multicollinearity)

vif(mlm.step)

##      percent_m mean_education   unemploy_m39     inequality    prob_prison 
##       1.970723       3.035998       1.412245       3.486175       1.343397 
##        pol_exp 
##       1.955926

There are no values equal to or more than 10 so Multicollinearity between variables is not found (predictor variables are mutually independent).

Based on the analysis, our mlm.step model passed all of the assumption evaluation.

Conclusions

The mlm.step model gave an R-squared of 0.7676 and has an RMSE of 194. The predictor variables that will contribute to the mlm.step model include percent_m, mean_education, unemploy_m39, inequality, prob_prison, and pol_exp. We can improve this model by adding more data (47 observations might not be enough), or searching for another predictor variables that may have a better linear correlation with our target variable). You can also optimize the prediction by using different machine learning approach to predict the crime rate. As you know, the linear regression model is not very robust compared to other machine learning algorithms such as Random Forest, although it has great interpretability.

Lastly,

Thank you for reading, happy learning, and keep improving!

Predicting the Crime Rate using Linear Regression

Ezra Soterion Nugroho

2/10/2020