Swedish crime statistics from 1950 to 2015

This data set contains statistics on reported crimes in Sweden (by 100.000) from 1950 to 2015. It contains the following columns:

crimes.total: total number of reported crimes
crimes.penal.code: total number of reported crimes against the criminal code
crimes.person: total number of reported crimes against a person
murder: total number of reported murder
sexual.offences: total number of reported sexual offences
rape: total number of reported rapes
assault: total number of reported aggravated assaults
stealing.general: total number of reported crimes involving stealing or robbery
robbery: total number of reported armed robberies
burglary: total number of reported armed burglaries
vehicle.theft: total number of reported vehicle thefts
house.theft: total number of reported theft inside a house
shop.theft: total number of reported theft inside a shop
out.of.vehicle.theft: total number of reported theft from a vehicle
criminal.damage: total number of reported criminal damages
other.penal.crimes: number of other penal crime offenses
fraud: total number of reported frauds
narcotics: total number of reported narcotics abuses
drunk.driving: total number of reported drunk driving incidents
Year: the year
population: the total estimated population of Sweden at the time

Using this dataset, We will develop a model with using linear reggression analysis to predict our target, that is crimes.total.

Library

library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

Read Data

df <-  read.csv("reported.csv")
df

Pre-Processing Data

Check Data Type

str(df)
## 'data.frame':    66 obs. of  21 variables:
##  $ Year                : int  1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
##  $ crimes.total        : int  2784 3284 3160 2909 3028 3357 3488 3774 4064 4033 ...
##  $ crimes.penal.code   : int  2306 2754 2608 2689 2791 3101 3215 3520 3791 3733 ...
##  $ crimes.person       : int  120 125 119 119 126 135 133 133 127 125 ...
##  $ murder              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ assault             : int  105 109 104 105 107 118 116 116 113 110 ...
##  $ sexual.offenses     : int  40 45 39 45 41 44 38 36 40 47 ...
##  $ rape                : int  5 6 4 5 5 5 5 5 6 6 ...
##  $ stealing.general    : int  1578 1899 1846 1929 1981 2254 2363 2635 2880 2793 ...
##  $ burglary            : int  295 342 372 361 393 459 470 580 724 715 ...
##  $ house.theft         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ vehicle.theft       : int  NA NA NA NA NA NA NA 245 279 238 ...
##  $ out.of.vehicle.theft: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ shop.theft          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ robbery             : int  3 3 3 4 4 4 6 6 6 6 ...
##  $ fraud               : int  209 310 217 209 236 236 234 254 254 251 ...
##  $ criminal.damage     : int  72 73 82 88 101 111 133 155 167 179 ...
##  $ other.penal.crimes  : int  477 530 553 220 237 255 273 255 273 299 ...
##  $ narcotics           : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ drunk.driving       : int  49 66 78 91 103 125 160 163 166 181 ...
##  $ population          : int  7014000 7073000 7125000 7171000 7213000 7262000 7315000 7364000 7409000 7446000 ...
colSums(is.na(df))
##                 Year         crimes.total    crimes.penal.code 
##                    0                    0                    0 
##        crimes.person               murder              assault 
##                    0                    0                    0 
##      sexual.offenses                 rape     stealing.general 
##                    0                    0                    0 
##             burglary          house.theft        vehicle.theft 
##                    0                   15                    7 
## out.of.vehicle.theft           shop.theft              robbery 
##                   15                   15                    0 
##                fraud      criminal.damage   other.penal.crimes 
##                    0                    0                    0 
##            narcotics        drunk.driving           population 
##                    4                    0                    0

As we can see, there are many columns that contains NA value. For that, we just delete it, as we don’t need those in our analysis.

Data Cleaning

df <-  df %>% 
  select(-c(house.theft, vehicle.theft, out.of.vehicle.theft,shop.theft, narcotics))
  
# str(df)
# anyNA(df)

Analysis

Correlation visualization

ggcorr(data = df,  hjust = 1, layout.exp = 3, label = T)

We can see the correlation above between columns, ranging from no correlation to strong correlation. For our analysis, we can use some columns to predict crimes.total, we just see the columns that good correlation with crimes.total. We can use all those columns as predictor, or we can use some of them, let’s pick = murder , sexual.offenses , burglary + fraud and drunk.driving.

boxplot(df$crimes.total)

boxplot(df$drunk.driving)

Good, looks like there is no outlier!

#Correlation

Cor.test()

To test our predictor, we have to make sure that the target and the predictor have a significant correlation.

  • H0: Correlation Not Significant (correlation = 0)
  • H1: Correlation Significant (correlation != 0)
  • Result we want = H1
cor.test(df$crimes.total, df$drunk.driving)
## 
##  Pearson's product-moment correlation
## 
## data:  df$crimes.total and df$drunk.driving
## t = 8.6101, df = 64, p-value = 2.722e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5962779 0.8278347
## sample estimates:
##       cor 
## 0.7325849

And we have H1, as the result we want. That’s good.

Modelling

Now in modelling step, we use lm() function to make a model, it’s contain the target, and the predictor.

model_base <-  lm(formula = crimes.total ~ drunk.driving, data = df)
summary(model_base)
## 
## Call:
## lm(formula = crimes.total ~ drunk.driving, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3963.9 -2478.2  -133.2  1464.0  7399.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -444.398   1273.778  -0.349    0.728    
## drunk.driving   46.637      5.417   8.610 2.72e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2897 on 64 degrees of freedom
## Multiple R-squared:  0.5367, Adjusted R-squared:  0.5294 
## F-statistic: 74.13 on 1 and 64 DF,  p-value: 2.722e-12

1 Predictor

plot(x= df$drunk.driving, y = df$crimes.total)
abline(model_base, col="red")

Insight :

  • When crimes.total increases by one unit, drunk.driving increases by 46,637 points, with the other predictor values remaining the same.
  • R-squared from model_base, standart, that is 0.5367, not high.

Many Predictor

Let’s try use some columns as predictor, as we choose above.

model_multi <-  lm(formula = crimes.total ~ murder + sexual.offenses + burglary + fraud + drunk.driving, data = df)
summary(model_multi)
## 
## Call:
## lm(formula = crimes.total ~ murder + sexual.offenses + burglary + 
##     fraud + drunk.driving, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2419.77  -729.54    68.18   760.96  1683.72 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2628.8657   464.5578  -5.659 4.53e-07 ***
## murder             -9.1045   234.6541  -0.039    0.969    
## sexual.offenses    62.4764     4.0438  15.450  < 2e-16 ***
## burglary            6.3485     0.3701  17.152  < 2e-16 ***
## fraud               1.1914     0.4759   2.503    0.015 *  
## drunk.driving      -3.2047     3.0407  -1.054    0.296    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 933 on 60 degrees of freedom
## Multiple R-squared:  0.955,  Adjusted R-squared:  0.9512 
## F-statistic: 254.4 on 5 and 60 DF,  p-value: < 2.2e-16

Insight :

  • When crimes.total increases by one unit, burglary increases by 6.3485 points, with the other predictor values remaining the same.
  • If we use multiple predictor, then we see the Adjusted R-squared from model_multi, is very good, that is 0.9512, better than model_base.
  • There is star in the side part of the predictor, that means, that the predictor is significant, which is good to build a model to predict.

All Predictor

Then we try use all columns as predictor.

model_all <-  lm(formula = crimes.total ~ . , data = df)
summary(model_all)
## 
## Call:
## lm(formula = crimes.total ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5016  -5.9832   0.2118   5.7266  26.4086 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.523e+03  2.773e+03  -0.549 0.585119    
## Year                6.305e-01  1.502e+00   0.420 0.676378    
## crimes.penal.code   9.762e-01  3.932e-02  24.830  < 2e-16 ***
## crimes.person      -7.720e-01  9.094e-01  -0.849 0.399983    
## murder             -3.619e-01  2.832e+00  -0.128 0.898841    
## assault             5.792e-01  8.999e-01   0.644 0.522739    
## sexual.offenses     2.620e-02  3.438e-01   0.076 0.939552    
## rape                3.471e-01  8.220e-01   0.422 0.674666    
## stealing.general    3.128e-02  3.953e-02   0.791 0.432468    
## burglary           -4.881e-02  2.462e-02  -1.982 0.052958 .  
## robbery             1.029e+00  3.120e-01   3.297 0.001803 ** 
## fraud               7.387e-03  4.493e-02   0.164 0.870083    
## criminal.damage     3.485e-02  3.751e-02   0.929 0.357341    
## other.penal.crimes  1.009e+00  9.691e-03 104.095  < 2e-16 ***
## drunk.driving      -1.978e-01  5.427e-02  -3.645 0.000636 ***
## population          4.964e-05  2.942e-05   1.687 0.097796 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.93 on 50 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 6.474e+05 on 15 and 50 DF,  p-value: < 2.2e-16

Insight :

  • When crimes.total increases by one unit, robbery increases by 1.029e+00 points, with the other predictor values remaining the same.
  • The Adjusted R-squared from model_all, is very good, that is 1.

R-Squared Comparison

summary(model_base)$r.squared
## [1] 0.5366806
summary(model_multi)$adj.r.squared
## [1] 0.9512103
summary(model_all)$adj.r.squared
## [1] 0.9999933

Insight :

  • What we have here is R-squared number. What we want is R-squared that near to 1.
  • Which is, it’s good to build a model with R-squared near to 1.
  • model_multi and model_all have very good R-squared, that we will choose these have to continue the analysis.

Predicting

Now we have to predict our data, using the model we have before, with predict() function.

pred_crime <-  predict(object = model_base , newdata = df)   
pred_crime_multi <-   predict(object = model_multi , newdata = df) 
pred_crime_all <-  predict(object = model_all , newdata = df) 

RMSE (Root Mean Squared Error)

Before continue, let’s see the error that can be generated from predict above!

RMSE(y_pred = pred_crime , y_true = df$crimes.total)
## [1] 2853.237
RMSE(y_pred = pred_crime_multi , y_true = df$crimes.total)
## [1] 889.5715
RMSE(y_pred = pred_crime_all , y_true = df$crimes.total)
## [1] 9.51178

Insight :

  • As the name as error, we want the smallest possible number.
  • Because if the error is big, than our predict is bad to predict our data.
  • In this case, predict with model_all have good RMSE.

Step-wise Regression for Feature Selection

Is the predict with model_all is the best choice to predict our model? not necessarily. So there is step called Step-wise Regression, to help us choose what the best model we have.

Backward

model_backward <-  step(object = model_all  #object diisi dengan model
                       , direction = "backward", #menggunakan backward
                       trace = F) #memilih untuk tidak melihat prosesnya
summary(model_backward)
## 
## Call:
## lm(formula = crimes.total ~ crimes.penal.code + crimes.person + 
##     stealing.general + burglary + robbery + criminal.damage + 
##     other.penal.crimes + drunk.driving + population, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.4292  -5.7773   0.2155   5.5346  25.8778 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -3.448e+02  1.005e+02  -3.430 0.001141 ** 
## crimes.penal.code   9.804e-01  8.416e-03 116.495  < 2e-16 ***
## crimes.person      -1.840e-01  4.235e-02  -4.345 5.92e-05 ***
## stealing.general    2.667e-02  1.153e-02   2.313 0.024404 *  
## burglary           -4.620e-02  1.986e-02  -2.327 0.023629 *  
## robbery             1.063e+00  2.538e-01   4.189 0.000100 ***
## criminal.damage     3.006e-02  1.626e-02   1.849 0.069757 .  
## other.penal.crimes  1.008e+00  8.007e-03 125.912  < 2e-16 ***
## drunk.driving      -1.762e-01  3.910e-02  -4.507 3.40e-05 ***
## population          5.553e-05  1.463e-05   3.796 0.000364 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.41 on 56 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.189e+06 on 9 and 56 DF,  p-value: < 2.2e-16

Insight :

  • With Backward Step, we make a calculation with model_all as our object. This step Backward calculate which predictor should be discarded, that make the information loss very high.
  • The Summary above showing, with Backward step, that are the model with best predictor it has.

Prediction Interval

pred_model_step_interval <-  predict(
  object = model_backward,
  newdata = df,
  interval = "prediction",
  level = 0.95) # 95% confidence level, tolerate 5% error

head(pred_model_step_interval)
##        fit      lwr      upr
## 1 2789.499 2766.181 2812.816
## 2 3287.948 3265.003 3310.893
## 3 3167.341 3144.737 3189.945
## 4 2915.257 2893.352 2937.161
## 5 3031.629 3009.853 3053.404
## 6 3355.431 3333.758 3377.105
ggplot(data = df, aes(x = drunk.driving, y = crimes.total)) +
  geom_point()+
  geom_smooth(method = "lm", level = 0.95)+
  labs(title = "Linear Regression of Crimes Person by Murder")+
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

Confidence interval illustration for crimes.total ~ drunk.driving

Model Evaluation

After making predictions, let’s compare the performance of the following three models:

# r-squared
summary(model_base)$r.squared
## [1] 0.5366806
summary(model_multi)$adj.r.squared
## [1] 0.9512103
summary(model_backward)$adj.r.squared
## [1] 0.9999939
pred_backward <-  predict(object = model_backward , newdata = df)
# RMSE
RMSE(y_pred = pred_crime, y_true =  df$crimes.total)
## [1] 2853.237
RMSE(y_pred = pred_crime_multi, y_true =  df$crimes.total)
## [1] 889.5715
RMSE(y_pred = pred_backward, y_true =  df$crimes.total)
## [1] 9.587221

Conclusion

  • Best model based on R-squared is model_backward
  • Best model based on RMSE is model_backward

What model we have to choose?

  • In this case, we will chose model_backward, because it’s already have significant predictor, and minimal error

Assumption

As we want our predictor is good to predict the data, the best step is we dont want assumption in our predcitor. Then let’s continue our analysis.

As a comparison, let’s compare model_multi, and model_backward.

Normality of Residuals

The linear regression model is expected to produce errors that are normally distributed. That way, more errors cluster around the number zero.

Visualization of residual histogram using hist() . function

hist(model_multi$residuals)

hist(model_backward$residuals)

Insight :

  • Both have normal distribution of residual, we can see the data gathered to 0.

Statistic test with shapiro.test()

Shapiro-Wilk hypothesis test:

  • H0: normal distribution error

  • H1: NOT normal distribution error

  • if p-value > 0.05, accept H0

Result we want : H0

# shapiro test dari residual
shapiro.test(model_backward$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_backward$residuals
## W = 0.98037, p-value = 0.3795
shapiro.test(model_multi$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_multi$residuals
## W = 0.97387, p-value = 0.1769

Insight :

  • Both have normal distribution of residual, we can see both p-value > 0.05.

Homoscedasticity of Residuals

It is expected that the error generated by the model spreads randomly or with constant variation. When visualized, the error is not patterned. This condition is also known as homoscedasticity

Visualization with scatter plot: fitted.values vs residuals

  • fitted.values adalah nilai hasil prediksi data training
  • residuals adalah nilai error
# scatter plot
plot(x = model_all$fitted.values, y = model_all$residuals)
abline(h = 0, col = "red") # garis horizontal di angka 0

plot(x = model_all$fitted.values, y = model_multi$residuals)
abline(h = 0, col = "red") # garis horizontal di angka 0

Statistic test with bptest() from lmtest package.

Breusch-Pagan hypothesis test:

  • H0: constant spread error or homoscedasticity
  • H1: constant NOT spread error or heteroscedasticity

Result we want: H0

# bptest dari model
library(lmtest)
bptest(model_multi)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_multi
## BP = 6.2736, df = 5, p-value = 0.2805
bptest(model_backward)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_backward
## BP = 30.059, df = 9, p-value = 0.0004287

Insight :

  • Model_multi have constant spread error.
  • Model_backward have error doesn’t spread constant.
  • Thats mean model_backward have failed in this assumption test.

No Multicollinearity

Multicollinearity is a condition where there is a strong correlation between predictors. This is not desirable because it indicates a redundant predictor in the model, which should be able to choose only one of the variables with a very strong relationship. It is hoped that multicollinearity will not occur

Test the VIF (Variance Inflation Factor) with the vif() function from the car package:

  • VIF value > 10: multicollinearity occurs in the model
  • VIF value < 10: there is no multicollinearity in the model

Result we want: VIF < 10

# vif dari model
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(model_multi)
##          murder sexual.offenses        burglary           fraud   drunk.driving 
##        2.158373        2.829396        2.133745        2.642805        3.039453
vif(model_backward)
##  crimes.penal.code      crimes.person   stealing.general           burglary 
##         545.360212          93.700149         348.235193          49.344501 
##            robbery    criminal.damage other.penal.crimes      drunk.driving 
##          48.472616          58.110113          19.806063           4.038203 
##         population 
##          66.176908

Insight :

  • Model_multi have there is no multicollinearity in the model.
  • Model_backward have multicollinearity occurs in the model.
  • We can see that model_multi have VIF < 10, and model_backward have VIF > 10 in all predictor, except drunk.driving.
  • Thats mean model_backward have failed in this assumption test.

Conclusion

So after our analysis, ranging from EDA, Modelling, Predciting, and Assumption test, we can conclude, that in the end, the model we have to choose is model_multi. Even the RMSE is bigger than the model_backward, but is better than if our model have assumption in it. We don’t want assumption in our model, because it’s bad for predicting our data.