Swedish crime statistics from 1950 to 2015
This data set contains statistics on reported crimes in Sweden (by 100.000) from 1950 to 2015. It contains the following columns:
crimes.total: total number of reported crimes
crimes.penal.code: total number of reported crimes against the criminal code
crimes.person: total number of reported crimes against a person
murder: total number of reported murder
sexual.offences: total number of reported sexual offences
rape: total number of reported rapes
assault: total number of reported aggravated assaults
stealing.general: total number of reported crimes involving stealing or robbery
robbery: total number of reported armed robberies
burglary: total number of reported armed burglaries
vehicle.theft: total number of reported vehicle thefts
house.theft: total number of reported theft inside a house
shop.theft: total number of reported theft inside a shop
out.of.vehicle.theft: total number of reported theft from a vehicle
criminal.damage: total number of reported criminal damages
other.penal.crimes: number of other penal crime offenses
fraud: total number of reported frauds
narcotics: total number of reported narcotics abuses
drunk.driving: total number of reported drunk driving incidents
Year: the year
population: the total estimated population of Sweden at the time
Using this dataset, We will develop a model with using linear reggression analysis to predict our target, that is crimes.total.
library(MLmetrics)##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lmtest)## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(GGally)## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)df <- read.csv("reported.csv")
dfstr(df)## 'data.frame': 66 obs. of 21 variables:
## $ Year : int 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
## $ crimes.total : int 2784 3284 3160 2909 3028 3357 3488 3774 4064 4033 ...
## $ crimes.penal.code : int 2306 2754 2608 2689 2791 3101 3215 3520 3791 3733 ...
## $ crimes.person : int 120 125 119 119 126 135 133 133 127 125 ...
## $ murder : int 1 1 1 1 1 1 1 1 1 1 ...
## $ assault : int 105 109 104 105 107 118 116 116 113 110 ...
## $ sexual.offenses : int 40 45 39 45 41 44 38 36 40 47 ...
## $ rape : int 5 6 4 5 5 5 5 5 6 6 ...
## $ stealing.general : int 1578 1899 1846 1929 1981 2254 2363 2635 2880 2793 ...
## $ burglary : int 295 342 372 361 393 459 470 580 724 715 ...
## $ house.theft : int NA NA NA NA NA NA NA NA NA NA ...
## $ vehicle.theft : int NA NA NA NA NA NA NA 245 279 238 ...
## $ out.of.vehicle.theft: int NA NA NA NA NA NA NA NA NA NA ...
## $ shop.theft : int NA NA NA NA NA NA NA NA NA NA ...
## $ robbery : int 3 3 3 4 4 4 6 6 6 6 ...
## $ fraud : int 209 310 217 209 236 236 234 254 254 251 ...
## $ criminal.damage : int 72 73 82 88 101 111 133 155 167 179 ...
## $ other.penal.crimes : int 477 530 553 220 237 255 273 255 273 299 ...
## $ narcotics : int 0 0 0 0 0 0 0 0 1 1 ...
## $ drunk.driving : int 49 66 78 91 103 125 160 163 166 181 ...
## $ population : int 7014000 7073000 7125000 7171000 7213000 7262000 7315000 7364000 7409000 7446000 ...
colSums(is.na(df))## Year crimes.total crimes.penal.code
## 0 0 0
## crimes.person murder assault
## 0 0 0
## sexual.offenses rape stealing.general
## 0 0 0
## burglary house.theft vehicle.theft
## 0 15 7
## out.of.vehicle.theft shop.theft robbery
## 15 15 0
## fraud criminal.damage other.penal.crimes
## 0 0 0
## narcotics drunk.driving population
## 4 0 0
As we can see, there are many columns that contains NA value. For that, we just delete it, as we don’t need those in our analysis.
df <- df %>%
select(-c(house.theft, vehicle.theft, out.of.vehicle.theft,shop.theft, narcotics))
# str(df)
# anyNA(df)ggcorr(data = df, hjust = 1, layout.exp = 3, label = T)
We can see the correlation above between columns, ranging from no
correlation to strong correlation. For our analysis, we can use some
columns to predict crimes.total, we just see the columns that good
correlation with crimes.total. We can use all those columns as
predictor, or we can use some of them, let’s pick =
murder
, sexual.offenses , burglary +
fraud and drunk.driving.
boxplot(df$crimes.total)boxplot(df$drunk.driving)Good, looks like there is no outlier!
#Correlation
To test our predictor, we have to make sure that the target and the predictor have a significant correlation.
cor.test(df$crimes.total, df$drunk.driving)##
## Pearson's product-moment correlation
##
## data: df$crimes.total and df$drunk.driving
## t = 8.6101, df = 64, p-value = 2.722e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5962779 0.8278347
## sample estimates:
## cor
## 0.7325849
And we have H1, as the result we want. That’s good.
Now in modelling step, we use lm() function to make a model, it’s contain the target, and the predictor.
model_base <- lm(formula = crimes.total ~ drunk.driving, data = df)
summary(model_base)##
## Call:
## lm(formula = crimes.total ~ drunk.driving, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3963.9 -2478.2 -133.2 1464.0 7399.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -444.398 1273.778 -0.349 0.728
## drunk.driving 46.637 5.417 8.610 2.72e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2897 on 64 degrees of freedom
## Multiple R-squared: 0.5367, Adjusted R-squared: 0.5294
## F-statistic: 74.13 on 1 and 64 DF, p-value: 2.722e-12
plot(x= df$drunk.driving, y = df$crimes.total)
abline(model_base, col="red")Insight :
Let’s try use some columns as predictor, as we choose above.
model_multi <- lm(formula = crimes.total ~ murder + sexual.offenses + burglary + fraud + drunk.driving, data = df)
summary(model_multi)##
## Call:
## lm(formula = crimes.total ~ murder + sexual.offenses + burglary +
## fraud + drunk.driving, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2419.77 -729.54 68.18 760.96 1683.72
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2628.8657 464.5578 -5.659 4.53e-07 ***
## murder -9.1045 234.6541 -0.039 0.969
## sexual.offenses 62.4764 4.0438 15.450 < 2e-16 ***
## burglary 6.3485 0.3701 17.152 < 2e-16 ***
## fraud 1.1914 0.4759 2.503 0.015 *
## drunk.driving -3.2047 3.0407 -1.054 0.296
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 933 on 60 degrees of freedom
## Multiple R-squared: 0.955, Adjusted R-squared: 0.9512
## F-statistic: 254.4 on 5 and 60 DF, p-value: < 2.2e-16
Insight :
Then we try use all columns as predictor.
model_all <- lm(formula = crimes.total ~ . , data = df)
summary(model_all)##
## Call:
## lm(formula = crimes.total ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.5016 -5.9832 0.2118 5.7266 26.4086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.523e+03 2.773e+03 -0.549 0.585119
## Year 6.305e-01 1.502e+00 0.420 0.676378
## crimes.penal.code 9.762e-01 3.932e-02 24.830 < 2e-16 ***
## crimes.person -7.720e-01 9.094e-01 -0.849 0.399983
## murder -3.619e-01 2.832e+00 -0.128 0.898841
## assault 5.792e-01 8.999e-01 0.644 0.522739
## sexual.offenses 2.620e-02 3.438e-01 0.076 0.939552
## rape 3.471e-01 8.220e-01 0.422 0.674666
## stealing.general 3.128e-02 3.953e-02 0.791 0.432468
## burglary -4.881e-02 2.462e-02 -1.982 0.052958 .
## robbery 1.029e+00 3.120e-01 3.297 0.001803 **
## fraud 7.387e-03 4.493e-02 0.164 0.870083
## criminal.damage 3.485e-02 3.751e-02 0.929 0.357341
## other.penal.crimes 1.009e+00 9.691e-03 104.095 < 2e-16 ***
## drunk.driving -1.978e-01 5.427e-02 -3.645 0.000636 ***
## population 4.964e-05 2.942e-05 1.687 0.097796 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.93 on 50 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 6.474e+05 on 15 and 50 DF, p-value: < 2.2e-16
Insight :
summary(model_base)$r.squared## [1] 0.5366806
summary(model_multi)$adj.r.squared## [1] 0.9512103
summary(model_all)$adj.r.squared## [1] 0.9999933
Insight :
Now we have to predict our data, using the model we have before, with predict() function.
pred_crime <- predict(object = model_base , newdata = df)
pred_crime_multi <- predict(object = model_multi , newdata = df)
pred_crime_all <- predict(object = model_all , newdata = df) Before continue, let’s see the error that can be generated from predict above!
RMSE(y_pred = pred_crime , y_true = df$crimes.total)## [1] 2853.237
RMSE(y_pred = pred_crime_multi , y_true = df$crimes.total)## [1] 889.5715
RMSE(y_pred = pred_crime_all , y_true = df$crimes.total)## [1] 9.51178
Insight :
Is the predict with model_all is the best choice to predict our model? not necessarily. So there is step called Step-wise Regression, to help us choose what the best model we have.
model_backward <- step(object = model_all #object diisi dengan model
, direction = "backward", #menggunakan backward
trace = F) #memilih untuk tidak melihat prosesnya
summary(model_backward)##
## Call:
## lm(formula = crimes.total ~ crimes.penal.code + crimes.person +
## stealing.general + burglary + robbery + criminal.damage +
## other.penal.crimes + drunk.driving + population, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.4292 -5.7773 0.2155 5.5346 25.8778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.448e+02 1.005e+02 -3.430 0.001141 **
## crimes.penal.code 9.804e-01 8.416e-03 116.495 < 2e-16 ***
## crimes.person -1.840e-01 4.235e-02 -4.345 5.92e-05 ***
## stealing.general 2.667e-02 1.153e-02 2.313 0.024404 *
## burglary -4.620e-02 1.986e-02 -2.327 0.023629 *
## robbery 1.063e+00 2.538e-01 4.189 0.000100 ***
## criminal.damage 3.006e-02 1.626e-02 1.849 0.069757 .
## other.penal.crimes 1.008e+00 8.007e-03 125.912 < 2e-16 ***
## drunk.driving -1.762e-01 3.910e-02 -4.507 3.40e-05 ***
## population 5.553e-05 1.463e-05 3.796 0.000364 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.41 on 56 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.189e+06 on 9 and 56 DF, p-value: < 2.2e-16
Insight :
pred_model_step_interval <- predict(
object = model_backward,
newdata = df,
interval = "prediction",
level = 0.95) # 95% confidence level, tolerate 5% error
head(pred_model_step_interval)## fit lwr upr
## 1 2789.499 2766.181 2812.816
## 2 3287.948 3265.003 3310.893
## 3 3167.341 3144.737 3189.945
## 4 2915.257 2893.352 2937.161
## 5 3031.629 3009.853 3053.404
## 6 3355.431 3333.758 3377.105
ggplot(data = df, aes(x = drunk.driving, y = crimes.total)) +
geom_point()+
geom_smooth(method = "lm", level = 0.95)+
labs(title = "Linear Regression of Crimes Person by Murder")+
theme_minimal()## `geom_smooth()` using formula 'y ~ x'
Confidence interval illustration for
crimes.total ~ drunk.driving
After making predictions, let’s compare the performance of the following three models:
# r-squared
summary(model_base)$r.squared## [1] 0.5366806
summary(model_multi)$adj.r.squared## [1] 0.9512103
summary(model_backward)$adj.r.squared## [1] 0.9999939
pred_backward <- predict(object = model_backward , newdata = df)# RMSE
RMSE(y_pred = pred_crime, y_true = df$crimes.total)## [1] 2853.237
RMSE(y_pred = pred_crime_multi, y_true = df$crimes.total)## [1] 889.5715
RMSE(y_pred = pred_backward, y_true = df$crimes.total)## [1] 9.587221
Conclusion
What model we have to choose?
As we want our predictor is good to predict the data, the best step is we dont want assumption in our predcitor. Then let’s continue our analysis.
As a comparison, let’s compare model_multi, and model_backward.
The linear regression model is expected to produce errors that are normally distributed. That way, more errors cluster around the number zero.
hist() .
functionhist(model_multi$residuals)hist(model_backward$residuals)Insight :
shapiro.test()Shapiro-Wilk hypothesis test:
H0: normal distribution error
H1: NOT normal distribution error
if p-value > 0.05, accept H0
Result we want : H0
# shapiro test dari residual
shapiro.test(model_backward$residuals)##
## Shapiro-Wilk normality test
##
## data: model_backward$residuals
## W = 0.98037, p-value = 0.3795
shapiro.test(model_multi$residuals)##
## Shapiro-Wilk normality test
##
## data: model_multi$residuals
## W = 0.97387, p-value = 0.1769
Insight :
It is expected that the error generated by the model spreads randomly or with constant variation. When visualized, the error is not patterned. This condition is also known as homoscedasticity
fitted.values vs
residualsfitted.values adalah nilai hasil prediksi data
trainingresiduals adalah nilai error# scatter plot
plot(x = model_all$fitted.values, y = model_all$residuals)
abline(h = 0, col = "red") # garis horizontal di angka 0plot(x = model_all$fitted.values, y = model_multi$residuals)
abline(h = 0, col = "red") # garis horizontal di angka 0bptest() from lmtest
package.Breusch-Pagan hypothesis test:
Result we want: H0
# bptest dari model
library(lmtest)
bptest(model_multi)##
## studentized Breusch-Pagan test
##
## data: model_multi
## BP = 6.2736, df = 5, p-value = 0.2805
bptest(model_backward)##
## studentized Breusch-Pagan test
##
## data: model_backward
## BP = 30.059, df = 9, p-value = 0.0004287
Insight :
Multicollinearity is a condition where there is a strong correlation between predictors. This is not desirable because it indicates a redundant predictor in the model, which should be able to choose only one of the variables with a very strong relationship. It is hoped that multicollinearity will not occur
Test the VIF (Variance Inflation Factor) with the vif()
function from the car package:
Result we want: VIF < 10
# vif dari model
library(car)## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif(model_multi)## murder sexual.offenses burglary fraud drunk.driving
## 2.158373 2.829396 2.133745 2.642805 3.039453
vif(model_backward)## crimes.penal.code crimes.person stealing.general burglary
## 545.360212 93.700149 348.235193 49.344501
## robbery criminal.damage other.penal.crimes drunk.driving
## 48.472616 58.110113 19.806063 4.038203
## population
## 66.176908
Insight :
So after our analysis, ranging from EDA, Modelling, Predciting, and
Assumption test, we can conclude, that in the end, the model we have to
choose is model_multi. Even the RMSE is bigger than the
model_backward, but is better than if our model have
assumption in it. We don’t want assumption in our model, because it’s
bad for predicting our data.