Introduction
What I`ll Do
i will use linear regression model using crime dataset. i want to know the relationship among variables, especially between the crime rate with other variables. and also want to predit the crime rate of city based on the historical data. crime dataset are contain Crime report data in several states in the US in 1960
Business Goal`s
In particular, the Government wants to look at the crime rate in an area:
- Which variable is significant in predicting the crime rate in an area
- Some of these variables both describe crime rates
Based on the data, the government has collected a large data set of various human information in an area.We will monitor to model crime rates with the available independent variables. This will be used by the Government to understand how crime rates are with the independent variables.They can accordingly manipulate the police expenditure, prob_prison ,etc. to meet certain crime rate. The government can use this information to anticipate crime so that can save the people / can reduce the crime rate in an area. Further, the model will be a good way for goverment to understand the crime-rate dynamics on a next year
Data Preparation
First, Load the required library package
# data wrangling
library(tidyverse)
# check asumsi model
library(lmtest)
library (car)
# calculate error
library(MLmetrics)
# correlation
library(GGally)
options(scipen = 999)Load dataset
crime <- read.csv("data_input/crime.csv")The data has 47 rows and 16 colomns. Our target variable is there crime_rate,
Data Description:
percent_m: percentage of males aged 14-24is_south: whether it is in a Southern state. 1 for Yes, 0 for No.mean_education: mean years of schoolingpolice_exp60: police expenditure in 1960police_exp59: police expenditure in 1959labour_participation: labour force participation ratem_per1000f: number of males per 1000 femalesstate_pop: state populationnonwhites_per1000: number of non-whites resident per 1000 peopleunemploy_m24: unemployment rate of urban males aged 14-24unemploy_m39: unemployment rate of urban males aged 35-39gdp: gross domestic product per headinequality: income inequalityprob_prison: probability of imprisonmenttime_prison: avg time served in prisonscrime_rate: crime rate in an unspecified category Produce a linear model
Check Dataset
Before we go further, first we need to make sure that our data is clean and will be useful.
Data Type
df_data_crime <- crime %>%
mutate(is_south = as.logical(is_south))
glimpse(df_data_crime)## Rows: 47
## Columns: 16
## $ percent_m <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,~
## $ is_south <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRU~
## $ mean_education <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10~
## $ police_exp60 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,~
## $ police_exp59 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, ~
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,~
## $ m_per1000f <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102~
## $ state_pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2~
## $ nonwhites_per1000 <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106~
## $ unemploy_m24 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83~
## $ unemploy_m39 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2~
## $ gdp <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,~
## $ inequality <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,~
## $ prob_prison <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,~
## $ time_prison <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9~
## $ crime_rate <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, ~
Check Unique Data
Because variable is_south is logical, we must check number of data
table(df_data_crime$is_south)##
## FALSE TRUE
## 31 16
Based on the table, the total unique data is 47 rows, so there are no empty rows
Check Missing Value
Check it again for all variable
colSums(is.na(df_data_crime))## percent_m is_south mean_education
## 0 0 0
## police_exp60 police_exp59 labour_participation
## 0 0 0
## m_per1000f state_pop nonwhites_per1000
## 0 0 0
## unemploy_m24 unemploy_m39 gdp
## 0 0 0
## inequality prob_prison time_prison
## 0 0 0
## crime_rate
## 0
Exploratory Data Analysis
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
Find the Pearson correlation between features.
ggcorr(df_data_crime %>% select(-is_south), label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 4,)The graphic shows that a lot of variable has strong correlation with the crime_rate variables.
Modeling
#model full / used all variable to be predictors
model_full_ptor <- lm(formula = crime_rate~., data = df_data_crime)
model_stepcrime<- step(model_full_ptor,direction = "both",trace = 0)
summary(model_stepcrime)##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
## data = df_data_crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -444.70 -111.07 3.03 122.15 483.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6426.101 1194.611 -5.379 0.0000040395 ***
## percent_m 9.332 3.350 2.786 0.00828 **
## mean_education 18.012 5.275 3.414 0.00153 **
## police_exp60 10.265 1.552 6.613 0.0000000826 ***
## m_per1000f 2.234 1.360 1.642 0.10874
## unemploy_m24 -6.087 3.339 -1.823 0.07622 .
## unemploy_m39 18.735 7.248 2.585 0.01371 *
## inequality 6.133 1.396 4.394 0.0000863344 ***
## prob_prison -3796.032 1490.646 -2.547 0.01505 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared: 0.7888, Adjusted R-squared: 0.7444
## F-statistic: 17.74 on 8 and 38 DF, p-value: 0.0000000001159
interpretation:
Adjusted RSquare: 74.44% = The model can explain 74.44% of the variation of the
crime_rate, the rest of the other factors that are not yet modeledPredictor p-value:
m_per1000fis not significant (no *), it means that the number of men / 10 000 women has little effect on the crime rate (even has no effect)unemployment_m24is not significant (no *), meaning that the unemployment rate for males aged 14-24 years has little effect on the crime rate (even does not affect it)
- If
percent_mincreases by 1, it will increase(because the value is positif) the crime rate to 9,332 - If
mean_educationincreases by 1, it will increase (because the value is positif) the crime rate to 18,012 - If
inequalityincreases by 1, it will increase (because the value is positif) the crime rate to 6,133 - If
police_exp60increases by 1, it will increase (because the value is positif) the crime rate to 10,265 - If
unemployment_m39increases by 1, it will increase (because the value is positif) the crime rate to 18,735 - If
prob_prisonincreases by 1, it will reduce(because the value is negatif) crime rate to 3796
Evaluation
Evaluasi Model
crime_test <- read.csv("data_input/crime_test.csv")
crime_test$is_south <- as.logical(crime_test$is_south)pred_crime_test <- Predict(model_stepcrime,newdata = crime_test)Check error From the model
# mean absolut error
MAE(pred_crime_test,crime_test$crime_rate)## [1] 180.7295
# mean absolute percentage error
MAPE(pred_crime_test,crime_test$crime_rate)## [1] 0.2339692
# Root Mean Squared Error
RMSE(pred_crime_test,crime_test$crime_rate)## [1] 215.5166
MAE 180.7295, the mean crime rate prediction missed by 180
MAPE 0.233 = average crime rate prediction missed by 23%
RMSE 215.5166, the average predicted crime rate missed was 215.51
Asumption
- linearity
ensure, whether the predictor (model step crime – without is_south) has a strong relationship
ggcorr(df_data_crime %>%
select(
percent_m,
mean_education,
police_exp60,
m_per1000f,
unemploy_m24,
unemploy_m39,
inequality,
prob_prison,
crime_rate
),label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 4)data.frame(prediksi = model_full_ptor$fitted.values, # prediksi
error = model_full_ptor$residuals) %>%
ggplot(aes(prediksi,error))+
geom_hline(yintercept= 0)+ # garis lurus disumbu y = 0
geom_point()+
geom_smooth()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# errorbased on ggcor and fitted values, it can be seen that the relationship between the target and the predictors is quite strong, even though there are some predictors that are not very strongly correlated with the target.
- Normality Test The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk/plot-density/qqPlot normality test
2.1 first, check with density plot
plot(density((model_full_ptor$residuals)))the density plot shows the residual is normal distribution
2.2 Second, we use the shapiro test
# condition accept if : p-value > 0.05
shapiro.test(model_full_ptor$residuals)##
## Shapiro-Wilk normality test
##
## data: model_full_ptor$residuals
## W = 0.9846, p-value = 0.7849
because the p-value> 0.05, the residual / error is normally distributed
2.3 Third, we check the data distribution using qqPlot
qqPlot(model_full_ptor$residuals)## [1] 11 19
The qqPlot also shows that the data is normally distributed, as evidenced by all the points in the blue area (not outside the dotted line) and following the blue line pattern.
Based on the three tests, it can be concluded that the data error is normally distributed
- heterokesdasticity / unequal variance
Heterocedasticity means that the variances of the error terms are non-constant. One can identify non-constant variances in the errors from the presence of a funnel shape in the residual plot, same with the linearity one.
3.1 BP test
# condition accept if : p-value > 0.05
bptest(model_full_ptor)##
## studentized Breusch-Pagan test
##
## data: model_full_ptor
## BP = 18.469, df = 15, p-value = 0.2388
based on the bp test, the results obtained p-value> 0.05, which means that the conditions are met
3.2 check with plot
plot(model_full_ptor$fitted.values, # prediction
model_full_ptor$residuals) # errorfrom the plot also does not form a pattern so that the conditions are met.
- Multikolinearitas
# condition accept if : vif < 10
vif(model_stepcrime)## percent_m mean_education police_exp60 m_per1000f unemploy_m24
## 2.131963 4.189684 2.560496 1.932367 4.360038
## unemploy_m39 inequality prob_prison
## 4.508106 3.731074 1.381879
based on the test results, condition accept because there is no predictor value greater than 10
Conclusion
The variables that are useful for describing crime rates are percent_m, mean_education, police_exp60, m_per1000f, unemployment_m24, unemployment_m39, inequality, prob_prison, crime_rate. The R-squared model is high, with 78.88% of the variables explaining the crime rate. The accuracy of the model in predicting the crime rate as measured by RMSE, MAPE and MAE with values of 215.5166, 23% and 180 respectively.Based on the assumption test, the model used can measure the crime rate.
Factors that must be considered by the government in preventing an increase in the crime rate, namely
The unemployment factor aged 35-39 years
The factor of the number of males aged 14-24 years
Factors Average years of schooling
Equality factor
Police Expenditure Factors in 1959
The possible factor of imprisonment