library(GGally) # Visualisasi (korelasi)## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(dplyr) # Data Cleansing##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(MLmetrics) # Metrics Error## Warning: package 'MLmetrics' was built under R version 4.2.2
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
library(lmtest) # Uji Asumsi## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car) # Uji Asumsi## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(rsample)## Warning: package 'rsample' was built under R version 4.2.2
crime <- read.csv("crime.csv")
crime <- crime[,-1]
names(crime) <- c("percent_m",
"is_south",
"mean_education",
"police_exp60",
"police_exp59",
"labour_participation",
"m_per1000f",
"state_pop",
"nonwhites_per1000",
"unemploy_m24",
"unemploy_m39",
"gdp",
"inequality",
"prob_prison",
"time_prison",
"crime_rate")
plot(crime$state_pop, crime$inequality)Before making a model, we have to make sure that the variables’ types are already correct.
str(crime)## 'data.frame': 47 obs. of 16 variables:
## $ percent_m : int 151 143 142 136 141 121 127 131 157 140 ...
## $ is_south : int 1 0 1 0 0 0 1 1 1 0 ...
## $ mean_education : int 91 113 89 121 121 110 111 109 90 118 ...
## $ police_exp60 : int 58 103 45 149 109 118 82 115 65 71 ...
## $ police_exp59 : int 56 95 44 141 101 115 79 109 62 68 ...
## $ labour_participation: int 510 583 533 577 591 547 519 542 553 632 ...
## $ m_per1000f : int 950 1012 969 994 985 964 982 969 955 1029 ...
## $ state_pop : int 33 13 18 157 18 25 4 50 39 7 ...
## $ nonwhites_per1000 : int 301 102 219 80 30 44 139 179 286 15 ...
## $ unemploy_m24 : int 108 96 94 102 91 84 97 79 81 100 ...
## $ unemploy_m39 : int 41 36 33 39 20 29 38 35 28 24 ...
## $ gdp : int 394 557 318 673 578 689 620 472 421 526 ...
## $ inequality : int 261 194 250 167 174 126 168 206 239 174 ...
## $ prob_prison : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
## $ time_prison : num 26.2 25.3 24.3 29.9 21.3 ...
## $ crime_rate : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
Data Description:
Then, we will check if any missing values in the dataset
colSums(is.na(crime))## percent_m is_south mean_education
## 0 0 0
## police_exp60 police_exp59 labour_participation
## 0 0 0
## m_per1000f state_pop nonwhites_per1000
## 0 0 0
## unemploy_m24 unemploy_m39 gdp
## 0 0 0
## inequality prob_prison time_prison
## 0 0 0
## crime_rate
## 0
The result, we don’t find any missing value in the crime data set
Among all variables within our crime dataset, there is a crime_rate variable that describes the measure of crime rate of the United States in 1960 from different states. We want to see each variable’s correlation with crime_rate variable.
ggcorr(crime, label = T, label_size = 2.5, hjust = 1, layout.exp = 2)On the result of the previous chunk, we can see that not all variables have correlation with crime_rate variable. The variables that have the highest correlation with crime_rate are police_exp60 and police_exp59 with 0.7 correlation.
From the data exploration process and the correlation information of our dataset, we want to build a simple linear model using one of the highly correlated variables: police_exp59.
model_crime <- lm(crime_rate ~ police_exp59, crime)
summary(model_crime)##
## Call:
## lm(formula = crime_rate ~ police_exp59, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -595.58 -156.76 12.29 146.74 593.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 165.164 130.427 1.266 0.212
## police_exp59 9.222 1.537 6.001 3.11e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 291.4 on 45 degrees of freedom
## Multiple R-squared: 0.4445, Adjusted R-squared: 0.4322
## F-statistic: 36.01 on 1 and 45 DF, p-value: 3.114e-07
The R-squared of model_crime approximates 0.4445, which ideally needs to be improved upon by, example, add more predictor variables. We are going to use one of the techniques for selecting predictor variables which is using stepwise regression algorithm.
model_full <- lm(crime_rate ~ ., crime)
model_step <- step(model_full, direction = "backward", trace = 0)
summary(model_step)##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
## data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -444.70 -111.07 3.03 122.15 483.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6426.101 1194.611 -5.379 4.04e-06 ***
## percent_m 9.332 3.350 2.786 0.00828 **
## mean_education 18.012 5.275 3.414 0.00153 **
## police_exp60 10.265 1.552 6.613 8.26e-08 ***
## m_per1000f 2.234 1.360 1.642 0.10874
## unemploy_m24 -6.087 3.339 -1.823 0.07622 .
## unemploy_m39 18.735 7.248 2.585 0.01371 *
## inequality 6.133 1.396 4.394 8.63e-05 ***
## prob_prison -3796.032 1490.646 -2.547 0.01505 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared: 0.7888, Adjusted R-squared: 0.7444
## F-statistic: 17.74 on 8 and 38 DF, p-value: 1.159e-10
Step-wise regression method will produce an optimum formula based on the lowest value of AIC. After doing the step-wise regression method, compared to the first model that only use the police_exp59 variable, the regression model with percent_m, mean_education, police_exp60, m_per1000f, unemploy_m24, unemploy_m39, inequality, and prob_prison adjusted R-squared is 0.7444, higher than the first model which is 0.4322.
crime_test <- read.csv("crime_test.csv")
pred_crime1 <- predict(model_crime, crime_test)
pred_crime2 <- predict(model_step, crime_test)Mean Absolute Percentage Error (MAPE) shows the difference between the prediction and actual number in percentage. We are going to compare the mean absolute percentage error between the first model and the second model.
MAPE(pred_crime1, crime_test$crime_rate)## [1] 0.3046554
MAPE(pred_crime2, crime_test$crime_rate)## [1] 0.2339692
First model MAPE 0.3046554 means that by average the difference with the actual number is 30% and for the second model the difference with the actual number is 23%. We can conclude that the second model is better because it is lower than the first one.
First Model
shapiro.test(model_crime$residuals)##
## Shapiro-Wilk normality test
##
## data: model_crime$residuals
## W = 0.977, p-value = 0.4748
Second Model
shapiro.test(model_step$residuals)##
## Shapiro-Wilk normality test
##
## data: model_step$residuals
## W = 0.98511, p-value = 0.8051
The Shapiro Test above, we can conclude that the error is distributed normally for the first and second model because the P-value is higher than 0.05 which are 0.4748 and 0.8051
bptest(model_crime)##
## studentized Breusch-Pagan test
##
## data: model_crime
## BP = 22.918, df = 1, p-value = 1.69e-06
The Breusch-Pagan Test above, we can conclude that there is heteroscesdasticity on the first model because the P-value is lower than 0.05 which is 0.00000169. For the second model we conclude that there is homocesdasticity because the P-value is above 0.05, it means the error is constant.
For the first model, we cannot use the multicollienarity test because it only has one variable. Therefore we only present the multicollinenarity test for the second model.
vif(model_step)## percent_m mean_education police_exp60 m_per1000f unemploy_m24
## 2.131963 4.189684 2.560496 1.932367 4.360038
## unemploy_m39 inequality prob_prison
## 4.508106 3.731074 1.381879
Based on the Variance Inflation Factor Test above, we conclude that multicollinearity is not present in our model because the VIF values for all variables are below 10.
We can conclude that the second model or model_step which use the stepwise-regression method is better than the first model with only one variable that has the highest positive correlation with crime_rate variable. The reasons are:
But, we also find the better point of the first model model_crime which is it has a lower number of adjusted R square than the second one (0.4322 < 0.7444)