As this is a graded task for our Academy students, completion of the task is not optional and count towards your final score
Write a regression analysis report applying what you’ve learned in the workshop. Using the dataset provided by you, write your findings on the different socioeconomic variables most highly correlated to crime rates (crime_rate
). Explain your recommendations where appropriate. To help you through the exercise, you should ask the following questions of your candidate model:
Students should be awarded the full points if:
1. The model achieves an adjusted R-squared value above the grading threshold of 0.701
2. The residual plot resembles a random scatterplot
Fist of all my objective is to write my findings on the different socioeconomic variables most highly correlated to crime rates. Socioeconomics it self mean “social science that studies how economic activity affects and is shaped by social processes.” In general it analyzes how societies progress, stagnate, or regress because of their local or regional economy, or the global economy. Societies are divided into 3 groups: 1. Social, 2. Cultural and 3. Economic.
This dataset was collected in 1960 and a full description of the dataset wasn’t conveniently available. Sammuel use the description he gathered from the authors of the MASS package. After he rename the dataset to easier to read, the variables are:
- percent_m
: percentage of males aged 14-24 - is_south
: whether it is in a Southern state. 1 for Yes, 0 for No.
- mean_education
: mean years of schooling
- police_exp
: police expenditure in 1960 and 1959 - labour_participation
: labour force participation rate
- m_per1000f
: number of males per 1000 females
- state_pop
: state population
- nonwhites_per1000
: number of non-whites resident per 1000 people
- unemploy24_39
: unemployment rate of urban males aged 14-24 and aged 35-39
- gdp
: gross domestic product per head
- inequality
: income inequality
- prob_prison
: probability of imprisonment
- time_prison
: avg time served in prisons
- crime_rate
: crime rate in an unspecified category
To prepare the data we taking crime data that provided by Algorit.ma and here i subseting the x column and changing the name so i can read it more clearly.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
crime.dat <- read.csv("crime.csv") %>%
select(-X)
names(crime.dat) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
crime.dat$police_exp <- crime.dat$police_exp59 + crime.dat$police_exp60
crime.dat$unemploy24_39<- crime.dat$unemploy_m24 + crime.dat$unemploy_m39
crime.dat <- subset(crime.dat, select=-c(police_exp59, police_exp60, unemploy_m24, unemploy_m39))
crime.dat$is_south<-as.factor(crime.dat$is_south)
str(crime.dat)
## 'data.frame': 47 obs. of 14 variables:
## $ percent_m : int 151 143 142 136 141 121 127 131 157 140 ...
## $ is_south : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 2 2 1 ...
## $ mean_education : int 91 113 89 121 121 110 111 109 90 118 ...
## $ labour_participation: int 510 583 533 577 591 547 519 542 553 632 ...
## $ m_per1000f : int 950 1012 969 994 985 964 982 969 955 1029 ...
## $ state_pop : int 33 13 18 157 18 25 4 50 39 7 ...
## $ nonwhites_per1000 : int 301 102 219 80 30 44 139 179 286 15 ...
## $ gdp : int 394 557 318 673 578 689 620 472 421 526 ...
## $ inequality : int 261 194 250 167 174 126 168 206 239 174 ...
## $ prob_prison : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
## $ time_prison : num 26.2 25.3 24.3 29.9 21.3 ...
## $ crime_rate : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
## $ police_exp : int 114 198 89 290 210 233 161 224 127 139 ...
## $ unemploy24_39 : int 149 132 127 141 111 113 135 114 109 124 ...
I created the formula using step=backward
to predicting the crime rate given a reasonable set of values for the predictor variable.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
crmodel.base<- lm(crime_rate~.,crime.dat)
summary(crmodel.base)
##
## Call:
## lm(formula = crime_rate ~ ., data = crime.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -400.63 -121.25 1.87 108.51 489.48
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.826e+03 1.590e+03 -3.664 0.000864 ***
## percent_m 8.275e+00 4.370e+00 1.894 0.067077 .
## is_south1 8.993e+01 1.497e+02 0.601 0.552082
## mean_education 1.379e+01 6.139e+00 2.246 0.031524 *
## labour_participation 3.886e-01 1.468e+00 0.265 0.792893
## m_per1000f 7.295e-01 2.031e+00 0.359 0.721723
## state_pop -9.503e-01 1.353e+00 -0.703 0.487250
## nonwhites_per1000 9.513e-02 6.593e-01 0.144 0.886142
## gdp 1.227e+00 1.081e+00 1.135 0.264602
## inequality 7.741e+00 2.373e+00 3.262 0.002574 **
## prob_prison -4.040e+03 2.316e+03 -1.744 0.090454 .
## time_prison 1.485e+00 7.059e+00 0.210 0.834669
## police_exp 5.776e+00 1.257e+00 4.597 6.01e-05 ***
## unemploy24_39 1.835e+00 1.943e+00 0.945 0.351754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 220.2 on 33 degrees of freedom
## Multiple R-squared: 0.7675, Adjusted R-squared: 0.6759
## F-statistic: 8.381 on 13 and 33 DF, p-value: 4.116e-07
step(crmodel.base, direction="backward")
## Start: AIC=518.45
## crime_rate ~ percent_m + is_south + mean_education + labour_participation +
## m_per1000f + state_pop + nonwhites_per1000 + gdp + inequality +
## prob_prison + time_prison + police_exp + unemploy24_39
##
## Df Sum of Sq RSS AIC
## - nonwhites_per1000 1 1009 1600646 516.48
## - time_prison 1 2145 1601782 516.51
## - labour_participation 1 3396 1603033 516.55
## - m_per1000f 1 6255 1605891 516.63
## - is_south 1 17497 1617133 516.96
## - state_pop 1 23927 1623563 517.15
## - unemploy24_39 1 43247 1642883 517.71
## - gdp 1 62432 1662068 518.25
## <none> 1599636 518.45
## - prob_prison 1 147449 1747085 520.60
## - percent_m 1 173809 1773446 521.30
## - mean_education 1 244509 1844145 523.14
## - inequality 1 515785 2115422 529.59
## - police_exp 1 1024471 2624107 539.71
##
## Step: AIC=516.48
## crime_rate ~ percent_m + is_south + mean_education + labour_participation +
## m_per1000f + state_pop + gdp + inequality + prob_prison +
## time_prison + police_exp + unemploy24_39
##
## Df Sum of Sq RSS AIC
## - time_prison 1 3014 1603660 514.57
## - labour_participation 1 4926 1605572 514.63
## - m_per1000f 1 5514 1606160 514.64
## - state_pop 1 24238 1624883 515.19
## - is_south 1 25581 1626227 515.23
## - unemploy24_39 1 48928 1649574 515.90
## - gdp 1 61999 1662645 516.27
## <none> 1600646 516.48
## - prob_prison 1 151807 1752453 518.74
## - percent_m 1 199479 1800124 520.00
## - mean_education 1 243802 1844448 521.14
## - inequality 1 519500 2120145 527.69
## - police_exp 1 1356048 2956693 543.32
##
## Step: AIC=514.57
## crime_rate ~ percent_m + is_south + mean_education + labour_participation +
## m_per1000f + state_pop + gdp + inequality + prob_prison +
## police_exp + unemploy24_39
##
## Df Sum of Sq RSS AIC
## - m_per1000f 1 3815 1607476 512.68
## - labour_participation 1 5837 1609497 512.74
## - state_pop 1 21514 1625174 513.20
## - is_south 1 25696 1629356 513.32
## - unemploy24_39 1 50242 1653902 514.02
## - gdp 1 65128 1668788 514.44
## <none> 1603660 514.57
## - percent_m 1 227419 1831079 518.80
## - prob_prison 1 232450 1836111 518.93
## - mean_education 1 241857 1845517 519.17
## - inequality 1 522622 2126282 525.83
## - police_exp 1 1358738 2962398 541.41
##
## Step: AIC=512.68
## crime_rate ~ percent_m + is_south + mean_education + labour_participation +
## state_pop + gdp + inequality + prob_prison + police_exp +
## unemploy24_39
##
## Df Sum of Sq RSS AIC
## - labour_participation 1 14595 1622070 511.11
## - is_south 1 25496 1632971 511.42
## - state_pop 1 40425 1647901 511.85
## - gdp 1 68865 1676340 512.65
## <none> 1607476 512.68
## - unemploy24_39 1 106387 1713862 513.69
## - prob_prison 1 230168 1837643 516.97
## - percent_m 1 265485 1872961 517.87
## - mean_education 1 280743 1888218 518.25
## - inequality 1 558268 2165744 524.69
## - police_exp 1 1443206 3050682 540.79
##
## Step: AIC=511.11
## crime_rate ~ percent_m + is_south + mean_education + state_pop +
## gdp + inequality + prob_prison + police_exp + unemploy24_39
##
## Df Sum of Sq RSS AIC
## - is_south 1 14140 1636210 509.51
## - state_pop 1 45007 1667078 510.39
## <none> 1622070 511.11
## - gdp 1 85006 1707076 511.51
## - unemploy24_39 1 91852 1713922 511.69
## - prob_prison 1 227914 1849985 515.29
## - percent_m 1 278219 1900289 516.55
## - mean_education 1 406829 2028899 519.62
## - inequality 1 771466 2393536 527.39
## - police_exp 1 1430499 3052570 538.82
##
## Step: AIC=509.51
## crime_rate ~ percent_m + mean_education + state_pop + gdp + inequality +
## prob_prison + police_exp + unemploy24_39
##
## Df Sum of Sq RSS AIC
## - state_pop 1 45088 1681298 508.79
## <none> 1636210 509.51
## - unemploy24_39 1 85168 1721378 509.90
## - gdp 1 98370 1734580 510.26
## - prob_prison 1 219956 1856166 513.44
## - percent_m 1 325254 1961464 516.04
## - mean_education 1 403684 2039894 517.88
## - inequality 1 1008407 2644617 530.08
## - police_exp 1 1529176 3165386 538.53
##
## Step: AIC=508.79
## crime_rate ~ percent_m + mean_education + gdp + inequality +
## prob_prison + police_exp + unemploy24_39
##
## Df Sum of Sq RSS AIC
## <none> 1681298 508.79
## - unemploy24_39 1 89362 1770660 509.23
## - gdp 1 90985 1772283 509.27
## - prob_prison 1 187011 1868308 511.75
## - percent_m 1 393761 2075059 516.68
## - mean_education 1 516691 2197989 519.39
## - inequality 1 963607 2644905 528.09
## - police_exp 1 1603714 3285012 538.27
##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + gdp +
## inequality + prob_prison + police_exp + unemploy24_39, data = crime.dat)
##
## Coefficients:
## (Intercept) percent_m mean_education gdp
## -5559.360 10.455 15.530 1.385
## inequality prob_prison police_exp unemploy24_39
## 8.514 -3384.687 5.522 1.876
crmodel.backward<- lm(formula = crime_rate ~ percent_m + mean_education + gdp +
inequality +
prob_prison + police_exp + unemploy24_39, data = crime.dat)
summary(crmodel.backward)
##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + gdp +
## inequality + prob_prison + police_exp + unemploy24_39, data = crime.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -441.22 -103.95 -9.48 88.75 485.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5559.3597 1093.5825 -5.084 9.62e-06 ***
## percent_m 10.4554 3.4595 3.022 0.00442 **
## mean_education 15.5299 4.4858 3.462 0.00132 **
## gdp 1.3845 0.9530 1.453 0.15429
## inequality 8.5140 1.8008 4.728 2.94e-05 ***
## prob_prison -3384.6875 1625.0821 -2.083 0.04388 *
## police_exp 5.5221 0.9054 6.099 3.77e-07 ***
## unemploy24_39 1.8763 1.3032 1.440 0.15792
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 207.6 on 39 degrees of freedom
## Multiple R-squared: 0.7557, Adjusted R-squared: 0.7118
## F-statistic: 17.23 on 7 and 39 DF, p-value: 3.76e-10
plot(crime.dat$crime_rate,residuals(crmodel.backward), main = "Crime rate Scaterplot", sub= "Using Backward step",cex= 0.5)
abline(abline(h = 0, col="darksalmon", lwd=2))