Crime - Final Project

Max Cooney

April 18 2022

Data preprocessing

Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format. Steps Involved in Data Preprocessing include:
1. Data Cleaning:

The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, Outliers etc.There are various ways of dealing with missing data or outliers. one way is to fill the missing values manually, by attribute mean or the most probable value. The over method is to remove missing values or outliers.

Introduction

Crime has always been attributed to poverty level. In areas with high poverty are expect to experience high crime rate. We will run linear regression to check if there exist a relationship between crime and poverty level.

#libraries
library(tidyverse)
library(MASS)
library(psych)
library(forcats)
library(readxl)
library(corrplot)

df<- read_excel("Crime - Project.xlsx", range = c("A1:J51"))

str(df)

## tibble [50 x 10] (S3: tbl_df/tbl/data.frame)
##  $ ...1       : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Poverty    : num [1:50] 15.7 8.4 14.7 17.3 13.3 11.4 9.3 10 13.2 14.7 ...
##  $ Infant Mort: num [1:50] 9 6.9 6.4 8.5 5 5.7 6.2 8.3 7.3 8.1 ...
##  $ White      : num [1:50] 71 70.6 86.5 80.8 76.6 ...
##  $ Crime      : num [1:50] 448 661 483 529 523 ...
##  $ Doctors    : num [1:50] 218 228 210 203 269 ...
##  $ Traf Deaths: num [1:50] 1.81 1.63 1.69 1.96 1.21 1.14 0.86 1.23 1.56 1.46 ...
##  $ University : num [1:50] 22 27.3 25.1 18.8 29.6 35.6 35.6 27.5 25.8 27.5 ...
##  $ Unemployed : num [1:50] 5 6.7 5.5 5.1 7.2 4.9 5.7 4.8 6.2 6.2 ...
##  $ Income     : num [1:50] 42666 68460 50958 38815 61021 ...

names(df)<-c("State","Poverty","Infant.Mort","White", "Crime","Doctors",
                   "Traf.Deaths","University","Unemployed","Income" )

anyNA(df)

## [1] FALSE

The str function show how the data is structured. The data contains 50 observations and 10 variable. All the variables are numeric data tpye except the state variable which is of character data type. We also renamed our variabe to remove the space in between the variable names.There is no missing values in the data.

#summary 
describe(df)%>%
dplyr:: select(vars,n,mean,sd,median,min,max,range)

##             vars  n     mean      sd   median      min      max    range
## State*         1 50    25.50   14.58    25.50     1.00    50.00    49.00
## Poverty        2 50    12.73    2.94    12.40     7.60    21.20    13.60
## Infant.Mort    3 50     6.83    1.34     6.85     4.70    10.60     5.90
## White          4 50    81.97   11.97    84.53    29.67    96.41    66.74
## Crime          5 50   407.46  183.58   345.50   118.00   788.30   670.30
## Doctors        6 50   260.28   64.37   249.04   168.83   469.01   300.17
## Traf.Deaths    7 50     1.40    0.39     1.38     0.76     2.45     1.69
## University     8 50    26.94    4.76    26.20    17.10    38.10    21.00
## Unemployed     9 50     5.27    1.25     5.30     3.00     8.40     5.40
## Income        10 50 51985.10 8592.66 50173.00 37790.00 70545.00 32755.00

par(mfrow=c(1,2))
boxplot(df$Crime, main="Crime") #Outcome variable 
boxplot(df$Poverty, main= "Poverty") #Independent Variable

#remove outlier 
out.l<- boxplot(df$Poverty, plot = F)$out

df<-df[-which(df$Poverty==out.l),]

The describe function gives the summary statiscs of the data, i.e, mean meadian,sd, range etc. A box plot was used to check if there is outliers in both the outcome variable and independent variable. The plots show that we have an outlier in the independent variable which was removed.

# Reorder following the value of another column:
df %>%
  mutate(State = fct_reorder(State, Crime)) %>%
  arrange(desc(Crime))%>%
  head(10)%>%
  ggplot( aes(x=State, y=Crime)) +
  geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
  coord_flip() +
  xlab("") +
  theme_bw()

The plot show the the top ten states in Crime rate. South Carolina leads the list.

Data distribution

par(mfrow=c(2,2))
hist(df$Crime, breaks = 20, xlab = "Crime", main = "Crime")
hist(df$Poverty, breaks = 20, xlab = "Poverty", main = "Poverty")
hist(df$Unemployed, breaks = 20, xlab = "Unemployment", main = "Unemployment")
hist(df$Income, breaks = 20, xlab="Income", main = "Income")

Histogram shows data distribution. It is evident from the plot that all the 4 variables are slightly skewed to the right.

Correlation

cor(df$Poverty, df$Crime)

## [1] 0.3457811

df%>%
  ggplot(aes(y= Crime, x= Poverty))+
  geom_point()+
  geom_smooth(se=F, method = "lm")+
  labs(title = "Poverty Vs Crime Scatter Plot")+
  theme_bw()

#remove the state variable. 
df<- df%>%
dplyr:: select(-State)

corrplot::corrplot(cor(df), method = "number")

There is a weak positive correlation between poverty and crime. The scatter plot shows that an increase in poverty will attribute to an increase in crime. The corplot show the correlation among variable. There is little or no multicollinearity in the data since all independent variables’ correlation coefficients are less than 1

Regression model

The first model will be a simple linear regression with crime as the outcome variable and Poverty as the independent variable. Later we will add more variables to the model.

m1<- lm(Crime~Poverty, data = df)
summary(m1)

## 
## Call:
## lm(formula = Crime ~ Poverty, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -285.70  -95.82  -34.62   82.19  370.55 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  112.805    120.202   0.938    0.353  
## Poverty       23.650      9.361   2.526    0.015 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 175.2 on 47 degrees of freedom
## Multiple R-squared:  0.1196, Adjusted R-squared:  0.1008 
## F-statistic: 6.383 on 1 and 47 DF,  p-value: 0.01495

We have a significant model with a p-value (0.01495) less than 0.05. The r-squared is quit with the model fitting 11.96% of the data. If poverty was at zero crime per 100,000 people would be at 112.805. Poverty was a significant predictor of Crime with a p-value of 0.015. A unit increase in poverty will increase crime per 100,000 people by 23.650 units.

par(mfrow=c(2,2))
plot(m1)

The normal Q-Q. Used to examine whether the residuals are normally distributed. Itâs good if residuals points follow the straight dashed line.The q-q plots suggest that normality assumption was not met. Residuals vs Fitted was used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, the assumption was met. Residuals vs Leverage was used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. The Homogeneity of residuals variance assumption was also met.

Income variable was added to the model.

m2<- lm(Crime~Poverty+Income, data = df)
summary(m2)

## 
## Call:
## lm(formula = Crime ~ Poverty + Income, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -251.45 -127.53  -30.95   65.69  358.14 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.189e+03  4.388e+02  -2.709 0.009443 ** 
## Poverty      6.352e+01  1.560e+01   4.071 0.000183 ***
## Income       1.532e-02  4.998e-03   3.065 0.003631 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 161.3 on 46 degrees of freedom
## Multiple R-squared:  0.2689, Adjusted R-squared:  0.2371 
## F-statistic:  8.46 on 2 and 46 DF,  p-value: 0.0007436

The model’s r-squared improved to 26.89%. Poverty and Income are both significant predictors of Crime with p-values of 0.000183 and 0.003631 respectively. A one unit increase in poverty will increase crime per 100,000 people by 63.52 units while a one unit increase in median household income will increase crime per 100,000 people by 0.01532.

m3<- lm(Crime~Poverty+Income+Unemployed, data = df)
summary(m3)

## 
## Call:
## lm(formula = Crime ~ Poverty + Income + Unemployed, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -267.90  -98.48  -28.71   76.68  356.64 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -1.020e+03  4.443e+02  -2.295  0.02643 * 
## Poverty      5.091e+01  1.726e+01   2.950  0.00503 **
## Income       1.180e-02  5.386e-03   2.192  0.03360 * 
## Unemployed   3.309e+01  2.070e+01   1.598  0.11704   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 158.7 on 45 degrees of freedom
## Multiple R-squared:  0.3082, Adjusted R-squared:  0.262 
## F-statistic: 6.682 on 3 and 45 DF,  p-value: 0.0007935

The third model had 3 independent variables i.e poverty, income and unemployment. Unemployment was not a significant predictor of crime since the p-value was greater than 0.05 significance level. An increase in poverty level will increase crime per 100,000 people by 50.91 units while an increase in one unit of unemployment will increase crime per 100,000 people by 0.018 units. The model’s r-squared increased to 0.3082. This mean that the model fits 30.82% of the data.

We will use step wise regression to come up with the best model. The stepwise regression (or stepwise selection) consists of iteratively adding and removing predictors, in the predictive model, in order to find the subset of variables in the data set resulting in the best performing model, that is a model that lowers prediction error

# Fit the full model 
full.model <- lm(Crime ~., data = df)
# Stepwise regression model
step.model <- stepAIC(full.model, direction = "both", 
                      trace = FALSE)
summary(step.model)

## 
## Call:
## lm(formula = Crime ~ Poverty + Infant.Mort + Traf.Deaths + Unemployed + 
##     Income, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -255.75 -102.87   10.99   75.14  307.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.591e+03  4.004e+02  -3.975 0.000265 ***
## Poverty      3.293e+01  1.564e+01   2.106 0.041081 *  
## Infant.Mort  5.703e+01  1.942e+01   2.936 0.005316 ** 
## Traf.Deaths  1.539e+02  7.808e+01   1.971 0.055131 .  
## Unemployed   3.675e+01  1.876e+01   1.959 0.056579 .  
## Income       1.522e-02  4.656e-03   3.270 0.002125 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 135.1 on 43 degrees of freedom
## Multiple R-squared:  0.5206, Adjusted R-squared:  0.4649 
## F-statistic:  9.34 on 5 and 43 DF,  p-value: 4.439e-06

The best model has Poverty , Infant Mort, Traf Deaths, Unemployed and Income as the predictor variable. The r-square improved to 0.5206 which mean that the model can fit 52.06 of the data.

Let assume that we have a state with the following attributes. Poverty = 13.3, Infant Mort = 10.6 Traf Deaths= 1.63, Unemployed= 7.2 and Income= 43,733. The the predicted income would be.

m5<-lm(formula = Crime ~ Poverty + `Infant.Mort` + `Traf.Deaths` + 
         Unemployed + Income, data = df)

pddf<- data.frame(Poverty= 13.3, "Infant.Mort"=10.6, `Traf.Deaths`=1.63, 
                    Unemployed=7.2, Income= 43733 )

y= predict(m5, newdata = pddf)
y

##        1 
## 632.4686

Conclusion

The study show that it there exist a positve relationship between crime and poverty level. Area with high poverty level will have high rate of crime.