Chances of police arrest in homicide cases in American cities

Assignment 2

Kushagra Bhatnagar - s3666420

Last updated: 01 June, 2021



# RPubs link information

https://rpubs.com/krbhatnagar/776438



# Introduction

- The following link will help you with creating R Markdown Slidy Presentations
- http://rmarkdown.rstudio.com/slidy_presentation_format.html
- Don't forget about [Bootcamp 4](https://astral-theory-157510.appspot.com/secured/RBootcamp_Course_04.html)
- A good introduction provides a brief background to the problem, defines important terms, and leads to a strong rationale. 

# Introduction
The objective of this exercise is to test a claim made by a leading American newspaper Washington Post that killers are getting away with murders due to disparity in US police's homicide arrest rate. The newspaper claims that over 50 percent of homicide cases there is no arrest. The data provided by the newspaper for fifty American cities will be tested using statistical technique outlined in the course with the help of language R.  

A good introduction provides a brief background to the
problem, defines important terms, and leads to a strong rationale. Background builds a strong and interesting rationale driving the investigation. Important concepts are detailed. The reader wants to know the answer
- Keep everything short and straight to the point. 
- Use bullet points to help minimise text.
- Add relevant images to make your presentation more appealing
- Remember, you have a maximum of 20 slides to fit everything in.
- Ensure each slide fits on one screen. The reader should not have to scroll down. 


# Problem Statement
The newspaper claim is that more than 50 percent of the homicides cases do not result in police arrests. The newspaper provides data for fifty American cities. The methodology adopted is that proportion of cases resulted in no arrest rate for each city is calculated. An assumption of 49 percent no arrest rate is made. This is null hypothesis Ho. The opposite hypothesis contradicting Ho is H1. We will reject Ho if on the probability of the data under (p-value) and the criteria set by significance level of 5 percent. 

# Data
The Washington Post is the primary source of data. They collated data on 55,000 criminal homicides in 55 largest American cities. The Washington Post mapped each homicide by geography in each city. Prior to publication they validated their analysis to local police department. Therefore, we can have confidence in the data integrity. "The Post considered a homicide to be closed by arrest when police reported that to be the case" (The Washington Post, 2018). The data is sourced from: https://www.washingtonpost.com/graphics/2018/investigations/unsolved-homicide-database/ (2018).

# Data Cont.
The data is read from the csv file. With the help of R codes data is preprocessed by labelling the cases with arrest as 1 and wihout arrest as 0. The homicide cases are further preprocessed by grouping as per fifty cities sorted by status of arrest as Yes or No.The most important variables are 'No' variable. This is the backbone of the exercise because it holds the key information for hypothesis testing. All these steps are undertaken to make the data ready for using discriptive statistical tools and hypothesis testing. 

# Descriptive Statistics and Visualisation

Another very important variable is the one created 'No_arrest_Prop'. This variable calculates the proportion of cases in a city with no arrest. There were no outliers in the data because the value for arrest for a case was binary. 0 means no arrest and 1 is arrest. These values were introduced by the auther for each city so there is no chance of outlier. We generate a histogram for data visulisation. 

```r
df <- read_csv("homicide-data.csv")
df$Arrest <- ifelse(df$disposition == "Closed without arrest", "0",ifelse(df$disposition == "Open/No arrest", "0","1") )
df$Arrest <- df$Arrest %>% factor(levels=c(0,1), labels=c("Yes","No"))
df1 <-df %>% count(city, Arrest, sort=TRUE)
df2 <- spread(df1, key = "Arrest", value = "n")
df3 <- mutate(df2, No_arrest_Prop = No / (Yes+No))
knitr::kable(df3)
city Yes No No_arrest_Prop
Albuquerque 146 232 0.6137566
Atlanta 373 600 0.6166495
Baltimore 1825 1002 0.3544393
Baton Rouge 196 228 0.5377358
Birmingham 347 453 0.5662500
Boston 310 304 0.4951140
Buffalo 319 202 0.3877159
Charlotte 206 481 0.7001456
Chicago 4073 1462 0.2641373
Cincinnati 309 385 0.5547550
Columbus 575 509 0.4695572
Dallas 754 813 0.5188258
Denver 169 143 0.4583333
Detroit 1482 1037 0.4116713
Durham 101 175 0.6340580
Fort Worth 255 294 0.5355191
Fresno 169 318 0.6529774
Houston 1493 1449 0.4925221
Indianapolis 594 728 0.5506808
Jacksonville 597 571 0.4888699
Kansas City 486 704 0.5915966
Las Vegas 572 809 0.5858074
Long Beach 156 222 0.5873016
Los Angeles 1106 1151 0.5099690
Louisville 261 315 0.5468750
Memphis 483 1031 0.6809775
Miami 450 294 0.3951613
Milwaukee 403 712 0.6385650
Minneapolis 187 179 0.4890710
Nashville 278 489 0.6375489
New Orleans 930 504 0.3514644
New York 243 384 0.6124402
Oakland 508 439 0.4635692
Oklahoma City 326 346 0.5148810
Omaha 169 240 0.5867971
Philadelphia 1360 1677 0.5521897
Phoenix 504 410 0.4485777
Pittsburgh 337 294 0.4659271
Richmond 113 316 0.7365967
Sacramento 139 237 0.6303191
San Antonio 357 476 0.5714286
San Bernardino 170 105 0.3818182
San Diego 175 286 0.6203905
San Francisco 336 327 0.4932127
Savannah 115 131 0.5325203
St. Louis 905 772 0.4603459
Stockton 266 178 0.4009009
Tampa 95 113 0.5432692
Tulsa 193 391 0.6695205
Washington 589 756 0.5620818

Decsriptive Statistics Cont.

df3 %>% summarise(Min = min(No_arrest_Prop,na.rm = TRUE),
                               Q1 = quantile(No_arrest_Prop,probs = .25,na.rm = TRUE),
                               Median = median(No_arrest_Prop, na.rm = TRUE),
                               Q3 = quantile(No_arrest_Prop,probs = .75,na.rm = TRUE),
                               Max = max(No_arrest_Prop,na.rm = TRUE),
                               Mean = mean(No_arrest_Prop, na.rm = TRUE),
                               SD = sd(No_arrest_Prop, na.rm = TRUE),
                               n = n(),
                               Missing = sum(is.na(No_arrest_Prop)))
df3$No_arrest_Prop %>% hist( main="Histogram for homicide cases with no arrest across 50 US cities", 
     xlab="Proportion of cases with no arrest", )  
abline(v = 0.49, col = "red",lwd = 2) # Population mean

Hypothesis Testing

Null Hypothesis testing is chosen for the assingment. The rationale behind chosing it is a claim of no arrest in 50 percent of cases has been made. We make the status quo contrary to the claim - no arrest is 49 percent of cases.This is null hypothesis Ho. If the result of hypothesis testing turns out to be nearer to status quo, the original claim of newspaper would be debunked. However, if the alternate hypothesis H1 holds true it would mean newspaper claim is more likely to be correct. A one-sample t-test is selected because there are known values for population mean, sample mean and unknown population standard deviation. The number of homocide cases is over 52,000. So, the sample is large and it can be assumed that sampling distribution will be normally distributed. ted.

#H0: Mu = 0.49
#HA: Mu != 0.49
t.test(df3$No_arrest_Prop, mu = 0.49)
## 
##  One Sample t-test
## 
## data:  df3$No_arrest_Prop
## t = 2.9395, df = 49, p-value = 0.005003
## alternative hypothesis: true mean is not equal to 0.49
## 95 percent confidence interval:
##  0.5030647 0.5595288
## sample estimates:
## mean of x 
## 0.5312968

Discussion

The p-value of 0.00503 is really small. It indicates the original assumption of μ = 0.49 H0 is likely to be wrong.To make a discussion reject or fail to reject Ho, the criteria set by the 5% signficance level of the test. p value is 0.00503, which is less than significance level of 5%. Hence, the null hypothesis is rejected. Alternate hypothesis, which is same as newspaper’s claim of over 50 percent cases result in no arrest holds true.

While, the findings of the hypothesis testing is an acceptable form of evidence gathering, it is felt that there is need for further testing to have more confidence in the findings. Also, it really helps in good preprocessing of the data because that helps in developing hypothesis. Also, if there are more visualisation of data it would further enhance understanding of data.

To conclude the newspaper Washington Post’s claim of more than 50 percent homicide cases do not result in arrest is endorsed by the author substantiated by hypothesis testing.

References