Kushagra Bhatnagar - s3666420
Last updated: 01 June, 2021
# RPubs link information
https://rpubs.com/krbhatnagar/776438
# Introduction
- The following link will help you with creating R Markdown Slidy Presentations
- http://rmarkdown.rstudio.com/slidy_presentation_format.html
- Don't forget about [Bootcamp 4](https://astral-theory-157510.appspot.com/secured/RBootcamp_Course_04.html)
- A good introduction provides a brief background to the problem, defines important terms, and leads to a strong rationale.
# Introduction
The objective of this exercise is to test a claim made by a leading American newspaper Washington Post that killers are getting away with murders due to disparity in US police's homicide arrest rate. The newspaper claims that over 50 percent of homicide cases there is no arrest. The data provided by the newspaper for fifty American cities will be tested using statistical technique outlined in the course with the help of language R.
A good introduction provides a brief background to the
problem, defines important terms, and leads to a strong rationale. Background builds a strong and interesting rationale driving the investigation. Important concepts are detailed. The reader wants to know the answer
- Keep everything short and straight to the point.
- Use bullet points to help minimise text.
- Add relevant images to make your presentation more appealing
- Remember, you have a maximum of 20 slides to fit everything in.
- Ensure each slide fits on one screen. The reader should not have to scroll down.
# Problem Statement
The newspaper claim is that more than 50 percent of the homicides cases do not result in police arrests. The newspaper provides data for fifty American cities. The methodology adopted is that proportion of cases resulted in no arrest rate for each city is calculated. An assumption of 49 percent no arrest rate is made. This is null hypothesis Ho. The opposite hypothesis contradicting Ho is H1. We will reject Ho if on the probability of the data under (p-value) and the criteria set by significance level of 5 percent.
# Data
The Washington Post is the primary source of data. They collated data on 55,000 criminal homicides in 55 largest American cities. The Washington Post mapped each homicide by geography in each city. Prior to publication they validated their analysis to local police department. Therefore, we can have confidence in the data integrity. "The Post considered a homicide to be closed by arrest when police reported that to be the case" (The Washington Post, 2018). The data is sourced from: https://www.washingtonpost.com/graphics/2018/investigations/unsolved-homicide-database/ (2018).
# Data Cont.
The data is read from the csv file. With the help of R codes data is preprocessed by labelling the cases with arrest as 1 and wihout arrest as 0. The homicide cases are further preprocessed by grouping as per fifty cities sorted by status of arrest as Yes or No.The most important variables are 'No' variable. This is the backbone of the exercise because it holds the key information for hypothesis testing. All these steps are undertaken to make the data ready for using discriptive statistical tools and hypothesis testing.
# Descriptive Statistics and Visualisation
Another very important variable is the one created 'No_arrest_Prop'. This variable calculates the proportion of cases in a city with no arrest. There were no outliers in the data because the value for arrest for a case was binary. 0 means no arrest and 1 is arrest. These values were introduced by the auther for each city so there is no chance of outlier. We generate a histogram for data visulisation.
```r
df <- read_csv("homicide-data.csv")
df$Arrest <- ifelse(df$disposition == "Closed without arrest", "0",ifelse(df$disposition == "Open/No arrest", "0","1") )
df$Arrest <- df$Arrest %>% factor(levels=c(0,1), labels=c("Yes","No"))
df1 <-df %>% count(city, Arrest, sort=TRUE)
df2 <- spread(df1, key = "Arrest", value = "n")
df3 <- mutate(df2, No_arrest_Prop = No / (Yes+No))
knitr::kable(df3)
| city | Yes | No | No_arrest_Prop |
|---|---|---|---|
| Albuquerque | 146 | 232 | 0.6137566 |
| Atlanta | 373 | 600 | 0.6166495 |
| Baltimore | 1825 | 1002 | 0.3544393 |
| Baton Rouge | 196 | 228 | 0.5377358 |
| Birmingham | 347 | 453 | 0.5662500 |
| Boston | 310 | 304 | 0.4951140 |
| Buffalo | 319 | 202 | 0.3877159 |
| Charlotte | 206 | 481 | 0.7001456 |
| Chicago | 4073 | 1462 | 0.2641373 |
| Cincinnati | 309 | 385 | 0.5547550 |
| Columbus | 575 | 509 | 0.4695572 |
| Dallas | 754 | 813 | 0.5188258 |
| Denver | 169 | 143 | 0.4583333 |
| Detroit | 1482 | 1037 | 0.4116713 |
| Durham | 101 | 175 | 0.6340580 |
| Fort Worth | 255 | 294 | 0.5355191 |
| Fresno | 169 | 318 | 0.6529774 |
| Houston | 1493 | 1449 | 0.4925221 |
| Indianapolis | 594 | 728 | 0.5506808 |
| Jacksonville | 597 | 571 | 0.4888699 |
| Kansas City | 486 | 704 | 0.5915966 |
| Las Vegas | 572 | 809 | 0.5858074 |
| Long Beach | 156 | 222 | 0.5873016 |
| Los Angeles | 1106 | 1151 | 0.5099690 |
| Louisville | 261 | 315 | 0.5468750 |
| Memphis | 483 | 1031 | 0.6809775 |
| Miami | 450 | 294 | 0.3951613 |
| Milwaukee | 403 | 712 | 0.6385650 |
| Minneapolis | 187 | 179 | 0.4890710 |
| Nashville | 278 | 489 | 0.6375489 |
| New Orleans | 930 | 504 | 0.3514644 |
| New York | 243 | 384 | 0.6124402 |
| Oakland | 508 | 439 | 0.4635692 |
| Oklahoma City | 326 | 346 | 0.5148810 |
| Omaha | 169 | 240 | 0.5867971 |
| Philadelphia | 1360 | 1677 | 0.5521897 |
| Phoenix | 504 | 410 | 0.4485777 |
| Pittsburgh | 337 | 294 | 0.4659271 |
| Richmond | 113 | 316 | 0.7365967 |
| Sacramento | 139 | 237 | 0.6303191 |
| San Antonio | 357 | 476 | 0.5714286 |
| San Bernardino | 170 | 105 | 0.3818182 |
| San Diego | 175 | 286 | 0.6203905 |
| San Francisco | 336 | 327 | 0.4932127 |
| Savannah | 115 | 131 | 0.5325203 |
| St. Louis | 905 | 772 | 0.4603459 |
| Stockton | 266 | 178 | 0.4009009 |
| Tampa | 95 | 113 | 0.5432692 |
| Tulsa | 193 | 391 | 0.6695205 |
| Washington | 589 | 756 | 0.5620818 |
df3 %>% summarise(Min = min(No_arrest_Prop,na.rm = TRUE),
Q1 = quantile(No_arrest_Prop,probs = .25,na.rm = TRUE),
Median = median(No_arrest_Prop, na.rm = TRUE),
Q3 = quantile(No_arrest_Prop,probs = .75,na.rm = TRUE),
Max = max(No_arrest_Prop,na.rm = TRUE),
Mean = mean(No_arrest_Prop, na.rm = TRUE),
SD = sd(No_arrest_Prop, na.rm = TRUE),
n = n(),
Missing = sum(is.na(No_arrest_Prop)))df3$No_arrest_Prop %>% hist( main="Histogram for homicide cases with no arrest across 50 US cities",
xlab="Proportion of cases with no arrest", )
abline(v = 0.49, col = "red",lwd = 2) # Population meanNull Hypothesis testing is chosen for the assingment. The rationale behind chosing it is a claim of no arrest in 50 percent of cases has been made. We make the status quo contrary to the claim - no arrest is 49 percent of cases.This is null hypothesis Ho. If the result of hypothesis testing turns out to be nearer to status quo, the original claim of newspaper would be debunked. However, if the alternate hypothesis H1 holds true it would mean newspaper claim is more likely to be correct. A one-sample t-test is selected because there are known values for population mean, sample mean and unknown population standard deviation. The number of homocide cases is over 52,000. So, the sample is large and it can be assumed that sampling distribution will be normally distributed. ted.
##
## One Sample t-test
##
## data: df3$No_arrest_Prop
## t = 2.9395, df = 49, p-value = 0.005003
## alternative hypothesis: true mean is not equal to 0.49
## 95 percent confidence interval:
## 0.5030647 0.5595288
## sample estimates:
## mean of x
## 0.5312968
The p-value of 0.00503 is really small. It indicates the original assumption of μ = 0.49 H0 is likely to be wrong.To make a discussion reject or fail to reject Ho, the criteria set by the 5% signficance level of the test. p value is 0.00503, which is less than significance level of 5%. Hence, the null hypothesis is rejected. Alternate hypothesis, which is same as newspaper’s claim of over 50 percent cases result in no arrest holds true.
While, the findings of the hypothesis testing is an acceptable form of evidence gathering, it is felt that there is need for further testing to have more confidence in the findings. Also, it really helps in good preprocessing of the data because that helps in developing hypothesis. Also, if there are more visualisation of data it would further enhance understanding of data.
To conclude the newspaper Washington Post’s claim of more than 50 percent homicide cases do not result in arrest is endorsed by the author substantiated by hypothesis testing.