# RPubs link information

https://rpubs.com/krbhatnagar/776438



# Introduction

- The following link will help you with creating R Markdown Slidy Presentations
- http://rmarkdown.rstudio.com/slidy_presentation_format.html
- Don't forget about [Bootcamp 4](https://astral-theory-157510.appspot.com/secured/RBootcamp_Course_04.html)
- A good introduction provides a brief background to the problem, defines important terms, and leads to a strong rationale. 

# Introduction
The objective of this exercise is to test a claim made by a leading American newspaper Washington Post that killers are getting away with murders due to disparity in US police's homicide arrest rate. The newspaper claims that over 50 percent of homicide cases there is no arrest. The data provided by the newspaper for fifty American cities will be tested using statistical technique outlined in the course with the help of language R.  

A good introduction provides a brief background to the
problem, defines important terms, and leads to a strong rationale. Background builds a strong and interesting rationale driving the investigation. Important concepts are detailed. The reader wants to know the answer
- Keep everything short and straight to the point. 
- Use bullet points to help minimise text.
- Add relevant images to make your presentation more appealing
- Remember, you have a maximum of 20 slides to fit everything in.
- Ensure each slide fits on one screen. The reader should not have to scroll down. 


# Problem Statement
The newspaper claim is that more than 50 percent of the homicides cases do not result in police arrests. The newspaper provides data for fifty American cities. The methodology adopted is that proportion of cases resulted in no arrest rate for each city is calculated. An assumption of 49 percent no arrest rate is made. This is null hypothesis Ho. The opposite hypothesis contradicting Ho is H1. We will reject Ho if on the probability of the data under (p-value) and the criteria set by significance level of 5 percent. 

# Data
The Washington Post is the primary source of data. They collated data on 55,000 criminal homicides in 55 largest American cities. The Washington Post mapped each homicide by geography in each city. Prior to publication they validated their analysis to local police department. Therefore, we can have confidence in the data integrity. "The Post considered a homicide to be closed by arrest when police reported that to be the case" (The Washington Post, 2018). The data is sourced from: https://www.washingtonpost.com/graphics/2018/investigations/unsolved-homicide-database/ (2018).

# Data Cont.
The data is read from the csv file. With the help of R codes data is preprocessed by labelling the cases with arrest as 1 and wihout arrest as 0. The homicide cases are further preprocessed by grouping as per fifty cities sorted by status of arrest as Yes or No.The most important variables are 'No' variable. This is the backbone of the exercise because it holds the key information for hypothesis testing. All these steps are undertaken to make the data ready for using discriptive statistical tools and hypothesis testing. 

# Descriptive Statistics and Visualisation

Another very important variable is the one created 'No_arrest_Prop'. This variable calculates the proportion of cases in a city with no arrest. There were no outliers in the data because the value for arrest for a case was binary. 0 means no arrest and 1 is arrest. These values were introduced by the auther for each city so there is no chance of outlier. We generate a histogram for data visulisation. 

```r
df <- read_csv("homicide-data.csv")
df$Arrest <- ifelse(df$disposition == "Closed without arrest", "0",ifelse(df$disposition == "Open/No arrest", "0","1") )
df$Arrest <- df$Arrest %>% factor(levels=c(0,1), labels=c("Yes","No"))
df1 <-df %>% count(city, Arrest, sort=TRUE)
df2 <- spread(df1, key = "Arrest", value = "n")
df3 <- mutate(df2, No_arrest_Prop = No / (Yes+No))
knitr::kable(df3)

city	Yes	No	No_arrest_Prop
Albuquerque	146	232	0.6137566
Atlanta	373	600	0.6166495
Baltimore	1825	1002	0.3544393
Baton Rouge	196	228	0.5377358
Birmingham	347	453	0.5662500
Boston	310	304	0.4951140
Buffalo	319	202	0.3877159
Charlotte	206	481	0.7001456
Chicago	4073	1462	0.2641373
Cincinnati	309	385	0.5547550
Columbus	575	509	0.4695572
Dallas	754	813	0.5188258
Denver	169	143	0.4583333
Detroit	1482	1037	0.4116713
Durham	101	175	0.6340580
Fort Worth	255	294	0.5355191
Fresno	169	318	0.6529774
Houston	1493	1449	0.4925221
Indianapolis	594	728	0.5506808
Jacksonville	597	571	0.4888699
Kansas City	486	704	0.5915966
Las Vegas	572	809	0.5858074
Long Beach	156	222	0.5873016
Los Angeles	1106	1151	0.5099690
Louisville	261	315	0.5468750
Memphis	483	1031	0.6809775
Miami	450	294	0.3951613
Milwaukee	403	712	0.6385650
Minneapolis	187	179	0.4890710
Nashville	278	489	0.6375489
New Orleans	930	504	0.3514644
New York	243	384	0.6124402
Oakland	508	439	0.4635692
Oklahoma City	326	346	0.5148810
Omaha	169	240	0.5867971
Philadelphia	1360	1677	0.5521897
Phoenix	504	410	0.4485777
Pittsburgh	337	294	0.4659271
Richmond	113	316	0.7365967
Sacramento	139	237	0.6303191
San Antonio	357	476	0.5714286
San Bernardino	170	105	0.3818182
San Diego	175	286	0.6203905
San Francisco	336	327	0.4932127
Savannah	115	131	0.5325203
St. Louis	905	772	0.4603459
Stockton	266	178	0.4009009
Tampa	95	113	0.5432692
Tulsa	193	391	0.6695205
Washington	589	756	0.5620818

Decsriptive Statistics Cont.

df3 %>% summarise(Min = min(No_arrest_Prop,na.rm = TRUE),
                               Q1 = quantile(No_arrest_Prop,probs = .25,na.rm = TRUE),
                               Median = median(No_arrest_Prop, na.rm = TRUE),
                               Q3 = quantile(No_arrest_Prop,probs = .75,na.rm = TRUE),
                               Max = max(No_arrest_Prop,na.rm = TRUE),
                               Mean = mean(No_arrest_Prop, na.rm = TRUE),
                               SD = sd(No_arrest_Prop, na.rm = TRUE),
                               n = n(),
                               Missing = sum(is.na(No_arrest_Prop)))

df3$No_arrest_Prop %>% hist( main="Histogram for homicide cases with no arrest across 50 US cities", 
     xlab="Proportion of cases with no arrest", )  
abline(v = 0.49, col = "red",lwd = 2) # Population mean

Hypothesis Testing

Null Hypothesis testing is chosen for the assingment. The rationale behind chosing it is a claim of no arrest in 50 percent of cases has been made. We make the status quo contrary to the claim - no arrest is 49 percent of cases.This is null hypothesis Ho. If the result of hypothesis testing turns out to be nearer to status quo, the original claim of newspaper would be debunked. However, if the alternate hypothesis H1 holds true it would mean newspaper claim is more likely to be correct. A one-sample t-test is selected because there are known values for population mean, sample mean and unknown population standard deviation. The number of homocide cases is over 52,000. So, the sample is large and it can be assumed that sampling distribution will be normally distributed. ted.

#H0: Mu = 0.49
#HA: Mu != 0.49
t.test(df3$No_arrest_Prop, mu = 0.49)

## 
##  One Sample t-test
## 
## data:  df3$No_arrest_Prop
## t = 2.9395, df = 49, p-value = 0.005003
## alternative hypothesis: true mean is not equal to 0.49
## 95 percent confidence interval:
##  0.5030647 0.5595288
## sample estimates:
## mean of x 
## 0.5312968

Discussion

The p-value of 0.00503 is really small. It indicates the original assumption of μ = 0.49 H0 is likely to be wrong.To make a discussion reject or fail to reject Ho, the criteria set by the 5% signficance level of the test. p value is 0.00503, which is less than significance level of 5%. Hence, the null hypothesis is rejected. Alternate hypothesis, which is same as newspaper’s claim of over 50 percent cases result in no arrest holds true.

While, the findings of the hypothesis testing is an acceptable form of evidence gathering, it is felt that there is need for further testing to have more confidence in the findings. Also, it really helps in good preprocessing of the data because that helps in developing hypothesis. Also, if there are more visualisation of data it would further enhance understanding of data.

To conclude the newspaper Washington Post’s claim of more than 50 percent homicide cases do not result in arrest is endorsed by the author substantiated by hypothesis testing.

Chances of police arrest in homicide cases in American cities

Assignment 2

Decsriptive Statistics Cont.

Hypothesis Testing

Discussion

References