Info

Objective

These homework problem sets are designed to help you understand material better. You should try doing these problems first and then look at model answers. You can use Generative AI as to help, such as prompt “Which tidyverse function do I use to drop certain columns from a data frame? Give me an example and explain”. It is also a good idea to feed an error message together with your code to Generative AI and ask it to help with fixing errors. But it is pointless to just solve all questions with ChatGPT because you won’t be learning anything.

Your task

Read instructions and write your solutions to these questions into the space provided. Then check the model answers (the link is in the end of the notebook).

Police Violence and Shootings

In this task, we will find out people of which race are murdered by police most often. The original dataset is here:

https://www.kaggle.com/discussions/general/158339

Data

First, we read the data into R, at the same time cleaning it (the data is on Fedor’s dropbox, but you can download it from the link above instead - it is the same dataset):

police_kill <- read_csv("../Data/police_killings_MPV.csv") %>%
  clean_names() %>%
  mutate(victims_age = parse_number(victims_age)) %>%
  rename(date_of_incident = date_of_incident_month_day_year) %>%
  mutate(date_of_incident = dmy(date_of_incident)) %>%
  rename(alleged_weapon = alleged_weapon_source_wa_po_and_review_of_cases_not_included_in_wa_po_database) %>%
  rename(alleged_threat = alleged_threat_level_source_wa_po) %>%
  rename(mental_symptoms = symptoms_of_mental_illness) %>%
  remove_constant() %>%
  select(starts_with("victim"),
         contains("incident"),
         city, state, zipcode, county, cause_of_death,
         criminal_charges,
         alleged_threat,
         alleged_weapon, 
         mental_symptoms) %>%
  mutate(month = as.yearmon(date_of_incident))

head(police_kill)

Question 1

Find the number of police murders by 100K population of each of the races - White, Black, and Hispanic. Does there seem to be a bias against any race?

# Numbers are taken from Wikipedia
race_totals <- c(White = 191697647, 
                 Black = 39940338, 
                 Hispanic = 62080044) %>%
  enframe(
    name = "victims_race",
    value = "population")

police_kill %>%
  count(victims_race) %>%
  left_join(race_totals) %>%
  drop_na() %>%
  mutate(murders_per_100K = n / population * 100000)

ANSWER There do appear to be more murders of Black rather than White people. But is it just due to random chance?

Question 2

Find a dataset on total population of each state in the USA and on proportions or total numbers of different races in each state. Load the dataset into R and clean it so that it is possible to merge it with police_kill.

For simplicity,

  1. Assume that the USA population remained unchanged between 2013 and 2021, i.e., just load one year of data.

  2. Load the data on the three most prevalent races, i.e., white, black, and hispanic.

ANSWER The first thing to do is to get the right names of races in police_kills:

police_kill %>%
  tabyl(victims_race)

When we load a dataset of race population to R, we need to use the same race names. Now, a simple google search shows a few ways to get race populations. There is Wikipedia - you can just manually copy the table and paste to Google Drive or Excel. There are also some downloadable datasets. But the easiest way is to install the package usdata:

library(usdata)
head(pop_race_2019)

It has the data the we need, but we need to reformat it. A quick inspection shows that “White” and “Black” are races but “Hispanic” is a binary marker, i.e., a person of any race can identify as Hispanic or not Hispanic. So in order to merge this with police_kill, we will compute a new variable called victims_race (the same name to be able to merge it with police_kill) that will be the same as the original race whenever hispanic marker is "Not Hispanic or Latino" and will be “Hispanic” whenever hispanic == "Hispanic or Latino". We will also drop the rest of the races

pop_race_2019_alt <- pop_race_2019 %>%
  mutate(
    victims_race = ifelse(hispanic == "Hispanic or Latino",
                          "Hispanic", race)) %>%
  mutate(
    victims_race = ifelse(victims_race == "Black or African American",
                          "Black", victims_race)) %>%
  filter(victims_race %in% c("Hispanic", "Black", "White")) %>%
  group_by(state, state_name, victims_race) %>%
  summarise(population = sum(population)) %>%
  ungroup

pop_race_2019_alt

Question 3

Find the number of murders in each state by race per 100K population. Save the result into a new data frame, police_kill_counts

## ANSWER
police_kill_counts <- police_kill %>%
  filter(victims_race %in% c("Hispanic", "Black", "White")) %>%
  count(state, victims_race) %>%
  left_join(pop_race_2019_alt) %>%
  mutate(murders_per_100K = n / population * 100000)

police_kill_counts

Question 4

Test the hypothesis that the true mean number of murders of Blacks per 100K population across all states is higher than the mean number of murders of Whites per 100K population.

# ANSWER
police_kill_counts %>%
  filter(victims_race %in% c("Black", "White")) %>%
t.test(murders_per_100K ~ victims_race, .)
## 
##  Welch Two Sample t-test
## 
## data:  murders_per_100K by victims_race
## t = 7.5147, df = 53.068, p-value = 6.653e-10
## alternative hypothesis: true difference in means between group Black and group White is not equal to 0
## 95 percent confidence interval:
##  3.092480 5.344256
## sample estimates:
## mean in group Black mean in group White 
##            6.309610            2.091242

Question 5

Why won’t the paired \(t\)-test work and how should we fix it?

ANSWER The paired \(t\)-test is suitable, but the problem is that some states do not have murders of Black people. We should just count all pairs (state, race), including those with zero counts.

Question 6

Apart from the issue outlined in Question 5, what is another (purely mathematical) issue with designing the statistical test as in Question 4?

ANSWER The issue here is that the mean number of murders per 100K population across all states may not be a meaningful statistic since it is disproportionally influenced by states with small population. For instance, the population of California is 40M and the population of Alaska is less than 1M but California and Alaska contribute equally to the mean number of murders per 100K population across all states. And states with smaller population will have higher variance of murders. Informally, every murder in Alaska that could occur due to random chance will have large influence on the number of murders in Alaska per population than one murder in California.

Question 7

Suggest a fix to the issue outlined in Question 6.

ANSWER We could compute murders per 100K population across years or months rather than across states.

Model answers:

https://rpubs.com/fduzhin/mh3511_hw_5_answers