Info

Objective

This in-class notebook is designed to complement the lecture. You’ll practice what you just learned, avoid falling asleep mid-slide, and get instant feedback - both from Fedor and your fellow classmates. You’re encouraged to experiment, ask questions, and correct your answers as we go.

The goal is to learn R by doing, not just by listening.

Your Task

Attempt each question yourself before checking the answer or asking for help.
Use lecture notes and the example code provided.
Update your answers after Fedor’s explanations.
Feel free to work with your neighbor if you get stuck — but make sure you understand the final answer!

Initial Setup

Before we begin, make sure you’ve installed the required packages:

tidyverse for data manipulation and plotting
janitor for data cleaning
zoo for some extra functions for working with time

You only need to install these once.

Factors

Data

Police Violence and Shootings

In this task, we will find out people of which race are murdered by police most often. The original dataset is here:

https://www.kaggle.com/discussions/general/158339

First, we read the data into R, at the same time cleaning it (the data is on Fedor’s dropbox, but you can download it from the link above instead - it is the same dataset):

# Numbers are taken from Wikipedia
race_totals <- c(White = 191697647, 
                 Black = 39940338, 
                 Hispanic = 62080044) %>%
  enframe(
    name = "victims_race",
    value = "population")


police_kill <- read_csv("../Data/police_killings_MPV.csv") %>%
  clean_names() %>%
  mutate(victims_age = parse_number(victims_age)) %>%
  rename(date_of_incident = date_of_incident_month_day_year) %>%
  mutate(date_of_incident = dmy(date_of_incident)) %>%
  rename(alleged_weapon = alleged_weapon_source_wa_po_and_review_of_cases_not_included_in_wa_po_database) %>%
  rename(alleged_threat = alleged_threat_level_source_wa_po) %>%
  rename(mental_symptoms = symptoms_of_mental_illness) %>%
  remove_constant() %>%
  select(starts_with("victim"),
         contains("incident"),
         city, state, zipcode, county, cause_of_death,
         criminal_charges,
         alleged_threat,
         alleged_weapon, 
         mental_symptoms) %>%
  mutate(month = as.yearmon(date_of_incident))

head(police_kill)

Question 1

Find the number of people of each of the races murdered by the police every month and test the hypothesis that the true monthly mean number of murders of Blacks per 1M population across the whole country is same as the mean number of murders of Whites per 1M polulation across the whole country.

Start with creating a data frame of counts of murders by month and victim’s race. Make sure that zero counts are included. Then merge with the population data and compute numbers of victims per 1M population. And then apply the \(t\)-test.

# ANSWER HERE
victim_counts <- police_kill %>%
  mutate(victims_race = as_factor(victims_race)) %>%
  count(month, victims_race, .drop = FALSE) %>%
  left_join(race_totals) %>%
  drop_na() %>%
  mutate(victims_per_1M = n / population * 1e+6)

victim_counts %>%
  filter(victims_race %in% c("White", "Black")) %>%
  t.test(victims_per_1M ~ victims_race, .)

## 
##  Welch Two Sample t-test
## 
## data:  victims_per_1M by victims_race
## t = 22.083, df = 107.64, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Black and group White is not equal to 0
## 95 percent confidence interval:
##  0.3302841 0.3954276
## sample estimates:
## mean in group Black mean in group White 
##           0.5710130           0.2081571

Question 2

Same as Question 1, but apply the paired \(t\)-test.

# ANSWER HERE
victim_counts %>%
  select(month, victims_race, victims_per_1M) %>%
  pivot_wider(names_from = victims_race, values_from = victims_per_1M) %>%
  mutate(black_minus_white = Black - White) %>%
  t.test(black_minus_white ~ 1, .)

## 
##  One Sample t-test
## 
## data:  black_minus_white
## t = 24.492, df = 92, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.3334312 0.3922806
## sample estimates:
## mean of x 
## 0.3628559

Mapping

Here we will use the function map_df() that applies the same operation to all entries of a vector and outputs a data frame.

Question 3

In this exercise, we investigate how the size of a confidence interval depends on the level of confidence. We will simulate a sample of 10 random values from a normal distribution with true mean \(\mu = 0.3\) and standard deviation \(\sigma = 1\).

Next, we create a log-scaled vector of confidence levels evenly spaced between \(0.9\) and \(0.99999\).

For each confidence level \(\alpha\) in this vector, we apply a \(t\)-test to the simulated data, extract the resulting confidence interval, and store the results in a data frame with three columns: confidence level, lower bound, and upper bound of the interval.

Finally, we plot the lower and upper bounds (in different colors) against the confidence level, using a log scale on the \(x\)-axis.

set.seed(42)
random_data <- 0.3 + rnorm(10)
test_confidence_levels <- 1-10^(seq(from = -1, to = -5, by = -0.1))
# ANSWER HERE

get_conf_interval <- function(conf_level) {
  random_data %>% t.test(conf.level = conf_level) %>%
    tidy()   %>%
    select(starts_with("conf")) %>%
    mutate(conf_level = conf_level) %>%
    relocate(conf_level)
}

test_confidence_levels %>%
  map_df(get_conf_interval) %>%
  pivot_longer(-conf_level) %>%
  ggplot(aes(x = conf_level, y = value, color = name)) +
  geom_line() +
  scale_x_log10()

Normality Testing

We need to test for normality when sample sizes are small. For example, if we are investigating monthly counts of police shooting victims, the number of months may be large enough that normality is not a concern (especially for the \(t\)-test):

n_distinct(victim_counts$month)

## [1] 93

However, let’s visualize the distribution of monthly victim counts per 1 million population across different years using smoothed histograms:

victim_counts %>%
  mutate(year = floor(as.numeric(month))) %>%
  ggplot(aes(x = victims_per_1M, fill = victims_race)) +
  geom_density(alpha = 0.5) +
  facet_wrap("year")

We observe that the trend of more Black victims than White victims per 1M population seems consistent across years. But for a proper statistical comparison, we need to check whether these numbers approximately follow a normal distribution.

To explore this, we can create a Q–Q plot for one year and one race:

victim_counts %>%
  mutate(year = floor(as.numeric(month))) %>%
  filter(year == 2013 & victims_race == "Black") %>%
  ggplot(aes(sample = victims_per_1M)) + 
  geom_qq() + geom_qq_line()

The distribution is not perfectly normal, but there are no major deviations either. The Shapiro–Wilk test gives a formal check:

victim_counts %>%
  mutate(year = floor(as.numeric(month))) %>%
  filter(year == 2013 & victims_race == "Black") %>%
  pull(victims_per_1M) %>%
  shapiro.test() %>%
  tidy()

Question 5

Make a panel Q–Q plot of victims_per_1M with different colours representing different races and separate panels for each year.

# ANSWER
victim_counts %>%
  mutate(year = floor(as.numeric(month))) %>%
  ggplot(aes(sample = victims_per_1M, color = victims_race)) +
  geom_qq() + geom_qq_line() + facet_wrap("year")

Question 6

Apply the Shapiro–Wilk test to the number of Black victims per month separately for each year. Report the results as a data frame with the following columns: year, statistic, p.value, and method.

# ANSWER

sw_test_for_blacks <- function(y) {
  victim_counts %>%
    mutate(year = floor(as.numeric(month))) %>%
    filter(year == y & victims_race == "Black") %>%
    pull(victims_per_1M) %>%
    shapiro.test() %>%
    tidy() %>%
    mutate(year = y) %>%
    relocate(year)
}

2013:2020 %>% map_df(sw_test_for_blacks)

Question 7

Construct QQ-plots and apply the Shapiro Wilk test for normality for a dataset of your choice.

Model answers:

https://rpubs.com/fduzhin/mh3511_cw_6_answers

MH3511 – Class Practice 6

Fedor Duzhin

2025-07-07