This in-class notebook is designed to complement the lecture. You’ll practice what you just learned, avoid falling asleep mid-slide, and get instant feedback - both from Fedor and your fellow classmates. You’re encouraged to experiment, ask questions, and correct your answers as we go.
The goal is to learn R by doing, not just by listening.
Before we begin, make sure you’ve installed the required packages:
tidyverse
for data manipulation and plottingjanitor
for data cleaningzoo
for some extra functions for working with timeYou only need to install these once.
Police Violence and Shootings
In this task, we will find out people of which race are murdered by police most often. The original dataset is here:
https://www.kaggle.com/discussions/general/158339
First, we read the data into R, at the same time cleaning it (the data is on Fedor’s dropbox, but you can download it from the link above instead - it is the same dataset):
# Numbers are taken from Wikipedia
race_totals <- c(White = 191697647,
Black = 39940338,
Hispanic = 62080044) %>%
enframe(
name = "victims_race",
value = "population")
police_kill <- read_csv("../Data/police_killings_MPV.csv") %>%
clean_names() %>%
mutate(victims_age = parse_number(victims_age)) %>%
rename(date_of_incident = date_of_incident_month_day_year) %>%
mutate(date_of_incident = dmy(date_of_incident)) %>%
rename(alleged_weapon = alleged_weapon_source_wa_po_and_review_of_cases_not_included_in_wa_po_database) %>%
rename(alleged_threat = alleged_threat_level_source_wa_po) %>%
rename(mental_symptoms = symptoms_of_mental_illness) %>%
remove_constant() %>%
select(starts_with("victim"),
contains("incident"),
city, state, zipcode, county, cause_of_death,
criminal_charges,
alleged_threat,
alleged_weapon,
mental_symptoms) %>%
mutate(month = as.yearmon(date_of_incident))
head(police_kill)
Find the number of people of each of the races murdered by the police every month and test the hypothesis that the true monthly mean number of murders of Blacks per 1M population across the whole country is same as the mean number of murders of Whites per 1M polulation across the whole country.
Start with creating a data frame of counts of murders by month and victim’s race. Make sure that zero counts are included. Then merge with the population data and compute numbers of victims per 1M population. And then apply the \(t\)-test.
# ANSWER HERE
victim_counts <- police_kill %>%
mutate(victims_race = as_factor(victims_race)) %>%
count(month, victims_race, .drop = FALSE) %>%
left_join(race_totals) %>%
drop_na() %>%
mutate(victims_per_1M = n / population * 1e+6)
victim_counts %>%
filter(victims_race %in% c("White", "Black")) %>%
t.test(victims_per_1M ~ victims_race, .)
##
## Welch Two Sample t-test
##
## data: victims_per_1M by victims_race
## t = 22.083, df = 107.64, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Black and group White is not equal to 0
## 95 percent confidence interval:
## 0.3302841 0.3954276
## sample estimates:
## mean in group Black mean in group White
## 0.5710130 0.2081571
Same as Question 1, but apply the paired \(t\)-test.
# ANSWER HERE
victim_counts %>%
select(month, victims_race, victims_per_1M) %>%
pivot_wider(names_from = victims_race, values_from = victims_per_1M) %>%
mutate(black_minus_white = Black - White) %>%
t.test(black_minus_white ~ 1, .)
##
## One Sample t-test
##
## data: black_minus_white
## t = 24.492, df = 92, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.3334312 0.3922806
## sample estimates:
## mean of x
## 0.3628559
Here we will use the function map_df()
that applies the
same operation to all entries of a vector and outputs a data frame.
In this exercise, we investigate how the size of a confidence interval depends on the level of confidence. We will simulate a sample of 10 random values from a normal distribution with true mean \(\mu = 0.3\) and standard deviation \(\sigma = 1\).
Next, we create a log-scaled vector of confidence levels evenly spaced between \(0.9\) and \(0.99999\).
For each confidence level \(\alpha\) in this vector, we apply a \(t\)-test to the simulated data, extract the resulting confidence interval, and store the results in a data frame with three columns: confidence level, lower bound, and upper bound of the interval.
Finally, we plot the lower and upper bounds (in different colors) against the confidence level, using a log scale on the \(x\)-axis.
set.seed(42)
random_data <- 0.3 + rnorm(10)
test_confidence_levels <- 1-10^(seq(from = -1, to = -5, by = -0.1))
# ANSWER HERE
get_conf_interval <- function(conf_level) {
random_data %>% t.test(conf.level = conf_level) %>%
tidy() %>%
select(starts_with("conf")) %>%
mutate(conf_level = conf_level) %>%
relocate(conf_level)
}
test_confidence_levels %>%
map_df(get_conf_interval) %>%
pivot_longer(-conf_level) %>%
ggplot(aes(x = conf_level, y = value, color = name)) +
geom_line() +
scale_x_log10()
We need to test for normality when sample sizes are small. For example, if we are investigating monthly counts of police shooting victims, the number of months may be large enough that normality is not a concern (especially for the \(t\)-test):
n_distinct(victim_counts$month)
## [1] 93
However, let’s visualize the distribution of monthly victim counts per 1 million population across different years using smoothed histograms:
victim_counts %>%
mutate(year = floor(as.numeric(month))) %>%
ggplot(aes(x = victims_per_1M, fill = victims_race)) +
geom_density(alpha = 0.5) +
facet_wrap("year")
We observe that the trend of more Black victims than White victims per 1M population seems consistent across years. But for a proper statistical comparison, we need to check whether these numbers approximately follow a normal distribution.
To explore this, we can create a Q–Q plot for one year and one race:
victim_counts %>%
mutate(year = floor(as.numeric(month))) %>%
filter(year == 2013 & victims_race == "Black") %>%
ggplot(aes(sample = victims_per_1M)) +
geom_qq() + geom_qq_line()
The distribution is not perfectly normal, but there are no major deviations either. The Shapiro–Wilk test gives a formal check:
victim_counts %>%
mutate(year = floor(as.numeric(month))) %>%
filter(year == 2013 & victims_race == "Black") %>%
pull(victims_per_1M) %>%
shapiro.test() %>%
tidy()
Make a panel Q–Q plot of victims_per_1M
with different
colours representing different races and separate panels for each
year.
# ANSWER
victim_counts %>%
mutate(year = floor(as.numeric(month))) %>%
ggplot(aes(sample = victims_per_1M, color = victims_race)) +
geom_qq() + geom_qq_line() + facet_wrap("year")
Apply the Shapiro–Wilk test to the number of Black victims per month
separately for each year. Report the results as a data frame with the
following columns: year
, statistic
,
p.value
, and method.
# ANSWER
sw_test_for_blacks <- function(y) {
victim_counts %>%
mutate(year = floor(as.numeric(month))) %>%
filter(year == y & victims_race == "Black") %>%
pull(victims_per_1M) %>%
shapiro.test() %>%
tidy() %>%
mutate(year = y) %>%
relocate(year)
}
2013:2020 %>% map_df(sw_test_for_blacks)
Construct QQ-plots and apply the Shapiro Wilk test for normality for a dataset of your choice.