This in-class notebook is designed to complement the lecture. You’ll practice what you just learned, avoid falling asleep mid-slide, and get instant feedback - both from Fedor and your fellow classmates. You’re encouraged to experiment, ask questions, and correct your answers as we go.
The goal is to learn R by doing, not just by listening.
Before we begin, make sure you’ve installed the required packages:
tidyverse
for data manipulation and plottingtitanic
for practice dataYou only need to install these once. If you already did it during Homework 1, you’re good to go.
# Load required packages
library(tidyverse)
library(e1071)
library(janitor)
The Poisson distribution models the number of independent events occurring within a fixed interval of time or space. In physics, the number of radioactive decay events within a given time frame is a classic example of a Poisson random variable.
In this exercise, we will explore whether there are examples from everyday life that could also be reasonably modeled using a Poisson distribution.
The dataset discoveries
, which contains the number of
major scientific discoveries per year, is available by default in R:
discoveries
## Time Series:
## Start = 1860
## End = 1959
## Frequency = 1
## [1] 5 3 0 2 0 3 2 3 6 1 2 1 2 1 3 3 3 5 2 4 4 0 2 3 7
## [26] 12 3 10 9 2 3 7 7 2 3 3 6 2 4 3 5 2 2 4 0 4 2 5 2 3
## [51] 3 6 5 8 3 6 6 0 5 2 2 2 6 3 4 4 2 2 4 7 5 3 3 0 2
## [76] 2 2 1 3 4 2 2 1 1 1 2 1 4 4 3 2 1 4 1 1 1 0 0 2 0
This dataset is stored as a time series object, not as a data frame. We’ll convert it into a data frame for easier handling and visualization:
df_discoveries <- tibble(
year = time(discoveries) %>% as.integer(),
value = as.numeric(discoveries)
)
# View the first few rows
head(df_discoveries)
Below is the histogram of the number of discoveries per year. Since
the values are discrete, we use geom_bar()
:
df_discoveries %>%
ggplot(aes(x = value)) + geom_bar()
Create a line plot of the number of discoveries over time. Your plot should show the number of discoveries on the \(y\)-axis and the corresponding year on the \(x\)-axis.
# ANSWER
df_discoveries %>%
ggplot(aes(x = year, y = value)) + geom_line() + ggtitle("The number of discoveries")
Generate a Poisson random variable with the same mean as the empirical mean of the number of discoveries. Then compare the distribution of the real data and the simulated data using two types of visualizations:
Overlayed bar plots (geom_bar(position = "dodge")
),
since the variable is discrete.
Overlayed smooth density plots (geom_density()
), for
a visual comparison of the distributions.
Does the distribution of the number of discoveries seem to follow a Poisson distribution?
## ANSWER
mean_disc <- df_discoveries %>% pull(value) %>% mean
df_pois <- rpois(nrow(df_discoveries), lambda = mean_disc) %>%
enframe() %>%
mutate(type = "random")
### Part (a)
df_discoveries %>%
select(value) %>%
mutate(type = "discoveries") %>%
bind_rows(df_pois) %>%
ggplot(aes(x = value, fill = type)) +
geom_bar(position = "dodge")
### Part (b)
df_discoveries %>%
select(value) %>%
mutate(type = "discoveries") %>%
bind_rows(df_pois) %>%
ggplot(aes(x = value, fill = type)) +
geom_density(alpha = 0.8)
ANSWER The Poisson model is kind of decent, but may not be a perfect fit for this dataset. Visually, the empirical distribution of the number of discoveries seems to have higher variance and possibly more skewness than the simulated Poisson distribution. This suggests overdispersion — a common sign that the data may not be well-modeled by a simple Poisson process.
In the next exercise, we will work with a dataset documenting people killed by the police in the United States.
We begin by loading and cleaning the data:
police_kill <- read_csv("../Data/police_killings_MPV.csv") %>%
# The following function from janitor package cleans the variable names:
clean_names() %>%
# Here we drop unnecessary variables:
select(victims_age, victims_gender, victims_race, date_of_incident_month_day_year, state) %>%
# Here we convert the date to the type "Date"
mutate(date_of_incident = dmy(date_of_incident_month_day_year)) %>%
# Here we convert victim's race to factor (technical detail that will be explained later):
mutate(victims_race = as.factor(victims_race)) %>%
# And here we round down to the first of each month so that we can compute monthly counts:
mutate(year_month = floor_date(date_of_incident, unit = "month"))
head(police_kill)
This dataset allows us to explore monthly counts of fatal police shootings, which we can compare to a Poisson distribution.
🔎 Reminder: The Poisson distribution is commonly used to model counts of independent events within a fixed time interval.
For a Poisson-distributed variable:
Mean \(=\lambda\),
Variance \(=\lambda\),
Skewness \(=1/\sqrt{\lambda}\).
For each race, calculate the monthly number of police killings, then compute the following statistics:
Empirical mean
Empirical variance
Empirical skewness
Theoretical variance (assuming a Poisson distribution: equal to the mean)
Theoretical skewness (Poisson skewness: \(1/\sqrt{\lambda}\).
⚠️ Technical Note: When counting the number of killings per month and
race, use .drop = FALSE
in count()
, i.e.,
count(year_month, victims_race, .drop = FALSE)
. This
ensures that months with zero killings for a given race are still
included in the analysis.
For which races, do monthly police killings seem Poisson-distributed?
## ANSWER
police_kill %>%
mutate(victims_race = as.factor(victims_race)) %>%
count(year_month, victims_race, .drop = FALSE) %>%
group_by(victims_race) %>%
summarise(
n_obs = n(),
obs_mean = mean(n),
obs_var = var(n),
obs_skew = skewness(n)
) %>%
mutate(
theor_var = obs_mean,
theor_skew = 1/sqrt(obs_mean)
)
ANSWER From the table, the Poisson model appears to be a reasonable fit for races with generally low monthly counts, i.e., Asian, Native American, and Pacific Islander. However, it is a very poor fit for the unknown race, which exhibits much higher variance and skewness than expected under a Poisson distribution.
Write a function that takes as input one of the races in the police_kill dataset and does the following:
geom_bar(position = "dodge")
for easy comparison.Asian
and Unknown race
.These comparisons illustrate that the Poisson model is a good fit for
races with low variance (e.g., Asian
), and a poor fit when
the variance is much larger than the mean (e.g.,
Unknown race
).
# ANSWER
# Precompute monthly counts by race
murder_counts <- police_kill %>%
mutate(victims_race = as.factor(victims_race)) %>%
count(year_month, victims_race, .drop = FALSE)
# Define the function for plotting empirical vs. Poisson-simulated data
murder_and_random_plot <- function(race) {
df_empirical <- murder_counts %>%
filter(victims_race == race) %>%
select(victims_race, n)
mean_count <- mean(df_empirical$n)
df_random <- tibble(
n = rpois(n = nrow(df_empirical), lambda = mean_count),
victims_race = "Poisson"
)
df_empirical %>%
bind_rows(df_random) %>%
ggplot(aes(x = n, fill = victims_race)) +
geom_bar(position = "dodge")
}
# Run the function for selected races
murder_and_random_plot("Asian")
murder_and_random_plot("Unknown race")