Info

Objective

This in-class notebook is designed to complement the lecture. You’ll practice what you just learned, avoid falling asleep mid-slide, and get instant feedback - both from Fedor and your fellow classmates. You’re encouraged to experiment, ask questions, and correct your answers as we go.

The goal is to learn R by doing, not just by listening.

Your Task

  • Attempt each question yourself before checking the answer or asking for help.
  • Use lecture notes and the example code provided.
  • Update your answers after Fedor’s explanations.
  • Feel free to work with your neighbor if you get stuck — but make sure you understand the final answer!

Initial Setup

Before we begin, make sure you’ve installed the required packages:

  • tidyverse for data manipulation and plotting
  • titanic for practice data

You only need to install these once. If you already did it during Homework 1, you’re good to go.

# Load required packages
library(tidyverse)
library(e1071)
library(janitor)

Poisson distribution

The Poisson distribution models the number of independent events occurring within a fixed interval of time or space. In physics, the number of radioactive decay events within a given time frame is a classic example of a Poisson random variable.

In this exercise, we will explore whether there are examples from everyday life that could also be reasonably modeled using a Poisson distribution.

Data (Discoveries)

The dataset discoveries, which contains the number of major scientific discoveries per year, is available by default in R:

discoveries
## Time Series:
## Start = 1860 
## End = 1959 
## Frequency = 1 
##   [1]  5  3  0  2  0  3  2  3  6  1  2  1  2  1  3  3  3  5  2  4  4  0  2  3  7
##  [26] 12  3 10  9  2  3  7  7  2  3  3  6  2  4  3  5  2  2  4  0  4  2  5  2  3
##  [51]  3  6  5  8  3  6  6  0  5  2  2  2  6  3  4  4  2  2  4  7  5  3  3  0  2
##  [76]  2  2  1  3  4  2  2  1  1  1  2  1  4  4  3  2  1  4  1  1  1  0  0  2  0

This dataset is stored as a time series object, not as a data frame. We’ll convert it into a data frame for easier handling and visualization:

df_discoveries <- tibble(
  year = time(discoveries) %>% as.integer(),
  value = as.numeric(discoveries)
)

# View the first few rows
head(df_discoveries)

Below is the histogram of the number of discoveries per year. Since the values are discrete, we use geom_bar():

df_discoveries %>%
  ggplot(aes(x = value)) + geom_bar()

Question 1

Create a line plot of the number of discoveries over time. Your plot should show the number of discoveries on the \(y\)-axis and the corresponding year on the \(x\)-axis.

# ANSWER
df_discoveries %>%
  ggplot(aes(x = year, y = value)) + geom_line() + ggtitle("The number of discoveries")

Question 2

Generate a Poisson random variable with the same mean as the empirical mean of the number of discoveries. Then compare the distribution of the real data and the simulated data using two types of visualizations:

  1. Overlayed bar plots (geom_bar(position = "dodge")), since the variable is discrete.

  2. Overlayed smooth density plots (geom_density()), for a visual comparison of the distributions.

Does the distribution of the number of discoveries seem to follow a Poisson distribution?

## ANSWER
mean_disc <- df_discoveries %>% pull(value) %>% mean
df_pois <- rpois(nrow(df_discoveries), lambda = mean_disc) %>%
  enframe() %>%
  mutate(type = "random")

### Part (a)
df_discoveries %>%
  select(value) %>%
  mutate(type = "discoveries") %>%
  bind_rows(df_pois) %>%
  ggplot(aes(x = value, fill = type)) +
  geom_bar(position = "dodge")

### Part (b)
df_discoveries %>%
  select(value) %>%
  mutate(type = "discoveries") %>%
  bind_rows(df_pois) %>%
  ggplot(aes(x = value, fill = type)) +
  geom_density(alpha = 0.8)

ANSWER The Poisson model is kind of decent, but may not be a perfect fit for this dataset. Visually, the empirical distribution of the number of discoveries seems to have higher variance and possibly more skewness than the simulated Poisson distribution. This suggests overdispersion — a common sign that the data may not be well-modeled by a simple Poisson process.

Data (Police Shootings)

In the next exercise, we will work with a dataset documenting people killed by the police in the United States.

We begin by loading and cleaning the data:

police_kill <- read_csv("../Data/police_killings_MPV.csv") %>%
  # The following function from janitor package cleans the variable names:
  clean_names() %>%
  # Here we drop unnecessary variables:
  select(victims_age, victims_gender, victims_race, date_of_incident_month_day_year, state) %>%
  # Here we convert the date to the type "Date"
  mutate(date_of_incident = dmy(date_of_incident_month_day_year)) %>%
  # Here we convert victim's race to factor (technical detail that will be explained later):
  mutate(victims_race = as.factor(victims_race)) %>%
  # And here we round down to the first of each month so that we can compute monthly counts:
  mutate(year_month = floor_date(date_of_incident, unit = "month"))

head(police_kill)

This dataset allows us to explore monthly counts of fatal police shootings, which we can compare to a Poisson distribution.

🔎 Reminder: The Poisson distribution is commonly used to model counts of independent events within a fixed time interval.

For a Poisson-distributed variable:

  • Mean \(=\lambda\),

  • Variance \(=\lambda\),

  • Skewness \(=1/\sqrt{\lambda}\).

Question 3

For each race, calculate the monthly number of police killings, then compute the following statistics:

  • Empirical mean

  • Empirical variance

  • Empirical skewness

  • Theoretical variance (assuming a Poisson distribution: equal to the mean)

  • Theoretical skewness (Poisson skewness: \(1/\sqrt{\lambda}\).

⚠️ Technical Note: When counting the number of killings per month and race, use .drop = FALSE in count(), i.e., count(year_month, victims_race, .drop = FALSE). This ensures that months with zero killings for a given race are still included in the analysis.

For which races, do monthly police killings seem Poisson-distributed?

## ANSWER
police_kill %>%
  mutate(victims_race = as.factor(victims_race)) %>%
  count(year_month, victims_race, .drop = FALSE) %>%
  group_by(victims_race) %>%
  summarise(
    n_obs = n(),
    obs_mean = mean(n),
    obs_var = var(n),
    obs_skew = skewness(n)
  ) %>%
  mutate(
    theor_var = obs_mean,
    theor_skew = 1/sqrt(obs_mean)
  )

ANSWER From the table, the Poisson model appears to be a reasonable fit for races with generally low monthly counts, i.e., Asian, Native American, and Pacific Islander. However, it is a very poor fit for the unknown race, which exhibits much higher variance and skewness than expected under a Poisson distribution.

Question 4

Write a function that takes as input one of the races in the police_kill dataset and does the following:

  1. Calculates the monthly number of victims of that race.
  2. Generates a Poisson sample of the same length, with mean equal to the empirical mean of monthly victim counts.
  3. Plots the empirical distribution and the Poisson sample as overlaid bar plots for comparison. Note that you will need to use geom_bar(position = "dodge") for easy comparison.
  4. Use your function to visualize and compare results for Asian and Unknown race.

These comparisons illustrate that the Poisson model is a good fit for races with low variance (e.g., Asian), and a poor fit when the variance is much larger than the mean (e.g., Unknown race).

# ANSWER
# Precompute monthly counts by race
murder_counts <- police_kill %>%
  mutate(victims_race = as.factor(victims_race)) %>%
  count(year_month, victims_race, .drop = FALSE)

# Define the function for plotting empirical vs. Poisson-simulated data
murder_and_random_plot <- function(race) {
  df_empirical <- murder_counts %>%
    filter(victims_race == race) %>%
    select(victims_race, n)
  
  mean_count <- mean(df_empirical$n)
  
  df_random <- tibble(
    n = rpois(n = nrow(df_empirical), lambda = mean_count),
    victims_race = "Poisson"
  )
  
  df_empirical %>%
    bind_rows(df_random) %>%
    ggplot(aes(x = n, fill = victims_race)) + 
    geom_bar(position = "dodge")
}

# Run the function for selected races
murder_and_random_plot("Asian")

murder_and_random_plot("Unknown race")

Model Answers:

https://rpubs.com/fduzhin/mh3511_hw_4_answers