Info

Objective

This in-class notebook is designed to complement the lecture. You’ll practice what you just learned, avoid falling asleep mid-slide, and get instant feedback - both from Fedor and your fellow classmates. You’re encouraged to experiment, ask questions, and correct your answers as we go.

The goal is to learn R by doing, not just by listening.

Your Task

Attempt each question yourself before checking the answer or asking for help.
Use lecture notes and the example code provided.
Update your answers after Fedor’s explanations.
Feel free to work with your neighbor if you get stuck — but make sure you understand the final answer!

Initial Setup

Before we begin, make sure you’ve installed the required packages:

tidyverse for data manipulation and plotting
janitor for data cleaning
zoo for some extra functions for working with time

You only need to install these once.

# Load required packages
library(tidyverse)
library(readxl)
library(janitor)
library(zoo)

Data Cleaning

Here is how we read a dataset into R, at the same time cleaning it:

police_kill <- read_csv("../Data/police_killings_MPV.csv") %>%
  clean_names() %>%
  mutate(victims_age = parse_number(victims_age)) %>%
  rename(date_of_incident = date_of_incident_month_day_year) %>%
  mutate(date_of_incident = dmy(date_of_incident)) %>%
  rename(alleged_weapon = alleged_weapon_source_wa_po_and_review_of_cases_not_included_in_wa_po_database) %>%
  rename(alleged_threat = alleged_threat_level_source_wa_po) %>%
  rename(mental_symptoms = symptoms_of_mental_illness) %>%
  remove_constant() %>%
  select(starts_with("victim"),
         contains("incident"),
         city, state, zipcode, county, cause_of_death,
         criminal_charges,
         alleged_threat,
         alleged_weapon, 
         mental_symptoms) %>%
  mutate(month = as.yearmon(date_of_incident))

head(police_kill)

and here is a simple table with the janitor package:

tabyl(police_kill, cause_of_death)

Question 1

Read the data from fatalities.xlsx into R (available on Dropbox):

https://www.dropbox.com/scl/fo/wh08w94b9qk609aua5ps0/ANQ_P8jCIJ1fNpBq7zJbaaI?rlkey=u8du27e2htisaev9csfwtph7s&st=4kseu4us&dl=0

You should clean variable names, remove rows containing missing values, and reformat the data to the long format with four columns:

accident_type, the first column in fatalities.xlsx,
industry, the names of the rest of the columns in fatalities.xlsx,
number_of_accidents, the entries in fatalities.xlsx
year, the sheet names in fatalities.xlsx

And then plot the number of accidents in the industries construction, manufacturing, and transportation_storage. with accident_type == "Total" by year coloured by the industry.

## ANSWER
get_one_year_of_accidents <- function(y) {
  read_excel("../Data/fatalities.xlsx", sheet = as.character(y)) %>%
    clean_names() %>%
    drop_na() %>%
    rename(accident_type = accident_type_industry_ssic2020) %>%
    pivot_longer(cols = -accident_type, 
               names_to = "industry", 
               values_to = "number_of_accidents") %>%
    mutate(year = y)
}

fat_2019 <- get_one_year_of_accidents(2019)
fat_2020 <- get_one_year_of_accidents(2020)
fat_2021 <- get_one_year_of_accidents(2021)
fat_2022 <- get_one_year_of_accidents(2022)
fat_2023 <- get_one_year_of_accidents(2023)
fat_2024 <- get_one_year_of_accidents(2024)

accident_data <- bind_rows(fat_2019, fat_2020,
                           fat_2021, fat_2022,
                           fat_2023, fat_2024)

accident_data %>%
  filter(accident_type == "Total") %>%
  filter(industry %in% c("construction", 
                         "manufacturing",
                         "transportation_storage")) %>%
  ggplot(aes(x = year, y = number_of_accidents, color = industry)) +
  geom_line()

Question 2

Read a dataset of your choice into R, clean names, remove empty columns, remove rows containing missing values if necessary, format numbers as numbers, and dates as dates.

Hypothesis Testing

Data

We will use the data on European stock markets available in base R. Originally, this is a time series:

EuStockMarkets %>% head()

## Time Series:
## Start = c(1991, 130) 
## End = c(1991, 135) 
## Frequency = 260 
##              DAX    SMI    CAC   FTSE
## 1991.496 1628.75 1678.1 1772.8 2443.6
## 1991.500 1613.63 1688.5 1750.5 2460.2
## 1991.504 1606.51 1678.6 1718.0 2448.2
## 1991.508 1621.04 1684.1 1708.1 2470.4
## 1991.512 1618.16 1686.6 1723.1 2484.7
## 1991.515 1610.61 1671.6 1714.3 2466.8

We will convert it to a data frame first:

# Extract numeric time and convert to Date
dates <- as.numeric(time(EuStockMarkets))

# Now build tibble
stocks_tbl <- EuStockMarkets %>%
  as_tibble() %>%
  mutate(date = dates) %>% 
  relocate(date)

head(stocks_tbl)

Question 3

Write a single chain of pipes to create a new data frame, stock_log_returns by going through the following steps

Reshape the data from wide to long. We will need three columns:

date - this is the original date in stocks_tbl,
index - column names in stocks_tbl
price - entries in stocks_tbl

For each index, compute log-returns, i.e., \[ \log\frac{\mbox{Price tomorrow}}{\mbox{Price today}} \] Note that you will need the functions lag() and lead() for this task.
Drop rows containing missing values

For sanity check, verify that all the stocks have the same number of observations and plot log-returns vs date with colours representing different indices.

## ANSWER
#
# We will first define the convenience function for computing log-returns:

compute_log_returns <- function(x) {
  log(lead(x)) - log(x)
} 

## Now there are two ways to solve the problem
#
# First Method - reshape, then compute log-returns
# Here, the original price will be preserved

stock_log_returns <- stocks_tbl %>%
  pivot_longer(cols = -date, 
               names_to = "index",
               values_to = "price") %>%
  group_by(index) %>%
  mutate(log_returns = compute_log_returns(price)) %>%
  drop_na() %>%
  ungroup()

# Second Method - first, compute log-returns, then reshape:
# Here, the original price will be dropped

stock_log_returns <- stocks_tbl %>%
  mutate(across(-date, compute_log_returns)) %>%
  pivot_longer(cols = -date, 
               names_to = "index",
               values_to = "log_returns") %>%
  drop_na()

stock_log_returns %>% count(index)
stock_log_returns %>%
  ggplot(aes(x = date, y = log_returns, color = index)) + geom_line()

Question 4

Here, we will apply \(t\)-test to test a few statistical hypotheses.

Test the hypothesis that “CAC” (French stock market index) had nonzero returns in the given period vs the null hypothesis that it had zero returns.
Apply the 2-sample \(t\)-test to test statistical hypothesis that “CAC” had higher returns than SMI (Swiss Market Index) over the given period.
Apply the paired \(t\)-test to test statistical hypothesis that “CAC” had higher returns than SMI (Swiss Market Index) over the given period.
To compare two market indices, such as CAC and SMI, it it more appropriate to use a two-sided or a one-sided \(t\)-test? A 2-sample or a paired \(t\)-test?
If you want to test all possible hypotheses of the type “Stock A” and “Stock B” have different returns rates and decided to reject the null hypothesis at the significance level \(\alpha = 0.01\), what should be the Bonferroni correction?

ANSWER

The one sample \(t\)-test shows very weak evidence in favour of the hypothesis that CAC had true non-zero returns (the \(p\)-value is too large):

stock_log_returns %>%
  filter(index == "CAC") %>%
  pull(log_returns) %>%
  t.test()

## 
##  One Sample t-test
## 
## data:  .
## t = 1.7083, df = 1858, p-value = 0.08775
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -6.471217e-05  9.388201e-04
## sample estimates:
##   mean of x 
## 0.000437054

The two sample \(t\)-test does not support the hypothesis that CAC has bigger returns thatn SMI”

stock_log_returns %>%
  filter(index %in% c("CAC", "SMI")) %>%
  t.test(log_returns ~ index, .)

## 
##  Welch Two Sample t-test
## 
## data:  log_returns by index
## t = -1.1406, df = 3606.5, p-value = 0.2541
## alternative hypothesis: true difference in means between group CAC and group SMI is not equal to 0
## 95 percent confidence interval:
##  -0.0010354735  0.0002737822
## sample estimates:
## mean in group CAC mean in group SMI 
##      0.0004370540      0.0008178997

The paired \(t\)-test produces weak evidence in support of the hypothesis that SMI has higher returns than CAC:

cac <- stock_log_returns %>% filter(index =="CAC") %>% pull(log_returns)
smi <- stock_log_returns %>% filter(index =="SMI") %>% pull(log_returns)
t.test(cac, smi, paired = TRUE)

## 
##  Paired t-test
## 
## data:  cac and smi
## t = -1.8186, df = 1858, p-value = 0.06913
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -7.915590e-04  2.986767e-05
## sample estimates:
## mean difference 
##   -0.0003808457

In this case, the paired \(t\)-test makes more sense than the two-sample \(t\)-test since observations are naturally paired by day.
Since there are 4 stocks, we have \(\binom{4}{2}=6\) pairs of stocks to compare and the Bonferroni correction should be \[ \alpha_{\mbox{corrected}}= \frac{\alpha}{6} = 0.001666667 \]

Model Answers:

https://rpubs.com/fduzhin/mh3511_cw_5_answers

MH3511 – Classwork 6

Data Cleaning and Hypotheses Testing in R.

Fedor Duzhin

2025-07-15