This in-class notebook is designed to complement the lecture. You’ll practice what you just learned, avoid falling asleep mid-slide, and get instant feedback - both from Fedor and your fellow classmates. You’re encouraged to experiment, ask questions, and correct your answers as we go.
The goal is to learn R by doing, not just by listening.
Before we begin, make sure you’ve installed the required packages:
tidyverse
for data manipulation and plottingjanitor
for data cleaningzoo
for some extra functions for working with timeYou only need to install these once.
# Load required packages
library(tidyverse)
library(readxl)
library(janitor)
library(zoo)
Here is how we read a dataset into R, at the same time cleaning it:
police_kill <- read_csv("../Data/police_killings_MPV.csv") %>%
clean_names() %>%
mutate(victims_age = parse_number(victims_age)) %>%
rename(date_of_incident = date_of_incident_month_day_year) %>%
mutate(date_of_incident = dmy(date_of_incident)) %>%
rename(alleged_weapon = alleged_weapon_source_wa_po_and_review_of_cases_not_included_in_wa_po_database) %>%
rename(alleged_threat = alleged_threat_level_source_wa_po) %>%
rename(mental_symptoms = symptoms_of_mental_illness) %>%
remove_constant() %>%
select(starts_with("victim"),
contains("incident"),
city, state, zipcode, county, cause_of_death,
criminal_charges,
alleged_threat,
alleged_weapon,
mental_symptoms) %>%
mutate(month = as.yearmon(date_of_incident))
head(police_kill)
and here is a simple table with the janitor
package:
tabyl(police_kill, cause_of_death)
Read the data from fatalities.xlsx
into R (available on
Dropbox):
You should clean variable names, remove rows containing missing values, and reformat the data to the long format with four columns:
accident_type
, the first column in
fatalities.xlsx
,industry
, the names of the rest of the columns in
fatalities.xlsx
,number_of_accidents
, the entries in
fatalities.xlsx
year
, the sheet names in
fatalities.xlsx
And then plot the number of accidents in the industries
construction, manufacturing, and
transportation_storage. with
accident_type == "Total"
by year coloured by the
industry.
## ANSWER
get_one_year_of_accidents <- function(y) {
read_excel("../Data/fatalities.xlsx", sheet = as.character(y)) %>%
clean_names() %>%
drop_na() %>%
rename(accident_type = accident_type_industry_ssic2020) %>%
pivot_longer(cols = -accident_type,
names_to = "industry",
values_to = "number_of_accidents") %>%
mutate(year = y)
}
fat_2019 <- get_one_year_of_accidents(2019)
fat_2020 <- get_one_year_of_accidents(2020)
fat_2021 <- get_one_year_of_accidents(2021)
fat_2022 <- get_one_year_of_accidents(2022)
fat_2023 <- get_one_year_of_accidents(2023)
fat_2024 <- get_one_year_of_accidents(2024)
accident_data <- bind_rows(fat_2019, fat_2020,
fat_2021, fat_2022,
fat_2023, fat_2024)
accident_data %>%
filter(accident_type == "Total") %>%
filter(industry %in% c("construction",
"manufacturing",
"transportation_storage")) %>%
ggplot(aes(x = year, y = number_of_accidents, color = industry)) +
geom_line()
Read a dataset of your choice into R, clean names, remove empty columns, remove rows containing missing values if necessary, format numbers as numbers, and dates as dates.
We will use the data on European stock markets available in base R. Originally, this is a time series:
EuStockMarkets %>% head()
## Time Series:
## Start = c(1991, 130)
## End = c(1991, 135)
## Frequency = 260
## DAX SMI CAC FTSE
## 1991.496 1628.75 1678.1 1772.8 2443.6
## 1991.500 1613.63 1688.5 1750.5 2460.2
## 1991.504 1606.51 1678.6 1718.0 2448.2
## 1991.508 1621.04 1684.1 1708.1 2470.4
## 1991.512 1618.16 1686.6 1723.1 2484.7
## 1991.515 1610.61 1671.6 1714.3 2466.8
We will convert it to a data frame first:
# Extract numeric time and convert to Date
dates <- as.numeric(time(EuStockMarkets))
# Now build tibble
stocks_tbl <- EuStockMarkets %>%
as_tibble() %>%
mutate(date = dates) %>%
relocate(date)
head(stocks_tbl)
Write a single chain of pipes to create a new data frame,
stock_log_returns
by going through the following steps
date
- this is the original date
in
stocks_tbl
,index
- column names in stocks_tbl
price
- entries in stocks_tbl
For each index, compute log-returns, i.e., \[
\log\frac{\mbox{Price tomorrow}}{\mbox{Price today}}
\] Note that you will need the functions lag()
and
lead()
for this task.
Drop rows containing missing values
For sanity check, verify that all the stocks have the same number of observations and plot log-returns vs date with colours representing different indices.
## ANSWER
#
# We will first define the convenience function for computing log-returns:
compute_log_returns <- function(x) {
log(lead(x)) - log(x)
}
## Now there are two ways to solve the problem
#
# First Method - reshape, then compute log-returns
# Here, the original price will be preserved
stock_log_returns <- stocks_tbl %>%
pivot_longer(cols = -date,
names_to = "index",
values_to = "price") %>%
group_by(index) %>%
mutate(log_returns = compute_log_returns(price)) %>%
drop_na() %>%
ungroup()
# Second Method - first, compute log-returns, then reshape:
# Here, the original price will be dropped
stock_log_returns <- stocks_tbl %>%
mutate(across(-date, compute_log_returns)) %>%
pivot_longer(cols = -date,
names_to = "index",
values_to = "log_returns") %>%
drop_na()
stock_log_returns %>% count(index)
stock_log_returns %>%
ggplot(aes(x = date, y = log_returns, color = index)) + geom_line()
Here, we will apply \(t\)-test to test a few statistical hypotheses.
Test the hypothesis that “CAC” (French stock market index) had nonzero returns in the given period vs the null hypothesis that it had zero returns.
Apply the 2-sample \(t\)-test to test statistical hypothesis that “CAC” had higher returns than SMI (Swiss Market Index) over the given period.
Apply the paired \(t\)-test to test statistical hypothesis that “CAC” had higher returns than SMI (Swiss Market Index) over the given period.
To compare two market indices, such as CAC and SMI, it it more appropriate to use a two-sided or a one-sided \(t\)-test? A 2-sample or a paired \(t\)-test?
If you want to test all possible hypotheses of the type “Stock A” and “Stock B” have different returns rates and decided to reject the null hypothesis at the significance level \(\alpha = 0.01\), what should be the Bonferroni correction?
ANSWER
stock_log_returns %>%
filter(index == "CAC") %>%
pull(log_returns) %>%
t.test()
##
## One Sample t-test
##
## data: .
## t = 1.7083, df = 1858, p-value = 0.08775
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -6.471217e-05 9.388201e-04
## sample estimates:
## mean of x
## 0.000437054
stock_log_returns %>%
filter(index %in% c("CAC", "SMI")) %>%
t.test(log_returns ~ index, .)
##
## Welch Two Sample t-test
##
## data: log_returns by index
## t = -1.1406, df = 3606.5, p-value = 0.2541
## alternative hypothesis: true difference in means between group CAC and group SMI is not equal to 0
## 95 percent confidence interval:
## -0.0010354735 0.0002737822
## sample estimates:
## mean in group CAC mean in group SMI
## 0.0004370540 0.0008178997
cac <- stock_log_returns %>% filter(index =="CAC") %>% pull(log_returns)
smi <- stock_log_returns %>% filter(index =="SMI") %>% pull(log_returns)
t.test(cac, smi, paired = TRUE)
##
## Paired t-test
##
## data: cac and smi
## t = -1.8186, df = 1858, p-value = 0.06913
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -7.915590e-04 2.986767e-05
## sample estimates:
## mean difference
## -0.0003808457
In this case, the paired \(t\)-test makes more sense than the two-sample \(t\)-test since observations are naturally paired by day.
Since there are 4 stocks, we have \(\binom{4}{2}=6\) pairs of stocks to compare and the Bonferroni correction should be \[ \alpha_{\mbox{corrected}}= \frac{\alpha}{6} = 0.001666667 \]