This in-class notebook is designed to complement the lecture. You’ll practice what you just learned, avoid falling asleep mid-slide, and get instant feedback - both from Fedor and your fellow classmates. You’re encouraged to experiment, ask questions, and correct your answers as we go.
The goal is to learn R by doing, not just by listening.
Before we begin, make sure you’ve installed the required packages:
tidyverse
for data manipulation and plottingjanitor
for data cleaningzoo
for some extra functions for working with timeYou only need to install these once.
We will work with couples_data
here.
Data: survey of American couples “How Couples Meet and Stay Together” 2017
https://data.stanford.edu/hcmst2017
We will work with a subset of the data (the whole dataset has 285 variables). We have the following variables:
couples_data <- read_csv("../Data/couples_data_processed.csv")
couples_data %>% names()
## [1] "married" "age_when_met" "met_online"
## [4] "met_vacation" "work_neighbors" "met_through_family"
## [7] "met_through_friend" "sex_frequency" "race"
## [10] "age" "education" "gender"
## [13] "income" "religious"
Most of them are self-explanatory, but we need to clarify three of them:
sex_frequency
: How often do you have sex with your
partner?Value | Description |
---|---|
-1 | No data (did not answer) |
1 | Once a day or more |
2 | 3 to 6 times a week |
3 | Once or twice a week |
4 | 2 to 3 times a month |
5 | Once a month or less |
7 | Never |
Here is a table:
couples_data %>% tabyl(sex_frequency)
religious
- How often do you attend religious
services?Value | Description |
---|---|
-1 | No data (did not answer) |
1 | More than once a week |
2 | Once a week |
3 | Once or twice a month |
4 | A few times a year |
5 | Once a year or less |
6 | Never |
Here is a table:
couples_data %>% tabyl(religious)
education
- Education LevelNumeric | Label |
---|---|
1 | No formal education |
2 | 1st, 2nd, 3rd, or 4th grade |
3 | 5th or 6th grade |
4 | 7th or 8th grade |
5 | 9th grade |
6 | 10th grade |
7 | 11th grade |
8 | 12th grade NO DIPLOMA |
9 | HIGH SCHOOL GRADUATE – high school DIPLOMA or the equivalent (GED) |
10 | Some college, no degree |
11 | Associate degree |
12 | Bachelors degree |
13 | Masters degree |
14 | Professional or Doctorate degree |
Here is a table:
couples_data %>% tabyl(education)
income
- Household IncomeNumeric | Label |
---|---|
1 | Less than $5,000 |
2 | $5,000 to $7,499 |
3 | $7,500 to $9,999 |
4 | $10,000 to $12,499 |
5 | $12,500 to $14,999 |
6 | $15,000 to $19,999 |
7 | $20,000 to $24,999 |
8 | $25,000 to $29,999 |
9 | $30,000 to $34,999 |
10 | $35,000 to $39,999 |
11 | $40,000 to $49,999 |
12 | $50,000 to $59,999 |
13 | $60,000 to $74,999 |
14 | $75,000 to $84,999 |
15 | $85,000 to $99,999 |
16 | $100,000 to $124,999 |
17 | $125,000 to $149,999 |
18 | $150,000 to $174,999 |
19 | $175,000 to $199,999 |
20 | $200,000 to $249,999 |
21 | $250,000 or more |
Here is a table:
couples_data %>% tabyl(income)
Plots distributions of
religious
,education
, andincome
as frequency bar charts coloured by met_online
. Do
distributions seem similar or different?
ANSWER HERE
Part (a)
recode_religious_frq <- function(x) {
result <- as.character(x)
result[x == -1] <- "No data (did not answer)"
result[x == 1] <- "More than once a week"
result[x == 2] <- "Once a week"
result[x == 3] <- "Once or twice a month"
result[x == 4] <- "A few times a year"
result[x == 5] <- "Once a year or less"
result[x == 6] <- "Never"
as_factor(result)
}
couples_data %>%
ggplot(aes(x = religious, y = after_stat(prop), fill = met_online)) +
geom_bar(position = "dodge") +
scale_x_continuous(
breaks = -1:6,
labels = recode_religious_frq(-1:6)
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Frequency of attending religious services", y = "Proportion")
Distributions seem different with a much bigger fraction of totally unreligious people among those who met online.
Part (b)
couples_data %>%
ggplot(aes(x = education, y = after_stat(prop), fill = met_online)) +
geom_bar(position = "dodge") +
scale_x_continuous(
breaks = 1:14
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Education level", y = "Proportion")
There seem to be a slightly higher fraction of people with degre among thos who met online, but the difference does not seem large.
Part (c)
couples_data %>%
ggplot(aes(x = income, y = after_stat(prop), fill = met_online)) +
geom_bar(position = "dodge") +
scale_x_continuous(
breaks = 1:21
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Education level", y = "Proportion")
The distributions of income are mostly similar, but there are more people with income 60-75K among those who met online and there are more people with income 100-125K who met offline. So the latter could have a bit higher income.
Apply the U-Test to see how distributions / medians of
religious
,education
, andincome
are different in groups who met online or met offline. Come up with some reasonable interpretation of results.
ANSWER
Part (a)
Distributions seem different:
couples_data %>%
filter(religious %in% 1:5) %>%
wilcox.test(religious ~ met_online, .)
##
## Wilcoxon rank sum test with continuity correction
##
## data: religious by met_online
## W = 139712, p-value = 0.01319
## alternative hypothesis: true location shift is not equal to 0
couples_data %>%
filter(religious %in% 1:5) %>%
group_by(met_online) %>%
summarise(num_obs = n(),
married = mean(married == "yes"),
median_religious = median(religious)) %>%
mutate(label = recode_religious_frq(median_religious))
There is a significant difference between attending religious services of people who met online and people who met offline. It may be due to the fact that religious people are more traditional or it may be they are older and less likely to meet online.
Part (b)
Distributions seem different:
couples_data %>%
wilcox.test(education ~ met_online, .)
##
## Wilcoxon rank sum test with continuity correction
##
## data: education by met_online
## W = 315724, p-value = 0.001198
## alternative hypothesis: true location shift is not equal to 0
couples_data %>%
group_by(met_online) %>%
summarise(num_obs = n(),
married = mean(married == "yes"),
median_education = median(education))
The median education levels for people who met offline and online, respectively, are “Some college, no degree” and “Associate degree”. The difference is not much, but it probably reflects something since the distributions are statistically different. There may be several explanations:
Educated people have hard time finding good partners nearby and are more inclined to online dating.
The sample could be biased towards older people and older people who prefer online dating tend to be more educated.
Part (c)
Distributions aren’t very different:
couples_data %>%
wilcox.test(income ~ met_online, .)
##
## Wilcoxon rank sum test with continuity correction
##
## data: income by met_online
## W = 374036, p-value = 0.1778
## alternative hypothesis: true location shift is not equal to 0
This could be because there is truly no difference in income between people who opt for online dating and those who don’t. Alternatively, there might be a trade-off: people who prefer online dating are younger and more educated, the former drives their income down and the latter up. Or our sample might be biased.
In this task, we will work with the data of victims of police in the
USA, but we will load a clean version of this data from the R package
usdata
:
fatal_police_shootings %>% head()
The codes of races can be found in the R manual
?fatal_police_shootings
:
However, we won’t need them for this task.
The second piece of information we need to work with the data is how states are matched to large regions. This is available in base R:
us_state_region <- data.frame(
state = state.abb,
region = state.region
)
us_state_region
Further, we will need state populations available in
usdata
:
state_stats
We will use the 2010 populations.
Find the number of people murdered by the police every month in every region of the USA and produce a line chart coloured by the region.
## ANSWER
# First, we find region populations
region_pops <- state_stats %>%
select(abbr, pop2010) %>%
rename(state = abbr, population = pop2010) %>%
left_join(us_state_region) %>%
group_by(region) %>%
summarise(population = sum(population)) %>%
drop_na()
region_pops
# ANSWER HERE
# Now, victim counts by region and month
victim_counts <- fatal_police_shootings %>%
left_join(us_state_region) %>%
mutate(month = as.yearmon(date)) %>%
count(month, region, .drop = FALSE) %>%
drop_na() %>%
left_join(region_pops) %>%
mutate(victims_per_1M = n / population * 1e+6)
victim_counts %>%
ggplot(aes(x = month, y = victims_per_1M, color = region)) +
geom_line()
Test the hypothesis that all regions have the same mean monthly
number of victims per 1M population. Use aov()
and
TukeyHSD()
# ANSWER HERE
victim_anova <- victim_counts %>% aov(victims_per_1M ~ region, .)
victim_anova %>% report()
victim_anova %>% tidy()
victim_anova %>% TukeyHSD()
## The ANOVA (formula: victims_per_1M ~ region) suggests that:
##
## - The main effect of region is statistically significant and large (F(3, 312) =
## 282.29, p < .001; Eta2 = 0.73, 95% CI [0.69, 1.00])
##
## Effect sizes were labelled following Field's (2013) recommendations.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = victims_per_1M ~ region, data = .)
##
## $region
## diff lwr upr p adj
## South-Northeast 0.19666781 0.16891491 0.22442071 0
## North Central-Northeast 0.09328789 0.06553499 0.12104079 0
## West-Northeast 0.29507813 0.26732523 0.32283103 0
## North Central-South -0.10337992 -0.13113283 -0.07562702 0
## West-South 0.09841032 0.07065742 0.12616322 0
## West-North Central 0.20179024 0.17403734 0.22954315 0
Apply oneway.test()
to the same hypothesis as in
Question 4 to find the \(p\)-value.
Interpret the findings.
# ANSWER HERE
victim_counts %>% oneway.test(victims_per_1M ~ region, .,
var.equal = FALSE)
##
## One-way analysis of means (not assuming equal variances)
##
## data: victims_per_1M and region
## F = 313.36, num df = 3.00, denom df = 167.39, p-value < 2.2e-16
We see that the numbers of victims of police shootings are different across regions and the difference is statistically significant. But it is not clear why, i.e., whether the West has more criminal activity than the North East, more guns, more violent police etc. We just see that the difference is so large that it can’t be explained by random chance.
Kruskal-Wallis test is a generalization of the U-test for more than 2 groups and is a non-parametric analogue of ANOVA.
Find median values of religious
across different races
in couples_data
and apply the the Kruskal-Wallis test to
see of the difference is statistically significant. Explain your
findings.
## ANSWER
couples_data %>%
filter(religious %in% 1:5) %>%
group_by(race) %>%
summarise(num_obs = n(),
married = mean(married == "yes"),
med = median(religious)) %>%
mutate(label = recode_religious_frq(med))
## Here is the test
couples_data %>%
filter(religious %in% 1:5) %>%
kruskal.test(religious ~ race, . )
##
## Kruskal-Wallis rank sum test
##
## data: religious by race
## Kruskal-Wallis chi-squared = 11.542, df = 4, p-value = 0.02111
There is a noticeable difference that can, nevertheless, be explained by pure chance or by a biased sample. But yeah, different races, may, indeed, have different median levels of religiousness.
Recall that the t-test has certain assumptions that make not applicable in certain cases. But the U-test can be applied all the time. Let us conduct a numeric experiment - generate random data, apply both tests and see which of the two gives a smaller \(p\)-value.
For \(n=10,11,\dots,50\), generate a sample of \(n\) normal variables with \(\mu=0.5\) and \(\sigma^2=1\). Apply the t-test and the U-test to the sample, and plot the \(p\)-value vs \(n\) coloured by the test with log-scale. Which of the two tests yields smaller \(p\)-values?
## Answer
set.seed(42)
two_tests <- function(n) {
random_x <- rnorm(n, mean = 0.5)
df_u <- wilcox.test(random_x) %>% tidy() %>%
select(p.value) %>%
mutate(n = n, test = "t")
df_t <- t.test(random_x) %>% tidy() %>%
select(p.value) %>%
mutate(n = n, test = "u")
df_u %>%
bind_rows(df_t)
}
df_experiment <- 10:50 %>% map_df(two_tests)
df_experiment %>%
ggplot(aes(x = n, y = p.value, colour = test)) +
geom_line() + scale_y_log10()
Apparently, the \(p\)-values are almost the same.