Info

Objective

This in-class notebook is designed to complement the lecture. You’ll practice what you just learned, avoid falling asleep mid-slide, and get instant feedback - both from Fedor and your fellow classmates. You’re encouraged to experiment, ask questions, and correct your answers as we go.

The goal is to learn R by doing, not just by listening.

Your Task

  • Attempt each question yourself before checking the answer or asking for help.
  • Use lecture notes and the example code provided.
  • Update your answers after Fedor’s explanations.
  • Feel free to work with your neighbor if you get stuck — but make sure you understand the final answer!

Initial Setup

Before we begin, make sure you’ve installed the required packages:

  • tidyverse for data manipulation and plotting
  • janitor for data cleaning
  • zoo for some extra functions for working with time

You only need to install these once.

U-Test

Couples Data

We will work with couples_data here.

Data: survey of American couples “How Couples Meet and Stay Together” 2017

https://data.stanford.edu/hcmst2017

We will work with a subset of the data (the whole dataset has 285 variables). We have the following variables:

couples_data <- read_csv("../Data/couples_data_processed.csv")
couples_data %>% names()
##  [1] "married"            "age_when_met"       "met_online"        
##  [4] "met_vacation"       "work_neighbors"     "met_through_family"
##  [7] "met_through_friend" "sex_frequency"      "race"              
## [10] "age"                "education"          "gender"            
## [13] "income"             "religious"

Most of them are self-explanatory, but we need to clarify three of them:

  • sex_frequency: How often do you have sex with your partner?
Value Description
-1 No data (did not answer)
1 Once a day or more
2 3 to 6 times a week
3 Once or twice a week
4 2 to 3 times a month
5 Once a month or less
7 Never

Here is a table:

couples_data %>% tabyl(sex_frequency)
  • religious - How often do you attend religious services?
Value Description
-1 No data (did not answer)
1 More than once a week
2 Once a week
3 Once or twice a month
4 A few times a year
5 Once a year or less
6 Never

Here is a table:

couples_data %>% tabyl(religious)
  • education - Education Level
Numeric Label
1 No formal education
2 1st, 2nd, 3rd, or 4th grade
3 5th or 6th grade
4 7th or 8th grade
5 9th grade
6 10th grade
7 11th grade
8 12th grade NO DIPLOMA
9 HIGH SCHOOL GRADUATE – high school DIPLOMA or the equivalent (GED)
10 Some college, no degree
11 Associate degree
12 Bachelors degree
13 Masters degree
14 Professional or Doctorate degree

Here is a table:

couples_data %>% tabyl(education)
  • income - Household Income
Numeric Label
1 Less than $5,000
2 $5,000 to $7,499
3 $7,500 to $9,999
4 $10,000 to $12,499
5 $12,500 to $14,999
6 $15,000 to $19,999
7 $20,000 to $24,999
8 $25,000 to $29,999
9 $30,000 to $34,999
10 $35,000 to $39,999
11 $40,000 to $49,999
12 $50,000 to $59,999
13 $60,000 to $74,999
14 $75,000 to $84,999
15 $85,000 to $99,999
16 $100,000 to $124,999
17 $125,000 to $149,999
18 $150,000 to $174,999
19 $175,000 to $199,999
20 $200,000 to $249,999
21 $250,000 or more

Here is a table:

couples_data %>% tabyl(income)

Question 1

Plots distributions of

  1. religious,
  2. education, and
  3. income

as frequency bar charts coloured by met_online. Do distributions seem similar or different?

ANSWER HERE

Part (a)

recode_religious_frq <- function(x) {
  result <- as.character(x)
  result[x == -1] <- "No data (did not answer)"
  result[x == 1]  <- "More than once a week"
  result[x == 2]  <- "Once a week"
  result[x == 3]  <- "Once or twice a month"
  result[x == 4]  <- "A few times a year"
  result[x == 5]  <- "Once a year or less"
  result[x == 6]  <- "Never"
  as_factor(result)
}

couples_data %>%
  ggplot(aes(x = religious, y = after_stat(prop), fill = met_online)) +
  geom_bar(position = "dodge") +
  scale_x_continuous(
    breaks = -1:6,
    labels = recode_religious_frq(-1:6)
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Frequency of attending religious services", y = "Proportion")

Distributions seem different with a much bigger fraction of totally unreligious people among those who met online.

Part (b)

couples_data %>%
  ggplot(aes(x = education, y = after_stat(prop), fill = met_online)) +
  geom_bar(position = "dodge") +
  scale_x_continuous(
    breaks = 1:14
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Education level", y = "Proportion")

There seem to be a slightly higher fraction of people with degre among thos who met online, but the difference does not seem large.

Part (c)

couples_data %>%
  ggplot(aes(x = income, y = after_stat(prop), fill = met_online)) +
  geom_bar(position = "dodge") +
  scale_x_continuous(
    breaks = 1:21
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Education level", y = "Proportion")

The distributions of income are mostly similar, but there are more people with income 60-75K among those who met online and there are more people with income 100-125K who met offline. So the latter could have a bit higher income.

Question 2

Apply the U-Test to see how distributions / medians of

  1. religious,
  2. education, and
  3. income

are different in groups who met online or met offline. Come up with some reasonable interpretation of results.

ANSWER

Part (a)

Distributions seem different:

couples_data %>%
  filter(religious %in% 1:5) %>%
  wilcox.test(religious ~ met_online, .)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  religious by met_online
## W = 139712, p-value = 0.01319
## alternative hypothesis: true location shift is not equal to 0
couples_data %>%
  filter(religious %in% 1:5) %>%
  group_by(met_online) %>%
  summarise(num_obs = n(),
            married = mean(married == "yes"),
            median_religious = median(religious)) %>%
  mutate(label = recode_religious_frq(median_religious))

There is a significant difference between attending religious services of people who met online and people who met offline. It may be due to the fact that religious people are more traditional or it may be they are older and less likely to meet online.

Part (b)

Distributions seem different:

couples_data %>%
  wilcox.test(education ~ met_online, .)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  education by met_online
## W = 315724, p-value = 0.001198
## alternative hypothesis: true location shift is not equal to 0
couples_data %>%
  group_by(met_online) %>%
  summarise(num_obs = n(),
            married = mean(married == "yes"),
            median_education = median(education)) 

The median education levels for people who met offline and online, respectively, are “Some college, no degree” and “Associate degree”. The difference is not much, but it probably reflects something since the distributions are statistically different. There may be several explanations:

  • Educated people have hard time finding good partners nearby and are more inclined to online dating.

  • The sample could be biased towards older people and older people who prefer online dating tend to be more educated.

Part (c)

Distributions aren’t very different:

couples_data %>%
  wilcox.test(income ~ met_online, .)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  income by met_online
## W = 374036, p-value = 0.1778
## alternative hypothesis: true location shift is not equal to 0

This could be because there is truly no difference in income between people who opt for online dating and those who don’t. Alternatively, there might be a trade-off: people who prefer online dating are younger and more educated, the former drives their income down and the latter up. Or our sample might be biased.

ANOVA

Data

In this task, we will work with the data of victims of police in the USA, but we will load a clean version of this data from the R package usdata:

fatal_police_shootings %>% head()

The codes of races can be found in the R manual ?fatal_police_shootings:

  • “W” White non-Hispanic;
  • “B” Black non-Hispanic;
  • “A” Asian;
  • “N” Native American;
  • “H” Hispanic;
  • “O” Other
  • None unknown.

However, we won’t need them for this task.

The second piece of information we need to work with the data is how states are matched to large regions. This is available in base R:

us_state_region <- data.frame(
  state = state.abb,
  region = state.region
) 

us_state_region

Further, we will need state populations available in usdata:

state_stats

We will use the 2010 populations.

Question 3

Find the number of people murdered by the police every month in every region of the USA and produce a line chart coloured by the region.

## ANSWER

# First, we find region populations
region_pops <- state_stats %>%
  select(abbr, pop2010) %>%
  rename(state = abbr, population = pop2010) %>%
  left_join(us_state_region) %>%
  group_by(region) %>%
  summarise(population = sum(population)) %>%
  drop_na()

region_pops
# ANSWER HERE
# Now, victim counts by region and month
victim_counts <- fatal_police_shootings %>%
  left_join(us_state_region) %>%
  mutate(month = as.yearmon(date)) %>%
  count(month, region, .drop = FALSE) %>%
  drop_na() %>%
  left_join(region_pops) %>%
  mutate(victims_per_1M = n / population * 1e+6)

victim_counts %>%
  ggplot(aes(x = month, y = victims_per_1M, color = region)) +
  geom_line()

Question 4

Test the hypothesis that all regions have the same mean monthly number of victims per 1M population. Use aov() and TukeyHSD()

# ANSWER HERE
victim_anova <- victim_counts %>% aov(victims_per_1M ~ region, .) 
victim_anova %>% report()
victim_anova %>% tidy()
victim_anova %>% TukeyHSD()
## The ANOVA (formula: victims_per_1M ~ region) suggests that:
## 
##   - The main effect of region is statistically significant and large (F(3, 312) =
## 282.29, p < .001; Eta2 = 0.73, 95% CI [0.69, 1.00])
## 
## Effect sizes were labelled following Field's (2013) recommendations.
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = victims_per_1M ~ region, data = .)
## 
## $region
##                                diff         lwr         upr p adj
## South-Northeast          0.19666781  0.16891491  0.22442071     0
## North Central-Northeast  0.09328789  0.06553499  0.12104079     0
## West-Northeast           0.29507813  0.26732523  0.32283103     0
## North Central-South     -0.10337992 -0.13113283 -0.07562702     0
## West-South               0.09841032  0.07065742  0.12616322     0
## West-North Central       0.20179024  0.17403734  0.22954315     0

Question 5

Apply oneway.test() to the same hypothesis as in Question 4 to find the \(p\)-value. Interpret the findings.

# ANSWER HERE
victim_counts %>% oneway.test(victims_per_1M ~ region, .,
                              var.equal = FALSE) 
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  victims_per_1M and region
## F = 313.36, num df = 3.00, denom df = 167.39, p-value < 2.2e-16

We see that the numbers of victims of police shootings are different across regions and the difference is statistically significant. But it is not clear why, i.e., whether the West has more criminal activity than the North East, more guns, more violent police etc. We just see that the difference is so large that it can’t be explained by random chance.

Kruskal-Wallis Test

Kruskal-Wallis test is a generalization of the U-test for more than 2 groups and is a non-parametric analogue of ANOVA.

Question 6

Find median values of religious across different races in couples_data and apply the the Kruskal-Wallis test to see of the difference is statistically significant. Explain your findings.

## ANSWER

couples_data %>%
  filter(religious %in% 1:5) %>%
  group_by(race) %>%
  summarise(num_obs = n(),
            married = mean(married == "yes"),
            med = median(religious)) %>%
  mutate(label = recode_religious_frq(med))
## Here is the test
couples_data %>%
  filter(religious %in% 1:5) %>%
  kruskal.test(religious ~ race, . ) 
## 
##  Kruskal-Wallis rank sum test
## 
## data:  religious by race
## Kruskal-Wallis chi-squared = 11.542, df = 4, p-value = 0.02111

There is a noticeable difference that can, nevertheless, be explained by pure chance or by a biased sample. But yeah, different races, may, indeed, have different median levels of religiousness.

Numeric Experiment: U-test vs t-test

Recall that the t-test has certain assumptions that make not applicable in certain cases. But the U-test can be applied all the time. Let us conduct a numeric experiment - generate random data, apply both tests and see which of the two gives a smaller \(p\)-value.

Question 7

For \(n=10,11,\dots,50\), generate a sample of \(n\) normal variables with \(\mu=0.5\) and \(\sigma^2=1\). Apply the t-test and the U-test to the sample, and plot the \(p\)-value vs \(n\) coloured by the test with log-scale. Which of the two tests yields smaller \(p\)-values?

## Answer

set.seed(42)
two_tests <- function(n) {
  random_x <- rnorm(n, mean = 0.5) 
  df_u <- wilcox.test(random_x) %>% tidy() %>% 
    select(p.value) %>%
    mutate(n = n, test = "t")
  df_t <- t.test(random_x) %>% tidy() %>% 
    select(p.value) %>%
    mutate(n = n, test = "u")
  df_u %>%
    bind_rows(df_t)
}

df_experiment <- 10:50 %>% map_df(two_tests) 

df_experiment %>%
  ggplot(aes(x = n, y = p.value, colour = test)) +
  geom_line() + scale_y_log10()

Apparently, the \(p\)-values are almost the same.

Model Answers: