Info

Objective

These homework problem sets are designed to help you understand material better. You should try doing these problems first and then look at model answers. You can use Generative AI as to help, such as prompt “Which tidyverse function do I use to drop certain columns from a data frame? Give me an example and explain”. It is also a good idea to feed an error message together with your code to Generative AI and ask it to help with fixing errors. But it is pointless to just solve all questions with ChatGPT because you won’t be learning anything.

Your task

Read instructions and write your solutions to these questions into the space provided. Then check the model answers (the link is in the end of the notebook).

Couples Data

In this task, we will find out people of which race are murdered by police most often. The original dataset is here:

https://www.kaggle.com/discussions/general/158339

Data

We will work with couples_data here.

Data: survey of American couples “How Couples Meet and Stay Together” 2017

https://data.stanford.edu/hcmst2017

We will work with a subset of the data (the whole dataset has 285 variables). We have the following variables:

couples_data <- read_csv("../Data/couples_data_processed.csv")
couples_data %>% names()
##  [1] "married"            "age_when_met"       "met_online"        
##  [4] "met_vacation"       "work_neighbors"     "met_through_family"
##  [7] "met_through_friend" "sex_frequency"      "race"              
## [10] "age"                "education"          "gender"            
## [13] "income"             "religious"

Most of them are self-explanatory, but we need to clarify three of them:

  • sex_frequency: How often do you have sex with your partner?
Value Description
-1 No data (did not answer)
1 Once a day or more
2 3 to 6 times a week
3 Once or twice a week
4 2 to 3 times a month
5 Once a month or less
7 Never

Here is a table:

couples_data %>% tabyl(sex_frequency)
  • religious - How often do you attend religious services?
Value Description
-1 No data (did not answer)
1 More than once a week
2 Once a week
3 Once or twice a month
4 A few times a year
5 Once a year or less
6 Never

Here is a table:

couples_data %>% tabyl(religious)
  • education - Education Level
Numeric Label
1 No formal education
2 1st, 2nd, 3rd, or 4th grade
3 5th or 6th grade
4 7th or 8th grade
5 9th grade
6 10th grade
7 11th grade
8 12th grade NO DIPLOMA
9 HIGH SCHOOL GRADUATE – high school DIPLOMA or the equivalent (GED)
10 Some college, no degree
11 Associate degree
12 Bachelors degree
13 Masters degree
14 Professional or Doctorate degree

Here is a table:

couples_data %>% tabyl(education)
  • income - Household Income
Numeric Label
1 Less than $5,000
2 $5,000 to $7,499
3 $7,500 to $9,999
4 $10,000 to $12,499
5 $12,500 to $14,999
6 $15,000 to $19,999
7 $20,000 to $24,999
8 $25,000 to $29,999
9 $30,000 to $34,999
10 $35,000 to $39,999
11 $40,000 to $49,999
12 $50,000 to $59,999
13 $60,000 to $74,999
14 $75,000 to $84,999
15 $85,000 to $99,999
16 $100,000 to $124,999
17 $125,000 to $149,999
18 $150,000 to $174,999
19 $175,000 to $199,999
20 $200,000 to $249,999
21 $250,000 or more

Here is a table:

couples_data %>% tabyl(income)

Visualising ordinal data

To visualise ordinal data, we can use bar charts:

couples_data %>% ggplot(aes(x = sex_frequency)) + geom_bar()

However, this plot is not perfect since

  1. Labels on the \(x\)-axis are not informative
  2. Counts rather than relative frequencies are shown

To fix the issue, we will write a helper function recode_sex_frq() that converts numeric values to their respective labels. There are three versions using

  • indexing and subsetting a character vector.
  • case_when() from base R,
  • recode() from tidyverse.
# Method 1: indexing ans as.character()

recode_sex_freq <- function(x) {
  result <- rep("", length(x))
  result[x == -1] <- "No data (did not answer)"
  result[x == 1]  <- "Once a day or more"
  result[x == 2] <- "3 to 6 times a week"
  result[x == 3] <- "Once or twice a week"
  result[x == 4] <- "2 to 3 times a month"
  result[x == 5] <- "Once a month or less"
  result[x == 7] <- "Never"
  result
} 


# Method 2: case_when()

recode_sex_freq_2 <- function(x) {
  case_when(
      x == -1 ~ "No data (did not answer)",
      x == 1  ~ "Once a day or more",
      x == 2  ~ "3 to 6 times a week",
      x == 3  ~ "Once or twice a week",
      x == 4  ~ "2 to 3 times a month",
      x == 5  ~ "Once a month or less",
      x == 7  ~ "Never",
      .default = ""
    ) %>% as_factor()
} 


# Method 3: recode() from tidyverse

recode_sex_freq_3 <- function(x) {
  x %>% recode(
    `-1` = "No data (did not answer)",
    `1`  = "Once a day or more",
    `2`  = "3 to 6 times a week",
    `3`  = "Once or twice a week",
    `4`  = "2 to 3 times a month",
    `5`  = "Once a month or less",
    `7`  = "Never",
    .default = ""
  ) 
}


couples_data %>%
  ggplot(aes(x = sex_frequency)) +
  geom_bar(aes(y = after_stat(prop))) +
  scale_x_continuous(
    breaks = -1:7,
    labels = recode_sex_freq_3(-1:7)
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Sex frequency", y = "Proportion")

Question 1

Write a similar helper function recode_religious_frq() that converts numeric values to their respective labels. Construct a frequency plot of religious coloured according to marital status. Hint you will need to use position = "dodge".

Are married or unmarried people more likely to attend religious serivces?

### ANSWER
# Method 1: indexing ans as.character()
recode_religious_frq <- function(x) {
  result <- as.character(x)
  result[x == -1] <- "No data (did not answer)"
  result[x == 1]  <- "More than once a week"
  result[x == 2]  <- "Once a week"
  result[x == 3]  <- "Once or twice a month"
  result[x == 4]  <- "A few times a year"
  result[x == 5]  <- "Once a year or less"
  result[x == 6]  <- "Never"
  as_factor(result)
}

# Method 2: case_when()
recode_religious_frq_2 <- function(x) {
  case_when(
    x == -1 ~ "No data (did not answer)",
    x == 1  ~ "More than once a week",
    x == 2  ~ "Once a week",
    x == 3  ~ "Once or twice a month",
    x == 4  ~ "A few times a year",
    x == 5  ~ "Once a year or less",
    x == 6  ~ "Never"
  ) %>% as_factor()
}

# Method 3: recode() from tidyverse
recode_religious_frq_3 <- function(x) {
  x %>% recode(
    `-1` = "No data (did not answer)",
    `1`  = "More than once a week",
    `2`  = "Once a week",
    `3`  = "Once or twice a month",
    `4`  = "A few times a year",
    `5`  = "Once a year or less",
    `6`  = "Never"
  ) %>% as_factor()
}

couples_data %>%
  ggplot(aes(x = religious)) +
  geom_bar(aes(y = after_stat(prop), fill = married), position = "dodge") +
  scale_x_continuous(
    breaks = -1:6,
    labels = recode_religious_frq(-1:6)
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Frequency of attending religious services", y = "Proportion")

Chi-squared test for contingency tables

Let us look at the relation between being married and having met online:

couples_data %>% tabyl(married, met_online) 

This is the contingency table of counts, where:

  • Rows represent the levels of the married variable (whether the couple is married).
  • Columns represent the levels of the met_online variable (whether the couple met online).
  • Entries are counts of how many observations fall into each combination of married and met_online.

Specifically,

  • 760 couples are not married and did not meet online.
  • 163 couples are not married and did meet online.
  • 1752 couples are married and did not meet online.
  • 121 couples are married and did meet online.

Looking at percentages of people who met offline and online among married and non-married, we see that distribution is different, i.e., the variables married and met_online seem to be dependent:

couples_data %>% tabyl(married, met_online) %>%
  adorn_percentages()

We will apply the chi-squared test to test the null hypothesis:

  • \(H_0\) — the variables married and met_online are independent.
couples_data %>% tabyl(married, met_online) %>%
  chisq.test()
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 83.762, df = 1, p-value < 2.2e-16

Since the \(p\)-value is extremely small, it seems highly unlikely that married and met_online are independent.

There might be different plausible explanations of this, such as

  • People who opt for online dating are often interested in a fling rather than a serious relationship.

  • Online dating started not so long ago and hence the pool of people who met online is biased towards those who haven’t been long in a relationship yet.

  • The sample of subjects in the study is biased.

Question 2

Test whether religiousness (remove entries with religious == -1) is independent with

  1. Sex frequency (remove entries with sex_frequency == -1 or sex_frequency == 7), i.e., only keep entries with sex_frequency %in% 1:5).

  2. Marital status

  3. Race

In each case, print the contingency table with fractions, apply the chi-squared test, and interpret the results.

ANSWER HERE

Part (a)

Here is the table of values (numbers in each row add up to 1):

couples_data %>% 
  filter(sex_frequency %in% 1:5 & religious %in% 1:6) %>%
  tabyl(religious, sex_frequency) %>%
  adorn_percentages() 

Now we will apply the chi-squared test.

couples_data %>% 
  filter(sex_frequency %in% 1:5 & religious %in% 1:6) %>%
  tabyl(religious, sex_frequency) %>%
  chisq.test()
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 25.62, df = 20, p-value = 0.1787

The chi-squared test does not show a large deviation from independence, which means that we can safely assume that sex frequency and religiousness are independent.

Part (b)

Here is the table of values (numbers in each row add up to 1):

couples_data %>% 
  filter(religious %in% 1:6) %>%
  tabyl(religious, married) %>%
  adorn_percentages() 

Now we will apply the chi-squared test.

couples_data %>% 
  filter(religious %in% 1:6) %>%
  tabyl(religious, married) %>%
  chisq.test()
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 105.93, df = 5, p-value < 2.2e-16

The chi-squared test shows a large deviation from independence, which means that we religiousness and marital status are dependent. A possible interpretation is that religious people tend be more prone to register their relation officially than non-religious people.

Part (c)

Here is the table of values (numbers in each row add up to 1):

couples_data %>% 
  filter(religious %in% 1:6) %>%
  tabyl(race, religious) %>%
  adorn_percentages() 

Now we will apply the chi-squared test.

couples_data %>% 
  filter(religious %in% 1:6) %>%
  tabyl(religious, race) %>%
  chisq.test()
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 59.958, df = 20, p-value = 7.23e-06

The chi-squared test shows a large deviation from independence, which means that we religiousness and race are dependent. A possible interpretation is that different races have different cultures and hence may be more or less religious.

Model answers:

https://rpubs.com/fduzhin/mh3511_hw_6_answers