These homework problem sets are designed to help you understand material better. You should try doing these problems first and then look at model answers. You can use Generative AI as to help, such as prompt “Which tidyverse function do I use to drop certain columns from a data frame? Give me an example and explain”. It is also a good idea to feed an error message together with your code to Generative AI and ask it to help with fixing errors. But it is pointless to just solve all questions with ChatGPT because you won’t be learning anything.
Read instructions and write your solutions to these questions into the space provided. Then check the model answers (the link is in the end of the notebook).
In this task, we will find out people of which race are murdered by police most often. The original dataset is here:
https://www.kaggle.com/discussions/general/158339
We will work with couples_data
here.
Data: survey of American couples “How Couples Meet and Stay Together” 2017
https://data.stanford.edu/hcmst2017
We will work with a subset of the data (the whole dataset has 285 variables). We have the following variables:
couples_data <- read_csv("../Data/couples_data_processed.csv")
couples_data %>% names()
## [1] "married" "age_when_met" "met_online"
## [4] "met_vacation" "work_neighbors" "met_through_family"
## [7] "met_through_friend" "sex_frequency" "race"
## [10] "age" "education" "gender"
## [13] "income" "religious"
Most of them are self-explanatory, but we need to clarify three of them:
sex_frequency
: How often do you have sex with your
partner?Value | Description |
---|---|
-1 | No data (did not answer) |
1 | Once a day or more |
2 | 3 to 6 times a week |
3 | Once or twice a week |
4 | 2 to 3 times a month |
5 | Once a month or less |
7 | Never |
Here is a table:
couples_data %>% tabyl(sex_frequency)
religious
- How often do you attend religious
services?Value | Description |
---|---|
-1 | No data (did not answer) |
1 | More than once a week |
2 | Once a week |
3 | Once or twice a month |
4 | A few times a year |
5 | Once a year or less |
6 | Never |
Here is a table:
couples_data %>% tabyl(religious)
education
- Education LevelNumeric | Label |
---|---|
1 | No formal education |
2 | 1st, 2nd, 3rd, or 4th grade |
3 | 5th or 6th grade |
4 | 7th or 8th grade |
5 | 9th grade |
6 | 10th grade |
7 | 11th grade |
8 | 12th grade NO DIPLOMA |
9 | HIGH SCHOOL GRADUATE – high school DIPLOMA or the equivalent (GED) |
10 | Some college, no degree |
11 | Associate degree |
12 | Bachelors degree |
13 | Masters degree |
14 | Professional or Doctorate degree |
Here is a table:
couples_data %>% tabyl(education)
income
- Household IncomeNumeric | Label |
---|---|
1 | Less than $5,000 |
2 | $5,000 to $7,499 |
3 | $7,500 to $9,999 |
4 | $10,000 to $12,499 |
5 | $12,500 to $14,999 |
6 | $15,000 to $19,999 |
7 | $20,000 to $24,999 |
8 | $25,000 to $29,999 |
9 | $30,000 to $34,999 |
10 | $35,000 to $39,999 |
11 | $40,000 to $49,999 |
12 | $50,000 to $59,999 |
13 | $60,000 to $74,999 |
14 | $75,000 to $84,999 |
15 | $85,000 to $99,999 |
16 | $100,000 to $124,999 |
17 | $125,000 to $149,999 |
18 | $150,000 to $174,999 |
19 | $175,000 to $199,999 |
20 | $200,000 to $249,999 |
21 | $250,000 or more |
Here is a table:
couples_data %>% tabyl(income)
To visualise ordinal data, we can use bar charts:
couples_data %>% ggplot(aes(x = sex_frequency)) + geom_bar()
However, this plot is not perfect since
To fix the issue, we will write a helper function
recode_sex_frq()
that converts numeric values to their
respective labels. There are three versions using
case_when()
from base R,recode()
from tidyverse
.# Method 1: indexing ans as.character()
recode_sex_freq <- function(x) {
result <- rep("", length(x))
result[x == -1] <- "No data (did not answer)"
result[x == 1] <- "Once a day or more"
result[x == 2] <- "3 to 6 times a week"
result[x == 3] <- "Once or twice a week"
result[x == 4] <- "2 to 3 times a month"
result[x == 5] <- "Once a month or less"
result[x == 7] <- "Never"
result
}
# Method 2: case_when()
recode_sex_freq_2 <- function(x) {
case_when(
x == -1 ~ "No data (did not answer)",
x == 1 ~ "Once a day or more",
x == 2 ~ "3 to 6 times a week",
x == 3 ~ "Once or twice a week",
x == 4 ~ "2 to 3 times a month",
x == 5 ~ "Once a month or less",
x == 7 ~ "Never",
.default = ""
) %>% as_factor()
}
# Method 3: recode() from tidyverse
recode_sex_freq_3 <- function(x) {
x %>% recode(
`-1` = "No data (did not answer)",
`1` = "Once a day or more",
`2` = "3 to 6 times a week",
`3` = "Once or twice a week",
`4` = "2 to 3 times a month",
`5` = "Once a month or less",
`7` = "Never",
.default = ""
)
}
couples_data %>%
ggplot(aes(x = sex_frequency)) +
geom_bar(aes(y = after_stat(prop))) +
scale_x_continuous(
breaks = -1:7,
labels = recode_sex_freq_3(-1:7)
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Sex frequency", y = "Proportion")
Write a similar helper function recode_religious_frq()
that converts numeric values to their respective labels. Construct a
frequency plot of religious
coloured according to marital
status. Hint you will need to use position = "dodge"
.
Are married or unmarried people more likely to attend religious serivces?
### ANSWER
# Method 1: indexing ans as.character()
recode_religious_frq <- function(x) {
result <- as.character(x)
result[x == -1] <- "No data (did not answer)"
result[x == 1] <- "More than once a week"
result[x == 2] <- "Once a week"
result[x == 3] <- "Once or twice a month"
result[x == 4] <- "A few times a year"
result[x == 5] <- "Once a year or less"
result[x == 6] <- "Never"
as_factor(result)
}
# Method 2: case_when()
recode_religious_frq_2 <- function(x) {
case_when(
x == -1 ~ "No data (did not answer)",
x == 1 ~ "More than once a week",
x == 2 ~ "Once a week",
x == 3 ~ "Once or twice a month",
x == 4 ~ "A few times a year",
x == 5 ~ "Once a year or less",
x == 6 ~ "Never"
) %>% as_factor()
}
# Method 3: recode() from tidyverse
recode_religious_frq_3 <- function(x) {
x %>% recode(
`-1` = "No data (did not answer)",
`1` = "More than once a week",
`2` = "Once a week",
`3` = "Once or twice a month",
`4` = "A few times a year",
`5` = "Once a year or less",
`6` = "Never"
) %>% as_factor()
}
couples_data %>%
ggplot(aes(x = religious)) +
geom_bar(aes(y = after_stat(prop), fill = married), position = "dodge") +
scale_x_continuous(
breaks = -1:6,
labels = recode_religious_frq(-1:6)
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Frequency of attending religious services", y = "Proportion")
Let us look at the relation between being married and having met online:
couples_data %>% tabyl(married, met_online)
This is the contingency table of counts, where:
married
variable
(whether the couple is married).met_online
variable
(whether the couple met online).married
and met_online
.Specifically,
Looking at percentages of people who met offline and online among
married and non-married, we see that distribution is different, i.e.,
the variables married
and met_online
seem to
be dependent:
couples_data %>% tabyl(married, met_online) %>%
adorn_percentages()
We will apply the chi-squared test to test the null hypothesis:
married
and met_online
are independent.couples_data %>% tabyl(married, met_online) %>%
chisq.test()
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: .
## X-squared = 83.762, df = 1, p-value < 2.2e-16
Since the \(p\)-value is extremely
small, it seems highly unlikely that married
and
met_online
are independent.
There might be different plausible explanations of this, such as
People who opt for online dating are often interested in a fling rather than a serious relationship.
Online dating started not so long ago and hence the pool of people who met online is biased towards those who haven’t been long in a relationship yet.
The sample of subjects in the study is biased.
Test whether religiousness (remove entries with
religious == -1
) is independent with
Sex frequency (remove entries with
sex_frequency == -1
or sex_frequency == 7
),
i.e., only keep entries with
sex_frequency %in% 1:5
).
Marital status
Race
In each case, print the contingency table with fractions, apply the chi-squared test, and interpret the results.
ANSWER HERE
Part (a)
Here is the table of values (numbers in each row add up to 1):
couples_data %>%
filter(sex_frequency %in% 1:5 & religious %in% 1:6) %>%
tabyl(religious, sex_frequency) %>%
adorn_percentages()
Now we will apply the chi-squared test.
couples_data %>%
filter(sex_frequency %in% 1:5 & religious %in% 1:6) %>%
tabyl(religious, sex_frequency) %>%
chisq.test()
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 25.62, df = 20, p-value = 0.1787
The chi-squared test does not show a large deviation from independence, which means that we can safely assume that sex frequency and religiousness are independent.
Part (b)
Here is the table of values (numbers in each row add up to 1):
couples_data %>%
filter(religious %in% 1:6) %>%
tabyl(religious, married) %>%
adorn_percentages()
Now we will apply the chi-squared test.
couples_data %>%
filter(religious %in% 1:6) %>%
tabyl(religious, married) %>%
chisq.test()
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 105.93, df = 5, p-value < 2.2e-16
The chi-squared test shows a large deviation from independence, which means that we religiousness and marital status are dependent. A possible interpretation is that religious people tend be more prone to register their relation officially than non-religious people.
Part (c)
Here is the table of values (numbers in each row add up to 1):
couples_data %>%
filter(religious %in% 1:6) %>%
tabyl(race, religious) %>%
adorn_percentages()
Now we will apply the chi-squared test.
couples_data %>%
filter(religious %in% 1:6) %>%
tabyl(religious, race) %>%
chisq.test()
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 59.958, df = 20, p-value = 7.23e-06
The chi-squared test shows a large deviation from independence, which means that we religiousness and race are dependent. A possible interpretation is that different races have different cultures and hence may be more or less religious.