This in-class notebook is designed to complement the lecture. You’ll practice what you just learned, avoid falling asleep mid-slide, and get instant feedback - both from Fedor and your fellow classmates. You’re encouraged to experiment, ask questions, and correct your answers as we go.
The goal is to learn R by doing, not just by listening.
Before we begin, make sure you’ve installed the required packages:
tidyverse
for data manipulation and plottingtitanic
for practice dataYou only need to install these once. If you already did it during Homework 1, you’re good to go.
# Load required packages
library(tidyverse)
library(titanic)
library(e1071)
Note that this notebook loads the data from the folder “../Data”. If you have downloaded the data to a different folder, you will need to adjust the path.
Data: survey of American couples “How Couples Meet and Stay Together” 2017
https://data.stanford.edu/hcmst2017
We will work with a subset of the data (the whole dataset has 285 variables). Note that the link above is given for your reference. We will actually load the data from CSV. Below we load the data and print variable names:
couples_data <- read_csv("../Data/couples_data_processed.csv")
couples_data %>% names()
## [1] "married" "age_when_met" "met_online"
## [4] "met_vacation" "work_neighbors" "met_through_family"
## [7] "met_through_friend" "sex_frequency" "race"
## [10] "age" "education" "gender"
## [13] "income" "religious"
Most of the variables are self-explanatory, but we need to clarify three of them:
sex_frequency
: How often do you have sex with your
partner?Value | Description |
---|---|
-1 | No data (did not answer) |
1 | Once a day or more |
2 | 3 to 6 times a week |
3 | Once or twice a week |
4 | 2 to 3 times a month |
5 | Once a month or less |
6 | Never |
religious
- How often do you attend religious
services?Value | Description |
---|---|
-1 | No data (did not answer) |
1 | More than once a week |
2 | Once a week |
3 | Once or twice a month |
4 | A few times a year |
5 | Once a year or less |
6 | Never |
education
- Education LevelNumeric | Label |
---|---|
1 | No formal education |
2 | 1st, 2nd, 3rd, or 4th grade |
3 | 5th or 6th grade |
4 | 7th or 8th grade |
5 | 9th grade |
6 | 10th grade |
7 | 11th grade |
8 | 12th grade NO DIPLOMA |
9 | HIGH SCHOOL GRADUATE – high school DIPLOMA or the equivalent (GED) |
10 | Some college, no degree |
11 | Associate degree |
12 | Bachelors degree |
13 | Masters degree |
14 | Professional or Doctorate degree |
income
- Household IncomeNumeric | Label |
---|---|
1 | Less than $5,000 |
2 | $5,000 to $7,499 |
3 | $7,500 to $9,999 |
4 | $10,000 to $12,499 |
5 | $12,500 to $14,999 |
6 | $15,000 to $19,999 |
7 | $20,000 to $24,999 |
8 | $25,000 to $29,999 |
9 | $30,000 to $34,999 |
10 | $35,000 to $39,999 |
11 | $40,000 to $49,999 |
12 | $50,000 to $59,999 |
13 | $60,000 to $74,999 |
14 | $75,000 to $84,999 |
15 | $85,000 to $99,999 |
16 | $100,000 to $124,999 |
17 | $125,000 to $149,999 |
18 | $150,000 to $174,999 |
19 | $175,000 to $199,999 |
20 | $200,000 to $249,999 |
21 | $250,000 or more |
Classify each of the variables in couples_data
as
numeric, categorical, or ordinal. Which numerical variables have ratio
scale and which have interval scale? Briefly justify your reasoning.
ANSWER Variables whose R class is character are categorical because values are not sorted - male / female, yes / no, or 5 different races. Here are unique values of these variables:
# We will learn how to do this in later classes, now just look at the output
couples_data %>%
select(where(is.character)) %>%
map(unique)
## $married
## [1] "yes" "no"
##
## $met_online
## [1] "no" "yes"
##
## $met_vacation
## [1] "no" "yes"
##
## $work_neighbors
## [1] "no" "yes"
##
## $met_through_family
## [1] "no" "yes"
##
## $met_through_friend
## [1] "no" "yes"
##
## $race
## [1] "White, Non-Hispanic" "Hispanic" "Black, Non-Hispanic"
## [4] "Other, Non-Hispanic" "2+ Races, Non-Hispanic"
##
## $gender
## [1] "Female" "Male"
Variables age
and age_when_met
are numeric
variables masured on the ratio scale since there is a meaningul zero and
someone who is 40 years old is twice as old as someone who is 20 years
old. There are no numeric variables measured on the interval scale in
this dataset.
Variables sex_frequency
, religious
,
education
, and income
are ordinal because
there is a meaninful order but no meaninful scale. For instance, the
difference between income == 20
and
income == 19
is between 1 and 75,000 USD but the difference
between income == 3
and income == 2
is between
1 and 10,000 USD. This is not even a logarithmic scale.
Write a single chain of pipe operators that
creates a copy of the dataset couples_data
,
subsets it by dropping observations with
sex_frequency == -1
or religious == 1
(these
are people who didn’t respond to those questions),
drops all variables whose name starts with “met”,
creates a new variable time_together
that equals
difference between age
and
age_when_met
.
creates a new variable married_dummy
that equals 1
whenever married == "yes"
and 0 whenever
married == "no"
reorders the variables so that columns whose name starts with “age” go first, followed by “time_together”, followed by columns whose name starts with “married”.
# ANSWER
clean_couples_data <- couples_data %>%
filter(sex_frequency > -1 & religious > -1) %>%
select(-starts_with("met")) %>%
mutate(time_together = age - age_when_met,
married_dummy = 0 + (married == "yes")) %>%
relocate(starts_with("age"), time_together, starts_with("married"))
head(clean_couples_data)
Do a similar manual cleaning and transformation for a dataset of your choice.
Write a single chain of pipe operators that computes the following statistic for every combination of marital status and race:
You should give these new variables names that are meaningful but short so that the whole summary table fits in RStudio window.
# ANSWER
couples_data %>%
group_by(married, race) %>%
summarise(
n_obs = n(),
med_met_age = median(age_when_met),
med_income = median(income),
perc_bach = round(100 * sum(education >= 12) / n()),
perc_online = round(100 * sum(met_online == "yes") / n()),
perc_sex = round(100 * sum(sex_frequency %in% c(1, 2, 3)) / n())
)
Now write a single chain of pipe operators that creates a copy of
couples_data
and adds the following new variables to
it:
Time together in years - how long the person has been together with their partner
Median age and median time together of all people of the same gender as the given person
Fraction of people who met through family among all people of the same marital status and the same race as the given person.
Here, you can give your new variables long and meaningful names.
# ANSWER
upd_couples_data <- couples_data %>%
mutate(time_together = age - age_when_met) %>%
group_by(gender) %>%
mutate(median_age_same_gender = median(age),
median_time_together_same_gender = median(time_together)) %>%
group_by(married, race) %>%
mutate(
fraction_met_family_same_marital_race = sum(met_through_family == "yes") / n()
)
upd_couples_data
For a dataset of your choice, find counts and fractions of groups given by one or two categorical variables.
For a dataset of your choice, compute the following summary statistics:
Mean, median, standard deviation, skewness (numeric variables)
Number of unique values, number of missing values, fraction of most prevalent values (character variables)
Note that you will need the following functions and constructions:
Constructions across(is.character, ...)
and
across(is.numeric, ...)
to apply functions to character and
numeric variables.
reframe()
instead of summarise()
for
functions that return vectors rather then single numbers.
n_disctinct()
- the number of unique values
table()
- counts of unique values
Write your own functions for the number of missing values and the number of most prevalent values
# First Method
# This is for character variables, similar will work for numeric variables
char_summary <- function(x) {
# Given that x is a character vector,
# Computes a vector of three statistics:
# Number of unique values
# Number of missing entries
# Fraction of the most common entry
no_unique <- n_distinct(x)
no_missing <- sum(is.na(x))
tab_x <- table(x)
frac_most_common <- tab_x[which.max(tab_x)] / length(x)
c(no_unique, no_missing, frac_most_common)
}
couples_data %>%
reframe(across(is.character, char_summary)) %>%
mutate(statistic = c("Number of Unique Values",
"Number of Missing Values",
"Fraction of Most Common Value")) %>%
relocate(statistic)
# Second Method
# This is for numeric variables, similar will work for character variables
couples_data %>%
select(where(is.numeric)) %>%
pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
group_by(variable) %>%
summarise(
mean = mean(value, na.rm = TRUE),
median = median(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
skew = skewness(value, na.rm = TRUE),
.groups = "drop"
)