Info
- Objective
- Your Task
- Initial Setup
- Data
- Question 1
Data Wrangling Part I
- Question 2
- Question 3
Data Wrangling Part II
Summary statistics
- Question 7
Model Answers:

Info

Objective

This in-class notebook is designed to complement the lecture. You’ll practice what you just learned, avoid falling asleep mid-slide, and get instant feedback - both from Fedor and your fellow classmates. You’re encouraged to experiment, ask questions, and correct your answers as we go.

The goal is to learn R by doing, not just by listening.

Your Task

Attempt each question yourself before checking the answer or asking for help.
Use lecture notes and the example code provided.
Update your answers after Fedor’s explanations.
Feel free to work with your neighbor if you get stuck — but make sure you understand the final answer!

Initial Setup

Before we begin, make sure you’ve installed the required packages:

tidyverse for data manipulation and plotting
titanic for practice data

You only need to install these once. If you already did it during Homework 1, you’re good to go.

# Load required packages
library(tidyverse)
library(titanic)
library(e1071)

Note that this notebook loads the data from the folder “../Data”. If you have downloaded the data to a different folder, you will need to adjust the path.

Data

Data: survey of American couples “How Couples Meet and Stay Together” 2017

https://data.stanford.edu/hcmst2017

We will work with a subset of the data (the whole dataset has 285 variables). Note that the link above is given for your reference. We will actually load the data from CSV. Below we load the data and print variable names:

couples_data <- read_csv("../Data/couples_data_processed.csv")
couples_data %>% names()

##  [1] "married"            "age_when_met"       "met_online"        
##  [4] "met_vacation"       "work_neighbors"     "met_through_family"
##  [7] "met_through_friend" "sex_frequency"      "race"              
## [10] "age"                "education"          "gender"            
## [13] "income"             "religious"

Most of the variables are self-explanatory, but we need to clarify three of them:

sex_frequency: How often do you have sex with your partner?

Value	Description
-1	No data (did not answer)
1	Once a day or more
2	3 to 6 times a week
3	Once or twice a week
4	2 to 3 times a month
5	Once a month or less
6	Never

religious - How often do you attend religious services?

Value	Description
-1	No data (did not answer)
1	More than once a week
2	Once a week
3	Once or twice a month
4	A few times a year
5	Once a year or less
6	Never

education - Education Level

Numeric	Label
1	No formal education
2	1st, 2nd, 3rd, or 4th grade
3	5th or 6th grade
4	7th or 8th grade
5	9th grade
6	10th grade
7	11th grade
8	12th grade NO DIPLOMA
9	HIGH SCHOOL GRADUATE – high school DIPLOMA or the equivalent (GED)
10	Some college, no degree
11	Associate degree
12	Bachelors degree
13	Masters degree
14	Professional or Doctorate degree

income - Household Income

Numeric	Label
1	Less than $5,000
2	$5,000 to $7,499
3	$7,500 to $9,999
4	$10,000 to $12,499
5	$12,500 to $14,999
6	$15,000 to $19,999
7	$20,000 to $24,999
8	$25,000 to $29,999
9	$30,000 to $34,999
10	$35,000 to $39,999
11	$40,000 to $49,999
12	$50,000 to $59,999
13	$60,000 to $74,999
14	$75,000 to $84,999
15	$85,000 to $99,999
16	$100,000 to $124,999
17	$125,000 to $149,999
18	$150,000 to $174,999
19	$175,000 to $199,999
20	$200,000 to $249,999
21	$250,000 or more

Question 1

Classify each of the variables in couples_data as numeric, categorical, or ordinal. Which numerical variables have ratio scale and which have interval scale? Briefly justify your reasoning.

ANSWER Variables whose R class is character are categorical because values are not sorted - male / female, yes / no, or 5 different races. Here are unique values of these variables:

# We will learn how to do this in later classes, now just look at the output
couples_data %>%
  select(where(is.character)) %>%
  map(unique)

## $married
## [1] "yes" "no" 
## 
## $met_online
## [1] "no"  "yes"
## 
## $met_vacation
## [1] "no"  "yes"
## 
## $work_neighbors
## [1] "no"  "yes"
## 
## $met_through_family
## [1] "no"  "yes"
## 
## $met_through_friend
## [1] "no"  "yes"
## 
## $race
## [1] "White, Non-Hispanic"    "Hispanic"               "Black, Non-Hispanic"   
## [4] "Other, Non-Hispanic"    "2+ Races, Non-Hispanic"
## 
## $gender
## [1] "Female" "Male"

Variables age and age_when_met are numeric variables masured on the ratio scale since there is a meaningul zero and someone who is 40 years old is twice as old as someone who is 20 years old. There are no numeric variables measured on the interval scale in this dataset.

Variables sex_frequency, religious, education, and income are ordinal because there is a meaninful order but no meaninful scale. For instance, the difference between income == 20 and income == 19 is between 1 and 75,000 USD but the difference between income == 3 and income == 2 is between 1 and 10,000 USD. This is not even a logarithmic scale.

Data Wrangling Part I

Question 2

Write a single chain of pipe operators that

creates a copy of the dataset couples_data,
subsets it by dropping observations with sex_frequency == -1 or religious == 1 (these are people who didn’t respond to those questions),
drops all variables whose name starts with “met”,
creates a new variable time_together that equals difference between age and age_when_met.
creates a new variable married_dummy that equals 1 whenever married == "yes" and 0 whenever married == "no"
reorders the variables so that columns whose name starts with “age” go first, followed by “time_together”, followed by columns whose name starts with “married”.

# ANSWER
clean_couples_data <- couples_data %>%
  filter(sex_frequency > -1 & religious > -1) %>%
  select(-starts_with("met")) %>%
  mutate(time_together = age - age_when_met,
         married_dummy = 0 + (married == "yes")) %>%
  relocate(starts_with("age"), time_together, starts_with("married"))

head(clean_couples_data)

Question 3

Do a similar manual cleaning and transformation for a dataset of your choice.

Data Wrangling Part II

Question 4

Write a single chain of pipe operators that computes the following statistic for every combination of marital status and race:

Number of observations
Median age when met
Median income level
Percentage of people with at least a bachelor degree, rounded to a whole percent
Percentage of people who met their partner online, rounded to a whole percent
Percentage of people have sex with their partner at least once a week, rounded to a whole percent

You should give these new variables names that are meaningful but short so that the whole summary table fits in RStudio window.

# ANSWER
couples_data %>%
  group_by(married, race) %>%
  summarise(
    n_obs = n(),
    med_met_age = median(age_when_met),
    med_income = median(income),
    perc_bach = round(100 * sum(education >= 12) / n()),
    perc_online = round(100 * sum(met_online == "yes") / n()),
    perc_sex = round(100 * sum(sex_frequency %in% c(1, 2, 3)) / n())
  )

Question 5

Now write a single chain of pipe operators that creates a copy of couples_data and adds the following new variables to it:

Time together in years - how long the person has been together with their partner
Median age and median time together of all people of the same gender as the given person
Fraction of people who met through family among all people of the same marital status and the same race as the given person.

Here, you can give your new variables long and meaningful names.

# ANSWER
upd_couples_data <- couples_data %>%
  mutate(time_together = age - age_when_met) %>%
  group_by(gender) %>%
  mutate(median_age_same_gender = median(age),
         median_time_together_same_gender = median(time_together)) %>%
  group_by(married, race) %>%
  mutate(
    fraction_met_family_same_marital_race = sum(met_through_family == "yes") / n()
  )

upd_couples_data

Question 6

For a dataset of your choice, find counts and fractions of groups given by one or two categorical variables.

Summary statistics

Question 7

For a dataset of your choice, compute the following summary statistics:

Mean, median, standard deviation, skewness (numeric variables)
Number of unique values, number of missing values, fraction of most prevalent values (character variables)

Note that you will need the following functions and constructions:

Constructions across(is.character, ...) and across(is.numeric, ...) to apply functions to character and numeric variables.
reframe() instead of summarise() for functions that return vectors rather then single numbers.
n_disctinct() - the number of unique values
table() - counts of unique values
Write your own functions for the number of missing values and the number of most prevalent values

# First Method 
# This is for character variables, similar will work for numeric variables

char_summary <- function(x) {
  # Given that x is a character vector,
  # Computes a vector of three statistics:
  # Number of unique values
  # Number of missing entries
  # Fraction of the most common entry
  no_unique <- n_distinct(x)
  no_missing <- sum(is.na(x))
  tab_x <- table(x)
  frac_most_common <- tab_x[which.max(tab_x)] / length(x)
  c(no_unique, no_missing, frac_most_common)
}


couples_data %>%
  reframe(across(is.character, char_summary)) %>%
  mutate(statistic = c("Number of Unique Values", 
                       "Number of Missing Values", 
                       "Fraction of Most Common Value")) %>%
  relocate(statistic)

# Second Method
# This is for numeric variables, similar will work for character variables
couples_data %>%
  select(where(is.numeric)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  group_by(variable) %>%
  summarise(
    mean = mean(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    skew = skewness(value, na.rm = TRUE),
    .groups = "drop"
  )

Model Answers:

https://rpubs.com/fduzhin/mh3511_cw_3_answers

MH3511 – Classwork 3

Data Wrangling and Summarising in R.

Fedor Duzhin

2025-06-29