Info

Objective

This in-class notebook is designed to complement the lecture. You’ll practice what you just learned, avoid falling asleep mid-slide, and get instant feedback - both from Fedor and your fellow classmates. You’re encouraged to experiment, ask questions, and correct your answers as we go.

The goal is to learn R by doing, not just by listening.

Your Task

  • Attempt each question yourself before checking the answer or asking for help.
  • Use lecture notes and the example code provided.
  • Update your answers after Fedor’s explanations.
  • Feel free to work with your neighbor if you get stuck — but make sure you understand the final answer!

Initial Setup

Before we begin, make sure you’ve installed the required packages:

  • tidyverse for data manipulation and plotting
  • titanic for practice data

You only need to install these once. If you already did it during Homework 1, you’re good to go.

# Load required packages
library(tidyverse)
library(titanic)

Reading data to R

A part of your learning experience is doing a mini-project on data analysis where you apply skills learned in the course. You will need to find a publicly available dataset, load it into R, generate tables of summary statistics, visualise the data, run statistical analyses (regression, correlation, \(t\)-test, \(U\)-test, ANOVA etc), and present your findings through a report (not longer than 4 pages) and a presentation (not more than 4 slides).

Now you have a chance to browse through publicly available datasets. Here are some sources for you:

  1. Kaggle: platform for machine learning competitions and datasets

https://www.kaggle.com/

  1. Links to datasets on Wikipedia:

https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research

  1. Government datasets, such as

https://opendata.cityofnewyork.us/

  1. Research publications that often share their data, such as

https://www.nature.com/sdata/

Now you have some time to explore these resources. Read about various datasets, download those that you like, save them into a separate folder that you can name “Data”, load into R, and try to compute summary statistics such as means, medians, standard deviations of certain samples of your datasets.

Here is an example for your reference:

couples_data <- read_csv("../Data/couples_data_processed.csv")

cat("Here is a glimpse of couples data:\n")
glimpse(couples_data)

cat("\n\n")
cat("Median age of people who met online is",
    median(couples_data$age[couples_data$met_online == "yes"]), "\n")

cat("Median age of people who met offline is",
    median(couples_data$age[couples_data$met_online == "no"]), "\n")
## Here is a glimpse of couples data:
## Rows: 2,796
## Columns: 14
## $ married            <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "no…
## $ age_when_met       <dbl> 21, 36, 23, 25, 23, 15, 15, 29, 24, 25, 31, 27, 24,…
## $ met_online         <chr> "no", "yes", "yes", "no", "no", "no", "no", "no", "…
## $ met_vacation       <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
## $ work_neighbors     <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
## $ met_through_family <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "n…
## $ met_through_friend <chr> "no", "no", "no", "no", "no", "yes", "no", "no", "n…
## $ sex_frequency      <dbl> 5, 4, 7, 2, 5, 5, 4, 7, 4, 7, 7, 3, 1, 2, 3, -1, 3,…
## $ race               <chr> "White, Non-Hispanic", "White, Non-Hispanic", "Whit…
## $ age                <dbl> 55, 47, 28, 59, 59, 66, 65, 65, 33, 25, 39, 37, 34,…
## $ education          <dbl> 13, 13, 8, 12, 9, 9, 11, 9, 12, 12, 9, 14, 11, 9, 1…
## $ gender             <chr> "Female", "Male", "Female", "Female", "Male", "Fema…
## $ income             <dbl> 18, 20, 11, 19, 14, 12, 13, 6, 21, 19, 14, 20, 16, …
## $ religious          <dbl> 6, 3, 6, 5, 2, 2, 6, 1, 6, 6, 5, 6, 4, 6, 4, 2, 5, …
## 
## 
## Median age of people who met online is 39 
## Median age of people who met offline is 54

Here are some codes for you (we will cover these in more depth in the next class):

## Count:
count(couples_data, sex_frequency)
## Count by two variables:
count(couples_data, met_online, met_through_friend) 

We’ll explain the pipe operator (%>%) in more detail later today, but you can already start experimenting with it.

## Some Summary statistics: 
couples_data %>%
  group_by(religious) %>%
  summarise(
    number_of_records = n(),
    median_age_when_met = median(age_when_met),
    fraction_of_married = sum(married == "yes") / n(),
    fraction_of_online = sum(met_online == "yes") / n()) 

Time Management Tip

You’ll have around 30–60 minutes to explore datasets and try things out. Aim to: - Spend 10–15 minutes browsing for interesting datasets, - Use the remaining time to download, load into R, and calculate basic statistics.

Questions to Guide You

As you explore your dataset, consider: - What is each row? What are the columns? - Which variables are numeric or categorical? - What questions can you ask using this data? - What would a summary table or visualization look like?

Activity

Now is your turn:

## LOAD YOUR DATA AND PLAY WITH IT HERE

Optional

If you have time, try loading some data from Google Drive to R and play with it:

## PLAY WITH DATA FROM GOOGLE DRIVE

If you want to plot a histogram (we will learn it later today):

## Some Summary statistics: 
ggplot(couples_data, aes(x = income)) + geom_histogram() 

Plotting

Data

We will work with couples_data here.

Data: survey of American couples “How Couples Meet and Stay Together” 2017

https://data.stanford.edu/hcmst2017

We will work with a subset of the data (the whole dataset has 285 variables). We have the following variables:

couples_data %>% names()
##  [1] "married"            "age_when_met"       "met_online"        
##  [4] "met_vacation"       "work_neighbors"     "met_through_family"
##  [7] "met_through_friend" "sex_frequency"      "race"              
## [10] "age"                "education"          "gender"            
## [13] "income"             "religious"

Most of them are self-explanatory, but we need to clarify three of them:

  • sex_frequency: How often do you have sex with your partner?
Value Description
-1 No data (did not answer)
1 Once a day or more
2 3 to 6 times a week
3 Once or twice a week
4 2 to 3 times a month
5 Once a month or less
6 Never
  • religious - How often do you attend religious services?
Value Description
-1 No data (did not answer)
1 More than once a week
2 Once a week
3 Once or twice a month
4 A few times a year
5 Once a year or less
6 Never
  • education - Education Level
Numeric Label
1 No formal education
2 1st, 2nd, 3rd, or 4th grade
3 5th or 6th grade
4 7th or 8th grade
5 9th grade
6 10th grade
7 11th grade
8 12th grade NO DIPLOMA
9 HIGH SCHOOL GRADUATE – high school DIPLOMA or the equivalent (GED)
10 Some college, no degree
11 Associate degree
12 Bachelors degree
13 Masters degree
14 Professional or Doctorate degree
  • income - Household Income
Numeric Label
1 Less than $5,000
2 $5,000 to $7,499
3 $7,500 to $9,999
4 $10,000 to $12,499
5 $12,500 to $14,999
6 $15,000 to $19,999
7 $20,000 to $24,999
8 $25,000 to $29,999
9 $30,000 to $34,999
10 $35,000 to $39,999
11 $40,000 to $49,999
12 $50,000 to $59,999
13 $60,000 to $74,999
14 $75,000 to $84,999
15 $85,000 to $99,999
16 $100,000 to $124,999
17 $125,000 to $149,999
18 $150,000 to $174,999
19 $175,000 to $199,999
20 $200,000 to $249,999
21 $250,000 or more

Activity

Create and interpret the following plots.

  1. Bar chart of religious and bar chart of religious colored by marital status, i.e., married variable. Try to figure out how to label the \(x\)-axis right.
## WRITE YOUR CODE FOR PLOTTING HERE
ggplot(couples_data, aes(x = religious, fill = married)) + geom_bar() 

We see that the more religious couples are, the more likely they are to be married

  1. The histogram of time together, i.e., age_when_met - age. Change the colour of the histogram by playing with variables fill and color. Google “r color names” to pick your favourite colour.
## WRITE YOUR CODE FOR PLOTTING HERE
ggplot(data = couples_data, aes(x = age - age_when_met)) + 
  geom_histogram(binwidth = 10, fill = "burlywood", color = "black")

The data covers a good range of couples who have been together for a long time, from 0 to 80 years together, with a median of about 20 years together

  1. Boxplot of time together (from) grouped by race.
## WRITE YOUR CODE FOR PLOTTING HERE
ggplot(data = couples_data, aes(x = race, y = age - age_when_met)) + 
  geom_boxplot(fill = "burlywood", color = "black")

We see that white couples who participated in the study, on average, have spent more time together than black or hispanic.

  1. Smoothed histograms (density plots) of age_when_met, colored by whether the couple met online (met_online). Make the density plots transparent for easier comparison.
## WRITE YOUR CODE FOR PLOTTING HERE
ggplot(data = couples_data, aes(fill = met_online, x = age_when_met)) + 
  geom_density(alpha = 0.5)

In this study, people who met offline are younger than people who met online. A likely explanation is that couples who have been together for long time are more likely to have met at a younger age and offline rather than online

  1. Scatterplot of age_when_met vs age:
  • Color points by marital status (married)
  • Size points by sex frequency (sex_frequency) so that larger points mean more frequent sex
## WRITE YOUR CODE FOR PLOTTING HERE
ggplot(data = couples_data, aes(x = age_when_met, y = age, 
                                color = married, size = -sex_frequency)) + 
  geom_point()

Note that there are more married coupled far from the line \(y=x\) in this plot, i.e., it shows that people usually wait some time to get married. We can also try to see if points get smaller when we look far from the line \(y=x\), i.e., if it is true that couples tend to have less sex with time, but it is hard to see that because points overlap. We need to adjust the size.

  1. Come up with your own insight that you can get from this data and support it with a plot
## WRITE YOUR CODE FOR PLOTTING HERE
ggplot(data = couples_data, aes(x = married, y = -sex_frequency)) + 
  geom_boxplot()

This plot shows that married couples tend to have sex more frequently than unmarried couples, at least according to the data collected for this study.

Model answers: