This in-class notebook is designed to complement the lecture. You’ll practice what you just learned, avoid falling asleep mid-slide, and get instant feedback - both from Fedor and your fellow classmates. You’re encouraged to experiment, ask questions, and correct your answers as we go.
The goal is to learn R by doing, not just by listening.
Before we begin, make sure you’ve installed the required packages:
tidyverse
for data manipulation and plottingtitanic
for practice dataYou only need to install these once. If you already did it during Homework 1, you’re good to go.
# Load required packages
library(tidyverse)
library(titanic)
A part of your learning experience is doing a mini-project on data analysis where you apply skills learned in the course. You will need to find a publicly available dataset, load it into R, generate tables of summary statistics, visualise the data, run statistical analyses (regression, correlation, \(t\)-test, \(U\)-test, ANOVA etc), and present your findings through a report (not longer than 4 pages) and a presentation (not more than 4 slides).
Now you have a chance to browse through publicly available datasets. Here are some sources for you:
https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
https://opendata.cityofnewyork.us/
Now you have some time to explore these resources. Read about various datasets, download those that you like, save them into a separate folder that you can name “Data”, load into R, and try to compute summary statistics such as means, medians, standard deviations of certain samples of your datasets.
Here is an example for your reference:
couples_data <- read_csv("../Data/couples_data_processed.csv")
cat("Here is a glimpse of couples data:\n")
glimpse(couples_data)
cat("\n\n")
cat("Median age of people who met online is",
median(couples_data$age[couples_data$met_online == "yes"]), "\n")
cat("Median age of people who met offline is",
median(couples_data$age[couples_data$met_online == "no"]), "\n")
## Here is a glimpse of couples data:
## Rows: 2,796
## Columns: 14
## $ married <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "no…
## $ age_when_met <dbl> 21, 36, 23, 25, 23, 15, 15, 29, 24, 25, 31, 27, 24,…
## $ met_online <chr> "no", "yes", "yes", "no", "no", "no", "no", "no", "…
## $ met_vacation <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
## $ work_neighbors <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
## $ met_through_family <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "n…
## $ met_through_friend <chr> "no", "no", "no", "no", "no", "yes", "no", "no", "n…
## $ sex_frequency <dbl> 5, 4, 7, 2, 5, 5, 4, 7, 4, 7, 7, 3, 1, 2, 3, -1, 3,…
## $ race <chr> "White, Non-Hispanic", "White, Non-Hispanic", "Whit…
## $ age <dbl> 55, 47, 28, 59, 59, 66, 65, 65, 33, 25, 39, 37, 34,…
## $ education <dbl> 13, 13, 8, 12, 9, 9, 11, 9, 12, 12, 9, 14, 11, 9, 1…
## $ gender <chr> "Female", "Male", "Female", "Female", "Male", "Fema…
## $ income <dbl> 18, 20, 11, 19, 14, 12, 13, 6, 21, 19, 14, 20, 16, …
## $ religious <dbl> 6, 3, 6, 5, 2, 2, 6, 1, 6, 6, 5, 6, 4, 6, 4, 2, 5, …
##
##
## Median age of people who met online is 39
## Median age of people who met offline is 54
Here are some codes for you (we will cover these in more depth in the next class):
## Count:
count(couples_data, sex_frequency)
## Count by two variables:
count(couples_data, met_online, met_through_friend)
We’ll explain the pipe operator (%>%
) in more detail
later today, but you can already start experimenting with it.
## Some Summary statistics:
couples_data %>%
group_by(religious) %>%
summarise(
number_of_records = n(),
median_age_when_met = median(age_when_met),
fraction_of_married = sum(married == "yes") / n(),
fraction_of_online = sum(met_online == "yes") / n())
You’ll have around 30–60 minutes to explore datasets and try things out. Aim to: - Spend 10–15 minutes browsing for interesting datasets, - Use the remaining time to download, load into R, and calculate basic statistics.
As you explore your dataset, consider: - What is each row? What are the columns? - Which variables are numeric or categorical? - What questions can you ask using this data? - What would a summary table or visualization look like?
Now is your turn:
## LOAD YOUR DATA AND PLAY WITH IT HERE
If you have time, try loading some data from Google Drive to R and play with it:
## PLAY WITH DATA FROM GOOGLE DRIVE
If you want to plot a histogram (we will learn it later today):
## Some Summary statistics:
ggplot(couples_data, aes(x = income)) + geom_histogram()
We will work with couples_data
here.
Data: survey of American couples “How Couples Meet and Stay Together” 2017
https://data.stanford.edu/hcmst2017
We will work with a subset of the data (the whole dataset has 285 variables). We have the following variables:
couples_data %>% names()
## [1] "married" "age_when_met" "met_online"
## [4] "met_vacation" "work_neighbors" "met_through_family"
## [7] "met_through_friend" "sex_frequency" "race"
## [10] "age" "education" "gender"
## [13] "income" "religious"
Most of them are self-explanatory, but we need to clarify three of them:
sex_frequency
: How often do you have sex with your
partner?Value | Description |
---|---|
-1 | No data (did not answer) |
1 | Once a day or more |
2 | 3 to 6 times a week |
3 | Once or twice a week |
4 | 2 to 3 times a month |
5 | Once a month or less |
6 | Never |
religious
- How often do you attend religious
services?Value | Description |
---|---|
-1 | No data (did not answer) |
1 | More than once a week |
2 | Once a week |
3 | Once or twice a month |
4 | A few times a year |
5 | Once a year or less |
6 | Never |
education
- Education LevelNumeric | Label |
---|---|
1 | No formal education |
2 | 1st, 2nd, 3rd, or 4th grade |
3 | 5th or 6th grade |
4 | 7th or 8th grade |
5 | 9th grade |
6 | 10th grade |
7 | 11th grade |
8 | 12th grade NO DIPLOMA |
9 | HIGH SCHOOL GRADUATE – high school DIPLOMA or the equivalent (GED) |
10 | Some college, no degree |
11 | Associate degree |
12 | Bachelors degree |
13 | Masters degree |
14 | Professional or Doctorate degree |
income
- Household IncomeNumeric | Label |
---|---|
1 | Less than $5,000 |
2 | $5,000 to $7,499 |
3 | $7,500 to $9,999 |
4 | $10,000 to $12,499 |
5 | $12,500 to $14,999 |
6 | $15,000 to $19,999 |
7 | $20,000 to $24,999 |
8 | $25,000 to $29,999 |
9 | $30,000 to $34,999 |
10 | $35,000 to $39,999 |
11 | $40,000 to $49,999 |
12 | $50,000 to $59,999 |
13 | $60,000 to $74,999 |
14 | $75,000 to $84,999 |
15 | $85,000 to $99,999 |
16 | $100,000 to $124,999 |
17 | $125,000 to $149,999 |
18 | $150,000 to $174,999 |
19 | $175,000 to $199,999 |
20 | $200,000 to $249,999 |
21 | $250,000 or more |
Create and interpret the following plots.
religious
and bar chart of
religious
colored by marital status, i.e.,
married
variable. Try to figure out how to label the \(x\)-axis right.## WRITE YOUR CODE FOR PLOTTING HERE
ggplot(couples_data, aes(x = religious, fill = married)) + geom_bar()
We see that the more religious couples are, the more likely they are to be married
age_when_met - age
. Change the colour of the histogram by
playing with variables fill
and color
. Google
“r color names” to pick your favourite colour.## WRITE YOUR CODE FOR PLOTTING HERE
ggplot(data = couples_data, aes(x = age - age_when_met)) +
geom_histogram(binwidth = 10, fill = "burlywood", color = "black")
The data covers a good range of couples who have been together for a long time, from 0 to 80 years together, with a median of about 20 years together
## WRITE YOUR CODE FOR PLOTTING HERE
ggplot(data = couples_data, aes(x = race, y = age - age_when_met)) +
geom_boxplot(fill = "burlywood", color = "black")
We see that white couples who participated in the study, on average, have spent more time together than black or hispanic.
age_when_met
,
colored by whether the couple met online (met_online
). Make
the density plots transparent for easier comparison.## WRITE YOUR CODE FOR PLOTTING HERE
ggplot(data = couples_data, aes(fill = met_online, x = age_when_met)) +
geom_density(alpha = 0.5)
In this study, people who met offline are younger than people who met online. A likely explanation is that couples who have been together for long time are more likely to have met at a younger age and offline rather than online
age_when_met
vs age
:married
)sex_frequency
) so that
larger points mean more frequent sex## WRITE YOUR CODE FOR PLOTTING HERE
ggplot(data = couples_data, aes(x = age_when_met, y = age,
color = married, size = -sex_frequency)) +
geom_point()
Note that there are more married coupled far from the line \(y=x\) in this plot, i.e., it shows that people usually wait some time to get married. We can also try to see if points get smaller when we look far from the line \(y=x\), i.e., if it is true that couples tend to have less sex with time, but it is hard to see that because points overlap. We need to adjust the size.
## WRITE YOUR CODE FOR PLOTTING HERE
ggplot(data = couples_data, aes(x = married, y = -sex_frequency)) +
geom_boxplot()
This plot shows that married couples tend to have sex more frequently than unmarried couples, at least according to the data collected for this study.