These homework problem sets are designed to help you understand material better. You should try doing these problems first and then look at model answers. You can use Generative AI as to help, such as prompt “Which tidyverse function do I use to drop certain columns from a data frame? Give me an example and explain”. It is also a good idea to feed an error message together with your code to Generative AI and ask it to help with fixing errors. But it is pointless to just solve all questions with ChatGPT because you won’t be learning anything.
Read instructions and write your solutions to these questions into the space provided. Then check the model answers (the link is in the end of the notebook).
We will use the Titanic dataset again that you presumably have already installed.
library(titanic)
Now we will create a copy of the dataset titanic_train
so that we won’t damage the original dataset while playing with it. We
will also use the glimpse
function form
tidyverse
(it somewhat like improved version of
str
):
df_tit <- titanic_train
glimpse(df_tit)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
We will now practice simple data subsetting and summarising using
pipe operator chains (%>%
) from tidyverse
.
The dataset df_tit
is a copy of titanic_train
from the titanic
package. It should already be loaded.
In each task, your goal is to produce the required output using a
single chain of tidyverse functions. You do not need to create any new
variables. Note that in each part, the result is going to be a data
frame. In parts (b) and (c), you will need to use the function
summarise
. You won’t need group_by
in part
(b), but you can do part (c) with or without group_by
.
Print the Name
, Sex
, Age
,
and survival status (Survived
) of passengers 42, 73, and
496.
Compute the fraction of passengers who survived. Your result should be a number between 0 and 1.
Compute the difference between the average ticket fare of passengers who survived and those who did not. The result should be a single number.
Write your own custom R functions:
x
and whose output
is the number of missing entries in x
number_of_na <- function(x) {
# Write your code here - currently this function returns 42
sum(is.na(x))
}
# The following should be 177:
number_of_na(df_tit$Age)
## [1] 177
x
and whose output
is the most frequent entry of x
(it is called the
mode in statistics):get_mode <- function(x) {
# Write your code here - currently this function returns 42
x %>% table %>% which.max %>% names
}
# The following should be 8.05:
get_mode(df_tit$Fare)
## [1] "8.05"
x
and whose output
is the largest entry of x
if x
is numeric or
the count of the most frequent entry if x
is not
numericconditional_max <- function(x) {
# Write your code here - currently this function returns 42
ifelse(is.numeric(x), max(x, na.rm = TRUE), max(table(x)))
}
# The following should be 80:
conditional_max(df_tit$Age)
# The following should be 577:
conditional_max(as.character(df_tit$Sex))
## [1] 80
## [1] 577
And now apply each of these function to every column of the dataframe
df_tit
. You don’t have to do all the three functions in one
chain of pipe operators - you can first count missing entries, then find
the mode of every column, and then find this conditional maximum.
# ANSWER
df_tit %>%
summarise(across(everything(), number_of_na))
df_tit %>%
summarise(across(everything(), get_mode))
df_tit %>%
summarise(across(everything(), conditional_max))
Solve all of the following tasks using a
single chain of pipe operators. Note that you will need
group_by()
for part (a) and it is a good idea to undo the
grouping afterwards with ungroup()
.
Add a new character variable has_survived
to
df_tit
that equals “yes” if Survived == 1
and
“no” otherwise.
Add a new variable imputed_age
that equals
Age
whenever Age
is defined. If
Age
is NA, then it should be the median Age
of
passengers of the same sex, i.e., median female age for female
passengers or median male age for male passengers.
family_size
that takes the value
“single” if the passenger has no accompanying family members (i.e.,
SibSp + Parch == 0
), “couple” if the passenger is
travelling with one relative, “small” if the number of family members
travelling with the passenger is 2 or 3, and “large” if the number of
family members travelling with the passenger is 4 or greater.Calculate survival rates for
Passengers of different sexes (Sex
variable)
Passengers of different port of embarcation
(Embarked
variable)
Passengers of different combinations of sex and class
(Sex
and Pclass
variables together)
Passengers of different age groups (convert Age
to
Age_Group
that equals “child” if age is 11 or below,
“teenager” if age is up to 16, “young adult” if age is up to 23, and
“adult” for the rest of passengers)
The answer for each part should be a single chain of pipe operators.
# ANSWER
# Part (a)
df_tit %>%
group_by(Sex) %>%
summarise(survaval_rate = mean(Survived))
# Part (b)
df_tit %>%
group_by(Embarked) %>%
summarise(survaval_rate = mean(Survived))
# Part (c)
df_tit %>%
group_by(Sex, Pclass) %>%
summarise(survaval_rate = mean(Survived))
# Part (d)
df_tit %>%
mutate(Age_Group = cut(Age,
breaks = c(-Inf, 11, 16, 23, Inf),
labels = c("child", "teenager", "young adult", "adult"))) %>%
group_by(Age_Group) %>%
summarise(survaval_rate = mean(Survived))