Info

Objective

These homework problem sets are designed to help you understand material better. You should try doing these problems first and then look at model answers. You can use Generative AI as to help, such as prompt “Which tidyverse function do I use to drop certain columns from a data frame? Give me an example and explain”. It is also a good idea to feed an error message together with your code to Generative AI and ask it to help with fixing errors. But it is pointless to just solve all questions with ChatGPT because you won’t be learning anything.

Your task

Read instructions and write your solutions to these questions into the space provided. Then check the model answers (the link is in the end of the notebook).

Questions

We will use the Titanic dataset again that you presumably have already installed.

library(titanic)

Now we will create a copy of the dataset titanic_train so that we won’t damage the original dataset while playing with it. We will also use the glimpse function form tidyverse (it somewhat like improved version of str):

df_tit <- titanic_train
glimpse(df_tit)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

Question 1

We will now practice simple data subsetting and summarising using pipe operator chains (%>%) from tidyverse. The dataset df_tit is a copy of titanic_train from the titanic package. It should already be loaded.

In each task, your goal is to produce the required output using a single chain of tidyverse functions. You do not need to create any new variables. Note that in each part, the result is going to be a data frame. In parts (b) and (c), you will need to use the function summarise. You won’t need group_by in part (b), but you can do part (c) with or without group_by.

  1. Print the Name, Sex, Age, and survival status (Survived) of passengers 42, 73, and 496.

  2. Compute the fraction of passengers who survived. Your result should be a number between 0 and 1.

  3. Compute the difference between the average ticket fare of passengers who survived and those who did not. The result should be a single number.

Question 2

Write your own custom R functions:

  • A function whose input is any vector x and whose output is the number of missing entries in x
number_of_na <- function(x) {
  # Write your code here - currently this function returns 42 
  sum(is.na(x))
}

# The following should be 177:
number_of_na(df_tit$Age)
## [1] 177
  • A function whose input is any vector x and whose output is the most frequent entry of x (it is called the mode in statistics):
get_mode <- function(x) {
  # Write your code here - currently this function returns 42 
  x %>% table %>% which.max %>% names
}

# The following should be 8.05:
get_mode(df_tit$Fare)
## [1] "8.05"
  • A function whose input is any vector x and whose output is the largest entry of x if x is numeric or the count of the most frequent entry if x is not numeric
conditional_max <- function(x) {
  # Write your code here - currently this function returns 42 
  ifelse(is.numeric(x), max(x, na.rm = TRUE), max(table(x)))
}

# The following should be 80:
conditional_max(df_tit$Age)

# The following should be 577:
conditional_max(as.character(df_tit$Sex))
## [1] 80
## [1] 577

And now apply each of these function to every column of the dataframe df_tit. You don’t have to do all the three functions in one chain of pipe operators - you can first count missing entries, then find the mode of every column, and then find this conditional maximum.

# ANSWER

df_tit %>%
  summarise(across(everything(), number_of_na))

df_tit %>%
  summarise(across(everything(), get_mode))


df_tit %>%
  summarise(across(everything(), conditional_max))

Question 3

Solve all of the following tasks using a single chain of pipe operators. Note that you will need group_by() for part (a) and it is a good idea to undo the grouping afterwards with ungroup().

  1. Add a new character variable has_survived to df_tit that equals “yes” if Survived == 1 and “no” otherwise.

  2. Add a new variable imputed_age that equals Age whenever Age is defined. If Age is NA, then it should be the median Age of passengers of the same sex, i.e., median female age for female passengers or median male age for male passengers.

  1. Add a new variable family_size that takes the value “single” if the passenger has no accompanying family members (i.e., SibSp + Parch == 0), “couple” if the passenger is travelling with one relative, “small” if the number of family members travelling with the passenger is 2 or 3, and “large” if the number of family members travelling with the passenger is 4 or greater.

Question 4

Calculate survival rates for

  1. Passengers of different sexes (Sex variable)

  2. Passengers of different port of embarcation (Embarked variable)

  3. Passengers of different combinations of sex and class (Sex and Pclass variables together)

  4. Passengers of different age groups (convert Age to Age_Group that equals “child” if age is 11 or below, “teenager” if age is up to 16, “young adult” if age is up to 23, and “adult” for the rest of passengers)

The answer for each part should be a single chain of pipe operators.

# ANSWER
# Part (a)
df_tit %>%
  group_by(Sex) %>%
  summarise(survaval_rate = mean(Survived))

# Part (b)
df_tit %>%
  group_by(Embarked) %>%
  summarise(survaval_rate = mean(Survived))

# Part (c)
df_tit %>%
  group_by(Sex, Pclass) %>%
  summarise(survaval_rate = mean(Survived))

# Part (d)
df_tit %>%
  mutate(Age_Group = cut(Age,
                         breaks = c(-Inf, 11, 16, 23, Inf),
                         labels = c("child", "teenager", "young adult", "adult"))) %>%
  group_by(Age_Group) %>%
  summarise(survaval_rate = mean(Survived))

Model answers:

https://rpubs.com/fduzhin/mh3511_hw_3