Info

Objective

These homework problem sets are designed to help you understand material better. You should try doing these problems first and then look at model answers. You can use Generative AI as to help, such as prompt “Which tidyverse function do I use to drop certain columns from a data frame? Give me an example and explain”. It is also a good idea to feed an error message together with your code to Generative AI and ask it to help with fixing errors. But it is pointless to just solve all questions with ChatGPT because you won’t be learning anything.

Your task

Read instructions and write your solutions to these questions into the space provided. Then check the model answers (the link is in the end of the notebook).

Functions

Here we will practice on defining functions in R. Here is a function whose input is a numeric vector of ages and whose output is a vector of character labels corresponding to ages:

numeric_to_age <- function(x) {
  # First: we create a vector with undefined entries of the same length as x
  result <- rep(NA, length(x)) 
  
  # Now we will enter "child" at positions of result 
  # corresponding to values between 0 and 11
  result[x <= 11] <- "child"
  
  # Now we will enter "teenager" at positions of result 
  # corresponding to values between 11 (not including) and 16
  result[x > 11 & x <= 16] <- "teenager"
  
  # Now we will enter "young adult" at positions of result 
  # corresponding to values between 16 (not including) and 23
  result[x > 16 & x <= 23] <- "young adult"
  
  # Now we will enter "adult" at positions of result 
  # corresponding to values more than 23
  result[x > 23] <- "adult"
  
  # And now we will return the result
  result

}

numeric_to_age(df_titanic$Age[1:10])
##  [1] "young adult" "adult"       "adult"       "adult"       "adult"      
##  [6] NA            "adult"       "child"       "adult"       "teenager"

Exercise 1

Rewrite the function numeric_to_age so that entries of the result where the input is NA are “unknown”.

# ANSWER

numeric_to_age <- function(x) {
  # First: we create a vector with undefined entries of the same length as x
  result <- rep("unknown", length(x)) 
  
  # Now we will enter "child" at positions of result 
  # corresponding to values between 0 and 11
  result[x <= 11] <- "child"
  
  # Now we will enter "teenager" at positions of result 
  # corresponding to values between 11 (not including) and 16
  result[x > 11 & x <= 16] <- "teenager"
  
  # Now we will enter "young adult" at positions of result 
  # corresponding to values between 16 (not including) and 23
  result[x > 16 & x <= 23] <- "young adult"
  
  # Now we will enter "adult" at positions of result 
  # corresponding to values more than 23
  result[x > 23] <- "adult"
  
  # And now we will return the result
  result
}

numeric_to_age(df_titanic$Age[1:10])
table(numeric_to_age(df_titanic$Age))
##  [1] "young adult" "adult"       "adult"       "adult"       "adult"      
##  [6] "unknown"     "adult"       "child"       "adult"       "teenager"   
## 
##       adult       child    teenager     unknown young adult 
##         468          68          32         177         146

Exercise 2

Now re-write the function numeric_to_age so that breaks defining character age labels are optional arguments.

# ANSWER

numeric_to_age <- function(x, 
                           child_ends = 11,
                           teen_ends = 16,
                           young_adult_ends = 23) {
  # First: we create a vector with undefined entries of the same length as x
  result <- rep("unknown", length(x)) 
  result[x <= child_ends] <- "child"
  result[x > child_ends & x <= teen_ends] <- "teenager"
  result[x > teen_ends & x <= young_adult_ends] <- "young adult"
  result[x > young_adult_ends] <- "adult"
  
  # And now we will return the result
  result
}

numeric_to_age(df_titanic$Age[1:10], child_ends = 12)
table(numeric_to_age(df_titanic$Age, 10, 20, 30))
##  [1] "young adult" "adult"       "adult"       "adult"       "adult"      
##  [6] "unknown"     "adult"       "child"       "adult"       "teenager"   
## 
##       adult       child    teenager     unknown young adult 
##         305          64         115         177         230

Built-in function

Note that we are doing a bit of reinventing the wheel. There is a built-in function for this, cut. Here is the right way to solve Exercise 2:

# The "right" way with a built-in function

numeric_to_age <- function(x, 
                           child_ends = 11,
                           teen_ends = 16,
                           young_adult_ends = 23) {
  result <- cut(
    x,
    breaks = c(-Inf, child_ends, teen_ends, young_adult_ends, Inf),
    labels = c("child", "teenager", "young adult", "adult")
  )
  result <- as.character(result)
  result[is.na(x)] <- "unknown"
  result
}

numeric_to_age(df_titanic$Age[1:10], child_ends = 10, teen_ends = 15, young_adult_ends = 30)
table(numeric_to_age(df_titanic$Age))
##  [1] "young adult" "adult"       "young adult" "adult"       "adult"      
##  [6] "unknown"     "adult"       "child"       "young adult" "teenager"   
## 
##       adult       child    teenager     unknown young adult 
##         468          68          32         177         146

Run ?cut in RStudio command line and read the manual.

Exercise 3

Write an R function whose input is a numeric vector x and whose output is the list of positions at which x has a minimum entry. Your function should ignore missing entries.

## ANSWER
which_all_min <- function(x){
  which(x == min(x, na.rm = TRUE))
}

### This should be 804:
which_all_min(df_titanic$Age)

### This should be 180 264 272 278 303 414 467 482 598 634 675 733 807 816 823
which_all_min(df_titanic$Fare)
## [1] 804
##  [1] 180 264 272 278 303 414 467 482 598 634 675 733 807 816 823

Pipe operator

The pipe operator %>% is helpful to make codes more readable. For example,

table(ifelse(df_titanic$Survived, "Yes", "No"))
## 
##  No Yes 
## 549 342

can be re-written with %>% as

df_titanic$Survived %>% 
  ifelse("Yes", "No") %>%
  table 
## .
##  No Yes 
## 549 342

Exercise 4

Rewrite the following expressions with the pipe operator

## Part (a)
mean(log(df_titanic$Age), na.rm = TRUE)

## Part (b)
summary(replace_na(df_titanic$Age, median(df_titanic$Age, na.rm = TRUE)))

## Part (c)
sum(df_titanic$Fare == min(df_titanic$Fare))
## [1] 3.195807
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   22.00   28.00   29.36   35.00   80.00 
## [1] 15
# ANSWERS

# Part (a)
df_titanic$Age %>%
  log() %>%
  mean(na.rm = TRUE)

# Part (b)
df_titanic$Age %>% 
  replace_na(median(df_titanic$Age, na.rm = TRUE)) %>%
  summary()

# Part (c)
df_titanic$Fare %>%
  which_all_min() %>%
  length
## [1] 3.195807
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   22.00   28.00   29.36   35.00   80.00 
## [1] 15

Reading from Google Drive

Here we read some kcal data from Google Drive

url <- "https://docs.google.com/spreadsheets/d/18Uy6pEHhXmfgVEbZ4MTtO5mgJb1ts-EvyBN5RU7HZ5E/edit?usp=sharing"
library(googlesheets4)
gs4_deauth()
kcal_data <- read_sheet(url)
head(kcal_data)

Note that we need to transform character variables to numeric. Here is the right way to do it (we will go deeper into mutate, across, and starts_with in the next class):

kcal_data <- kcal_data %>%
  mutate(across(-starts_with("Food"), parse_number))

head(kcal_data)

Plotting

Exercise 5

  1. Plot the histogram of Cals_per100grams. What seems to be wrong about the data?
### Answer
ggplot(data = kcal_data, aes(x = Cals_per100grams)) + geom_histogram()

ANSWER The maximum number of calories per 100 grams is 900 (for pure oils and fats). The histogram has some outliers that can’t be true data and need to be fixed.

  1. What is the scatterplot of Cals_per100grams vs KJ_per100grams going to look like? Think about it before producing the plot, then produce the plot and explain what you see and why.
### Answer
ggplot(data = kcal_data, 
       aes(y = Cals_per100grams, x = KJ_per100grams)) + 
  geom_point()

ANSWER We see a straight line because KJ is a linear function of kcal: \[ \mbox{KJ} = 0.239006\times \mbox{kcal} \] c) Produce a box plot of Cals_per100grams vs FoodCategory. Use Google / ChatGPT / R manuals to modify it so that labels do not overlap (you can turn them 45 or 90 degrees).

ggplot(data = kcal_data, 
       aes(y = Cals_per100grams, x = FoodCategory)) + 
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Optional: use ChatGPT to help you to arrange this plot in the order of decreasing median of Cals_per100grams. Note that it’ll use functions that we haven’t learned yet - we will cover them in the next class.

kcal_data %>%
  mutate(FoodCategory = fct_reorder(FoodCategory, Cals_per100grams, .fun = median, .desc = TRUE)) %>%
  ggplot(aes(x = FoodCategory, y = Cals_per100grams)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Exercise 6

Write a function with two inputs:

  • Compulsory character variable color_code, a string indicating a colour.

  • Optional numeric variable number_of_bins with default value of 20.

The output should be a histogram of Cals_per100grams filled with that color, black outline, given number of bins, and whith logarithmic scale on the \(x\)-axis. You will need the function scale_x_log10()

### ANSWER
coloured_kcal_histogram <- function(color_code, number_of_bins = 20) {
  kcal_data %>%
    ggplot(aes(x = Cals_per100grams)) + geom_histogram(fill = color_code, 
                                           color = "black", 
                                           bins = number_of_bins) +
    scale_x_log10()
}

coloured_kcal_histogram("burlywood", 10)