These homework problem sets are designed to help you understand material better. You should try doing these problems first and then look at model answers. You can use Generative AI as to help, such as prompt “Which tidyverse function do I use to drop certain columns from a data frame? Give me an example and explain”. It is also a good idea to feed an error message together with your code to Generative AI and ask it to help with fixing errors. But it is pointless to just solve all questions with ChatGPT because you won’t be learning anything.
Read instructions and write your solutions to these questions into the space provided. Then check the model answers (the link is in the end of the notebook).
Here we will practice on defining functions in R. Here is a function whose input is a numeric vector of ages and whose output is a vector of character labels corresponding to ages:
numeric_to_age <- function(x) {
# First: we create a vector with undefined entries of the same length as x
result <- rep(NA, length(x))
# Now we will enter "child" at positions of result
# corresponding to values between 0 and 11
result[x <= 11] <- "child"
# Now we will enter "teenager" at positions of result
# corresponding to values between 11 (not including) and 16
result[x > 11 & x <= 16] <- "teenager"
# Now we will enter "young adult" at positions of result
# corresponding to values between 16 (not including) and 23
result[x > 16 & x <= 23] <- "young adult"
# Now we will enter "adult" at positions of result
# corresponding to values more than 23
result[x > 23] <- "adult"
# And now we will return the result
result
}
numeric_to_age(df_titanic$Age[1:10])
## [1] "young adult" "adult" "adult" "adult" "adult"
## [6] NA "adult" "child" "adult" "teenager"
Rewrite the function numeric_to_age
so that entries of
the result where the input is NA
are “unknown”.
# ANSWER
numeric_to_age <- function(x) {
# First: we create a vector with undefined entries of the same length as x
result <- rep("unknown", length(x))
# Now we will enter "child" at positions of result
# corresponding to values between 0 and 11
result[x <= 11] <- "child"
# Now we will enter "teenager" at positions of result
# corresponding to values between 11 (not including) and 16
result[x > 11 & x <= 16] <- "teenager"
# Now we will enter "young adult" at positions of result
# corresponding to values between 16 (not including) and 23
result[x > 16 & x <= 23] <- "young adult"
# Now we will enter "adult" at positions of result
# corresponding to values more than 23
result[x > 23] <- "adult"
# And now we will return the result
result
}
numeric_to_age(df_titanic$Age[1:10])
table(numeric_to_age(df_titanic$Age))
## [1] "young adult" "adult" "adult" "adult" "adult"
## [6] "unknown" "adult" "child" "adult" "teenager"
##
## adult child teenager unknown young adult
## 468 68 32 177 146
Now re-write the function numeric_to_age
so that breaks
defining character age labels are optional arguments.
# ANSWER
numeric_to_age <- function(x,
child_ends = 11,
teen_ends = 16,
young_adult_ends = 23) {
# First: we create a vector with undefined entries of the same length as x
result <- rep("unknown", length(x))
result[x <= child_ends] <- "child"
result[x > child_ends & x <= teen_ends] <- "teenager"
result[x > teen_ends & x <= young_adult_ends] <- "young adult"
result[x > young_adult_ends] <- "adult"
# And now we will return the result
result
}
numeric_to_age(df_titanic$Age[1:10], child_ends = 12)
table(numeric_to_age(df_titanic$Age, 10, 20, 30))
## [1] "young adult" "adult" "adult" "adult" "adult"
## [6] "unknown" "adult" "child" "adult" "teenager"
##
## adult child teenager unknown young adult
## 305 64 115 177 230
Note that we are doing a bit of reinventing the wheel. There is a
built-in function for this, cut
. Here is the right way to
solve Exercise 2:
# The "right" way with a built-in function
numeric_to_age <- function(x,
child_ends = 11,
teen_ends = 16,
young_adult_ends = 23) {
result <- cut(
x,
breaks = c(-Inf, child_ends, teen_ends, young_adult_ends, Inf),
labels = c("child", "teenager", "young adult", "adult")
)
result <- as.character(result)
result[is.na(x)] <- "unknown"
result
}
numeric_to_age(df_titanic$Age[1:10], child_ends = 10, teen_ends = 15, young_adult_ends = 30)
table(numeric_to_age(df_titanic$Age))
## [1] "young adult" "adult" "young adult" "adult" "adult"
## [6] "unknown" "adult" "child" "young adult" "teenager"
##
## adult child teenager unknown young adult
## 468 68 32 177 146
Run ?cut
in RStudio command line and read the
manual.
Write an R function whose input is a numeric vector x
and whose output is the list of positions at which x
has a
minimum entry. Your function should ignore missing entries.
## ANSWER
which_all_min <- function(x){
which(x == min(x, na.rm = TRUE))
}
### This should be 804:
which_all_min(df_titanic$Age)
### This should be 180 264 272 278 303 414 467 482 598 634 675 733 807 816 823
which_all_min(df_titanic$Fare)
## [1] 804
## [1] 180 264 272 278 303 414 467 482 598 634 675 733 807 816 823
The pipe operator %>%
is helpful to make codes more
readable. For example,
table(ifelse(df_titanic$Survived, "Yes", "No"))
##
## No Yes
## 549 342
can be re-written with %>%
as
df_titanic$Survived %>%
ifelse("Yes", "No") %>%
table
## .
## No Yes
## 549 342
Rewrite the following expressions with the pipe operator
## Part (a)
mean(log(df_titanic$Age), na.rm = TRUE)
## Part (b)
summary(replace_na(df_titanic$Age, median(df_titanic$Age, na.rm = TRUE)))
## Part (c)
sum(df_titanic$Fare == min(df_titanic$Fare))
## [1] 3.195807
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 22.00 28.00 29.36 35.00 80.00
## [1] 15
# ANSWERS
# Part (a)
df_titanic$Age %>%
log() %>%
mean(na.rm = TRUE)
# Part (b)
df_titanic$Age %>%
replace_na(median(df_titanic$Age, na.rm = TRUE)) %>%
summary()
# Part (c)
df_titanic$Fare %>%
which_all_min() %>%
length
## [1] 3.195807
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 22.00 28.00 29.36 35.00 80.00
## [1] 15
Here we read some kcal data from Google Drive
url <- "https://docs.google.com/spreadsheets/d/18Uy6pEHhXmfgVEbZ4MTtO5mgJb1ts-EvyBN5RU7HZ5E/edit?usp=sharing"
library(googlesheets4)
gs4_deauth()
kcal_data <- read_sheet(url)
head(kcal_data)
Note that we need to transform character variables to numeric. Here
is the right way to do it (we will go deeper into mutate
,
across
, and starts_with
in the next
class):
kcal_data <- kcal_data %>%
mutate(across(-starts_with("Food"), parse_number))
head(kcal_data)
Cals_per100grams
. What seems to
be wrong about the data?### Answer
ggplot(data = kcal_data, aes(x = Cals_per100grams)) + geom_histogram()
ANSWER The maximum number of calories per 100 grams is 900 (for pure oils and fats). The histogram has some outliers that can’t be true data and need to be fixed.
Cals_per100grams
vs
KJ_per100grams
going to look like? Think about it before
producing the plot, then produce the plot and explain what you see and
why.### Answer
ggplot(data = kcal_data,
aes(y = Cals_per100grams, x = KJ_per100grams)) +
geom_point()
ANSWER We see a straight line because KJ is a linear
function of kcal: \[
\mbox{KJ} = 0.239006\times \mbox{kcal}
\] c) Produce a box plot of Cals_per100grams
vs
FoodCategory
. Use Google / ChatGPT / R manuals to modify it
so that labels do not overlap (you can turn them 45 or 90 degrees).
ggplot(data = kcal_data,
aes(y = Cals_per100grams, x = FoodCategory)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Optional: use ChatGPT to help you to arrange this plot in
the order of decreasing median of Cals_per100grams
. Note
that it’ll use functions that we haven’t learned yet - we will cover
them in the next class.
kcal_data %>%
mutate(FoodCategory = fct_reorder(FoodCategory, Cals_per100grams, .fun = median, .desc = TRUE)) %>%
ggplot(aes(x = FoodCategory, y = Cals_per100grams)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Write a function with two inputs:
Compulsory character variable color_code
, a string
indicating a colour.
Optional numeric variable number_of_bins
with
default value of 20.
The output should be a histogram of Cals_per100grams
filled with that color, black outline, given number of bins, and whith
logarithmic scale on the \(x\)-axis.
You will need the function scale_x_log10()
### ANSWER
coloured_kcal_histogram <- function(color_code, number_of_bins = 20) {
kcal_data %>%
ggplot(aes(x = Cals_per100grams)) + geom_histogram(fill = color_code,
color = "black",
bins = number_of_bins) +
scale_x_log10()
}
coloured_kcal_histogram("burlywood", 10)