These homework problem sets are designed to help you understand material better. You should try doing these problems first and then look at model answers. You can use Generative AI as to help, such as prompt “How do extract rows 4, 19, and 55 from a data frame?” or use it to explain the error message to you. But it is pointless to just feed all questions to ChatGPT because you won’t be learning anything.
Create an R script, name it “mh3511_hw_1_YOUR_NAME.R”, save it to a
folder where you keep all your homework. Type your solutions in that R
script. Insert comments, such as # Question 1
etc.
You will need to solve a few questions. In each of these questions, you will need to write one line of R code to do a certain computation task. In particular, you shouldn’t use loops, if-else statements, define functions etc (it is very good if you know how to use loops, but now this knowledge won’t be helpful).
Let us first load a few numeric vectors into R. Run the following commands (it is better to copy them into your R script):
x <- -5:6
y <- rep(c(1.5, -0.6, 0), 4)
z <- x / y
cat("x = \n")
x
cat("y = \n")
y
cat("z =\n")
z
## x =
## [1] -5 -4 -3 -2 -1 0 1 2 3 4 5 6
## y =
## [1] 1.5 -0.6 0.0 1.5 -0.6 0.0 1.5 -0.6 0.0 1.5 -0.6 0.0
## z =
## [1] -3.3333333 6.6666667 -Inf -1.3333333 1.6666667 NaN
## [7] 0.6666667 -3.3333333 Inf 2.6666667 -8.3333333 Inf
Write a single R command that prints the number of infinite entries
of a numeric vector. Now change it so that it prints the number of
positive infinite entries of a numeric vector. Test both commands on the
vector x/y
— you should get 3 and 2 respectively.
# ANSWER
# Number of infinite entries:
sum(is.infinite(z))
# Number of positive infinite entries (several solutions):
sum(is.infinite(z) & !is.na(z) & z > 0)
sum(z == Inf, na.rm = TRUE)
sum(z[is.infinite(z)] > 0)
## [1] 3
## [1] 2
## [1] 2
## [1] 2
Now we will load a few non-numeric vectors into R. Run the following commands:
s <- c("The", "mirror", "of", "73", "37", "is", "the", "12", "th", "prime", "number", ".")
w <- x < s
What should be the output of each of the following R commands? Try to predict it first, then run them in R, and then come up with an explanation.
class(s)
class(w)
class(x + w)
x + w
sum(!is.na(as.numeric(s)))
sum(is.na(as.character(x)))
## [1] "character"
## [1] "logical"
## [1] "integer"
## [1] -4 -3 -2 -1 0 1 2 2 4 5 6 6
## [1] 3
## [1] 0
# ANSWER
# s is a character vector - it was explicitly defined as a character vector
class(s)
# w is the result of comparison, i.e., logical - each entry is true or false
class(w)
# When we add a numeric and logical vector, the logical vector converts to numeric
# TRUE becomes 1, FALSE becomes 0
class(x + w)
x + w
# When we convert a character vector to numeric, everything that is not a number becomes NA
sum(!is.na(as.numeric(s)))
# When we convert a numeric vector to character, everything becomes a string, even NaN becomes "NaN".
# The exception is NA - it becomes NA, not "NA".
sum(is.na(as.character(z)))
## [1] "character"
## [1] "logical"
## [1] "integer"
## [1] -4 -3 -2 -1 0 1 2 2 4 5 6 6
## [1] 3
## [1] 0
Write a single R command to calculate each of the following expressions, where \(n\) is any positive integer:
\(\displaystyle \sum_{k=1}^{n}\cos(k^2-k)\)
\(\displaystyle \frac{2}{1!}-\frac{2^3}{3!}+\frac{2^5}{5!}-\cdots + (-1)^n\frac{2^{2n-1}}{(2n-1)!}\)
The number of divisors of \(n\).
# ANSWER
# First, let us assign a value to n
n <- 10
# (a)
sum(cos((1:n)^2-(1:n)))
# (b)
sum(-(-1)^(1:n) * 2^(2 * (1:n) - 1) / factorial(2 * (1:n) - 1))
# (c)
sum(n %% (1:n) == 0)
## [1] 1.988122
## [1] 0.9092974
## [1] 4
Write a single R command to calculate
The largest negative entry of a numeric vector
The sum of squares of even entries of a numeric vector
The entry of a numeric vector that is closest to 42
You can test your command on vectors x
, y
,
and z
that have already been loaded to R.
# ANSWER
# (a)
max(z[z<0], na.rm = TRUE)
# (b)
sum(x[x %% 2 == 0]^2, na.rm = TRUE)
# (c)
z[which.min(abs(z - 42))]
## [1] -1.333333
## [1] 76
## [1] 6.666667
The command numeric(0)
generates an empty numeric
vector:
print(numeric(0))
class(numeric(0))
length(numeric(0))
## numeric(0)
## [1] "numeric"
## [1] 0
What do you think will be the output of each of the following R commands? Try to predict it first, then run them in R, and then come up with an explanation.
sum(numeric(0))
prod(numeric(0))
min(numeric(0))
max(numeric(0))
which.min(numeric(0))
# ANSWER
# The sum of elements of an empty set is 0 since 0 is the additive identity (x + 0 = x).
sum(numeric(0))
# The product of an empty set is 1 since 1 is the multiplicative identity (x * 1 = x).
prod(numeric(0))
# The minimum of an empty vector is Inf — so that min(x, Inf) = x.
min(numeric(0)) # Warning: returns Inf and a warning
# The maximum of an empty vector is -Inf — so that max(x, -Inf) = x.
max(numeric(0)) # Warning: returns -Inf and a warning
# which.min returns the index of the minimum — but there are no elements, so it returns integer(0)
which.min(numeric(0))
## [1] 0
## [1] 1
## [1] Inf
## [1] -Inf
## integer(0)
You will need to solve a few questions. In each of these questions, you will need to write one line of R code to do a certain computation task. In particular, you shouldn’t use loops, if-else statements, define functions etc (it is very good if you know how to use loops, but now this knowledge won’t be helpful).
Let us first load a few matrices into R. Run the following commands (it is better to copy them into your R script):
X <- matrix(rep(-1:1, 4), nrow = 3)
Y <- matrix(rep(-1:1, 4), nrow = 3, byrow = TRUE)
Z <- matrix(6:-1, nrow = 4)
X
Y
Z
dim(X)
dim(Y)
dim(Z)
## [,1] [,2] [,3] [,4]
## [1,] -1 -1 -1 -1
## [2,] 0 0 0 0
## [3,] 1 1 1 1
## [,1] [,2] [,3] [,4]
## [1,] -1 0 1 -1
## [2,] 0 1 -1 0
## [3,] 1 -1 0 1
## [,1] [,2]
## [1,] 6 2
## [2,] 5 1
## [3,] 4 0
## [4,] 3 -1
## [1] 3 4
## [1] 3 4
## [1] 4 2
Which of the following lines of code will produce an error message when run? First, think about it and try to predict what is going to happen. Then copy them into your R script, uncomment, run one by one, and see if what happens matches your prediction
# X + Y
# 42 * X - Y / 73
# X * Y
# X %*% Y
# X + Z
# X %*% Z
# Y < X
# ANSWER
# These are OK — elementwise operations or scalar-matrix ops:
X + Y
42 * X - Y / 73
X * Y
# This fails: %*% is matrix multiplication, and X (3x4) × Y (3x4) is not defined.
# The inner dimensions (4 and 3) must match.
# X %*% Y
# This fails: X (3x4) and Z (4x2) have different shapes — can't add elementwise.
# X + Z
# This works: X (3x4) × Z (4x2) gives a 3x2 matrix.
X %*% Z
# This is OK: elementwise logical comparison, returns a logical matrix.
Y < X
## [,1] [,2] [,3] [,4]
## [1,] -2 -1 0 -2
## [2,] 0 1 -1 0
## [3,] 2 0 1 2
## [,1] [,2] [,3] [,4]
## [1,] -41.9863 -42.00000000 -42.01369863 -41.9863
## [2,] 0.0000 -0.01369863 0.01369863 0.0000
## [3,] 41.9863 42.01369863 42.00000000 41.9863
## [,1] [,2] [,3] [,4]
## [1,] 1 0 -1 1
## [2,] 0 0 0 0
## [3,] 1 -1 0 1
## [,1] [,2]
## [1,] -18 -2
## [2,] 0 0
## [3,] 18 2
## [,1] [,2] [,3] [,4]
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE TRUE FALSE
## [3,] FALSE TRUE TRUE FALSE
Write a single R command to calculate
The fraction of nonzero entries in a matrix.
The sum of entries that are bigger than the minimum and smaller than the maximum entry of a matrix. Do not include the min and max entries themselves.
The largest sum of squares of entries in any row of a matrix. For
this, you will need R functions t
(transpose of a matrix)
and diag
(diagonal of a matrix)
You can test your command on matrices \(X\), \(Y\), and \(Z\) that we have already defined in R. For
simplicity, you can assume that all entries of your matrix are defined,
i.e., it doesn’t have NA
entries.
# ANSWER
# (a)
sum(Z != 0) / prod(dim(Z))
length(Z[Z != 0]) / length(Z)
# (b)
sum(X > min(X) & X < max(X))
# (c)
max(diag(Y %*% t(Y)))
# Explanation: Y %*% t(Y) gives a matrix of pairwise dot products of the rows.
# The diagonal contains the squared norms of each row.
## [1] 0.875
## [1] 0.875
## [1] 4
## [1] 3
Given a matrix, write a single R command that generates the following matrices of the same shape as the given matrix:
The zero matrix.
The matrix defined by \(a_{ij}=i\).
The matrix that is same as the given one above the main diagonal and is zero on the main diagonal and below.
# ANSWER
# (a)
X - X
# or
matrix(0, nrow = nrow(X), ncol = ncol(X))
# (b)
# This is how it can be done with known tools:
matrix(rep(1:dim(X)[1], dim(X)[2]), nrow = dim(X)[1])
# But there is a built-in R function:
row(X)
# (c)
# This is how it can be done with known tools:
X * (matrix(rep(1:dim(X)[1], dim(X)[2]), nrow = dim(X)[1]) <
matrix(rep(1:dim(X)[2], dim(X)[1]), nrow = dim(X)[1], byrow = TRUE))
# Or using built-in R functions:
X * (row(X) < col(X))
# One more way with built-in R functions:
X * upper.tri(X)
## [,1] [,2] [,3] [,4]
## [1,] 0 0 0 0
## [2,] 0 0 0 0
## [3,] 0 0 0 0
## [,1] [,2] [,3] [,4]
## [1,] 0 0 0 0
## [2,] 0 0 0 0
## [3,] 0 0 0 0
## [,1] [,2] [,3] [,4]
## [1,] 1 1 1 1
## [2,] 2 2 2 2
## [3,] 3 3 3 3
## [,1] [,2] [,3] [,4]
## [1,] 1 1 1 1
## [2,] 2 2 2 2
## [3,] 3 3 3 3
## [,1] [,2] [,3] [,4]
## [1,] 0 -1 -1 -1
## [2,] 0 0 0 0
## [3,] 0 0 0 1
## [,1] [,2] [,3] [,4]
## [1,] 0 -1 -1 -1
## [2,] 0 0 0 0
## [3,] 0 0 0 1
## [,1] [,2] [,3] [,4]
## [1,] 0 -1 -1 -1
## [2,] 0 0 0 0
## [3,] 0 0 0 1
You can test your command on matrices \(X\) and \(Z\). You can assume that all entries of
your matrix are defined, i.e., it doesn’t have NA
entries.
Some of R packages contain data that you can play with. Run the following R code to install a package with the data of the Titanic passengers:
install.packages("titanic")
Run this command in the command line just once (you don’t need to reinstall the package every time you run your R script).
Then you need to run the following command so that the data from the packages becomes available during your R session. It needs to be run just once per session, but you can include it to your R script and run a few times:
library(titanic)
Now we will create a copy of the dataset titanic_train
so that we won’t damage the original dataset while playing with it:
df_tit <- titanic_train
class(df_tit)
str(df_tit)
head(df_tit)
## [1] "data.frame"
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Write a single R command that answers each of the following questions
about df_tit
:
What are dimensions of the dataset df_tit
, i.e., now
many observations (rows) and variables (columns) does it have?
What are the variables?
Print the summary of the data the contains the minimum, the 1st quartile, the median, the mean, the 3rd quartile, and the maximum of all numeric variables and an overview of non-numeric variables.
# ANSWER
# (a)
dim(df_tit)
# (b)
names(df_tit)
# (c)
summary(df_tit)
## [1] 891 12
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
The following questions require subsetting the data. You are not required to create new variables for this question, just print the answer. In each case, you should do it with just a single R command.
Print name, sex, age, and surival status for passengers 42, 73, and 496.
Find the fraction of survived passengers
Find the difference between the average ticket fare of passengers who survived the disaster and those who died.
# ANSWER
# (a)
df_tit[c(42, 73, 496), c("Name", "Sex", "Age", "Survived")]
# (b)
sum(df_tit$Survived) / nrow(df_tit)
# or
mean(df_tit$Survived, na.rm = TRUE)
# (c)
mean(df_tit$Fare[df_tit$Survived == 1]) - mean(df_tit$Fare[df_tit$Survived == 0])
## [1] 0.3838384
## [1] 0.3838384
## [1] 26.27752
In R, you can create a new variable in a data frame using the
$
operator:
df_tit$new_var <- 1
This adds a new column called new_var
to the data frame
df_tit
and assigns 1 to every entry in that column.
New variables can be computed in terms of old variables. For example,
SibSp
is the number of siblings / spouses abroad,
Parch
is the number of parents / children abroad, so we can
introduce a new numeric variable that shows the number of family members
abroad and a new logical variable that indicated if the passenger is
travelling with family:
df_tit$number_of_family_members <- df_tit$SibSp + df_tit$Parch
df_tit$has_family <- (df_tit$SibSp + df_tit$Parch) > 0
head(df_tit[ , c("SibSp", "Parch", "number_of_family_members", "has_family")])
A useful function is ifelse
(vectorised conditional).
The command ifelse(condition, yes, no)
applies a condition
to each element and returns a result of the same length. For example,
the following creates a new character variable that equals “very
expensive” for passengers who paid more tha 100 dollars for their
tickers and “kinda okay” for those who paid 100 dollars or less:
df_tit$expensive_ticket <- ifelse(df_tit$Fare > 100, "very expensive", "kinda okay")
df_tit[215:220 , c("Pclass", "Fare", "expensive_ticket")]
To solve each of the following tasks, you can use several R commands
or several lines of R code, but not loops. You can use the function
ifelse
. Note that is actually possible to solve each of
these question with a single R command, but it may be complicated.
Add a new character variable has_survived
to
df_tit
that equals “yes” if Survived == 1
and
“no” otherwise.
Add a new variable imputed_age
that equals
Age
whenever Age
is defined. If
Age
is NA, then it should be the median Age
of
passengers of the same sex, i.e., median female age for female
passengers or median male age for male passengers.
family_size
that takes the value
“single” if the passenger has no accompanying family members (i.e.,
SibSp + Parch == 0
), “couple” of the passenger is
travelling with one relative, “small” if the number of family members
travelling with the passenger is 2 or 3, and “large” if the number of
family members travelling with the passenger is 4 or greater.# ANSWER
# (a)
## First method:
df_tit$has_survived <- "yes"
df_tit$has_survived[df_tit$Survived == 0] <- "no"
## Second method:
df_tit$has_survived <- ifelse(df_tit$Survived, "yes", "no")
# (b)
## First method
median_female_age <- median(df_tit$Age[df_tit$Sex == "female"], na.rm = TRUE)
median_male_age <- median(df_tit$Age[df_tit$Sex == "male"], na.rm = TRUE)
df_tit$imputed_age <- df_tit$Age
df_tit$imputed_age[is.na(df_tit$imputed_age) & df_tit$Sex == "female"] <- median_female_age
df_tit$imputed_age[is.na(df_tit$imputed_age) & df_tit$Sex == "male"] <- median_male_age
## Second method - one (very complicated) command
df_tit$imputed_age <- ifelse(is.na(df_tit$Age), 0, df_tit$Age) +
is.na(df_tit$Age) * (df_tit$Sex == "female") * median(df_tit$Age[df_tit$Sex == "female"], na.rm = TRUE) +
is.na(df_tit$Age) * (df_tit$Sex == "male") * median(df_tit$Age[df_tit$Sex == "male"], na.rm = TRUE)
# (c)
## First method
df_tit$family_size <- "single"
df_tit$family_size[df_tit$SibSp + df_tit$Parch >= 1] <- "couple"
df_tit$family_size[df_tit$SibSp + df_tit$Parch >= 2] <- "small"
df_tit$family_size[df_tit$SibSp + df_tit$Parch >= 4] <- "large"
table(df_tit$family_size)
## Second method, using the built-in function cut
df_tit$family_size <- cut(df_tit$SibSp + df_tit$Parch, breaks = c(-0.5, 0.5, 1.5, 3.5, Inf),
labels = c("single", "couple", "small", "large"))
table(df_tit$family_size)
##
## couple large single small
## 161 62 537 131
##
## single couple small large
## 537 161 131 62