Info

Objective

These homework problem sets are designed to help you understand material better. You should try doing these problems first and then look at model answers. You can use Generative AI as to help, such as prompt “How do extract rows 4, 19, and 55 from a data frame?” or use it to explain the error message to you. But it is pointless to just feed all questions to ChatGPT because you won’t be learning anything.

Your task

Create an R script, name it “mh3511_hw_1_YOUR_NAME.R”, save it to a folder where you keep all your homework. Type your solutions in that R script. Insert comments, such as # Question 1 etc.

Part 1: Vectors

You will need to solve a few questions. In each of these questions, you will need to write one line of R code to do a certain computation task. In particular, you shouldn’t use loops, if-else statements, define functions etc (it is very good if you know how to use loops, but now this knowledge won’t be helpful).

Let us first load a few numeric vectors into R. Run the following commands (it is better to copy them into your R script):

x <- -5:6
y <- rep(c(1.5, -0.6, 0), 4)
z <- x / y
cat("x = \n")
x
cat("y = \n")
y
cat("z =\n")
z
## x = 
##  [1] -5 -4 -3 -2 -1  0  1  2  3  4  5  6
## y = 
##  [1]  1.5 -0.6  0.0  1.5 -0.6  0.0  1.5 -0.6  0.0  1.5 -0.6  0.0
## z =
##  [1] -3.3333333  6.6666667       -Inf -1.3333333  1.6666667        NaN
##  [7]  0.6666667 -3.3333333        Inf  2.6666667 -8.3333333        Inf

Question 1

Write a single R command that prints the number of infinite entries of a numeric vector. Now change it so that it prints the number of positive infinite entries of a numeric vector. Test both commands on the vector x/y — you should get 3 and 2 respectively.

# ANSWER

# Number of infinite entries:
sum(is.infinite(z))

# Number of positive infinite entries (several solutions):
sum(is.infinite(z) & !is.na(z) & z > 0)
sum(z == Inf, na.rm = TRUE)
sum(z[is.infinite(z)] > 0)
## [1] 3
## [1] 2
## [1] 2
## [1] 2

Now we will load a few non-numeric vectors into R. Run the following commands:

s <- c("The", "mirror",  "of", "73", "37", "is", "the", "12", "th", "prime", "number", ".")
w <- x < s

Question 2.

What should be the output of each of the following R commands? Try to predict it first, then run them in R, and then come up with an explanation.

class(s)
class(w)
class(x + w)
x + w
sum(!is.na(as.numeric(s)))
sum(is.na(as.character(x)))
## [1] "character"
## [1] "logical"
## [1] "integer"
##  [1] -4 -3 -2 -1  0  1  2  2  4  5  6  6
## [1] 3
## [1] 0
# ANSWER
# s is a character vector - it was explicitly defined as a character vector
class(s)

# w is the result of comparison, i.e., logical - each entry is true or false
class(w)

# When we add a numeric and logical vector, the logical vector converts to numeric
# TRUE becomes 1, FALSE becomes 0
class(x + w)
x + w

# When we convert a character vector to numeric, everything that is not a number becomes NA
sum(!is.na(as.numeric(s)))

# When we convert a numeric vector to character, everything becomes a string, even NaN becomes "NaN". 
# The exception is NA - it becomes NA, not "NA".
sum(is.na(as.character(z)))
## [1] "character"
## [1] "logical"
## [1] "integer"
##  [1] -4 -3 -2 -1  0  1  2  2  4  5  6  6
## [1] 3
## [1] 0

Question 3

Write a single R command to calculate each of the following expressions, where \(n\) is any positive integer:

  1. \(\displaystyle \sum_{k=1}^{n}\cos(k^2-k)\)

  2. \(\displaystyle \frac{2}{1!}-\frac{2^3}{3!}+\frac{2^5}{5!}-\cdots + (-1)^n\frac{2^{2n-1}}{(2n-1)!}\)

  3. The number of divisors of \(n\).

# ANSWER
# First, let us assign a value to n
n <- 10

# (a)
sum(cos((1:n)^2-(1:n)))

# (b)
sum(-(-1)^(1:n) * 2^(2 * (1:n) - 1) / factorial(2 * (1:n) - 1))

# (c)
sum(n %% (1:n) == 0)
## [1] 1.988122
## [1] 0.9092974
## [1] 4

Question 4

Write a single R command to calculate

  1. The largest negative entry of a numeric vector

  2. The sum of squares of even entries of a numeric vector

  3. The entry of a numeric vector that is closest to 42

You can test your command on vectors x, y, and z that have already been loaded to R.

# ANSWER
# (a)
max(z[z<0], na.rm = TRUE)

# (b)
sum(x[x %% 2 == 0]^2, na.rm = TRUE)

# (c)
z[which.min(abs(z - 42))]
## [1] -1.333333
## [1] 76
## [1] 6.666667

Question 5

The command numeric(0) generates an empty numeric vector:

print(numeric(0))
class(numeric(0))
length(numeric(0))
## numeric(0)
## [1] "numeric"
## [1] 0

What do you think will be the output of each of the following R commands? Try to predict it first, then run them in R, and then come up with an explanation.

sum(numeric(0))
prod(numeric(0))
min(numeric(0))
max(numeric(0))
which.min(numeric(0))
# ANSWER

# The sum of elements of an empty set is 0 since 0 is the additive identity (x + 0 = x).
sum(numeric(0))

# The product of an empty set is 1 since 1 is the multiplicative identity (x * 1 = x).
prod(numeric(0))

# The minimum of an empty vector is Inf — so that min(x, Inf) = x.
min(numeric(0))  # Warning: returns Inf and a warning

# The maximum of an empty vector is -Inf — so that max(x, -Inf) = x.
max(numeric(0))  # Warning: returns -Inf and a warning

# which.min returns the index of the minimum — but there are no elements, so it returns integer(0)
which.min(numeric(0))
## [1] 0
## [1] 1
## [1] Inf
## [1] -Inf
## integer(0)

Part 2: Matrices

You will need to solve a few questions. In each of these questions, you will need to write one line of R code to do a certain computation task. In particular, you shouldn’t use loops, if-else statements, define functions etc (it is very good if you know how to use loops, but now this knowledge won’t be helpful).

Let us first load a few matrices into R. Run the following commands (it is better to copy them into your R script):

X <- matrix(rep(-1:1, 4), nrow = 3)
Y <- matrix(rep(-1:1, 4), nrow = 3, byrow = TRUE)
Z <- matrix(6:-1, nrow = 4)
X
Y
Z
dim(X)  
dim(Y)  
dim(Z)  
##      [,1] [,2] [,3] [,4]
## [1,]   -1   -1   -1   -1
## [2,]    0    0    0    0
## [3,]    1    1    1    1
##      [,1] [,2] [,3] [,4]
## [1,]   -1    0    1   -1
## [2,]    0    1   -1    0
## [3,]    1   -1    0    1
##      [,1] [,2]
## [1,]    6    2
## [2,]    5    1
## [3,]    4    0
## [4,]    3   -1
## [1] 3 4
## [1] 3 4
## [1] 4 2

Question 6

Which of the following lines of code will produce an error message when run? First, think about it and try to predict what is going to happen. Then copy them into your R script, uncomment, run one by one, and see if what happens matches your prediction

# X + Y
# 42 * X - Y / 73 
# X * Y
# X %*% Y
# X + Z
# X %*% Z
# Y < X 
# ANSWER

# These are OK — elementwise operations or scalar-matrix ops:
X + Y
42 * X - Y / 73 
X * Y

# This fails: %*% is matrix multiplication, and X (3x4) × Y (3x4) is not defined.
# The inner dimensions (4 and 3) must match.
# X %*% Y

# This fails: X (3x4) and Z (4x2) have different shapes — can't add elementwise.
# X + Z

# This works: X (3x4) × Z (4x2) gives a 3x2 matrix.
X %*% Z

# This is OK: elementwise logical comparison, returns a logical matrix.
Y < X 
##      [,1] [,2] [,3] [,4]
## [1,]   -2   -1    0   -2
## [2,]    0    1   -1    0
## [3,]    2    0    1    2
##          [,1]         [,2]         [,3]     [,4]
## [1,] -41.9863 -42.00000000 -42.01369863 -41.9863
## [2,]   0.0000  -0.01369863   0.01369863   0.0000
## [3,]  41.9863  42.01369863  42.00000000  41.9863
##      [,1] [,2] [,3] [,4]
## [1,]    1    0   -1    1
## [2,]    0    0    0    0
## [3,]    1   -1    0    1
##      [,1] [,2]
## [1,]  -18   -2
## [2,]    0    0
## [3,]   18    2
##       [,1]  [,2]  [,3]  [,4]
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE  TRUE FALSE
## [3,] FALSE  TRUE  TRUE FALSE

Question 7

Write a single R command to calculate

  1. The fraction of nonzero entries in a matrix.

  2. The sum of entries that are bigger than the minimum and smaller than the maximum entry of a matrix. Do not include the min and max entries themselves.

  3. The largest sum of squares of entries in any row of a matrix. For this, you will need R functions t (transpose of a matrix) and diag (diagonal of a matrix)

You can test your command on matrices \(X\), \(Y\), and \(Z\) that we have already defined in R. For simplicity, you can assume that all entries of your matrix are defined, i.e., it doesn’t have NA entries.

# ANSWER

# (a)
sum(Z != 0) / prod(dim(Z))
length(Z[Z != 0]) / length(Z)

# (b)
sum(X > min(X) & X < max(X))

# (c)
max(diag(Y %*% t(Y)))
# Explanation: Y %*% t(Y) gives a matrix of pairwise dot products of the rows.
# The diagonal contains the squared norms of each row.
## [1] 0.875
## [1] 0.875
## [1] 4
## [1] 3

Question 8

Given a matrix, write a single R command that generates the following matrices of the same shape as the given matrix:

  1. The zero matrix.

  2. The matrix defined by \(a_{ij}=i\).

  3. The matrix that is same as the given one above the main diagonal and is zero on the main diagonal and below.

# ANSWER

# (a)
X - X
# or
matrix(0, nrow = nrow(X), ncol = ncol(X))

# (b)
# This is how it can be done with known tools:
matrix(rep(1:dim(X)[1], dim(X)[2]), nrow = dim(X)[1])
# But there is a built-in R function:
row(X)

# (c)
# This is how it can be done with known tools:
X * (matrix(rep(1:dim(X)[1], dim(X)[2]), nrow = dim(X)[1]) < 
       matrix(rep(1:dim(X)[2], dim(X)[1]), nrow = dim(X)[1], byrow = TRUE))
# Or using built-in R functions:
X * (row(X) < col(X))
# One more way with built-in R functions:
X * upper.tri(X)
##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    0    0    0    0
## [3,]    0    0    0    0
##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    0    0    0    0
## [3,]    0    0    0    0
##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    1
## [2,]    2    2    2    2
## [3,]    3    3    3    3
##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    1
## [2,]    2    2    2    2
## [3,]    3    3    3    3
##      [,1] [,2] [,3] [,4]
## [1,]    0   -1   -1   -1
## [2,]    0    0    0    0
## [3,]    0    0    0    1
##      [,1] [,2] [,3] [,4]
## [1,]    0   -1   -1   -1
## [2,]    0    0    0    0
## [3,]    0    0    0    1
##      [,1] [,2] [,3] [,4]
## [1,]    0   -1   -1   -1
## [2,]    0    0    0    0
## [3,]    0    0    0    1

You can test your command on matrices \(X\) and \(Z\). You can assume that all entries of your matrix are defined, i.e., it doesn’t have NA entries.

Part 3: Data Frames

Some of R packages contain data that you can play with. Run the following R code to install a package with the data of the Titanic passengers:

install.packages("titanic")

Run this command in the command line just once (you don’t need to reinstall the package every time you run your R script).

Then you need to run the following command so that the data from the packages becomes available during your R session. It needs to be run just once per session, but you can include it to your R script and run a few times:

library(titanic)

Now we will create a copy of the dataset titanic_train so that we won’t damage the original dataset while playing with it:

df_tit <- titanic_train
class(df_tit)
str(df_tit)
head(df_tit)
## [1] "data.frame"
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Question 9

Write a single R command that answers each of the following questions about df_tit:

  1. What are dimensions of the dataset df_tit, i.e., now many observations (rows) and variables (columns) does it have?

  2. What are the variables?

  3. Print the summary of the data the contains the minimum, the 1st quartile, the median, the mean, the 3rd quartile, and the maximum of all numeric variables and an overview of non-numeric variables.

# ANSWER

# (a)
dim(df_tit)

# (b)
names(df_tit)

# (c)
summary(df_tit)
## [1] 891  12
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"   
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 

Question 10

The following questions require subsetting the data. You are not required to create new variables for this question, just print the answer. In each case, you should do it with just a single R command.

  1. Print name, sex, age, and surival status for passengers 42, 73, and 496.

  2. Find the fraction of survived passengers

  3. Find the difference between the average ticket fare of passengers who survived the disaster and those who died.

# ANSWER

# (a)
df_tit[c(42, 73, 496), c("Name", "Sex", "Age", "Survived")]

# (b)
sum(df_tit$Survived) / nrow(df_tit)
# or
mean(df_tit$Survived, na.rm = TRUE)

# (c)
mean(df_tit$Fare[df_tit$Survived == 1]) - mean(df_tit$Fare[df_tit$Survived == 0])
## [1] 0.3838384
## [1] 0.3838384
## [1] 26.27752

Question 11

In R, you can create a new variable in a data frame using the $ operator:

df_tit$new_var <- 1

This adds a new column called new_var to the data frame df_tit and assigns 1 to every entry in that column.

New variables can be computed in terms of old variables. For example, SibSp is the number of siblings / spouses abroad, Parch is the number of parents / children abroad, so we can introduce a new numeric variable that shows the number of family members abroad and a new logical variable that indicated if the passenger is travelling with family:

df_tit$number_of_family_members <- df_tit$SibSp + df_tit$Parch
df_tit$has_family <- (df_tit$SibSp + df_tit$Parch) > 0
head(df_tit[ , c("SibSp", "Parch", "number_of_family_members", "has_family")])

A useful function is ifelse (vectorised conditional). The command ifelse(condition, yes, no) applies a condition to each element and returns a result of the same length. For example, the following creates a new character variable that equals “very expensive” for passengers who paid more tha 100 dollars for their tickers and “kinda okay” for those who paid 100 dollars or less:

df_tit$expensive_ticket <- ifelse(df_tit$Fare > 100, "very expensive", "kinda okay")
df_tit[215:220 , c("Pclass", "Fare", "expensive_ticket")]

To solve each of the following tasks, you can use several R commands or several lines of R code, but not loops. You can use the function ifelse. Note that is actually possible to solve each of these question with a single R command, but it may be complicated.

  1. Add a new character variable has_survived to df_tit that equals “yes” if Survived == 1 and “no” otherwise.

  2. Add a new variable imputed_age that equals Age whenever Age is defined. If Age is NA, then it should be the median Age of passengers of the same sex, i.e., median female age for female passengers or median male age for male passengers.

  1. Add a new variable family_size that takes the value “single” if the passenger has no accompanying family members (i.e., SibSp + Parch == 0), “couple” of the passenger is travelling with one relative, “small” if the number of family members travelling with the passenger is 2 or 3, and “large” if the number of family members travelling with the passenger is 4 or greater.
# ANSWER

# (a)
## First method:
df_tit$has_survived <- "yes"
df_tit$has_survived[df_tit$Survived == 0] <- "no"

## Second method:
df_tit$has_survived <- ifelse(df_tit$Survived, "yes", "no")

# (b)
## First method
median_female_age <- median(df_tit$Age[df_tit$Sex == "female"], na.rm = TRUE)
median_male_age <- median(df_tit$Age[df_tit$Sex == "male"], na.rm = TRUE)
df_tit$imputed_age <- df_tit$Age
df_tit$imputed_age[is.na(df_tit$imputed_age) & df_tit$Sex == "female"] <- median_female_age
df_tit$imputed_age[is.na(df_tit$imputed_age) & df_tit$Sex == "male"] <- median_male_age

## Second method - one (very complicated) command
df_tit$imputed_age <- ifelse(is.na(df_tit$Age), 0, df_tit$Age) +
  is.na(df_tit$Age) * (df_tit$Sex == "female") * median(df_tit$Age[df_tit$Sex == "female"], na.rm = TRUE) +
  is.na(df_tit$Age) * (df_tit$Sex == "male") * median(df_tit$Age[df_tit$Sex == "male"], na.rm = TRUE)


# (c)
## First method
df_tit$family_size <- "single"
df_tit$family_size[df_tit$SibSp + df_tit$Parch >= 1] <- "couple"
df_tit$family_size[df_tit$SibSp + df_tit$Parch >= 2] <- "small"
df_tit$family_size[df_tit$SibSp + df_tit$Parch >= 4] <- "large"
table(df_tit$family_size)

## Second method, using the built-in function cut
df_tit$family_size <- cut(df_tit$SibSp + df_tit$Parch, breaks = c(-0.5, 0.5, 1.5, 3.5, Inf),
                          labels = c("single", "couple", "small", "large"))
table(df_tit$family_size)
## 
## couple  large single  small 
##    161     62    537    131 
## 
## single couple  small  large 
##    537    161    131     62

Model answers:

https://rpubs.com/fduzhin/mh3511_hw_1