Homework 1 – Due September 8

Name: Jared Ali netID: 230005904 Collaborated with: no one :(

Your homework must be submitted in Word or PDF format, created by calling “Knit Word” or “Knit PDF” from RStudio on your R Markdown document. Submission in other formats may receive a grade of 0. Your responses must be supported by both textual explanations and the code you generate to produce your result. Note that all R code used to produce your results must be shown in your knitted file.

Syntax and class-typing.

For each line of the following code, either explain why they should be erroneous, or explain what tasks the non-erroneous ones perform.

vector1 <- c(5, 12, TRUE, 32)
This first line describes a vector's arguments in order. This would need the boolean TRUE to be changed to a number in order for the code to perform.
max(vector1)
Because the boolean TRUE is included in the vector, we cannot find the maximum value of the vector without knowing the value of TRUE.
sort(vector1)
Likewise in order to sort a vector, this would return a new vector and we need boolean TRUE to be defined in terms of vector1 in order to do this command.
sum(vector1)
This final vector function returns single numbers and boolean TRUE would need to be a numeric in order for to use this command.

For each block of the following code, either explain why they should be erroneous, or explain what tasks the non-erroneous ones perform.

vector2 <- c(5,"7",12)
vector2[2] + vector2[3]
7 needs to lose the "," in order for this block of code to perform or be defined as an integer.

dataframe3 <- data.frame(z1="5",z2=7,z3=12)
dataframe3[1,2] + dataframe3[1,3]
The first constituent of the dataframe "5", needs to lose the "," in order for the code to run.

list4 <- list(z1="6", z2=42, z3="49", z4=126)
list4[[2]]+list4[[4]]
list4[2]+list4[4]

Because the values in the list are not all of the same type, they cannot be added and it is erroneous.

Working with functions and operators.

The colon operator will create a sequence of integers in order. It is a special case of the function seq(). Using the help command ?seq to learn about the function, produce an expression that will give you the sequence of numbers from 1 to 10000 in increments of 369. Produce another that will give you a sequence between 1 and 10000 that is exactly 50 numbers in length (i.e., the first number is 1 and the last number is 10000; and the differences between a pair of consecutive numbers are the same).

seq(1, 10000, +369)

##  [1]    1  370  739 1108 1477 1846 2215 2584 2953 3322 3691 4060 4429 4798 5167
## [16] 5536 5905 6274 6643 7012 7381 7750 8119 8488 8857 9226 9595 9964

seq(1, 10000, +200)

##  [1]    1  201  401  601  801 1001 1201 1401 1601 1801 2001 2201 2401 2601 2801
## [16] 3001 3201 3401 3601 3801 4001 4201 4401 4601 4801 5001 5201 5401 5601 5801
## [31] 6001 6201 6401 6601 6801 7001 7201 7401 7601 7801 8001 8201 8401 8601 8801
## [46] 9001 9201 9401 9601 9801

The function rep() repeats a vector some number of times. Explain the difference between `rep(1:3, times=3) and rep(1:3, each=3).

rep1:3, times=3 would be as follows: 1 2 3 1 2 3 1 2 3 (The vector’s individual components are repeated)

and rep1:3, each=3 would be: 1 1 1 2 2 2 3 3 3 (The vector’s components are repeated according to their order first)

The Binomial distribution.

The binomial distribution \(\mathrm{Bin}(m,p)\) is defined by the number of successes in \(m\) independent trials, each have probability \(p\) of success. Think of flipping a coin \(m\) times, where the coin is weighted to have probability \(p\) of landing on heads.

The R function rbinom() generates random variables with a binomial distribution. E.g.,

rbinom(n=20, size=10, prob=0.5)

produces 20 observations from \(\mathrm{Bin}(10,0.5)\).

The following generates 300 binomials composed of 15 trials each with varying probability of success: 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, and 0.8, storing the results in vectors called bin.draws.0.2, bin.draws.0.3, bin.draws.0.4., bin.draws.0.5., bin.draws.0.6, bin.draws.0.7 and bin.draws.0.8. The means ares stored in the vector bin.draws.means.

set.seed(01202023) #for randomization; do not change

bins.draws.0.2 <- rbinom(300, size = 15, prob = 0.2)
bins.draws.0.3 <- rbinom(300, size = 15, prob = 0.3)
bins.draws.0.4 <- rbinom(300, size = 15, prob = 0.4)
bins.draws.0.5 <- rbinom(300, size = 15, prob = 0.5)
bins.draws.0.6 <- rbinom(300, size = 15, prob = 0.6)
bins.draws.0.7 <- rbinom(300, size = 15, prob = 0.7)
bins.draws.0.8 <- rbinom(300, size = 15, prob = 0.8)

bin.draws.means <- c(
  mean(bins.draws.0.2),
  mean(bins.draws.0.3),
  mean(bins.draws.0.4),
  mean(bins.draws.0.5),
  mean(bins.draws.0.6),
  mean(bins.draws.0.7),
  mean(bins.draws.0.8)
)

Create a matrix of dimension 300 x 7, called bin.matrix, whose columns contain the 7 vectors we’ve created, in order of the success probabilities of their underlying binomial distributions (0.2 through 0.8). Hint: use cbind().

bin.matrix <- matrix((bins.draws.0.2, bins.draws.0.3, bins.draws.0.4, bins.draws.0.5, bins.draws.0.6, bins.draws.0.7, bins.draws.0.8))
cbind(bin.matrix)

## Error: <text>:1:37: unexpected ','
## 1: bin.matrix <- matrix((bins.draws.0.2,
##                                         ^

b.Print the first five rows of bin.matrix. Print the element in the 66th row and 5th column. Compute the largest element in first column. Compute the largest element in all but the first column.

bin.matrix <- matrix(bin.matrix, nrow = 66, ncol = 5)
print(bin.matrix)
max.col(bin.matrix, ties.method = c("specific", first", "largest"))
max.col(bin.matrix, ties.method = c("specific", largest", "all except the first"))

## Error: <text>:3:54: unexpected string constant
## 2: print(bin.matrix)
## 3: max.col(bin.matrix, ties.method = c("specific", first", "
##                                                         ^

Calculate the column means of bin.matrix by using just a single function call.

mean(col(bin.matrix)) # mean

## Error in eval(expr, envir, enclos): object 'bin.matrix' not found

Compare the means you computed in the last question to those in bin.draws.means, in two ways. First, using ==, and second, using identical(). What do the two ways report? Are the results compatible? Explain.

mean(bin.draws.means) == mean(bin.matrix)

## Error in eval(expr, envir, enclos): object 'bin.matrix' not found

identical(mean(bin.draws.means), mean(bin.matrix))

## Error in eval(expr, envir, enclos): object 'bin.matrix' not found

identical(mean(bin.draws.means), mean(bin.matrix))

## Error in eval(expr, envir, enclos): object 'bin.matrix' not found

It appears these results are compatible. Both == and identical() say that the means of bin.draws.means and bin.matrix are FALSE. e. Take the transpose of bin.matrix and then take row means. Are these the same as what you just computed? Should they be?

t(bin.matrix)

## Error in eval(expr, envir, enclos): object 'bin.matrix' not found

(rowMeans(bin.matrix))

## Error in eval(expr, envir, enclos): object 'bin.matrix' not found

These are not the same as what I computed. They don’t necessarily have to be.

Lastly, let’s look at memory useage. The command object.size(x) returns the number of bytes used to store the object x in your current R session. Convert bin.matrix into a list using as.list() and save the result as bin.list. Find the number of bytes used to store bin.matrix and bin.list. How many megabytes (MB) is this, for each object? Which object requires more memory, and why do you think this is the case? Remind yourself: why are lists special compared to vectors, and is this property important for the current purpose (storing the binomial draws)? Hint: look at the help page for object.size to see how to change the units to MB.

bin.list <- as.list(bin.matrix)

## Error in eval(expr, envir, enclos): object 'bin.matrix' not found

object.size(bin.list) + object.size(bin.matrix)

## Error in eval(expr, envir, enclos): object 'bin.list' not found

The bin.list (0.021168 megabytes) takes up more memory than the bin.matrix (0.001536 megabytes). The list requires more memory than the matrix because it has a non-continuous memory meaning it allocates extra memory to store pointers before and after an element.

Going big with lists

R’s capacity for data storage and computation is very large compared to what was available 10 years ago. The following code generate 5 million numbers from \(\mathrm{Bin}(1 \times 10^6, 0.5)\) distribution and store them in a vector called big.bin.draws.

big.bin.draws <- rbinom(n = 5e6, size = 1e6, prob = 0.5)

Create a new vector, called big.bin.draws.standardized, which is given by taking big.bin.draws, subtracting off its mean, and then dividing by its standard deviation. Calculate the mean and standard deviation of big.bin.draws.standardized. (These should be 0 and 1, respectively, or very close to it; if not, you’ve made a mistake somewhere).

big.bin.draws.standardized <- ((big.bin.draws)-mean(big.bin.draws)/sd(big.bin.draws))
mean(big.bin.draws) # mean

## [1] 499999.7

sd(big.bin.draws) # sd

## [1] 500.1553

Convert big.bin.draws into a list using as.list() and save the result as big.bin.draws.list. Check that you indeed have a list by calling class() on the result. Check also that your list has the right length, and that its 1159th element is equal to that of big.bin.draws.

big.bin.draws.list <- as.list(big.bin.draws)
class(big.bin.draws)

## [1] "integer"

length(big.bin.draws)

## [1] 5000000

big.bin.draws.list[1159]

## [[1]]
## [1] 499202

as.list(big.bin.draws[1159])

## [[1]]
## [1] 499202

Run the code below, to standardize the binomial draws in the list big.bin.draws.list. Note that lapply() applies the function supplied in the second argument to every element of the list supplied in the first argument, and then returns a list of the function outputs. (We’ll learn much more about the apply() family of functions later in the course.) Did this lapply() command take longer to evaluate than the code you wrote in part a? (It should have; otherwise your previous code could have been improved, so go back and improve it.) Why do you think this is the case?

big.bin.draws.mean = mean(big.bin.draws)
big.bin.draws.sd = sd(big.bin.draws)
standardize = function(x) {
  return((x - big.bin.draws.mean) / big.bin.draws.sd)
}
big.bin.draws.list.standardized.slow = lapply(big.bin.draws.list, standardize)

The lapply() command did take longer to evaluate. I think this is the case because this command applies a function to elements of a list or vector and that routine likely explains the longer evaluation time.

Find the number of bytes used to store big.bin.draws and big.bin.draws.list. How many megabytes (MB) is this, for each object? Which object requires more memory, and why do you think this is the case? Discuss any additional observations compared to part f of the previous question.

object.size(big.bin.draws)

## 20000048 bytes

object.size(big.bin.draws.list)

## 320000048 bytes

big.bin.draws has 20.000048 MB and big.bin.draws.list has 320.000048 MB. The list object requires more memory because it is not an integer like big.bin.draws is. The list has to allocate memory to each particular element that it entails, before and after each element.