Name: Daniel Heller

NSHE ID: 5002428049

Collaborated with:

This lab is to be completed both within and outside class. You’re encouraged collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Sunday, September 15, 2019 by 11:59pm.

## For reproducibility --- don't change this!
  set.seed(09102018)

This week’s agenda: manipulating data objects; using built-in functions, doing numerical calculations, and basic plots; reinforcing core probabilistic ideas.

The binomial distribution

The binomial distribution \(\mathrm{Bin}(m,p)\) is defined by the number of successes in \(m\) independent trials, each have probability \(p\) of success. Think of flipping a coin \(m\) times, where the coin is weighted to have probability \(p\) of landing on heads.

The R function rbinom() generates random variables with a binomial distribution. E.g.,

rbinom(n=20, size=10, prob=0.5)

produces 20 observations from \(\mathrm{Bin}(10,0.5)\).

Some simple manipulations

1a. Generate 200 random values from the \(\mathrm{Bin}(10,0.5)\) distribution, and store them in a vector called bin_draws_0.5. Extract and display the first 10 elements. Extract and display all but the first 175 elements.

bin_draws_0.5 <- rbinom(n=200, size=10, prob=0.5)
head(bin_draws_0.5, n=10)

##  [1] 6 6 5 5 7 5 1 3 7 4

tail(bin_draws_0.5, n = 25)

##  [1] 4 5 4 4 7 5 6 6 3 5 4 2 3 6 6 3 4 6 7 6 7 4 4 9 6

1b. Add the first element of bin_draws_0.5 to the seventh. Compare the second element to the fifth, which is larger? A bit more tricky: print the indices of the elements of bin.draws.0.5 that are equal to 5. How many such elements are there? Challenge: theoretically, how many such elements would you expect there to be?

bin_draws_0.5[1] + bin_draws_0.5[7]

## [1] 7

bin_draws_0.5[2] > bin_draws_0.5[5]

## [1] FALSE

Since it returns false the fifth element is larger than the second.

ind.1b <- which(bin_draws_0.5 == 5)
ind.1b

##  [1]   3   4   6  12  17  21  26  33  37  43  45  46  52  54  55  56  58
## [18]  62  63  69  70  71  72  73  76  79  85  90  95  97  99 103 107 108
## [35] 114 119 124 127 133 134 135 137 138 142 143 148 157 160 167 170 174
## [52] 177 181 185

length(ind.1b)

## [1] 54

Theoretically we would expect about a quarter of the elements to be equal to the median number since it is a normal distribution centered around that number.

1c. Find the mean and standard deviation of bin_draws_0.5. Is the mean close what you’d expect? The standard deviation?

mean(bin_draws_0.5)

## [1] 4.925

sd(bin_draws_0.5)

## [1] 1.53334

The mean is close to what I expected since it was around 5, which is to be expected with a binomial with a 10 independent trials. As well the standard deviation is about what was expected with this distribution.

1d. Call summary() on bin_draws_0.5 and describe the result.

summary(bin_draws_0.5)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   4.925   6.000   9.000

The results of summary are the minimum value seen in the sampling, the first quartile is the split between the lowest 25% of the data from the next 75%. The median is the number in the middle that cuts the data in half, as expected it is 5. The third quartile is the split for the highest 25% from the rest of the data. The maximum value is the highest value that was recieved from the sample. Last the mean is the average value drawn from the sampling.

1e. Find the data type of the elements in bin_draws_0.5 using typeof(). Then convert bin_draws_0.5 to a vector of characters, storing the result as bin_draws_0.5_char, and use typeof() again to verify that you’ve done the conversion correctly. Call summary() on bin_draws_0.5_char. Is the result formatted differently from what you saw above? Why?

typeof(bin_draws_0.5)

## [1] "integer"

bin_draws_0.5_char <- as.character(bin_draws_0.5)
typeof(bin_draws_0.5_char)

## [1] "character"

summary(bin_draws_0.5_char)

##    Length     Class      Mode 
##       200 character character

The result is formatted much differently from what we saw above due to the change from integer to character vectors. This is due to the different handling of characters and integers in R. With integers R takes summary to be the summary statistics within the vector, i.e. the things most data scientists want to know like mean and quartiles. Whereas character summary is more concerned with length and character type within the vector.

Some simple plots

2a. The function plot() is a generic function in R for the visual display of data. The function hist() specifically produces a histogram display. Use hist() to produce a histogram of your random draws from the binomial distribution, stored in bin_draws_0.5.

hist(bin_draws_0.5)

2b. Call tabulate() on bin_draws_0.5. What is being shown? Does it roughly match the histogram you produced in the last question?

tabulate(bin_draws_0.5)

## [1]  4  7 22 42 54 46 16  6  3

What is being shown is the frequency of certain values, from 1-9, received from our sample of binomial(10, 0.5). This roughly matches our histogram above, as the values follow a normal curve of small tails and a large midsection.

2c. Call plot() on bin_draws_0.5 to display your random values from the binomial distribution. Can you guess what the plot() function is doing here?

plot(bin_draws_0.5)

What the plot() function is doing is plotting the values as they are drawn, so it is showing when each point is drawn, as indicated by the index, and not just the frequency or how many times each number was drawn like with hist().

2d. Call plot() with two arguments, the first being 1:200, and the second being bin_draws_0.5. This creates a scatterplot of bin_draws_0.5 (on the y-axis) versus the indices 1 through 200 (on the x-axis). Does this match your plot from the last question?

plot(1:200, bin_draws_0.5)

This plot matches the plot made in the last question, the only difference is that instead of index the x-axis has a label of 1:200, which is makes it easier to understand that it is the repititions that the binomial is going through to achieve the values shown on the graph.

More binomials, more plots

3a. Generate binomial random variables again, but now produce 7 different vectors of length 200, with probability of success equal to: 0.2, 0.3, 0.4, 0.6, 0.7, and 0.8, storing the results in vectors called bin_draws_0.2, bin_draws_0.3, bin_draws_0.4., bin_draws_0.6, bin_draws_0.7 and bin_draws_0.8. For each, compute the mean and standard deviation.

##     mean_vector sd_vector
## 0.2       2.105  1.261719
## 0.3       3.120  1.447785
## 0.4       3.835  1.539358
## 0.5       4.925  1.533340
## 0.6       6.105  1.467906
## 0.7       7.055  1.397045
## 0.8       8.125  1.255891

3b. We’d like to compare the properties of our vectors. Create a vector of length 7, whose entries are the means of the 7 vectors we’ve created, in order according to the success probabilities of their underlying binomial distributions (0.2 through 0.8).

mean_vector

## [1] 2.105 3.120 3.835 4.925 6.105 7.055 8.125

3c. Using the vectors from the last part, create the following scatterplots. Explain in words, for each, what’s going on.
The 7 means versus the 7 probabilities used to generate the draws.

plot(prob, mean_vector)

What is happening is that with this graph we can see that there is a linearly increasing correlation between the probability and the means of the draws. (i.e as the probabilitiy increases so does the mean)

The standard deviations versus the probabilities.

plot(prob, sd_vector)

In this graph we can see there’s a parabolic relationship between standard deviations and probabilities, so the deviation or dispersion between data points is small around the extreme high or low probabilies (0.2 and 0.8) and becomes larger until it peaks around the middle (0.4 and 0.5).

The standard deviations versus the means.

plot(mean_vector, sd_vector)

As well in this graph there is a parabolic relationship between standard deviation and means. So much like before around the extremes (2 and 8) the deviation between data points is lower than in the middle where it peaks (4 and 5)

Challenge: for each plot, add a curve that corresponds to the relationships you’d expect to see in the theoretical population (i.e., with an infinite amount of draws, rather than just 200 draws).

Working with matrices

4a. Create a matrix of dimension 200 x 7, called bin_matrix, whose columns contain the 7 vectors we’ve created, in order of the success probabilities of their underlying binomial distributions (0.2 through 0.8).

bin_matrix <- matrix(bin_vector, nrow = 200, ncol = 7, byrow = FALSE)

4b. Print the first three rows of bin_matrix. Print the element in the 77th row and 6th column. Compute the largest element in first column. Compute the largest element in all but the first column.

head(bin_matrix, 3)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    3    4    4    6    6    8    8
## [2,]    2    4    3    6    5   10    8
## [3,]    0    1    1    5    6    5    7

bin_matrix[77,6]

## [1] 5

max_col1 = max(bin_matrix[,1])
max_col1

## [1] 6

max_col_rest = max(bin_matrix[,2:7])
max_col_rest

## [1] 10

4c. Calculate the means of each column by calling colMeans() on bin_matrix.

colMeans(bin_matrix)

## [1] 2.105 3.120 3.835 4.925 6.105 7.055 8.125

4d. Compare the means you computed in the last question to those you computed in Q3b. Do they appear equal, when you print the results out to the console? Try using identical() to check that they are exactly equal. Does it fail? Why? Try using all.equal(). What does it report?

mean_vector

## [1] 2.105 3.120 3.835 4.925 6.105 7.055 8.125

mean_vector2

## [1] 2.105 3.120 3.835 4.925 6.105 7.055 8.125

mean_vector are the means calculated in Q3b, mean_vector2 are the means from Q4c. When the results are printed to the console they appear to be equal.

identical(mean_vector2,mean_vector)

## [1] TRUE

It does not fail because the vectors are identical.

all.equal(mean_vector2, mean_vector)

## [1] TRUE

As well the all.equal function reports true since the vectors are equal.

4e. Take the transpose of bin_matrix and then take row means. Are these the same as what you just computed? Should they be?

bin_matrix_t = t(bin_matrix)
rowMeans(bin_matrix_t)

## [1] 2.105 3.120 3.835 4.925 6.105 7.055 8.125

They are the same and they should be the same as computed above since we are taking the means of the same 7 vectors.

Warm up is over, let’s go big

5a. R’s capacity for data storage and computation is very large compared to what was available 10 years ago. Generate 2 million numbers from \(\mathrm{Bin}(1 \times 10^6, 0.5)\) distribution and store them in a vector called big_bin_draws. Calculate the mean and standard deviation of this vector.

big_bin_draws <- rbinom(n=2000000, size=1*10e6, prob=0.5)
mean(big_bin_draws)

## [1] 5000001

sd(big_bin_draws)

## [1] 1581.232

5b. Create a new vector, called big_bin_draws_standardized, which is given by taking big_bin_draws, subtracting off its mean, and then dividing by its standard deviation. Calculate the mean and standard deviation of big_bin_draws_standardized. (These should be 0 and 1, respectively, or very, very close to it; if not, you’ve made a mistake somewhere).

big_bin_draws_standardized <- (big_bin_draws - mean(big_bin_draws))/sd(big_bin_draws)
mean(big_bin_draws_standardized)

## [1] -1.358527e-13

sd(big_bin_draws_standardized)

## [1] 1

## Time difference of 0.09873319 secs

5c. Plot a histogram of big_bin_draws_standardized. To increase the number of histogram bars, set the breaks argument in the hist() function (e.g., set breaks=100). What does the shape of this histogram appear to be? Is this surprising? What could explain this phenomenon? (Hint: rhymes with “Mental Gimmick Serum” …)

hist(big_bin_draws_standardized, breaks = 100)

The shape is of a normal or bell curve, this is not surprising because the Central Limit Theorem states that when independent random variables such as from a binomial distribution are added, their normalized sum tends toward a normal distribution. I.e. the larger sample size you use the more likely you are to be a normal curve once standardized.

5d. Calculate the proportion of times that an element of big_bin_draws_standardized exceeds 1.644854. Is this close to 0.05?

sum(big_bin_draws_standardized > 1.644854)/length(big_bin_draws_standardized)

## [1] 0.0501575

Yes, the proportion is 0.05.

5e. Either by simulation, or via a built-in R function, compute the probability that a standard normal random variable exceeds 1.644854. Is this close to 0.05? (Hint: for either approach, it would be helpful to look at the help file for the rnorm() function.)

nd <- rnorm(2000000)
sum((nd > 1.644854) / 2000000)

## [1] 0.0501435

This number is close to 0.05.

Now let’s go really big

6a. Let’s push R’s computational engine a little harder. Design an expression to generate 100 million numbers from \(\mathrm{Bin}(10 \times 10^6, 50 \times 10^{-8})\), to be saved in a vector called huge_bin_draws, but do not evaluate this command yet. Then ask your lab partner to name three of Justin Bieber’s songs and simultaneously evaluate your R command that defines huge_bin_draws. Which finished first, R or your partner? (Note: your partner cannot really win this game. Even if he/she finishes first, he/she still loses.)

## Time difference of 9.615378 secs

6b. Calculate the mean and standard deviation of huge_bin_draws. Are they close to what you’d expect? (They should be very close.) Did it take longer to compute these, or to generate huge_bin_draws in the first place?

mean(huge_bin_draws)

## [1] 499.9992

sd(huge_bin_draws)

## [1] 22.36202

## Time difference of 0.917093 secs

This was shorter than my original generation of huge_bin_draws. By alot.

6c. Calculate the median of huge_bin_draws. Did this median calculation take longer than the calculating the mean? Is this surprising?

median(huge_bin_draws)

## [1] 500

## Time difference of 2.009448 secs

It took longer to calculate the median than the mean and standard deviation, this is surprising to me.

6d. Calculate e to the power of the median of the logs of huge_bin_draws, in one line of code. Did this take longer than the median calculation applied to huge_bin_draws directly? Is this surprising?

exp(median(log(huge_bin_draws)))

## [1] 500

## Time difference of 6.502575 secs

This did take longer to calculate than the median calculation, this isn’t too surprising as we are asking the same question in a more complex way.

6e. Plot a histogram of of huge_bin_draws, again with a large setting of the breaks argument (e.g., breaks=100). Describe what you see; is this different from before, when we had 2 million draws? Challenge: is this surprising? What distribution is this?

hist(huge_bin_draws, breaks = 100)

This looks to be a normal distribution, so this is not very surprising when we think about the Central Limit Theorem.

Going big with lists

7a. Convert big_bin_draws into a list using as.list() and save the result as big_bin_draws_list. Check that you indeed have a list by calling class() on the result. Check also that your list has the right length, and that its 1049th element is equal to that of big_bin_draws.

big_bin_draws_list <- as.list(big_bin_draws)
class(big_bin_draws_list)

## [1] "list"

length(big_bin_draws_list)

## [1] 2000000

print(big_bin_draws_list[1049]) == print(big_bin_draws[1049])

## [[1]]
## [1] 5000638
## 
## [1] 5000638

## [1] TRUE

7b. Run the code below, to standardize the binomial draws in the list big_bin_draws_list. Note that lapply() applies the function supplied in the second argument to every element of the list supplied in the first argument, and then returns a list of the function outputs. Did this lapply() command take longer to evaluate than the code you wrote in Q5b? (It should have; otherwise your previous code could have been improved, so go back and improve it.) Why do you think this is the case?

big_bin_draws_mean = mean(big_bin_draws)
big_bin_draws_sd = sd(big_bin_draws)
standardize = function(x) {
  return((x - big_bin_draws_mean) / big_bin_draws_sd)
}
big_bin_draws_list_standardized_slow = lapply(big_bin_draws_list, standardize)

## Time difference of 1.956685 secs

This command took longer, ~1.5 seconds longer, to evaluate than Q5b. This might because it is going through each individual element whereas Q5b treated it as a whole vectore instead of going one by one through the vector.

7c. Run the code below, which again standardizes the binomial draws in the list big_bin_draws_list, using lapply(). Why is it so much slower than the code in the last question? Think about what is happening each time the function is called.

big_bin_draws_mean = mean(big_bin_draws)
big_bin_draws_sd = sd(big_bin_draws)
standardize_slow = function(x) {
  return((x - mean(big_bin_draws)) / sd(big_bin_draws))
}
big_bin_draws_list_standardized_slow = lapply(big_bin_draws_list, standardize)

## Time difference of 1.988959 secs

This code is so much slower because everytime this function is called not only are we applying the function to each element of big_bin_draws but we are also having R calculate the mean of big_bin_draws everytime the function is called.

7d. Lastly, let’s look at memory useage. The command object.size(x) returns the number of bytes used to store the object x in your current R session. Find the number of bytes used to store big_bin_draws and big_bin_draws_list. How many megabytes (MB) is this, for each object? Which object requires more memory, and why do you think this is the case? Remind yourself: why are lists special compared to vectors, and is this property important for the current purpose (storing the binomial draws)?

object.size(big_bin_draws)

## 8000048 bytes

object.size(big_bin_draws_list)

## 128000048 bytes

big_bin_draws_list uses more memory than big_bin_draws because it is a list, which is always larger since lists can have multiple different types of elements in it. Whereas atomic vector’s elements are known to have the same type. So for the current purpose a list is a bit overkill, a vector works fine.

Lab 1: R Basics

Intro to Statistical Computing, STAT 445-645

Wednesday, August 28, 2019