Name: Daniel Heller
NSHE ID: 5002428049
Collaborated with:
This lab is to be completed both within and outside class. You’re encouraged collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Sunday, September 15, 2019 by 11:59pm.
## For reproducibility --- don't change this!
set.seed(09102018)
This week’s agenda: manipulating data objects; using built-in functions, doing numerical calculations, and basic plots; reinforcing core probabilistic ideas.
The binomial distribution \(\mathrm{Bin}(m,p)\) is defined by the number of successes in \(m\) independent trials, each have probability \(p\) of success. Think of flipping a coin \(m\) times, where the coin is weighted to have probability \(p\) of landing on heads.
The R function rbinom()
generates random variables with a binomial distribution. E.g.,
rbinom(n=20, size=10, prob=0.5)
produces 20 observations from \(\mathrm{Bin}(10,0.5)\).
bin_draws_0.5
. Extract and display the first 10 elements. Extract and display all but the first 175 elements.bin_draws_0.5 <- rbinom(n=200, size=10, prob=0.5)
head(bin_draws_0.5, n=10)
## [1] 6 6 5 5 7 5 1 3 7 4
tail(bin_draws_0.5, n = 25)
## [1] 4 5 4 4 7 5 6 6 3 5 4 2 3 6 6 3 4 6 7 6 7 4 4 9 6
bin_draws_0.5
to the seventh. Compare the second element to the fifth, which is larger? A bit more tricky: print the indices of the elements of bin.draws.0.5
that are equal to 5. How many such elements are there? Challenge: theoretically, how many such elements would you expect there to be?bin_draws_0.5[1] + bin_draws_0.5[7]
## [1] 7
bin_draws_0.5[2] > bin_draws_0.5[5]
## [1] FALSE
Since it returns false the fifth element is larger than the second.
ind.1b <- which(bin_draws_0.5 == 5)
ind.1b
## [1] 3 4 6 12 17 21 26 33 37 43 45 46 52 54 55 56 58
## [18] 62 63 69 70 71 72 73 76 79 85 90 95 97 99 103 107 108
## [35] 114 119 124 127 133 134 135 137 138 142 143 148 157 160 167 170 174
## [52] 177 181 185
length(ind.1b)
## [1] 54
Theoretically we would expect about a quarter of the elements to be equal to the median number since it is a normal distribution centered around that number.
bin_draws_0.5
. Is the mean close what you’d expect? The standard deviation?mean(bin_draws_0.5)
## [1] 4.925
sd(bin_draws_0.5)
## [1] 1.53334
The mean is close to what I expected since it was around 5, which is to be expected with a binomial with a 10 independent trials. As well the standard deviation is about what was expected with this distribution.
summary()
on bin_draws_0.5
and describe the result.summary(bin_draws_0.5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 5.000 4.925 6.000 9.000
The results of summary are the minimum value seen in the sampling, the first quartile is the split between the lowest 25% of the data from the next 75%. The median is the number in the middle that cuts the data in half, as expected it is 5. The third quartile is the split for the highest 25% from the rest of the data. The maximum value is the highest value that was recieved from the sample. Last the mean is the average value drawn from the sampling.
bin_draws_0.5
using typeof()
. Then convert bin_draws_0.5
to a vector of characters, storing the result as bin_draws_0.5_char
, and use typeof()
again to verify that you’ve done the conversion correctly. Call summary()
on bin_draws_0.5_char
. Is the result formatted differently from what you saw above? Why?typeof(bin_draws_0.5)
## [1] "integer"
bin_draws_0.5_char <- as.character(bin_draws_0.5)
typeof(bin_draws_0.5_char)
## [1] "character"
summary(bin_draws_0.5_char)
## Length Class Mode
## 200 character character
The result is formatted much differently from what we saw above due to the change from integer to character vectors. This is due to the different handling of characters and integers in R. With integers R takes summary to be the summary statistics within the vector, i.e. the things most data scientists want to know like mean and quartiles. Whereas character summary is more concerned with length and character type within the vector.
plot()
is a generic function in R for the visual display of data. The function hist()
specifically produces a histogram display. Use hist()
to produce a histogram of your random draws from the binomial distribution, stored in bin_draws_0.5
.hist(bin_draws_0.5)
tabulate()
on bin_draws_0.5
. What is being shown? Does it roughly match the histogram you produced in the last question?tabulate(bin_draws_0.5)
## [1] 4 7 22 42 54 46 16 6 3
What is being shown is the frequency of certain values, from 1-9, received from our sample of binomial(10, 0.5). This roughly matches our histogram above, as the values follow a normal curve of small tails and a large midsection.
plot()
on bin_draws_0.5
to display your random values from the binomial distribution. Can you guess what the plot()
function is doing here?plot(bin_draws_0.5)
What the plot()
function is doing is plotting the values as they are drawn, so it is showing when each point is drawn, as indicated by the index, and not just the frequency or how many times each number was drawn like with hist()
.
plot()
with two arguments, the first being 1:200
, and the second being bin_draws_0.5
. This creates a scatterplot of bin_draws_0.5
(on the y-axis) versus the indices 1 through 200 (on the x-axis). Does this match your plot from the last question?plot(1:200, bin_draws_0.5)
This plot matches the plot made in the last question, the only difference is that instead of index the x-axis has a label of 1:200, which is makes it easier to understand that it is the repititions that the binomial is going through to achieve the values shown on the graph.
bin_draws_0.2
, bin_draws_0.3
, bin_draws_0.4.
, bin_draws_0.6
, bin_draws_0.7
and bin_draws_0.8
. For each, compute the mean and standard deviation.## mean_vector sd_vector
## 0.2 2.105 1.261719
## 0.3 3.120 1.447785
## 0.4 3.835 1.539358
## 0.5 4.925 1.533340
## 0.6 6.105 1.467906
## 0.7 7.055 1.397045
## 0.8 8.125 1.255891
mean_vector
## [1] 2.105 3.120 3.835 4.925 6.105 7.055 8.125
3c. Using the vectors from the last part, create the following scatterplots. Explain in words, for each, what’s going on.
The 7 means versus the 7 probabilities used to generate the draws.
plot(prob, mean_vector)
What is happening is that with this graph we can see that there is a linearly increasing correlation between the probability and the means of the draws. (i.e as the probabilitiy increases so does the mean)
plot(prob, sd_vector)
In this graph we can see there’s a parabolic relationship between standard deviations and probabilities, so the deviation or dispersion between data points is small around the extreme high or low probabilies (0.2 and 0.8) and becomes larger until it peaks around the middle (0.4 and 0.5).
plot(mean_vector, sd_vector)
As well in this graph there is a parabolic relationship between standard deviation and means. So much like before around the extremes (2 and 8) the deviation between data points is lower than in the middle where it peaks (4 and 5)
Challenge: for each plot, add a curve that corresponds to the relationships you’d expect to see in the theoretical population (i.e., with an infinite amount of draws, rather than just 200 draws).
bin_matrix
, whose columns contain the 7 vectors we’ve created, in order of the success probabilities of their underlying binomial distributions (0.2 through 0.8).bin_matrix <- matrix(bin_vector, nrow = 200, ncol = 7, byrow = FALSE)
bin_matrix
. Print the element in the 77th row and 6th column. Compute the largest element in first column. Compute the largest element in all but the first column.head(bin_matrix, 3)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 3 4 4 6 6 8 8
## [2,] 2 4 3 6 5 10 8
## [3,] 0 1 1 5 6 5 7
bin_matrix[77,6]
## [1] 5
max_col1 = max(bin_matrix[,1])
max_col1
## [1] 6
max_col_rest = max(bin_matrix[,2:7])
max_col_rest
## [1] 10
colMeans()
on bin_matrix
.colMeans(bin_matrix)
## [1] 2.105 3.120 3.835 4.925 6.105 7.055 8.125
identical()
to check that they are exactly equal. Does it fail? Why? Try using all.equal()
. What does it report?mean_vector
## [1] 2.105 3.120 3.835 4.925 6.105 7.055 8.125
mean_vector2
## [1] 2.105 3.120 3.835 4.925 6.105 7.055 8.125
mean_vector are the means calculated in Q3b, mean_vector2 are the means from Q4c. When the results are printed to the console they appear to be equal.
identical(mean_vector2,mean_vector)
## [1] TRUE
It does not fail because the vectors are identical.
all.equal(mean_vector2, mean_vector)
## [1] TRUE
As well the all.equal
function reports true since the vectors are equal.
bin_matrix
and then take row means. Are these the same as what you just computed? Should they be?bin_matrix_t = t(bin_matrix)
rowMeans(bin_matrix_t)
## [1] 2.105 3.120 3.835 4.925 6.105 7.055 8.125
They are the same and they should be the same as computed above since we are taking the means of the same 7 vectors.
big_bin_draws
. Calculate the mean and standard deviation of this vector.big_bin_draws <- rbinom(n=2000000, size=1*10e6, prob=0.5)
mean(big_bin_draws)
## [1] 5000001
sd(big_bin_draws)
## [1] 1581.232
big_bin_draws_standardized
, which is given by taking big_bin_draws
, subtracting off its mean, and then dividing by its standard deviation. Calculate the mean and standard deviation of big_bin_draws_standardized
. (These should be 0 and 1, respectively, or very, very close to it; if not, you’ve made a mistake somewhere).big_bin_draws_standardized <- (big_bin_draws - mean(big_bin_draws))/sd(big_bin_draws)
mean(big_bin_draws_standardized)
## [1] -1.358527e-13
sd(big_bin_draws_standardized)
## [1] 1
## Time difference of 0.09873319 secs
big_bin_draws_standardized
. To increase the number of histogram bars, set the breaks
argument in the hist()
function (e.g., set breaks=100
). What does the shape of this histogram appear to be? Is this surprising? What could explain this phenomenon? (Hint: rhymes with “Mental Gimmick Serum” …)hist(big_bin_draws_standardized, breaks = 100)
The shape is of a normal or bell curve, this is not surprising because the Central Limit Theorem states that when independent random variables such as from a binomial distribution are added, their normalized sum tends toward a normal distribution. I.e. the larger sample size you use the more likely you are to be a normal curve once standardized.
big_bin_draws_standardized
exceeds 1.644854. Is this close to 0.05?sum(big_bin_draws_standardized > 1.644854)/length(big_bin_draws_standardized)
## [1] 0.0501575
Yes, the proportion is 0.05.
rnorm()
function.)nd <- rnorm(2000000)
sum((nd > 1.644854) / 2000000)
## [1] 0.0501435
This number is close to 0.05.
huge_bin_draws
, but do not evaluate this command yet. Then ask your lab partner to name three of Justin Bieber’s songs and simultaneously evaluate your R command that defines huge_bin_draws
. Which finished first, R or your partner? (Note: your partner cannot really win this game. Even if he/she finishes first, he/she still loses.)## Time difference of 9.615378 secs
huge_bin_draws
. Are they close to what you’d expect? (They should be very close.) Did it take longer to compute these, or to generate huge_bin_draws
in the first place?mean(huge_bin_draws)
## [1] 499.9992
sd(huge_bin_draws)
## [1] 22.36202
## Time difference of 0.917093 secs
This was shorter than my original generation of huge_bin_draws
. By alot.
huge_bin_draws
. Did this median calculation take longer than the calculating the mean? Is this surprising?median(huge_bin_draws)
## [1] 500
## Time difference of 2.009448 secs
It took longer to calculate the median than the mean and standard deviation, this is surprising to me.
huge_bin_draws
, in one line of code. Did this take longer than the median calculation applied to huge_bin_draws
directly? Is this surprising?exp(median(log(huge_bin_draws)))
## [1] 500
## Time difference of 6.502575 secs
This did take longer to calculate than the median calculation, this isn’t too surprising as we are asking the same question in a more complex way.
huge_bin_draws
, again with a large setting of the breaks
argument (e.g., breaks=100
). Describe what you see; is this different from before, when we had 2 million draws? Challenge: is this surprising? What distribution is this?hist(huge_bin_draws, breaks = 100)
This looks to be a normal distribution, so this is not very surprising when we think about the Central Limit Theorem
.
big_bin_draws
into a list using as.list()
and save the result as big_bin_draws_list
. Check that you indeed have a list by calling class()
on the result. Check also that your list has the right length, and that its 1049th element is equal to that of big_bin_draws
.big_bin_draws_list <- as.list(big_bin_draws)
class(big_bin_draws_list)
## [1] "list"
length(big_bin_draws_list)
## [1] 2000000
print(big_bin_draws_list[1049]) == print(big_bin_draws[1049])
## [[1]]
## [1] 5000638
##
## [1] 5000638
## [1] TRUE
big_bin_draws_list
. Note that lapply()
applies the function supplied in the second argument to every element of the list supplied in the first argument, and then returns a list of the function outputs. Did this lapply()
command take longer to evaluate than the code you wrote in Q5b? (It should have; otherwise your previous code could have been improved, so go back and improve it.) Why do you think this is the case?big_bin_draws_mean = mean(big_bin_draws)
big_bin_draws_sd = sd(big_bin_draws)
standardize = function(x) {
return((x - big_bin_draws_mean) / big_bin_draws_sd)
}
big_bin_draws_list_standardized_slow = lapply(big_bin_draws_list, standardize)
## Time difference of 1.956685 secs
This command took longer, ~1.5 seconds longer, to evaluate than Q5b. This might because it is going through each individual element whereas Q5b treated it as a whole vectore instead of going one by one through the vector.
big_bin_draws_list
, using lapply()
. Why is it so much slower than the code in the last question? Think about what is happening each time the function is called.big_bin_draws_mean = mean(big_bin_draws)
big_bin_draws_sd = sd(big_bin_draws)
standardize_slow = function(x) {
return((x - mean(big_bin_draws)) / sd(big_bin_draws))
}
big_bin_draws_list_standardized_slow = lapply(big_bin_draws_list, standardize)
## Time difference of 1.988959 secs
This code is so much slower because everytime this function is called not only are we applying the function to each element of big_bin_draws
but we are also having R calculate the mean of big_bin_draws
everytime the function is called.
object.size(x)
returns the number of bytes used to store the object x
in your current R session. Find the number of bytes used to store big_bin_draws
and big_bin_draws_list
. How many megabytes (MB) is this, for each object? Which object requires more memory, and why do you think this is the case? Remind yourself: why are lists special compared to vectors, and is this property important for the current purpose (storing the binomial draws)?object.size(big_bin_draws)
## 8000048 bytes
object.size(big_bin_draws_list)
## 128000048 bytes
big_bin_draws_list
uses more memory than big_bin_draws
because it is a list, which is always larger since lists can have multiple different types of elements in it. Whereas atomic vector’s elements are known to have the same type. So for the current purpose a list is a bit overkill, a vector works fine.