Name: Daniel Heller

NSHE ID: 5002428049

Collaborated with:

This lab is to be completed both within and outside class. You’re encouraged collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Sunday, September 15, 2019 by 11:59pm.

## For reproducibility --- don't change this!
  set.seed(09102018)

This week’s agenda: manipulating data objects; using built-in functions, doing numerical calculations, and basic plots; reinforcing core probabilistic ideas.

The binomial distribution

The binomial distribution \(\mathrm{Bin}(m,p)\) is defined by the number of successes in \(m\) independent trials, each have probability \(p\) of success. Think of flipping a coin \(m\) times, where the coin is weighted to have probability \(p\) of landing on heads.

The R function rbinom() generates random variables with a binomial distribution. E.g.,

rbinom(n=20, size=10, prob=0.5)

produces 20 observations from \(\mathrm{Bin}(10,0.5)\).

Some simple manipulations

bin_draws_0.5 <- rbinom(n=200, size=10, prob=0.5)
head(bin_draws_0.5, n=10)
##  [1] 6 6 5 5 7 5 1 3 7 4
tail(bin_draws_0.5, n = 25)
##  [1] 4 5 4 4 7 5 6 6 3 5 4 2 3 6 6 3 4 6 7 6 7 4 4 9 6
bin_draws_0.5[1] + bin_draws_0.5[7]
## [1] 7
bin_draws_0.5[2] > bin_draws_0.5[5]
## [1] FALSE

Since it returns false the fifth element is larger than the second.

ind.1b <- which(bin_draws_0.5 == 5)
ind.1b
##  [1]   3   4   6  12  17  21  26  33  37  43  45  46  52  54  55  56  58
## [18]  62  63  69  70  71  72  73  76  79  85  90  95  97  99 103 107 108
## [35] 114 119 124 127 133 134 135 137 138 142 143 148 157 160 167 170 174
## [52] 177 181 185
length(ind.1b)
## [1] 54

Theoretically we would expect about a quarter of the elements to be equal to the median number since it is a normal distribution centered around that number.

mean(bin_draws_0.5)
## [1] 4.925
sd(bin_draws_0.5)
## [1] 1.53334

The mean is close to what I expected since it was around 5, which is to be expected with a binomial with a 10 independent trials. As well the standard deviation is about what was expected with this distribution.

summary(bin_draws_0.5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   4.925   6.000   9.000

The results of summary are the minimum value seen in the sampling, the first quartile is the split between the lowest 25% of the data from the next 75%. The median is the number in the middle that cuts the data in half, as expected it is 5. The third quartile is the split for the highest 25% from the rest of the data. The maximum value is the highest value that was recieved from the sample. Last the mean is the average value drawn from the sampling.

typeof(bin_draws_0.5)
## [1] "integer"
bin_draws_0.5_char <- as.character(bin_draws_0.5)
typeof(bin_draws_0.5_char)
## [1] "character"
summary(bin_draws_0.5_char)
##    Length     Class      Mode 
##       200 character character

The result is formatted much differently from what we saw above due to the change from integer to character vectors. This is due to the different handling of characters and integers in R. With integers R takes summary to be the summary statistics within the vector, i.e. the things most data scientists want to know like mean and quartiles. Whereas character summary is more concerned with length and character type within the vector.

Some simple plots

hist(bin_draws_0.5)

tabulate(bin_draws_0.5)
## [1]  4  7 22 42 54 46 16  6  3

What is being shown is the frequency of certain values, from 1-9, received from our sample of binomial(10, 0.5). This roughly matches our histogram above, as the values follow a normal curve of small tails and a large midsection.

plot(bin_draws_0.5)

What the plot() function is doing is plotting the values as they are drawn, so it is showing when each point is drawn, as indicated by the index, and not just the frequency or how many times each number was drawn like with hist().

plot(1:200, bin_draws_0.5)

This plot matches the plot made in the last question, the only difference is that instead of index the x-axis has a label of 1:200, which is makes it easier to understand that it is the repititions that the binomial is going through to achieve the values shown on the graph.

More binomials, more plots

##     mean_vector sd_vector
## 0.2       2.105  1.261719
## 0.3       3.120  1.447785
## 0.4       3.835  1.539358
## 0.5       4.925  1.533340
## 0.6       6.105  1.467906
## 0.7       7.055  1.397045
## 0.8       8.125  1.255891
mean_vector
## [1] 2.105 3.120 3.835 4.925 6.105 7.055 8.125
plot(prob, mean_vector)

What is happening is that with this graph we can see that there is a linearly increasing correlation between the probability and the means of the draws. (i.e as the probabilitiy increases so does the mean)

plot(prob, sd_vector)

In this graph we can see there’s a parabolic relationship between standard deviations and probabilities, so the deviation or dispersion between data points is small around the extreme high or low probabilies (0.2 and 0.8) and becomes larger until it peaks around the middle (0.4 and 0.5).

plot(mean_vector, sd_vector)

As well in this graph there is a parabolic relationship between standard deviation and means. So much like before around the extremes (2 and 8) the deviation between data points is lower than in the middle where it peaks (4 and 5)

Challenge: for each plot, add a curve that corresponds to the relationships you’d expect to see in the theoretical population (i.e., with an infinite amount of draws, rather than just 200 draws).

Working with matrices

bin_matrix <- matrix(bin_vector, nrow = 200, ncol = 7, byrow = FALSE)
head(bin_matrix, 3)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    3    4    4    6    6    8    8
## [2,]    2    4    3    6    5   10    8
## [3,]    0    1    1    5    6    5    7
bin_matrix[77,6]
## [1] 5
max_col1 = max(bin_matrix[,1])
max_col1
## [1] 6
max_col_rest = max(bin_matrix[,2:7])
max_col_rest
## [1] 10
colMeans(bin_matrix)
## [1] 2.105 3.120 3.835 4.925 6.105 7.055 8.125
mean_vector
## [1] 2.105 3.120 3.835 4.925 6.105 7.055 8.125
mean_vector2
## [1] 2.105 3.120 3.835 4.925 6.105 7.055 8.125

mean_vector are the means calculated in Q3b, mean_vector2 are the means from Q4c. When the results are printed to the console they appear to be equal.

identical(mean_vector2,mean_vector)
## [1] TRUE

It does not fail because the vectors are identical.

all.equal(mean_vector2, mean_vector)
## [1] TRUE

As well the all.equal function reports true since the vectors are equal.

bin_matrix_t = t(bin_matrix)
rowMeans(bin_matrix_t)
## [1] 2.105 3.120 3.835 4.925 6.105 7.055 8.125

They are the same and they should be the same as computed above since we are taking the means of the same 7 vectors.

Warm up is over, let’s go big

big_bin_draws <- rbinom(n=2000000, size=1*10e6, prob=0.5)
mean(big_bin_draws)
## [1] 5000001
sd(big_bin_draws)
## [1] 1581.232
big_bin_draws_standardized <- (big_bin_draws - mean(big_bin_draws))/sd(big_bin_draws)
mean(big_bin_draws_standardized)
## [1] -1.358527e-13
sd(big_bin_draws_standardized)
## [1] 1
## Time difference of 0.09873319 secs
hist(big_bin_draws_standardized, breaks = 100)

The shape is of a normal or bell curve, this is not surprising because the Central Limit Theorem states that when independent random variables such as from a binomial distribution are added, their normalized sum tends toward a normal distribution. I.e. the larger sample size you use the more likely you are to be a normal curve once standardized.

sum(big_bin_draws_standardized > 1.644854)/length(big_bin_draws_standardized)
## [1] 0.0501575

Yes, the proportion is 0.05.

nd <- rnorm(2000000)
sum((nd > 1.644854) / 2000000)
## [1] 0.0501435

This number is close to 0.05.

Now let’s go really big

## Time difference of 9.615378 secs
mean(huge_bin_draws)
## [1] 499.9992
sd(huge_bin_draws)
## [1] 22.36202
## Time difference of 0.917093 secs

This was shorter than my original generation of huge_bin_draws. By alot.

median(huge_bin_draws)
## [1] 500
## Time difference of 2.009448 secs

It took longer to calculate the median than the mean and standard deviation, this is surprising to me.

exp(median(log(huge_bin_draws)))
## [1] 500
## Time difference of 6.502575 secs

This did take longer to calculate than the median calculation, this isn’t too surprising as we are asking the same question in a more complex way.

hist(huge_bin_draws, breaks = 100)

This looks to be a normal distribution, so this is not very surprising when we think about the Central Limit Theorem.

Going big with lists

big_bin_draws_list <- as.list(big_bin_draws)
class(big_bin_draws_list)
## [1] "list"
length(big_bin_draws_list)
## [1] 2000000
print(big_bin_draws_list[1049]) == print(big_bin_draws[1049])
## [[1]]
## [1] 5000638
## 
## [1] 5000638
## [1] TRUE
big_bin_draws_mean = mean(big_bin_draws)
big_bin_draws_sd = sd(big_bin_draws)
standardize = function(x) {
  return((x - big_bin_draws_mean) / big_bin_draws_sd)
}
big_bin_draws_list_standardized_slow = lapply(big_bin_draws_list, standardize)
## Time difference of 1.956685 secs

This command took longer, ~1.5 seconds longer, to evaluate than Q5b. This might because it is going through each individual element whereas Q5b treated it as a whole vectore instead of going one by one through the vector.

big_bin_draws_mean = mean(big_bin_draws)
big_bin_draws_sd = sd(big_bin_draws)
standardize_slow = function(x) {
  return((x - mean(big_bin_draws)) / sd(big_bin_draws))
}
big_bin_draws_list_standardized_slow = lapply(big_bin_draws_list, standardize)
## Time difference of 1.988959 secs

This code is so much slower because everytime this function is called not only are we applying the function to each element of big_bin_draws but we are also having R calculate the mean of big_bin_draws everytime the function is called.

object.size(big_bin_draws)
## 8000048 bytes
object.size(big_bin_draws_list)
## 128000048 bytes

big_bin_draws_list uses more memory than big_bin_draws because it is a list, which is always larger since lists can have multiple different types of elements in it. Whereas atomic vector’s elements are known to have the same type. So for the current purpose a list is a bit overkill, a vector works fine.