This lab is adapted from labs given to students in similar courses at the College of Staten Island and the University of California, Los Angeles.
Please get started with these basic administrative steps:
.md file
with the same name.Please knit now rather than waiting until submission time. Then you can troubleshoot problems AnD look at the printout as you work on your code.
We’ll use the tidyverse package for visualization. Everything else is basic R commands. YOu will also be simulating data using a binomial distribution.
Load the packages by running the following:
library(tidyverse)
Instructions for code: There R chunks where you should write your code. For some exercises, you might save your answer as a particular variable. For example, we might give you a code chunk that looks like this:
set.seed(4291)
# insert code here save the median of your simulated data as
# 'medx'
And you might complete it like so:
set.seed(4291)
# insert code here save the median of your simulated data as
# 'medx'
x <- rnorm(1000)
medx <- median(x)
medx
## [1] 0.01612433
It is a good idea to put the variable name at the bottom so it prints (assuming its not a huge object), and usually this should be already part of the provided code. It also helps you check your work.
Of note: Sometimes an exercise will ask for code AND pose a question. Make sure that if the answer to the question is not an output of the code, then you must answer it separately in a non-code text box. For example the problem might ask you to make a plot and describe its prominent features. You would write the code to make the plot, but also write a sentence or two outside of the code block (plain text) to describe the features of the plot.
Submission: You must submit both the PDF and .Rmd to your submission folder on Google drive by the due date and time.
Suppose that on a certain day in a certain large class (500 students!) the instructor gives a pop quiz and no one is prepared because no one has been studying.
Suppose also that everyone showed up for class that day.
This means that students taking the test have no choice but to guess their answers randomly and independently because they don’t know what they’re doing.
The quiz is given as a multiple choice test. There are 20 questions and 4 choices for each question.
1a. What represents a success in this scenario? What is the probability of that success?
Your non-coding answer:
# A success in this scenario is getting a good score, even though you have absolutely no idea what you're doing. If we assume that a good score is a decent 80%, the probability of reaching this success is an astonishingly small 3.57 x 10^-7, or 0.000000357.
1b. There are two primary parameters of a binomial distribution, n = # of trials (sample size) and p = probability of a success. Identify each of those parameters for this scenario.
# "n" represents the total number of questions on the quiz, which is 20.
# "p" represents the probability of guessing a question correctly, which is 0.25.
1c. There is a sometimes a third parameter, which represents the number of observations or times that an experiment is repeated. Identify that parameter for this scenario.
# This third parameter is the amount of students taking the quiz, which is 500.
2a. Produce a graph and numerical summary (table) of the theoretical binomial distribution for this scenario.
n <- 20
p <- 0.25
success <- 0:n
options(scipen = 999)
table <- data.frame(success, prob = dbinom(success, n, p))
print(table)
## success prob
## 1 0 0.0031712119389339963
## 2 1 0.0211414129262266146
## 3 2 0.0669478075997176625
## 4 3 0.1338956151994353250
## 5 4 0.1896854548658665485
## 6 5 0.2023311518569244072
## 7 6 0.1686092932141036449
## 8 7 0.1124061954760691984
## 9 8 0.0608866892162040069
## 10 9 0.0270607507627574212
## 11 10 0.0099222752796776989
## 12 11 0.0030067500847508173
## 13 12 0.0007516875211877076
## 14 13 0.0001541923120385035
## 15 14 0.0000256987186730839
## 16 15 0.0000034264958230779
## 17 16 0.0000003569266482373
## 18 17 0.0000000279942469206
## 19 18 0.0000000015552359400
## 20 19 0.0000000000545696821
## 21 20 0.0000000000009094947
plot <- table |>
ggplot(aes(success, prob)) +
geom_col(fill = "#2E5A88") +
theme_minimal() +
labs(x = "Questions Correctly Answered",
y = "Probability",
title = "Binomial Distribution of Probability of Getting Answers Right on a 20 Question Quiz")
plot
2b. Suppose that a student must answer at least 10 questions correctly in order to pass this quiz. What is the theoretic probability of guessing at least 10 answers correctly?
Your non-coding answer about reasoning for what R binomial function to use, what target value you will use, and whether you will use a left or right tail.
# The pbinom function is the best function in this instance to get the probability and utilize a target value of 10 because we are attempting to get 10 or more answers correct. This is a right tail probability because we are look for numbers more than 10.
2c. Use R’s built-in bionomial functions to calculate this probability, and then create a color-coded visualization of that probability.
n <- 20
p <- 0.25
z <- 10
probability <- pbinom(z, n, p, lower.tail = FALSE)
print(probability)
## [1] 0.003942142
plot2 <- table |>
ggplot(aes(success, prob, fill = success >= z)) +
geom_col(color = "#2E5A88") +
scale_fill_manual(values = c("#FFF5EE", "#2E5A88")) +
theme_minimal() +
labs(x = "# of Successes",
y = "Probability",
title = "Theoretical Binomial Distribution")
plot2
3a. Visualize the simulated data as binomial distribution. Color code the threshold value for passing.
set.seed(10)
obs <- 500
n <- 20
p <- 0.25
expected <- n*p
binomial <- rbinom(obs, n, p)
binomial <- as.data.frame(binomial)
names(binomial) <- c('x_sim')
plot3 <- binomial |>
ggplot(aes(x = x_sim,
y = stat(count / sum(count)),
fill = x_sim == expected)) +
geom_histogram(binwidth = 0.5, color = "#2E5A88") +
theme_minimal() +
scale_fill_manual(values = c("#2E5A88", "#FFF5EE")) +
labs(x = "# of Sucesses",
y = "Proportion",
title = "500 Samples of b(20, 0.25")
plot3
## Warning: `stat(count / sum(count))` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count / sum(count))` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
3b. Using the simulated data, calculate the proportion of students who passed the exam AND the number who passed the exam. Do all of these calculations as code (pass the value of one response to another)
passed <- binomial |>
summarize(passed = sum(binomial >= z) / n())
failed <- binomial |>
summarize(failed = sum(binomial < z) / n())
table2 <- data.frame(
. = c("Proportion"),
Proportion = c(passed, failed))
table2
3c. Create a color coded visualization of the proportion of students who passed and the proportion who failed.
plot4 <- binomial |>
ggplot(aes(x = x_sim,
y = stat(count / sum(count)),
fill = x_sim >= z)) +
geom_histogram(binwidth = 0.5, color = "#2E5A88") +
scale_fill_manual(values = c("#2E5A88", "#FFF5EE")) +
theme_minimal() +
labs(x = "# of Successes",
y = "Proportion",
title = "Proportion of Students Who Passed and Failed")
plot4
3d. How does the empirical proportion of students who passed compare with the theoretical one?
# The empirical proportion is obtained by simulating the quiz results for 500 students and counting how many of them actually passed.
# The theoretical passing probability is the expected proportion of students based on probability theory. This value is fixed based on the assumptions of the model.