This lab is adapted from labs given to students in similar courses at the College of Staten Island and the University of California, Los Angeles.
Please get started with these basic administrative steps:
.md file
with the same name.Please knit now rather than waiting until submission time. Then you can troubleshoot problems AnD look at the printout as you work on your code.
We’ll use the tidyverse package for visualization. Everything else is basic R commands. YOu will also be simulating data using a binomial distribution.
Load the packages by running the following:
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.4.1
library(ggplot2)
Instructions for code: There R chunks where you should write your code. For some exercises, you might save your answer as a particular variable. For example, we might give you a code chunk that looks like this:
set.seed(4291)
# insert code here save the median of your simulated data as
# 'medx'
dbinom(16,20,.25)
## [1] 3.569266e-07
And you might complete it like so:
set.seed(4291)
# insert code here save the median of your simulated data as
# 'medx'
x <- rnorm(1000)
medx <- median(x)
medx
## [1] 0.01612433
It is a good idea to put the variable name at the bottom so it prints (assuming its not a huge object), and usually this should be already part of the provided code. It also helps you check your work.
Of note: Sometimes an exercise will ask for code AND pose a question. Make sure that if the answer to the question is not an output of the code, then you must answer it separately in a non-code text box. For example the problem might ask you to make a plot and describe its prominent features. You would write the code to make the plot, but also write a sentence or two outside of the code block (plain text) to describe the features of the plot.
Submission: You must submit both the PDF and .Rmd to your submission folder on Google drive by the due date and time.
Suppose that on a certain day in a certain large class (500 students!) the instructor gives a pop quiz and no one is prepared because no one has been studying.
Suppose also that everyone showed up for class that day.
This means that students taking the test have no choice but to guess their answers randomly and independently because they don’t know what they’re doing.
The quiz is given as a multiple choice test. There are 20 questions and 4 choices for each question.
Your non-coding answer:
print("I believe that a 16 out to 20 or an 80% is a success. The probability of that success is 3.569266e-07")
## [1] "I believe that a 16 out to 20 or an 80% is a success. The probability of that success is 3.569266e-07"
print("n = 20 and p = 0.25")
## [1] "n = 20 and p = 0.25"
print("it would be 500 becasue that is the population size")
## [1] "it would be 500 becasue that is the population size"
size <- 20
p <- .25
x_vals <- 0:size
probabilities <- dbinom(x_vals,size,p)
binom_dist <- data.frame(Correct = x_vals, Probability = probabilities)
ggplot(binom_dist, aes(x = Correct, y = Probability)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Binomial Distribution of Correct Answers (n = 20, p = 0.25)",
x = "Number of Correct Answers",
y = "Probability") +
theme_minimal()
Your non-coding answer about reasoning for what R binomial function to use, what target value you will use, and whether you will use a left or right tail.
You will use pbinom with a target vlue of 10. I will use a right tail because I am focused on the right side of the data (10)
size <- 20
p <- .25
target_value <- 10
prob10 <- 1 - pbinom(target_value - 1,size,p)
x_values <- 0:size
probabilities <- dbinom(x_values, size = size, prob = p)
binom_dist <- data.frame(Correct_Answers = x_values, Probability = probabilities)
binom_dist$Color <- ifelse(binom_dist$Correct_Answers >= target_value, "red", "gray")
ggplot(binom_dist, aes(x = Correct_Answers, y = Probability, fill = Color)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_identity() +
labs(title = "Probability of Getting at Least 10 Correct Answers",
x = "Number of Correct Answers",
y = "Probability") +
theme_minimal()
set.seed(10)
n_students <- 500
n_questions <- 20
p_correct <- 0.25
pass_threshold <- 10
x_sim <- rbinom(n_students, size = n_questions, prob = p_correct)
sim_results <- data.frame(Student_ID = 1:n_students, Correct_Answers = x_sim)
head(sim_results)
ggplot(sim_results, aes(x = Correct_Answers)) +
geom_histogram(binwidth = 1, fill = "gray", color = "black") +
geom_vline(xintercept = pass_threshold, color = "red", linetype = "dashed", size = 1) +
labs(title = "Simulated Test Results for 500 Students",
x = "Number of Correct Answers",
y = "Number of Students") +
annotate("text", x = pass_threshold + 0.5, y = max(table(sim_results$Correct_Answers)) * 0.8,
label = paste("Pass Threshold: ", pass_threshold), color = "red") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
pass_threshold <- 10
n_passed <- sum(sim_results$Correct_Answers >= pass_threshold)
prop_passed <- n_passed / nrow(sim_results)
cat("Number of students who passed:", n_passed, "\n")
## Number of students who passed: 10
cat("Proportion of students who passed:", prop_passed, "\n")
## Proportion of students who passed: 0.02
n_failed <- nrow(sim_results) - n_passed
pass_fail_df <- data.frame(
Status = c("Passed", "Failed"),
Count = c(n_passed, n_failed)
)
pass_fail_df$Proportion <- pass_fail_df$Count / nrow(sim_results)
ggplot(pass_fail_df, aes(x = Status, y = Proportion, fill = Status)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_manual(values = c("Passed" = "green", "Failed" = "red")) +
labs(title = "Proportion of Students Who Passed and Failed",
x = "Exam Outcome",
y = "Proportion of Students") +
theme_minimal()
The empirical and theoretical tests are very similar and show almost exactly the same result. the only difference is the empirical is going to be more accurate due to the random variability of the test compared to the theoretical distribution.
#EXTRA CREDIT/OPTIONAL (5 points). This is a more challenging problem in terms of logic. Think carefully about how probabilities are represented in the probability mass function (PMF) graph and table.
Suppose that the instructor for this class gives another pop quiz the following week and 80% of students pass it.
#Insert code here
*Add text here*
*Add text here*