This lab is adapted from labs given to students in similar courses at the College of Staten Island and the University of California, Los Angeles.

Getting started

Please get started with these basic administrative steps:

Update the YAML, changing the author name to your name.
Save the file with your last name, using “Save as” and substituting your last name for YOURNAME.
knit the document. Make sure it compiles without errors. The output will be in the file markdown .md file with the same name.

Please knit now rather than waiting until submission time. Then you can troubleshoot problems AnD look at the printout as you work on your code.

Packages

We’ll use the tidyverse package for visualization. Everything else is basic R commands. YOu will also be simulating data using a binomial distribution.

Load the packages by running the following:

library(tidyverse)

## Warning: package 'dplyr' was built under R version 4.4.1

library(ggplot2)

Submitting the lab

Instructions for code: There R chunks where you should write your code. For some exercises, you might save your answer as a particular variable. For example, we might give you a code chunk that looks like this:

set.seed(4291)
# insert code here save the median of your simulated data as 
# 'medx'
dbinom(16,20,.25)

## [1] 3.569266e-07

And you might complete it like so:

set.seed(4291)
# insert code here save the median of your simulated data as 
# 'medx'
x <- rnorm(1000)
medx <- median(x)
medx

## [1] 0.01612433

It is a good idea to put the variable name at the bottom so it prints (assuming its not a huge object), and usually this should be already part of the provided code. It also helps you check your work.

Of note: Sometimes an exercise will ask for code AND pose a question. Make sure that if the answer to the question is not an output of the code, then you must answer it separately in a non-code text box. For example the problem might ask you to make a plot and describe its prominent features. You would write the code to make the plot, but also write a sentence or two outside of the code block (plain text) to describe the features of the plot.

Submission: You must submit both the PDF and .Rmd to your submission folder on Google drive by the due date and time.

Scenario for the Week 3 Lab Exercise

Suppose that on a certain day in a certain large class (500 students!) the instructor gives a pop quiz and no one is prepared because no one has been studying.

Suppose also that everyone showed up for class that day.

This means that students taking the test have no choice but to guess their answers randomly and independently because they don’t know what they’re doing.

The quiz is given as a multiple choice test. There are 20 questions and 4 choices for each question.

Exercises

Review the information in the scenario and answer the following questions.

What represents a success in this scenario? What is the probability of that success?

Your non-coding answer:

print("I believe that a 16 out to 20 or an 80% is a success. The probability of that success is 3.569266e-07")

## [1] "I believe that a 16 out to 20 or an 80% is a success. The probability of that success is 3.569266e-07"

There are two primary parameters of a binomial distribution, n = # of trials (sample size) and p = probability of a success. Identify each of those parameters for this scenario.

print("n = 20 and p = 0.25")

## [1] "n = 20 and p = 0.25"

There is a sometimes a third parameter, which represents the number of observations or times that an experiment is repeated. Identify that parameter for this scenario.

print("it would be 500 becasue that is the population size")

## [1] "it would be 500 becasue that is the population size"

1. Produce a graph and numerical summary (table) of the theoretical binomial distribution for this scenario.

size <- 20
p <- .25
x_vals <- 0:size
probabilities <- dbinom(x_vals,size,p)
binom_dist <- data.frame(Correct = x_vals, Probability = probabilities)

ggplot(binom_dist, aes(x = Correct, y = Probability)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Binomial Distribution of Correct Answers (n = 20, p = 0.25)",
       x = "Number of Correct Answers",
       y = "Probability") +
  theme_minimal()

Suppose that a student must answer at least 10 questions correctly in order to pass this quiz. What is the theoretic probability of guessing at least 10 answers correctly?

Your non-coding answer about reasoning for what R binomial function to use, what target value you will use, and whether you will use a left or right tail.

You will use pbinom with a target vlue of 10. I will use a right tail because I am focused on the right side of the data (10)

Use R’s built-in bionomial functions to calculate this probability, and then create a color-coded visualization of that probability.

size <- 20
p <- .25
target_value <- 10
prob10 <- 1 - pbinom(target_value - 1,size,p)
x_values <- 0:size
probabilities <- dbinom(x_values, size = size, prob = p)
binom_dist <- data.frame(Correct_Answers = x_values, Probability = probabilities)
binom_dist$Color <- ifelse(binom_dist$Correct_Answers >= target_value, "red", "gray")

ggplot(binom_dist, aes(x = Correct_Answers, y = Probability, fill = Color)) +
  geom_bar(stat = "identity", color = "black") +
  scale_fill_identity() +
  labs(title = "Probability of Getting at Least 10 Correct Answers",
       x = "Number of Correct Answers",
       y = "Probability") +
  theme_minimal()

Produce a dataframe of simulated test results for all 500 students in the class. Use a seed integer of 10 when you do it. Use the variable name ‘x_sim’ to indicate your simulated test results.

Visualize the simulated data as binomial distribution. Color code the threshold value for passing.

set.seed(10)
n_students <- 500
n_questions <- 20
p_correct <- 0.25
pass_threshold <- 10
x_sim <- rbinom(n_students, size = n_questions, prob = p_correct)
sim_results <- data.frame(Student_ID = 1:n_students, Correct_Answers = x_sim)
head(sim_results)

ggplot(sim_results, aes(x = Correct_Answers)) +
  geom_histogram(binwidth = 1, fill = "gray", color = "black") +
  geom_vline(xintercept = pass_threshold, color = "red", linetype = "dashed", size = 1) +
  labs(title = "Simulated Test Results for 500 Students",
       x = "Number of Correct Answers",
       y = "Number of Students") +
  annotate("text", x = pass_threshold + 0.5, y = max(table(sim_results$Correct_Answers)) * 0.8, 
           label = paste("Pass Threshold: ", pass_threshold), color = "red") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Using the simulated data, calculate the proportion of students who passed the exam AND the number who passed the exam. Do all of these calculations as code (pass the value of one response to another)

pass_threshold <- 10
n_passed <- sum(sim_results$Correct_Answers >= pass_threshold)
prop_passed <- n_passed / nrow(sim_results)
cat("Number of students who passed:", n_passed, "\n")

## Number of students who passed: 10

cat("Proportion of students who passed:", prop_passed, "\n")

## Proportion of students who passed: 0.02

Create a color coded visualization of the proportion of students who passed and the proportion who failed.

n_failed <- nrow(sim_results) - n_passed
pass_fail_df <- data.frame(
  Status = c("Passed", "Failed"),
  Count = c(n_passed, n_failed)
)
pass_fail_df$Proportion <- pass_fail_df$Count / nrow(sim_results)

ggplot(pass_fail_df, aes(x = Status, y = Proportion, fill = Status)) +
  geom_bar(stat = "identity", color = "black") +
  scale_fill_manual(values = c("Passed" = "green", "Failed" = "red")) +
  labs(title = "Proportion of Students Who Passed and Failed",
       x = "Exam Outcome",
       y = "Proportion of Students") +
  theme_minimal()

How does the empirical proportion of students who passed compare with the theoretical one?

The empirical and theoretical tests are very similar and show almost exactly the same result. the only difference is the empirical is going to be more accurate due to the random variability of the test compared to the theoretical distribution.

#EXTRA CREDIT/OPTIONAL (5 points). This is a more challenging problem in terms of logic. Think carefully about how probabilities are represented in the probability mass function (PMF) graph and table.

Suppose that the instructor for this class gives another pop quiz the following week and 80% of students pass it.

What is the probability of this outcome, based on your simulated data?

#Insert code here

What is the logic you used to produce this answer? Explain how it relates to the PMF.

*Add text here*

Why is it highly unlikely that students were randomly guessing this time?

*Add text here*

Week 3 Homework - Guessing Answers on a Test

DATA 201 - Johnson

September 19, 2024

Getting started

Packages

Submitting the lab

Scenario for the Week 3 Lab Exercise

Exercises