Homework 4

Probability and Inference foundations

Author

Aidan Perkins

.bold {
  font-weight: bold;
}
NoteInstructions

Modify this quarto notebook, change eval: false to eval: true above, convert it to PDF using typst or to HTML and submit. Total is 70 points

Either HTML or PDF submissions are fine. Please also submit this Quarto file.

{{< downloadthis week-04-student.qmd dname="homework4.qmd" label="Download Quarto file" >}}

Rubric:

  • Each question is worth 10 points
  • Since these are numerical questions with a right answer, full credit will be given for following directions, using appropriate methods and getting the right answer
  • Partial credit may be given for incorrect answers if the reasoning and methodology are sound. The correct answer itself is worth 2 points.
NoteUseful facts
  1. If two events A and B are independent, then P(A and B) = P(A) x P(B).
  2. The law of total probability states that for two events A and B, P(B) = P(A | B)P(B) + P(A | not B) P(not B)
  3. For two events A and B, the conditional probability of B given A is P(B|A) = P(A and B) / P(A)
  4. A 95% confidence interval for a mean is given by \((\bar{x} - 2 s/\sqrt{n}, \bar{x} + 2s/\sqrt{n})\) where \(\bar{x}\) is the sample mean and \(s\) is the sample standard deviation from a sample of size \(n\)

Question 1: An air quality monitoring station reports that the probability of a “high pollution alert” on any given day in July is 0.3. Assuming days are independent, what is the probability that exactly 2 out of the next 5 days have high pollution alerts?

Show your work, and confirm your answer using dbinom(x=2, size=5, prob=0.3)

Answer

As we are assuming that days are independent of each, the occurrence of one day triggering a high pollution alert would not influence that probability of the next day also having a high pollution alert. For our binomial equation we would have the following inputs: n = 5, k = 2, p=0.3

The written equation would be: (10)0.3^2(1 - 0.3)^5-2

dbinom(x=2, size=5, prob=0.3)
[1] 0.3087

Question 2: An environmental sensor detects excess ozone (event A) and excess particulate matter (event B). Historical data show: - P(A) = 0.25 - P(B) = 0.40 - P(A∩B) = 0.12 Compute P(B|A) and interpret the result in context. What does it suggest about the joint occurrence of these pollutants?

Are A and B independent? Provide both a numerical check and an interpretation in terms of environmental processes

Answer

  1. Compute P(B|A) Trying to determine the probability of excess particulate matter (B) happening, given that excess ozone (A) has occurred. Our results show that among days with excess ozone, 48% also had excess particulate matter. As our value is greater than B alone it means there is possible co occurrence / association between the two metrics. This means B is more likely on days with A than on a typical day. The strength of the relationship can be determined by doing 0.48 / 0.40 which equal 1.20 meaning that B is 20% more likely when A occurs.

Are A and B independent? If they were independent then P(A∩B) = 0.25 * 0.40 = 0.10. But we know from historical data the true P(A∩B) = 0.12 which is greater than 0.10. This makes sense as excess ozone and excess particulate matter are likely to come from similar sources such as vehicle exhaust.

Probability <- 0.12/0.25

Question 3. In a three-region election analysis, probabilities of supporting Candidate Y differ by region:

  • Urban (45% of population): 0.70
  • Suburban (35%): 0.55
  • Rural (20%): 0.40

Find the overall probability that a randomly selected voter supports Candidate Y.

Answer

To find the overall probability we need to multiply the population % by their probability to support candidate Y. Urban = 0.45 * 0.70 = 0.315 Suburban = 0.35 * 0.55 = 0.1925 Rural = 0.20 * 0.40 = 0.08

We then add up each of these probabilities and we get the answer of 58.75% chance a randomly selected voter supports candidate Y.

# population proportion by region 
proportion <- c(Urban = 0.45, Suburban = 0.35, Rural = 0.20)
# support probabilities by region
probability <- c(Urban = 0.70, Suburban = 0.55, Rural = 0.40)

support_prob <- sum(proportion * probability)

print(support_prob)
[1] 0.5875

Question 4. Suppose a new water-testing kit detects microplastics with 95% sensitivity (true positive) and 90% specificity (true negative). Environmental assessments show that 10% of water samples in the region contain microplastics. If the test gives a positive result, what is the posterior probability the sample truly contains microplastics?

Answer

Using 1000 samples as an example - 100 contain microplastics, 900 do not. According to environmental assessments - 0.10 = Prevalence - 95 = True positive rate, sample has microplastics and the test catches it - 90 = True Negative, Sample has no microplastics and the test is correct

Can use the positive predictive value formula to answer this question PPV = True positive rate / total positives PPV = 95/185 = 0.51

Even with a pretty accurate test, a 10% prevalence means roughly half of positive results are true positives.


Question 5: A survey of 1000 voters shows a mean approval rating of 35% for the Governor, with a standard deviation of 12%. Construct a 95% confidence interval for the true approval rating. Interpret your results.

Answer

# information we are given 
survey_n <- 1000
approval <- 0.35 # mean approval
SD <- 0.12 # sample standard deviation 

# standard deviation would not be used unless each respondent gave approval rating as a percentage of approval (0-100), which is not how approval rating polls are typically done. 

# 95% confidence value, using qnorm to get normal distribution
# this gives us our z score of ~1.96
CI <- qnorm(0.975)
se_prop <- sqrt(approval * (1 - approval) / survey_n)

upper <- approval + CI * se_prop
lower <- approval - CI * se_prop

print(lower)
[1] 0.3204377
print(upper)
[1] 0.3795623

Given these results we can be 95% confident that the true approval rating lies somewhere between ~32 and ~38% percent.

No but we can be confident if you repeated the same random sampling process many times, about 95% of the resulting intervals would contain the true approval rate. We don’t assign a 95% probability to this specific interval.


Question 6. An environmental agency claims that the average nitrate level in groundwater exceeds the safety standard of 10 mg/L. A random sample of 40 wells gives a mean of 10.6 mg/L with a standard deviation of 1.5 mg/L.

  1. Formulate the null and alternate hypothesis to test this claim.
  2. Construct a simulation experiment to assess the evidence against the null hypothesis and report the p-value from the simulation
  3. Interpret this p-value and its implications for the agency’s claim. The agency has set a significance level of 3%.
  4. Construct a 95% confidence interval for the mean nitrate level.

Note that two people will likely not get the same answers to these questions, due to the random nature of the simulations. However, there is a ballpark of right answers, and you will be graded on your simulation code.

Answer

Null hypothesis is that the true nitrate level in groundwater is at or below 10mg/L Alternative hypothesis = The true nitrate level in groundwater exceeds 10mg/L

library(BSDA)
sample_size <- 40
sample_mean <- 10.6
SD <- 1.5

# standard error
SE <- SD / sqrt(sample_size)

simulation <- replicate(10000, mean(rnorm(sample_size, mean = 10, sd = SD)))
pvalue <- mean(simulation >= sample_mean)

# 95% CI
margin <- 2 * SD / sqrt(sample_size)
lower1 <- sample_mean - margin 
upper1 <- sample_mean + margin

print(lower1)
[1] 10.12566
print(upper1)
[1] 11.07434

6b) The p value was equal to 0.0062. 6c) As the p value is below the threshold set by the agency of 0.03 then we can believe nitrate levels exceed 10mg/L. The confidence interval further validates this finding. 6d) If we were to repeatedly take many random samples of 40 wells and build a 95% CI from each sample then about 95% of those intervals would contain the true average nitrate level. Meaning if we repeat this research we can be sure we would capture the true mean 95% of the time.


Question 7:

The file dc_sample.rds is provided you on Canvas. Download this and load this into R with the code

dc_data <- readRDS("data/dc_sample.rds") 

This dataset is a sample of 500 individuals and their race, based on standard Census criteria.

  1. Construct a confidence interval of the proportion of black people in DC using this dataset. (5 points)
library(tidyverse)

path <- "/Users/aidanperkins/Downloads/dc_sample.rds"

# RDS files: use readRDS()
dc_data <- readRDS(path)


dc_data_clean <- dc_data |> filter(race == "Black or African American")
n <- nrow(dc_data)
x <- nrow(dc_data_clean)
sample_mean <- x / n

sample_size <- 500
SD <- 1.4

CI <- qnorm(0.975)
se_prop <- sqrt(sample_mean * (1 - sample_mean) / sample_size)

upper2 <- sample_mean + CI * se_prop
lower2 <- sample_mean - CI * se_prop

print(lower2)
[1] 0.4461826
print(upper2)
[1] 0.5338174

You compute a 95% confidence interval for a proportion using the formula

\[ (\hat{p} - 2 \cdot \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}, \hat{p} + 2 \cdot \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}) \]

where \(\hat{p}\) is the sample proportion and \(n\) is the sample size.

  1. Perform a simulation experiment as follows: (10 points)
  • Create 1000 datasets of size 500 from this dataset by resampling with replacement.
library(infer)
boot1000 <- dc_data |>
  rep_sample_n(size = 500, replace = TRUE, reps = 1000)
  • For each dataset, compute the proportion of black individuals.
boot_props <- boot1000 |>
  group_by(replicate) |>
  summarize(prop_black = mean(race == "Black or African American", na.rm = TRUE))
  • Construct a distribution of these proportions and report the 2.5% and 97.5% percentiles.
library(cowplot)
ggplot(boot_props, aes(x = prop_black))+
  geom_histogram()+
  ylab("Count")+
  xlab("Proportion Black or African American")+
  theme_cowplot()

quantile(boot_props$prop_black, c(0.025, 0.975))
   2.5%   97.5% 
0.44600 0.53005 
  • Compare the confidence interval from part (a) with the distribution from the simulation in part (b). The confidence level is strikingly similar with the original data set having a CI of 44.6-53.4 while the re sampled data has a CI of 44.6-53.6.
  1. Provide a table showing how often subjects are repeated in one of the resampled datasets. What I’m looking for is a table like the following, where the header denotes the number of times a subject is repeated.
1 2 3 4 5
191 86 29 10 2
# selected replicate number 5
boot5 <- boot1000 |>
  filter(replicate == 5)

# getting id count time selected
mult_per_subject <- boot5 |>
  count(id, name = "times_selected") 

# counts of subjects appearing exactly 1..5 times
counts <- table(factor(mult_per_subject$times_selected, levels = 1:5))
result_table <- tibble::as_tibble_row(setNames(as.integer(counts), as.character(1:5)))
result_table
# A tibble: 1 × 5
    `1`   `2`   `3`   `4`   `5`
  <int> <int> <int> <int> <int>
1   170    94    35     4     3

Also, report on the number of unique individuals included in this resampled dataset (it is not 500)

num_unique <- dplyr::n_distinct(boot5$id)
num_unique
[1] 307

You can create a single resampled dataset from the original dataset with replacement using the following code:

dc_resampled <- dc_data[sample(nrow(dc_data), replace = TRUE), ]

or

dc_resampled <- dc_data |> slice_sample(p = 1, replace = TRUE)