1. PROBLEM SOLVING PART
Problem 1 (5 pts) - similar to Pr. 5 / page 238 in
text A box contains 30 consecutive balls numbered 1 to 30. If
four numbers are drawn at random, how many ways are there for the
largest number to be 19 and the smallest number to be 9?
Use the Fundamental Principle of Counting!
Solution: 36
YOUR CODE HERE:
#largest number to be 19 and the smallest number to be 9? Mutually Independent Events. 1 - 30 not needed. Exclude 9 and 19 itself. 10 - 18
between9and19 <- 10:18
#Total ways to choose 2 numbered balls within range
totalWays <- choose(length(between9and19), 2)
# Print the result
print(totalWays)
[1] 36
Problem 2 (5 pts) - similar to Pr. 14 / page 239 in
text On a multiple-choice exam with four
possible answers for each of the five questions, what is the probability
that a student would get four or more correct answers just by
guessing?
Hint: Use the fact that \(P(E ∩ F) = P(E) ·
P(F)\) for two independent event (generalized for more than
events) Getting one answer correct is independent of another.
Also
\[P(\text{at least } 4) = P(\text{exactly
} 4) + P(\text{exactly } 5)\]
the last events are mutually exclusive so \(P(A \cup B) = P(A) + P(B)\)
YOUR CODE HERE:
# 4 possible answers
# 5 Questions
# Goal: Probability of 4 or 5 questions correct (each question again 4 possible answers, // each Q having 1/4 or 25% chance to be correct)
#Total Questions
q <- 5
# Each question has 1/2 or 25% chance to be correct
eachQuestionRight <- .25
# 4 correct answers only
FourCorrect <- dbinom(4, size = q, prob = eachQuestionRight)
# 5 correct answers only
FiveCorrect <- dbinom(5, size = q, prob = eachQuestionRight)
# Probability of 4 correct answers plus 5 correct answers
FourORFiveCorrect <- FourCorrect + FiveCorrect
print(FourORFiveCorrect)
[1] 0.015625
# Extrapolate to 200 students
print(FourORFiveCorrect * 200)
[1] 3.125
Out of 200 students who adopt this test taking approach how many are
expected to get at least 4 correct? Hint: Use Binomial
experiment settings to answer this questions.
Solution: 3.125 / 3 of the students would get at least 4
correct answers of the 5 questions on the multiple-choice
exam
Problem 3 (5 pts) (IND-OPT) - similar to Pr. 44 page 243 in
text Consider tossing four fair coins. There
are 16 possible outcomes, e.g HHHH,HHHT,HHTH,HTHH,THHH, ...
possible outcomes. Define X as the random variable “number
of heads showing when four coins are tossed.” Obtain the mean and the
variance of X. Simulate tossing four fair coins 10,000 times. Compute
the simulated mean and variance of X. Are the simulated values within 2%
of the theoretical answers?
Hint: To find the theoretical values use
dbinom (x= , size = , prob = )
Solution: No, neither the theoretical mean and variance are
within are within 2% of the simulated mean and variance
YOUR CODE HERE:
library(MASS)
set.seed(007) #Random seed for the rbinom simulations
# Flipping a single/fair coin (1st input - number of draws, 2nd input - number of coins, 3rd input - probability of coin / e.g. 50% or .5)
rbinom(1, 1, .5)
[1] 1
# Flipping 4 fair coins
rbinom(1, 4, .5)
[1] 2
# Flipping 4 fair coins 10,000 times, set to coinflipsX var (Where X is the number of heads showing when 4 coins are tossed)
coinflipsX <- rbinom(10000, 4, .5)
# Simulated mean of coinflipsX
simulated_mean <- mean(coinflipsX)
# Simulated variance of coinflipsX
simulated_variance <- var(coinflipsX)
# theoretical values dbinom (x= 1, size = 16 possible outcomes, prob = .02 / 2%)
dbinom_mean <- (dbinom (x = 2, size = 16, prob = 0.02))
dbinom_variance <- (dbinom (x = 1, size = 16, prob = 0.02))
# Check if theoretical value of dbinom_mean is less than or equal to .02% of simulated_mean
simulated_versus_theoretical_mean <- abs(simulated_mean - dbinom_mean) / dbinom_mean <= 0.02
# Check if theoretical value of dbinom_variance is less than or equal to .02% of simulated_variance
simulated_versus_theoretical_variance <- abs(simulated_variance - dbinom_variance) / dbinom_variance <= 0.02
print(simulated_mean)
[1] 2.0028
print(simulated_variance)
[1] 1.009493
print(simulated_versus_theoretical_mean)
[1] FALSE
print(simulated_versus_theoretical_variance)
[1] FALSE
Problem 4 (10 pts) - similar to Pr. 11 / page 307 in
text
Traffic volume is an important factor for determining the most
cost-effective method to surface a road. Suppose that the average number
of vehicles passing a certain point on a road is 5 every 60 seconds.
- Find the probability that more than 12 cars will pass the point in a
period of 3 minutes.
- What is the probability that more than 100 cars pass the point in
half an hour?
Hint: Adjust the Poisson parameter \(\lambda = 5\) every 60 seconds to
# per every 3 minutes
(a): 0.8152482 (b): 0.9999941
# Avg # of vehicles passing a certain point on a road is 5 every 60 seconds.
# (a) more than 12 cars will pass the point in a period of 3 minutes (3 sets of 60 seconds, or 3 sets of 5 cars / 15 cars MAX)
# 11 as min non-inclusive (to get 12+)
aProbability <- 1 - ppois(11, 15)
# (b) more than 100 cars pass the point in half an hour (30 sets of 60 seconds, or 30 sets of 5 cars / 150 cars MAX)
# 99 as min non-inclusive (to get100+)
bProbability <- 1 - ppois(99, 150)
list(AProbability = aProbability, bProbability = bProbability)
$AProbability
[1] 0.8152482
$bProbability
[1] 0.9999941
Problem 5 (10 pts) - similar to Pr. 18 / page 308 in
text Suppose the percentage of drinks sold from a vending
machine are 70% and 30% for soft drinks and bottled water,
respectively.
- What is the probability that on a randomly selected day, the first
soft drink is the fourth drink sold?
- Find the probability that exactly 4 out of 10 drinks sold is a soft
drink.
Hint: Let X = number of waters (failures) purchased
before the first soft drink is purchased. Then, \(X \sim Geo(0.70)\).
(a): 0.0189 (b): 0.03675691
YOUR CODE HERE::
# Randomly selected day, the first soft drink is the fourth drink sold?
SDPercent <- 0.70
first_SD_fourth_sold_drink <- (1 - SDPercent)^3 * SDPercent
# Result
print(first_SD_fourth_sold_drink)
[1] 0.0189
- Let
X = number of soft drinks sold. Then, \(X \sim Bin(10, 0.70)\) and \(P(X = 1) = 0\) since one has:
SDPercent <- 0.70
# Exactly 4 out of 10 drinks sold is a soft drink
total_drinks <- 10
target_drinks_SD <- 4
#Exactly 4 of 10 drinks as soft drinks
four_of_ten_SD <- choose(total_drinks, target_drinks_SD) * SDPercent^target_drinks_SD * (1 - SDPercent)^(total_drinks - target_drinks_SD)
print(prob_exactly_4_soft_drinks)
[1] 0.03675691
Problem 6 (10 pts) - Exponential Distribution: Light
Bulbs If the life of a certain type of light bulb has an
exponential distribution with a mean of 10 months, find
- The probability that a randomly selected light bulb lasts between 6
and 15 months.
- The 95th percentile of the distribution. How the probability of the
light bulb to last more than 36 months compare to
0.05?
- The probability that a light bulb that has lasted for 10 months will
last more than 25 months.
(a): 0.3256815 (b): 0.02732372 & it
meets comparison to .05 (c): 0.2231302
YOUR CODE HERE:
expDist_Ten_Months <- 1/10 # Exponential distribution with mean of 10 months
# (a) - randomly selected light bulb lasts between 6 and 15 months (15 mo. pexp - 6 mo. pexp)
Six_to_Fifteen <- pexp(15, rate = expDist_Ten_Months) - pexp(6, rate = expDist_Ten_Months)
print(Six_to_Fifteen)
[1] 0.3256815
# (b) - 95th percentile. Probability of the light bulb to last 36 months+ compare to `0.05`?
Percentile_95th <- qexp(0.95, rate = expDist_Ten_Months)
over_36 <- pexp(36, rate = expDist_Ten_Months, lower.tail = FALSE)
print(over_36)
[1] 0.02732372
comparison_to_.05 <- over_36 < 0.05
print(comparison_to_.05)
[1] TRUE
# (c) - Probability that a light bulb that has lasted for 10 months will last 25 months+
Ten_months_last_25_months_plus <- pexp(15, rate = expDist_Ten_Months, lower.tail = FALSE)
print(Ten_months_last_25_months_plus)
[1] 0.2231302
Problem 7 (10 pts) (IND-OPT) - Pr. 8 page 400 A
population has the following elements: 2, 5, 8, 12, 13 (finite
population).
Enumerate all the samples of size 2 that can be drawn with and
without replacement. (Hint: Use the srs() function from the
PASWR2 package)
Calculate the mean of the population.
Calculate the variance of the population.
Calculate the standard deviation of the population.
Calculate the mean of the sample mean, \(E[X]\).
Calculate the variance of the sampled mean, \(Var(\bar X)\)
Calculate the standard deviation of the sample mean.
Calculate the mean of the sample variance, \(E[S^2]\)
Is the variance of \(\bar X\)
larger when sampling with or without replacement? Explain your
answer
Solution:
YOUR CODE HERE:
#individual optional
Problem 8 (10 pts) - similar to Pr. 10 / page 400 in
text
Use the data frame WHEATUSA2004 from the
PASWR2 package; draw all samples of sizes 4, 5, and 6; and
calculate the mean of the means for the variable acres (wheat surface
area measured in thousands of acres). . What size provides the best
approximation to the population mean? What is the variance of these
means?
Hint: Use the srs() function from the
PASWR2 package to draw simple random samples.
YOUR CODE HERE:
#refrencing sampling distributions exercises line 206
library(PASWR2)
# Sample Size 4:
SRS4 <- srs(WHEATUSA2004$acres, 4)
xbarSRS4 <- apply(SRS4, 1, mean) # MEAN
c(mean(xbarSRS4), var(xbarSRS4)) # MEAN & VARIANCE
[1] 1148.733 645755.872
# Sample Size 5:
SRS5 <- srs(WHEATUSA2004$acres, 5)
xbarSRS5 <- apply(SRS5, 1, mean) # MEAN
c(mean(xbarSRS5), var(xbarSRS5)) # MEAN & VARIANCE
[1] 1148.733 496720.646
# Sample Size 6:
SRS6 <- srs(WHEATUSA2004$acres, 6)
xbarSRS6 <- apply(SRS6, 1, mean) # MEAN
c(mean(xbarSRS6), var(xbarSRS6)) # MEAN & VARIANCE
[1] 1148.733 397374.397
Solution: Sample Size 6 has the smallest variance of
397374.397 (compared to Sample size 4’s 645755.872 and Sample size 5’s
496720.646
Problem 9 (10 pts) - Pr. 16 / page 513 in text
A group of engineers working with physicians in a research hospital
is developing a new device to measure blood glucose levels. Based on
measurements taken from patients in a previous study, the physicians
assert that the new device provides blood glucose levels slightly higher
than those provided by the old device. To corroborate their suspicion,
15 diabetic patients were randomly selected, and their
blood glucose levels were measured with both the new and the old
devices. The measurements, in mg/100 ml, appear in data
frame GLUCOSE from the PASWR2 package:
- Are the samples independent? Why or why not?
- If the blood glucose level is a normally distributed random
variable, compute a 95% confidence interval for the mean differences of
the population.
- Use the results in (b) to decide whether or not the two devices give
the same results.
Solution:
The samples are not independent. The patients each have both
measurements obtained from old device as well as their newer device. The
statistics are dependent on each other
The 95% confidence level interval for the mean differences of the
population is −16.3614 to −12.7586.
YOUR CODE HERE:
GLUCOSE <- GLUCOSE %>% mutate(DIFF = old-new) # create variable DIFF for the difference between old and new level
head(GLUCOSE)
ggplot(data = GLUCOSE, aes(sample = DIFF)) +
stat_qq() +
theme_bw()

CI <- t.test(GLUCOSE$DIFF)$conf # use t-test since the sample size is small < 30
CI
[1] -16.36145 -12.75855
attr(,"conf.level")
[1] 0.95
The 95% confidence interval for the mean differences of the
population is [−16.3614,−12.7586].
- The confidence interval suggests that on average, the new device
reports higher blood glucose levels than the old device.
Problem 10 (10 pts) (IND-OPT) - similar to Pr. 9 / page 511
in text
A large company wants to estimate the proportion of its accounts that
are paid on time.
How large a sample is needed to estimate the true proportion
within 2% with a 95% confidence level?
Suppose 700 out of 800 accounts are paid on time. Construct 95%
confidence intervals for the true proportion of accounts that are paid
on time using an asymptotic confidence interval, an Agresti Coull
confidence interval.
Hint: Use the nsize() and binom.confint()
from the library(binom)
Solution:
YOUR CODE HERE:
library(binom)
library(binom)
#(a)
#(b)
#individual optional
Problem 11 (10 pts) - Pr. 10 / page 511 in text
In a study conducted at Appalachian State
University, students used digital oral thermometers to record
their temperatures each day they came to class. A randomly selected day
of student temperatures is provided in the following table and in the
data frame STATTEMPS. Information is also provided with
regard to subject gender and the hour of the day when the students’
temperatures were measured.
Direction: load the data with data(STATTEMPS) having the
library(PASWR2) loaded.
Hint: Use
t.test(temperature ~ gender, data = STATTEMPS, mu = 0, paired = FALSE)
the x ad y values are given with formula
expression formula = temperature ~ gender. Then use the
$ access to obtain the confidence interval:
CI <- t.test(temperature ~ gender, data = STATTEMPS, mu = 0, paired = FALSE)$conf
Construct a 95% confidence interval for the true average
temperature difference between females and males. Does the interval
contain the value zero? What does this suggest about gender temperature
differences?
Construct a 95% confidence interval for the true average
temperature difference between students taking their temperatures at 8
a.m. and students taking their temperatures at 9 a.m. Give a reason why
one group appears to have a higher temperature reading.
Solution:
YOUR CODE HERE:
library(PASWR2)
data(STATTEMPS)
#(a) 95% confidence interval for the true average temperature difference between females and males
t.test(temperature ~ gender, data = STATTEMPS, mu = 0, paired = FALSE)
Error in t.test.formula(temperature ~ gender, data = STATTEMPS, mu = 0, :
cannot use 'paired' in formula method
A 95% confidence interval for the true average
difference between students taking their temperatures at 8
a.m. and students taking their temperatures at 9 a.m. is
[−2.4965,−0.2564]. Note that this interval does not contain
0, indicating that the there is evidence to suggest
students in the 8 a.m. class have temperatures that are not
as warm as the 9 a.m. class. One possible explanation is
that students roll straight out of bed and into the 8 a.m.
class. Consequently, their temperatures are closer to their sleeping
temperatures which are lower than their waking temperatures.
3. BOOTSTRAP ESTIMATION PART
Review the problem below.
Pr. 30 / page 695 in text
The “Wisconsin Card Sorting Test” is widely used by psychiatrists,
neurologists, and neuropsychologists with patients who have a brain
injury, neurodegenerative disease, or a mental illness such as
schizophrenia. Patients with any sort of frontal lobe lesion generally
do poorly on the test. The data frame WCST and the
following table contain the test scores from a group of 50 patients from
the Virgen del Camino Hospital (Pamplona, Spain).
Use the function eda() from the PASWR2
package to explore the data and decide if normality can be assumed. For
details type in console ?eda
What assumption(s) must be made to compute a 95%
confidence interval for the population mean?
Compute the confidence interval from (b).
Compute a 95% BCa bootstrap confidence interval for the mean test
score.
Should you use the confidence interval reported in (c) or the
confidence interval reported in (d)?
Solution:
- Assuming the variable score has a normal distribution is not
reasonable.
See the results of EDA analysis below:
data(WCST)
WCST %>% pull(score) %>% eda()
Size (n) Missing Minimum 1st Qu Mean Median TrMean 3rd Qu Max Stdev Var SE Mean I.Q.R. Range Kurtosis
50.000 0.000 4.000 8.500 21.480 17.000 19.413 26.000 94.000 18.406 338.785 2.603 17.500 90.000 4.511
Skewness SW p-val
2.033 0.000

In order to construct a 95% confidence interval for
the population mean, one assumes that the values in the variable score
are taken from a normal distribution. Although this is
not a reasonable assumption, the sample size might be sufficiently large
to overcome the skewness in the parent population. Consequently, one
might appeal to the Central Limit Theorem and claim
that the sampling distribution of X is
approximately normal due to the sample size
(50). In this problem, the skewness is quite severe, and
one should not be overly confident in the final interval.
If we assume normality one has:
# CI <- with(data = WCST, t.test(score)$conf) #
# or using piping
CI <- WCST %>% pull(score) %>% t.test() %>% .$conf
CI
[1] 16.24904 26.71096
attr(,"conf.level")
[1] 0.95
- Use the
boot.ci() nonparametric bootstrap CIs functions
to obtain the bootstrap CI.
library(boot) # use boot package
MEAN <- function(data, i){
d <- data[i]
M <- mean(d)
}
set.seed(13)
B <- 10^4 - 1 # number of bootstrap samples
b.obj <- boot(data = WCST$score, statistic = MEAN, R = B) # bootstrap object of class "boot" containing the output of a bootstrap calculation
CIB <- boot.ci(b.obj, conf = 0.95, type = "bca")
CIB
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 9999 bootstrap replicates
CALL :
boot.ci(boot.out = b.obj, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% (17.26, 27.86 )
Calculations and Intervals on Original Scale
Problem 12 (10 pts) Set the seed to 23
if your group index is odd number (i.e. group 1,3,5,7),
and the seed to 89 if your group index is even number
(i.e. group 2,4,6,8). Draw a random sample of size
200 from exponential distribution with mean \(\lambda = 4\) (\(rate = \frac{1}{4}\)). Produce all
bootstrap CI with the function boot.ci (use type =
“all”).
How many of them contain the true mean for the sampled
distribution?
YOUR CODE HERE:
library(boot)
set.seed(89)
stat.mean = function(x, indx){
mean( x[indx] )
}
exps200 <- rexp(200,.25)
b.obj <- boot(data = exps200, statistic = stat.mean, 200)
CIB <- boot.ci(b.obj, type = "all")
Warning: bootstrap variances needed for studentized intervals
CIB
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 200 bootstrap replicates
CALL :
boot.ci(boot.out = b.obj, type = "all")
Intervals :
Level Normal Basic
95% ( 3.523, 4.685 ) ( 3.509, 4.718 )
Level Percentile BCa
95% ( 3.486, 4.695 ) ( 3.533, 4.732 )
Calculations and Intervals on Original Scale
Some basic intervals may be unstable
Some percentile intervals may be unstable
Some BCa intervals may be unstable
#bootstrap variances needed for studentized intervals warning
EXTRA CREDIT (10 pts) Set the seed to
23 if your group index is odd number (i.e. group
1,3,5,7), and the seed to 89 if your group index
is even number (i.e. group 2,4,6,8).
Draw a random sample of size 25 from \(Bin(20,\frac{1}{5})\). Produce all
bootstrap CI with the function boot.ci (use type =
“all”).
How many of them contain the true mean for the \(Bin(20,\frac{1}{5})\)?
YOUR CODE HERE:
library(boot)
set.seed(89)
stat.mean = function(x, indx){
mean( x[indx] )
}
exps25 <- rexp(20,1/5)
b.obj <- boot(exps25, statistic = stat.mean, 10000)
CIB <- boot.ci(b.obj, type = "all")
Warning: bootstrap variances needed for studentized intervals
CIB
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates
CALL :
boot.ci(boot.out = b.obj, type = "all")
Intervals :
Level Normal Basic
95% ( 3.177, 7.148 ) ( 3.056, 6.997 )
Level Percentile BCa
95% ( 3.326, 7.267 ) ( 3.540, 7.663 )
Calculations and Intervals on Original Scale
---
title: "STAT270 - Project 2 - Classical Probability and Statistics Problems and Bootstrap Estimation"
author: "Madelyn Figueras"
date: "`r Sys.Date()`"
output: html_notebook
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Directions:

Submit the project either as **a group** or as an **individual work**.

-   Project consist of **three** (3) parts: **1.** Problems Solving, **2.** Brainstorming, and **3.** Bootstrap estimation.

-   For **group** project **submisison**: credit will be given to all students listed as contributors - **all listed problem** must be completed. Maximum score **100** pts + **10** pts (Extra Credit).

-   For **individual submissions**: problems listed with label **(IND-OPT)** (individual-optional) in parenthesis **are not** required and may be omitted. For a maximum score **80** pts + **10 pts** (Extra Credit).

```{r packages, echo = FALSE, warning=FALSE, message = FALSE}
# load the packages for graphing and data wrangling
library(ggplot2)
library(PASWR2)
library(dplyr)
library(lattice)
library(boot)
library(MASS)
```

**Note:** If you `Rmd` file submission knits you will receive total of **(5 points)**

## 1. PROBLEM SOLVING PART

**Problem 1 (5 pts) - similar to Pr. 5 / page 238 in text** A box contains 30 consecutive balls numbered 1 to 30. If four numbers are drawn at random, how many ways are there for the largest number to be 19 and the smallest number to be 9?

Use the Fundamental Principle of Counting!

**Solution: 36**

**YOUR CODE HERE:**

```{r}
#largest number to be 19 and the smallest number to be 9? Mutually Independent Events. 1 - 30 not needed. Exclude 9 and 19 itself. 10 - 18
between9and19 <- 10:18

#Total ways to choose 2 numbered balls within range
totalWays <- choose(length(between9and19), 2) 

# Print the result
print(totalWays)

```

**Problem 2 (5 pts) - similar to Pr. 14 / page 239 in text** On a multiple-choice exam with **four** possible answers for each of the five questions, what is the probability that a student would get four or more correct answers just by guessing?

Hint: Use the fact that $P(E ∩ F) = P(E) · P(F)$ for two independent event (generalized for more than events) Getting one answer correct is independent of another.

Also

$$P(\text{at least } 4) = P(\text{exactly } 4) + P(\text{exactly } 5)$$

the last events are mutually exclusive so $P(A \cup B) = P(A) + P(B)$

**YOUR CODE HERE:**

```{r}

# 4 possible answers
# 5 Questions
# Goal: Probability of 4 or 5 questions correct (each question again 4 possible answers, // each Q having 1/4 or 25% chance to be correct)

#Total Questions
q <- 5

# Each question has 1/2 or 25% chance to be correct
eachQuestionRight <- .25

# 4 correct answers only
FourCorrect <- dbinom(4, size = q, prob = eachQuestionRight)

# 5 correct answers only
FiveCorrect <- dbinom(5, size = q, prob = eachQuestionRight)

# Probability of 4 correct answers plus 5 correct answers
FourORFiveCorrect <- FourCorrect + FiveCorrect
print(FourORFiveCorrect)

# Extrapolate to 200 students

print(FourORFiveCorrect * 200)

```

Out of 200 students who adopt this test taking approach how many are expected to get at least 4 correct? **Hint:** Use Binomial experiment settings to answer this questions.

**Solution: 3.125 / 3 of the students would get at least 4 correct answers of the 5 questions on the multiple-choice exam**

**Problem 3 (5 pts) (IND-OPT) - similar to Pr. 44 page 243 in text** Consider tossing **four** fair coins. There are 16 possible outcomes, e.g `HHHH,HHHT,HHTH,HTHH,THHH, ...` possible outcomes. Define `X` as the random variable "number of heads showing when four coins are tossed." Obtain the mean and the variance of X. Simulate tossing four fair coins 10,000 times. Compute the simulated mean and variance of X. Are the simulated values within 2% of the theoretical answers?

**Hint:** To find the theoretical values use `dbinom (x= , size = , prob =  )`

**Solution: No, neither the theoretical mean and variance are within are within 2% of the simulated mean and variance**

**YOUR CODE HERE:**

```{r}
library(MASS)
set.seed(007) #Random seed for the rbinom simulations

# Flipping a single/fair coin (1st input - number of draws, 2nd input - number of coins, 3rd input - probability of coin / e.g. 50% or .5)
rbinom(1, 1, .5)

# Flipping 4 fair coins
rbinom(1, 4, .5)

# Flipping 4 fair coins 10,000 times, set to coinflipsX var (Where X is the number of heads showing when 4 coins are tossed)
coinflipsX <- rbinom(10000, 4, .5)

# Simulated mean of coinflipsX
simulated_mean <- mean(coinflipsX)

# Simulated variance of coinflipsX
simulated_variance <- var(coinflipsX)

# theoretical values dbinom (x= 1, size = 16 possible outcomes, prob =  .02 / 2%)
dbinom_mean <- (dbinom (x = 2, size = 16, prob = 0.02))
dbinom_variance <- (dbinom (x = 1, size = 16, prob = 0.02))

# Check if theoretical value of dbinom_mean is less than or equal to .02% of simulated_mean
simulated_versus_theoretical_mean <- abs(simulated_mean - dbinom_mean) / dbinom_mean <= 0.02

# Check if theoretical value of dbinom_variance is less than or equal to .02% of simulated_variance
simulated_versus_theoretical_variance <- abs(simulated_variance - dbinom_variance) / dbinom_variance <= 0.02

print(simulated_mean) 
print(simulated_variance)
print(simulated_versus_theoretical_mean) 
print(simulated_versus_theoretical_variance)

```

**Problem 4 (10 pts) - similar to Pr. 11 / page 307 in text**

Traffic volume is an important factor for determining the most cost-effective method to surface a road. Suppose that the average number of vehicles passing a certain point on a road is 5 every 60 seconds.

(a) Find the probability that more than 12 cars will pass the point in a period of 3 minutes.
(b) What is the probability that more than 100 cars pass the point in half an hour?

Hint: Adjust the Poisson parameter $\lambda = 5$ every 60 seconds to `#` per every 3 minutes

**(a): 0.8152482**
**(b): 0.9999941**

```{r pr 11}

# Avg # of vehicles passing a certain point on a road is 5 every 60 seconds. 
# (a) more than 12 cars will pass the point in a period of 3 minutes (3 sets of 60 seconds, or 3 sets of 5 cars / 15 cars MAX)
# 11 as min non-inclusive (to get 12+)
aProbability <- 1 - ppois(11, 15)

# (b) more than 100 cars pass the point in half an hour (30 sets of 60 seconds, or 30 sets of 5 cars / 150 cars MAX)
# 99 as min non-inclusive (to get100+)
bProbability <- 1 - ppois(99, 150)

list(AProbability = aProbability, bProbability = bProbability)

```

**Problem 5 (10 pts) - similar to Pr. 18 / page 308 in text** Suppose the percentage of drinks sold from a vending machine are 70% and 30% for soft drinks and bottled water, respectively.

(a) What is the probability that on a randomly selected day, the first soft drink is the fourth drink sold?
(b) Find the probability that exactly 4 out of 10 drinks sold is a soft drink.

Hint: Let `X` = number of waters (failures) purchased before the first soft drink is purchased. Then, $X \sim Geo(0.70)$.

**(a): 0.0189**
**(b): 0.03675691**

**YOUR CODE HERE:**:

(a) 

```{r pr. 18a}

# Randomly selected day, the first soft drink is the fourth drink sold?
SDPercent <- 0.70 
first_SD_fourth_sold_drink <- (1 - SDPercent)^3 * SDPercent
# Result
print(first_SD_fourth_sold_drink)

```

(b) Let `X` = number of soft drinks sold. Then, $X \sim Bin(10, 0.70)$ and $P(X = 1) = 0$ since one has:

```{r pr. 18b}

SDPercent <- 0.70 

# Exactly 4 out of 10 drinks sold is a soft drink
total_drinks <- 10
target_drinks_SD <- 4

#Exactly 4 of 10 drinks as soft drinks
four_of_ten_SD <- choose(total_drinks, target_drinks_SD) * SDPercent^target_drinks_SD * (1 - SDPercent)^(total_drinks - target_drinks_SD)
print(prob_exactly_4_soft_drinks)

```

**Problem 6 (10 pts) - Exponential Distribution: Light Bulbs** If the life of a certain type of light bulb has an exponential distribution with a mean of 10 months, find

(a) The probability that a randomly selected light bulb lasts between 6 and 15 months.
(b) The 95th percentile of the distribution. How the probability of the light bulb to last more than 36 months compare to `0.05`?\
(c) The probability that a light bulb that has lasted for 10 months will last more than 25 months.

**(a): 0.3256815**
**(b): 0.02732372 & it meets comparison to .05 **
**(c): 0.2231302**

**YOUR CODE HERE:**

```{r Ex. 4.17}

expDist_Ten_Months <- 1/10 # Exponential distribution with mean of 10 months

# (a) - randomly selected light bulb lasts between 6 and 15 months (15 mo. pexp - 6 mo. pexp)
Six_to_Fifteen <- pexp(15, rate = expDist_Ten_Months) - pexp(6, rate = expDist_Ten_Months)
print(Six_to_Fifteen)

# (b) - 95th percentile. Probability of the light bulb to last 36 months+ compare to `0.05`?
Percentile_95th <- qexp(0.95, rate = expDist_Ten_Months)
over_36 <- pexp(36, rate = expDist_Ten_Months, lower.tail = FALSE)
print(over_36)

comparison_to_.05 <- over_36 < 0.05
print(comparison_to_.05)

# (c) - Probability that a light bulb that has lasted for 10 months will last 25 months+
Ten_months_last_25_months_plus <- pexp(15, rate = expDist_Ten_Months, lower.tail = FALSE)
print(Ten_months_last_25_months_plus)

```

**Problem 7 (10 pts) (IND-OPT) - Pr. 8 page 400** A population has the following elements: 2, 5, 8, 12, 13 (**finite population**).

(a) Enumerate all the samples of size 2 that can be drawn with and without replacement. (Hint: Use the `srs()` function from the `PASWR2` package)

(b) Calculate the mean of the population.

(c) Calculate the variance of the population.

(d) Calculate the standard deviation of the population.

(e) Calculate the mean of the sample mean, $E[X]$.

(f) Calculate the variance of the sampled mean, $Var(\bar X)$

(g) Calculate the standard deviation of the sample mean.

(h) Calculate the mean of the sample variance, $E[S^2]$

(i) Is the variance of $\bar X$ larger when sampling with or without replacement? Explain your answer

**Solution:**

**YOUR CODE HERE:**

```{r}

#individual optional

```

**Problem 8 (10 pts) - similar to Pr. 10 / page 400 in text**

Use the data frame `WHEATUSA2004` from the `PASWR2` package; draw all samples of sizes 4, 5, and 6; and calculate the mean of the means for the variable acres (wheat surface area measured in thousands of acres). . What size provides the best approximation to the population mean? What is the variance of these means?

**Hint:** Use the `srs()` function from the `PASWR2` package to draw simple random samples.

**YOUR CODE HERE:**

```{r}

#refrencing sampling distributions exercises line 206
library(PASWR2)

# Sample Size 4:
SRS4 <- srs(WHEATUSA2004$acres, 4)
xbarSRS4 <- apply(SRS4, 1, mean) # MEAN
c(mean(xbarSRS4), var(xbarSRS4)) # MEAN & VARIANCE

# Sample Size 5:
SRS5 <- srs(WHEATUSA2004$acres, 5) 
xbarSRS5 <- apply(SRS5, 1, mean) # MEAN
c(mean(xbarSRS5), var(xbarSRS5)) # MEAN & VARIANCE

# Sample Size 6:
SRS6 <- srs(WHEATUSA2004$acres, 6) 
xbarSRS6 <- apply(SRS6, 1, mean) # MEAN
c(mean(xbarSRS6), var(xbarSRS6)) # MEAN & VARIANCE

```
**Solution: Sample Size 6 has the smallest variance of 397374.397 (compared to Sample size 4's 645755.872 and Sample size 5's 496720.646**

**Problem 9 (10 pts) - Pr. 16 / page 513 in text**

A group of engineers working with physicians in a research hospital is developing a new device to measure blood glucose levels. Based on measurements taken from patients in a previous study, the physicians assert that the new device provides blood glucose levels slightly higher than those provided by the old device. To corroborate their suspicion, `15` diabetic patients were randomly selected, and their blood glucose levels were measured with both the new and the old devices. The measurements, in `mg/100 ml`, appear in data frame `GLUCOSE` from the `PASWR2` package:

(a) Are the samples independent? Why or why not?
(b) If the blood glucose level is a normally distributed random variable, compute a 95% confidence interval for the mean differences of the population.
(c) Use the results in (b) to decide whether or not the two devices give the same results.

**Solution:**

(a) The samples are not independent. The patients each have both measurements obtained from old device as well as their newer device. The statistics are dependent on each other

(b) The 95% confidence level interval for the mean differences of the population is −16.3614 to −12.7586.

**YOUR CODE HERE:**

```{r}
GLUCOSE <- GLUCOSE %>% mutate(DIFF = old-new) # create variable DIFF for the difference between old and new level

head(GLUCOSE)

ggplot(data = GLUCOSE, aes(sample = DIFF)) +
  stat_qq() +
  theme_bw()

CI <- t.test(GLUCOSE$DIFF)$conf # use t-test since the sample size is small < 30
CI
```

The 95% confidence interval for the mean differences of the population is `[−16.3614,−12.7586]`.

(c) The confidence interval suggests that on average, the new device reports higher blood glucose levels than the old device.

**Problem 10 (10 pts) (IND-OPT) - similar to Pr. 9 / page 511 in text**

A large company wants to estimate the proportion of its accounts that are paid on time.

(a) How large a sample is needed to estimate the true proportion within 2% with a 95% confidence level?

(b) Suppose 700 out of 800 accounts are paid on time. Construct 95% confidence intervals for the true proportion of accounts that are paid on time using an asymptotic confidence interval, an Agresti Coull confidence interval.

Hint: Use the `nsize()` and `binom.confint()` from the `library(binom)`

**Solution:**

**YOUR CODE HERE:**

```{r}
library(binom)
library(binom)
#(a) 

#(b)

#individual optional

```

**Problem 11 (10 pts) - Pr. 10 / page 511 in text**

In a study conducted at **Appalachian State University**, students used digital oral thermometers to record their temperatures each day they came to class. A randomly selected day of student temperatures is provided in the following table and in the data frame `STATTEMPS`. Information is also provided with regard to subject gender and the hour of the day when the students' temperatures were measured.

Direction: load the data with `data(STATTEMPS)` having the `library(PASWR2)` loaded.

**Hint:** Use `t.test(temperature ~ gender, data = STATTEMPS, mu = 0, paired = FALSE)` the `x` ad `y` values are given with formula expression `formula = temperature ~ gender`. Then use the `$` access to obtain the confidence interval: `CI <- t.test(temperature ~ gender, data = STATTEMPS, mu = 0, paired = FALSE)$conf`

(a) Construct a 95% confidence interval for the true average temperature difference between females and males. Does the interval contain the value zero? What does this suggest about gender temperature differences?

(b) Construct a 95% confidence interval for the true average temperature difference between students taking their temperatures at 8 a.m. and students taking their temperatures at 9 a.m. Give a reason why one group appears to have a higher temperature reading.

**Solution:**

**YOUR CODE HERE:**

```{r}
library(PASWR2)
data(STATTEMPS)

#(a) 95% confidence interval for the true average temperature difference between females and males

t.test(temperature ~ gender, data = STATTEMPS, mu = 0, paired = FALSE)
CI <- t.test(temperature ~ gender, data = STATTEMPS, mu = 0, paired = FALSE)$conf

#(b) 95% confidence interval for the true average temperature difference between students taking their temperatures at 8 a.m. and students taking their temperatures at 9 a.m

t.test(temperature ~ class, data = STATTEMPS, mu = 0, paired = FALSE)
CI <- t.test(temperature ~ class, data = STATTEMPS, mu = 0, paired = FALSE)$conf

```

A `95%` confidence interval for the true average difference between students taking their temperatures at `8` a.m. and students taking their temperatures at `9` a.m. is `[−2.4965,−0.2564]`. Note that this interval does not contain `0`, indicating that the there is evidence to suggest students in the `8` a.m. class have temperatures that are not as warm as the `9` a.m. class. One possible explanation is that students roll straight out of bed and into the `8` a.m. class. Consequently, their temperatures are closer to their sleeping temperatures which are lower than their waking temperatures.

## 2. BRAINSTORMING PART:

### How do we form confidence intervals when normality assumption is not reasonable and we have small sample size so the Central Limit Theorem does not apply?

> Look for the answer with the following :

**Empirical bootstrap confidence interval for the mean.**

For the data contained in a vector `x = c(36,30,37,43,42,43,43,46,40,42)` Estimate the mean `μ` of the underlying distribution and give an `90%` bootstrap confidence interval.

```{r}
x = c(36,30,37,43,42,43,43,46,40,42)
qqnorm(x) # see if the data appear to follow normal distribution
```

> Data is not quite normal as it appears on the Q-Q PLOT!

```{r}

n = length(x)
set.seed(25)  

# sample mean
xbar = mean(x)

cat("data mean = ",xbar,'\n')

nboot = 10000 # number of bootstrap samples 

# Generate 10000 bootstrap samples, i.e. an n x 10000 array of random resamples from x.
tmpdata = sample(x,n*nboot, replace=TRUE) # sample with replacement
bootstrapsample = matrix(tmpdata, nrow=n, ncol=nboot)

# Compute the sample mean xbar for each bootstrap sample
xbarstar = colMeans(bootstrapsample)

# Compute delta* for each bootstrap sample (difference between the original mean and the bootstrap sample mean)
deltastar = xbarstar - xbar

# Find the 0.1 and 0.9 quantiles for deltastar
d = quantile(deltastar,c(0.05,0.95)) # for the 905 CI we need 90% area between the quantiles


# Calculate the 90\% confidence interval for the mean.
ci = xbar - c(d[2],d[1])
ci
```

> We can achieve the same CI with the functions `boot()` and `boot.ci()` from the package `boot`

```{r}

MEAN <- function(data, i){
  d <- data[i]
   M <- mean(d)
  }
boot.obj <- boot(x, statistic = MEAN, 10000)
CI <- boot.ci(boot.obj, conf = 0.90,type = "all") # all type of bootstrap intervals will be returned
CI
```

## 3. BOOTSTRAP ESTIMATION PART

> Review the problem below.

**Pr. 30 / page 695 in text**

The "Wisconsin Card Sorting Test" is widely used by psychiatrists, neurologists, and neuropsychologists with patients who have a brain injury, neurodegenerative disease, or a mental illness such as schizophrenia. Patients with any sort of frontal lobe lesion generally do poorly on the test. The data frame `WCST` and the following table contain the test scores from a group of 50 patients from the Virgen del Camino Hospital (Pamplona, Spain).

(a) Use the function `eda()` from the `PASWR2` package to explore the data and decide if normality can be assumed. For details type in console `?eda`

(b) What assumption(s) must be made to compute a `95%` confidence interval for the population mean?

(c) Compute the confidence interval from (b).

(d) Compute a 95% BCa bootstrap confidence interval for the mean test score.

(e) Should you use the confidence interval reported in (c) or the confidence interval reported in (d)?

**Solution:**

(a) Assuming the variable score has a normal distribution is not reasonable.

See the results of EDA analysis below:

```{r}
data(WCST)

WCST %>% pull(score) %>% eda()
```

(b) In order to construct a `95%` confidence interval for the population mean, one assumes that the values in the variable score are taken from a **normal distribution**. Although this is not a reasonable assumption, the sample size might be sufficiently large to overcome the skewness in the parent population. Consequently, one might appeal to the **Central Limit Theorem** and claim that the sampling distribution of `X` is **approximately normal** due to the sample size `(50)`. In this problem, the skewness is quite severe, and one should not be overly confident in the final interval.

(c) If we assume normality one has:

```{r}
# CI <- with(data = WCST, t.test(score)$conf) # 

# or using piping
CI <- WCST %>% pull(score) %>% t.test() %>% .$conf
CI
```

(d) Use the `boot.ci()` nonparametric bootstrap CIs functions to obtain the bootstrap CI.

```{r}
library(boot) # use boot package
MEAN <- function(data, i){
  d <- data[i]
   M <- mean(d)
  }

set.seed(13)

B <- 10^4 - 1 # number of bootstrap samples

b.obj <- boot(data = WCST$score, statistic = MEAN, R = B) # bootstrap object of class "boot" containing the output of a bootstrap calculation
CIB <- boot.ci(b.obj, conf = 0.95, type = "bca")
CIB
```

**Problem 12 (10 pts)** Set the seed to `23` if your group index is odd number (i.e. **group 1,3,5,7**), and the seed to `89` if your group index is even number (i.e. **group 2,4,6,8**). Draw a random sample of size `200` from exponential distribution with mean $\lambda = 4$ ($rate = \frac{1}{4}$). Produce all bootstrap CI with the function `boot.ci` (use type = "all").

How many of them contain the true mean for the sampled distribution?

**YOUR CODE HERE:**

```{r}
library(boot)
set.seed(89)


stat.mean = function(x, indx){
  mean( x[indx] )
}

exps200 <- rexp(200,.25)

b.obj <- boot(data = exps200, statistic = stat.mean, 200)
CIB <- boot.ci(b.obj, type = "all")
CIB

#bootstrap variances needed for studentized intervals warning

```

**EXTRA CREDIT (10 pts)** Set the seed to `23` if your group index is odd number (i.e. **group 1,3,5,7**), and the seed to `89` if your group index is even number (i.e. **group 2,4,6,8**).

Draw a random sample of size `25` from $Bin(20,\frac{1}{5})$. Produce all bootstrap CI with the function `boot.ci` (use type = "all").

How many of them contain the true mean for the $Bin(20,\frac{1}{5})$?

YOUR CODE HERE:

```{r}
library(boot)
set.seed(89) 

stat.mean = function(x, indx){
  mean( x[indx] )
}
exps25 <- rexp(20,1/5)

b.obj <- boot(exps25, statistic = stat.mean, 10000)
CIB <- boot.ci(b.obj, type = "all")
CIB
```
