Submit the project either as a group or as an individual work.
Project consist of three (3) parts: 1. Problems Solving, 2. Brainstorming, and 3. Bootstrap estimation.
For group project submisison: credit will be given to all students listed as contributors - all listed problem must be completed. Maximum score 100 pts + 10 pts (Extra Credit).
For individual submissions: problems listed with label (IND-OPT) (individual-optional) in parenthesis are not required and may be omitted. For a maximum score 80 pts + 10 pts (Extra Credit).
Loading required package: lattice
Attaching package: ‘PASWR2’
The following object is masked _by_ ‘.GlobalEnv’:
GLUCOSE
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Attaching package: ‘boot’
The following object is masked from ‘package:lattice’:
melanoma
Attaching package: ‘MASS’
The following object is masked from ‘package:dplyr’:
select
Note: If you Rmd file submission knits
you will receive total of (5 points)
Problem 1 (5 pts) - similar to Pr. 5 / page 238 in text A box contains 30 consecutive balls numbered 1 to 30. If four numbers are drawn at random, how many ways are there for the largest number to be 19 and the smallest number to be 9?
Use the Fundamental Principle of Counting!
Solution: 45
YOUR CODE HERE:
choose(10, 2)
[1] 45
Problem 2 (5 pts) - similar to Pr. 14 / page 239 in text On a multiple-choice exam with four possible answers for each of the five questions, what is the probability that a student would get four or more correct answers just by guessing?
Hint: Use the fact that \(P(E ∩ F) = P(E) · P(F)\) for two independent event (generalized for more than events) Getting one answer correct is independent of another.
Also
\[P(\text{at least } 4) = P(\text{exactly } 4) + P(\text{exactly } 5)\]
the last events are mutually exclusive so \(P(A \cup B) = P(A) + P(B)\)
YOUR CODE HERE:
x <- sum(dbinom(x = 4:200, size = 5, prob = 0.25))
x
[1] 0.015625
x <- 200 * x
x
[1] 3.125
Out of 200 students who adopt this test taking approach how many are expected to get at least 4 correct? Hint: Use Binomial experiment settings to answer this questions.
Problem 3 (5 pts) (IND-OPT) - similar to Pr. 44 page 243 in
text Consider tossing four fair coins. There
are 16 possible outcomes, e.g HHHH,HHHT,HHTH,HTHH,THHH, ...
possible outcomes. Define X as the random variable “number
of heads showing when four coins are tossed.” Obtain the mean and the
variance of X. Simulate tossing four fair coins 10,000 times. Compute
the simulated mean and variance of X. Are the simulated values within 2%
of the theoretical answers?
Hint: To find the theoretical values use
dbinom (x= , size = , prob = )
Solution:
YOUR CODE HERE:
library(MASS)
Problem 4 (10 pts) - similar to Pr. 11 / page 307 in text
Traffic volume is an important factor for determining the most cost-effective method to surface a road. Suppose that the average number of vehicles passing a certain point on a road is 5 every 60 seconds.
Hint: Adjust the Poisson parameter \(\lambda = 5\) every 60 seconds to
# per every 3 minutes
Solution: For 12 cars in 3 mins the probability is 73.2389%, for 100 cars in 30 mins it is 99.99909%
YOUR CODE HERE:
1 - ppois(12, 15)
[1] 0.732389
1 - ppois(100, 150)
[1] 0.9999909
Problem 5 (10 pts) - similar to Pr. 18 / page 308 in text Suppose the percentage of drinks sold from a vending machine are 70% and 30% for soft drinks and bottled water, respectively.
Hint: Let X = number of waters (failures) purchased
before the first soft drink is purchased. Then, \(X \sim Geo(0.70)\).
Solution: The probability that on a randomly selected day, the first soft drink is the fourth drink sold is 1.89%. The probability that exactly 4 out of 10 drinks sold is a soft drink is 3.675691%.
YOUR CODE HERE::
p_success <- 0.70
prob_first_soft_drink_fourth <- (1 - p_success)^3 * p_success
# Result
prob_first_soft_drink_fourth
[1] 0.0189
X = number of soft drinks sold. Then, \(X \sim Bin(10, 0.70)\) and \(P(X = 1) = 0\) since one has:n_drinks <- 10
k_soft_drinks <- 4
prob_exactly_4_soft_drinks <- choose(n_drinks, k_soft_drinks) * p_success^k_soft_drinks * (1 - p_success)^(n_drinks - k_soft_drinks)
prob_exactly_4_soft_drinks
[1] 0.03675691
Problem 6 (10 pts) - Exponential Distribution: Light Bulbs If the life of a certain type of light bulb has an exponential distribution with a mean of 10 months, find
0.05?Solution: The probability that a randomly selected light bulb lasts between 6 and 15 months is 32.56815. The 95th percentile is 29.95732. The probability of the light bulb to last more than 36 months compare to 0.05 is 2.732372. The probability that a light bulb that has lasted for 10 months will last more than 25 months is 22.31302%.
YOUR CODE HERE:
pexp(15, 1/10) - pexp(6, 1/10) #a
[1] 0.3256815
qexp(0.95, 1/10) #b
[1] 29.95732
1 - pexp(36, 1/10) #b
[1] 0.02732372
pexp(25, 1/10, lower = FALSE)/pexp(10, 1/10, lower = FALSE) #c
[1] 0.2231302
Problem 7 (10 pts) (IND-OPT) - Pr. 8 page 400 A population has the following elements: 2, 5, 8, 12, 13 (finite population).
Enumerate all the samples of size 2 that can be drawn with and
without replacement. (Hint: Use the srs() function from the
PASWR2 package)
Calculate the mean of the population.
Calculate the variance of the population.
Calculate the standard deviation of the population.
Calculate the mean of the sample mean, \(E[X]\).
Calculate the variance of the sampled mean, \(Var(\bar X)\)
Calculate the standard deviation of the sample mean.
Calculate the mean of the sample variance, \(E[S^2]\)
Is the variance of \(\bar X\) larger when sampling with or without replacement? Explain your answer
Solution:
YOUR CODE HERE:
Problem 8 (10 pts) - similar to Pr. 10 / page 400 in text
Use the data frame WHEATUSA2004 from the
PASWR2 package; draw all samples of sizes 4, 5, and 6; and
calculate the mean of the means for the variable acres (wheat surface
area measured in thousands of acres). . What size provides the best
approximation to the population mean? What is the variance of these
means?
Solution: All three sample sizes had the same mean of 1148.733. The variance of each size acre is 645755.872 for sample size 4, 496720.646 for sample size 5, and 397374.397 for sample size 6
Hint: Use the srs() function from the
PASWR2 package to draw simple random samples.
YOUR CODE HERE:
SRS2 <- srs(WHEATUSA2004$acres, 4)
xbarSRS2 <- apply(SRS2, 1, mean)
c(mean(xbarSRS2), var(xbarSRS2))
[1] 1148.733 645755.872
SRS3 <- srs(WHEATUSA2004$acres, 5)
xbarSRS3 <- apply(SRS3, 1, mean)
c(mean(xbarSRS3), var(xbarSRS3))
[1] 1148.733 496720.646
SRS4 <- srs(WHEATUSA2004$acres, 6)
xbarSRS4 <- apply(SRS4, 1, mean)
c(mean(xbarSRS4), var(xbarSRS4))
[1] 1148.733 397374.397
Problem 9 (10 pts) - Pr. 16 / page 513 in text
A group of engineers working with physicians in a research hospital
is developing a new device to measure blood glucose levels. Based on
measurements taken from patients in a previous study, the physicians
assert that the new device provides blood glucose levels slightly higher
than those provided by the old device. To corroborate their suspicion,
15 diabetic patients were randomly selected, and their
blood glucose levels were measured with both the new and the old
devices. The measurements, in mg/100 ml, appear in data
frame GLUCOSE from the PASWR2 package:
Solution:
These samples are not independent because these samples are both taken at the same time using similar devices.
The 95% confidence interval for the mean differences of
the population is [−16.3614,−12.7586].
Hint: look at the QQ-plot to see if normality is reasonable.
YOUR CODE HERE:
GLUCOSE <- GLUCOSE %>% mutate(DIFF = old-new) # create variable DIFF for the difference between old and new level
head(GLUCOSE)
ggplot(data = GLUCOSE, aes(sample = DIFF)) +
stat_qq() +
theme_bw()
CI <- t.test(GLUCOSE$DIFF)$conf # use t-test since the sample size is small < 30
CI
[1] -16.36145 -12.75855
attr(,"conf.level")
[1] 0.95
Problem 10 (10 pts) (IND-OPT) - similar to Pr. 9 / page 511 in text
A large company wants to estimate the proportion of its accounts that are paid on time.
How large a sample is needed to estimate the true proportion within 2% with a 95% confidence level?
Suppose 700 out of 800 accounts are paid on time. Construct 95% confidence intervals for the true proportion of accounts that are paid on time using an asymptotic confidence interval, an Agresti Coull confidence interval.
Hint: Use the nsize() and binom.confint()
from the library(binom)
Solution:
YOUR CODE HERE:
library(binom)
library(binom)
#(a)
#(b)
Problem 11 (10 pts) - Pr. 10 / page 511 in text
In a study conducted at Appalachian State
University, students used digital oral thermometers to record
their temperatures each day they came to class. A randomly selected day
of student temperatures is provided in the following table and in the
data frame STATTEMPS. Information is also provided with
regard to subject gender and the hour of the day when the students’
temperatures were measured.
Direction: load the data with data(STATTEMPS) having the
library(PASWR2) loaded.
Hint: Use
t.test(temperature ~ gender, data = STATTEMPS, mu = 0, paired = FALSE)
the x ad y values are given with formula
expression formula = temperature ~ gender. Then use the
$ access to obtain the confidence interval:
CI <- t.test(temperature ~ gender, data = STATTEMPS, mu = 0, paired = FALSE)$conf
Construct a 95% confidence interval for the true average temperature difference between females and males. Does the interval contain the value zero? What does this suggest about gender temperature differences?
Construct a 95% confidence interval for the true average temperature difference between students taking their temperatures at 8 a.m. and students taking their temperatures at 9 a.m. Give a reason why one group appears to have a higher temperature reading.
Solution: The interval for the first test (-1.054064 - 1.659598) contains 0 and it also implies that there is not a major difference in average temperatures between men and women. The p-value of 0.6458 also helps to prove that the mull hypothesis correct. For part b, a reason that one group have higher temps is due to variety of factors such as sleep, diet, and activity.
YOUR CODE HERE:
A 95% confidence interval for the true average
difference between students taking their temperatures at 8
a.m. and students taking their temperatures at 9 a.m. is
[−2.4965,−0.2564]. Note that this interval does not contain
0, indicating that the there is evidence to suggest
students in the 8 a.m. class have temperatures that are not
as warm as the 9 a.m. class. One possible explanation is
that students roll straight out of bed and into the 8 a.m.
class. Consequently, their temperatures are closer to their sleeping
temperatures which are lower than their waking temperatures.
Look for the answer with the following :
Empirical bootstrap confidence interval for the mean.
For the data contained in a vector
x = c(36,30,37,43,42,43,43,46,40,42) Estimate the mean
μ of the underlying distribution and give an
90% bootstrap confidence interval.
Data is not quite normal as it appears on the Q-Q PLOT!
We can achieve the same CI with the functions
boot()andboot.ci()from the packageboot
Review the problem below.
Pr. 30 / page 695 in text
The “Wisconsin Card Sorting Test” is widely used by psychiatrists,
neurologists, and neuropsychologists with patients who have a brain
injury, neurodegenerative disease, or a mental illness such as
schizophrenia. Patients with any sort of frontal lobe lesion generally
do poorly on the test. The data frame WCST and the
following table contain the test scores from a group of 50 patients from
the Virgen del Camino Hospital (Pamplona, Spain).
Use the function eda() from the PASWR2
package to explore the data and decide if normality can be assumed. For
details type in console ?eda
What assumption(s) must be made to compute a 95%
confidence interval for the population mean?
Compute the confidence interval from (b).
Compute a 95% BCa bootstrap confidence interval for the mean test score.
Should you use the confidence interval reported in (c) or the confidence interval reported in (d)?
Solution:
See the results of EDA analysis below:
In order to construct a 95% confidence interval for
the population mean, one assumes that the values in the variable score
are taken from a normal distribution. Although this is
not a reasonable assumption, the sample size might be sufficiently large
to overcome the skewness in the parent population. Consequently, one
might appeal to the Central Limit Theorem and claim
that the sampling distribution of X is
approximately normal due to the sample size
(50). In this problem, the skewness is quite severe, and
one should not be overly confident in the final interval.
If we assume normality one has:
boot.ci() nonparametric bootstrap CIs functions
to obtain the bootstrap CI.Problem 12 (10 pts) Set the seed to 23
if your group index is odd number (i.e. group 1,3,5,7),
and the seed to 89 if your group index is even number
(i.e. group 2,4,6,8). Draw a random sample of size
200 from exponential distribution with mean \(\lambda = 4\) (\(rate = \frac{1}{4}\)). Produce all
bootstrap CI with the function boot.ci (use type =
“all”).
How many of them contain the true mean for the sampled distribution?
All four of our bootstrap confidence intervals contain the true mean of 4.
YOUR CODE HERE:
set.seed(89)
# Set the sample
sample <- rexp(200, rate = 1/4)
library(boot)
# Find the true mean of the sampled distribution
MEAN <- function(data, i){
d <- data[i]
M <- mean(d)
}
B <- 10^4-1
b.obj <- boot(data = sample, statistic = MEAN, R = B)
CIB <- boot.ci(b.obj, type = "all")
Warning in boot.ci(b.obj, type = "all") :
bootstrap variances needed for studentized intervals
CIB
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 9999 bootstrap replicates
CALL :
boot.ci(boot.out = b.obj, type = "all")
Intervals :
Level Normal Basic
95% ( 3.534, 4.674 ) ( 3.514, 4.646 )
Level Percentile BCa
95% ( 3.557, 4.689 ) ( 3.594, 4.732 )
Calculations and Intervals on Original Scale
EXTRA CREDIT (10 pts) Set the seed to
23 if your group index is odd number (i.e. group
1,3,5,7), and the seed to 89 if your group index
is even number (i.e. group 2,4,6,8).
Draw a random sample of size 25 from \(Bin(20,\frac{1}{5})\). Produce all
bootstrap CI with the function boot.ci (use type =
“all”).
How many of them contain the true mean for the \(Bin(20,\frac{1}{5})\)?
YOUR CODE HERE: All four of our confidence intervals contain the true mean of 4.
set.seed(89) # uncomment by setting the value per your group designation
sample <- rbinom(n=25, size=20, prob = 1/5)
library(boot)
MEAN <- function(data, i){
d <- data[i]
M <- mean(d)
}
B <- 10^4-1
b.obj <- boot(data = sample, statistic = MEAN, R = B)
CIB <- boot.ci(b.obj, type = "all")
Warning in boot.ci(b.obj, type = "all") :
bootstrap variances needed for studentized intervals
CIB
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 9999 bootstrap replicates
CALL :
boot.ci(boot.out = b.obj, type = "all")
Intervals :
Level Normal Basic
95% ( 3.009, 4.509 ) ( 3.000, 4.520 )
Level Percentile BCa
95% ( 3.00, 4.52 ) ( 3.00, 4.48 )
Calculations and Intervals on Original Scale