*Submit your homework to Canvas by the due date and time. Email your instructor if you have extenuating circumstances and need to request an extension.

*If an exercise asks you to use R, include a copy of the code and output. Please edit your code and output to be only the relevant portions.

*If a problem does not specify how to compute the answer, you many use any appropriate method. I may ask you to use R or use manually calculations on your exams, so practice accordingly.

*You must include an explanation and/or intermediate calculations for an exercise to be complete.

*Be sure to submit the HWK3 Autograde Quiz which will give you ~20 of your 40 accuracy points.

*50 points total: 40 points accuracy, and 10 points completion

Sampling Distributions

Exercise 1: A serving of a specific type of yogurt has a sugar content that is well approximated by a Normally distributed random variable \(X\) with mean 13 g and variance: \(1.3^2 g^2\). We can consider each serving as an independent and identical draw from X.

  1. In what percent of servings will the sugar content be above 13.3 g?
pnorm(13.3,13,1.3,lower.tail=FALSE)
## [1] 0.408747

Approximately 40.87% of servings will have a sugar content above 13.3 g

  1. What is the probability that a randomly chosen serving will have a sugar content between 13.877 and 12.123? What do we call the difference: 13.877-12.123=1.754?
pnorm(13.877,13,1.3)
## [1] 0.7500399
pnorm(12.123,13,1.3)
## [1] 0.2499601

This is the interquartile range and there is a 50% probability that the sugar content will be between those values.

  1. Calculate the probability that in 6 servings, only 1 has a sugar content below 13 g.
dbinom(1,6,0.5)
## [1] 0.09375

There is a 9.375% chance that exactly 1 serving in 6 servings has a sugar content below 13 g.

  1. Describe the sampling distribution for the mean sugar content of 6 servings \[\bar{X}\]. (Give shape, mean, and standard deviation or variance, if possible)

The mean will also be 13, the standard deviation of \[\bar{X}\] is 1.3/sqrt(6) = 0.53, the variance is 0.28. Because the original variable is normally distributed, the sampling distribution of \[\bar{X}\] is also normally distributed.

(std.sample.six=1.3/sqrt(6))
## [1] 0.5307228
1.3^2/6
## [1] 0.2816667
  1. What is the interquartile range of the sampling distribution for the sample mean \(\bar{X}\) when n=6? Is that value larger or smaller than the IQR implied in part b? Why does the relative sizes of the IQRs make sense?
qnorm(0.75,13,0.5307)-qnorm(0.25,13,0.5307)
## [1] 0.7159034

The IQR of \(\bar{X}\) is 0.716 which is smaller that the original IQR. This makes sense because it is the IQR of sample means. By nature, averages/means take on less extreme values and their distribution will be much narrower.

  1. What is the probability that the mean sugar content in 6 servings is more than 13.3 g ?
pnorm(13.3,13,std.sample.six, lower.tail=FALSE)
## [1] 0.2859461

The probability that the mean sugar content in 6 servings is more than 13.3 is 28.6%

  1. Is it more or less likely that the mean sugar content is above 13.3 g in 10 servings or 6 servings (as computed in f)? Can you explain it without actually computing the new probability?

As N increases, the standard deviation of \(\bar{X}\) decreases, as standard dev. decreases, it will become less likely that the mean sugar content will be above 13.3 and our distribution becomes narrower.

  1. Suppose each large yogurt container of this type contains 10 servings and consider the total sugar content in each container as a sum of 10 iid random draws from \(X \sim N(13, 1.3^2)\). If you were to eat a whole large container of yogurt, above what total sugar content would you consume with 95% probability? Show and briefly explain your calculations.
(std.sample.ten=1.3*sqrt(10))
## [1] 4.110961
(mean.sample.ten=10*13)
## [1] 130
qnorm(0.05,130,std.sample.ten)
## [1] 123.2381

With 95% probability, you would consume above 123.24 g of sugar eating the entire container. qnorm(0.05) corresponds to consuming more than 5th percentile lower tail cutoff. This answer means that 95% of the time eating the entire container will result in consuming an amount of sugar that is higher than the calculated value.

Exercise 2: You will be comparing the sampling distributions for two different estimators of \(\sigma\), the population standard deviation.

When trying to estimate the standard deviation of a population (\(\sigma\)) from a sample we could use:

The graphs below give the sampling distributions produced by these estimators when drawing a sample of size 8 from a normal population with mean \(\mu_x=3\) and standard deviation \(\sigma_X=5\).

  1. What do you notice about the mean of the standard deviations produced using the \(s_1\) estimator compared to the \(s_2\) estimator compared to the true population standard deviation? Why do we prefer to use the \(s_1\) formulation when we have a sample of data and are interested in estimating the population standard deviation? (You should use the resulting histograms to help you answer the question and use the word β€œbias”.)

The mean of the standard deviations using the \(s_1\) estimator is much closer to the true population standard deviation than the \(s_2\) estimator. The histogram using the \(s_1\) estimator shows that it is an unbiased estimator whereas \(s_2\) is a biased estimator because it underestimates the population standard deviation on average. Upon more research, this is due to the fact that the sample mean is an estimate and causes a loss of a degree of freedom, and the n-1 term accounts for this loss

Common Random Variables and Combining RV into Estimators

Exercise 3 Exit polling has been a controversial practice in recent elections, since early release of the resulting information appears to affect whether or not those who have not yet voted do so. Suppose that \(90\%\) of all registered Wisconsin voters favor banning the release of information from exit polls in presidential elections until after the polls in Wisconsin close. A random sample of 250 Wisconsin voters are selected (You can assume that the responses of those surveyed are independent). Let X be the count of people in the 250 who favor the ban.

  1. Calculate the probability that exactly 230 people in the sample of 250 favor the ban, that is \(P(X=230)\).
dbinom(230,250,0.9)
## [1] 0.05122197

There is roughly a 5.1% probability that exactly 230/250 people favor the ban.

  1. Calculate the exact probability that 230 or more people in the sample of 250 favor the ban, that is \(P(X \ge 230)\). Hint: use a couple of R functions to help with this calculation.
pi=0.9
n=250
sum(dbinom(230:250,n,pi))
## [1] 0.1718898
1-pbinom(229,250,0.9)
## [1] 0.1718898
  1. What are the expected value (\(\mu_X\)) and standard deviation (\(\sigma_X\)) of X?
(mu_X=pi*n)
## [1] 225
(sigma_X=sqrt(n*pi*(1-pi)))
## [1] 4.743416
  1. We can consider X the sum of 250 iid random draws from the population Y where P(Y=1)=0.90 and P(Y=0)=0.10. That is \(X=Y_1+Y_2+..+Y_{250}\). What do we think will be true about the shape of the distribution of X? What theorem are you using?

The central limit theorum tells us the sampling distrubtion of sum will be approximately normal for a large N. Typically N should be greater than 30. Our N is much greater than 30 so we can assume the shape of the distribution to be normal

  1. To look at how well a normal curve approximates the distribution of X with n=250, \(\pi=0.90\) run the following code with the mean and sd values computed in (c) substituted for β€œMEAN_VALUE” and β€œSD_VALUE”, respectively.
MEAN_VALUE = 225
SD_VALUE = 4.74

plot(x=seq(150,250,1), y=dbinom(150:250, 250, prob=0.90), type='h', ylab="")
curve(dnorm(x, mean=MEAN_VALUE, sd=SD_VALUE), col="darkblue", lwd=2, add=TRUE, yaxt="n")

  1. Calculate the approximate probability that at least 230 people in a sample of 250 favor the ban, that is \(P(X \ge 230)\), assuming a Normal Distribution for X centered at the mean and sd found in c.Β Compare the value to that found in b. and explain why they are not exactly equal.
pnorm(230,225,4.743,lower.tail=FALSE)
## [1] 0.1458991

The binomial calculation (0.172) is more precise than the normal approximation (0.146). The central limit theorem tells us that since N is large, we can say that our distribution is approximately normal. This is supported by the histogram above. However, this is still only an approximation and will not yield the same value as the binomial caluclation.

  1. Consider the sample proportion. Calculate the approximate probability that at least \(\hat{p}=\frac{230}{250}\) favor the ban, that is \(P(\hat{p} \ge 0.92)\), assuming a Normal Distribution of \(\hat{p}\) centered at the appropriate mean and standard deviation. Compare the value to that found in f.Β and explain the relationship between the values.
var.phat=(0.9*(1-0.9))/250
(std.phat=sqrt(var.phat))
## [1] 0.01897367
pnorm(0.92,0.9,std.phat,lower.tail=FALSE)
## [1] 0.1459203

\(P(\hat{p} \ge 0.92)\) = 0.146. The values in g and f are essentially equal because the methods in g and f measure the same probability, but one uses count and one uses proportion.

Exercise 4 Let \(X\) denote the number of flaws in a 1 in. length of copper wire. The probability mass function of \(X\) is given in the table below. It has mean: \(\mu=0.66\) and variance \(\sigma^2 = 0.5244\).

Number of Flaws in length of wire (X) Probability
x=0 0.48
x=1 0.39
x=2 0.12
x=3 0.01
vals=c(0,1,2,3)
probs=c(0.48, 0.39, 0.12, 0.01)
(EV_pop=sum(vals*probs))
## [1] 0.66
(Var_pop=sum(probs*(vals-EV_pop)^2)) 
## [1] 0.5244
  1. Is the distribution of \(X\) left skewed, symmetric, or right skewed? How do you know?

It is right skewed because the data is condensed on the lower end based on the probability mass function.

  1. In what percent of 1 in. length of copper wire will 1 or more flaws be observed?

In 52% 1 in. length of copper wire 1 or more flaws will be observed.

A random sample of 45 1 in. lengths of the copper wire are selected for review. Since this is a SRS from a very large population, we can consider the number of flaws on multiple draws from the population \(X_1, X_2, ..X_n\) iid to \(X\).

  1. The simulation below selects 45 lengths of copper wire from the population with replacement and computes the sample mean and sample sum. It then repeats this process manytimes, stores the sample mean and sample sum values in vectors and then creates histograms of those vectors of values. Identify which histogram diplays (1) the population \(X\) values, (2) the simulated sampling distribution of the sample mean \(\bar{X}\), (3) the simulated sampling distribution of the sample sum \(S\). Briefly explain how you know.
items=c(rep(0, 48), rep(1, 39), rep(2, 12), rep(3, 1))
manytimes=500000
samp_mean=rep(9, manytimes)
samp_sum=rep(9, manytimes)
for (i in 1:manytimes){
  samp=sample(items, size=45, replace=TRUE)
  samp_mean[i]=mean(samp)
  samp_sum[i]=sum(samp)
}

par(mfrow=c(3,1))
hist(samp_sum, breaks=seq(0, 60, 1), main="Histogram A", xlab="")
hist(items, breaks=seq(-0.5, 3.5, 1), freq=FALSE, main="Histogram B", xlab="")
hist(samp_mean, breaks=seq(0,1.3, 0.01), main="Histogram C", xlab="")

par(mfrow=c(1,1))

Histogram B displays the population X values because our x values only take on the discrete values between 0 and 3. Histogram C displays the simulated sampling distribution of the sample mean \(\bar{X}\) because it would make logical sense that our average number of errors between 0 and 3 would be on that x scale. Histogram A displays simulated sampling distribution of the sample sum S because the mean of that distribution would be 0.66*45 = 29.7 and that is where histrogram A is centered.

  1. Describe the sampling distribution (shape, mean, and standard deviation) of the sample mean number of flaws in 45 1 in. length of copper wire \(\bar{X}=\frac{X_1+X_2+...+X_{45}}{45}\) according to theory. Make sure to name any theorems you are using. (You can compute the mean and sd of one of the vectors constructed above to make sure your theoretical values are close to what you get in the simulation.)
(var_xbar=0.5244/45)
## [1] 0.01165333
(std_xbar=sqrt(var_xbar))
## [1] 0.1079506
mean(samp_mean)
## [1] 0.6596175
sd(samp_mean)
## [1] 0.1079139

The mean of the sample mean is 0.66, the standard deviation is 0.108, and the distribution is approximately normal according to the central limit theorum which we can rely on given our large n.

  1. According to your theoretical distribution in (d), what is the probability that the mean number of flaws in the 45 1 in lengths of wire reviewed will be 1 or more items?
pnorm(1,0.66,0.10795,lower.tail = FALSE)
## [1] 0.0008174531

The probability is 0.08% that the mean number of flaws in the 45 lengths will be 1 or more.

  1. Explain why the value you found in e. was so much smaller than the value found in b.

The first probability is in regard to the chance that an individual wire will have 1 or more flows. The second probability is the chance that the average number of flaws across all 45 wires is 1 or more. This is much less probable because it would mean that the 48% chance of having 0 flaws essentially never occurs in your entire sample of 45. This is very roughly similar to flipping a coin 45 times and it landing on heads almost every time.

  1. Consider the total number of flaws in the 45 1 in. lengths of copper wire. Describe the sampling distribution (shape, mean, and standard deviation) of \(Sum=X_1+X_2+...+X_{45}\) according to theory. Make sure to name any theorems you are using. (You can compute the mean and sd of one of the vectors constructed above to make sure your theoretical values are close to what you get in the simulation.)
(mu_sum=0.66*45)
## [1] 29.7
(std_sum=sqrt(0.5244)*sqrt(45))
## [1] 4.857777

The mean of the sample sum is 29.7, the standard error is 4.857, and the distribution is approximately normal based on the central limit theorem.

  1. Find an upper bound b such that the total number of flaws in 45 1 in. lengths of copper wire will be less than b with probability 0.95.
qnorm(0.95,29.7,4.857)
## [1] 37.68905

The upper bound B is 38 total flaws.

Exercise 5

  1. Identify the Z critical value that should be used to construct a 95% level Z confidence interval.
qnorm(0.975)
## [1] 1.959964
  1. Identify the Z critical value that should be used to construct a 90% level Z confidence interval.
qnorm(0.95)
## [1] 1.644854
  1. Compare the values from (a) and (b). Explain how their relative magnitude corresponds to their related confidence interval size and function.

The Z critical value for the 90% confidence interval is 1.64 and for the 95% confidence interval it is 1.96. The margin of error for a 90% confidence interval may be smaller and give a narrower interval upon application, but we are only 90% confident that this interval would contain the true value. In comparison, our 95% confidence interval has a larger margin of error but we are more confident that the true value falls within the interval.

Estimating unknown population mean and proportion of success with point and interval estimators

Exercise 6 An automobile club pays for emergency road services (ERS) requested by its members. Upon examining a sample of 2927 ERS calls from the club members, the club finds that 1499 calls related to starting problems, 849 calls involved serious mechanical failures requiring towing, 498 calls involved flat tires or lockouts, and 81 calls were for other reasons.

  1. Construct a \(98\%\) confidence interval by hand for the proportion of all ERS calls from club members that are serious mechanical problems requiring towing services (after checking that necessary assumptions are well met).

Assumptions Check

  1. We have a SRS of iid draws from X. Μ…

  2. We have a sample size large enough that π‘‹β‰ˆ 𝑁 n*0.29 = 848.33 which is greater than 10 n(1-0.29) = 2078 which is greater than 10

  3. We either know 𝜎, or n is large enough that 𝑠 is likely to be very close to 𝜎.

Computations

Parameter: proportion of ERS calls that are serious mechanical problems

Sample Size: 2927 Serious mechanical problems: 849 Not serious mechanical problems: 2078

(point_estimate=849/2927)
## [1] 0.2900581
(standard_error_phat=sqrt(point_estimate*(1-point_estimate)/2927))
## [1] 0.008387693
(CI=qnorm(0.99))
## [1] 2.326348
(ME=CI*standard_error_phat)
## [1] 0.01951269
(upper_boundary=ME+point_estimate)
## [1] 0.3095708
(lower_boundary=point_estimate-ME)
## [1] 0.2705454

We are 98% confident that the interval from 0.27 to 0.31 captures the true proportion of all ERS calls from club members that are serious mechanical failures.

  1. The current policy rate the automobile club pays is based on the thought that \(20\%\) of services requested will be serious mechanical problems requiring towing. However, the insurance company claims that the auto club has a higher rate of serious mechanical problems requiring towing services. Using your confidence interval in part 1a, respond to the insurance company’s claim.

With 98% confidence, we can say that the insurance companies claim is correct as our confidence interval indicates that 27-31% of services requested will be serious mechanical problems requiring towing.

  1. The club wants to construct a \(95\%\) confidence interval for the proportion of members who want a chocolate fountain at the annual picnic. They want the margin of error to be less than 0.01. How large of a random sample of club members should they contact if they start with the assumption that \(50\%\) are in favor of a chocolate fountain at the picnic?

0.01 = qnorm(0.975) * (sqrt(0.5*0.5/n))

0.01^2 = qnorm(0.975)^2 * 0.5*0.5/n

n = ( qnorm(0.975)^2 * 0.5^2 ) / 0.01^2

(num_term_1 = (qnorm(0.975))^2)
## [1] 3.841459
(num_term_2 = (0.5^2))
## [1] 0.25
(denom_term = 0.01^2)
## [1] 1e-04
(num_term_1 * num_term_2)/ denom_term
## [1] 9603.647

Our sample size N would need to be 9604 club members.