Foundations for Statistical Inference - Sampling Distributions

Question 4.4: Heights of adults

nrow(bdims)

## [1] 507

# (a) What is the point estimate for the average height of active individuals? What about the median?
sample_means10 <- rep(NA, 10)

for(i in 1:10){
  samp <- sample(bdims$hgt, 50)
  sample_means10[i] <- mean(samp)
}
print(paste("Point Estimate is", round(mean(sample_means10), 2)))

## [1] "Point Estimate is 170.39"

print(paste("Median is", round(median(sample_means10), 2)))

## [1] "Median is 170.7"

# (b) What is the point estimate for the standard deviation of the heights of active individuals? What about the IQR?
print(paste("Mean: ", round(mean(sample_means10), 2)))

## [1] "Mean:  170.39"

print(paste("Point Estimate for the standard deviation of the heights of active individuals: ", round(sd(sample_means10), 2)))

## [1] "Point Estimate for the standard deviation of the heights of active individuals:  1.51"

print(paste("IQR: ", IQR(sample_means10)))

## [1] "IQR:  1.3015"

# (c) Is a person who is 1m 80cm (180 cm) tall considered unusually tall? And is a person who is 1m 55cm (155cm) considered unusually short? Explain your reasoning.
(180-round(mean(bdims$hgt), 2))/round(sd(bdims$hgt), 2)

## [1] 0.9415515

(155-round(mean(bdims$hgt), 2))/round(sd(bdims$hgt), 2)

## [1] -1.715197

print("No, both heights are within 2 SD so they are not unusually tall or short.")

## [1] "No, both heights are within 2 SD so they are not unusually tall or short."

# (d) The researchers take another random sample of physically active individuals. 
# Would you expect the mean and the standard deviation of this new sample to be the ones given above?
# Explain your reasoning.
print("Not necessarily. Mean and Standard Deviation for the new sample can be different with another sample. Process of choosing the heights in a sample is not same, in general.")

## [1] "Not necessarily. Mean and Standard Deviation for the new sample can be different with another sample. Process of choosing the heights in a sample is not same, in general."

# (e) The sample means obtained are point estimates for the mean height of all active individuals,
# if the sample of individuals is equivalent to a simple random sample. What measure do we
# use to quantify the variability of such an estimate? Compute
# this quantity using the data from the original sample under the condition that the data are a
# simple random sample.
sample_means507 <- rep(NA, 507)

for(i in 1:507){
  samp <- sample(bdims$hgt, 1)
  sample_means507[i] <- mean(samp)
}
round(mean(sample_means507), 2)

## [1] 171.04

round(median(sample_means507), 2)

## [1] 170.2

round(sd(sample_means507), 2)

## [1] 9.38

print(paste("We calculate Standard Error in this scenario which is ", round(sd(sample_means507/sqrt(507)), 2)))

## [1] "We calculate Standard Error in this scenario which is  0.42"

Question 4.14: Thanksgiving Spending

Thanksgiving spending, Part I. The 2009 holiday retail season, which kicked off on November 27, 2009 (the day after Thanksgiving), had been marked by somewhat lower self-reported consumer spending than was seen during the comparable period in 2008. To get an estimate of consumer spending, 436 randomly sampled American adults were surveyed. Daily consumer spending for the six-day period after Thanksgiving, spanning the Black Friday weekend and Cyber Monday, averaged $84.71. A 95% confidence interval based on this sample is ($80.31, $89.11). Determine whether the following statements are true or false, and explain your reasoning.

(a) We are 95% confident that the average spending of these 436 American adults is between $80.31 and $89.11.

FALSE; Confidence interval refers to the population and not for the sample.

(b) This confidence interval is not valid since the distribution of spending in the sample is right skewed.

FALSE; Other conditions are met for Confidence interval and also the distribution is only slightly skewed

(c) 95% of random samples have a sample mean between $80.31 and $89.11.

FALSE; Only this particular sample has the Confidence level interval of thse values. Another sample may yield different values.

(d) We are 95% confident that the average spending of all American adults is between $80.31 and $89.11. True

(e) A 90% confidence interval would be narrower than the 95% confidence interval since we don’t need to be as sure about our estimate. True

(f) In order to decrease the margin of error of a 95% confidence interval to a third of what it is now, we would need to use a sample 3 times larger. False

(g) The margin of error is 4.4. True. 4.4 is the result of 1.96 * SE.

4.24 Gifted children, Part I (p211)

stat	value
n	36
min	21
mean	30.69
sd	4.31
max	39

a) Are conditions for inference satisfied?

Sample size is small, at 36, but above the minimum of 30. The distribution is a very rough normal shape. Maybe we can accept this as meeting the conditions for inference.

b) Suppose you read online that children first count to 10 successfully when they are 32 months old, on average. Perform a hypothesis test to evaluate if these data provide convincing evidence that the average age at which gifted children first count to 10 successfully is less than the general average of 32 months. Use a significance level of 0.10.

Setting up the hypothesis test as follows:

$H_0: \mu_g = 32$ (The gifted children’s average months is equal to 32 (the average for children in general).)

$H_A: \mu_g < 32$ (The gifted children’s average months is less than 32.)

$\alpha = 0.10$

a <- 0.10
xbar <- 30.69
sdX <- 4.31
n <- 36
SEx <- sdX / sqrt(n)
zXbar <- (xbar - 32) / SEx
zXbar

## [1] -1.823666

This is a one-sided hypothesis test as shown below:

pval <- pnorm(zXbar)
pval

## [1] 0.0341013

c) Interpret the p-value in context of the hypothesis test and the data.

The p-value of 0.0341 is much lower than the significance level $\alpha=0.1$. This suggests the gifted months mean of 30.69 is not even close to the 32 month average. Therefore, I conclude to reject the null hypothesis in favor of the alternative. In other words, it is implausible that we would see a mean from our sample as low as we did if there wasn’t a significant different between gifted and non-gifted children.

d) Calculate the 90% confidence interval for the average age at which gifted children first count to 10 successfully.

The following R code computes the 90% confidence interval.

# Determine z score of 0.10
theZ <- abs(qnorm(a))
theZ

## [1] 1.281552

# Compute the confidence interval
lower <- xbar - (theZ * SEx)
upper <- xbar + (theZ * SEx)
ci <- c(lower, upper)
ci

## [1] 29.76942 31.61058

The 90% confidence interval of the gifted children is 29.7694188 - 31.6105812.

e) Do your results from the hyposthesis test and the confidence interval agree? Explain.

The results agree because the range of the confidence interval does not overlap the average of 32 for non-gifted children. If the CI range had overlapped, this would indicate that 32 might be the population mean for the gifted children and would have caused us to fail to reject the null hypothesis.

4.26 Gifted Children, Part II (p212)

a) Perform a hypothesis test to evaluate if these data provide a convincing evidence that the average IQ of mothers of gifted children is different that the average IQ for the population at large, which is 100. Use a significance level of 0.10.

Setting up the hypothesis test as follows:

$H_0: \mu_g = 100$ (The gifted children’s mother’s IQ is equal to the average IQ.)

$H_A: \mu_g > 100$ (The gifted children’s mother’s IQ is greater than the average IQ)

$\alpha = 0.10$

n <- 36
xbar <- 118.2
sdX <- 6.5
SEx <- sdX / sqrt(n)
zXbar <- (xbar - 100) / SEx
zXbar

## [1] 16.8

This is a one-sided, upper tail hypothesis, though the rejection region is so small that it is not visible in the visualization.

Computing the p-value. Since this is an upper tail test, we subtract from 1.

pval <- 1 - pnorm(zXbar)
pval

## [1] 0

The p-value of 0 is much lower than the significance level $\alpha=0.1$. This suggests the IQs of mothers of gifted children mean of 118.2 is not even close to the 32 month average. Therefore, I conclude to reject the null hypothesis in favor of the alternative. In other words, it is implausible that we would see a mean from our sample as high as we did if there wasn’t a significant different between gifted and non-gifted mother’s IQ.

b) Calculate the 90% confidence interval for the average IQ of mothers of gifted children.

The following R code computes the 90% confidence interval.

# Determine z score of 0.10
theZ <- abs(qnorm(a))
theZ

## [1] 1.281552

# Compute the confidence interval
lower <- xbar - (theZ * SEx)
upper <- xbar + (theZ * SEx)
ci <- c(lower, upper)
ci

## [1] 116.8117 119.5883

The 90% confidence interval of the IQ of mother’s of gifted children is 116.8116525 - 119.5883475.

c) Do your results from the hypothesis test and the confidence interval agree? Explain.

Yes, the results agree. The confidence interval for Mother’s IQ for gifted children is well above the 100 average for mother’s of non-gifted children.

4.34 CLT.

Define the term “sampling distribution” of the mean, and describe how the shape, center, and spread of the sampling distribution of the mean change as sample size increases.

The sampling distribution of the mean is the distribution of mean values from repeated samples from a population. The shape is approximately normal, with a center at the population mean. The shape more closely approximates the normal distribution as more samples are taken and included. This also will move the center closer to the population mean. Likewise, the spread of the sampling distribution will narrow around the population mean as more samples are included.

4.40 CFLBs.

A manufacturer of compact fluorescent light bulbs advertises that the distribution
of the lifespans of these light bulbs is nearly normal with a mean of 9,000 hours and a standard
deviation of 1,000 hours.

(a) What is the probability that a randomly chosen light bulb lasts more than 10,500 hours?

bulbmean <- 9000
stdev <- 1000
samplebulb <- 10500
p <- round(1 - pnorm(samplebulb, mean = bulbmean, sd = stdev),4)
p

## [1] 0.0668

(b) Describe the distribution of the mean lifespan of 15 light bulbs.

n <- 15 
bulbs15 <- rnorm(n, mean = bulbmean, sd = stdev)
mu15 <- mean(bulbs15)
sd15 <- sd(bulbs15)


hist(bulbs15, probability = TRUE)
x <- 0:15000
y15 <- dnorm(x = x, mean = mu15, sd = sd15)
y <- dnorm(x = x, mean = bulbmean, sd = stdev)
lines(x = x, y = y15, col = "red")
abline(v=mu15,col="red")
lines(x = x, y = y, col = "blue")
abline(v=bulbmean,col="blue")

(c) What is the probability that the mean lifespan of 15 randomly chosen light bulbs is more than 10,500 hours?

n <- 15
x <- 10500
mu <- 9000
sd <- 1000

SE15 <- sd15/sqrt(n)
p15 <- round((1 - pnorm(x, mean = mu, sd = SE15)) * 100,4)
p15

## [1] 0

The probability that the mean lifespan of 15 randomly chosen light bulbs is more than 10,500 hours is approximately 0

(e) Could you estimate the probabilities from parts (a) and (c) if the lifespans of light bulbs had a skewed distribution? If you have a skewed distribution, we can not estimate the probabilities as one of the assumptions in order to perform these calculations is that there must be a normal distributions.

4.48 Same observation

different sample size. Suppose you conduct a hypothesis test based on a sample where the sample size is n = 50, and arrive at a p-value of 0.08. You then refer back to your notes and discover that you made a careless mistake, the sample size should have been n = 500. Will your p-value increase, decrease, or stay the same? Explain.

As the sample size increases the spread becomes narrower and the sd deviation becomes smaller. A smaller SD results in a larger z-score which decreases the p-value. That is, if the null hypothesis is true, as you increase the sample size, you stregthen the case that the null hypothesis is true (p-value decreases).