## Warning: package 'ggplot2' was built under R version 3.2.5
Researchers studying anthropometry collected body girth measurements and skeletal diameter measurements, as well as age, weight, height an gender, for 507 physically active individuals.
The mean would be the point estimate for the average height, which is 171.1. The median is 170.3.
Given the distribution is mostly normal, we can use the sample standard deviation as the point estimate for the population standard deviation (p173), SD=9.4.
The IQR would be derived from the sample IQR of Q3 - Q1 = 177.8 - 163.8 = 14.
After codifying the sample mean and standard deviation, we compute the Z score for the 180cm tall scenario.
meanHeight <- 171.1
sdHeight <- 9.4
x <- 180
zTall <- (x - meanHeight) / sdHeight
pTall <- pnorm(zTall)
pTall
## [1] 0.8281318
Being 180cm tall puts one at 0.9468085 standard deviations above the mean with 17.19 % of people taller. As such, I would not consider being 180cm tall particularly unusual, though it is taller than 82.81 % of the sample.
x <- 155
zShort <- (x - meanHeight) / sdHeight
pShort <- pnorm(zShort)
pShort
## [1] 0.0433778
Being 155cm tall puts one at -1.712766 standard deviations below the mean with 95.66 % of people taller. As such, I would consider being 155cm tall unusual, with just 4.34 % of the sample being shorter
I would not expect identical mean and standard deviation in the new sample unless by some coincidence the second sample cases where identical to the first. With that said, any new sample mean and standard deviation would be normally distributed around the population mean and standard deviation.
We quantify the variability of the point estimates through the Standard Error (SE) which is the standard deviation of the sampling distribution. Using R
we compute the SE for the sample below:
n <- 507
seHeight <- sdHeight / sqrt(n)
seHeight
## [1] 0.4174687
More accurately, I would say we are 95% confident that the average spending of the population is between $80.31 and $89.11. It seems we should be 100% confident the average spending of the specific individuals in the survey is between the upper and lower bound. The confidence level of the interval is meant to measure the likelyhood that the population parameter falls within the range.
Although the sample’s distribution is right skewed, this does not affect the sampling distribution’s shape normal the confidence interval built from it when the sample size is sufficiently large. In our current case, n=436, which is well above 100 so we rely on the Central Limit Theorem for a normal sampling distribution.
While this statement might be true in some particular scenario, it is not known to be true as a result of the 95% confidence internal of this data set. Rather, we expect 95% of confidence intervals at this level from random samples would contain the population’s mean.
Yes, exactly.
Correct. This is due to the fact that the number of standard deviations surronding the mean that encompass 90% (1.645) of the normal distribution’s center is less than the number associated with 95% (1.96)
First we can compute the Standard Error by reversing the Margin of Error computation:
n <- 436
SE <- 4.4 / 1.96
SE
## [1] 2.244898
The Standard Error comes from the Standard Deviation divided by the square root of the number of observations.
stdev <- SE * sqrt(n)
stdev
## [1] 46.87485
Does tripling the sample size achieve a margin of error of 1.4666667 assuming standard deviation stays the same?
newN <- n * 3
newSE <- stdev / sqrt(newN)
newMoE <- newSE * 1.96
newMoE
## [1] 2.540341
The new margin of error is 2.5403412, which is not a third of the current margin of error.
In order to acheive a margin of error of 1.4666667 given a standard deviation of 46.8748456 what sample size would we need (again assuming the same standard deviation)?
desiredMoE <- 4.4 / 3
reqN <- ((1.96 * stdev) / desiredMoE)^2
reqN
## [1] 3924
checkMoE <- 1.96 * (stdev / sqrt(reqN))
checkMoE
## [1] 1.466667
Therefore, inorder to achieve a margin off error of 1.4666667, a sample of 3924 would be needed assuming current standard deviation stays the same.
Correct. 4.4 is the result of 1.96 * SE.
stat | value |
---|---|
n | 36 |
min | 21 |
mean | 30.69 |
sd | 4.31 |
max | 39 |
Sample size is small, at 36, but above the minimum of 30. The distribution is a very rough normal shape. Maybe we can accept this as meeting the conditions for inference.
Setting up the hypothesis test as follows:
\(H_0: \mu_g = 32\) (The gifted children’s average months is equal to 32 (the average for children in general).)
\(H_A: \mu_g < 32\) (The gifted children’s average months is less than 32.)
\(\alpha = 0.10\)
a <- 0.10
xbar <- 30.69
sdX <- 4.31
n <- 36
SEx <- sdX / sqrt(n)
zXbar <- (xbar - 32) / SEx
zXbar
## [1] -1.823666
This is a one-sided hypothesis test as shown below:
pval <- pnorm(zXbar)
pval
## [1] 0.0341013
The p-value of 0.0341 is much lower than the significance level \(\alpha=0.1\). This suggests the gifted months mean of 30.69 is not even close to the 32 month average. Therefore, I conclude to reject the null hypothesis in favor of the alternative. In other words, it is implausible that we would see a mean from our sample as low as we did if there wasn’t a significant different between gifted and non-gifted children.
The following R code computes the 90% confidence interval.
# Determine z score of 0.10
theZ <- abs(qnorm(a))
theZ
## [1] 1.281552
# Compute the confidence interval
lower <- xbar - (theZ * SEx)
upper <- xbar + (theZ * SEx)
ci <- c(lower, upper)
ci
## [1] 29.76942 31.61058
The 90% confidence interval of the gifted children is 29.7694188 - 31.6105812.
The results agree because the range of the confidence interval does not overlap the average of 32 for non-gifted children. If the CI range had overlapped, this would indicate that 32 might be the population mean for the gifted children and would have caused us to fail to reject the null hypothesis.
Setting up the hypothesis test as follows:
\(H_0: \mu_g = 100\) (The gifted children’s mother’s IQ is equal to the average IQ.)
\(H_A: \mu_g > 100\) (The gifted children’s mother’s IQ is greater than the average IQ)
\(\alpha = 0.10\)
n <- 36
xbar <- 118.2
sdX <- 6.5
SEx <- sdX / sqrt(n)
zXbar <- (xbar - 100) / SEx
zXbar
## [1] 16.8
This is a one-sided, upper tail hypothesis, though the rejection region is so small that it is not visible in the visualization.
Computing the p-value. Since this is an upper tail test, we subtract from 1.
pval <- 1 - pnorm(zXbar)
pval
## [1] 0
The p-value of 0 is much lower than the significance level \(\alpha=0.1\). This suggests the IQs of mothers of gifted children mean of 118.2 is not even close to the 32 month average. Therefore, I conclude to reject the null hypothesis in favor of the alternative. In other words, it is implausible that we would see a mean from our sample as high as we did if there wasn’t a significant different between gifted and non-gifted mother’s IQ.
The following R code computes the 90% confidence interval.
# Determine z score of 0.10
theZ <- abs(qnorm(a))
theZ
## [1] 1.281552
# Compute the confidence interval
lower <- xbar - (theZ * SEx)
upper <- xbar + (theZ * SEx)
ci <- c(lower, upper)
ci
## [1] 116.8117 119.5883
The 90% confidence interval of the IQ of mother’s of gifted children is 116.8116525 - 119.5883475.
Yes, the results agree. The confidence interval for Mother’s IQ for gifted children is well above the 100 average for mother’s of non-gifted children.
The sampling distribution of the mean is the distribution of mean values from repeated samples from a population. The shape is approximately normal, with a center at the population mean. The shape more closely approximates the normal distribution as more samples are taken and included. This also will move the center closer to the population mean. Likewise, the spread of the sampling distribution will narrow around the population mean as more samples are included.
A manufacturer of compact fourescent light bulbs advertises that the distribution of the lifespans of these light bulbs is nearly normal with a mean of 9,000 hours and a standard deviation of 1,000 hours.
Basically we want to find the p-value of 10,500 hours.
# First compute the z-score for 10,500 hours
z <- (10500 - 9000) / 1000
z
## [1] 1.5
# Then determine the area under the normal curve at said z score.
# Since we want the area of the upper tail, we'll substract from 1.
p <- 1 - pnorm(z)
p
## [1] 0.0668072
The probability that a randomly chosen light bulb last more than 10,500 hours is 0.0668.
Assuming random sampling of 15 independent light bulbs, the distribution of the mean lifespan would be centered near population mean (claimed to be 9000 hours) and having a nearly normal shape.
Assuming the question is driving at the probability that all of some 15 randomly selected lights each have a lifespan of more than 10,500 hours, since we know the probability of one randomly chosen light bulb having a lifespan of 10,500 hours is 0.0668, the probability for 15 independent lights bulbs would be \(P(15 \text{ bulbs with lifespan }=10,500)=0.0668^{15}=2.3569435\times 10^{-18}\approx0\)
Another way to look at this is how likely would a given sample of 15 light bulbs have a mean lifespan of 10,500 hours?
\(H_0: \bar{x}=10,500\)
\(H_A: \bar{x}\neq10,500\)
mu <- 9000
s <- 1000
# Compute standard error of the mean
se <- s / sqrt(15)
se
## [1] 258.1989
# What is the z-score within the sampling distribution of a mean of 10,500
z10k5 <- (10500 - 9000) / se
z10k5
## [1] 5.809475
# Lookup the p value for this Z score.
pv <- pnorm(z10k5, mean=mu, sd=s)
pv
## [1] 1.189897e-19
The area under normal curve for a Z score of 5.809475 is so small it is effectively zero.
Visualization is shown below the code segment that produces it:
x <- seq(mu - (4 * s), mu + (4 * s), length=100)
hx <- dnorm(x, mean=mu, sd=s)
df <- data.frame(name="Population", x, hx)
smpl <- seq(mu - (4 * se), mu + (4 * se), length=100)
hxSmpl <- dnorm(smpl, mean=mu, sd=se)
df <- rbind(df, data.frame(name="Sampling", x=smpl, hx=hxSmpl))
g1 <- ggplot() +
geom_line(data=df, aes(x=x, y=hx, color=name)) +
myTheme +
labs(title="Distribution of CFLBs")
g1
I don’t think my estimate for (a) would be very useful. Based on information from the OpenIntro text, I understand that there are techniques to deal with skewed distributions. Either way, the sampling distribution would still tend toward a normal shape and our existing tools could be used to estimate (c).
Suppose you conduct a hypothesis test based on a sample where the sample size is \(n=50\), and arrive at a p-value of 0.08. You then refer back to your notes and discover that you made a careless mistake, the sample size should have been \(n=500\). Will your p-value increase, decrease, or stay the same? Explain.
First let’s examine what happens to the standard error value:
s <- 1
se1 <- s / sqrt(50)
se2 <- s / sqrt(500)
c(se1, se2)
## [1] 0.14142136 0.04472136
Standard error gets smaller as sample size increases, in our example from 0.1414214 to 0.0447214. What happens next? The standard error would likely be used in the hypothesis test to determine the z score of the alternate hypothesis.
xbar <- 100.3
zXbar1 <- (xbar - 100) / se1
zXbar2 <- (xbar - 100) / se2
c(zXbar1, zXbar2)
## [1] 2.121320 6.708204
Z score gets more extreme as sample size increases… Ok, so what does this do to the p-value?
pval1 <- 1 - pnorm(zXbar1)
pval2 <- 1 - pnorm(zXbar2)
c(pval1, pval2)
## [1] 1.694743e-02 9.851675e-12
p-value gets smaller as sample size increases, which would give stronger evidence of a difference if any.
Conclusion: p-value of 0.08 would decrease to something signficantly smaller as shown through the example.