MSDS Spring 2018

DATA 606 Statistics and Probability for Data Analytics

Jiadi Li

Chapter 4: Foundations for Inference

HW 4: 4.4, 4.14, 4.24, 4.26, 4.34, 4.40, 4.48

4.4 Heights of adults

  1. Mean: 170.3cm; Median:171.1cm.
  2. Standard deviation: 9.4cm; IQR = Q3 - Q1 = 177.8cm - 163.8cm = 14cm.
  3. 180cm: is between 2 ~ 3 standard deviation (179.7cm ~ 189.1cm)
    155cm: is between -3 ~ -2 standard deviation (151.5cm ~ 160.9cm)
    Based on the the scale provided by the question, these two heights should not be considered as unusual. However, the generalization of the sample still needs to be evaluated in order to draw confident conclusion.
  4. No, even though it’s possible, the mean and standard deviation of the new samle would more likely to be different since it’s the mean and standard deviation of another set of data while some of the members might be the same.
  5. \(SD_{\bar{x}}=\frac{\sigma}{\sqrt n}=\frac{9.4}{\sqrt{507}}\)
9.4/sqrt((507))
## [1] 0.4174687

4.14 Thanksgiving spending, Part I


436 randomly sampled American adults;
Average: $84.71 [95% confidence interval ($80.31,$89.11)]
(a) We are 95% confident that the average spending of these 436 American adults is between $80.31 and $89.11.
False. We can 100% confirm that the average spending of these 536 American adults is $84.71. The 95% confidence interval is for the whole population.
(b) This confidence interval is not valid since the distribution of spending in the sample is right skewed.
False. The skew is not strong enough to fail the sample mean.
(c) 95% of random samples have a sample mean between $80.31 and $89.11.
False. The range concluded is relied on this specific sample.
(d) We are 95% confident that the average spending of all American adults is between $80.31 and $89.11.
True. This conclusion is drawn directly by the definition of a confidence interval. (e) A 90% confidence interval would be narrower than the 95% confidence interval since we don’t need to be as sure about our estimate.
True. A more specific confidence interval leads to less accuracy.
(f) In order to decrease the margin of error of a 95% confidence interval to a third of what it is now, we would need to use a sample 3 times larger.
False. Margin of Error = \(z*SE\)=\(z*\frac{s}{\sqrt n}\). The sample size needs to be 9 times larger to achieve \(\frac{1}{3}\) of current margin of error.
(g) The margin of error is 4.4.
True. Margin of Error = sample mean - lower confidence interval = $84.71-$80.31 = 4.4.

4.24 Gifted children, Part I

  1. Are conditions for inference satisfied?
    The sample observations are independent: No. Children in a large city tend to be more well educated so the sample observations might not be independent. However, since data were collected from various schools meaning that might consist of student from wealthy,median, and low-income family, so addtional information will be more convincing.
    The sample size is large: $n$30 is a good rule of thumb: Yes. 36 \(\ge\) 30.
    The population distribution is not strongly skewed. This condition can be difficult to evaluate: The model is slightly skewed to the left, but it’s bimodal instead of being normal.
  2. Suppose you read online that children first count to 10 successfully when they are 32 months old, on average. Perform a hypothesis test to evaluate if these data provide convincing evidence that the average age at which gifted children fist count to 10 successfully is less than the general average of 32 months. Use a significance level of 0.10.
    \(H_0\): \(\mu\)=32
    \(H_A\): \(\mu\)<32

\(SE_{\bar{x}}\)=\(\frac{s}{\sqrt n}\)

se <- 4.31/sqrt(36)
se
## [1] 0.7183333

\(Z\)=\(\frac{bar{x}-null value}{SE_{\bar{x}}}=\frac{30.69-32}{SE}\)

z <- (30.69-32)/se
pnorm(z)
## [1] 0.0341013

\(\because p-value\) = 0.034 < 0.10 = \(\alpha\)
\(\therefore\) we reject the null hypothesis.
(c) Interpret the p-value in context of the hypothesis test and the data.
The smaller the p-value, the stronger the data favor \(H_A\) over \(H_0\). A small p-value (usually < 0.05) corresponds to sufficient evidence to reject \(H_0\) in favor of \(H_A\).
(d) Calculate a 90% confidence interval for the average age at which gifted children first count to 10 successfully.

mean <- 30.69
lower <- mean - 1.645 * se
upper <- mean + 1.645 * se
c(lower,upper)
## [1] 29.50834 31.87166
  1. Do your results from the hypothesis test and the confidence interval agree? Explain.
    Yes, the results from the hypothesis test and the confidence interval agree.

4.26 Gifted children, Part II


(a) Perform a hypothesis test to evaluate if these data provide convincing evidence that the average IQ of mothers of gifted children is different than the average IQ for the population at large, which is 100. Use a significance level of 0.10.
\(H_0\): \(\mu\)=100
\(H_A\): $$100

\(SE_{\bar{x}}\)=\(\frac{s}{\sqrt n}\)

se <- 6.5/sqrt(36)
se
## [1] 1.083333

\(Z\)=\(\frac{bar{x}-null value}{SE_{\bar{x}}}=\frac{118.2-100}{SE}\)

mean <- 118.2
hy_mean <- 100
z <- (mean - hy_mean)/se
1 - pnorm(z)
## [1] 0

\(\because p-value\) = 0 < 0.10 = \(\alpha\)
\(\therefore\) we reject the null hypothesis.
(b) Calculate a 90% confidence interval for the average IQ of mothers of gifted children.

lower <- mean - 1.645 * se
upper <- mean + 1.645 * se
c(lower,upper)
## [1] 116.4179 119.9821
  1. Do your results from the hypothesis test and the confidence interval agree? Explain.
    Yes, they agree with each other. The smaller the p-value, the stronger the data favor \(H_A\) over \(H_0\). A small p-value (usually < 0.05) corresponds to sufficient evidence to reject \(H_0\) in favor of \(H_A\).

4.34 CLT

Define the term “sampling distribution” of the mean, and describe how the shape, center, and spread of the sampling distribution of the mean change as sample size increases.
The sampling distribution represents the distribution of the point estimates based on samples of a fixed size from a certain population.
With larger n, the sampling distribution of \(\bar{x}\) becomes more normal. As the sample size increases, the normal model for \(\bar{x}\) becomes more reasonable. We can also relax our condition on skew when the sample size is very large.

4.40 CFLBs

nearly normal with a mean of 9,000 hours and a standard deviation of 1,000 hours.

mean <- 9000
sd <- 1000
  1. What is the probability that a randomly chosen light bulb lasts more than 10,500 hours?
se <- sd
z <- (10500 - mean)/se
1 - pnorm(z)
## [1] 0.0668072
  1. Describe the distribution of the mean lifespan of 15 light bulbs.
    Since it’s normal distribution, the mean and standard deviation should be 9,000 and 1,000 hours respectively.
  2. What is the probability that the mean lifespan of 15 randomly chosen light bulbs is more than 10,500 hours?
se <- sd/sqrt(15)
z <- (10500 - mean)/se
1 - pnorm(z)
## [1] 3.133452e-09
  1. Sketch the two distributions (population and sampling) on the same scale.
plot(seq(4000,14000,30),dnorm(seq(4000,14000,30),mean = mean,sd = sd))


(e) Could you estimate the probabilities from parts (a) and (c) if the lifespans of light bulbs had a skewed distribution?
A special case of the Central Limit Theorem ensures the distribution of sample means will be nearly normal, regardless of sample size, when the data come from a nearly normal distribution. If this condition is not met, no assumption should be made.

4.48 Sample observation, different sample size

sample size = 50
p-value = 0.08
sample size should be 500 not 50
Will your p-value increase, decrease, or stay the same? Explain.
\(SE\bar{x}=\frac{sd}{\sqrt n}\)
As \(SE\) change from \(\frac{sd}{\sqrt{50}}\) to \(\frac{sd}{\sqrt{500}}\), SE decreases and therefore Z-score increases and p-value decreases.