MSDS Spring 2018

DATA 606 Statistics and Probability for Data Analytics

Jiadi Li

Chapter 5: Inference for Numerical Data

HW 5: 5.6, 5.14, 5.20, 5.32, 5.48

5.6 Working backwards, Part II.

90% confidence interval: (65, 77)
Distribution: approximately normal, standard deviation unknown
25 observations

n <- 25
ci1 <- 65
ci2 <- 77

Sample mean:

sample_mean <- (ci1 + ci2)/2
sample_mean

## [1] 71

Margin of error:

me <- (ci2 - ci1)/2
me

## [1] 6

Sample standard deviation:

t <- qt(0.95, 24)
sd <- me/t*sqrt(n)
sd

## [1] 17.53481

5.14 SAT scores.

Standard deviation: 250
Margin of error: no more than 25

90% confidence interval. Sample size?

z1 <- 1.645 #for 90% ci
me <- 25
sd <- 250

n1 <- (z1*sd/me)^2
n1

## [1] 270.6025

99% confidence interval:
The sample size needs to be larger to gain more accurate estimate. From the formula point of view, \(z\)-score will be greater for a 99% confidence interval and therefore makes the sample size greater.
Minimum sample size for 99% confidence interval:

z2 <- 2.575

n2 <- (z2*sd/me)^2
n2

## [1] 663.0625

5.20 High School and Beyond, Part I

sample size: 200

Difference in the average of reading and writing scores?
There is difference in the average of reading and writing score, but not too obvious.
scores independent?
While there is no definite connection between reading and writing scores, scores should not be independent.
is there an evident difference in the average scores of students in the reading and writing exam?
\(H_0\): there is no difference between the average scores of reading and writing. \(\mu_r=\mu_w\)
\(H_A\): there is difference between the average scores. \(\mu_r\neq\mu_w\)
conditions:

One-sample or differences from paired data: the observations (or differences) must be independent and nearly normal. For larger sample sizes, we can relax the nearly normal requirement, e.g. slight skew is okay for sample sizes of 15, moderate skew for sample sizes of 30, and strong skew for sample sizes of 60.
For a difference of means when the data are not paired: each sample mean must separately satisfy the one-sample conditions for the t-distribution, and the data in the groups must also be independent.

observed difference in scores is \(\bar{x}_{read-write}\)=-0.545,sd of diff = 8.887. Do these data provide convincing evidence of a difference between the average scores on the two exams?

p <- pt((-0.545-0)/(8.887/sqrt(200)),200-1)
p

## [1] 0.1934182

Since 0.193 > 0.05, there is no convincing evidence between the average scores on the two exams.

What type of error might we have made? Explain what the error means in the context of the application.
A Type 1 Error is rejecting the null hypothesis when H0 is actually true.
A Type 2 Error is failing to reject the null hypothesis when the alternative is actually true.
Since we rejected the alternative hypothesis, we can only mke Type 2 Error.
Based on the results of this hypothesis test, would you expect a confidence interval for the average difference between the reading and writing scores to include 0? Explain your reasoning.
Yes. When we select \(H_0\), the confidence interval should include 0.

5.32 Fuel efficiency of manual and automatic cars, Part I.

Each year the US Environmental Protection Agency (EPA) releases fuel economy data on cars manufactured in that year. Below are summary statistics on fuel e“ciency (in miles/gallon) from random samples of cars with manual and automatic transmissions manufactured in 2012. Do these data provide strong evidence of a di???erence between the average fuel e”ciency of cars with manual and automatic transmissions in terms of their average city mileage? Assume that conditions for inference are satisfied.
\(H_0\): \(\mu_A=\mu_M\)
\(H_A\): \(\mu_A\neq\mu_M\)

n <- 26

sd_A <- 3.58
sd_M <- 4.51

mean_A <- 16.12
mean_M <- 19.85

mean_diff <- mean_A - mean_M

sd_diff <- sqrt((sd_A ^ 2 / n) + (sd_M ^ 2 /n)) 

p <- pt((mean_diff - 0)/sd_diff, n - 1)
p

## [1] 0.001441807

Since 0.0014 < 0.05, \(H_0\) should be rejected and \(H_A\) should be accepted.

5.48 Work hours and education.

The General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents.47 Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.

(a) Write hypotheses for evaluating whether the average number of hours worked varies across the five groups.
\(H_0\): all average number of hours worked across the five groups are equal.
\(H_A\): at least 1 of the average number doesn’t equal to the rest.

Check conditions and describe any assumptions you must make to proceed with the test.

One-sample or differences from paired data: the observations (or differences) must be independent and nearly normal. For larger sample sizes, we can relax the nearly normal requirement, e.g. slight skew is okay for sample sizes of 15, moderate skew for sample sizes of 30, and strong skew for sample sizes of 60.
For a difference of means when the data are not paired: each sample mean must separately satisfy the one-sample conditions for the t-distribution, and the data in the groups must also be independent.

What is the conclusion of the test?
Since 0.0682 > 0.05, \(H_0\) accepted.