Homework 5 - Data 606

5.6

5.6 Working backwards, Part II. A 90% confidence interval for a population mean is (65,77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations. Calculate the sample mean, the margin of error, and the sample standard deviation.

Given: 90% confidence interval (65,77) has a rate: z = 1.65

=> low = 65
=> up = 77

n = 25, df = n - 1

margin_of_error = (up-low)/2 = standard_error * t*

low<-65
up<-77
n<-25
sample_mean<-round((low+up)/2,2)
a<-paste("The sample mean is:", sample_mean)

margin_of_error<-round((up-low)/2,2)
b<-paste("The margin of error is:", margin_of_error)

df<-25-1
ci<-0.9
two_tailed_ci<-ci + (1-ci)/2
t<-qt(two_tailed_ci,df)
standard_error<-margin_of_error/t
standard_deviation<-standard_error*sqrt(n)
c<-paste("The standard deviation is:", round(standard_deviation,2))

The sample mean is: 71
The margin of error is: 6
The standard deviation is: 17.53

5.14

5.14 SAT scores. SAT scores of students at an Ivy League college are distributed with a standard deviation of 250 points. Two statistics students, Raina and Luke, want to estimate the average SAT score of students at this college as part of a class project. They want their margin of error to be no more than 25 points.

(a) Raina wants to use a 90% confidence interval. How large a sample should she collect?

(b) Luke wants to use a 99% confidence interval. Without calculating the actual sample size, determine whether his sample should be larger or smaller than Raina’s, and explain your reasoning.

(c) Calculate the minimum required sample size for Luke.

Given: sd = 250

(margin_of_error = standard_error * t*) =< 25

ci=0.9, with a rate: z = 1.65

\(ME = z \cdot SE = z \cdot \frac{sd}{sqrt(n)}\) => \(n = (\frac{z \cdot sd}{ME})^2\)

sd<-250
margin_of_error<-25
z<-1.65
n<-(sd * z/margin_of_error)^2
a<-paste("The sample size is:", round(n,2))

The sample size is: 272.25

ci=0.99, with a rate: z = 2.58

\(z_{Luke} > z_{Raina}\) => \(n_{Luke} > n_{Raina}\)

ci=0.9, with a rate: z = 1.65

\(ME = z \cdot SE = z \cdot \frac{sd}{sqrt(n)}\) => \(n = (\frac{z \cdot sd}{ME})^2\)

sd<-250
margin_of_error<-25
z<-2.58
n<-(sd * z/margin_of_error)^2
c<-paste("The sample size is:", round(n,2))

The sample size is: 665.64

5.20

5.20 High School and Beyond, Part I. The National Center of Education Statistics conducted a survey of high school seniors, collecting test data on reading, writing, and several other subjects. Here we examine a simple random sample of 200 students from this survey. Side-by-side box plots of reading and writing scores as well as a histogram of the differences in scores are shown below.

(a) Is there a clear difference in the average reading and writing scores?

(b) Are the reading and writing scores of each student independent of each other?

(c) Create hypotheses appropriate for the following research question: is there an evident difference in the average scores of students in the reading and writing exam?

(d) Check the conditions required to complete this test.

(e) The average observed difference in scores is ̄\({\bar{x}_{read−write}}= −0.545\), and the standard deviation of the differences is 8.887 points. Do these data provide convincing evidence of a difference between the average scores on the two exams?

(f) What type of error might we have made? Explain what the error means in the context of the application.

(g) Based on the results of this hypothesis test, would you expect a confidence interval for the average difference between the reading and writing scores to include 0? Explain your reasoning.

Given: n = 200

There’s no clear difference in the average reading and writing scores. From the image, the histogram is unimodal centered at 0 and the boxplots have similar medians.

The reading and writing scores of each student not independent of each other, because there’s pairing of writing and reading that needs to be taken into consideration and there’s students ability to score differently: high vs. low.

In the average scores of students in the reading and writing exam:

Null hypothesis with no difference: \(H_0\) : \(\mu_0 = 0\)
Alternative Hypothesis with a difference: \(H_A\) : \(\mu_A \ne 0\)

Satisfied because of the sample:

is n = 200, large and unimodal
is independent, n = 200 > 30
is random

\({\bar{x}_{read−write}}= −0.545\), sd = 8.887

n<-200
sd<-8.887
x<--0.545
se<-sd/sqrt(n)
t<-x/se
df<-n-1
p_value<-pt(t,df)

The test statistic is: -0.867274
p_value = 0.1934182 > 0.05, failed to reject \(H_0\).
There’s not enough evidence to support the difference on the scores of the two exams

Failure to reject \(H_0\) means that we ignored \(H_A\)
Type 1 error is for rejecting \(H_0\) => \(H_0\) is true
Our error is of type 2 error which is that we assumed that there’s a difference which is not the case

Yes. It will include 0 because we rejected the \(H_0\) null hypothesis

5.32

5.32 Fuel efficiency of manual and automatic cars, Part I. Each year the US Environmental Protection Agency (EPA) releases fuel economy data on cars manufactured in that year Below are summary statistics on fuel efficiency (in miles/gallon) from random samples of cars with manual and automatic transmissions manufactured in 2012. Do these data provide strong evidence of a difference between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average city mileage? Assume that conditions for inference are satisfied.

Given:

Automatic	Manual
\(n_1=26\)	\(n_2=26\)
\(sd_1=3.58\)	\(sd_2=4.51\)
\(\mu_1=16.12\)	\(\mu_2=19.85\)

n_1<-26
n_2<-n_1
sd_1<-3.58
sd_2<-4.51
mu_1<-16.12
mu_2<-19.85
se_1<-sqrt((sd_1)^2/n_1)
se_2<-sqrt((sd_2)^2/n_2)
t<-(mu_1 - mu_2) / sqrt((sd_1)^2/n_1 + (sd_2)^2/n_2)
df<-min(n_1-1,n_2-1)
p_value<-pt(t,df)

Null hypothesis with no difference: \(H_0\) : \(\mu_1 = \mu_2\)
Alternative Hypothesis with a difference: \(H_A\) : \(\mu_1 \ne \mu_2\)
Conditions:
1- Random: the sample is random
2- Normal: n = 26 is small sample relatively to the total cars in the city => no central theorem used
3- Independent: n = 26 < 10% of total cars in the city
p_value = 0.0014418 < 0.05 => reject \(H_0\). Yes, there’s a difference between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average city mileage

5.48

(a) Write hypotheses for evaluating whether the average number of hours worked varies across the five groups.

(b) Check conditions and describe any assumptions you must make to proceed with the test.

(c) Below is part of the output associated with this test. Fill in the empty cells.

(d) What is the conclusion of the test?

Given: n = 1172

Less than HS	HS	Jr Col	Bachelor’s	Graduate	Total
\(\mu_1=38.67\)	\(\mu_2=39.6\)	\(\mu_3=41.39\)	\(\mu_4=42.55\)	\(\mu_5=40.85\)	\(\mu=40.45\)
\(sd_1=15.81\)	\(sd_2=14.97\)	\(sd_3=18.1\)	\(sd_4=13.62\)	\(sd_5=15.51\)	\(sd=15.17\)
\(n_1=121\)	\(n_2=546\)	\(n_1=97\)	\(n_2=253\)	\(n_1=155\)	\(n=1172\)

Null hypothesis: \(H_0\) : \(\mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5\)
Alternative Hypothesis: \(H_A\) : \(\mu_1 \ne \mu_2 \ne \mu_3 \ne \mu_4 \ne \mu_5\)
p_value = 0.0014418 < 0.05 => reject \(H_0\). Yes, there’s a difference between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average city mileage

Conditions:
1- Random: assuming randomness is in place
2- Normal: large sample sizes given with normal central limit sampling distribution
3- Independent: assuming that different grous’ values are independent by looking at the given table and boxplot

mu<-c(38.67, 39.6, 41.39, 42.55, 40.85)
sd<-c(15.81, 14.97, 18.1, 13.62, 15.51)
n<-c(121, 546, 97, 253, 155)

total_n<-1172
df_groups<-length(mu) - 1
df_errors<-total_n - length(mu)
df<-df_groups + df_errors

msg<-501.54
ssg<-round(msg * df_groups,2)

sse<-267382
mse<-round(sse/df_errors,2)

f_value<-round(msg/mse,2)

ANOVA	Df	Sum Sq	Mean Sq	F value	Pr(>F)
degree	4	2006.16	501.54	2.19	0.0682
Residuals	1167	267,382	229.12
Total	1171	2.693881610^{5}

Failed to reject \(H_0\) because p_value = 0.0682 > 0.05