DATA606_W5_HW5_Inference_for_Numerical

On your own

## Warning: package 'ggplot2' was built under R version 3.2.5

5.6 Working backwards, Part II. page(257)

A 90% confidence interval for a population mean is (65, 77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations. Calculate the sample mean, the margin of error, and the sample standard deviation.

90% confidence interval is (65, 77), n = 25, df = 24 \(mean(\bar{x})=\frac{x1+x2}{2}\) = \(\bar{x}=\frac{65+77}{2} =71 \) Margin of error is \(ME=77-71=6\)

Sample standard deviation

t24 <- qt(0.95, df=24)
t24

## [1] 1.710882

To calculate the sample standard deviation \(ME=t*SE\)

n <- 25
ME <-6
df <- 24
p <- 0.9
p_2tails <- p + (1 - p)/2

t_val <- qt(p_2tails, df)

# Since ME = t * SE
SE <- ME / t_val

# Since SE = sd/sqrt(n)
sd <- SE * sqrt(n)
sd

## [1] 17.53481

Answer \(\bar{x}=71, ME =6, s= 17.53481 \)

5.14 SAT scores. (page 259)

SAT scores of students at an Ivy League college are distributed with a standard deviation of 250 points. Two statistics students, Raina and Luke, want to estimate the average SAT score of students at this college as part of a class project. They want their margin of error to be no more than 25 points.

Raina wants to use a 90% confidence interval. How large a sample should she collect?

\(SD(\sigma) = 250, ME=25\ , CE= 0.9\)

\(ME = z.SE\ -> SE = \frac{ME}{z}\) Also we know \(ME = z*\frac{sd}{\sqrt{n}}\) => \(n = (\frac{z.sd}{ME})^2 \)

z <- 1.65 # due to 90% Confidence interval
ME <- 25
sd <- 250

n <- ((z * sd) / ME ) ^ 2
n

## [1] 272.25

Luke wants to use a 99% confidence interval. Without calculating the actual sample size, determine whether his sample should be larger or smaller than Raina’s, and explain your reasoning.

Luke needs to collect a larger sample to have higher confidence that the sample is representative of the population.

Calculate the minimum required sample size for Luke.

z <- 2.575 # due to 99% Confidence interval
ME <- 25
sd <- 250

n <- ((z * sd) / ME ) ^ 2
n

## [1] 663.0625

The sample size should be 663 students.

5.20 High School and Beyond, Part I. (page 261)

The National Center of Education Statistics conducted a survey of high school seniors, collecting test data on reading, writing, and several other subjects. Here we examine a simple random sample of 200 students from this survey. Side-by-side box plots of reading and writing scores as well as a histogram of the diffrences in scores are shown below.

(a) Is there a clear difference in the average reading and writing scores?

There is no clear difference in the average reading and writing scores. Box plots show some difference, but the mean scores are fairly close to each other. Additionally, it looks like the histogram of the difference may be centered at 0

(b) Are the reading and writing scores of each student independent of each other? - I would say that the scores are independent of each other.

(c) Create hypotheses appropriate for the following research question: is there an evident difference in the average scores of students in the reading and writing exam?

H_0: The difference of average in between reading and writing equal zero. That is: \(\mu_r - \mu_w = 0\)
H_A: The difference of average in between reading and writing does NOT equal zero. That is: \(\mu_r - \mu_w \neq 0\)

(d) Check the conditions required to complete this test.

The conditions required for this test are independence and normality. We already stated above that the observations are independent. Looking at the box plot, we don’t see any skewness or outliers so we can say that the conditions are satisfied.

(e) The average observed difference in scores is \(\bar{x}_r-\bar{x}_w = ??? 0.545\), and the standard deviation of the differences is 8.887 points. Do these data provide convincing evidence of a difference between the average scores on the two exams?

n <- 200
mean.diff <- -.545
df <- n-1
SD <- 8.887
SE <- SD/sqrt(n)
T <- (mean.diff-0)/SE
pvalue <- pt(T, df)
pvalue

## [1] 0.1934182

t-value here is , 0.19 > .05 so we fail to reject the null hypothesis. we do no have convining evidence of a difference between the average reading and writing exam scores.

(f) What type of error might we have made? Explain what the error means in the context of the application.

Type I error: Incorrectly reject the null hypothesis. Type II error: Incorrectly reject the alternative hypothesis.

(g) Based on the results of this hypothesis test, would you expect a confidence interval for the average di???erence between the reading and writing scores to include 0? Explain your reasoning.

Yes, I would expect a confidence interval for the average difference between reading and writing scores to include 0. When the confidence interval spans 0 for this kind of hypothesis test, it indicates that the difference is not clearly on one side or the other of zero and therefore results is a failure to reject the null hypothesis of no difference.

5.32 Fuel efficiency of manual and automatic cars, Part I. ( page 266 )

Each year the US Environmental Protection Agency (EPA) releases fuel economy data on cars manufactured in that year. Below are summary statistics on fuel efficiency (in miles/gallon) from random samples of cars with manual and automatic transmissions manufactured in 2012. Do these data provide strong evidence of a di???erence between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average city mileage? Assume that conditions for inference are satisfied.

The hypotheses for this test are as follows:

\(H_0: \mu_a - \mu_m = 0\) \(H_a: \mu_a - \mu_m \ne 0\)

Point estimate of the population difference

n <- 26
m_a <- 16.12
sd_a <- 3.58

m_m <- 19.85
sd_m <- 4.51

Difference in sample means

xDiff <- m_a - m_m
xDiff

## [1] -3.73

Standard error of this point OF estimate

seDiff <- sqrt( (sd_a^2 / n) + (sd_m^2 / n) )
seDiff

## [1] 1.12927

The t-statistic and p-value associated with the difference?

T <- (xDiff - 0) / seDiff
T

## [1] -3.30302

pVal <- pt(T, df=n-1)
pVal

## [1] 0.001441807

Since the p-value is less than 0.05, we reject the null hypothesis H0 and conclude that there is strong evidence of a difference in fuel efficiency between manual and automatic transmissions.

5.48 Work hours and education. (PAGE 272)

The General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents.47 Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.

(a) Write hypotheses for evaluating whether the average number of hours worked varies across the five groups.

The difference of ALL averages is equal. \(H_0: \mu_l = \mu_h = \mu_j = \mu_b = \mu_g\)

There is one average that is NOT equal to the other ones \(H_a: \text{ At least one mean is different } \)

(b) Check conditions and describe any assumptions you must make to proceed with the test.

The observations are independent within and across groups: Given the nature of the survey, we will assume independence within and across the groups.
The data within each group are nearly normal: The box plots do not support nearly normal data within each group. Particularly, the Bachelor’s distribution is skewed and has significant outliers. Each group has outliers though other groups appear closer to normally distributed.
The variability across the groups is about equal: There is general similarity of variabily, though the Bachelor’s again stands out as deviating.

(c) Below is part of the output associated with this test. Fill in the empty cells.

ANOVA	Df	Sum Sq	Mean Sq	F value	Pr(>F)
degree	…..	…..	501.54	…..	0.0682
Residuals	…..	267,382	…..
Total	…..	…..

The following R code was used to compute the values:

mu <- c(38.67, 39.6, 41.39, 42.55, 40.85)
sd <- c(15.81, 14.97, 18.1, 13.62, 15.51)
n <- c(121, 546, 97, 253, 155)
data_table <- data.frame (mu, sd, n)
data_table

##      mu    sd   n
## 1 38.67 15.81 121
## 2 39.60 14.97 546
## 3 41.39 18.10  97
## 4 42.55 13.62 253
## 5 40.85 15.51 155

n <- sum(data_table$n)
k <- length(data_table$mu)
n

## [1] 1172

## [1] 5

Finding degrees of freedom

df <- k - 1
dfResidual <- n - k

# Using the qf function on the Pr(>F) to get the F-statistic:

Prf <- 0.0682
F_statistic <- qf( 1 - Prf, df , dfResidual)

# F-statistic = MSG/MSE

MSG <- 501.54
MSE <- MSG / F_statistic

# MSG = 1 / df * SSG

SSG <- df * MSG
SSE <- 267382

# SST = SSG + SSE, and df_Total = df + dfResidual

SST <- SSG + SSE
dft <- df + dfResidual
dft

## [1] 1171

The values have been included in bold in the table below.

ANOVA	Df	Sum Sq	Mean Sq	F value	Pr(>F)
degree	4	2006.16	501.54	2.188984	0.0682
Residuals	1167	267,382	229.12
Total	1171	269388.16

(d) What is the conclusion of the test? Given the p-value = 0.0682 is greater than 0.05, I conclude that there is not a significant difference between the groups and therefore I don’t reject the null hypothesis.

DATA606_W5_HW5_Inference_for_Numerical_Data