hw5

R Markdown

5.6 A 90% confidence interval for a population mean is (65, 77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations. Calculate the sample mean, the margin of error, and the sample standard deviation.

Sample Mean is (x2 + x1)/2 where the confidence interval is (x1,x2)

Marging of Error is (x2 - x1)/2 where the confidence interval is (x1,x2)

sample standard deviation we use Margin of Error =t * SE by using the ‘qt()’ function and df = n - 1. We know std. Error : SE = s/sqrt(n) df = n-1 = 25 -1 = 24

n <- 25

x1 <- 65
x2 <- 77
#we know that the margin of error is (b-a)/2 where the confidence interval is (a,b)
#we know that sample mean is calculated as (a+b)/2 for confidence interval (a,b)
#to calculate the sample standard devation we use ME = t(.05)*s/sqrt(n). Using the qt function and df = 25-1 we get

sample_mean <- (x2 + x1 )/2
sample_mean

## [1] 71

margin_error <- (x2 - x1 )/2
margin_error

## [1] 6

df <- 25-1
t_value <- qt(.95, df)
t_value

## [1] 1.710882

# Since Margin of Error = t * SE 
std_error <- margin_error/t_value

# Since SE = sd/sqrt(n)
# SD = std_error * sqrt (n)
std_deviation <- std_error * sqrt(n)

std_deviation

## [1] 17.53481

sample mean, 71

margin of error 6

sample standard deviation 17.5348146

5.14 SAT scores of students at an Ivy League college are distributed with a standard deviation of 250 points. Two statistics students, Raina and Luke, want to estimate the average SAT score of students at this college as part of a class project. They want their margin of error to be no more than 25 points.

\|1 - alfa \| z - alfa/2 \|
0.90 \| 1.645\|
0.95 \| 1.960\|
0.98 \| 2.325\|
0.99 \| 2.575\|

Raina wants to use a 90% confidence interval. How large a sample should she collect? > Note that Z=1.645 to reflect the 90% confidence level.

Here Margin of Error=z * Std Error and We know std. Error : SE = s/sqrt(n)

Margin of Error = z * std Deviation/sqrt(n) at the end we obtain: n=(z * std_deviation/ Margin or Error)^2

z <- 1.65
margin_of_error <- 25
std_deviation <- 250

n <- ((z * std_deviation) / margin_of_error ) ^ 2

Possible sample size should be 272.25

Luke wants to use a 99% confidence interval. Without calculating the actual sample size, determine whether his sample should be larger or smaller than Raina’s, and explain your reasoning.

We know n=(z * std_deviation/ Margin or Error)^2 , so if we know that z for 99% would be greater than z for 90% , and since n is directly proportional to Z , we can say that n will be more than 250 for 99% confidence interval.

Calculate the minimum required sample size for Luke.

z <- 2.575
margin_of_error <- 25
std_deviation <- 250

n <- ((z * std_deviation) / margin_of_error ) ^ 2

The sample size should be 663.0625 students.

5.20 The National Center of Education Statistics conducted a survey of high school seniors, collecting test data on reading, writing, and several other subjects. Here we examine a simple random sample of 200 students from this survey. Side-by-side box plots of reading and writing scores as well as a histogram of the differences in scores are shown below.

A. Is there a clear difference in the average reading and writing scores?

Answer: I do not see a clear difference in the average of the reading and writing scores. The difference distribution is fairly normal around the zero difference, though it seems to be a slight skew to the right.

Are the reading and writing scores of each student independent of each other?

Answer: In the sample of 200 students from the survey, I would conclude that each student’s scores are independent of other student’s scores as a result of the simple random sampling technique. With that said, the reading and writing scores might be paired for each student and would not be independent of each other for a given student.

C. Create hypotheses appropriate for the following research question: is there an evident difference in the average scores of students in the reading and writing exam?

Answer: The hypotheses for the difference in the average score of students could be as follows:

H0: mean_(read) - mean_(write) = 0 HA: mean_(read) - mean_(write) not equal to 0

D. Check the conditions required to complete this test.

Answer: Independence : The difference histogram suggested the data is paired. If paired, then they wouldn’t be independent.

Normal distribution: The box plot provided suggests the data is reasonably normally distributed and no outliers exist.(not skewed too much)

E. The average observed difference in scores is x¯read - x¯write = -0.545, and the standard deviation of the differences is 8.887 points. Do these data provide convincing evidence of a difference between the average scores on the two exams?

Answer: H0: The difference of average scores is equal to zero. That is: MEAN_diff=0 HA : The difference of average scores is NOT equal to zero. That is: MEAN_diff NOT EQUAL TO 0

sd_Diff <- 8.887
mu_Dif <- -0.545
n <- 200

SE_Diff <- sd_Diff / sqrt(n)

# Compute T statistic
t_value <- (mu_Dif - 0) / SE_Diff

df <- n - 1

p <- pt(t_value, df = df)

0.1934182> .05 so we fail to reject the null hypothesis. we do no have convining evidence of a difference between the average reading and writing exam scores.

F. What type of error might we have made? Explain what the error means in the context of the application.

Answer: Type I error: Incorrectly reject the null hypothesis.

Type II error: Incorrectly reject the alternative hypothesis.

Since we did NOT reject the null hypothesis, we are at a risk of making a Type II error by rejecting the alternative hypothesis HA. We might have wrongly concluded that there is not a enough difference in the average student reading and writing exam scores.

G. Based on the results of this hypothesis test, would you expect a confidence interval for the average difference between the reading and writing scores to include 0? Explain your reasoning.

Answer: Being that our results indicated that there is no difference in the reading and writing scores, I would expect that the confidence interval would include 0.

5.32 Fuel Efficiency of manual and automatic cars, Part I Each year the US Environmental Protection Agency (EPA) releases fuel economy data on cars manufactured in that year. Below are summary statistics on fuel efficiency(in miles/gallon) from random samples of cars with manual and automatic transmissions manufactured in 2012. Do these data provide strong evidence of a difference between the average fuel efficiency of cars with manual and automatic transmission in terms of their average city mileage? Assume that conditions of inference are satisfied.

Answer: H0 : The difference of average miles is equal to zero. mean.A - mean.M = 0 HA : The difference of average miles is NOT equal to zero. mean.A - mean.M NOT = 0

n <- 26
# Automatic
mu_a <- 16.12
sd_a <- 3.58
# Manual
mu_m <- 19.85
sd_m <- 4.51
# difference in sample means
mu_Diff <- mu_a - mu_m

# standard error of this point estimate
SE_Diff <- ( (sd_a ^ 2 / n) + ( sd_m ^ 2 / n) ) ^ 0.5

t_val <- (mu_Diff - 0) / SE_Diff
df <- n - 1
pvalue <- pt(t_val, df = df)
pvalue

## [1] 0.001441807

pvalue, 0.0014418 < .05 so we will reject out null hypothesis. Suport atlernate hypothesis that there is evidence of a difference in fuel efficiency between manual and automatic transmissions.

5.48 The General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents.47 Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.

Write hypotheses for evaluating whether the average number of hours worked varies across the five groups

Answer: H0: The difference of ALL averages is equal. HA: Average number of hours are not same for all groups.atleast One can have differnet Mean.

Check conditions and describe any assumptions you must make to proceed with the test.

Answer: For ANOVA we must check for independence within and across groups, normality, and nearly equal variability across groups. We look at the box plot to determine it, some of the groups are skewed and some are noraml distributed.

Below is part of the output associated with this test. Fill in the empty cells.

mu <- c(38.67, 39.6, 41.39, 42.55, 40.85)
sd <- c(15.81, 14.97, 18.1, 13.62, 15.51)
n <- c(121, 546, 97, 253, 155)
data_table <- data.frame (mu, sd, n)

N <- sum(data_table$n)
MU <- length(data_table$mu)

# Finding degrees of freedom
df <- MU - 1
dfResidual <- N - MU

dfResidual

## [1] 1167

# Using the qf function on the Pr(>F) to get the F-statistic:
Prf <- 0.0682
F_statistic <- qf( 1 - Prf, df , dfResidual)
F_statistic

## [1] 2.188931

# F-statistic = MSG/MSE

MSG <- 501.54
MSE <- MSG / F_statistic

# MSG = 1 / df * SSG

SSG <- df * MSG
SSE <- 267382

# SST = SSG + SSE, and df_Total = df + dfResidual

SST <- SSG + SSE
dft <- df + dfResidual

dft

## [1] 1171

ANOVA	Df	Sum Sq	Mean Sq	F value	Pr(>F)
1	degree	4	2004.11	501.54	2.19
2	Residuals	1,167	267,383	229.11	NA
3	Total	1171	269387.1 1	NA	NA

D. The independence assumption can be relaxed when the total sample size is large.

Given the p-value = 0.0682 is greater than 0.05, I conclude that there is not a significant difference between the groups and therefore I don’t reject the null hypothesis.

hw5

Rajwant Mishra

March 25, 2019

R Markdown