Homework 5

5.6 Working backwards, Part II.

A 90% confidence interval for a population mean is (65, 77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations. Calculate the sample mean, the margin of error, and the sample standard deviation.

n <- 25
CI <- 0.90
lower_bound <- 65
upper_bound <- 77

#find p-value/2 
p_2tails <- CI + (1 - CI)/2

#find degree of freedom
df <- n-1

#find margin of error
ME <- (upper_bound-lower_bound)/2
ME

## [1] 6

#find T-value
T <- qt(p_2tails, df)

#find mean
mean <- 65 + ME

#find standard deviation
sd <- sqrt(n)*ME/T

mean

## [1] 71

sd

## [1] 17.53481

5.14 SAT Scores

SAT Scores SAT Scores of students at an Ivy League college are distributed with a standard deviation of 250 points. Two statistics students, Raina and Luke, want to estimate the average SAT score of students in this college as part of a class project. They want their margin of error to be no more than 25 points.

Raina wants to use a 90% confidence interval. How large a sample should she collect?

Z * SE <= ME

SE = SD/sqrt(n)

Z * SD/sqrt(n) <= ME

Z * SD <= ME sqrt(n)

n >= sqr (Z * SD/ME)

Z <- 1.65
ME <- 25
sd <- 250

n <- (Z*sd/ME)^2
round(n,0)

## [1] 272

The size of the sample should be no less than 272 students.

Luke wants to use a 99% confidence interval. Without calculating the actual sample size, determine whether his sample should be larger or smaller than Raina’s, and explain your reasoning.

We figured out that n = sqr(Z*sd/ME)

Since Z-score of 99% confidence interval is larger than 90% confidence interval Luke;s sample should be larger than Raina’s.

Calculate the minimum required sample size for Luke.

Z <- 2.576

n <- (Z*sd/ME)^2
round(n,0)

## [1] 664

The size of the sample should be no less than 664 students.

5.20 High School and Beyond, Part I.

Is there a clear difference in the average reading and writing scores?

There no clear difference in the average of the reading and writing scores because the the difference in scores distribution is fairly normal and grouped around the zero.

Are the reading and writing scores of each student independent of each other?

I would say that the scores of each student are independent of each other. However, reading and writing scores of the a student are not independent of each other.

Create hypotheses appropriate for the following research question: is there an evident difference in the average scores of students in the reading and writing exam?

Null hypotheses: There is NO difference in the average scores of students in the reading and writing exam. Alternative hypotheses: There is difference in the average scores of students in the reading and writing exam.

Check the conditions required to complete this test.

First, we assume that data is from random sample. Second, we have to check the independence of observations. By looking at the difference histogram we can assume that the data are paired. Pared data can’t be independent. However, we cam assume that paired data represents less that 10% population. Third, we have to check whether the data normally distributed or not. The box plot suggests that the data are reasonably normally distributed. Moreover, no outliers exist.

Even tough the second condition doesn’t meet requirements we still can perform t-test. However, the results might need additional analysis.

The average observed difference in scores is x¯read−x¯write=−0.545x¯read−x¯write=−0.545 , and the standard deviation of the differences is 8.887 points. Do these data provide convincing evidence of a difference between the average scores on the two exams?

n <- 200
sd_diff <- 8.887
mean_diff <- -0.545

#find degree of freedom
df <- n-1
#find standard error
SE <- sd_diff/sqrt(n)

#find T-value
T_val <- mean_diff/SE

#find p-value
p <- pt(T_val, df)
p

## [1] 0.1934182

Since the p-value is greater than 0.05 we can’t reject null hypotheses. So, there is no convincing evidence of a difference between the average reading and writing exam scores.

What type of error might we have made? Explain what the error means in the context of the application.

Type I error: Reject the null hypotheses when it’s actually true. Type II error: Do not reject the null hypotheses when it’s false.

We might experience Type II error since we couldn’t reject null hypotheses.

Based on the results of this hypothesis test, would you expect a confidence interval for the average difference between the reading and writing scores to include 0? Explain your reasoning.

When the confidence interval include 0 it indicates that there is no convincing evidence about difference in means. Since we failed to rejected null hypotheses we can say that a confidence interval for the average difference between the reading and writing scores includes 0.

5.32 Fuel efficiency of manual and automatic cars, Part I.

Each year the US Environmental Protection Agency (EPA) releases fuel economy data on cars manufactured in that year. Below are summary statistics on fuel efficiency (in miles/gallon) from random samples of cars with manual and automatic transmissions manufactured in 2012. Do these data provide strong evidence of a difference between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average city mileage? Assume that conditions for inference are satisfied.

Let’s state hypotheses.

Null Hypotheses: There is a difference between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average city mileage. Alternative Hypotheses: There is NO difference between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average city mileage.

n <- 26+26-2

# Automatic transmission
mean_aut <- 16.12
sd_aut <- 3.58

# Manual transmission
mean_man <- 19.85
sd_man <- 4.51

# difference in sample means
mean_diff <- mean_aut - mean_man

# standard error of this point estimate
SE <- ( (sd_aut ^ 2 / n) + ( sd_man ^ 2 / n) ) ^ 0.5

#finf t-value
T <- (mean_diff - 0) / SE

#find degree of freedom
df <- n - 1

#find p-value in percents
p <- pt(T, df = df)
p*100

## [1] 0.001603064

Since p-value is less than 5% we have to reject a null hypotheses. There is a strong evidence of a difference between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average city mileage.

5.48 Work hours and education.

The General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents.47 Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.

Write hypotheses for evaluating whether the average number of hours worked varies across the five groups.

Null Hypotheses: There is NO difference between averages of five groups. Alternative Hypotheses: At least one average doesn’t equal to others.

Check conditions and describe any assumptions you must make to proceed with the test.

The observations are independent within and between groups. I would assume that observations are independent based on the nature of the data.

The data within groups is normal. Each group has outliers. Some groups follow normal distribution.

The cross group variance is relatively equal. By observing the standard deviations I would say that the cross group variance is relatively equal.

Below is part of the output associated with this test. Fill in the empty cells.

group <- c("LessthanHS","HS","JrColl","Bachelors","Graduates")
mean <- c(38.67, 39.6, 41.39, 42.55, 40.85)
sd <- c(15.81, 14.97, 18.1, 13.62, 15.51)
n <- c(121, 546, 97, 253, 155)
data <- data.frame (group,mean, sd, n)
head(data)

##        group  mean    sd   n
## 1 LessthanHS 38.67 15.81 121
## 2         HS 39.60 14.97 546
## 3     JrColl 41.39 18.10  97
## 4  Bachelors 42.55 13.62 253
## 5  Graduates 40.85 15.51 155

n <- sum(data$n)
k <- length(data$mean)

#find degrees of freedom
degree_df <- k - 1
degree_residuals <- n - k

#find F-statistic
Pr_F <- 0.0682 #from table
F_value <- qf( 1 - Pr_F, df , degree_residuals)

#find Mean Sq
Mean_Sq_degree <- 501.54
Mean_Sq_residuals <- Mean_Sq_degree / F_value

#find Sum Sq
Sum_Sq_degree <- df * Mean_Sq_degree
Sum_Sq_residuals <- 267382

#find total of degrees
df_total <- degree_df + degree_residuals

#find total of sum sq
Sum_sq_total <- Sum_Sq_degree + Sum_Sq_residuals

#group <- c("LessthanHS","HS","JrColl","Bachelors","Graduates")
data_type <- c("degree","Residuals","Total")
df_data <- c(degree_df,Sum_Sq_degree,df_total)
sum_sq_data <- c(Sum_Sq_degree,Sum_Sq_residuals,Sum_sq_total)
mean_sq_data <- c(Mean_Sq_degree,Mean_Sq_residuals,"")
F_value <- c(F_value,"","")
Pr_F <- c(Pr_F,"","")
table <- data.frame (data_type,df_data,sum_sq_data,mean_sq_data,F_value,Pr_F)
head(table)

##   data_type  df_data sum_sq_data     mean_sq_data          F_value   Pr_F
## 1    degree     4.00    24575.46           501.54 1.32561453314125 0.0682
## 2 Residuals 24575.46   267382.00 378.345278707471                        
## 3     Total  1171.00   291957.46

What is the conclusion of the test?

Since p-value (0.0682) is greater than 0.05 we fail to reject the null hypotheses. So, there is no difference between averages of five groups.