DATA606_Homework7

Working backwards, Part II. (5.24, p. 203) A 90% confidence interval for a population mean is (65, 77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations. Calculate the sample mean, the margin of error, and the sample standard deviation.

We know that sample mean is midpoint between the confidence interval. We can calculate the sample mean by $(x1+x2)/2$ where the confidence interval is (x1,x2)

# sample Mean
n <- 25
x1 <- 65
x2 <- 77

s_mean <- (x1+x2)/2
s_mean

## [1] 71

We know that the margin of error is $(x2-x1)/2$ where the confidence interval is (x1,x2)

# Margin of Error
ME <- (x2-x1)/2
ME

## [1] 6

In order to calculate the sample standard deviation we can use the formul $ME=t* . SE by using the qt() function and degree of freedom (df)

# Sample standard deviation
df <- 25-1
p <- 0.9
x <- p + (1-p)/2
t_val <- qt(x, df)
SE <- 6/t_val
SD <- SE*sqrt(25)
SD

## [1] 17.53481

SAT scores. (7.14, p. 261) SAT scores of students at an Ivy League college are distributed with a standard deviation of 250 points. Two statistics students, Raina and Luke, want to estimate the average SAT score of students at this college as part of a class project. They want their margin of error to be no more than 25 points.

Raina wants to use a 90% confidence interval. How large a sample should she collect?

z <- 1.65
ME <- 25
SD <- 250

sample <- round(((z * SD)/ME)^2)
sample

## [1] 272

Answer: Sample should be 272

Luke wants to use a 99% confidence interval. Without calculating the actual sample size, determine whether his sample should be larger or smaller than Raina’s, and explain your reasoning.

Answer: Luke is having a bigger confidence interval than raina and this will make his z* score more than Raina. There is an inverse square root relationship between confidence interval and sample size. Lukes will have a bigger sample size.

Calculate the minimum required sample size for Luke.

z <- 2.575 #for 99% confidence interval
ME <- 25
SD<- 250

Luke_sample <- round(((z*SD)/ME)^2)
Luke_sample

## [1] 663

Answer: Luke will have a sample size of 663

High School and Beyond, Part I. (7.20, p. 266) The National Center of Education Statistics conducted a survey of high school seniors, collecting test data on reading, writing, and several other subjects. Here we examine a simple random sample of 200 students from this survey. Side-by-side box plots of reading and writing scores as well as a histogram of the differences in scores are shown below.

library(openintro)

## Please visit openintro.org for free statistics materials

## 
## Attaching package: 'openintro'

## The following objects are masked from 'package:datasets':
## 
##     cars, trees

scores <- c(hsb2$read, hsb2$write)
gp <- c(rep('read', nrow(hsb2)), rep('write', nrow(hsb2)))
par(mar = c(3, 4, 0.5, 0.5), las = 1, mgp = c(2.8, 0.7, 0), 
    cex.axis = 1.1, cex.lab = 1.1)
openintro::dotPlot(scores, gp, vertical = TRUE, ylab = "scores", 
                   at=1:2+0.13, col = COL[1,3], 
                   xlim = c(0.5,2.5), ylim = c(20, 80), 
                   axes = FALSE, cex.lab = 1.25, cex.axis = 1.25)
axis(1, at = c(1,2), labels = c("read","write"), cex.lab = 1.25, cex.axis = 1.25)
axis(2, at = seq(20, 80, 20), cex.axis = 1.25)
boxplot(scores ~ gp, add = TRUE, axes = FALSE, col = NA)

par(mar=c(3.3, 2, 0.5, 0.5), las = 1, mgp = c(2.1, 0.7, 0), 
    cex.lab = 1.25, cex.axis = 1.25)
histPlot(hsb2$read - hsb2$write, col = COL[1], 
         xlab = "Differences in scores (read - write)", ylab = "")

Is there a clear difference in the average reading and writing scores?

Answer: There is no clear difference but we can say that reading average is little below writing average.

Are the reading and writing scores of each student independent of each other?

Answer: Based on the graphs and the scenario we can say that scores of each student are independent but each scores are dependent.

Create hypotheses appropriate for the following research question: is there an evident difference in the average scores of students in the reading and writing exam?

Answer:
H₀- The difference of average scores of students in reading and writing exams is 0
H_A- The difference of average scores of students in reading and writing exams is Not 0

Check the conditions required to complete this test.

Answer: Independence - According to the graphs, these data are paired and could be because they are dependent. Scores for each students are independent. Normal - Histagram shows that data has a nearly normal distribution.

The average observed difference in scores is ${ \widehat { x } }_{ read-write }=-0.545$, and the standard deviation of the differences is 8.887 points. Do these data provide convincing evidence of a difference between the average scores on the two exams?

Answer:

n <- 200
SD_diff <- 8.887
df <- n-1 # degree of freedom
Mean_diff <- -0.545
SE <- SD_diff/sqrt(n)
t_val <- (Mean_diff)/(SE)
p_val <- pt(t_val, df=df)
p_val

## [1] 0.1934182

Beacase that the p-value is larger than the 0.05, we reject the alternative hypothesis and accept the null hypothesis.

What type of error might we have made? Explain what the error means in the context of the application.

Answer: This is a Type II error of not rejecting a null hypothesis. It is a false negative where there could be differences in scores that were not observed.

Based on the results of this hypothesis test, would you expect a confidence interval for the average difference between the reading and writing scores to include 0? Explain your reasoning.

Answer: Based on our hypothesis that we created, 0 is the best possible result to reject alternative hypothesis. There for I would expect 0 to be in the confidence interval.

Fuel efficiency of manual and automatic cars, Part II. (7.28, p. 276) The table provides summary statistics on highway fuel economy of cars manufactured in 2012. Use these statistics to calculate a 98% confidence interval for the difference between average highway mileage of manual and automatic cars, and interpret this interval in the context of the data.

Answer: 98% Confidence interval

H₀: difference of Average highway mileage of automatic and manual is = 0 H_A: difference of Average highway mileage of automatic and manual is != 0

n <- 26
mean_A <- 22.92 # mean of Automatic
mean_M <- 27.88 # mean of manual
sd_A <- 5.29 # standard deviation of Automatic
sd_M <- 5.01 # standard deviation of manual
mean_diff <- mean_M - mean_A # difference means of Manual and Automatic fuel efficiency
# Standard Error

SE <- sqrt((sd_M^2)/n + (sd_A^2)/n)
df = (n+n)-2

T <- abs(qt(0.02/2, df=df))
T

## [1] 2.403272

# 98% Confidence interval
lower <- (mean_diff - T) * SE
upper <- (mean_diff + T) * SE

lower ; upper

## [1] 3.653259

## [1] 10.52124

Confidence interval of (3.653, 10.521) does not span zero. There for we reject the null hypothesis and accept the alternative hypothesis that there is a difference fuel efficiency in Automatic and Manual cars.

Email outreach efforts. (7.34, p. 284) A medical research group is recruiting people to complete short surveys about their medical history. For example, one survey asks for information on a person’s family history in regards to cancer. Another survey asks about what topics were discussed during the person’s last visit to a hospital. So far, as people sign up, they complete an average of just 4 surveys, and the standard deviation of the number of surveys is about 2.2. The research group wants to try a new interface that they think will encourage new enrollees to complete more surveys, where they will randomize each enrollee to either get the new interface or the current interface. How many new enrollees do they need for each interface to detect an effect size of 0.5 surveys per enrollee, if the desired power level is 80%?

Work hours and education. The General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents.47 Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.

Write hypotheses for evaluating whether the average number of hours worked varies across the five groups.

Answer:
H₀: The difference of all averages are equal to each other. in other words, difference in mean hours worked = 0 H_A: Difference in mean hours worked among all groups is != 0

Check conditions and describe any assumptions you must make to proceed with the test.

Assuming 1172 respondents are individuals that randomly picked we can safely say that it is fulfilling the independent condition. Also given that the sample population is out of total population of US and it is less than 10% of the total population.

Below is part of the output associated with this test. Fill in the empty cells.

Credit to www.khanacademy.com. They had great explanations on how to calculate the MSG, MSE, and F values.
Another website with great information on how to calculate the MSG, MSE, and F values is: http://oak.ucc.nau.edu/rh232/courses/EPS625/Handouts/One-Way%20ANOVA/Hand%20Calculation%20of%20ANOVA.pdf

	Df	Sum Sq	Mean SQ	F-value	Pr(>F)
degree	4	2,004	501.54	2.1868	0.0684
Residuals	1167	267,374	229.11
Total	1171	269,378

What is the conclusion of the test?

Answer: If we are using a p value < 0.05, then we fail to reject the null hypothesis. Our P value here is 0.0682.

DATA606_Homework7

Don Padmaperuma (Geeth)

11/4/2019