Grando 5 Homework

if (Sys.info()["sysname"] == "Windows") {
    setwd("~/Masters/DATA606/Week5/Homework")
} else {
    setwd("~/Documents/Masters/DATA606/Week5/Homework")
}
require(ggplot2)
## Loading required package: ggplot2

5.6 Working backwards, Part II. A 90% confidence interval for a population mean is (65, 77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations. Calculate the sample mean, the margin of error, and the sample standard deviation.

Answer:

The sample mean is the mid-point in the confidence interval:

(77 - 65)/2 + 65
## [1] 71

The margin of error is half the distance of the confidence interval.

(77 - 65)/2
## [1] 6

The sample standard deviation can be determined by rearranging the following formula:

\[Margin\quad of\quad Error\quad (ME)\quad =\quad Z\quad *\quad SE\\ ME\quad =\quad Z\quad *\quad \frac { s }{ \sqrt { n } }\]

p_range <- pnorm(q = 1.65, mean = 0, sd = 1) - pnorm(q = -1.65, 
    mean = 0, sd = 1)
p_range
## [1] 0.9010571
6 * sqrt(25)/1.65
## [1] 18.18182

\[s\quad =\quad ME\quad *\quad \sqrt { n } \quad /\quad Z\quad =\quad 6\quad *\quad \sqrt { 25 } \quad /\quad 1.65\quad =\quad 18.18\]

5.14 SAT scores. SAT scores of students at an Ivy League college are distributed with a standard deviation of 250 points. Two statistics students, Raina and Luke, want to estimate the average SAT score of students at this college as part of a class project. They want their margin of error to be no more than 25 points.

(a) Raina wants to use a 90% confidence interval. How large a sample should she collect?

Answer:

As indicated in the previous response, the formula for the margin of error is the following:

\[Margin\quad of\quad Error\quad (ME)\quad =\quad Z\quad *\quad SE\\ ME\quad =\quad Z\quad *\quad \frac { s }{ \sqrt { n } }\]

By rearranging the formula, we can get the answer:

\[n\quad =\quad { (Z\quad *\quad s\quad /\quad ME) }^{ 2 }\]

ceiling((1.65 * 250/25)^2)
## [1] 273

\[n\quad =\quad { (1.65\quad *\quad 250\quad /\quad 25) }^{ 2 } = 273\]

(b) Luke wants to use a 99% confidence interval. Without calculating the actual sample size, determine whether his sample should be larger or smaller than Raina’s, and explain your reasoning.

Answer:

From the formula provided in answer (a), we can see that as the Z value increases (which happens when we move from a confidence interval of 90% to 99%) the sample size increases. This makes sense because if the margin of error is to remain the same, then when the numerator gets larger (Z * s), the denominator must also increase in size (\(\sqrt(n)\))

(c) Calculate the minimum required sample size for Luke.

Answer:

p_range <- pnorm(q = 2.57, mean = 0, sd = 1) - pnorm(q = -2.57, 
    mean = 0, sd = 1)
p_range
## [1] 0.9898301
ceiling((2.57 * 250/25)^2)
## [1] 661

\[n\quad =\quad { (2.57\quad *\quad 250\quad /\quad 25) }^{ 2 } = 661\]

5.20 High School and Beyond, Part I. The National Center of Education Statistics conducted a survey of high school seniors, collecting test data on reading, writing, and several other subjects. Here we examine a simple random sample of 200 students from this survey. Side-by-side box plots of reading and writing scores as well as a histogram of the differences in scores are shown below.

(a) Is there a clear difference in the average reading and writing scores?

Answer:

No, there does not appear to be a clear difference.

(b) Are the reading and writing scores of each student independent of each other?

Answer:

No, some students may perform better on standardized tests, be more intelligent than average, etc. which means that their reading and writing scores would generally be higher than the average, thus not being independent of one another.

(c) Create hypotheses appropriate for the following research question: is there an evident difference in the average scores of students in the reading and writing exam?

Answer:

\[{ H }_{ O }:\quad { \mu }_{ reading score }\quad -\quad { \mu }_{ writing score }\quad =\quad 0\\ { H }_{ A }:\quad { \mu }_{ reading score }\quad -\quad { \mu }_{ writing score }\quad \neq \quad 0\]

(d) Check the conditions required to complete this test.

Answer:

The conditions required to copmlete this test are as follows:

  1. Independence of observations: From the description provided, it appears a random sample was taken and we can assume that our results represent less than 10% of the population.

  2. Observations come from a nearly normal distribution: From the boxplots and histograms provided, there do not appear to be any significant outliers and the data appears to be normally distributed. We do not have the individual histograms for reading and writing scores but we can assume they are normal as well.

  3. Paired data: Each student reading score observation has an associated writing score.

(e) The average observed difference in scores is x̄ read write = 0.545, and the standard deviation of the differences is 8.887 points. Do these data provide convincing evidence of a difference between the average scores on the two exams?

Answer:

Since it is not specified, we will use a two tail test with a significance level of \(\alpha = 0.05\)

se <- 8.87/sqrt(200)
t_value <- (0.545 - 0)/se
pt(q = t_value, df = 199, lower.tail = FALSE)
## [1] 0.1929644
lb <- -4
ub <- 4
t1 <- qt(p = 0.025, df = 199)
t2 <- qt(p = (1 - 0.025), df = 199)
pick_line1 <- round(t_value, digits = 2)
ggplot(data.frame(x = c(lb, ub)), aes(x = x)) + stat_function(fun = dt, 
    args = list(df = 199)) + stat_function(fun = dt, args = list(df = 199), 
    xlim = c(lb, t1), geom = "area", alpha = 0.5) + stat_function(fun = dt, 
    args = list(df = 199), xlim = c(t2, ub), geom = "area", alpha = 0.5) + 
    geom_vline(xintercept = pick_line1, color = "black", alpha = 0.75) + 
    geom_text(aes(x = pick_line1, y = 0.25, label = sprintf("T = %s\n", 
        pick_line1)), color = "black", angle = 90)

The data does not provide convincing evidence that there is a diference between the reading and writing scores; therefore, we fail to reject the null hypothesis that the reading and writing scores are equal.

(f) What type of error might we have made? Explain what the error means in the context of the application.

Answer:

Since we have failed to reject the null hypothesis, we might have made a Type II error. There may be a diffeence between the reading and writing scores for a student; however, we might have failed to detect it.

(g) Based on the results of this hypothesis test, would you expect a confidence interval for the average difference between the reading and writing scores to include 0? Explain your reasoning.

Answer

Yes, since we have failed to reject the null hypothesis (the population mean is equal to zero), then it is likely it would be included in the confidence interval.

5.32 Fuel efficiency of manual and automatic cars, Part I. Each year the US Environmental Protection Agency (EPA) releases fuel economy data on cars manufactured in that year. Below are summary statistics on fuel efficiency (in miles/gallon) from random samples of cars with manual and automatic transmissions manufactured in 2012. Do these data provide strong evidence of a difference between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average city mileage? Assume that conditions for inference are satisfied.

Answer:

First, we calculate the standard error from the two groups:

sa <- 3.58
sm <- 4.51
na <- 26
nm <- 26
se <- sqrt(sa^2/na + sm^2/nm)

Then we calcuate the t-value:

xa <- 16.12
xm <- 19.85
t_value <- (xa - xm)/se

Since the sample sizes are equal, we can use a conservative estimate of df = 25. Also, since a significance level hasn’t been set, we will use \(\alpha = 0.05\) for a two-sided test

lb <- -4
ub <- 4
t1 <- qt(p = 0.025, df = 25)
t2 <- qt(p = (1 - 0.025), df = 25)
pick_line1 <- round(t_value, digits = 2)
ggplot(data.frame(x = c(lb, ub)), aes(x = x)) + stat_function(fun = dt, 
    args = list(df = 25)) + stat_function(fun = dt, args = list(df = 25), 
    xlim = c(lb, t1), geom = "area", alpha = 0.5) + stat_function(fun = dt, 
    args = list(df = 25), xlim = c(t2, ub), geom = "area", alpha = 0.5) + 
    geom_vline(xintercept = pick_line1, color = "black", alpha = 0.75) + 
    geom_text(aes(x = pick_line1, y = 0.25, label = sprintf("T = %s\n", 
        pick_line1)), color = "black", angle = 90)

There is significant evidence to suggest that there is a difference between the two groups; therefore, we reject the null hypothesis that there is no difference between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average city mileage.

5.48 Work hours and education. The General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents. Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.

(a) Write hypotheses for evaluating whether the average number of hours worked varies across the five groups.

Answer:

\[{ H }_{ O }:\quad { \mu }_{ less than HS }\quad =\quad { \mu }_{ HS }\quad =\quad { \mu }_{ Jr Coll }\quad =\quad { \mu }_{ Bachelors }\quad=\quad { \mu }_{ Graduate }\quad\\ { H }_{ A }:\quad { At\ least\ one\ mean\ is\ different }\quad\]

(b) Check conditions and describe any assumptions you must make to proceed with the test.

Answer:

Independence: All groups are randomly sampled and represent less than 10% of the population. We can assume that the number of hours worked per week by one peopole within in the same gorup in the sample are independent of another, and the number of hours worked per week by different groups is also independent.

Approximately normal: It appears a few of the groups may have some skew; however, the standard deviations are less than the mean and each group has a large sample size.

Constant variance: The standard deviations are all similar between the groups and there do not appear to be any significant outliers in any one group.

(c) Below is part of the output associated with this test. Fill in the empty cells.

Answer:

df1 <- (5 - 1)
df2 <- 1172 - 5
MSG <- 501.54
f_value <- qf(p = (1 - 0.0682), df1 = 4, df2 = 1167)
MSE <- MSG/f_value
SSG <- MSG * df1
SSE <- 267382
SST <- SSG + SSE

f_table <- data.frame(c(df1, df2, df1 + df2), round(c(SSG, SSE, 
    SST), digits = 2), round(c(MSG, MSE, NA), digits = 2), round(c(f_value, 
    NA, NA), digits = 2), c(0.0682, NA, NA))
names(f_table) <- c("Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
rownames(f_table) <- c("degree", "residuals", "Total")
f_table
##             Df    Sum Sq Mean Sq F value Pr(>F)
## degree       4   2006.16  501.54    2.19 0.0682
## residuals 1167 267382.00  229.13      NA     NA
## Total     1171 269388.16      NA      NA     NA

What is the conclusion of the test?

Answer:

Given that we have not been provided a significance level, we will use \(\alpha = 0.05\). Since the p-value is greater than 0.05, there is not sufficient evidence to reject the null hypothesis. Therefore we do not reject the null hypothesis that the number of hours worked between the different levels of educational attainment is equal.

lb <- 0
ub <- 4
f1 <- qf(p = (1 - 0.05), df1 = 4, df2 = 1167)
pick_line1 <- round(f_value, digits = 2)
ggplot(data.frame(x = c(lb, ub)), aes(x = x)) + stat_function(fun = df, 
    args = list(df1 = 4, df2 = 1167)) + stat_function(fun = df, 
    args = list(df1 = 4, df2 = 1167), xlim = c(f1, ub), geom = "area", 
    alpha = 0.5) + geom_vline(xintercept = pick_line1, color = "black", 
    alpha = 0.75) + geom_text(aes(x = pick_line1, y = 0.25, label = sprintf("F = %s\n", 
    pick_line1)), color = "black", angle = 90) + labs(x = "F Value")