if (Sys.info()["sysname"] == "Windows") {
setwd("~/Masters/DATA606/Week5/Homework")
} else {
setwd("~/Documents/Masters/DATA606/Week5/Homework")
}
require(ggplot2)
## Loading required package: ggplot2
Answer:
The sample mean is the mid-point in the confidence interval:
(77 - 65)/2 + 65
## [1] 71
The margin of error is half the distance of the confidence interval.
(77 - 65)/2
## [1] 6
The sample standard deviation can be determined by rearranging the following formula:
\[Margin\quad of\quad Error\quad (ME)\quad =\quad Z\quad *\quad SE\\ ME\quad =\quad Z\quad *\quad \frac { s }{ \sqrt { n } }\]
p_range <- pnorm(q = 1.65, mean = 0, sd = 1) - pnorm(q = -1.65,
mean = 0, sd = 1)
p_range
## [1] 0.9010571
6 * sqrt(25)/1.65
## [1] 18.18182
\[s\quad =\quad ME\quad *\quad \sqrt { n } \quad /\quad Z\quad =\quad 6\quad *\quad \sqrt { 25 } \quad /\quad 1.65\quad =\quad 18.18\]
Answer:
As indicated in the previous response, the formula for the margin of error is the following:
\[Margin\quad of\quad Error\quad (ME)\quad =\quad Z\quad *\quad SE\\ ME\quad =\quad Z\quad *\quad \frac { s }{ \sqrt { n } }\]
By rearranging the formula, we can get the answer:
\[n\quad =\quad { (Z\quad *\quad s\quad /\quad ME) }^{ 2 }\]
ceiling((1.65 * 250/25)^2)
## [1] 273
\[n\quad =\quad { (1.65\quad *\quad 250\quad /\quad 25) }^{ 2 } = 273\]
Answer:
From the formula provided in answer (a), we can see that as the Z value increases (which happens when we move from a confidence interval of 90% to 99%) the sample size increases. This makes sense because if the margin of error is to remain the same, then when the numerator gets larger (Z * s), the denominator must also increase in size (\(\sqrt(n)\))
Answer:
p_range <- pnorm(q = 2.57, mean = 0, sd = 1) - pnorm(q = -2.57,
mean = 0, sd = 1)
p_range
## [1] 0.9898301
ceiling((2.57 * 250/25)^2)
## [1] 661
\[n\quad =\quad { (2.57\quad *\quad 250\quad /\quad 25) }^{ 2 } = 661\]
Answer:
No, there does not appear to be a clear difference.
Answer:
No, some students may perform better on standardized tests, be more intelligent than average, etc. which means that their reading and writing scores would generally be higher than the average, thus not being independent of one another.
Answer:
\[{ H }_{ O }:\quad { \mu }_{ reading score }\quad -\quad { \mu }_{ writing score }\quad =\quad 0\\ { H }_{ A }:\quad { \mu }_{ reading score }\quad -\quad { \mu }_{ writing score }\quad \neq \quad 0\]
Answer:
The conditions required to copmlete this test are as follows:
Independence of observations: From the description provided, it appears a random sample was taken and we can assume that our results represent less than 10% of the population.
Observations come from a nearly normal distribution: From the boxplots and histograms provided, there do not appear to be any significant outliers and the data appears to be normally distributed. We do not have the individual histograms for reading and writing scores but we can assume they are normal as well.
Paired data: Each student reading score observation has an associated writing score.
Answer:
Since it is not specified, we will use a two tail test with a significance level of \(\alpha = 0.05\)
se <- 8.87/sqrt(200)
t_value <- (0.545 - 0)/se
pt(q = t_value, df = 199, lower.tail = FALSE)
## [1] 0.1929644
lb <- -4
ub <- 4
t1 <- qt(p = 0.025, df = 199)
t2 <- qt(p = (1 - 0.025), df = 199)
pick_line1 <- round(t_value, digits = 2)
ggplot(data.frame(x = c(lb, ub)), aes(x = x)) + stat_function(fun = dt,
args = list(df = 199)) + stat_function(fun = dt, args = list(df = 199),
xlim = c(lb, t1), geom = "area", alpha = 0.5) + stat_function(fun = dt,
args = list(df = 199), xlim = c(t2, ub), geom = "area", alpha = 0.5) +
geom_vline(xintercept = pick_line1, color = "black", alpha = 0.75) +
geom_text(aes(x = pick_line1, y = 0.25, label = sprintf("T = %s\n",
pick_line1)), color = "black", angle = 90)
The data does not provide convincing evidence that there is a diference between the reading and writing scores; therefore, we fail to reject the null hypothesis that the reading and writing scores are equal.
Answer:
Since we have failed to reject the null hypothesis, we might have made a Type II error. There may be a diffeence between the reading and writing scores for a student; however, we might have failed to detect it.
Answer
Yes, since we have failed to reject the null hypothesis (the population mean is equal to zero), then it is likely it would be included in the confidence interval.
Answer:
First, we calculate the standard error from the two groups:
sa <- 3.58
sm <- 4.51
na <- 26
nm <- 26
se <- sqrt(sa^2/na + sm^2/nm)
Then we calcuate the t-value:
xa <- 16.12
xm <- 19.85
t_value <- (xa - xm)/se
Since the sample sizes are equal, we can use a conservative estimate of df = 25. Also, since a significance level hasn’t been set, we will use \(\alpha = 0.05\) for a two-sided test
lb <- -4
ub <- 4
t1 <- qt(p = 0.025, df = 25)
t2 <- qt(p = (1 - 0.025), df = 25)
pick_line1 <- round(t_value, digits = 2)
ggplot(data.frame(x = c(lb, ub)), aes(x = x)) + stat_function(fun = dt,
args = list(df = 25)) + stat_function(fun = dt, args = list(df = 25),
xlim = c(lb, t1), geom = "area", alpha = 0.5) + stat_function(fun = dt,
args = list(df = 25), xlim = c(t2, ub), geom = "area", alpha = 0.5) +
geom_vline(xintercept = pick_line1, color = "black", alpha = 0.75) +
geom_text(aes(x = pick_line1, y = 0.25, label = sprintf("T = %s\n",
pick_line1)), color = "black", angle = 90)
There is significant evidence to suggest that there is a difference between the two groups; therefore, we reject the null hypothesis that there is no difference between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average city mileage.
Answer:
\[{ H }_{ O }:\quad { \mu }_{ less than HS }\quad =\quad { \mu }_{ HS }\quad =\quad { \mu }_{ Jr Coll }\quad =\quad { \mu }_{ Bachelors }\quad=\quad { \mu }_{ Graduate }\quad\\ { H }_{ A }:\quad { At\ least\ one\ mean\ is\ different }\quad\]
Answer:
Independence: All groups are randomly sampled and represent less than 10% of the population. We can assume that the number of hours worked per week by one peopole within in the same gorup in the sample are independent of another, and the number of hours worked per week by different groups is also independent.
Approximately normal: It appears a few of the groups may have some skew; however, the standard deviations are less than the mean and each group has a large sample size.
Constant variance: The standard deviations are all similar between the groups and there do not appear to be any significant outliers in any one group.
Answer:
df1 <- (5 - 1)
df2 <- 1172 - 5
MSG <- 501.54
f_value <- qf(p = (1 - 0.0682), df1 = 4, df2 = 1167)
MSE <- MSG/f_value
SSG <- MSG * df1
SSE <- 267382
SST <- SSG + SSE
f_table <- data.frame(c(df1, df2, df1 + df2), round(c(SSG, SSE,
SST), digits = 2), round(c(MSG, MSE, NA), digits = 2), round(c(f_value,
NA, NA), digits = 2), c(0.0682, NA, NA))
names(f_table) <- c("Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
rownames(f_table) <- c("degree", "residuals", "Total")
f_table
## Df Sum Sq Mean Sq F value Pr(>F)
## degree 4 2006.16 501.54 2.19 0.0682
## residuals 1167 267382.00 229.13 NA NA
## Total 1171 269388.16 NA NA NA
Answer:
Given that we have not been provided a significance level, we will use \(\alpha = 0.05\). Since the p-value is greater than 0.05, there is not sufficient evidence to reject the null hypothesis. Therefore we do not reject the null hypothesis that the number of hours worked between the different levels of educational attainment is equal.
lb <- 0
ub <- 4
f1 <- qf(p = (1 - 0.05), df1 = 4, df2 = 1167)
pick_line1 <- round(f_value, digits = 2)
ggplot(data.frame(x = c(lb, ub)), aes(x = x)) + stat_function(fun = df,
args = list(df1 = 4, df2 = 1167)) + stat_function(fun = df,
args = list(df1 = 4, df2 = 1167), xlim = c(f1, ub), geom = "area",
alpha = 0.5) + geom_vline(xintercept = pick_line1, color = "black",
alpha = 0.75) + geom_text(aes(x = pick_line1, y = 0.25, label = sprintf("F = %s\n",
pick_line1)), color = "black", angle = 90) + labs(x = "F Value")