Homework 4

Instructions: Answer all questions and submit the problems in order, making sure that the computer output and discussion are placed together. (Do not put the computer output at the end of the homework.) Raw computer output is not acceptable. Make it clear what parts of the output are relevant and show how they answer the questions posed in the homework. Make sure problem numbers and each part (a, b, c ) are clearly labeled. You may work together on the homework, but do not copy any part of a homework. Each student must produce his/her own homework to be handed in. Homework is to be submitted via Carmen.

R code should be inserted into R chunks. Answers/comments should be written as plain text OUTSIDE of R chunks, NOT INSIDE of R chunks prefaced by a hashtag. Inserting brief comments in your R code to describe what your code is doing is fine but detailed comments and answers to questions should be written OUTSIDE of R chunks. Review solutions to homework assignments for examples. Combining regular text with chunks of code is one of the main advantages of using a markdown file.

Refer to the lecture notes for examples of R code. Additional help is provided by the links in the Installation of R and RStudio document. When the assignement is completed, please upload two files, your .RMD file and the output as a .pdf file. Filenames for each homework assignment will have the following format: lastname_firstname_hw#.Rmd which contains the text and R code, and lastname_firstname_hw#.pdf, which is a copy of the output generated by knitting lastname_firstname_hw#.Rmd. For example, for homework 6, I would submit kelbick_nicole_hw6.Rmd and kelbick_nicole_hw6.pdf.

NOTE: When converting to a Word file first, you may edit page breaks and center plots, tables and/or text but nothing else at this time. The command \newpage will cause a page break in your document.

Wherever possible, please show how each answer was derived.

Non-integer values should go out to at least 3 decimal places.

Unless indicated otherwise, use R as a calculator as well as to obtain probabilities, such as p-values, critical values as well as basic calculations, such as means, standard deviations and the like.

Question 1

In addition to the computer’s calculations of miles per gallon, a car’s owner also recorded the miles per gallon by dividing the miles driven by the number of gallons at each fill up. The owner wants to know if his calculation of mpg and the computer’s calculations are significantly different. Below are the mpg recorded by both the computer and driver.

Obs	1	2	3	4	5	6	7	8	9	10
Computer	41.5	50.7	36.6	37.3	34.2	45.0	48.0	43.2	47.7	42.2
Driver	46.5	57.2	36.0	39.0	37.9	49.5	56.0	45.4	52.6	45.2

Obs	11	12	13	14	15	16	17	18	19	20
Computer	43.2	44.6	48.4	46.4	46.8	39.2	37.3	43.5	44.3	43.3
Driver	47.6	44.7	51.4	47.5	47.9	44.2	39.4	47.2	43.7	39.1

Carry out the appropriate hypothesis test according to steps i-v listed below. For the sake of consistency, look at the difference computer - driver. Wherever possible, answers should be written out to at least 3 decimal places. Check out Pairedttest.R for R code related to the seizure drug study example mentioned in the lecture notes. It is located on the paired t-test lecture page on Carmen.

Clearly state the null ($H_0$) and alternative ($H_a$) hypotheses in terms of the parameter of interest. Null: No difference between the mean miles per gallon by the computer and by the driver. alternate: there is a difference between the mean miles per gallon by the computer and by the driver
Calculate the relevant test statistic. ’’’{r}

computer <- c(41.5, 50.7, 36.6, 37.3, 34.2, 45.0, 48.0, 43.2, 47.7, 42.2, 43.2, 44.6, 48.4, 46.4, 46.8, 39.2, 37.3, 43.5, 44.3, 43.3)

driver <- c(46.5, 57.2, 36.0, 39.0, 37.9, 49.5, 56.0, 45.4, 52.6, 45.2, 47.6, 44.7, 51.4, 47.5, 47.9, 44.2, 39.4, 47.2, 43.7, 39.1) tTest <- t.test(computer, driver, paired = TRUE) print(tTest) ’’’

Calculate the appropriate p-value. Indicate degrees of freedom if appropriate.

p-value = 0.0003386 iv. State your conclusion based on a $\alpha = 10\%$ significance level.

Since the p-value < alpha, we can reject the null hypothesis

Determine the Rejection Region in terms of $t_{obs}$.

’’’ critical <- qt(0.1 / 2, df = length(computer) - 1) rejectionRegion <- c(-Inf, -critical, critical, Inf) print(rejectionRegion) ’’’

Calculate a $90\%$ confidence interval for the mean difference $\mu_D = \mu_{computer} - \mu_{driver}$ for this car. ’’’ meanDifference <- mean(computer - driver) moe <- qt(0.05, df = length(computer) - 1) * (sd(computer - driver) / sqrt(length(computer))) confidenceInterval <- meanDifference + c(-1, 1) * moe print(confidenceInterval) ’’’
Is it reasonable to assume the differences are Normally distributed? Use plots, such as histograms and/or quantile plots, to support your conclusion.

’’’ qqline(computer - driver) hist(computer - driver, main = “differences”) qqnorm(computer - driver)

’’’

Are the differences in MPG between the computer and driver considered statistically significant when using the Sign Test? Follow the same basic steps (i - iv) outlined above. How do the conclusions compare between parts (a) and (e)?

’’’ signTest <- binom.test(sum(computer > driver), length(computer), p = 0.5, alternative = “two.sided”) print(signTest) ’’’ Since the p-value< alpha, we reject the null hypothesis abd say that there are differences

NOTE:  The p-value for a two-sided Sign Test is a little trickier to compute than for a one-sided Sign Test.  We need to account for the fact that the number of positive differences could be either above the expected number $(n/2)$ or below it.  Suppose the number of positive difference is "a".  Then $n-a$ is the number of negative differences.  Then the p-value will be $P(X \le min(a, n-a)) + P(X \ge max(a, n-a))$.

Question 2

Suppose Honda wants to compare gas mileage between the latest Accord and Civic models. Are there differences in MPG between the two models? Computer estimates of MPG were used for each model. Assume the readings were done on brand new cars of each model under similar conditions and that observations within a group as well as between groups are independent.

Obs	1	2	3	4	5	6	7	8	9	10
Accord	41.5	50.7	36.6	37.3	34.2	45.0	48.0	43.2	47.7	42.2
Civic	46.5	57.2	36.0	39.0	37.9	49.5	56.0	45.4	52.6	45.2

Obs	11	12	13	14	15	16	17	18	19	20
Accord	43.2	44.6	48.4	46.4	46.8	39.2	37.3	43.5	44.3	43.3
Civic	47.6	44.7	51.4	47.5	47.9	44.2	39.4	47.2	43.7	39.1

Checking Assumptions: An important assumption regarding the two-sample $t$-tests is that both samples originated from Normally distributed populations. Is this a reasonable assumption for the Accord and Civic data? Look at histograms and/or QQ Normal plots for each group. Check out BP_Example_Handout.R for relevant examples of R code related to the calcium and blood pressure study example described in BP_Example_Handout.pdf. Both files can be downloaded from the Carmen assignment page.

’’’ accord <- c(41.5, 50.7, 36.6, 37.3, 34.2, 45.0, 48.0, 43.2, 47.7, 42.2, 43.2, 44.6, 48.4, 46.4, 46.8, 39.2, 37.3, 43.5, 44.3, 43.3) civic <- c(46.5, 57.2, 36.0, 39.0, 37.9, 49.5, 56.0, 45.4, 52.6, 45.2, 47.6, 44.7, 51.4, 47.5, 47.9, 44.2, 39.4, 47.2, 43.7, 39.1)

par(mfrow = c(1, 2)) hist(accord, main = “Accord”, xlab = “MPG”) hist(civic, main = “Civic”, xlab = “MPG”)

par(mfrow = c(1, 2)) qqnorm(accord, main = “Accord, MPG”) qqline(accord) qqnorm(civic, main = “Civic, MPG”) qqline(civic) ’’’

Is it reasonable to use the pooled $t$-test with this data? Why or why not?

’’’ test <- var.test(civic, accord) print(test) ’’’

Conduct a 2-sample pooled $t$-test at level $\alpha$ = 0.10. Follow the steps i-v outlined below. For the sake of consistency, look at the difference Civic - Accord. In addition, when calculating the test statistic, show how you calculated $SE(\bar{Y}_{C}-\bar{Y}_A)$, the standard error of $\bar{Y}_{C}-\bar{Y}_A$. You may use R as a calculator of sorts to help you with calculations, as well as use pt() for calculating probabilities and qt() for obtaining critical values or quantiles. Compare your results to those obtained from t.test() by using var.equal=TRUE as one of the arguments to t.test(). You may find lecture notes as well as help(t.test) to be useful.

Clearly state the null ($H_0$) and alternative ($H_a$) hypotheses in terms of the parameter of interest.

Null: no difference in the mean gas mileage between the Accord and Civic Alternative: there is a difference in the mean gas mileage between the Accord and Civic

Calculate the relevant test statistic.

’’’ SE <- sqrt(((sd(accord))^2 / (length(accord))) + ((sd(civic))^2 / (length(civic)))) t <- ((mean(civic)) - (mean(accord))) / SE ’’’

Calculate the appropriate p-value or Rejection Region. Make sure to indicate degrees of freedom if appropriate.

’’’ df <- length(accord) + length(civic) - 2 pValue <- 2 * pt(-abs(t), df) criticalValue <- qt(0.05, df)

print(t) print(p_value) print(critical_value) ’’’

State your conclusion based on a $\alpha = 10\%$ significance level. ’’’ tTest <- t.test(civic, accord, var.equal = TRUE) print(tTest) ’’’

Calculate a corresponding $90\%$ confidence interval for the difference in population MPG means, $\mu_{Civic} - \mu_{Accord}$. Compare your answer to results obtained from t.test() for a two-sample pooled variance setting.

’’’ confidenceInterval1 <- mean(civic) - mean(accord) + c(-1, 1) * qt(0.05, df = length(civic) + length(accord) - 2) * se_diff print(confidenceInterval1)

tTest <- t.test(civic, accord, var.equal = TRUE) print(tTest$conf.int) ’’’

Conduct a 2-sample unpooled t-test at level $\alpha$ = 0.10. Follow the steps i-v outlined below. For the sake of consistency, look at the difference Civic - Accord. In addition, when calculating the test statistic, show how you calculated $SE(\bar{Y}_{C}-\bar{Y}_A)$, the standard error of $\bar{Y}_{C}-\bar{Y}_A$. Use Satterthwaite’s approximation to calculate degrees of freedom. Refer to the lecture notes for the formula as well as BP_Example_Handout.R for examples of R code for similar calculations used with the calcium intake and blood pressure data. You may use R as a calculator of sorts to help you with calculations, as well as use pt() for calculating probabilities and qt() for obtaining critical values or quantiles. Compare your results to those obtained from t.test() by using var.equal=FALSE as one of the arguments to t.test(). You may find BP_Example_Handout.pdf and BP_Example_Handout.R as well as help("t.test") to be helpful.

Clearly state the null ($H_0$) and alternative ($H_a$) hypotheses in terms of the parameter of interest. Null: no difference in the mean gas mileage between the Accord and Civic Alternative: there is a difference in the mean gas mileage between the Accord and Civic
Calculate the relevant test statistic. ’’’ SE <- sqrt(((sd(accord))^2 / length(accord)) + ((sd(civic))^2 / length(civic)))

t <- ((mean(civic)) - (mean(accord))) / SE ’’’

Calculate the appropriate p-value or Rejection Region.

’’’ df <- ((sd(accord))^2 / length(accord) + (sd(civic)^2 / length(civic))^2) / (((sd(accord)^2 / length(accord))^2 / (length(accord) - 1) + ((sd(civic)^2 / length(civic))^2 / (length(civic) - 1))

pValue <- 2 * pt(-abs(t), df) criticalValue <- qt(0.05, df)

print(t)

print(df)

print(pValue)

print(criticalValue) ’’’

State your conclusion based on a $\alpha = 10\%$ significance level.

’’’ tTest <- t.test(civic, accord, var.equal = FALSE) print(tTest) ’’’ We can reject the null hypothesis.

Calculate a corresponding $90\%$ confidence interval for the difference in population MPG means, $\mu_{Civic} - \mu_{Accord}$. Compare your answer to results obtained from t.test() for a two-sample unpooled variance setting.

’’’ SE_diff <- sqrt(((sd(accord))^2 / (length(accord))) + ((sd(civic))^2 / length(civic)))

tScore <- qt(0.05, df = ((length(accord)) - 1) + ((length(civic)) - 1))

moe <- tScore * SE_diff

confidenceInterval <- c((mean(civic)) - (mean(accord)) - moe, (mean(civic)) - (mean(accord)) + moe) tTest <- t.test(civic, accord, var.equal = FALSE, conf.level = 0.90) print(tTest$confidenceInterval) ’’’

How did the conclusions from the hypothesis tests conducted in parts (c) and (e) compare? Were there any differences?

Both tests reached the same conclusion: reject the null hypothesis

How did the values of $SE(\bar{Y}_{C}-\bar{Y}_A)$ compare in parts (c) and (e) under the different $t$-test procedures?

The value of SE(¯YC−¯YA) were the same.

How did the degrees of freedom compare in parts (c) and (e) under the different assumptions?

Degrees of freedom were not needed in part e.

How did the confidence intervals derived in parts (d) and (f) compare? Which one had a larger margin or error?

The confidence interval in (f) was wider and possessed a larger margin of error.