Load some potentially useful packages:

library(ggplot2)

library(dplyr)


Question 1

This is a general question on measuring distances which pertains to the definition of a test statistic:

(a) Suppose that you and a friend are in two different classes. On your respective mid-terms, you each obtained 85%, and the class average in both of your classes was 80%. Could one of you or your friend still be considered further above the class average than the other? Briefly explain.


(b) Now suppose that your class’s test scores were more tightly packed around 80%: maybe the standard deviation of your class’s scores was 2.5, where as the standard deviation of your friend’s class scores was 5. Which of you or your friend was further above the class average, and how might you summarize this information?



Question 2

Provide a response to the following conceptual questions regarding p-values:

(a) Broadly speaking, does a p-value measure the chance of a hypothesis being true or the chance of data having occurred?

it is the chance of the observed value occurring


(b) Why can’t a p-value measure the other quantity that you didn’t choose in (a)?

A p value cannot be the chance that a hypothesis being true because the definition of a p value is the chance of observed data occurring under the assumption that the null hypothesis is correct


(c) Try to explain in your own words why, in calculating a p-value, we need to assume that the null hypothesis is true.

we need to assume that the null hypothesis is true because if not, then we can’t reject it based on probability



Question 3

1. Consider the following output obtained from a made up dataset:



(a) TRUE or FALSE: The value 1.3712 from the above output represents the true population mean value of the response variable, y.

true

If you answered FALSE, re-write the statement so that it is TRUE.


(b) TRUE or FALSE: The value 0.1551 from the above output represents the standard deviation of the y variable.

If you answered FALSE, re-write the statement so that it is TRUE.

False, the value 0.1551 from the above output represents the standard error of the y variable


(c) Provide an interpretation of the 8.838 value from the above output.

the t value of 8.838 means that the observed mean is 8 times bigger or different than the calculated mean under the null hypothesis


(d) TRUE or FALSE: The null hypothesis associated with the p-value, 1.03e-11, from the above output is: \(H_0\): (true mean value of y) = 1.3712.

False

If you answered FALSE, re-write the statement so that it is TRUE.

The null hypothesis associated with the p-value; 1.03*10^-11, from the above output is not equal to 1.3712


(e) Use the output to calculate what the test statistic would be for the hypothesis test with the null hypothesis: \(H_0\): (true mean value of y) = 1.6.

1.475


(f) Suppose we were testing:

\(H_0\): (true mean value of y) = 1.6 vs. \(H_a\): (true mean value of y) \(\neq\) 1.6

What would the p-value be for this hypothesis test?

1-pnorm(0)
## [1] 0.5

0.5


(g) Based on the p-value you obtained in (f), what would we conclude?

that the null hypothesis cannot be ruled out.



Question 4

Let \(\beta\) represent the true value of a model coefficient. Which of the following hypotheses would be pairs that we would never test in our course? You may select MORE than 1 pair.

a. \(~H_0:~ \beta = 0 ~~~vs.~~ H_a:~ \beta \neq 0\)

b. \(~H_0:~ \beta \neq 0 ~~~vs.~~ H_a:~ \beta = 0\)

c. \(~H_0:~ \beta = 5 ~~~vs.~~ H_a:~ \beta \neq 5\)

d. \(~H_0:~ \beta = 0 ~~~vs.~~ H_a:~ \beta > 0\)

e. \(~H_0:~ \beta = 5 ~~~vs.~~ H_a:~ \beta > 8\)

f. \(~H_0:~ \beta < 0 ~~~vs.~~ H_a:~ \beta = 0\)



Question 5

USNews.csv has data on 1167 colleges obtained from an edition of the U.S. News College Rankings. For each college, we have the following variables:

College: Name of school

GradRate: College graduation rate (as a value from 0 to 100)

SFRatio: Student-to-faculty ratio

AdmisRate: Percentage of applicants accepted (as a value from 0 to 100)

FacultyPhD: Percentage of faculty with a PhD (as a value from 0 to 100)

Type: School type (Private or Public)

Region: Location of school: Midwest, NorthEast, South, or West

State: State in which the school is located

MathSAT: Average Math SAT score of entering first-years

VerbalSAT: Average Verbal SAT score of entering first-years

ACT: Average ACT score of entering first-years


Read in the data:

USNews = read.csv( 'https://raw.githubusercontent.com/vittorioaddona/data/main/USNews.csv' )


(a) Make the constant only model for GradRate, and print out its summary. What are the hypotheses associated with the p-value on the (Intercept) coefficient?

M12=lm(formula=GradRate~1,data=USNews)
M12
## 
## Call:
## lm(formula = GradRate ~ 1, data = USNews)
## 
## Coefficients:
## (Intercept)  
##       60.53
summary(M12)
## 
## Call:
## lm(formula = GradRate ~ 1, data = USNews)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.529 -12.529  -0.529  13.471  39.471 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  60.5287     0.5522   109.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.86 on 1166 degrees of freedom

the null hypothesis is equal to 0.00758 and the alternate hypothesis is not equal to 0.00758

(b) Given the hypotheses you stated in (a), briefly explain why the p-value on the (Intercept) coefficient is so tiny.

because the estimated mean is so much bigger than the hypothesized mean that assuming the null hypothesis is correct, the chance of getting the estimated coefficient under that null hypothesis is incredibly low


(c) Is the observed mean (or “sample mean”) graduation rate in this dataset greater than 60%?

yes


(d) Is there enough evidence from this data to conclude that the true mean graduation rate in the population is greater than 60%? To answer this question, state the hypotheses, calculate the test statistic and p-value, and make your conclusion.

USNews %>% summarize(mean(GradRate))
##   mean(GradRate)
## 1       60.52871
1-pnorm(109.6)
## [1] 0
summary(M12)
## 
## Call:
## lm(formula = GradRate ~ 1, data = USNews)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.529 -12.529  -0.529  13.471  39.471 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  60.5287     0.5522   109.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.86 on 1166 degrees of freedom

the p value for this data is 0 according to r meaning that we can reject the null hypothesis, thus we can conclude that the graduation rate is greater than 60%


(e) Is there a relationship between GradRate and SFRatio? Fit an appropriate model to answer this question, and fill in the following pieces en route to your ultimate conclusion:

M13=lm(formula=GradRate~SFRatio,data=USNews)
M13
## 
## Call:
## lm(formula = GradRate ~ SFRatio, data = USNews)
## 
## Coefficients:
## (Intercept)      SFRatio  
##      80.272       -1.338
summary(M13)
## 
## Call:
## lm(formula = GradRate ~ SFRatio, data = USNews)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.419 -12.085   0.088  12.084  77.564 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  80.2721     1.6813   47.74   <2e-16 ***
## SFRatio      -1.3381     0.1084  -12.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.75 on 1165 degrees of freedom
## Multiple R-squared:  0.1157, Adjusted R-squared:  0.115 
## F-statistic: 152.5 on 1 and 1165 DF,  p-value: < 2.2e-16

Part 1: State the hypotheses being tested in terms of a model coefficient:

the null hypothesis is that the coefficient for SFRatio will be equal to 0, the alternate hypothesis is that that coefficient will not be equal to 0

Part 2: Re-state the hypotheses being tested in terms of a comparison of two models:

does a horizontal or non horizontal trend line best define the relationship between GradRate and SFRatio?

Part 3: State the value of the test statistic for this question:

35.39

Part 4: State the value of the p-value for this question:

less than 2*10^-16

Part 5: State your conclusion:

there is a relationship between SFRatio and GradRate


(f) Is there a relationship between GradRate and SFRatio after controlling for Type? Fit an appropriate model to answer this question, and fill in the following pieces en route to your ultimate conclusion:

M14=lm(formula=GradRate~SFRatio+Type,data=USNews)
M14
## 
## Call:
## lm(formula = GradRate ~ SFRatio + Type, data = USNews)
## 
## Coefficients:
## (Intercept)      SFRatio   TypePublic  
##      76.882       -0.798      -12.781
summary(M14)
## 
## Call:
## lm(formula = GradRate ~ SFRatio + Type, data = USNews)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.024 -11.629   0.541  12.141  52.100 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  76.8817     1.6255  47.298  < 2e-16 ***
## SFRatio      -0.7980     0.1136  -7.026 3.62e-12 ***
## TypePublic  -12.7811     1.1356 -11.255  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.86 on 1164 degrees of freedom
## Multiple R-squared:  0.2025, Adjusted R-squared:  0.2011 
## F-statistic: 147.8 on 2 and 1164 DF,  p-value: < 2.2e-16
confint(M14)
##                  2.5 %      97.5 %
## (Intercept)  73.692546  80.0708959
## SFRatio      -1.020902  -0.5751716
## TypePublic  -15.009076 -10.5530809

Part 1: State the hypotheses being tested in terms of a model coefficient:

the null hypothesis is that the SFRatio coefficient is equal to 0, the alternate hypothesis is that the SFRatio coefficient is not equal to 0

Part 2: Re-state the hypotheses being tested in terms of a comparison of two models:

which best represents the data, a horizontal line or a non horizontal line

Part 3: State the value of the test statistic for this question:

40.272

Part 4: State the value of the p-value for this question:

3.62*10^-12

Part 5: State your conclusion:

because the p value for SFRatio is very small, we can reject it as not fitting the data, thus we can conclude that even when controlling for type, there is a relationship between SFRatio and GradRate


(g) Is there a relationship between GradRate and Type after controlling for SFRatio? Fit an appropriate model to answer this question, and fill in the following pieces en route to your ultimate conclusion:

M15=lm(formula=GradRate~Type+SFRatio,data=USNews)
M15
## 
## Call:
## lm(formula = GradRate ~ Type + SFRatio, data = USNews)
## 
## Coefficients:
## (Intercept)   TypePublic      SFRatio  
##      76.882      -12.781       -0.798
summary(M15)
## 
## Call:
## lm(formula = GradRate ~ Type + SFRatio, data = USNews)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.024 -11.629   0.541  12.141  52.100 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  76.8817     1.6255  47.298  < 2e-16 ***
## TypePublic  -12.7811     1.1356 -11.255  < 2e-16 ***
## SFRatio      -0.7980     0.1136  -7.026 3.62e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.86 on 1164 degrees of freedom
## Multiple R-squared:  0.2025, Adjusted R-squared:  0.2011 
## F-statistic: 147.8 on 2 and 1164 DF,  p-value: < 2.2e-16

Part 1: State the hypotheses being tested in terms of a model coefficient:

the null hypothesis is that the type coefficient is equal to 0, the alternate hypothesis is that the type coefficient is not equal to one

Part 2: Re-state the hypotheses being tested in terms of a comparison of two models:

does a horizontal or non horizontal line best represent the relationship between type and GradRate.

Part 3: State the value of the test statistic for this question:

36.043

Part 4: State the value of the p-value for this question:

less than 2*10^-16

1-pnorm(64.1006)
## [1] 0


Part 5: State your conclusion:

since the p value is so low, the null hypothesis can be rejected meaning that there is a relationship between Type and GradRate when controlling for SFRatio


(h) Compare the summary output of the following three models:

Model 1: GradRate ~ VerbalSAT

Model 2: GradRate ~ MathSAT

Model 3: GradRate ~ MathSAT + VerbalSAT

M16=lm(formula=GradRate~VerbalSAT,data=USNews)
M16
## 
## Call:
## lm(formula = GradRate ~ VerbalSAT, data = USNews)
## 
## Coefficients:
## (Intercept)    VerbalSAT  
##    -26.2801       0.1907
summary(M16)
## 
## Call:
## lm(formula = GradRate ~ VerbalSAT, data = USNews)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.632  -9.441  -0.008   9.515  44.196 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -26.280068   4.504936  -5.834 8.26e-09 ***
## VerbalSAT     0.190688   0.009641  19.778  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.88 on 705 degrees of freedom
##   (460 observations deleted due to missingness)
## Multiple R-squared:  0.3569, Adjusted R-squared:  0.3559 
## F-statistic: 391.2 on 1 and 705 DF,  p-value: < 2.2e-16
M17=lm(formula=GradRate~MathSAT,data=USNews)
M17
## 
## Call:
## lm(formula = GradRate ~ MathSAT, data = USNews)
## 
## Coefficients:
## (Intercept)      MathSAT  
##    -18.2299       0.1574
summary(M17)
## 
## Call:
## lm(formula = GradRate ~ MathSAT, data = USNews)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.897 -10.146  -0.544   9.887  49.992 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.229886   4.419489  -4.125 4.15e-05 ***
## MathSAT       0.157401   0.008583  18.339  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.27 on 705 degrees of freedom
##   (460 observations deleted due to missingness)
## Multiple R-squared:  0.323,  Adjusted R-squared:  0.322 
## F-statistic: 336.3 on 1 and 705 DF,  p-value: < 2.2e-16
M18=lm(formula=GradRate~MathSAT+VerbalSAT,data=USNews)
M18
## 
## Call:
## lm(formula = GradRate ~ MathSAT + VerbalSAT, data = USNews)
## 
## Coefficients:
## (Intercept)      MathSAT    VerbalSAT  
##   -26.88887      0.03306      0.15559
summary(M18)
## 
## Call:
## lm(formula = GradRate ~ MathSAT + VerbalSAT, data = USNews)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.308  -9.251  -0.082   9.198  44.823 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -26.88887    4.51785  -5.952 4.18e-09 ***
## MathSAT       0.03306    0.02145   1.541    0.124    
## VerbalSAT     0.15559    0.02472   6.293 5.46e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.87 on 704 degrees of freedom
##   (460 observations deleted due to missingness)
## Multiple R-squared:  0.359,  Adjusted R-squared:  0.3572 
## F-statistic: 197.2 on 2 and 704 DF,  p-value: < 2.2e-16


Part 1: Would you say that MathSAT is a confounding variable in the relationship between GradRate and VerbalSAT? Briefly explain.

No, because controlling for MathSAT does not change the sign of the coefficient for VerbalSAT.

Part 2: Would you say that VerbalSAT is a confounding variable in the relationship between GradRate and MathSAT? Briefly explain.

No, because controlling for VerbalSAT does not change the sign of MathSAT meaning that it is not a confounding variable

Part 3: Try to explain what you can deduce about the relationship between MathSAT and VerbalSAT. Stated differently, why do you think we observed an association between MathSAT and GradRate in Model 2?

Because the SAT is a measure of learning and knowledge, and the higher score gotten on the SAT correlates pretty well with a higher graduation rate.

Part 4: Consider the question: Is MathSAT associated with GradRate once we control for VerbalSAT? Which model is the null hypothesis model for this research question, and which model is the alternative hypothesis model?

the null hypothesis model would be a model where the coefficient for MathSAT is equal to 0, the alternate hypothesis model would be a model where the coefficient for MathSAT is not equal to 0, the VerbalSAT coefficient and the GradRate coefficients do not matter.


Part 5: What is the test statistic for the hypothesis test stated in Part 4?

-4.411



Question 6

CSHA.csv has data on 509 patients from the Canadian Study of Health and Ageing, a study which examined survival times from onset of Alzheimer’s disease. For each patient, the following variables were recorded:

Education: Number of years of education

AAO: Age at onset of Alzheimer’s (in years)

Sex: A patient’s biological sex recorded at birth in binary fashion: male (M) or female (F)

Survival: Time from onset of Alzheimer’s until death (in days)


Read in the data:

CSHA = read.csv( 'https://raw.githubusercontent.com/vittorioaddona/data/main/CSHA.csv' )


(a) Is there enough evidence to conclude that, on average, individuals with Alzheimer’s live longer than 2100 days? Carry out a hypothesis test using \(\alpha\) = 0.05. State the hypotheses, the test statistic, the p-value, and your conclusion.


(b) Assuming all other things stay the same, at what value of the observed mean (or sample mean) would your conclusion to (a) change?


(c) Is there enough evidence to conclude that there is truly a difference in mean survival between a male and a female with Alzheimer’s disease? State the hypotheses, the test statistic, the p-value, and your conclusion.


(d) Is there enough evidence to conclude that there is truly a difference in mean survival between a male and a female who have onset of Alzheimer’s disease at the same age? State the hypotheses, the test statistic, the p-value, and your conclusion.


(e) Based on your answers to (c) and (d), would you say that age at onset of Alzheimer’s is a confounder in the relationship between Survival and Sex?


(f) What specifically can we deduce about the age at onset of Alzheimer’s for male and female patients in this data?


(g) Uncomment and run the following two commands in order to produce summary output from a certain model:

# model = lm( Survival ~ Sex - 1 , data=CSHA )

# summary( model )

Briefly explain what the two p-values (from the output produced) tell us, and why they are, or are not, useful.



Question 7

Rivers contain small concentrations of mercury which can accumulate in fish over their lifetimes. Since mercury cannot be excreted from the body, it builds up in the tissues of the fish. The concentration of mercury in fish tissue can be obtained at considerable expense by catching fish and sending samples to a lab for analysis. Directly measuring the mercury concentration in the water is challenging since it is often below detectable limits. A study was conducted in the Wacamaw and Lumber rivers (in North Carolina) to investigate mercury levels in tissues of large mouth bass. At several stations along each river, a group of fish were caught, weighed, and measured. In addition, a filet from each fish was sent to the lab so that the tissue concentration of mercury could be determined. Mercury.csv contains the following information for 171 fish:

River: Lumber or Wacamaw

Station: A station number (0, 1,…, 15)

Length: The fish’s length (in centimeters)

Weight: The fish’s weight (in grams)

Concen: The fish’s mercury concentration (in parts per million; ppm)


Read in the data:

Mercury = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/Mercury.csv')


(a) Consider the following two models:

Model 1: Concen ~ Length + River

Model 2: Concen ~ Length + River + Weight

Is Model 1 nested in Model 2?


(b) If you wanted to perform a hypothesis test between Model 1 and Model 2 from (a), which one would be the null hypothesis and which one would be the alternative hypothesis? Briefly explain your reasoning.


(c) Find the test statistic and p-value for the hypothesis test you laid out in (b), and state your conclusion.


(d) The p-value in each row of a summary output corresponds to a certain hypothesis test. Can you describe in your own words what each of these p-values is testing?



Question 8

Suppose someone presents you with the results of the following hypothesis test:

\(H_0\): \(y = b~\) (or \(y\) \(\sim\) 1) \(~~vs.~~\) \(H_a\): \(y = b + mx~\) (or \(y\) \(\sim\) 1 + \(x\))

From this test (using \(\alpha\) = 0.05), they report a p-value of 0.007.

Which of the following sets of 95% confidence intervals for the coefficients in the \(H_a\) model are possible? Circle ALL possible options:

a. \(~\)For \(~b:~\) \([7.3,11.4]\) \(~~~and~~~\) For \(~m:~\) \([1.6,3.4]\)(CIRCLE)

b. \(~\)For \(~b:~\) \([-4.3,9.1]\) \(~~~and~~~\) For \(~m:~\) \([1.6,3.4]\)

c. \(~\)For \(~b:~\) \([7.3,11.4]\) \(~~~and~~~\) For \(~m:~\) \([-1.6,3.4]\)

d. \(~\)For \(~b:~\) \([-4.3,9.1]\) \(~~~and~~~\) For \(~m:~\) \([-1.6,3.4]\)(CIRCLE)

e. \(~\)For \(~b:~\) \([7.3,11.4]\) \(~~~and~~~\) For \(~m:~\) \([0.0002,0.0009]\)



Question 9

(a) Describe how confidence intervals and hypothesis tests are equivalent ways of answering questions of association/relationship, i.e., every time we reject the null hypothesis in a test of association/relationship, what can we say about the corresponding confidence interval on the relevant model coefficient? And every time the p-value is greater than 0.05 (i.e., we don’t reject the null hypothesis, at the 5% level, in a test of association/relationship), what can we say about the corresponding confidence interval on the relevant model coefficient?


(b) Even though confidence intervals and hypothesis tests are equivalent, confidence intervals are seen as more informative. Can you think of why this might be the case?



Question 10

Carefully read through the following xkcd comic on hypothesis testing. Try to articulate in your own words what message this comic is attempting to communicate (we will discuss this issue more in Topic 11).

the comic is trying to show how confusing some values can be, for example, the comic claims that an alternate hypothesis is true with a p value greater than 0.05, which would actually make the null hypothesis more likely.