Load some potentially useful packages:

library(ggplot2)

library(dplyr)


Question 1

Consider the random variable \(X\) = time required for a student to complete an exam. Suppose \(X\) is approximately Normal with a mean of 45 minutes and a standard deviation of 5 minutes.

Carefully explain the difference between pnorm and qnorm. Be as specific as possible.

Pnorm is the percent chance that a result occurs to the left of the q value. Qnorm is the numerical value of that percent chance in Pnorm


(c) If 50 minutes are allowed for the exam, what proportion of students would be unable to finish it?

1-pnorm(50,mean=45,sd=5)
## [1] 0.1586553

about 16% of students are unable to finish the exam within 50 minutes


(d) How much time should be allowed for the exam if we wanted to limit the percentage of students who are unable to complete the exam to only 2.3%?

qnorm(1-0.023,mean=45,sd=5)
## [1] 54.97697

about 55 minutes



Question 2

Suppose we have some data on a response variable (\(y\)) which we might write as follows: \(y_1\), \(y_2\), … \(y_n\). This data comes from some broader population of \(y\) values. Suppose we label the true mean of this population as \(\beta_0\). The coefficient from the constant only model, lm(y ~ 1), might be labeled \(\hat{\beta_0}\) since it represents an estimate of \(\beta_0\).

NOTE: Sometimes, instead of \(\beta_0\), people use \(\mu\) to represent the true mean of a population. And, instead of \(\hat{\beta_0}\), people often use \(\overline{y}\) to represent the sample (or data) mean. In this case, \(\hat{\mu} = \overline{y}\).

(a) We wouldn’t typically know this, but suppose you are told that \(\beta_0=100\). Moreover, suppose we estimate that \(sd(\hat{\beta_0}) = 10\). What is the chance that \(\hat{\beta_0}\) is between 93 and 102?

pnorm(102,mean=100,sd=10)-pnorm(93,mean=100,sd=10)
## [1] 0.3372961


(b) Now, suppose we don’t know the value of \(\beta_0\) (much more realistic!), but we can still estimate that \(sd(\hat{\beta_0}) = 10\). Can we find:

\[P( 93 < \hat{\beta_0} < 102 )\]

If so, find it, otherwise briefly explain why not.


(c) Again, suppose we don’t know the value of \(\beta_0\). Can we find:

\[ P\Bigg( ~~ -1 < ~\Bigg(\frac{\hat{\beta_0}-\beta_0}{sd(\hat{\beta_0})}\Bigg) ~ < 1 ~~\Bigg) \]

If so, find it, otherwise briefly explain why not.


(d) Explain in words what the question in (c) is asking us to find.



Question 3

Here is a 95% confidence interval: 12.3 \(\pm\) 9.8.

(a) (Multiple Choice Question) What is 12.3?

A. margin of error

B. point estimate

C. standard error (i.e. the standard deviation of our estimate)

D. confidence level

E. confidence interval


(b) (Multiple Choice Question) What is 9.8?

A. margin of error

B. point estimate

C. standard error (i.e. the standard deviation of our estimate)

D. confidence level

E. confidence interval


(c) (Multiple Choice Question) What is 95%?

A. margin of error

B. point estimate

C. standard error (i.e. the standard deviation of our estimate)

D. confidence level

E. confidence interval


(d) Approximately, what is the standard error (i.e. the standard deviation of our estimate)?



Question 4

Consider the College.csv data that we have used previously. This dataset contains information on the 4-year graduation rate (GradRate), the admission rate (AdmisRate), and the type of school (Type: Private or Public) for 195 colleges.

Read in the data:

College = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/College.csv')


(a) Fit the constant only model for GradRate and use the confint function to find a 95% confidence interval for the true mean graduation rate.

model=lm(GradRate~1,data=College)
model
## 
## Call:
## lm(formula = GradRate ~ 1, data = College)
## 
## Coefficients:
## (Intercept)  
##      0.5458
confint(model)
##                 2.5 %    97.5 %
## (Intercept) 0.5124118 0.5792805


(b) Uncomment and run the following command:

 College %>% summarize( "2.5th" = quantile(GradRate,0.025) , 
                        "97.5th" = quantile(GradRate,0.975) )
##    2.5th 97.5th
## 1 0.1455   0.88

This yields a 95% coverage interval for GradRate. What do you notice about this interval in comparison to the one from (a)?

the coverage interval is much greater than the confidence interval.


(c) Write down an interpretation for the interval from (a), and an interpretation for the interval from (b). What is the difference in the interpretations of these two intervals?

The 95% confidence interval in a shows us that between the values of 0.5124 and 0.5793, there is a 95% chance that the true mean GradRate falls within this interval. The 95% coverage interval tells us that between 0.1455 and 0.88, 95% of the data falls within this interval. The difference is that the coverage interval is telling us what percent of data falls within our parameters and the confidence interval is telling us what the chance is that our desired value(true mean of the GradRate) falls within a set of parameters


(d) Make a model for GradRate that uses Type, and interpret each of its coefficients. Then find the 95% confidence intervals for the two coefficients from this model.

M1=lm(formula=GradRate~Type,data=College)
M1
## 
## Call:
## lm(formula = GradRate ~ Type, data = College)
## 
## Coefficients:
## (Intercept)   TypePublic  
##      0.7224      -0.3624
confint(M1)
##                  2.5 %     97.5 %
## (Intercept)  0.6923730  0.7524270
## TypePublic  -0.4054197 -0.3193803

The intercept coefficient is the value of the y intercept for the model, the type public coefficient refers to the slope of the line that relates the gradrate in public universities with the gradrate in private universities.

(e) What do each of these two confidence intervals tell us? Of the two intervals, which one do you think is more important? Why?

These confidence intervals tell us that there is a 95% chance that the true mean of the GradRate falls between 0.69 and 0.7524 and that there is a 95% chance that the change in GradRate between public and private schools is between -0.4054 and -0.3194. I would say that the interval for TypePublic is more important because it tells us the significance of the relationship between the two variables.


(f) A confidence interval provides a range of plausible values for a coefficient. Is 0 a plausible value for the TypePublic coefficient? Why is 0 such a crucial value to consider?

No, 0 is not a plausible coefficient for TypePublic. 0 is crucial to consider because the coefficient in this case refers to slope, and a slope of 0 means that the present explanatory variable has no effect on the response variable, if 0 is a plausible value for the TypePublic coefficient, this means that any relationship between Type and GradRate in insignificant.


(g) Repeat (d)-(f) using AdmisRate in place of Type.

M2=lm(formula=GradRate~AdmisRate,data=College)
M2
## 
## Call:
## lm(formula = GradRate ~ AdmisRate, data = College)
## 
## Coefficients:
## (Intercept)    AdmisRate  
##      0.9343      -0.7233
confint(M2)
##                  2.5 %     97.5 %
## (Intercept)  0.8628828  1.0056930
## AdmisRate   -0.8472908 -0.5993505

The intercept coefficient tells us the y intercept of the model, and the AdmisRate coefficient tells us the slope of the model. the confidence interval for the intercept tells us that there is a 95% chance that the true mean GradRate falls between 0.863 and 1.006. The confidence interval for AdmisRate tells us that there is a 95% chance that the slope of the model falls between -0.85 and -0.6. The interval for AdmisRate is not 0 meaning that the effect of AdmisRate on GradRate is significant. I would say that AdmisRate interval is better because it tells us if the relationship between the two variables is significant.


(h) Now fit a model for GradRate that uses AdmisRate and Type. Are both AdmisRate and Type useful explanatory variables in this model? Answer this question by referring to appropriate confidence intervals.

M3=lm(formula=GradRate~AdmisRate+Type,data=College)
M3
## 
## Call:
## lm(formula = GradRate ~ AdmisRate + Type, data = College)
## 
## Coefficients:
## (Intercept)    AdmisRate   TypePublic  
##      0.8714      -0.3504      -0.2820
confint(M3)
##                  2.5 %     97.5 %
## (Intercept)  0.8157953  0.9270166
## AdmisRate   -0.4640264 -0.2368482
## TypePublic  -0.3292846 -0.2346388

Yes they are because the interval for the slopes of both values does not include 0 meaning that the relationship between either explanatory and GradRate is significant.

(i) Now fit a model for GradRate that use AdmisRate and Type, and includes interaction. Is it worthwhile to include interaction in the model? Explain how you arrived at your answer.

M4=lm(formula=GradRate~AdmisRate+Type+AdmisRate:Type,data=College)
M4
## 
## Call:
## lm(formula = GradRate ~ AdmisRate + Type + AdmisRate:Type, data = College)
## 
## Coefficients:
##          (Intercept)             AdmisRate            TypePublic  
##               0.8522               -0.3053               -0.2167  
## AdmisRate:TypePublic  
##              -0.1156
confint(M4)
##                           2.5 %      97.5 %
## (Intercept)           0.7844476  0.91995238
## AdmisRate            -0.4508224 -0.15971371
## TypePublic           -0.3564681 -0.07684841
## AdmisRate:TypePublic -0.3484040  0.11725329

including interaction is worthwhile because it shows that AdmisRate and Type independently have an effect on GradRate, but they don’t have a significant effect on GradRate when both are affecting it and eachother


(j) In an interaction model, the most important confidence interval is the one on the interaction coefficient. Often, people (even experienced statisticians) read too much into the intervals on the other coefficients. Consider the interval on the TypePublic coefficient in the interaction model. As it turns out, it happens to not include 0, but suppose it did include 0. Would this indicate a lack of usefulness of the Type variable? Briefly explain.

No, because the inclusion of a variable makes the other coefficients have a bigger 95% confidence interval, and including interaction is no different, unless the variable is the main focus of the model, its confidence interval should not be referred to as a reliable measure of the confidence in that number. As such, if TypePublic did include 0 in the interaction model, it would depend on the interval from a model that does not include interaction to determine the usefulness of that variable.

Question 5

CSHA.csv has data on 509 patients from the Canadian Study of Health and Ageing, a study which examined survival times from onset of Alzheimer’s disease. For each patient, the following variables were recorded:

Education: Number of years of education

AAO: Age at onset of Alzheimer’s (in years)

Sex: Patient’s sex recorded at birth in binary fashion: male (M) or female (F)

Survival: Time from onset of Alzheimer’s until death (in days)


Read in the data:

CSHA = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/CSHA.csv')


(a) Find, and interpret, a 95% confidence interval for the true mean survival time with Alzheimer’s disease.

M4=lm(formula=Survival~1,data=CSHA)
M4
## 
## Call:
## lm(formula = Survival ~ 1, data = CSHA)
## 
## Coefficients:
## (Intercept)  
##        2299
confint(M4)
##                2.5 %   97.5 %
## (Intercept) 2166.646 2430.646

the size of the 95% confidence interval means that there is a 95% chance that the true mean survival time with Alzheimer’s is between 2166.646 and 2430.646


(b) Is there enough evidence to conclude that there is truly a difference in mean survival between a male patient and a female patient with Alzheimer’s disease? Answer this question by reporting an appropriate confidence interval and explaining how your conclusion follows from it.

M5=lm(formula=Survival~Sex,data=CSHA)
M5
## 
## Call:
## lm(formula = Survival ~ Sex, data = CSHA)
## 
## Coefficients:
## (Intercept)         SexM  
##      2141.6        231.7
confint(M5)
##                 2.5 %    97.5 %
## (Intercept) 1909.4404 2373.8036
## SexM         -50.3503  513.6861

No, because the confidence interval for the intercept coefficient, or the relationship between females and Survival time falls within the interval for the relationship between males and survival time, so concluding that there is a difference is not possible


(c) Is there enough evidence to conclude that there is truly a difference in mean survival between a male patient and a female patient who have onset of Alzheimer’s disease at the same age? Answer this question by reporting an appropriate confidence interval and explaining how your conclusion follows from it.

M6=lm(formula=Survival~Sex+AAO,data=CSHA)
M6
## 
## Call:
## lm(formula = Survival ~ Sex + AAO, data = CSHA)
## 
## Coefficients:
## (Intercept)         SexM          AAO  
##     10326.5        552.1       -103.0
confint(M6)
##                 2.5 %      97.5 %
## (Intercept) 9248.5437 11404.39749
## SexM         314.3782   789.85858
## AAO         -116.3632   -89.66666

No, because the confidence interval for Males intersects with the confidence interval for females, this means that there is a possibility that the means are the same value. Thus, you cannot conclude that holding age constant leads to a true difference in survival time between males and females.


(d) Based on your answers to (b) and (c), would you say that age at onset of Alzheimer’s is a confounding variable in the relationship between survival and sex? Briefly explain.

No, because adding AAO does not change the sign of any of the coefficients nor does it change the confidence intervals enough to change the significance of the results


(e) What can we deduce about the age at onset of Alzheimer’s for male patients in comparison to female patients? Briefly explain.

M6=lm(formula=AAO~Sex,data=CSHA)
M6
## 
## Call:
## lm(formula = AAO ~ Sex, data = CSHA)
## 
## Coefficients:
## (Intercept)         SexM  
##      79.453        3.111
confint(M6)
##                 2.5 %    97.5 %
## (Intercept) 78.193300 80.712798
## SexM         1.580572  4.640867

We can conclude that there is no significant difference in age of onset between males and females because the confidence intervals for the true mean AAO for males and females overlaps a lot, this means that even if there is an estimated difference in means, the true means could show a different story meaning that that the estimated difference is not reliable.



Question 6

(a) Would a 99% confidence interval be wider (i.e., longer) or narrower (i.e., shorter) than a 95% confidence interval?

it would be wider


(b) Why do you think a value like 95% is chosen as the standard “level” for a confidence interval instead of, say, 99.9%?

because a 99.9% confidence interval would include all the data.



Question 7

Consider the highway miles per gallon data for 147 cars contained in Cars.csv. Assume that these cars represent a random sample of all new cars produced in 2003. The corporate average fuel economy (CAFE) standards required that average fuel economy be more than 27.5 miles per gallon. Is there enough evidence from this data to say that the CAFE standard is being met? Briefly explain.

# Read in the data:
Cars = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/Cars.csv')


# Continue any R work you need here:



Question 8

Rivers contain small concentrations of mercury which can accumulate in fish over their lifetimes. Since mercury cannot be excreted from the body, it builds up in the tissues of the fish. The concentration of mercury in fish tissue can be obtained at considerable expense by catching fish and sending samples to a lab for analysis. Directly measuring the mercury concentration in the water is challenging since it is often below detectable limits. A study was conducted in the Wacamaw and Lumber rivers (in North Carolina) to investigate mercury levels in tissues of large mouth bass. At several stations along each river, a group of fish were caught, weighed, and measured. In addition, a filet from each fish was sent to the lab so that the tissue concentration of mercury could be determined. Mercury.csv contains the following information for 171 fish:

River: Lumber or Wacamaw

Station: A station number (0, 1,…, 15)

Length: The fish’s length (in centimeters)

Weight: The fish’s weight (in grams)

Concen: The fish’s mercury concentration (in parts per million; ppm)


Read in the data:

Mercury = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/Mercury.csv')


(a) A VERY useful command that we will work with more soon is summary, which can be applied to any stored model. For example, uncomment and run the following commands to see the sort of output produced by summary:

 model = lm( Concen ~ 1 , data=Mercury )

 summary(model)
## 
## Call:
## lm(formula = Concen ~ 1, data = Mercury)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0817 -0.5917 -0.2617  0.4083  2.4082 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.19175    0.05825   20.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7617 on 170 degrees of freedom

For now, we won’t discuss all of the output from summary, but here are 2 important pieces:

The Estimate column gives the estimated coefficient(s) of the model.

The Std. Error column gives the standard deviation associated with each estimated coefficient.

Using only the summary output, find a 95% confidence interval for the true mean mercury concentration in fish from these two rivers.

1.07525, 1.30825


(b)

Use the summary function, applied to appropriate models in order to answer the following questions:

Part 1: What is the point estimate for the true mean concentration of fish from the Lumber river?

 M8 = lm( Concen ~ River , data=Mercury )

 summary(M8)
## 
## Call:
## lm(formula = Concen ~ River, data = Mercury)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1664 -0.5681 -0.1764  0.4219  2.4219 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.07808    0.08866  12.160   <2e-16 ***
## RiverWacamaw  0.19835    0.11712   1.694   0.0922 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7575 on 169 degrees of freedom
## Multiple R-squared:  0.01669,    Adjusted R-squared:  0.01087 
## F-statistic: 2.868 on 1 and 169 DF,  p-value: 0.09218

1.07808


Part 2: What is the margin of error associated with the point estimate in Part 1?

0.25428

Part 3: What is the point estimate for the difference in the true mean concentrations of fish from the Lumber and Wacamaw rivers?

0.19835


Part 4: What is the margin of error associated with the point estimate in Part 3?

0.3359



(c) Consider the model for mercury concentrations that uses River as the only explanatory variable. Are Wacamaw fish observed to have more mercury in this data set? Is there enough evidence to say that Wacamaw fish truly have more mercry? Justify your answer.

 M9 = lm( Concen ~ River , data=Mercury )

 summary(M9)
## 
## Call:
## lm(formula = Concen ~ River, data = Mercury)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1664 -0.5681 -0.1764  0.4219  2.4219 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.07808    0.08866  12.160   <2e-16 ***
## RiverWacamaw  0.19835    0.11712   1.694   0.0922 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7575 on 169 degrees of freedom
## Multiple R-squared:  0.01669,    Adjusted R-squared:  0.01087 
## F-statistic: 2.868 on 1 and 169 DF,  p-value: 0.09218
 confint(M9)
##                    2.5 %    97.5 %
## (Intercept)   0.90305825 1.2531061
## RiverWacamaw -0.03285077 0.4295435

No, because the confidence interval for the two rivers overlaps quite a bit. This means that we cannot be sure that the means for the two rivers are the same as what the estimated mean is. They may reflect a completely different reality



(d) Fit a model for mercury concentrations that uses the weight of the fish as an explanatory variable. Report the coefficients for this model. Would you say that the coefficient on Weight is small?

M10 = lm( Concen ~ Weight , data=Mercury )
M10
## 
## Call:
## lm(formula = Concen ~ Weight, data = Mercury)
## 
## Coefficients:
## (Intercept)       Weight  
##   0.6386813    0.0004818

yes the coefficient for weight is pretty small.



(e) Find confidence interval estimates for each coefficient in the model from (d). Is the confidence interval for the coefficient on Weight entirely above 0? Why is this an important question to consider? What does the answer to this question tell us about the relationship between Concen and Weight?

M10 = lm( Concen ~ Weight , data=Mercury )
M10
## 
## Call:
## lm(formula = Concen ~ Weight, data = Mercury)
## 
## Coefficients:
## (Intercept)       Weight  
##   0.6386813    0.0004818
confint(M10)
##                    2.5 %       97.5 %
## (Intercept) 0.4800552234 0.7973072814
## Weight      0.0003718145 0.0005918011

yes the coefficient for weight is entirely above 0. This means that while the trend between weight and concentration is small, it is significant and represents the true trend.



Question 9 (a slight challenge question)

Suppose that the height (in inches) of a randomly selected 25-year-old is a Normal random variable with standard deviation 2.5 inches. Further, suppose that we don’t know the true mean of this distribution but we do know that approximately 12.5% of 25-year-olds are taller than 6 feet. Using this information, find the true mean height of the population of 25-year-olds.



Question 10

A dietitian developed a low-fat diet. Two groups of 100 people are selected and one group is placed on the low-fat diet. The other group is placed on a diet with roughly the same quantity of food but which is not as low in fats. For each person, the amount of weight lost over 3 weeks is recorded in DIETSTUDY.csv. The variables are:

Diet: LOWFAT or REGULAR

WeightLoss: Amount of weight lost (in pounds)


Read in the data:

DIETSTUDY = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/DIETSTUDY.csv')


(a) Uncomment and run the following R command:

# DIETSTUDY %>% summarize( "2.5th" = quantile(WeightLoss,0.025) ,
#                          "97.5th" = quantile(WeightLoss,0.975) )

Briefly interpret the output from this command.


(b) Fit a model for WeightLoss that use Diet, and obtain confidence intervals for each coefficient in this model. Briefly explain what each of the confidence intervals from this model tell us.



(c) Based on the output from (b), do we have enough evidence to say that there is truly a difference in the average amount of weight lost using the two diets? Briefly explain.



(d) Uncomment and run the following commands:

# model = lm( WeightLoss ~ Diet - 1  , data = DIETSTUDY )

# confint( model )

Explain how you could have answered the question in (c) using this output.