library(ggplot2)
library(dplyr)
Pnorm is the percent chance that a result occurs to the left of the q value. Qnorm is the numerical value of that percent chance in Pnorm
1-pnorm(50,mean=45,sd=5)
## [1] 0.1586553
about 16% of students are unable to finish the exam within 50 minutes
qnorm(1-0.023,mean=45,sd=5)
## [1] 54.97697
about 55 minutes
lm(y ~ 1), might be labeled \(\hat{\beta_0}\) since it represents an
estimate of \(\beta_0\).pnorm(102,mean=100,sd=10)-pnorm(93,mean=100,sd=10)
## [1] 0.3372961
College = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/College.csv')
confint function to find a 95% confidence interval for the
true mean graduation rate.model=lm(GradRate~1,data=College)
model
##
## Call:
## lm(formula = GradRate ~ 1, data = College)
##
## Coefficients:
## (Intercept)
## 0.5458
confint(model)
## 2.5 % 97.5 %
## (Intercept) 0.5124118 0.5792805
College %>% summarize( "2.5th" = quantile(GradRate,0.025) ,
"97.5th" = quantile(GradRate,0.975) )
## 2.5th 97.5th
## 1 0.1455 0.88
GradRate. What do you notice about this interval in
comparison to the one from (a)?the coverage interval is much greater than the confidence interval.
The 95% confidence interval in a shows us that between the values of 0.5124 and 0.5793, there is a 95% chance that the true mean GradRate falls within this interval. The 95% coverage interval tells us that between 0.1455 and 0.88, 95% of the data falls within this interval. The difference is that the coverage interval is telling us what percent of data falls within our parameters and the confidence interval is telling us what the chance is that our desired value(true mean of the GradRate) falls within a set of parameters
GradRate that uses
Type, and interpret each of its coefficients. Then find the
95% confidence intervals for the two coefficients from this model.M1=lm(formula=GradRate~Type,data=College)
M1
##
## Call:
## lm(formula = GradRate ~ Type, data = College)
##
## Coefficients:
## (Intercept) TypePublic
## 0.7224 -0.3624
confint(M1)
## 2.5 % 97.5 %
## (Intercept) 0.6923730 0.7524270
## TypePublic -0.4054197 -0.3193803
The intercept coefficient is the value of the y intercept for the
model, the type public coefficient refers to the slope of the line that
relates the gradrate in public universities with the gradrate in private
universities.
These confidence intervals tell us that there is a 95% chance that the true mean of the GradRate falls between 0.69 and 0.7524 and that there is a 95% chance that the change in GradRate between public and private schools is between -0.4054 and -0.3194. I would say that the interval for TypePublic is more important because it tells us the significance of the relationship between the two variables.
No, 0 is not a plausible coefficient for TypePublic. 0 is crucial to consider because the coefficient in this case refers to slope, and a slope of 0 means that the present explanatory variable has no effect on the response variable, if 0 is a plausible value for the TypePublic coefficient, this means that any relationship between Type and GradRate in insignificant.
AdmisRate in place of
Type.M2=lm(formula=GradRate~AdmisRate,data=College)
M2
##
## Call:
## lm(formula = GradRate ~ AdmisRate, data = College)
##
## Coefficients:
## (Intercept) AdmisRate
## 0.9343 -0.7233
confint(M2)
## 2.5 % 97.5 %
## (Intercept) 0.8628828 1.0056930
## AdmisRate -0.8472908 -0.5993505
The intercept coefficient tells us the y intercept of the model, and the AdmisRate coefficient tells us the slope of the model. the confidence interval for the intercept tells us that there is a 95% chance that the true mean GradRate falls between 0.863 and 1.006. The confidence interval for AdmisRate tells us that there is a 95% chance that the slope of the model falls between -0.85 and -0.6. The interval for AdmisRate is not 0 meaning that the effect of AdmisRate on GradRate is significant. I would say that AdmisRate interval is better because it tells us if the relationship between the two variables is significant.
GradRate that uses
AdmisRate and Type. Are both
AdmisRate and Type useful explanatory
variables in this model? Answer this question by referring to
appropriate confidence intervals.M3=lm(formula=GradRate~AdmisRate+Type,data=College)
M3
##
## Call:
## lm(formula = GradRate ~ AdmisRate + Type, data = College)
##
## Coefficients:
## (Intercept) AdmisRate TypePublic
## 0.8714 -0.3504 -0.2820
confint(M3)
## 2.5 % 97.5 %
## (Intercept) 0.8157953 0.9270166
## AdmisRate -0.4640264 -0.2368482
## TypePublic -0.3292846 -0.2346388
Yes they are because the interval for the slopes of both values does
not include 0 meaning that the relationship between either explanatory
and GradRate is significant.
GradRate that use
AdmisRate and Type, and includes interaction.
Is it worthwhile to include interaction in the model? Explain how you
arrived at your answer.M4=lm(formula=GradRate~AdmisRate+Type+AdmisRate:Type,data=College)
M4
##
## Call:
## lm(formula = GradRate ~ AdmisRate + Type + AdmisRate:Type, data = College)
##
## Coefficients:
## (Intercept) AdmisRate TypePublic
## 0.8522 -0.3053 -0.2167
## AdmisRate:TypePublic
## -0.1156
confint(M4)
## 2.5 % 97.5 %
## (Intercept) 0.7844476 0.91995238
## AdmisRate -0.4508224 -0.15971371
## TypePublic -0.3564681 -0.07684841
## AdmisRate:TypePublic -0.3484040 0.11725329
including interaction is worthwhile because it shows that AdmisRate and Type independently have an effect on GradRate, but they don’t have a significant effect on GradRate when both are affecting it and eachother
No, because the inclusion of a variable makes the other coefficients
have a bigger 95% confidence interval, and including interaction is no
different, unless the variable is the main focus of the model, its
confidence interval should not be referred to as a reliable measure of
the confidence in that number. As such, if TypePublic did include 0 in
the interaction model, it would depend on the interval from a model that
does not include interaction to determine the usefulness of that
variable.
CSHA = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/CSHA.csv')
M4=lm(formula=Survival~1,data=CSHA)
M4
##
## Call:
## lm(formula = Survival ~ 1, data = CSHA)
##
## Coefficients:
## (Intercept)
## 2299
confint(M4)
## 2.5 % 97.5 %
## (Intercept) 2166.646 2430.646
the size of the 95% confidence interval means that there is a 95% chance that the true mean survival time with Alzheimer’s is between 2166.646 and 2430.646
M5=lm(formula=Survival~Sex,data=CSHA)
M5
##
## Call:
## lm(formula = Survival ~ Sex, data = CSHA)
##
## Coefficients:
## (Intercept) SexM
## 2141.6 231.7
confint(M5)
## 2.5 % 97.5 %
## (Intercept) 1909.4404 2373.8036
## SexM -50.3503 513.6861
No, because the confidence interval for the intercept coefficient, or the relationship between females and Survival time falls within the interval for the relationship between males and survival time, so concluding that there is a difference is not possible
M6=lm(formula=Survival~Sex+AAO,data=CSHA)
M6
##
## Call:
## lm(formula = Survival ~ Sex + AAO, data = CSHA)
##
## Coefficients:
## (Intercept) SexM AAO
## 10326.5 552.1 -103.0
confint(M6)
## 2.5 % 97.5 %
## (Intercept) 9248.5437 11404.39749
## SexM 314.3782 789.85858
## AAO -116.3632 -89.66666
No, because the confidence interval for Males intersects with the confidence interval for females, this means that there is a possibility that the means are the same value. Thus, you cannot conclude that holding age constant leads to a true difference in survival time between males and females.
No, because adding AAO does not change the sign of any of the coefficients nor does it change the confidence intervals enough to change the significance of the results
M6=lm(formula=AAO~Sex,data=CSHA)
M6
##
## Call:
## lm(formula = AAO ~ Sex, data = CSHA)
##
## Coefficients:
## (Intercept) SexM
## 79.453 3.111
confint(M6)
## 2.5 % 97.5 %
## (Intercept) 78.193300 80.712798
## SexM 1.580572 4.640867
We can conclude that there is no significant difference in age of onset between males and females because the confidence intervals for the true mean AAO for males and females overlaps a lot, this means that even if there is an estimated difference in means, the true means could show a different story meaning that that the estimated difference is not reliable.
it would be wider
because a 99.9% confidence interval would include all the data.
# Read in the data:
Cars = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/Cars.csv')
# Continue any R work you need here:
Mercury = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/Mercury.csv')
summary, which can be applied to any stored model. For
example, uncomment and run the following commands to see the sort of
output produced by summary: model = lm( Concen ~ 1 , data=Mercury )
summary(model)
##
## Call:
## lm(formula = Concen ~ 1, data = Mercury)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0817 -0.5917 -0.2617 0.4083 2.4082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.19175 0.05825 20.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7617 on 170 degrees of freedom
summary, but here are 2 important pieces:Estimate column gives the estimated
coefficient(s) of the model.Std. Error column gives the standard
deviation associated with each estimated coefficient.summary output, find a 95% confidence
interval for the true mean mercury concentration in fish from these two
rivers.1.07525, 1.30825
summary function, applied to appropriate models
in order to answer the following questions: M8 = lm( Concen ~ River , data=Mercury )
summary(M8)
##
## Call:
## lm(formula = Concen ~ River, data = Mercury)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1664 -0.5681 -0.1764 0.4219 2.4219
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.07808 0.08866 12.160 <2e-16 ***
## RiverWacamaw 0.19835 0.11712 1.694 0.0922 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7575 on 169 degrees of freedom
## Multiple R-squared: 0.01669, Adjusted R-squared: 0.01087
## F-statistic: 2.868 on 1 and 169 DF, p-value: 0.09218
1.07808
0.25428
0.19835
0.3359
River as the only explanatory variable. Are Wacamaw fish
observed to have more mercury in this data set? Is there enough
evidence to say that Wacamaw fish truly have more mercry? Justify your
answer. M9 = lm( Concen ~ River , data=Mercury )
summary(M9)
##
## Call:
## lm(formula = Concen ~ River, data = Mercury)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1664 -0.5681 -0.1764 0.4219 2.4219
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.07808 0.08866 12.160 <2e-16 ***
## RiverWacamaw 0.19835 0.11712 1.694 0.0922 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7575 on 169 degrees of freedom
## Multiple R-squared: 0.01669, Adjusted R-squared: 0.01087
## F-statistic: 2.868 on 1 and 169 DF, p-value: 0.09218
confint(M9)
## 2.5 % 97.5 %
## (Intercept) 0.90305825 1.2531061
## RiverWacamaw -0.03285077 0.4295435
No, because the confidence interval for the two rivers overlaps quite a bit. This means that we cannot be sure that the means for the two rivers are the same as what the estimated mean is. They may reflect a completely different reality
Weight is
small?M10 = lm( Concen ~ Weight , data=Mercury )
M10
##
## Call:
## lm(formula = Concen ~ Weight, data = Mercury)
##
## Coefficients:
## (Intercept) Weight
## 0.6386813 0.0004818
yes the coefficient for weight is pretty small.
Weight entirely above 0? Why is this an important question
to consider? What does the answer to this question tell us about the
relationship between Concen and Weight?M10 = lm( Concen ~ Weight , data=Mercury )
M10
##
## Call:
## lm(formula = Concen ~ Weight, data = Mercury)
##
## Coefficients:
## (Intercept) Weight
## 0.6386813 0.0004818
confint(M10)
## 2.5 % 97.5 %
## (Intercept) 0.4800552234 0.7973072814
## Weight 0.0003718145 0.0005918011
yes the coefficient for weight is entirely above 0. This means that while the trend between weight and concentration is small, it is significant and represents the true trend.
DIETSTUDY = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/DIETSTUDY.csv')
# DIETSTUDY %>% summarize( "2.5th" = quantile(WeightLoss,0.025) ,
# "97.5th" = quantile(WeightLoss,0.975) )
WeightLoss that use
Diet, and obtain confidence intervals for each coefficient
in this model. Briefly explain what each of the confidence intervals
from this model tell us.
# model = lm( WeightLoss ~ Diet - 1 , data = DIETSTUDY )
# confint( model )