Name:
Dr. Suzanne Rohrback used a novel approach in a series of experiments to examine calcium-binding proteins. The data from one experiment are provided in the Fluorescence dataset in the Stat2Data package. The variable Calcium is the log of the free calcium concentration and ProteinProp is the proportion of protein bound to calcium.
(a) Find the regression line for predicting the proportion of protein bound to calcium from the transformed free calcium concentration.
library(Stat2Data)
data("Fluorescence")
head("Fluorescence")
[1] "Fluorescence"
fluorescence_model <- lm(ProteinProp ~ Calcium, data = Fluorescence)
summary(fluorescence_model)
Call:
lm(formula = ProteinProp ~ Calcium, data = Fluorescence)
Residuals:
Min 1Q Median 3Q Max
-0.22712 -0.09454 0.00176 0.10410 0.21375
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.06586 0.08876 23.27 <2e-16 ***
Calcium 0.17514 0.01107 15.82 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1199 on 49 degrees of freedom
Multiple R-squared: 0.8363, Adjusted R-squared: 0.8329
F-statistic: 250.3 on 1 and 49 DF, p-value: < 2.2e-16
(b) What is the regression standard error?
The regression SE is 0.01107 for the coefficient and 0.08876 for the intercept.
(c) Plot the regression line and all of the points on a scatterplot. Does the regression line appear to provide a good fit?
The regression line does not seem like a food fit, the data seems quadratic and not linear in nature.
plot(ProteinProp ~ Calcium, data = Fluorescence)
abline(fluorescence_model)
(d) Analyze the residual plots. Are conditions for the regression model met?
I would say that the conditions for a regression model are not met. It does not seem like there is constant variance based on the residuals vs fitted plot.
plot(fluorescence_model)
NA
Researchers were interested in looking for an association between body size and the number of eggs produced by a moth. BodyMass and Eggs are both recorded for 39 moths in the dataset MothEggs in Stat2Data.
(a) Before looking at the data, would you expect the association between body mass and number of eggs to be positive or negative? Explain.
I would expect the association between body mass and the number of eggs to be positive, as a higher body mass might create more space in the mother moth for eggs.
(b) Fit a linear regression model for predicting Eggs from BodyMass. Is the association between the two variables statistically significant? Justify your answer.
Yes, the association is statistically significant, at the 0.01 level, as shown by the “**” next to the p-value.
data("MothEggs")
mothmodel <- lm(Eggs ~ BodyMass, data = MothEggs)
summary(mothmodel)
Call:
lm(formula = Eggs ~ BodyMass, data = MothEggs)
Residuals:
Min 1Q Median 3Q Max
-157.586 -17.187 3.162 25.790 67.960
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.38 45.38 0.537 0.59423
BodyMass 79.86 26.69 2.992 0.00492 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 44.75 on 37 degrees of freedom
Multiple R-squared: 0.1948, Adjusted R-squared: 0.173
F-statistic: 8.95 on 1 and 37 DF, p-value: 0.004916
(c) The conditions for inference are not met, primarily because there is one very unusual observation. Identify this observation and what makes it unusual.
Point 39 is the one unusual observation, it laid 0 eggs unlike every other datapoint.
plot(mothmodel)
NA
(d) Fit the model again after removing this unusual point. Compare the estimated slopes and comment on the difference between the two models.
The slope of the new model is 0.000911 whereas in the old model it was 0.00492. The association of the new model has a higher level of statistical significance (0.001) than the old model (0.01).
mothmodelno39 <- MothEggs[-39,]
head(mothmodelno39)
NewMothModel <- lm(Eggs ~ BodyMass, data = mothmodelno39)
summary(NewMothModel)
Call:
lm(formula = Eggs ~ BodyMass, data = mothmodelno39)
Residuals:
Min 1Q Median 3Q Max
-115.079 -20.785 -0.846 21.763 63.917
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.56 37.28 0.793 0.433043
BodyMass 79.24 21.92 3.615 0.000911 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 36.75 on 36 degrees of freedom
Multiple R-squared: 0.2664, Adjusted R-squared: 0.246
F-statistic: 13.07 on 1 and 36 DF, p-value: 0.0009108
(e) Do you think we were justified in removing this unusual point from the model? Why or why not?
Yes. The unusual point was an outlier and possibly a fluke in our data with no eggs being laid. 0 eggs had a large effect on our model.
(a) In R, sample 100 datapoints from a uniform distribution with min -1 and max 1.
mydata <- runif(100, -1, 1)
(b) Before generating a normal Q-Q plot, predict what you will see.
Hint: How might the tails of your uniform distribution
differ from the tails of a normal distribution?
The tails of my uniform distribution will likely be different from that of a normal distribution because the sampling methods allows for pulling data from any extreme between -1 and 1, there is not necessarily a normal tendency for this data.
(c) Generate the Q-Q plot for your uniformly sampled data. Comment on where and why it deviates from the Q-Q line. This is
The data for Q-Q plot deviates from the Q-Q line at the extremities, which makes sense as the Q-Q line follows the normal distribution and our data is not centered around a particular value and will have more points farther away from the “mean”.
qqnorm(mydata)
qqline(mydata)