Plot the average number of flowers produced per plant against the light intensity, distinguishing the two timings. Comment.
data(flowers)
ggplot(data=flowers, aes(Light, Flowers, colour = Timing)) +
geom_point()
As light intensity increases, flower production decreases. It also seems that PFI produces fewer flowers than “before” (24 days before PFI).
Suppose a model with the systematic component Flowers ∼ Light + Timing was needed to model the data. What would such a systematic component imply about the relationship between the variables?
This would imply that we are interested in modeling the mean number of flowers produced based on the light intensity (Light) and when the light treatment was applied (Timing). In other words, our dependent variable is the mean number of flowers produced, where the independent variables (predictors) are Light and Timing.
Suppose a model with the systematic component Flowers ∼ Light + Timing + Light:Timing was needed to model the data. What would such a systematic component imply about the relationship between the variables?
This model would suggest we are still modeling the mean number of flowers produced, where our predictos are Light and Timing, but we are also including the interaction term Light*Timing. An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent variable. In this case are including the interaction term to check if there is a significant interaction between Light and Timing.
Fit the two linear regression models with the systematic components specified above. Give the regression tables and comment.
flower_reg <- lm(Flowers ~ Light + Timing, data=flowers)
summary(flower_reg)
##
## Call:
## lm(formula = Flowers ~ Light + Timing, data = flowers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.040 -3.930 -1.819 5.587 11.866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.399167 3.289525 25.353 < 2e-16 ***
## Light -0.040490 0.005157 -7.851 1.11e-07 ***
## TimingPFI -12.091667 2.642210 -4.576 0.000164 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.472 on 21 degrees of freedom
## Multiple R-squared: 0.7973, Adjusted R-squared: 0.778
## F-statistic: 41.29 on 2 and 21 DF, p-value: 5.279e-08
Our regression model with coefficients can be seen below:
\({\hat{Flowers} = 83.3992 -0.0405*Light -12.0917*Timing}\)
The model suggest per one unit of increase in light intensity there is a -0.0405 decrease in the mean number of flowers produced. It also implies that when Timing is equal to 1 (or PMI) there is -12.0917 decrease in the mean number of flowers produced. When all variables are equal to zero, we expect the mean number of flowers produced to be 83.3992.
Hypotheses
\(H_0: \ \beta_{Light} = 0\) \(H_1:\) \(\beta_{Light}\) \(\ne 0\)
Test Statistic
\(t_0 = -7.851\)
p-value
\(p < 0.001\)
Rejection Region
Reject \(H_0\) if \(p < \alpha\); \(\alpha = 0.05\)
Conclusion and Interpretation
Reject \(H_0\), we see that there is sufficient evidence to suggest that Light is a significant predictor of mean number of flowers produced.
Hypotheses
\(H_0: \ \beta_{Timing} = 0\) \(H_1:\) \(\beta_{Timing}\) \(\ne 0\)
Test Statistic
\(t_0 = -4.576\)
p-value
\(p < 0.001\)
Rejection Region
Reject \(H_0\) if \(p < \alpha\); \(\alpha = 0.05\)
Conclusion and Interpretation
Reject \(H_0\), we see that there is sufficient evidence to suggest that Timing is a significant predictor of mean number of flowers produced.
int_flower_reg <- lm(Flowers ~ Light + Timing + Light:Timing, data=flowers)
summary(int_flower_reg)
##
## Call:
## lm(formula = Flowers ~ Light + Timing + Light:Timing, data = flowers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.919 -3.991 -1.698 5.446 11.664
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.116667 4.364515 19.044 2.75e-14 ***
## Light -0.039952 0.007471 -5.347 3.11e-05 ***
## TimingPFI -11.526667 6.172356 -1.867 0.0766 .
## Light:TimingPFI -0.001076 0.010566 -0.102 0.9199
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.63 on 20 degrees of freedom
## Multiple R-squared: 0.7974, Adjusted R-squared: 0.767
## F-statistic: 26.24 on 3 and 20 DF, p-value: 3.898e-07
The regression equation for Model 2 (w/interaction term) is as follows:
\({\hat{Flowers} = 83.1167 -0.0399*Light -11.5267*Timing - 0.0011*Light:Timing}\)
Hypotheses
\(H_0: \ \beta_{Light:Timing} = 0\) \(H_1:\) \(\beta_{Light:Timing}\) \(\ne 0\)
Test Statistic
\(t_0 = -0.102\)
p-value
\(p = 0.9199\)
Rejection Region
Reject \(H_0\) if \(p < \alpha\); \(\alpha = 0.05\)
Conclusion and Interpretation
Fail to Reject \(H_0\), we see that there is not sufficient evidence to suggest that the interaction term is significant, therefore the slopes are the same.
Perform a diagnostic analysis on the fitted linear regression models. Present the residuals plots and give a comment.
resid <- resid(flower_reg)
plot(fitted(flower_reg), resid)
abline(0,0)
qqnorm(resid)
qqline(resid)
plot(density(resid))
Based on the plots above, we can safely assume variances are equal based on the plot of the residuals. We don’t suspect that any normality assumptions are violated, even though qq plot may seem slightly skewed, the density plots show our data is reasonable bell shaped.
resid_int <- resid(int_flower_reg)
plot(fitted(int_flower_reg), resid_int)
abline(0,0)
qqnorm(resid_int)
qqline(resid_int)
plot(density(resid_int))
Similarly for the assesment of the fit diagnostics for Model 1, variance and normality assumptions do not appear to be violated.
Identify any influential observations or outliers.
# studentized deleted residuals
flowers$studres <- studres(flower_reg)
# create flag variable
flowers$studres_flag <- if_else(abs(flowers$studres)>=2, 1, 0)
# count the number with an issue
table(flowers$studres_flag)
##
## 0 1
## 23 1
# filter to the observations we're concerned about
clean <- filter(flowers, studres_flag==1)
# hat diagonal
flowers$hat <- hatvalues(flower_reg)
# create flag variable
# compare to 2p/n; 2(3)/24
flowers$hat_flag <- if_else(abs(flowers$hat)>=0.25, 1, 0)
# count the number with an issue
table(flowers$hat_flag)
##
## 0
## 24
# filter to the observations we're concerned about
clean <- filter(flowers, hat_flag==1)
ols_plot_resid_stud_fit(flower_reg)
# Cook's distance
ols_plot_cooksd_chart(flower_reg)
# DFBETAS and DFFITS
# (and Cook's distance and hat diagonal)
reg_sig.inf <- influence.measures(flower_reg)
Both the Deleted Studentized Residual Vs Predicted Values plot and Cook’s D Chart suggests that observation 2 is a potential outlier we should consider for deletion.
What is the best model? Why?
Full <- lm(Flowers ~ Light + Timing + Light:Timing, data=flowers)
Reduced <- lm(Flowers ~ Light + Timing, data=flowers)
anova(Reduced,Full)
## Analysis of Variance Table
##
## Model 1: Flowers ~ Light + Timing
## Model 2: Flowers ~ Light + Timing + Light:Timing
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 21 879.64
## 2 20 879.18 1 0.45604 0.0104 0.9199
Although we can see from the summary output and the hypothesis test in Problem 4 that the interaction term was not significant, we can also show we can drop the interaction term through a full and reduced model with Anova. Since p = 0.9199 > \(\alpha\), we see that this interaction term is indeed not significant.
Conclusion
Given our interaction term is not significant, we can conclude that Model 1 is more efficient for our modeling purposes.
Interpret the Final Model
summary(flower_reg)
##
## Call:
## lm(formula = Flowers ~ Light + Timing, data = flowers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.040 -3.930 -1.819 5.587 11.866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.399167 3.289525 25.353 < 2e-16 ***
## Light -0.040490 0.005157 -7.851 1.11e-07 ***
## TimingPFI -12.091667 2.642210 -4.576 0.000164 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.472 on 21 degrees of freedom
## Multiple R-squared: 0.7973, Adjusted R-squared: 0.778
## F-statistic: 41.29 on 2 and 21 DF, p-value: 5.279e-08
The best regression model, is Model 1 (no interaction):
\({\hat{Flowers} = 83.3992 -0.0405*Light -12.0917*Timing}\)
As previously stated in Problem 4 for Model 1:
The model suggest per one unit of increase in light intensity there is a -0.0405 decrease in the mean number of flowers produced. It also implies that when Timing is equal to 1 (or PMI) there is -12.0917 decrease in the mean number of flowers produced. When all variables are equal to zero, we expect the mean number of flowers produced to be 83.3992.