STA6257: Homework 1

Problem 1

Plot the average number of flowers produced per plant against the light intensity, distinguishing the two timings. Comment.

data(flowers)
ggplot(data=flowers, aes(Light, Flowers, colour = Timing)) +
  geom_point()

As light intensity increases, flower production decreases. It also seems that PFI produces fewer flowers than “before” (24 days before PFI).

Problem 2

Suppose a model with the systematic component Flowers ∼ Light + Timing was needed to model the data. What would such a systematic component imply about the relationship between the variables?

This would imply that we are interested in modeling the mean number of flowers produced based on the light intensity (Light) and when the light treatment was applied (Timing). In other words, our dependent variable is the mean number of flowers produced, where the independent variables (predictors) are Light and Timing.

Problem 3

Suppose a model with the systematic component Flowers ∼ Light + Timing + Light:Timing was needed to model the data. What would such a systematic component imply about the relationship between the variables?

This model would suggest we are still modeling the mean number of flowers produced, where our predictos are Light and Timing, but we are also including the interaction term Light*Timing. An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent variable. In this case are including the interaction term to check if there is a significant interaction between Light and Timing.

Problem 4

Fit the two linear regression models with the systematic components specified above. Give the regression tables and comment.

Model 1 W/O Interaction Term

flower_reg <- lm(Flowers ~ Light + Timing, data=flowers)
summary(flower_reg)

## 
## Call:
## lm(formula = Flowers ~ Light + Timing, data = flowers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.040  -3.930  -1.819   5.587  11.866 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  83.399167   3.289525  25.353  < 2e-16 ***
## Light        -0.040490   0.005157  -7.851 1.11e-07 ***
## TimingPFI   -12.091667   2.642210  -4.576 0.000164 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.472 on 21 degrees of freedom
## Multiple R-squared:  0.7973, Adjusted R-squared:  0.778 
## F-statistic: 41.29 on 2 and 21 DF,  p-value: 5.279e-08

Our regression model with coefficients can be seen below:

\({\hat{Flowers} = 83.3992 -0.0405*Light -12.0917*Timing}\)

The model suggest per one unit of increase in light intensity there is a -0.0405 decrease in the mean number of flowers produced. It also implies that when Timing is equal to 1 (or PMI) there is -12.0917 decrease in the mean number of flowers produced. When all variables are equal to zero, we expect the mean number of flowers produced to be 83.3992.

Are Light and Timing Significant Predictors of Flowers Produced?

Hypothesis Test for Light

Hypotheses

\(H_0: \ \beta_{Light} = 0\) \(H_1:\) \(\beta_{Light}\) \(\ne 0\)

Test Statistic

\(t_0 = -7.851\)

p-value

\(p < 0.001\)

Rejection Region

Reject \(H_0\) if \(p < \alpha\); \(\alpha = 0.05\)

Conclusion and Interpretation

Reject \(H_0\), we see that there is sufficient evidence to suggest that Light is a significant predictor of mean number of flowers produced.

Hypothesis Test for Timing

Hypotheses

\(H_0: \ \beta_{Timing} = 0\) \(H_1:\) \(\beta_{Timing}\) \(\ne 0\)

Test Statistic

\(t_0 = -4.576\)

p-value

\(p < 0.001\)

Rejection Region

Reject \(H_0\) if \(p < \alpha\); \(\alpha = 0.05\)

Conclusion and Interpretation

Reject \(H_0\), we see that there is sufficient evidence to suggest that Timing is a significant predictor of mean number of flowers produced.

Model 2 W/ Interaction Term

int_flower_reg <- lm(Flowers ~ Light + Timing + Light:Timing, data=flowers)
summary(int_flower_reg)

## 
## Call:
## lm(formula = Flowers ~ Light + Timing + Light:Timing, data = flowers)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.919 -3.991 -1.698  5.446 11.664 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      83.116667   4.364515  19.044 2.75e-14 ***
## Light            -0.039952   0.007471  -5.347 3.11e-05 ***
## TimingPFI       -11.526667   6.172356  -1.867   0.0766 .  
## Light:TimingPFI  -0.001076   0.010566  -0.102   0.9199    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.63 on 20 degrees of freedom
## Multiple R-squared:  0.7974, Adjusted R-squared:  0.767 
## F-statistic: 26.24 on 3 and 20 DF,  p-value: 3.898e-07

The regression equation for Model 2 (w/interaction term) is as follows:

\({\hat{Flowers} = 83.1167 -0.0399*Light -11.5267*Timing - 0.0011*Light:Timing}\)

Is the Interaction Term Significant?

Hypotheses

\(H_0: \ \beta_{Light:Timing} = 0\) \(H_1:\) \(\beta_{Light:Timing}\) \(\ne 0\)

Test Statistic

\(t_0 = -0.102\)

p-value

\(p = 0.9199\)

Rejection Region

Reject \(H_0\) if \(p < \alpha\); \(\alpha = 0.05\)

Conclusion and Interpretation

Fail to Reject \(H_0\), we see that there is not sufficient evidence to suggest that the interaction term is significant, therefore the slopes are the same.

Problem 5

Perform a diagnostic analysis on the fitted linear regression models. Present the residuals plots and give a comment.

Fit Diagnostics for Model 1

resid <- resid(flower_reg)
plot(fitted(flower_reg), resid)
abline(0,0)

qqnorm(resid)
qqline(resid)

plot(density(resid))

Based on the plots above, we can safely assume variances are equal based on the plot of the residuals. We don’t suspect that any normality assumptions are violated, even though qq plot may seem slightly skewed, the density plots show our data is reasonable bell shaped.

Fit Diagnostics for Model 2

resid_int <- resid(int_flower_reg)
plot(fitted(int_flower_reg), resid_int)
abline(0,0)

qqnorm(resid_int)
qqline(resid_int)

plot(density(resid_int))

Similarly for the assesment of the fit diagnostics for Model 1, variance and normality assumptions do not appear to be violated.

Problem 6

Identify any influential observations or outliers.

Testing for Outliers

# studentized deleted residuals

flowers$studres <- studres(flower_reg)

# create flag variable
flowers$studres_flag <- if_else(abs(flowers$studres)>=2, 1, 0)

# count the number with an issue
table(flowers$studres_flag)

## 
##  0  1 
## 23  1

# filter to the observations we're concerned about
clean <- filter(flowers, studres_flag==1)

# hat diagonal
flowers$hat <- hatvalues(flower_reg)

# create flag variable
# compare to 2p/n; 2(3)/24
flowers$hat_flag <- if_else(abs(flowers$hat)>=0.25, 1, 0)

# count the number with an issue
table(flowers$hat_flag)

## 
##  0 
## 24

# filter to the observations we're concerned about
clean <- filter(flowers, hat_flag==1)

ols_plot_resid_stud_fit(flower_reg)

# Cook's distance
ols_plot_cooksd_chart(flower_reg)

# DFBETAS and DFFITS
# (and Cook's distance and hat diagonal)
reg_sig.inf <- influence.measures(flower_reg)

Both the Deleted Studentized Residual Vs Predicted Values plot and Cook’s D Chart suggests that observation 2 is a potential outlier we should consider for deletion.

Problem 7

What is the best model? Why?

Anova to drop interaction

Full <- lm(Flowers ~ Light + Timing + Light:Timing, data=flowers)
Reduced <- lm(Flowers ~ Light + Timing, data=flowers)
anova(Reduced,Full)

## Analysis of Variance Table
## 
## Model 1: Flowers ~ Light + Timing
## Model 2: Flowers ~ Light + Timing + Light:Timing
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     21 879.64                           
## 2     20 879.18  1   0.45604 0.0104 0.9199

Although we can see from the summary output and the hypothesis test in Problem 4 that the interaction term was not significant, we can also show we can drop the interaction term through a full and reduced model with Anova. Since p = 0.9199 > \(\alpha\), we see that this interaction term is indeed not significant.

Conclusion

Given our interaction term is not significant, we can conclude that Model 1 is more efficient for our modeling purposes.

Problem 8

Interpret the Final Model

Best Model

summary(flower_reg)

## 
## Call:
## lm(formula = Flowers ~ Light + Timing, data = flowers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.040  -3.930  -1.819   5.587  11.866 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  83.399167   3.289525  25.353  < 2e-16 ***
## Light        -0.040490   0.005157  -7.851 1.11e-07 ***
## TimingPFI   -12.091667   2.642210  -4.576 0.000164 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.472 on 21 degrees of freedom
## Multiple R-squared:  0.7973, Adjusted R-squared:  0.778 
## F-statistic: 41.29 on 2 and 21 DF,  p-value: 5.279e-08

The best regression model, is Model 1 (no interaction):

\({\hat{Flowers} = 83.3992 -0.0405*Light -12.0917*Timing}\)

As previously stated in Problem 4 for Model 1:

STA6257: Homework 1 - Problem A

Mason Lowery

Last updated: 2022-08-05

Problem 1

Problem 2

Problem 3

Problem 4

Model 1 W/O Interaction Term

Are Light and Timing Significant Predictors of Flowers Produced?

Hypothesis Test for Light

Hypothesis Test for Timing

Model 2 W/ Interaction Term

Is the Interaction Term Significant?

Problem 5

Fit Diagnostics for Model 1

Fit Diagnostics for Model 2

Problem 6

Testing for Outliers

Problem 7

Anova to drop interaction

Problem 8

Best Model