This report provides the summary of the key takeaways we obtained from Lessons 9-11 of the Regression Models course.
LESSON 9: Residuals, Diagnostics, and Variation
In this lesson, we focused on identifying and understanding outliers
in regression models and their impact on the fit. We are given two
figures. The first figure shows an outlier that does not affect the fit
and the second figure shows an outlier that is influential to the slope
and residuals of the plot. To understand, we utilized the figure with
influential outlier. Also, diagnostic plots, such as residuals versus
fitted plots and Cook’s distance, and measures of influence, are used to
determine the influence of these outliers. First, we generated a
Residuals vs Fitted plot to check if the residuals are uncorrelated with
the fit and approximately normally distributed. By using the function
lm() and by creating a new model fitno, we
compared the Residuals vs Fitted plots before and after excluding the
outlier and it shows that the pattern disappears once the outlier is
removed, confirming its influence. We also learned about the function
dfbeta(), a way to calculate the influence of each sample
on the regression coefficients and the function
hatvalues(), a way to measure the influence of a sample by
comparing the residuals of models with and without the outlier.
Following this, standardized residuals are introduced, which account
for differences in residual variances across individual samples using
the rstandard() function. A QQ plot is used to visualize
how far the influential outlier deviates from the expected normal
distribution. Studentized residuals, a refined version of standardized
residuals, are also calculated by the rstudent() function.
The key difference is that Studentized residuals use the deviance of a
model that excludes the outlier, making them more accurate when dealing
with influential samples. Finally, we learned about the Cook’s distance,
which is a comprehensive measure of a sample’s influence on the overall
regression model. Here, we used the cooks.distance()
function. Overall, we learned that by excluding an influential outlier,
the fit of the model can change noticeably. It is important to assess
outliers carefully as they can either represent genuine data points or
error, and diagnostic plots in R can help identify their impact on
regression analysis.
LESSON 10: Variance Inflation Factors
In this lesson, we investigated variance inflation which occurs when
adding correlated variables to a regression model increases the standard
errors of the coefficients. This means that the more correlated the
variables are, the less reliable the coefficient estimates become. This
lesson used the context of the vifSims.R script and it
began by introducing two functions, rgp1() and
rgp2(), which generate three regressors: x1,
x2, and x3. These functions are used to
simulate scenarios in which the regressors are either uncorrelated
(rgp1()) or correlated (rgp2()). VIF is
introduced as a theoretical tool used to quantify variance inflation due
to correlated regressors. To further understand, the Swiss dataset is
used to illustrate VIF in a practical setting. The dataset contains data
on fertility rates and various socioeconomic indicators for 47 Swiss
provinces. A linear regression model (mdl) is fit to
predict fertility using five regressors: Agriculture,
Examination, Education, Catholic,
and Infant Mortality. The outputs show that
Examination has the highest VIF, suggesting that it is
strongly correlated with other regressors, particularly
Education. The high VIF indicates that the variance of
Examination’s coefficient is inflated compared to a
scenario where it is uncorrelated with the other regressors.
Simulations like those in this lesson help illustrate how variance inflation affects regression models, showing that the variance of a coefficient increases when its predictor is correlated with others. A key takeaway is that including highly correlated variables can lead to inflated standard errors, but simply removing them might cause bias in the estimates of the other coefficients. Thus, it is important to balance variable inclusion carefully, using methods like VIF to measure inflation and assess the impact.
LESSON 11: Overfitting and Underfitting
In this lesson, we examined the concepts of overfitting and
underfitting, focusing on how the inclusion or exclusion of variables
affects the estimates of other variables in a model. Adding irrelevant
variables can reduce the residual sum of squares, leading to
overfitting, while omitting important variables introduces bias in the
estimates. The function simbias() shows the bias introduced
by omitting a correlated regressor. A key takeaway is that adding
variables does not always improve the model unless their significance is
tested, which can be done using ANOVA. The analysis of variance (ANOVA)
was used to assess the significance of adding regressors while
accounting for the loss of degrees of freedom. In the example of
regressing Fertility on Agriculture
(fit1), and then adding Examination and
Education (fit3), ANOVA shows that the two added
regressors are significant, with an F-value of 20.968
and a p-value of 4.407e-07, leading to the rejection of
the null hypothesis.
Additionally, residuals should be tested for normality to ensure the accuracy of the analysis. Understanding the balance between overfitting and underfitting is essential for creating a reliable model.