SUMMARY

This report provides the summary of the key takeaways we obtained from Lessons 9-11 of the Regression Models course.

LESSON 9: Residuals, Diagnostics, and Variation

In this lesson, we focused on identifying and understanding outliers in regression models and their impact on the fit. We are given two figures. The first figure shows an outlier that does not affect the fit and the second figure shows an outlier that is influential to the slope and residuals of the plot. To understand, we utilized the figure with influential outlier. Also, diagnostic plots, such as residuals versus fitted plots and Cook’s distance, and measures of influence, are used to determine the influence of these outliers. First, we generated a Residuals vs Fitted plot to check if the residuals are uncorrelated with the fit and approximately normally distributed. By using the function lm() and by creating a new model fitno, we compared the Residuals vs Fitted plots before and after excluding the outlier and it shows that the pattern disappears once the outlier is removed, confirming its influence. We also learned about the function dfbeta(), a way to calculate the influence of each sample on the regression coefficients and the function hatvalues(), a way to measure the influence of a sample by comparing the residuals of models with and without the outlier.

Following this, standardized residuals are introduced, which account for differences in residual variances across individual samples using the rstandard() function. A QQ plot is used to visualize how far the influential outlier deviates from the expected normal distribution. Studentized residuals, a refined version of standardized residuals, are also calculated by the rstudent() function. The key difference is that Studentized residuals use the deviance of a model that excludes the outlier, making them more accurate when dealing with influential samples. Finally, we learned about the Cook’s distance, which is a comprehensive measure of a sample’s influence on the overall regression model. Here, we used the cooks.distance() function. Overall, we learned that by excluding an influential outlier, the fit of the model can change noticeably. It is important to assess outliers carefully as they can either represent genuine data points or error, and diagnostic plots in R can help identify their impact on regression analysis.

LESSON 10: Variance Inflation Factors

In this lesson, we investigated variance inflation which occurs when adding correlated variables to a regression model increases the standard errors of the coefficients. This means that the more correlated the variables are, the less reliable the coefficient estimates become. This lesson used the context of the vifSims.R script and it began by introducing two functions, rgp1() and rgp2(), which generate three regressors: x1, x2, and x3. These functions are used to simulate scenarios in which the regressors are either uncorrelated (rgp1()) or correlated (rgp2()). VIF is introduced as a theoretical tool used to quantify variance inflation due to correlated regressors. To further understand, the Swiss dataset is used to illustrate VIF in a practical setting. The dataset contains data on fertility rates and various socioeconomic indicators for 47 Swiss provinces. A linear regression model (mdl) is fit to predict fertility using five regressors: Agriculture, Examination, Education, Catholic, and Infant Mortality. The outputs show that Examination has the highest VIF, suggesting that it is strongly correlated with other regressors, particularly Education. The high VIF indicates that the variance of Examination’s coefficient is inflated compared to a scenario where it is uncorrelated with the other regressors.

Simulations like those in this lesson help illustrate how variance inflation affects regression models, showing that the variance of a coefficient increases when its predictor is correlated with others. A key takeaway is that including highly correlated variables can lead to inflated standard errors, but simply removing them might cause bias in the estimates of the other coefficients. Thus, it is important to balance variable inclusion carefully, using methods like VIF to measure inflation and assess the impact.

LESSON 11: Overfitting and Underfitting

In this lesson, we examined the concepts of overfitting and underfitting, focusing on how the inclusion or exclusion of variables affects the estimates of other variables in a model. Adding irrelevant variables can reduce the residual sum of squares, leading to overfitting, while omitting important variables introduces bias in the estimates. The function simbias() shows the bias introduced by omitting a correlated regressor. A key takeaway is that adding variables does not always improve the model unless their significance is tested, which can be done using ANOVA. The analysis of variance (ANOVA) was used to assess the significance of adding regressors while accounting for the loss of degrees of freedom. In the example of regressing Fertility on Agriculture (fit1), and then adding Examination and Education (fit3), ANOVA shows that the two added regressors are significant, with an F-value of 20.968 and a p-value of 4.407e-07, leading to the rejection of the null hypothesis.

Additionally, residuals should be tested for normality to ensure the accuracy of the analysis. Understanding the balance between overfitting and underfitting is essential for creating a reliable model.