This paper provides a summary of the key points and R codes from Lessons 9 to 11 of the swirl course on Regression Models. These lessons tackle about Residuals Diagnostics and Variation, Variance Inflation Factors, and Overfitting and Underfitting.
Lesson 9: Residuals Diagnostics and Variation
The lesson begins by introducing the concept of outliers and their
impact on regression models. Using R, an example demonstrates how an
outlier can significantly alter the fit of a model. The process involves
creating a linear model (fit <- lm(y ~ x, out2)) and
examining diagnostic plots, such as the residuals vs fitted values plot
(plot(fit, which=1)), to detect patterns that may signal an
outlier. By identifying the influential outlier (in row 1), a second
model (fitno) is created without this data point, leading
to a more appropriate residual pattern. The difference in coefficients
between the two models (coef(fit)-coef(fitno)) shows the
outlier’s significant influence on the slope. Various R functions, such
as dfbeta(fit) and hatvalues(fit), are used to
analyze the leverage and influence of data points, with larger values
indicating greater influence. Standardized residuals, computed via
rstandard, reveal how much individual points deviate from
the model’s predictions.
Further analysis includes using a Scale-Location plot
(plot(fit, which=3)) and a QQ plot
(plot(fit, which=2)) to assess residual distributions and
detect deviations from normality. The outlier’s significant standardized
residual (around -5) and large Studentized residual
(rstudent) emphasize its strong influence on the model.
Finally, Cook’s distance is introduced as a measure of how much the
model changes when an influential point is removed. This is calculated
by comparing predicted values between the full and reduced models, with
the cooks.distance function providing an automated way to
assess influential points. These tools collectively highlight the
importance of diagnosing and managing outliers in regression
analysis.
Lesson 10: Variance Inflation Factors
In this lesson, we explore Variance Inflation Factors (VIF), which help us understand how the relationships between different variables can affect our predictions in regression analysis. When we include variables that are similar or related to each other, it can make our estimates less accurate. Our goal is to create a simple model that explains the data well without unnecessary complications.
We can run simulations to estimate coefficients using rgp1() for models with uncorrelated predictors and rgp2() for models with correlated predictors. We use the lm() function to create a linear model, like this: mdl <- lm(Fertility ~ ., swiss), where we are looking at how various factors affect fertility in a dataset. To find out how much the variables are influencing each other, we calculate the VIF using vif(mdl). A VIF value greater than 1 means that there’s some overlap between the variables, and values over 5 or 10 mean there’s a lot of overlap, which can lead to confusion in our results.
For example, the VIF for Education is about 2.77, which tells us that its influence is inflated because it’s related to another variable called Examination. To see how this affects our model, we can create a second model without the Examination variable using mdl2 <- lm(Fertility ~ . -Examination, swiss). When we recalculate the VIF with vif(mdl2), we find that the VIF for Education drops to about 1.82. This shows that removing related variables helps make our model clearer and more accurate.
In summary, understanding VIF helps us see when our variables might be confusing our results. By using it, we can identify which variables to keep and which to remove, leading to a better understanding of the relationships in our data. This lesson teaches us that choosing the right variables is important for making good predictions.
Lesson 11: Overfitting and Underfitting
This lesson highlights the importance of selecting the right number of variables in a regression model. Underfitting occurs when key variables are omitted, leading to biased coefficient estimates, as shown using the simbias() function. This bias is especially problematic when correlated variables are excluded, as it distorts the model’s accuracy. Overfitting, on the other hand, occurs when too many variables, including irrelevant ones, are included. By adding random regressors to the Swiss dataset, the lesson demonstrates that while the residual sum of squares (RSS) decreases, it does not guarantee that the model is truly improved. Instead, the model may fit the noise in the data, making it less reliable for new datasets.
To address overfitting, the lesson introduces the use of ANOVA (Analysis of Variance), which helps assess the significance of adding new regressors. By comparing simpler models to more complex ones, ANOVA evaluates whether the reduction in residuals is meaningful or due to chance. Key functions such as lm(), anova(), and deviance() are used to evaluate model performance, while the shapiro.test() function checks for the normality of residuals. The lesson emphasizes that a balance must be struck between underfitting and overfitting to build a model that generalizes well without being overly complex or overly simplistic.