Lessons 1 to 5 of the Regression Models course in the R Package

Lesson 1: Introduction

The lesson introduces the concept of “regression toward the mean,” demonstrated through Sir Francis Galton’s study of parent-child height relationships. This concept, pioneered by Sir Francis Galton, describes how extreme measurements tend to move closer to the average in subsequent generations. Galton observed that children of tall parents tend to be tall but closer to the average height, while children of short parents are shorter but also closer to the average height.

In Galton’s dataset, it consists of parent and child heights, adjusted for measurement errors. It includes 928 pairs of parent and child heights.

A scatter plot of parent vs. child heights shows that while there is a positive relationship (taller parents have taller children), the heights of children are less extreme than their parents’ heights.

A linear regression line is fitted to the data, showing the best-fit line with the minimum residual variation. The regression line has a slope less than $1$, indicating that children’s heights tend to be closer to the mean height compared to their parents.

The model is fit using the $lm$ function in R, and the summary provides:

• Intercept - 23.94153

• Slope - 0.64629

• Standard Error of Slope - 0.04114

• Residual Standard Error - 2.239

• R-squared $(R^2)$ - 0.2105.

With that, the slope indicates that for each 1-inch increase in parents’ height, children’s height increases by approximately 0.65 inches. Standard Error measures the precision of the slope estimate. And R-squared $(R^2)$ represents the proportion of variance in children’s heights explained by the parents’ heights.

Thus, the concept is demonstrated through the analysis, showing how children’s heights regress towards the mean, even though they are influenced by their parents’ heights.

Lesson 2: Residuals

The lesson focuses on understanding residuals, the differences between actual and predicted values in regression analysis, and how they help in determining the best-fit line.

Residuals are the differences between observed values and predicted values from the regression line. They measure the vertical distance from the data points to the line of best fit.

There are properties of residuals which includes the mean of residuals and uncorrelated residuals. For a well-fitted regression model, the mean of residuals should be close to zero. This ensures that the residuals are balanced and the model does not systematically overestimate or underestimate the response variable. And residuals should be uncorrelated with the predictor variables. If residuals were correlated with predictors, it would suggest that there is additional information in the predictors that the model is not capturing.

The lesson also covers the least squares method, it minimizes the sum of the squared residuals to find the best-fitting line. This approach ensures that the sum of the squared differences between the observed and predicted values is as small as possible.

Moreover, by fitting the model, we use the $R$ function $lm$ to fit a linear model. For example, fit <- lm(child ~ parent, galton) fits a model predicting children’s heights based on parents’ heights. And by examining residuals, we use $summary(fit)$ to review the residuals and other regression statistics. To check if the mean of the residuals is close to zero using mean(fit$residuals), and ensure residuals are uncorrelated with predictors using cov(fitresiduals, galtonparent).

In addition, the equation of the line verifies that the mean of children’s heights $(mch)$ lies on the regression line defined by the mean of parents’ heights $(mph)$, the intercept, and the slope. And the variance analysis demonstrates that the total variance in the response variable (children’s heights) is the sum of the variance explained by the regression line (variance of the estimates) and the variance of the residuals.

Thus, the lesson highlights the significance of residuals in validating regression models. Properly calculated residuals that are uncorrelated with predictors and have a mean close to zero confirm that the regression model is a good fit for the data. The variance decomposition further reinforces the understanding of how well the model explains the variability in the response variable.

Lesson 3: Least Squares Estimation

This lesson introduces key concepts of linear regression, including:

• Regression Coefficients: We learn to interpret coefficients as the estimated change in the dependent variable for a one-unit change in the independent variable.

• Fitting a Linear Model: We use the $lm()$ function in R to fit a model, specifying the dependent variable $(Y)$, independent variable $(X)$, and dataset.

• Interpreting Output: We utilize the $summary()$ function to examine coefficients, standard errors, t-values, p-values, and R-squared $(R^2)$ values to evaluate the significance of predictors.

• Residuals: We understand residuals (the differences between observed and predicted values) and the importance of checking them to validate model assumptions.

• Model Assumptions: It cover key assumptions such as linearity, independence, homoscedasticity, and normality of residuals.

• Plotting & Predictions: We visualize the regression line using $abline()$ and make predictions with new data using the $predict()$ function.

• R-squared ($R^2$): We understand the R-squared ($R^2$) value as a measure of how well the model explains variance in the data.

These concepts are reinforced through interactive exercises in Swirl, focusing on fitting models, interpreting results, and visualizing data.

Lesson 4: Residual Variation

Residuals are the differences between the observed values and the predicted values from a regression model. They measure how well the model fits the data by showing the variation not explained by the predictor. Residual Variation is the variation in the outcome variable that is not explained by the predictor. This is contrasted with Systematic Variation, which is the variation explained by the regression model.

In linear regression model, the variance of the random error is estimated using the residuals. Since a linear model with one predictor uses two parameters, the degrees of freedom are $n−2$. The variance estimate is calculated as $\frac{\text{sum of squared residuals}}{n - 2}$, not by dividing by $n$ to avoid bias.

To find the standard deviation (sigma) of the error term, compute $\sqrt{\frac{\text{sum of squared residuals}}{n - 2}}$. This is equivalent to the residual standard error provided in the summary of the regression model.

Moreover, we have \[Total Variation = Residual Variation + Regression Variation.\]

Total Variation measures the spread of the observed data around its mean, it is computed as the sum of squared differences between the observed values and the mean. Residual Variation represents the part of the total variation not explained by the model and computed using the deviance of the regression model. And Regression Variation is the part of the total variation explained by the model.

Additionally, $R^2$ represents the percentage of total variation explained by the regression model. It ranges from $0$ to $1$, where $1$ indicates that the model explains all the variability of the response variable. It is computed as \[1 - \frac{\text{Residual Variation}}{\text{Total Variation}}\] and can also be obtained from the squared correlation coefficient between the observed and predicted values. Lastly, to confirm accuracy, compare the manually computed $R^2$ with the value provided in the model summary. And to verify the proportion of the explained variance, we can squared correlation between the predictor and the response variable equals $R^2$.

Lesson 5: Introduction to Multivariable Regression

This lesson demonstrates how multivariable regression can be broken down into a series of single-variable regressions. By focusing on one regressor at a time, we can eliminate others, reducing a regression problem with multiple variables to one with fewer variables. The process begins with understanding simple regression, then progresses to more complex scenarios by using the Galton dataset and the trees dataset.

Key steps include:

Substitution of Regressors: We substitute a constant regressor to demonstrate the elimination of the intercept by subtracting means.
Gaussian Elimination: This method allows for systematically reducing the number of regressors by replacing variables with the residuals of their regression against a chosen variable.
Application to Multiple Variables: Through a sequence of steps, we eliminate variables one by one, eventually reducing a multivariable regression to single-variable regression. This method is computationally equivalent to what modern algorithms do but is shown here in a simplified, step-by-step manner for conceptual clarity.

The lesson concludes by highlighting that understanding single-variable regression provides a foundation for tackling more complex regression problems.