Summary of the lessons 1 to 5 of the Regression Models course in the R Package

The paper provides a summary of key points and R code from Lessons 1 to 5 of the swirl course on Regression Models, focusing on introduction, residuals, least squares estimation, residual variation, and introduction to multivariable regression.

Lesson 1: Introduction

The lesson begins by installing a new package, Regression models, on swirl. The commands library(swirl)s and swirl() are used to load the swirl package and start a swirl session, respectively. The command install_course(“Regression Models”) is then introduced, which installs the Regression Models course into swirl, allowing users to learn about regression modeling interactively.

The lesson proceeds with an exploration of data plotting. For instance, the command plot(child ~ parent, galton) generates a scatter plot of children’s heights versus their parents’ heights using the galton dataset, with the ~ symbol indicating the formula where “child” is the dependent variable and “parent” is the independent variable. To improve visibility of overlapping data points, plot(jitter(child,4) ~ parent, galton) is used, adding jitter to the children’s heights.

Next, the lesson covers the creation of a regression line. The command regrline <- lm(child ~ parent, galton) fits a linear model where the child’s height is predicted by the parent’s height, with the result stored in the variable regrline. The regression line is then added to the plot using abline(regrline, lwd=3, col=‘red’), specifying a line width of 3 and red color.

The summary also explains how to interpret regression output using the command summary(regrline), which provides detailed statistics such as estimated coefficients, standard errors, t-values, and p-values. For example, the intercept represents the predicted child height when the parent’s height is zero, and the slope (parent estimate) indicates that for every 1-inch increase in parent height, the child’s height increases by approximately 0.65 inches. The lesson concludes by introducing the concept of “regression toward the mean,” where children of very tall or very short parents tend to be closer to average height, as evidenced by the slope being less than 1.

Lesson 2: Residuals

In this lesson, residuals are defined as the distances between the actual children’s heights and the estimates given by the regression line. In general, residuals are the differences between the data points and the estimates from the regression line. Residuals have mean zero, meaning they are balanced among data points. They are also uncorrelated with the predictors, which in this lesson is the parents’ height.

To form the regression line, we used the R function lm. And then we checked the mean of the residuals to see if it is close to 0, which was indeed close to 0. Aside from that, we also checked if the correlation between the residuals and predictors is close to 0. When the slope and intercept values of the regression line are varied, the resulting squared residuals are approximately equal to the sum of two sums of squares. Since variances are sums of squares, the variance of the estimate is always less than the variance of the data.

Lesson 3: Least Squares Estimation

This lesson focuses on Least Squares Estimation, emphasizing that the regression line minimizes the squared errors, which are the vertical distances between actual data points and the predicted values from the line. This technique, known as ordinary least squares (OLS), ensures that the regression line passes through the mean of the two sets of heights.

The mathematical foundation of the slope is discussed, with the slope being calculated as the correlation between parents’ and children’s heights multiplied by the ratio of their standard deviations. Learners are then introduced to the command manipulate(), which allows for interactive adjustment of the slope (beta) of the regression line to observe its impact on the mean squared error (MSE). The optimal slope that minimizes MSE is approximately 0.64, and the minimum MSE achieved is 5.0.

The concept of normalization is introduced, where data is normalized by subtracting the mean and dividing by the standard deviation. The normalized heights of parents and children are represented by the vectors gpa_nor and gch_nor, and the command cor(gpa_nor, gch_nor) computes the correlation between these normalized datasets, yielding a value of 0.4587624, which matches the correlation of the unnormalized data.

Finally, the lesson explains how to generate a regression line using normalized data with the command l_nor <- lm(gch_nor ~ gpa_nor), where the slope equals the correlation of the two datasets. The course concludes with a discussion on the impact of swapping the outcome and predictor variables, showing how this affects the slope of the regression line. Three regression lines are displayed: the original line (red) with children as the outcome, a new line (blue) with parents as the outcome and children as the predictor, and a black line where the slope equals the ratio of the standard deviations.

Lesson 4: Residual Variation

Residuals can be thought of as the outcome with the linear association of the predictor removed. Residual variation is defined as the variation after removing the predictor. Given a model, the maximum likelihood estimate of the variance of the random error is the average squared residual. To calculate an average squared residual to estimate the variance, we use the formula 1/(n-2) * (the sum of the squared residuals). Furthermore, if we divide the sum of the squared residuals by n, the result will give a biased estimate.

We discovered that the square root of the sum of the squared residuals divided by the quantity (n-2) is equal to the standard deviation of the error. The term R^2 was also introduced, which represents the percent of total variation described by the model. It is the percentage of variation explained by the regression model. It also equals the sample correlation squared.

Lesson 5: Introduction to Multivariable Regression

In lesson 5, we learned how to handle regression with multiple variables by breaking it down into simpler, single-variable regressions. We started with a basic regression using the lm function in R to look at the relationship between child and parent height from the Galton dataset. We saw that the default intercept in this model is essentially a constant value of 1. To handle this, we replaced the default intercept with a custom regressor of ones and used a method similar to Gaussian Elimination. This involved subtracting the mean from each variable to remove the intercept and then focusing on the residuals from regressions against a chosen variable.

We applied this method to the Trees dataset, which predicts timber volume based on height and girth measurements. We added a column of 1’s as a constant regressor and used the regressOneOnOne function to eliminate predictors and see how the regression results changed. This exercise showed us how to reduce a regression with three variables to one with two, and then to a single-variable regression. Even though using practical algorithms like lm is more efficient, this step-by-step coding in R helped us understand how to simplify multi-variable regression into simpler single-variable steps.

Summary of the lessons 1 to 5 of the Regression Models course in the R Package

Jhamyle B. Pangcoga, Carmelle Ezrah N. Sambaan, & Karla Giselle R. Santos

2024-08-27