SUMMARY

The Regression Models Course in SWIRL library (RStudio) provides interactive way of learning Regression Models by providing illustrations and on-site questions, along with codes we are freely to explore. After finishing Lessons 1 to 5, here are the following key takeaways we’ve gained:

LESSON 1: INTRODUCTION

Lesson 1 introduced us to the basics of Regression Models through the concept of ‘regression towards the mean’ by utilizing the data used by Francis Galton in 1885, a statistician who invented the term and concepts of regression and correlation. In his data found a set of 928 parent/child height pairs, with mothers’ and fathers’ heights averaged together and children’s heights kept as is. The children’s heights is the dependent variable and the parents’ heights is the independent variable.

In our exploration, we have learned how to plot the data and how important the R function jitter is, since it helps us see the individual points on the plot more clearly. We have also used the R function lm (linear model) to generate the regression line of the data. To understand the data, we used the R function summary, and there we are informed of the slope, residuals, standard errors, coefficients, etc. represented by lines on the data plot. In simplier terms, the plot of this data suggests that children’s heights depend on their parents’, moreso, parents who were taller than average had children who were also tall but closer to the average height and parents who were shorter than average had children who were also shorter than average but less so than their parents. That is, they were closer to the average height. From one generation to the next the heights moved closer to the average or regressed toward the mean.

LESSON 2: RESIDUALS

In this lesson, we learned the role of residuals in regression models. By definition, residuals represent the differences between the actual data points and the estimates made by the regression line. The Galton data from previous lesson was used for the plot and so, the data points here were the actual children’s heights. We learned that one of the ways for a regression model to be well-fitted is by the use of least squares method or minimizing the sum of squared residuals. Here, the residuals should have a mean of zero suggesting that the errors computed in predictions are balanced across all data points and residuals should also be uncorrelated with the predictor variables, in this case the parents’ heights.

In our exploration, we used again the R function lm to generate the regression line. Using R functions summary and cov, we verified that the mean was closer to zero and if the covariance with the predictor variable was nearly zero, confirming the model’s validity. we also knew how tweaking the slope and intercept of the regression line affects the sum of squared errors. By comparing the original squared residuals with those from the resulted squared residuals, we saw how the total squared residuals were influenced by the changes, confirming the accuracy of the regression line. Lastly, we learned how the total variance in the data could be divided into the variance explained by the model and the variance of the residuals and the equation var(data) = var(estimate) + var(residuals) were illustrated as well, suggesting that the variance explained by the model is always less than the total variance in the data. Hence, the basic concepts we learned as well as the codes we utilized, help us understand the importance of examining residuals to ensure that regression models are specified correctly and fit the data well.

LESSON 3: LEAST SQUARES ESTIMATION

In lesson 2, the Galton data was also used. Here, the slope of the regression line was further explained, that is, it is determined by the correlation between the two sets of heights, scaled by the ratio of their standard deviations. This relationship highlights how the predictor (parents’ heights) and the outcome (children’s heights) are connected.

The goal in this lesson is to demonstrate how changing the slope of the regression line affects the mean squared error between actual and predicted values. And so, the package ‘manipulate’ was best fitting to achieve it so. This allowed us to play with the slope of the regression line and observe how it affects the mean squared error (MSE). We discovered that the slope value of approximately 0.64 minimized the MSE to 5.0, demonstrating the importance of finding the optimal slope for accurate predictions. Further, we explored the concept of data normalization which involves subtracting the mean and dividing by the standard deviation. After normalizing the Galton parent and child data, we used the cor function to compute the correlation between these normalized datasets and found that the correlation remains the same as with the unnormalized data. We also generated a regression line using the normalized data with the lm function and observed that the slope of this line is equivalent to the correlation of the two datasets. Finally, the lesson discussed the effects of swapping the outcome and predictor variables in the regression model. If we were to use children’s heights to predict their parents’ heights, the slope of the new regression line would be calculated using the formula correlation(X, Y) * sd(X)/sd(Y).

LESSON 4: RESIDUAL VARIATION

In this lesson, we explored the concept of residuals and their use in evaluating how well data points fit a statistical model. Residuals, which represent the difference between the observed outcomes and the predictions made by the model, are essential in distinguishing between residual variation (the variation remaining after removing the predictor) and systematic variation (the variation explained by the regression model).

One key takeaway here was the method for estimating the variance of the random error in the model using residuals. Since the model with one predictor requires two parameters, the degrees of freedom are reduced by two, leading to an adjustment in the calculation of the average squared residual. By dividing the sum of squared residuals by n-2 rather than n, we obtained an unbiased estimate of the variance. Using the Galton height data, we applied these concepts by generating a regression line and calculating the standard deviation (sigma) of the error using the residuals. We also learned that total variation equals residual variation plus regression variation. Moreover, we explored the concept of R^2, which represents the percentage of variation in the data explained by the regression model. By calculating the ratio of residual variation to total variation, we were able to determine R^2 for the Galton data, which turned out to be approximately 0.2105. This value matched the R^2 reported in the summary of the regression model.

LESSON 5: INTRODUCTION TO MULTIVARIABLE REGRESSION

In this lesson, we delved into the concept of multivariable regression and learned how it can be broken down into a series of single-variable regressions. One takeaway is that our understanding in performing regression in one variable, we can be able to extend this to handling regressions in multiple variables.

The lesson recalled the concept of simple linear regression with the Galton height data which was helpful for us. There we saw that the intercept can be treated as a special regressor, being the coefficient of a special regressor which has the same value, 1, at every sample. By substituting this intercept with an “all-ones” regressor, it was shown that the intercept and slope coefficients remain the same. Next, we used the file, ‘elimination.R’, with technique of picking one predictor and replacing all other variables by the residuals of their regressions against that one. We also used the funtion regressOneOnOne as the first step of this process. This process can reduce a regression with N variables to a regression with N-1 variables. This process was applied in thre trees dataset and here, we demonstrated how the coefficients of the remaining variables are unaffected by the elimination, and how the constant regressor can vary when other variables are regressed against it. This approach provided us a clear understanding of how multivariable regression works.