The Regression Models Course in SWIRL library (RStudio) provides interactive way of learning Regression Models by providing illustrations and on-site questions, along with codes we are freely to explore. After finishing Lessons 1 to 5, here are the following key takeaways we’ve gained:
LESSON 1: INTRODUCTIONLesson 1 introduced us to the basics of Regression Models through the concept of ‘regression towards the mean’ by utilizing the data used by Francis Galton in 1885, a statistician who invented the term and concepts of regression and correlation. In his data found a set of 928 parent/child height pairs, with mothers’ and fathers’ heights averaged together and children’s heights kept as is. The children’s heights is the dependent variable and the parents’ heights is the independent variable.
In our exploration, we have learned how to plot the data and how important the R function jitter is, since it helps us see the individual points on the plot more clearly. We have also used the R function lm (linear model) to generate the regression line of the data. To understand the data, we used the R function summary, and there we are informed of the slope, residuals, standard errors, coefficients, etc. represented by lines on the data plot. In simplier terms, the plot of this data suggests that children’s heights depend on their parents’, moreso, parents who were taller than average had children who were also tall but closer to the average height and parents who were shorter than average had children who were also shorter than average but less so than their parents. That is, they were closer to the average height. From one generation to the next the heights moved closer to the average or regressed toward the mean.
In this lesson, we learned the role of residuals in regression models. By definition, residuals represent the differences between the actual data points and the estimates made by the regression line. The Galton data from previous lesson was used for the plot and so, the data points here were the actual children’s heights. We learned that one of the ways for a regression model to be well-fitted is by the use of least squares method or minimizing the sum of squared residuals. Here, the residuals should have a mean of zero suggesting that the errors computed in predictions are balanced across all data points and residuals should also be uncorrelated with the predictor variables, in this case the parents’ heights.
In our exploration, we used again the R function lm to
generate the regression line. Using R functions summary and
cov, we verified that the mean was closer to zero and if the
covariance with the predictor variable was nearly zero, confirming the
model’s validity. we also knew how tweaking the slope and intercept of
the regression line affects the sum of squared errors. By comparing the
original squared residuals with those from the resulted squared
residuals, we saw how the total squared residuals were influenced by the
changes, confirming the accuracy of the regression line. Lastly, we
learned how the total variance in the data could be divided into the
variance explained by the model and the variance of the residuals and
the equation var(data) = var(estimate) + var(residuals)
were illustrated as well, suggesting that the variance explained by the
model is always less than the total variance in the data. Hence, the
basic concepts we learned as well as the codes we utilized, help us
understand the importance of examining residuals to ensure that
regression models are specified correctly and fit the data well.
In lesson 2, the Galton data was also used. Here, the slope of the regression line was further explained, that is, it is determined by the correlation between the two sets of heights, scaled by the ratio of their standard deviations. This relationship highlights how the predictor (parents’ heights) and the outcome (children’s heights) are connected.
The goal in this lesson is to demonstrate how changing the slope of
the regression line affects the mean squared error between actual and
predicted values. And so, the package ‘manipulate’ was best
fitting to achieve it so. This allowed us to play with the slope of the
regression line and observe how it affects the mean squared error (MSE).
We discovered that the slope value of approximately 0.64 minimized the
MSE to 5.0, demonstrating the importance of finding the optimal slope
for accurate predictions. Further, we explored the concept of data
normalization which involves subtracting the mean and dividing by the
standard deviation. After normalizing the Galton parent and child data,
we used the cor function to compute the correlation between
these normalized datasets and found that the correlation remains the
same as with the unnormalized data. We also generated a regression line
using the normalized data with the lm function and observed
that the slope of this line is equivalent to the correlation of the two
datasets. Finally, the lesson discussed the effects of swapping the
outcome and predictor variables in the regression model. If we were to
use children’s heights to predict their parents’ heights, the slope of
the new regression line would be calculated using the formula
correlation(X, Y) * sd(X)/sd(Y).
In this lesson, we explored the concept of residuals and their use in evaluating how well data points fit a statistical model. Residuals, which represent the difference between the observed outcomes and the predictions made by the model, are essential in distinguishing between residual variation (the variation remaining after removing the predictor) and systematic variation (the variation explained by the regression model).
One key takeaway here was the method for estimating the variance of the random error in the model using residuals. Since the model with one predictor requires two parameters, the degrees of freedom are reduced by two, leading to an adjustment in the calculation of the average squared residual. By dividing the sum of squared residuals by n-2 rather than n, we obtained an unbiased estimate of the variance. Using the Galton height data, we applied these concepts by generating a regression line and calculating the standard deviation (sigma) of the error using the residuals. We also learned that total variation equals residual variation plus regression variation. Moreover, we explored the concept of R^2, which represents the percentage of variation in the data explained by the regression model. By calculating the ratio of residual variation to total variation, we were able to determine R^2 for the Galton data, which turned out to be approximately 0.2105. This value matched the R^2 reported in the summary of the regression model.
In this lesson, we delved into the concept of multivariable regression and learned how it can be broken down into a series of single-variable regressions. One takeaway is that our understanding in performing regression in one variable, we can be able to extend this to handling regressions in multiple variables.
The lesson recalled the concept of simple linear regression with the Galton height data which was helpful for us. There we saw that the intercept can be treated as a special regressor, being the coefficient of a special regressor which has the same value, 1, at every sample. By substituting this intercept with an “all-ones” regressor, it was shown that the intercept and slope coefficients remain the same. Next, we used the file, ‘elimination.R’, with technique of picking one predictor and replacing all other variables by the residuals of their regressions against that one. We also used the funtion regressOneOnOne as the first step of this process. This process can reduce a regression with N variables to a regression with N-1 variables. This process was applied in thre trees dataset and here, we demonstrated how the coefficients of the remaining variables are unaffected by the elimination, and how the constant regressor can vary when other variables are regressed against it. This approach provided us a clear understanding of how multivariable regression works.