Introduction to Linear Regression

Linear regression seeks to model the relationship between two variables through applying a linear equation to the given data. However, prior to applying regression, the model of interest must meet four assumptions to gauge whether linear regression is appropriate and that we can confidently draw inferences from it – the model must meet linearity, independence of observations, normal residuals, and equal variance (an easy way to memorize all these is by using the acronym LINE). Additionally, to ensure that we get the best possible estimate, ordinary least squares (OLS) regression is often implemented; we will go into this method more in detail later on. Lastly, we will also touch on the correlation coefficient and its significance in linear regression.

First, using a dataset called “headbrain.csv” that we found on Kaggle, let’s map head size (in \(cm^3\)) to brain weight (in grams) on a scatterplot.

In the scatterplot generated above, just by looking at the data you can immediately recognize that there is an upward trend which seems to suggest that a bigger head size is associated with a higher brain weight. So, let’s fit a line on the points to see if the data meets the linearity assumption.

To meet the linearity assumption, the variables in the model of interest must have a linear relationship. By looking at the plot above, our line of best-fit shows us that there is a positive relationship. However, this linear relationship need not imply or involve causation; there simply needs to be a significant association between the two variables at hand.

The second assumption that needs to be met is independence of the observations. All this means is that the value of one observation cannot depend on the value of another. An example of an independent observation would be measuring head size at a single point in time. However, if one were to measure head size over time, the observations would turn dependent as the head size at a particular time would influence head sizes at future points. To visualize this, we can map the predictor variable on the x-axis against the residuals on the y-axis. As a reminder, the residuals are the difference between the actual value (the data point on the scatterplot) and the value predicted by the model (the line).

To meet the assumption, there should be no discernable relationship in this residual plot. In the plot we generated above, residuals are randomly scattered around the center line of zero, which means that the data meets the independence assumption.

The third assumption for the model is normally distributed residuals. A simple way of checking this assumption is to create a histogram of the residuals. This histogram should be approximately normally distributed, which means that the plot needs to be unimodal and be able to fit a bell-shaped curve.

After plotting the residuals on a histogram, we can see that they assume a normal distribution, fulfilling the third assumption. Normally, “frequency” or “count” is used for histograms, but we used a density histogram here instead; because the areas of the bars add up to 1, this allows us to properly overlay the normal distribution curve to show that the data meets the normal residuals assumption.

The fourth and final assumption is equal variance of the residuals. Essentially, the variances of the residuals should be consistent across all values. This assumption can be checked through examining the scatter plot of the residuals versus fitted, or estimated, values. Like with checking independence, there should be no relationship as the scatterplot should not show a pattern.

In the graph above, there is no discernable relationship or pattern between the residuals and fitted values, meeting the equal variance assumption. With this last box checked, the head size vs brain weight dataset meets all four assumptions, making linear regression an appropriate model that we can draw inferences from.

Ordinary least squares (OLS) regression is a statistical analysis method that estimates the relationship between a predictor variable and a dependent variable by minimizing the sum of squares difference between the actual and fitted values of the predictor variable using a fitted line and the variances. Although this may sound complex, all that is occurring behind the scenes is reducing the sum of the vertical distances between the data points and the line as much as possible.The animation below provides a clear visualization of how it works

As for the fit of the regression line (denoted by the fixed line in green), this equation of the line is gathered from the method of least-squares. So, as you can see in the animated model, this line is positioned where the model has the least amount of error relative to the points. As the sliding black line strays away from the green line, you can see that some of the residuals for the points, denoted by the vertical red lines, get longer, increasing the sum of squares value. In other words, a proper OLS regression fits the line so that each and every red line is as short as possible, giving us the smallest possible sum of squares value.

Another crucial concept in regression is the correlation coefficient. The correlation coefficient measures the degree of association between the variables in a regression. This value is often denoted as \(r\) and ranges between -1 and 1 depending on the slope. Another useful method to gauge the degree of correlation is through the standard deviation line. The standard deviation line goes through the point of averages, which is the means of the two variables, and has a slope equal to the standard deviation of \(Y\) over the standard deviation of \(X\). The closer the correlation coefficient is to positive or negative 1, the closer the regression line will be to the standard deviation line. The relative positions of the lines in the animation below should give you a visual aid to discern the correlation between the variables.

In the simulation above, we modeled a scatterplot with the \(X\) and \(Y\) having a positive relationship, so \(r\) starts at 0 and gradually goes up to 1. The blue line represents the linear regression model and the red line denotes the standard deviation line. As r increases, you can see that the blue line’s slope gets steeper, gradually aligning with the red line. A correlation coefficient of 1 signifies a perfectly linear relationship, meaning that all the points line up precisely on the line (however in real life this would almost never happen).

Finally, it is important to accurately interpret the meaning of the correlation coefficient. Remember, it is simply a measure of how strongly a pair of variables are related, and need not imply causation. Another common metric in regression is the coefficient of determination, or \(R^2\). \(R^2\) is just as it appears: the correlation coefficient squared. The coefficient of determination is a measure of how variance is explained by the regression model. For instance, an \(R^2\) of 0.9 simply means that X explains 90% of the variation in Y.

We think we should get an Excellent for this project because we’ve reorganized the document to make it flow better and tell a more cohesive story and used code chunk options to make the document look cleaner and understandable for a first-time reader. We also made a plotly for the residual plots.

Introduction to Linear Regression

Dohyun Lee, Satyaa Suresh