Introduction to Simple Linear Regression

What is Simple Linear Regression?

Definition

Simple linear regression is a linear regression model that concerns two separate variables, one independent variable and one dependent variable. The model attempts to find a linear function that predicts the value of the dependent variable as a function of the independent variable in the form of a straight, linear line.

Formula and Computation

The model function for simple linear regression is represented by \(y = \alpha + \beta x + \epsilon\), where \(\alpha\) represents the y-intercept of the line, \(\beta\) represents the slope of the line, and \(\epsilon\) represents the error of the estimate.

The goal of simple linear regression is to find estimated values \(\hat \alpha\) and \(\hat \beta\) which would provide the best fit estimate of y given arbitrary x based on existing data points. This is often done using the least-squares approach, which finds a line that minimizes the sum of squared residuals (the distance between the actual point in the data set and the point estimated by the fitted line).

Why Use Simple Linear Regression?

Applications in Computer Science

Predicting system performance, algorithmic time complexity

Applications in Engineering

Predicting the strength of a material based on its composition

Real-World Application

Predicting house prices based on square footage

Real-World Example

Let’s build on the real-world application discussed in the previous slide. Here, we have an example data set logging the square footage and price of 10 different homes, and a plot of this data looking at price versus square footage on the following slide.

houseData <- data.frame(SqFt = c(1498, 3607, 1024, 2473, 1503, 2664, 
                                 980, 2166, 2893, 3022),
                   Price = c(475000, 995000, 127500, 558990, 375000, 
                             562990, 80000, 489900, 820000, 735000))

Real-World Example Continued

ggplot(houseData, aes(x=SqFt, y=Price)) + geom_point() +
  labs(title="House Prices vs. Square Footage",
       x="Square Footage", y="Price ($)")

While it may be tough to guess the price of a house arbitrarily based on square footage using this data, we can see that this data looks quite linear, which means it is a great candidate for simple linear regression.

Plot Linear Regression Line

In R, linear regression can be easily performed using the lm() command, and a linear regression line can be easily graphed by adding the geom_smooth() command found in the ggplot2 library when plotting the data.

`geom_smooth()` using formula = 'y ~ x'

After adding geom_smooth(method=‘lm’), we see that there is a perfect straight line going through our data. This line is our linear regression, and it represents the best fit estimation of price based on square footage.

Find Linear Regression Function

If we want to see the coefficients of the linear regression function, we can perform lm() using the format y~x.

Call:
lm(formula = Price ~ SqFt, data = houseData)

Coefficients:
(Intercept)         SqFt  
  -143117.0        304.7

This gives us the y-intercept coefficient and the square footage slope coefficient, which means that we can complete our linear regression equation. Since y = price and x = square footage, the linear regression equation would be \(y = 304.7x - 143117\). By plugging in any square footage for x, you will receive the best-fit estimate of the price based on the data set we have been given.

Linear Regression Errors Graphed

Since simple linear regression represents the best-fit estimate, this means it is likely that if you plug in an existing x (square footage) and y (price) pairing, the linear regression line will give an incorrect value compared to reality, either too high or too low. In the following graph, we can see the difference in price between reality and estimate (error, z) graphed alongside each square footage (x) price (y) pairing.

Conclusion

In conclusion, simple linear regression is a great tool to use if you want to broadly estimate the value of one variable using knowledge of another variable, given that the existing data suggests a linear relationship between these two variables. The method has applications in many industries and fields of study, both technical and non-technical. While the estimation will not return perfectly accurate values, you can be confident that it represents the most accurate estimation and extrapolation of the dependent variable built on the existing data of the independent and dependent variables.