2024-10-20

What is Linear Regression?

Linear regression is a statistical model that can be used to generate predictions, based on an estimate of it’s relationship to other, related, variables. A form of linear regression is used in most businesses today to predict things like stock price, insurance risk, environmental impact, and home prices. It sounds complex but you might have been doing it as early as elementary school when finding the slope of a line based on two points.

The Linear Regression Equation

Here is the Linear Regression Equation: \[Y_i=\beta_0 + \beta_1X_i\] But what do all of those symbols mean? I’ll use stock price as an example to explain

\(Y_i\) is the dependent variable, or the variable you are trying to predict. (e.g. stock price)
\(\beta_0\) is the constant. (e.g. how much a stock is worth at the beginning of a set time frame)
\(\beta_1\) is the slope/coefficient. (e.g. the expected change in a stock’s price based on in increase in time)
\(X_i\) is the independent variable. (e.g. an amount of time)

Using this equation you can predict a stock’s future price!

Time for a Real Example

For this example we will use an Arizona housing prices data set to predict the price of a house based on a few different factors.

Firstly, here is the structure of the data set:

Price Local_area beds baths sqft
229900 Phoenix 2 3 1498
294900 Sahuarita 4 3 1951
683100 Phoenix 4 4 3110
260000 Scottsdale 2 1 759

Visualizing the Data

Lets start off by visualizing some of the data.
We can use a scatter plot to visualize the houses by square footage and price:

There looks to be some sort of correlation between these two

Lets explore a little further by calculating the line of best fit

Line of Best Fit

This is the line that most accurately represents the trend of the data

It can be calculated using this function:

\[Y=(\bar{Y}-\beta_1\bar{X})+\beta_1X\]

Where \(\bar{Y}\) is the mean of all the y values
and \(\bar{X}\) is the mean of all the x values

Regression Line Visualized

ggplot(data, aes(sqft, Price)) + 
    geom_point() + geom_smooth(method='lm',se=FALSE,formula='y ~ x')

We can use this line to estimate the price of some houses!

For example a 3000 sqft house would sell for about $625,000 and a 1000 sqft house would sell for around $250,000

Can we get more complex?

Lets try and use 3 variables instead of 2. We can visualize the relationship between price and the number of bedrooms and bathrooms.

In more advanced linear regression we can calculate the plane of best fit, which is similar to the line of best fit but it’s in 3D.