Using Simple Linear Regression to Predict House Prices

2024-04-11

Introduction

In this project, I will be exploring how simple linear regression can be used to predict house prices based on a single predictor variable: the size of the house (in square feet). This method provides a straightforward way to understand the relationship between house size and price.

The Theory Behind Simple Linear Regression

Simple linear regression models the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable (house size), and the other is considered to be a dependent variable (house price). The linear equation used in simple linear regression is:

\[ Price = \beta_0 + \beta_1 \times Size \] ##Explanation of the Simple Linear Regression

$\beta_0$ is the intercept,
$\beta_1$ is the slope of the line, indicating the price change per square foot.

Loading the data

To load this code, we’ll use this code:

set.seed(123): This ensures that whenever running the code, we get the same random numbers.

house_size <- round(runif(50, 1000, 5000)): It makes a list of 50 random house sizes. The sizes range from 1000 to 5000 square feet.

house_price <- 50000 + house_size * 100 + rnorm(50, mean = 0, sd = 50000): This calculates the pretend prices for the houses. It adds a base price of $50,000 to each house, plus an extra amount based on its size.

data <- data.frame(house_size, house_price): Finally, it puts the house sizes and prices together into a table called “data”. Each row in the table represents one house, with its size and price.

Plotly

## `geom_smooth()` using formula = 'y ~ x'

Ggplot of Scatter plot with regression line

## `geom_smooth()` using formula = 'y ~ x'

Ggplot2 of Histogram of house prices

Estimating Coefficients

The coefficients of the linear regression equation, $\beta_0$ and $\beta_1$ represent the intercept and the slope of the line, respectively. They are calculated to minimize the difference between the predicted and actual values. The formula for the slope ($\beta_1$) is:

\[ \beta_1 = \frac{\sum (x_i - \overline{x})(y_i - \overline{y})}{\sum (x_i - \overline{x})^2} \]

And the formula for the intercept ($\beta_0$) is:

\[ \beta_0 = \overline{y} - \beta_1\overline{x} \]

Explaining the Coefficients

$\overline{x}$ is the mean of the independent variable,
$\overline{y}$ is the mean of the dependent variable,
$x_i$ and $y_i$ are individual observations.

Interpreting the Model

Once we have estimated the coefficients, we can interpret them to understand the relationship between house size and price.

$\beta_1$ (the slope) tells us how much the price increases for each additional square foot of house size. For example, if $\beta_1 = 100$, it means that for each additional square foot, the house price increases by $100.
$\beta_0$ (the intercept) gives us the predicted price of a house when the size is zero. Practically, this might not make sense (as houses can’t have zero size), but it helps in aligning our linear model.

Using these coefficients, we can predict house prices for any given size using the formula:

\[ \text{Predicted Price} = \beta_0 + \beta_1 \times \text{House Size} \]

R Output

The summary of our linear model provides important information, including the coefficients, their significance, and the overall fit of the model. Below is the R code used to fit the model and its output:

## 
## Call:
## lm(formula = house_price ~ house_size, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -112787  -27894   -3284   27465  109306 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38106.191  18792.058   2.028   0.0482 *  
## house_size    104.773      5.706  18.362   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47020 on 48 degrees of freedom
## Multiple R-squared:  0.8754, Adjusted R-squared:  0.8728 
## F-statistic: 337.2 on 1 and 48 DF,  p-value: < 2.2e-16