Using Simple Linear Regression to Predict House Prices

2024-04-04

’’’{=html} < style type=“text/css” > body p, div, h1, h2, h3, h4, h5 { color: black; font-family: Modern Computer Roman; } slides > slide.title-slide hgroup h1 { color: #8C1D40; } h2 { color: #8C1D40; }

R Markdown

This is an R Markdown presentation. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Slide with Bullets

Bullet 1 Introduction

In this project, I will be exploring how simple linear regression can be used to predict house prices based on a single predictor variable: the size of the house (in square feet). This method provides a straightforward way to understand the relationship between house size and price.

Bullet 2 The Theory Behind Simple Linear Regression

Simple linear regression models the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable (house size), and the other is considered to be a dependent variable (house price). The linear equation used in simple linear regression is:

\[ Price = \beta_0 + \beta_1 \times Size \] - Bullet 3 : Where: - $\beta_0$ is the intercept, - $\beta_1$ is the slope of the line, indicating the price change per square foot.

##Slide loading the data

Slide with Plotly

`geom_smooth()` using formula = 'y ~ x'

##Slide with ggplot of Scatter plot with regression line

library(ggplot2) ggplot(data, aes(x = house_size, y = house_price)) + geom_point(aes(color = ‘blue’), size = 3) + geom_smooth(method = ‘lm’, color = ‘red’, se = FALSE) + theme_minimal() + labs(title = “House Price vs. Size”, x = “Size (Square Feet)”, y = “Price ($)”)

##Slide with ggplot2 of Histogram of house prices

ggplot(data, aes(x = house_price)) + geom_histogram(aes(fill = ‘skyblue’), binwidth = 50000, color = “black”) + # Specifying bin width for granularity theme_minimal() + labs(title = “Distribution of House Prices”, x = “Price ($)”, y = “Frequency”, caption = “Histogram showing the distribution of house prices”)

Slide Estimating Coefficients

The coefficients of the linear regression equation, $\beta_0$ and $\beta_1$ represent the intercept and the slope of the line, respectively. They are calculated to minimize the difference between the predicted and actual values. The formula for the slope ($\beta_1$) is:

\[ \beta_1 = \frac{\sum (x_i - \overline{x})(y_i - \overline{y})}{\sum (x_i - \overline{x})^2} \]

And the formula for the intercept ($\beta_0$) is:

\[ \beta_0 = \overline{y} - \beta_1\overline{x} \]

Where: - $\overline{x}$ is the mean of the independent variable, - $\overline{y}$ is the mean of the dependent variable, - $x_i$ and $y_i$ are individual observations.

Interpreting the Model

Once we have estimated the coefficients, we can interpret them to understand the relationship between house size and price.

$\beta_1$ (the slope) tells us how much the price increases for each additional square foot of house size. For example, if $\beta_1 = 100$, it means that for each additional square foot, the house price increases by $100.
$\beta_0$ (the intercept) gives us the predicted price of a house when the size is zero. Practically, this might not make sense (as houses can’t have zero size), but it helps in aligning our linear model.

Using these coefficients, we can predict house prices for any given size using the formula:

\[ \text{Predicted Price} = \beta_0 + \beta_1 \times \text{House Size} \]

##Slide with R Output

The summary of our linear model provides important information, including the coefficients, their significance, and the overall fit of the model. Below is the R code used to fit the model and its output:

# Fitting a linear model
model <- lm(house_price ~ house_size, data = data)
# Displaying summary of the model
summary(model)

Call:
lm(formula = house_price ~ house_size, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-112787  -27894   -3284   27465  109306 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 38106.191  18792.058   2.028   0.0482 *  
house_size    104.773      5.706  18.362   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 47020 on 48 degrees of freedom
Multiple R-squared:  0.8754,    Adjusted R-squared:  0.8728 
F-statistic: 337.2 on 1 and 48 DF,  p-value: < 2.2e-16