2024-10-28

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Overview

  • Simple Linear Regression
  • US Housing Data

What is a linear regression?

  • Purpose: Models the relationship between two variables by fitting a straight line
  • Interpretation: The slope tells us how much one variable changes for each one-unit increase in another
  • Objective: Find the best-fitting line

Linear Regression Equation

The equation of a simple linear regression line is given by: \[ y = \beta_0 + \beta_1 x + \epsilon \]

  • \(y\): Dependent variable (e.g., Price)
  • \(x\): Independent variable (e.g., Living Space)
  • \(\beta_0\): Intercept of the line
  • \(\beta_1\): Slope of the line (change in \(y\) for a one-unit increase in \(x\))
  • \(\epsilon\): Error term (the difference between observed and predicted values)

Dataset Summary

selected_columns <- df %>%
  select(Price, Living.Space, Beds, Baths)
summary(selected_columns)
##      Price           Living.Space        Beds            Baths       
##  Min.   :    1800   Min.   :    2   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.:  265000   1st Qu.: 1200   1st Qu.: 3.000   1st Qu.: 2.000  
##  Median :  399900   Median : 1639   Median : 3.000   Median : 2.000  
##  Mean   :  622777   Mean   : 1902   Mean   : 3.172   Mean   : 2.467  
##  3rd Qu.:  674990   3rd Qu.: 2265   3rd Qu.: 4.000   3rd Qu.: 3.000  
##  Max.   :38000000   Max.   :74340   Max.   :54.000   Max.   :66.000

Price vs. Square Feet - plotly

First Linear Regression

  • That linear regression wasn’t great because of the number of entries and the outliers dragging it out
  • It was hard to tell how good the model was due to congestion in the bottom left
  • We need to reduce the data because there is just too much at the moment
  • We will randomly choose 100 rows using dpylr to make it more manageable

Randomly Choosing Data

set.seed(7)
sample_df <- sample_n(df, 100)
selected_columns <- sample_df %>%
  select(Price, Living.Space, Beds, Baths)
summary(selected_columns)
##      Price          Living.Space       Beds           Baths    
##  Min.   :  97500   Min.   : 396   Min.   : 1.00   Min.   :1.0  
##  1st Qu.: 257400   1st Qu.:1245   1st Qu.: 2.00   1st Qu.:2.0  
##  Median : 369450   Median :1776   Median : 3.00   Median :3.0  
##  Mean   : 605582   Mean   :1990   Mean   : 3.27   Mean   :2.7  
##  3rd Qu.: 591175   3rd Qu.:2416   3rd Qu.: 4.00   3rd Qu.:3.0  
##  Max.   :5550000   Max.   :6076   Max.   :11.00   Max.   :8.0

Price vs. Square Feet - ggplot

Second Linear Regression

  • As we saw in the last slide, the linear regression for Square Feet and Price looks a lot better when we limit the data
  • We see that the model is working correctly
  • It is not perfect but it seems to be a decent prediction for price given the square footage of a house
  • There are many other variables influencing price so it is difficult to get a good prediction from just the square footage

R-Squared

mod <- lm(Price ~ Living.Space, data = sample_df)
model_summary <- summary(mod)
r_squared <- model_summary$r.squared
  • In the last slide we were mentioning the perceived quality of the regression, but there is actually a way to measure it
  • \(R^2\) is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable.
  • 1 is a great fit and 0 is a terrible fit
  • For this model, the \(R^2\) value is 0.373
  • This means that approximately 37.3% of the variance in price can be explained by just the square footage of the house which means it explains a noticable amount of variance, but suggests there are other factors

Price vs. Bedrooms, Price vs. Bathrooms, Price vs. Bedrooms and Bathrooms - ggplot

Last Linear Regression

  • In the last slide we saw a pretty ugly plot that showed the linear regression to predict price based off the number of bathrooms, the number of bedrooms, and then the number of bathrooms and bedrooms combined
  • For just bathrooms \(R^2\) is 0.319
  • For just bedrooms \(R^2\) is 0.203
  • For total rooms \(R^2\) is 0.293
  • Interestingly, we see that the number of bathrooms are actually the strongest predictor out of the three. However, it is still less predictive than square footage.