2024-10-28
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Overview
- Simple Linear Regression
- US Housing Data
What is a linear regression?
- Purpose: Models the relationship between two variables by fitting a straight line
- Interpretation: The slope tells us how much one variable changes for each one-unit increase in another
- Objective: Find the best-fitting line
Linear Regression Equation
The equation of a simple linear regression line is given by: \[
y = \beta_0 + \beta_1 x + \epsilon
\]
- \(y\): Dependent variable (e.g., Price)
- \(x\): Independent variable (e.g., Living Space)
- \(\beta_0\): Intercept of the line
- \(\beta_1\): Slope of the line (change in \(y\) for a one-unit increase in \(x\))
- \(\epsilon\): Error term (the difference between observed and predicted values)
Dataset Summary
selected_columns <- df %>%
select(Price, Living.Space, Beds, Baths)
summary(selected_columns)
## Price Living.Space Beds Baths
## Min. : 1800 Min. : 2 Min. : 1.000 Min. : 1.000
## 1st Qu.: 265000 1st Qu.: 1200 1st Qu.: 3.000 1st Qu.: 2.000
## Median : 399900 Median : 1639 Median : 3.000 Median : 2.000
## Mean : 622777 Mean : 1902 Mean : 3.172 Mean : 2.467
## 3rd Qu.: 674990 3rd Qu.: 2265 3rd Qu.: 4.000 3rd Qu.: 3.000
## Max. :38000000 Max. :74340 Max. :54.000 Max. :66.000
Price vs. Square Feet - plotly
First Linear Regression
- That linear regression wasn’t great because of the number of entries and the outliers dragging it out
- It was hard to tell how good the model was due to congestion in the bottom left
- We need to reduce the data because there is just too much at the moment
- We will randomly choose 100 rows using dpylr to make it more manageable
Randomly Choosing Data
set.seed(7)
sample_df <- sample_n(df, 100)
selected_columns <- sample_df %>%
select(Price, Living.Space, Beds, Baths)
summary(selected_columns)
## Price Living.Space Beds Baths
## Min. : 97500 Min. : 396 Min. : 1.00 Min. :1.0
## 1st Qu.: 257400 1st Qu.:1245 1st Qu.: 2.00 1st Qu.:2.0
## Median : 369450 Median :1776 Median : 3.00 Median :3.0
## Mean : 605582 Mean :1990 Mean : 3.27 Mean :2.7
## 3rd Qu.: 591175 3rd Qu.:2416 3rd Qu.: 4.00 3rd Qu.:3.0
## Max. :5550000 Max. :6076 Max. :11.00 Max. :8.0
Price vs. Square Feet - ggplot

Second Linear Regression
- As we saw in the last slide, the linear regression for Square Feet and Price looks a lot better when we limit the data
- We see that the model is working correctly
- It is not perfect but it seems to be a decent prediction for price given the square footage of a house
- There are many other variables influencing price so it is difficult to get a good prediction from just the square footage
R-Squared
mod <- lm(Price ~ Living.Space, data = sample_df)
model_summary <- summary(mod)
r_squared <- model_summary$r.squared
- In the last slide we were mentioning the perceived quality of the regression, but there is actually a way to measure it
- \(R^2\) is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable.
- 1 is a great fit and 0 is a terrible fit
- For this model, the \(R^2\) value is 0.373
- This means that approximately 37.3% of the variance in price can be explained by just the square footage of the house which means it explains a noticable amount of variance, but suggests there are other factors
Price vs. Bedrooms, Price vs. Bathrooms, Price vs. Bedrooms and Bathrooms - ggplot

Last Linear Regression
- In the last slide we saw a pretty ugly plot that showed the linear regression to predict price based off the number of bathrooms, the number of bedrooms, and then the number of bathrooms and bedrooms combined
- For just bathrooms \(R^2\) is 0.319
- For just bedrooms \(R^2\) is 0.203
- For total rooms \(R^2\) is 0.293
- Interestingly, we see that the number of bathrooms are actually the strongest predictor out of the three. However, it is still less predictive than square footage.