HW3

2024-10-28

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

Overview

Simple Linear Regression
US Housing Data

What is a linear regression?

Purpose: Models the relationship between two variables by fitting a straight line
Interpretation: The slope tells us how much one variable changes for each one-unit increase in another
Objective: Find the best-fitting line

Linear Regression Equation

The equation of a simple linear regression line is given by: \[ y = \beta_0 + \beta_1 x + \epsilon \]

\(y\): Dependent variable (e.g., Price)
\(x\): Independent variable (e.g., Living Space)
\(\beta_0\): Intercept of the line
\(\beta_1\): Slope of the line (change in \(y\) for a one-unit increase in \(x\))
\(\epsilon\): Error term (the difference between observed and predicted values)

Dataset Summary

selected_columns <- df %>%
  select(Price, Living.Space, Beds, Baths)
summary(selected_columns)

##      Price           Living.Space        Beds            Baths       
##  Min.   :    1800   Min.   :    2   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.:  265000   1st Qu.: 1200   1st Qu.: 3.000   1st Qu.: 2.000  
##  Median :  399900   Median : 1639   Median : 3.000   Median : 2.000  
##  Mean   :  622777   Mean   : 1902   Mean   : 3.172   Mean   : 2.467  
##  3rd Qu.:  674990   3rd Qu.: 2265   3rd Qu.: 4.000   3rd Qu.: 3.000  
##  Max.   :38000000   Max.   :74340   Max.   :54.000   Max.   :66.000

Price vs. Square Feet - plotly

First Linear Regression

That linear regression wasn’t great because of the number of entries and the outliers dragging it out
It was hard to tell how good the model was due to congestion in the bottom left
We need to reduce the data because there is just too much at the moment
We will randomly choose 100 rows using dpylr to make it more manageable

Randomly Choosing Data

set.seed(7)
sample_df <- sample_n(df, 100)
selected_columns <- sample_df %>%
  select(Price, Living.Space, Beds, Baths)
summary(selected_columns)

##      Price          Living.Space       Beds           Baths    
##  Min.   :  97500   Min.   : 396   Min.   : 1.00   Min.   :1.0  
##  1st Qu.: 257400   1st Qu.:1245   1st Qu.: 2.00   1st Qu.:2.0  
##  Median : 369450   Median :1776   Median : 3.00   Median :3.0  
##  Mean   : 605582   Mean   :1990   Mean   : 3.27   Mean   :2.7  
##  3rd Qu.: 591175   3rd Qu.:2416   3rd Qu.: 4.00   3rd Qu.:3.0  
##  Max.   :5550000   Max.   :6076   Max.   :11.00   Max.   :8.0

Price vs. Square Feet - ggplot

Second Linear Regression

As we saw in the last slide, the linear regression for Square Feet and Price looks a lot better when we limit the data
We see that the model is working correctly
It is not perfect but it seems to be a decent prediction for price given the square footage of a house
There are many other variables influencing price so it is difficult to get a good prediction from just the square footage

R-Squared

mod <- lm(Price ~ Living.Space, data = sample_df)
model_summary <- summary(mod)
r_squared <- model_summary$r.squared

In the last slide we were mentioning the perceived quality of the regression, but there is actually a way to measure it
\(R^2\) is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable.
1 is a great fit and 0 is a terrible fit
For this model, the \(R^2\) value is 0.373
This means that approximately 37.3% of the variance in price can be explained by just the square footage of the house which means it explains a noticable amount of variance, but suggests there are other factors

Price vs. Bedrooms, Price vs. Bathrooms, Price vs. Bedrooms and Bathrooms - ggplot

Last Linear Regression

In the last slide we saw a pretty ugly plot that showed the linear regression to predict price based off the number of bathrooms, the number of bedrooms, and then the number of bathrooms and bedrooms combined
For just bathrooms \(R^2\) is 0.319
For just bedrooms \(R^2\) is 0.203
For total rooms \(R^2\) is 0.293
Interestingly, we see that the number of bathrooms are actually the strongest predictor out of the three. However, it is still less predictive than square footage.