Sacramento Housing Price Linear Regression

2024-03-21

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

data(Sacramento)
str(Sacramento)

## 'data.frame':    932 obs. of  9 variables:
##  $ city     : Factor w/ 37 levels "ANTELOPE","AUBURN",..: 34 34 34 34 34 34 34 34 29 31 ...
##  $ zip      : Factor w/ 68 levels "z95603","z95608",..: 64 52 44 44 53 65 66 49 24 25 ...
##  $ beds     : int  2 3 2 2 2 3 3 3 2 3 ...
##  $ baths    : num  1 1 1 1 1 1 2 1 2 2 ...
##  $ sqft     : int  836 1167 796 852 797 1122 1104 1177 941 1146 ...
##  $ type     : Factor w/ 3 levels "Condo","Multi_Family",..: 3 3 3 3 3 1 3 3 1 3 ...
##  $ price    : int  59222 68212 68880 69307 81900 89921 90895 91002 94905 98937 ...
##  $ latitude : num  38.6 38.5 38.6 38.6 38.5 ...
##  $ longitude: num  -121 -121 -121 -121 -121 ...

sum(is.na(Sacramento))

## [1] 0

Introduction

For this assignment I wanted to cover a topic that was of interest to me. Housing prices and the variables that cause the prices to fluctuate and attempt to predict the prices of homes utilizing basic linear regression. With a few variables price, square footage, zip code, and a 3d plot with beds, baths, and price.

Building of the Linear Regression model for Price and SQFT

set.seed(123)
trainIndex <- createDataPartition(Sacramento$price, p = .8, list = FALSE,
times = 1)
data_train <- Sacramento[trainIndex, ]
data_test <- Sacramento[-trainIndex, ]


model <- train(price ~ sqft, data = data_train, method = "lm" )
options(scipen = 999)

predictions <- predict(model, newdata = data_test)
RMSE <- sqrt(mean((predictions - data_test$price)^2))
RMSE

## [1] 77113.89

plot_data <- data.frame(Actual = data_test$price, Predicted = predictions)

Linear Regression Model graph For SQFT and Price

## `geom_smooth()` using formula = 'y ~ x'

Regression Formula for the Price VS Sqft

The linear regression equation: $\text{Price} = \beta_{\text{sqft}} \times \text{sqft} + \beta_0$

Price: This is the predicted price of the property.

$\beta_{\text{sqft}}$ :This is the coefficient for the predictor variable sqft. It represents the change in the predicted price for each unit change in square footage.

sqft: This is the square footage of the property, which is the predictor variable used to predict the price.

$\beta_0$ : This is the intercept term of the regression equation. It represents the predicted price when the square footage is zero.

Regression with Coefficients Price vs Sqft

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -237741  -54886  -12563   37993  598733 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 11638.268   7959.589   1.462               0.144    
## sqft          140.781      4.382  32.127 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 85810 on 745 degrees of freedom
## Multiple R-squared:  0.5808, Adjusted R-squared:  0.5802 
## F-statistic:  1032 on 1 and 745 DF,  p-value: < 0.00000000000000022

Regressions with Coefficients Continued

Now that we have the coefficients we can plug them into our equation $\text{Price} = 11638.268 - 140.781 \times \text{sqft} + \beta_0$ The equation is now showing that for every additional square foot the price decreases by $140.781 on average

The Standard error of the estimates show an accuracy of $\beta = 7959.589$ and for Sqft = 4.382

Building of the Linear Regression model for Price and ZIP

set.seed(123)
trainIndex <- createDataPartition(Sacramento$price, p = .8, list = FALSE, times = 1)
data_train <- Sacramento[trainIndex, ]
data_test <- Sacramento[-trainIndex, ]
# Train linear regression model with price and zip code
model_zip <- train(price ~ zip, data = data_train, method = "lm")

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

options(scipen = 999)

# Make predictions
predictions_zip <- predict(model_zip, newdata = data_test)

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

# Calculate RMSE
RMSE_zip <- sqrt(mean((predictions_zip - data_test$price)^2))
RMSE_zip

## [1] 95259.28

# Create plot data frame
plot_data_zip <- data.frame(Actual = data_test$price, Predicted = predictions_zip)

Regression Formula for the Price VS ZIP

The linear regression formula: $\text{Price} = \beta_{\text{zip}} \times \text{zip} + \beta_0$

$\beta_{\text{zip}}$ :This is the coefficient for the predictor variable zip It represents the change in the predicted price for each unit change for zip codes.

zip: This is the zip code of the property, which is the predictor variable used to predict the price.

$\beta_0$ : This is the intercept term of the regression equation. It represents the predicted price when the zip code is zero.

Results of the Linear Regression Model For ZIP and Price

## `geom_smooth()` using formula = 'y ~ x'