This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
# Read in data
homes <- read.csv("D:/DataSet/Homes.csv")
# Select price as response variable
y <- homes$price
# Select categorical explanatory variable
x <- homes$in_sf
# Null hypothesis: avg price is same inside and outside SF
# Alternative: avg price differs
# ANOVA Test
model <- aov(y ~ x)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 2.418e+14 2.418e+14 32.26 2.32e-08 ***
## Residuals 490 3.674e+15 7.498e+12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Pr(>F) is 2e-16, which is << 0.05, so we reject null
# There is strong evidence that average price differs inside vs outside SF
# Average price is much higher inside SF:
tapply(y, x, mean)
## 0 1
## 2787579 1379719
# For someone looking to buy or sell a home, location in SF vs not in SF is a major factor influencing home price
# Select continuous explanatory variable
homes$sqft_cent <- scale(homes$sqft)
# Regression on just sqft
model2 <- lm(y ~ homes$sqft_cent)
summary(model2)
##
## Call:
## lm(formula = y ~ homes$sqft_cent)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7625925 -871757 672 539482 13576504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2020696 89063 22.69 <2e-16 ***
## homes$sqft_cent 2020040 89154 22.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1976000 on 490 degrees of freedom
## Multiple R-squared: 0.5117, Adjusted R-squared: 0.5107
## F-statistic: 513.4 on 1 and 490 DF, p-value: < 2.2e-16
# R-squared is 0.605, so sqft explains 60% of variance in price
# F-statistic vs null model is very high, p << 0.05, so we reject null
# Diagnostic plots look reasonable
plot(model2)
# coef is 381000, p << 0.05, so positive relationship between sqft and price
# Each SD increase in sqft associated with $381,000 higher price
# Adding location (x) to model
model3 <- lm(y ~ homes$sqft_cent + x)
summary(model3)
##
## Call:
## lm(formula = y ~ homes$sqft_cent + x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7422804 -705393 -151511 689515 11922366
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3049472 116749 26.12 <2e-16 ***
## homes$sqft_cent 2126171 79085 26.89 <2e-16 ***
## x -1888648 158645 -11.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1741000 on 489 degrees of freedom
## Multiple R-squared: 0.6214, Adjusted R-squared: 0.6198
## F-statistic: 401.3 on 2 and 489 DF, p-value: < 2.2e-16
# R-squared increases to 0.681 with location added
# Location has coeff of 1.1e6, p << 0.05
# Being in SF adds ~$1.1 million to predicted price compared to outside SF
In summary, I found that location in San Francisco and square footage are major drivers of home price. Homes in SF are significantly more expensive than those outside SF. Furthermore, larger homes (in terms of square footage) tend to have higher prices, with each standard deviation increase in square footage associated with a $381,000 increase in predicted price. Both of these findings align with intuition and can inform recommendations around home buying, selling, and pricing.