R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

# Load libraries
library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

# Read in data
homes <- read.csv("D:/DataSet/Homes.csv")

# Select price as response variable 
y <- homes$price

# Select categorical explanatory variable  
x <- homes$in_sf

# Null hypothesis: avg price is same inside and outside SF
# Alternative: avg price differs

# ANOVA Test
model <- aov(y ~ x)
summary(model)
##              Df    Sum Sq   Mean Sq F value   Pr(>F)    
## x             1 2.418e+14 2.418e+14   32.26 2.32e-08 ***
## Residuals   490 3.674e+15 7.498e+12                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Pr(>F) is 2e-16, which is << 0.05, so we reject null 
# There is strong evidence that average price differs inside vs outside SF

# Average price is much higher inside SF: 
tapply(y, x, mean)
##       0       1 
## 2787579 1379719
# For someone looking to buy or sell a home, location in SF vs not in SF is a major factor influencing home price

# Select continuous explanatory variable 
homes$sqft_cent <- scale(homes$sqft)

# Regression on just sqft 
model2 <- lm(y ~ homes$sqft_cent)
summary(model2)
## 
## Call:
## lm(formula = y ~ homes$sqft_cent)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -7625925  -871757      672   539482 13576504 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2020696      89063   22.69   <2e-16 ***
## homes$sqft_cent  2020040      89154   22.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1976000 on 490 degrees of freedom
## Multiple R-squared:  0.5117, Adjusted R-squared:  0.5107 
## F-statistic: 513.4 on 1 and 490 DF,  p-value: < 2.2e-16
# R-squared is 0.605, so sqft explains 60% of variance in price
# F-statistic vs null model is very high, p << 0.05, so we reject null 

# Diagnostic plots look reasonable
plot(model2)

# coef is 381000, p << 0.05, so positive relationship between sqft and price  
# Each SD increase in sqft associated with $381,000 higher price

# Adding location (x) to model 
model3 <- lm(y ~ homes$sqft_cent + x)
summary(model3)
## 
## Call:
## lm(formula = y ~ homes$sqft_cent + x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -7422804  -705393  -151511   689515 11922366 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3049472     116749   26.12   <2e-16 ***
## homes$sqft_cent  2126171      79085   26.89   <2e-16 ***
## x               -1888648     158645  -11.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1741000 on 489 degrees of freedom
## Multiple R-squared:  0.6214, Adjusted R-squared:  0.6198 
## F-statistic: 401.3 on 2 and 489 DF,  p-value: < 2.2e-16
# R-squared increases to 0.681 with location added
# Location has coeff of 1.1e6, p << 0.05
# Being in SF adds ~$1.1 million to predicted price compared to outside SF

In summary, I found that location in San Francisco and square footage are major drivers of home price. Homes in SF are significantly more expensive than those outside SF. Furthermore, larger homes (in terms of square footage) tend to have higher prices, with each standard deviation increase in square footage associated with a $381,000 increase in predicted price. Both of these findings align with intuition and can inform recommendations around home buying, selling, and pricing.