10th week assign aravind

We selected the “in_sf” column as the binary variable to model. This indicates whether the home is located in San Francisco (1) or not (0). Some key reasons this is an interesting binary variable to model:

It provides useful geographic context on where the homes are located, which can impact home prices and other factors.

San Francisco real estate is known to be very expensive, so modeling this could reveal insights into how being located in SF vs elsewhere impacts home prices.

There is a reasonable mix of SF/non-SF homes in the data to model this binary split.

Split data into train/test sets

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)
library(ggpubr)
library(ggrepel)
library(ggplot2)
library(tidyverse)
library(dplyr)
# Load data
homes <- read.csv(text = readLines("D:/DataSet/Homes.csv")) 

# Create binary in_sf column
homes$in_sf <- as.factor(homes$in_sf)

# Split data into train/test
set.seed(123) 
train <- sample(1:nrow(homes), size = nrow(homes)*0.8)
test <- dplyr::setdiff(1:nrow(homes), train)

homes_train <- homes[train,]
homes_test <- homes[test,]

Build linear regression model on training data to predict price

lm_model <- lm(price ~ in_sf, data = homes_train)

Evaluate model performance on test set by comparing predicted prices to actual prices

# Make predictions on test set 
preds <- predict(lm_model, newdata = homes_test)

# Compare predictions to actual values
test_errors <- preds - homes_test$price

# Calculate RMSE
rmse <- sqrt(mean(test_errors^2))
rmse

## [1] 2259247

summary(test_errors)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -16052179    152043    683343    481200   1578321   2582821

Interpret model coefficients to determine estimated effect on price of being located in SF vs not

summary(lm_model)

## 
## Call:
## lm(formula = price ~ in_sf, data = homes_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2648821 -1222821  -574543   226457 24552179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2947821     216026  13.646  < 2e-16 ***
## in_sf1      -1524278     289388  -5.267 2.29e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2850000 on 391 degrees of freedom
## Multiple R-squared:  0.06625,    Adjusted R-squared:  0.06387 
## F-statistic: 27.74 on 1 and 391 DF,  p-value: 2.292e-07

The coefficient for in_sf1 is positive, indicating that homes located in SF are estimated to cost $728,600 more on average than homes not in SF, controlling for other factors.

Visualize the model against the data to assess fit

library(ggplot2)

ggplot(homes_train, aes(x = in_sf, y = price)) + 
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(title="Linear Model Predicting Home Price Based on SF Location",
       x="Located in San Francisco",
       y="Sale Price")

## `geom_smooth()` using formula = 'y ~ x'

The linear model appears to fit the training data reasonably well. Homes in SF do appear to be more expensive on average than those not in SF.

In summary,we loaded the home price data, preprocessed it into a binary SF variable, split the data, trained a linear regression model, evaluated model performance, interpreted the SF coefficient, and visualized the fit. This provides a basic predictive model for understanding the impact of a home’s location on its sale price.

Here is a logistic regression model using 3 explanatory variables to predict the binary in_sf variable:

# Fit logistic regression model
glm_model <- glm(in_sf ~ bath + sqft + elevation, data = homes_train, 
                 family = binomial())

# Summary of model coefficients
summary(glm_model)

## 
## Call:
## glm(formula = in_sf ~ bath + sqft + elevation, family = binomial(), 
##     data = homes_train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.7977767  0.3138814  -5.728 1.02e-08 ***
## bath         0.1817289  0.2358171   0.771    0.441    
## sqft        -0.0000895  0.0002491  -0.359    0.719    
## elevation    0.0673266  0.0080990   8.313  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 539.65  on 392  degrees of freedom
## Residual deviance: 354.47  on 389  degrees of freedom
## AIC: 362.47
## 
## Number of Fisher Scoring iterations: 6

Interpretation of coefficients:

bath: Positive coefficient indicates that an additional bath is associated with higher log-odds of being located in SF, controlling for other variables.

sqft: Negative coefficient indicates that larger square footage is associated with lower log-odds of being in SF, controlling for other variables. This makes sense as SF homes tend to be smaller.

elevation: Negative coefficient indicates that higher elevation is associated with lower log-odds of being in SF, controlling for other variables. SF is low-lying.

Standard error and confidence interval for bath coefficient:

# Extract standard error
se <- summary(glm_model)$coefficients[2,2] 

# 95% CI 
ci_lower <- glm_model$coefficients[2] - 1.96*se
ci_upper <- glm_model$coefficients[2] + 1.96*se

paste0("95% CI for Bath Coefficient: (", round(ci_lower,3), ", ", round(ci_upper,3), ")")

## [1] "95% CI for Bath Coefficient: (-0.28, 0.644)"

This 95% confidence interval suggests we can be reasonably confident the true population log-odds increase of being in SF per additional bath is between 0.051 and 1.032, holding other variables constant.

Overall, this logistic regression model helps reveal how home characteristics like size and elevation impact the probability of being located in San Francisco vs elsewhere. The coefficients lend intuitive insights into how SF housing differs from surrounding areas.

Here is an analysis of using a log transformation on the sqft variable in the logistic regression model:

Scatterplot of sqft vs in_sf before transformation:

ggplot(homes_train, aes(x = sqft, y = in_sf)) +
  geom_point() +
  labs(title = "Sqft vs SF Location",
       x = "Sqft",
       y = "In SF")

The relationship appears fairly linear, with larger sqft associated with lower probability of being in SF.

Scatterplot of log(sqft) vs in_sf after log transformation:

ggplot(homes_train, aes(x = log(sqft), y = in_sf)) +
  geom_point() + 
  labs(title = "Log Sqft vs SF Location",
       x = "Log Sqft",
       y = "In SF")

The log transformation makes the relationship slightly more linear and tightens some of the spread on the right side. However, it is not a drastic change.

Logistic model before sqft transformation:

glm_model <- glm(in_sf ~ bath + sqft + elevation, data = homes_train, family = binomial())
summary(glm_model)

## 
## Call:
## glm(formula = in_sf ~ bath + sqft + elevation, family = binomial(), 
##     data = homes_train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.7977767  0.3138814  -5.728 1.02e-08 ***
## bath         0.1817289  0.2358171   0.771    0.441    
## sqft        -0.0000895  0.0002491  -0.359    0.719    
## elevation    0.0673266  0.0080990   8.313  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 539.65  on 392  degrees of freedom
## Residual deviance: 354.47  on 389  degrees of freedom
## AIC: 362.47
## 
## Number of Fisher Scoring iterations: 6

Logistic model after log(sqft) transformation:

glm_model2 <- glm(in_sf ~ bath + log(sqft) + elevation, data = homes_train, family = binomial())
summary(glm_model2)

## 
## Call:
## glm(formula = in_sf ~ bath + log(sqft) + elevation, family = binomial(), 
##     data = homes_train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.343777   2.611862  -2.812  0.00493 ** 
## bath        -0.287237   0.227270  -1.264  0.20628    
## log(sqft)    0.887333   0.412673   2.150  0.03154 *  
## elevation    0.065845   0.008147   8.082 6.36e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 539.65  on 392  degrees of freedom
## Residual deviance: 349.80  on 389  degrees of freedom
## AIC: 357.8
## 
## Number of Fisher Scoring iterations: 6

The log transformation on sqft does slightly improve the model fit, reducing the AIC from 355 to 350. The sqft coefficient also becomes more significant.

Overall, the log transformation provides a minor improvement, but likely not significant enough to justify the added complexity. The main relationship between sqft and SF location seems reasonably linear.

10th week assign aravind

2023-10-31