Data Dive — GLMs

# Logistic Regression Modeling on New York Housing Dataset

## Introduction

#In this analysis, we aim to build a logistic regression model using the New York housing dataset. The goal is to predict a binary variable of interest based on several explanatory variables.

## Selecting the Binary Variable

#For this analysis, we will consider the "TYPE" column from the New York housing dataset as our binary variable. We'll model whether a property type is a house or not.

# Load necessary libraries
library(dplyr)    # For data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)  # For data visualization
library(car)      # For statistical functions

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

# Load the data
NY_House_Dataset <- read.csv("C:\\Users\\velag\\Downloads\\NY-House-Dataset.csv")

# Check for missing values
sum(is.na(NY_House_Dataset))

## [1] 0

# Remove missing values
NY_House_Dataset <- na.omit(NY_House_Dataset)

# Select an interesting binary column of data
# For example, let's choose the TYPE column which indicates whether the property is a house or not
# Convert the TYPE column to binary: 1 for house, 0 for non-house
NY_House_Dataset$IS_HOUSE <- as.numeric(NY_House_Dataset$TYPE == "House for sale")

# Check for perfect separation
# If any variable has unique values for each level of the response variable, it might cause perfect separation
perfect_separation <- function(var) {
  length(unique(var)) == length(var)
}

# Identify variables causing perfect separation
vars_causing_separation <- sapply(NY_House_Dataset[-which(names(NY_House_Dataset) %in% c("TYPE", "IS_HOUSE"))], perfect_separation)
vars_causing_separation <- names(vars_causing_separation)[vars_causing_separation]

# Remove variables causing perfect separation
NY_House_Dataset <- NY_House_Dataset %>%
  select(-one_of(vars_causing_separation))

# Sample a subset of the data for model fitting
subset_data <- NY_House_Dataset[sample(nrow(NY_House_Dataset), 1000), ]

# Build a logistic regression model with explanatory variables: BEDS, BATH, and PROPERTY_SQFT
library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: lattice

logistic_model <- train(IS_HOUSE ~ BEDS + BATH + PROPERTYSQFT, data = subset_data, method = "glm")

## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.

# Interpret coefficients
summary(logistic_model)

## 
## Call:
## NULL
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.774e-01  2.508e-02   7.072 2.87e-12 ***
## BEDS          2.684e-02  9.069e-03   2.960  0.00315 ** 
## BATH         -1.654e-02  1.311e-02  -1.262  0.20741    
## PROPERTYSQFT -4.816e-06  6.029e-06  -0.799  0.42457    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1682987)
## 
##     Null deviance: 169.34  on 999  degrees of freedom
## Residual deviance: 167.63  on 996  degrees of freedom
## AIC: 1061.9
## 
## Number of Fisher Scoring iterations: 2

# Explanation:
# - The coefficients represent the log odds of being a house for a one-unit increase in each explanatory variable.
# - For example, for every one-unit increase in BEDS, the log odds of being a house increase by the coefficient for BEDS.

# Build confidence interval for coefficient of PROPERTY_SQFT
# Access coefficient estimates and standard errors from train object
coefficients <- logistic_model$finalModel$coefficients
standard_errors <- logistic_model$resample$standarddev

# Translate the meaning of the confidence interval
cat("95% Confidence Interval for the coefficient of PROPERTY_SQFT:", coefficients, "\n")

## 95% Confidence Interval for the coefficient of PROPERTY_SQFT: 0.17735 0.02684259 -0.01654069 -4.816357e-06

cat("The coefficient of PROPERTY_SQFT is significant at the 0.05 level if the confidence interval does not include zero.\n")

## The coefficient of PROPERTY_SQFT is significant at the 0.05 level if the confidence interval does not include zero.

cat("If the coefficient is positive, it indicates that an increase in property square footage increases the log odds of being a house.\n")

## If the coefficient is positive, it indicates that an increase in property square footage increases the log odds of being a house.

# Consider a transformation for the PROPERTY_SQFT variable
# Let's check the scatterplot of PROPERTY_SQFT against IS_HOUSE
ggplot(NY_House_Dataset, aes(x = PROPERTYSQFT, y = IS_HOUSE)) +
  geom_point() +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) +
  labs(title = "Scatterplot of Property Square Footage vs. IS_HOUSE",
       x = "Property Square Footage",
       y = "IS_HOUSE")

## `geom_smooth()` using formula = 'y ~ x'

# Explanation:
# - The scatterplot helps us visualize the relationship between PROPERTY_SQFT and IS_HOUSE.
# - If there's a non-linear relationship, we might consider transforming PROPERTY_SQFT.

# Further investigation:
# - Explore other potential explanatory variables that could improve the model's predictive power.
# - Assess the goodness-of-fit of the model using techniques like ROC curve analysis.
# - Investigate potential interactions between variables to better understand their combined effects on the response variable.

Data Dive — GLMs

Abhinandhan Velagapudi

2024-03-25