Data Dive - GLMs part2

# Data Dive — GLMs (Part 2)

## Introduction

# In this analysis, we will continue exploring the New York housing dataset by building and diagnosing a linear (or generalized linear) model.

## Loading the Data

# Load necessary libraries
# Load necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(tidyr)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## Building the Model

# For this analysis, let's build a linear model to predict the property price based on various explanatory variables.

# Load the data
NY_House_Dataset <- read.csv("C:\\Users\\velag\\Downloads\\NY-House-Dataset.csv")

# Check for missing values
missing_values <- colSums(is.na(NY_House_Dataset))
missing_values <- missing_values[missing_values > 0]
missing_values

## named numeric(0)

# Impute missing values if needed
# For simplicity, I'll skip imputation for now

# Visualize the distribution of the response variable (PRICE)
ggplot(NY_House_Dataset, aes(x = PRICE)) +
  geom_histogram(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Property Prices",
       x = "Price",
       y = "Frequency") +
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Select relevant variables for modeling
selected_vars <- c("PRICE", "BEDS", "BATH", "PROPERTYSQFT")

# Sample a subset of the data for model fitting
subset_data <- NY_House_Dataset[sample(nrow(NY_House_Dataset), 1000), ]

# Build a linear regression model
linear_model <- lm(PRICE ~ BEDS + BATH + PROPERTYSQFT, data = subset_data)

# Check model summary
summary(linear_model)

## 
## Call:
## lm(formula = PRICE ~ BEDS + BATH + PROPERTYSQFT, data = subset_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -14979117   -911745   -342532    379976  26857533 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -500420.21  166205.42  -3.011  0.00267 ** 
## BEDS         -526142.26   59092.05  -8.904  < 2e-16 ***
## BATH         1420629.84   88329.62  16.083  < 2e-16 ***
## PROPERTYSQFT     345.17      52.86   6.530 1.05e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2875000 on 996 degrees of freedom
## Multiple R-squared:  0.3084, Adjusted R-squared:  0.3063 
## F-statistic: 148.1 on 3 and 996 DF,  p-value: < 2.2e-16

# Diagnostic plots
par(mfrow = c(2, 2))
plot(linear_model)

# Explanation:
# - The diagnostic plots help us assess the assumptions of the linear model.
# - The plots include:
# 1. Residuals vs. Fitted: Residuals vs Fitted plot is used to assess the linearity   assumption of the model. It examines whether the residuals are randomly distributed around the horizontal dashed line at 0. If there's a clear pattern or curvature in the plot, it suggests that the linearity assumption may be violated.

# 2. Normal Q-Q: Normal Q-Q plot is employed to evaluate the normality assumption of the residuals. It compares the quantiles of the standardized residuals to the quantiles of a theoretical normal distribution. Ideally, the points should approximately follow the diagonal line, indicating that the residuals are normally distributed. Departures from the diagonal line suggest deviations from normality

# 3. Scale-Location: Scale-Location plot, also known as the spread-location plot, is utilized to check the homoscedasticity assumption of the model. It examines whether the spread of residuals is constant across different levels of fitted values. Points should be randomly scattered around the smoothed line without any discernible pattern or trend. Deviations from randomness indicate heteroscedasticity, which violates the homoscedasticity assumption.

# 4. Residuals vs. Leverage: Leverage vs Residuals squared plot helps identify influential observations and assesses leverage. It plots leverage values against the squared standardized residuals. Observations with high leverage and large residuals are potential outliers that can significantly influence the regression model. It's essential to examine any points far from the dashed line, as they may warrant further investigation to understand their impact on the model.

## Highlighting Issues

# Based on the diagnostic plots, we can identify potential issues with the model.
# For example, if we observe patterns in the Residuals vs. Fitted plot or non-random patterns in the Normal Q-Q plot, it suggests violations of the linearity and normality assumptions, respectively.

# Let's interpret one of the coefficients from the linear model.
# Interpret coefficients
coef_summary <- coef(summary(linear_model))
coef_summary

##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)  -500420.2065 166205.42385 -3.010854 2.670840e-03
## BEDS         -526142.2637  59092.05381 -8.903774 2.509334e-18
## BATH         1420629.8368  88329.62282 16.083278 6.442758e-52
## PROPERTYSQFT     345.1744     52.86168  6.529766 1.048122e-10

# Explanation:
# - The coefficients represent the estimated effect of each explanatory variable on the response variable.
# - For example, the coefficient for BEDS indicates the average change in the property price for a one-unit increase in the number of bedrooms, holding other variables constant.

# Build confidence intervals for coefficients
conf_int <- confint(linear_model)
conf_int

##                     2.5 %       97.5 %
## (Intercept)  -826573.1924 -174267.2206
## BEDS         -642101.4746 -410183.0528
## BATH         1247296.3224 1593963.3512
## PROPERTYSQFT     241.4414     448.9074

# Visualize the relationship between each predictor and the response variable
ggplot(subset_data, aes(x = PROPERTYSQFT, y = PRICE)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Property Square Footage vs. Price",
       x = "Property Square Footage",
       y = "Price") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## Further Investigation

# Further investigation could involve:
# - Checking for multicollinearity among explanatory variables.
# - Trying different transformations of variables to improve model fit.
# - Evaluating the predictive performance of the model using techniques like cross-validation.

# Conclusion
cat("The linear regression model provides insights into the relationship between property features and prices. Further refinement and exploration can improve model accuracy.")

## The linear regression model provides insights into the relationship between property features and prices. Further refinement and exploration can improve model accuracy.

Data Dive - GLMs part2

Abhinandhan Velagapudi

2024-04-01