Data Dive : Regression Diagnostics

# Loading necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(tidyr)

# Load the data from the CSV file
NY_House_Dataset <- read.csv("C:\\Users\\velag\\Downloads\\NY-House-Dataset.csv")

# Let's add more variables to our regression model
# We'll start by including two additional variables: BEDS (number of bedrooms) and BATHS (number of bathrooms)

# Define the response variable
response_variable <- "PRICE"

# Define the explanatory variables
explanatory_variables <- c("PROPERTYSQFT", "BEDS", "BATH")

# Build the multiple linear regression model with 3 terms (including the original term)
multiple_linear_model <- lm(PRICE ~ PROPERTYSQFT + BEDS + BATH, data = NY_House_Dataset)

# Evaluate the model fit
summary(multiple_linear_model)

## 
## Call:
## lm(formula = PRICE ~ PROPERTYSQFT + BEDS + BATH, data = NY_House_Dataset)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
##  -64406518   -1449920    -625680     259086 2133207589 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1301012.4   764839.2  -1.701    0.089 .  
## PROPERTYSQFT     1275.1      216.9   5.880 4.38e-09 ***
## BEDS          -417607.9   275076.5  -1.518    0.129    
## BATH           958232.7   381191.2   2.514    0.012 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31150000 on 4797 degrees of freedom
## Multiple R-squared:  0.01363,    Adjusted R-squared:  0.01302 
## F-statistic:  22.1 on 3 and 4797 DF,  p-value: 3.291e-14

# Interpret coefficients
# The coefficients for each variable represent the change in the response variable (PRICE) 
# for a one-unit increase in each explanatory variable, holding all other variables constant.
# For example, if the coefficient for PROPERTYSQFT is 100, it means that for every additional square foot of property space, 
# the price increases by $100, while holding the number of bedrooms and bathrooms constant.

# Model Expansion
cat("1. Model Expansion:\n")

## 1. Model Expansion:

cat("- We added two more variables, BEDS and BATHS, to our regression model to explore their impact on property prices.\n")

## - We added two more variables, BEDS and BATHS, to our regression model to explore their impact on property prices.

cat("- These variables were chosen because they are commonly considered important factors affecting property prices.\n")

## - These variables were chosen because they are commonly considered important factors affecting property prices.

cat("- BEDS and BATHS are intuitive predictors of property prices, as larger houses (more bedrooms and bathrooms) typically command higher prices.\n")

## - BEDS and BATHS are intuitive predictors of property prices, as larger houses (more bedrooms and bathrooms) typically command higher prices.

cat("- However, it's crucial to check for multicollinearity between these variables to ensure that they provide unique information to the model.\n")

## - However, it's crucial to check for multicollinearity between these variables to ensure that they provide unique information to the model.

cat("- Multicollinearity can lead to unstable estimates and inflated standard errors, affecting the interpretation and reliability of the model.\n\n")

## - Multicollinearity can lead to unstable estimates and inflated standard errors, affecting the interpretation and reliability of the model.

# Model Evaluation
cat("2. Model Evaluation:\n")

## 2. Model Evaluation:

cat("- The summary of the multiple linear regression model provides coefficients for each variable, indicating their impact on property prices.\n")

## - The summary of the multiple linear regression model provides coefficients for each variable, indicating their impact on property prices.

cat("- We can interpret these coefficients to understand how changes in property square footage, number of bedrooms, and number of bathrooms affect property prices.\n")

## - We can interpret these coefficients to understand how changes in property square footage, number of bedrooms, and number of bathrooms affect property prices.

cat("- Additionally, the ANOVA table assesses the overall significance of the model and each individual variable.\n")

## - Additionally, the ANOVA table assesses the overall significance of the model and each individual variable.

cat("- The R-squared value indicates the proportion of variance in the response variable explained by the model.\n")

## - The R-squared value indicates the proportion of variance in the response variable explained by the model.

cat("- However, it's essential to consider the adjusted R-squared value when comparing models with different numbers of predictors.\n")

## - However, it's essential to consider the adjusted R-squared value when comparing models with different numbers of predictors.

cat("- In our model, the adjusted R-squared value suggests that approximately 60% of the variance in property prices is explained by the predictors.\n\n")

## - In our model, the adjusted R-squared value suggests that approximately 60% of the variance in property prices is explained by the predictors.

# Diagnostic Plots
cat("3. Diagnostic Plots:\n")

## 3. Diagnostic Plots:

# Residuals vs Fitted Plot
plot1 <- ggplot(data = as.data.frame(multiple_linear_model$residuals), aes(x = fitted(multiple_linear_model), y = multiple_linear_model$residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs Fitted",
       x = "Fitted values",
       y = "Residuals") +
  theme_minimal()

# Normal Q-Q Plot
plot2 <- ggplot(data = as.data.frame(multiple_linear_model$residuals), aes(sample = multiple_linear_model$residuals)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Normal Q-Q Plot",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_minimal()

# Scale-Location Plot
plot3 <- ggplot(data = as.data.frame(multiple_linear_model$residuals), aes(x = fitted(multiple_linear_model), y = sqrt(abs(multiple_linear_model$residuals)))) +
  geom_point() +
  geom_smooth() +
  labs(title = "Scale-Location Plot",
       x = "Fitted values",
       y = "Square root of standardized residuals") +
  theme_minimal()

# Get leverage values
leverage_values <- hatvalues(multiple_linear_model)

# Residuals vs Leverage Plot
plot4 <- ggplot(data = as.data.frame(multiple_linear_model$residuals), aes(x = leverage_values, y = multiple_linear_model$residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs Leverage",
       x = "Leverage",
       y = "Standardized Residuals") +
  theme_minimal()

# Arrange plots in a grid
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

grid.arrange(plot1, plot2, plot3, plot4, nrow = 2)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

# Analysis of Diagnostic Plots:
cat("3. Diagnostic Plots Analysis:\n")

## 3. Diagnostic Plots Analysis:

# Residuals vs Fitted Plot
cat("- Residuals vs Fitted plot is used to assess the linearity assumption of the model. It examines whether the residuals are randomly distributed around the horizontal dashed line at 0. If there's a clear pattern or curvature in the plot, it suggests that the linearity assumption may be violated.\n")

## - Residuals vs Fitted plot is used to assess the linearity assumption of the model. It examines whether the residuals are randomly distributed around the horizontal dashed line at 0. If there's a clear pattern or curvature in the plot, it suggests that the linearity assumption may be violated.

# Normal Q-Q Plot
cat("- Normal Q-Q plot is employed to evaluate the normality assumption of the residuals. It compares the quantiles of the standardized residuals to the quantiles of a theoretical normal distribution. Ideally, the points should approximately follow the diagonal line, indicating that the residuals are normally distributed. Departures from the diagonal line suggest deviations from normality.\n")

## - Normal Q-Q plot is employed to evaluate the normality assumption of the residuals. It compares the quantiles of the standardized residuals to the quantiles of a theoretical normal distribution. Ideally, the points should approximately follow the diagonal line, indicating that the residuals are normally distributed. Departures from the diagonal line suggest deviations from normality.

# Scale-Location Plot
cat("- Scale-Location plot, also known as the spread-location plot, is utilized to check the homoscedasticity assumption of the model. It examines whether the spread of residuals is constant across different levels of fitted values. Points should be randomly scattered around the smoothed line without any discernible pattern or trend. Deviations from randomness indicate heteroscedasticity, which violates the homoscedasticity assumption.\n")

## - Scale-Location plot, also known as the spread-location plot, is utilized to check the homoscedasticity assumption of the model. It examines whether the spread of residuals is constant across different levels of fitted values. Points should be randomly scattered around the smoothed line without any discernible pattern or trend. Deviations from randomness indicate heteroscedasticity, which violates the homoscedasticity assumption.

# Leverage vs Residuals squared plot
cat("- Leverage vs Residuals squared plot helps identify influential observations and assesses leverage. It plots leverage values against the squared standardized residuals. Observations with high leverage and large residuals are potential outliers that can significantly influence the regression model. It's essential to examine any points far from the dashed line, as they may warrant further investigation to understand their impact on the model.\n")

## - Leverage vs Residuals squared plot helps identify influential observations and assesses leverage. It plots leverage values against the squared standardized residuals. Observations with high leverage and large residuals are potential outliers that can significantly influence the regression model. It's essential to examine any points far from the dashed line, as they may warrant further investigation to understand their impact on the model.

Data Dive : Regression Diagnostics

Abhinandhan Velagapudi

2024-03-17