Week 13

In week 10, the following were discussed: Transformations and Link functions. Lets critique about those. We have used the apartments dataset which contains apartments in SF and NYC.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(broom)
library(lindia)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

Transformations

We begin by building a linear regression model to explore the relationship between the price of apartments and their price per square foot.

url <- "https://raw.githubusercontent.com/leontoddjohnson/i590/main/data/apartments/apartments.csv"

apts <- read_delim(url, delim = ',')

## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

model <- lm(price ~ price_per_sqft,
            filter(apts, in_sf == 0))

rsquared <- summary(model)$r.squared

apts |> 
  filter(in_sf == 0) |>
  ggplot(mapping = aes(x = price_per_sqft, 
                       y = price)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', 
              se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(title = "Price vs. Price Per Sq. Ft.",
       subtitle = paste("Linear Fit R-Squared =", round(rsquared, 3))) +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Concerns with the model include a problematic choice of predictors, where using price per square foot as a predictor for price introduces an inherent dependency, violating assumptions of independence in the model. Additionally, the model overlooks potential influential predictors such as neighborhood, number of bedrooms/bathrooms, and building amenities, potentially introducing bias in the results. Furthermore, the lack of randomization in the data raises the possibility of underlying confounding factors that could impact the observed relationship between price and price per square foot.

R-squared is the only metric used to evaluate the model, providing a limited perspective on the model performance. We could have considered other metrics such as Mean squared Error or Root Mean Squared Error to asses predictive performance. The below is the code with MSE included.

# Fit linear regression model
model <- lm(price ~ price_per_sqft, data = filter(apts, in_sf == 0))

# Make predictions
predictions <- predict(model, newdata = filter(apts, in_sf == 0))

# Calculate Mean Squared Error
mse <- mean((apts %>% filter(in_sf == 0) %>% pull(price) - predictions)^2)

# Visualize the data and linear fit
apts %>%
  filter(in_sf == 0) %>%
  ggplot(mapping = aes(x = price_per_sqft, y = price)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Price vs. Price Per Sq. Ft.",
    subtitle = paste("Linear Fit MSE =", round(mse, 3))
  ) +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

R-squared primarily focuses on explaining the variance in the response variable. MSE directly assesses the accuracy of predictions by quantifying the average squared differences, providing a more granular insight into the model’s performance.

Power transformation

In power transformations, new linear models were fitted after the transformation, but it does not compare to the original model. We could have included a couple of lines of code to formally perform model comparison to assess whether the power transformed model provides a better fit or not.

# Original Model
original_model <- lm(price ~ price_per_sqft, data = filter(apts, in_sf == 0))

# Transformed Model
apts <- apts %>%
  mutate(log_price = log(price))

transformed_model <- lm(log_price ~ price_per_sqft, data = filter(apts, in_sf == 0))

# Model Comparison
summary(original_model)

## 
## Call:
## lm(formula = price ~ price_per_sqft, data = filter(apts, in_sf == 
##     0))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4289426 -1480152    19921   894222 17184132 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3213431.7   322193.0  -9.974   <2e-16 ***
## price_per_sqft     3689.5      175.8  20.984   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2221000 on 222 degrees of freedom
## Multiple R-squared:  0.6648, Adjusted R-squared:  0.6633 
## F-statistic: 440.3 on 1 and 222 DF,  p-value: < 2.2e-16

summary(transformed_model)

## 
## Call:
## lm(formula = log_price ~ price_per_sqft, data = filter(apts, 
##     in_sf == 0))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1514 -0.3639 -0.0443  0.3023  1.9612 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.271e+01  7.256e-02  175.14   <2e-16 ***
## price_per_sqft 9.815e-04  3.960e-05   24.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5001 on 222 degrees of freedom
## Multiple R-squared:  0.7346, Adjusted R-squared:  0.7334 
## F-statistic: 614.5 on 1 and 222 DF,  p-value: < 2.2e-16

# AIC Comparison
AIC(original_model)

## [1] 7186.49

AIC(transformed_model)

## [1] 329.266

The new code includes explicit model comparison by calculating and displaying the AIC values for both the original and transformed models. This quantitative measure allows for a more informed assessment of model performance and complexity.

In the context of AIC, lower values are preferred as they indicate a better trade-off between goodness of fit and model complexity. The transformed model’s significantly lower AIC suggests that it provides a better fit to the data compared to the original model.

The original code primarily relied on visual assessments and R-squared values for model evaluation. The above code introduces the use of AIC, a widely accepted metric for model comparison that considers both goodness of fit and model complexity. This enhances the quantitative evaluation of the models.

Link Functions

Linear regression

In this section the following was performed:

Predicted a binary response variable based on an explanatory variable.
The sigmoid function is introduced, which transforms the linear combination into a probability between 0 and 1.
The code demonstrates how to visualize the sigmoid function, showing how it outputs probabilities based on elevation. Coefficients (-5 for intercept and 0.15 for elevation) are used to define the sigmoid function.

Critique: The code presents a comprehensive introduction to logistic regression, effectively explaining the model and its coefficients for predicting whether an apartment is in San Francisco based on elevation. The inclusion of a visual representation of the sigmoid function enhances understanding by illustrating how elevation influences the probability of an apartment being in SF. However, the analysis falls short in terms of model evaluation, as it lacks formal metrics like accuracy or ROC curves. Additionally, assumptions regarding the monotonic relationship between elevation and the likelihood of an apartment being in SF are made, but thorough checks for linearity and multicollinearity are absent. The model’s limited predictor variable, focusing solely on elevation, overlooks other potentially influential factors, and exploring additional predictors could enhance accuracy. Furthermore, the critique underscores the importance of considering dataset characteristics, emphasizing the need to address issues such as data quality, representativeness, and biases to ensure the reliability of conclusions drawn from the model.

Poisson Regression

The code introduces the concept of Poisson Regression as a method to model count data, emphasizing the need to transform the response variable by taking its logarithm. The visualization of the histogram and scatter plot aids in understanding the distribution of prices for apartments in San Francisco and New York City.

However, the subsequent transformation of the square footage variable using both the square root and the logarithm might be overly complex and requires a clear rationale. The Poisson Regression is then applied to predict the price of apartments based on the square root of square footage, but the interpretation of coefficients lacks clarity. The code introduces the concept of percent change but could benefit from a more straightforward explanation.

Additionally, the interpretation assumes that all other variables are held constant, which may not be explicitly stated in the code. The visualization of the Poisson model against the data provides a useful visual aid, but the interpretation of the model’s coefficients could be further simplified for a broader audience.

The code is generally informative but may benefit from enhanced clarity in explanations and a more straightforward approach to variable transformations and model interpretation.