In week 10, the following were discussed: Transformations and Link functions. Lets critique about those. We have used the apartments dataset which contains apartments in SF and NYC.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

Transformations

We begin by building a linear regression model to explore the relationship between the price of apartments and their price per square foot.

url <- "https://raw.githubusercontent.com/leontoddjohnson/i590/main/data/apartments/apartments.csv"

apts <- read_delim(url, delim = ',')
## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
model <- lm(price ~ price_per_sqft,
            filter(apts, in_sf == 0))

rsquared <- summary(model)$r.squared

apts |> 
  filter(in_sf == 0) |>
  ggplot(mapping = aes(x = price_per_sqft, 
                       y = price)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', 
              se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(title = "Price vs. Price Per Sq. Ft.",
       subtitle = paste("Linear Fit R-Squared =", round(rsquared, 3))) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Concerns with the model include a problematic choice of predictors, where using price per square foot as a predictor for price introduces an inherent dependency, violating assumptions of independence in the model. Additionally, the model overlooks potential influential predictors such as neighborhood, number of bedrooms/bathrooms, and building amenities, potentially introducing bias in the results. Furthermore, the lack of randomization in the data raises the possibility of underlying confounding factors that could impact the observed relationship between price and price per square foot.

R-squared is the only metric used to evaluate the model, providing a limited perspective on the model performance. We could have considered other metrics such as Mean squared Error or Root Mean Squared Error to asses predictive performance. The below is the code with MSE included.

# Fit linear regression model
model <- lm(price ~ price_per_sqft, data = filter(apts, in_sf == 0))

# Make predictions
predictions <- predict(model, newdata = filter(apts, in_sf == 0))

# Calculate Mean Squared Error
mse <- mean((apts %>% filter(in_sf == 0) %>% pull(price) - predictions)^2)

# Visualize the data and linear fit
apts %>%
  filter(in_sf == 0) %>%
  ggplot(mapping = aes(x = price_per_sqft, y = price)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Price vs. Price Per Sq. Ft.",
    subtitle = paste("Linear Fit MSE =", round(mse, 3))
  ) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

R-squared primarily focuses on explaining the variance in the response variable. MSE directly assesses the accuracy of predictions by quantifying the average squared differences, providing a more granular insight into the model’s performance.

Power transformation

In power transformations, new linear models were fitted after the transformation, but it does not compare to the original model. We could have included a couple of lines of code to formally perform model comparison to assess whether the power transformed model provides a better fit or not.

# Original Model
original_model <- lm(price ~ price_per_sqft, data = filter(apts, in_sf == 0))

# Transformed Model
apts <- apts %>%
  mutate(log_price = log(price))

transformed_model <- lm(log_price ~ price_per_sqft, data = filter(apts, in_sf == 0))

# Model Comparison
summary(original_model)
## 
## Call:
## lm(formula = price ~ price_per_sqft, data = filter(apts, in_sf == 
##     0))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4289426 -1480152    19921   894222 17184132 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3213431.7   322193.0  -9.974   <2e-16 ***
## price_per_sqft     3689.5      175.8  20.984   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2221000 on 222 degrees of freedom
## Multiple R-squared:  0.6648, Adjusted R-squared:  0.6633 
## F-statistic: 440.3 on 1 and 222 DF,  p-value: < 2.2e-16
summary(transformed_model)
## 
## Call:
## lm(formula = log_price ~ price_per_sqft, data = filter(apts, 
##     in_sf == 0))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1514 -0.3639 -0.0443  0.3023  1.9612 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.271e+01  7.256e-02  175.14   <2e-16 ***
## price_per_sqft 9.815e-04  3.960e-05   24.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5001 on 222 degrees of freedom
## Multiple R-squared:  0.7346, Adjusted R-squared:  0.7334 
## F-statistic: 614.5 on 1 and 222 DF,  p-value: < 2.2e-16
# AIC Comparison
AIC(original_model)
## [1] 7186.49
AIC(transformed_model)
## [1] 329.266

The new code includes explicit model comparison by calculating and displaying the AIC values for both the original and transformed models. This quantitative measure allows for a more informed assessment of model performance and complexity.

In the context of AIC, lower values are preferred as they indicate a better trade-off between goodness of fit and model complexity. The transformed model’s significantly lower AIC suggests that it provides a better fit to the data compared to the original model.

The original code primarily relied on visual assessments and R-squared values for model evaluation. The above code introduces the use of AIC, a widely accepted metric for model comparison that considers both goodness of fit and model complexity. This enhances the quantitative evaluation of the models.