Model Critique

For this lab, you’ll be working with a group of other classmates, and each group will be assigned a lab from a previous week. Your goal is to critique the models (or analyses) present in the lab.

First, review the materials from the Lesson on Ethics and Epistemology (week 5?). This includes lecture slides, the lecture video, or the reading. You can use these as reference materials for this lab. You may even consider the reading for the week associated with the lab, or even supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.).

For the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and possible solutions (even if you would need to request more data or resources to accomplish those solutions).

Share your model critique in this notebook as your data dive submission for the week.

As a start, think about the context of the lab and consider the following:

Analytical issues, such as model assumptions
Statistical improvements; what do we know now that we didn’t know then?
Are there better visualizations which could have been used?
Overcoming biases (existing or potential)
Possible risks or societal implications
Crucial issues which might not be measurable

Treat this exercise as if the analyses in your assigned lab (i.e., the one you are critiquing) were to be published, made available to the public in a press release, or used at some large company (e.g., for mpg data, imagine if Toyota used the conclusions to drive strategic decisions).

If you were unable to attend class, select a notes_*.Rmd file from a previous week (not including weeks 1 or 3), and complete the analysis above. Share your critique below.

Example

For example, in Week 11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:

Is this a “good” selection of variables? What could we be missing, or are there potential biases inherent in the groups of apartments here?
Nowhere in the lab do we investigate the assumptions of a linear model. Is the relationship between the response (i.e., \(\log(\text{price})\)) and each of these variables linear? Are the error terms evenly distributed?
Is it possible that our conclusions are more appropriate for some group(s) of the data and not others?
What if assumptions are not met? What could happen to this model if it were deployed on a platform like Zillow?
Consider different evaluation metrics between models. What is a practical use for these values?

#Week 10 lab # Week 10 Notes Analysis

Analysis: Logistics Regression

In this section our goal was to predict whether or not an apartment is in SF. To do this, we used in_sf which is a binary response variable, and elevation as our explanatory variable. What we want is a monotonic function which outputs something like 1 or 0 for a given elevation. I think a “better” variables for predicting whether an apartment is in San Francisco for linear regression are:

Longitude/Latitude: Instead of using elevation alone I think we should use more specific location indicators such as latitude and longitude coordinates because some apartments in SF are on lower elevations which can also be the same elevation as apartments in NYC . These can provide a more precise representation of the apartment’s location, capturing nuances like neighborhood characteristics, proximity to landmarks, or distance from the city center.
Zip Codes: Using zip codes could capture localized variations in housing patterns and demographics. Apartments in certain zip codes or neighborhoods might be more likely to be in San Francisco due to factors like affordability, amenities, or cultural attractions.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(patchwork)

## Warning: package 'patchwork' was built under R version 4.3.3

library(broom)
library(lindia)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

options(scipen = 6)
theme_set(theme_minimal())

url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/apartments/apartments.csv"

apts <- read_delim(url_, delim = ",")

## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Here are a few potential improvements to consider:

#1) Add Labels and Legend

#2) Adjust Axis Limits:

#3) Improve Aesthetics:

#4) Add Regression Lines:

# Define the data
x <- runif(100, 1, 100)
y <- 2 * sqrt(x) + rnorm(100, sd = 0.5)

# Calculate transformed x values
x_transformed <- sqrt(x)

# Create a data frame for the transformed data
transformed_data <- data.frame(x = x_transformed, y = y)

# Plot 1: Before Transformation
p1 <- ggplot(data = data.frame(x = x, y = y)) + 
  geom_point(mapping = aes(x = x, y = y), color = "blue", alpha = 0.5) +  # Adjust aesthetics
  labs(title = 'Before Transformation', x = "x", y = "y") +  # Add axis labels
  theme_minimal()  # Improve aesthetics

# Plot 2: After Transformation
p2 <- ggplot(data = transformed_data) + 
  geom_point(mapping = aes(x = x, y = y), color = "red", alpha = 0.5) +  # Adjust aesthetics
  labs(title = 'After Transformation', x = "x_p = sqrt(x)", y = "y") +  # Add axis labels
  theme_minimal() +  # Improve aesthetics
  xlim(0, max(x_transformed)) +  # Adjust x-axis limits
  ylim(min(y), max(y))  # Adjust y-axis limits

# Combine plots using patchwork
combined_plots <- p1 + p2

# Display the combined plot
combined_plots

#Here are a few potential improvements to consider:

#Add Residuals Plot: Along with the scatter plot of price vs. price_per_sqft, added a subplot to visualize the residuals

#Enhance Aesthetics:

#Increase Interpretability: Added annotations

library(ggplot2)
library(dplyr)

# Calculate residuals
apts_ny <- apts %>%
  filter(in_sf == 0)

model <- lm(price ~ price_per_sqft, data = apts_ny)
apts_ny <- apts_ny %>%
  mutate(residuals = residuals(model))

# Scatter plot with residuals
scatter_plot <- ggplot(data = apts_ny, aes(x = price_per_sqft, y = price)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
  labs(title = "Price vs. Price Per Sq. Ft.",
       subtitle = paste("Linear Fit R-Squared =", round(summary(model)$r.squared, 3)),
       x = "Price Per Sq. Ft.",
       y = "Price") +
  theme_minimal()

# Residuals plot
residuals_plot <- ggplot(data = apts_ny, aes(x = price_per_sqft, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Residuals Plot",
       x = "Price Per Sq. Ft.",
       y = "Residuals") +
  theme_minimal()

# Combine both plots
scatter_plot + residuals_plot + plot_layout(guides = 'collect')

## `geom_smooth()` using formula = 'y ~ x'

# Fit the model
model <- lm(price ~ I(price_per_sqft ^ 2) + price_per_sqft, data = apts_ny)
rsquared <- summary(model)$r.squared

# Create the plot
ggplot(data = apts_ny, aes(x = price_per_sqft ^ 2, y = price)) +
  geom_point() +
  geom_smooth(method = 'lm', aes(color = "Linear Fit"), se = FALSE, linetype = 'dashed') +
  geom_smooth(method = 'lm', aes(color = "Quadratic Fit"), se = FALSE) +
  labs(title = "Price vs. (Price Per Sq. Ft.) ^ 2",
       subtitle = paste("Linear Fit R-Squared =", round(rsquared, 3)),
       x = "Squared Price Per Sq. Ft.",
       y = "Price") +
  scale_color_manual(values = c("Linear Fit" = "gray", "Quadratic Fit" = "blue")) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

# Till this section it looks pretty good. Things we have done till this section :

#1) Visualization Enchancement

#2) Code Efficiency

#3) Interpretation Clarity

summary(model)

## 
## Call:
## lm(formula = price ~ I(price_per_sqft^2) + price_per_sqft, data = apts_ny)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -6546137  -557706  -227157   409955 15109341 
## 
## Coefficients:
##                         Estimate   Std. Error t value   Pr(>|t|)    
## (Intercept)         2218792.5314  535882.0113   4.140 0.00004933 ***
## I(price_per_sqft^2)       1.5127       0.1312  11.528    < 2e-16 ***
## price_per_sqft        -2773.1446     577.6418  -4.801 0.00000291 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1759000 on 221 degrees of freedom
## Multiple R-squared:  0.7907, Adjusted R-squared:  0.7888 
## F-statistic: 417.4 on 2 and 221 DF,  p-value: < 2.2e-16

#The code utilizes gg_diagnose from the lindia package to generate diagnostic plots for a linear regression model.

#Diagnostic plots include the plot of residuals versus fitted values (res_fitted) and the QQ plot (qqplot) of residuals.

#These plots assist in evaluating the assumptions of linearity, homoscedasticity, and normality of residuals in the linear regression model.

#Patterns in the plot of residuals versus fitted values and deviations from the diagonal line in the QQ plot can indicate potential violations of model assumptions.

#The diagnostic plots aid in identifying areas for further model refinement or validation.

plots <- gg_diagnose(model, plot.all = FALSE)
plot_all(plots[c('res_fitted', 'qqplot')], 
         max.per.page = 1)

#The following code snippet utilizes the powerTransform function from the car package.

#It applies the Box-Cox transformation (family = “bcnPower”) to the linear regression model (model).

#The purpose is to identify the optimal transformation parameter (lambda) for achieving normality in the response variable distribution.

pT <- powerTransform(model, family = "bcnPower")
pT$lambda

## [1] 0.04443713

# Create histograms for price and log(price)
p1 <- ggplot(data = apts) +
  geom_histogram(aes(x = price), color = 'white') +
  labs(x = "Price")

p2 <- ggplot(data = apts) +
  geom_histogram(aes(x = log(price)), color = 'white') +
  labs(x = "Log(Price)")

# Combine the histograms using patchwork
p1 + p2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In Exercise Poisson Regression we made some changes:

apts |>
  ggplot(mapping = aes(x = price)) +
  geom_histogram(color = 'white') +
  labs("Histogram of Price of SF and NYC Apartments") +
  theme_hc()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

apts |>
  ggplot(mapping = aes(x = sqft, y = log(price))) +
  geom_point(shape = 'O', size = 3) +
  geom_smooth(se = FALSE) +
  labs(title = "Log(Price) vs. Square Footage") +
  theme_minimal()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

apts |>
  ggplot(mapping = aes(x = sqrt(sqft), y = log(price))) +
  geom_point(shape = 'O', size = 3) +
  geom_smooth(se = FALSE) +
  labs(title = "Log(price) vs. sqrt(Square Footage)") +
  theme_minimal()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#The number of bedrooms (beds) can reflect the size and type of the property. Different types of properties (e.g., apartments, houses) may have different price dynamics even with the same square footage.

#While sqft represents the total area of the property, beds provides another dimension of size or capacity.

#There could be interaction effects between sqft and beds that influence the price.

url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/apartments/apartments.csv"
# Fit a Poisson GLM with log link function including both 'sqft' and 'beds'
model <- glm(price ~ sqft + beds, data = apts, family = poisson(link = "log"))


# Print summary of the model
summary(model)

## 
## Call:
## glm(formula = price ~ sqft + beds, family = poisson(link = "log"), 
##     data = apts)
## 
## Coefficients:
##                   Estimate     Std. Error z value Pr(>|z|)    
## (Intercept) 13.63716143633  0.00006142994  221995   <2e-16 ***
## sqft         0.00051139808  0.00000002713   18850   <2e-16 ***
## beds        -0.03936119885  0.00002984976   -1319   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1026709125  on 491  degrees of freedom
## Residual deviance:  434387986  on 489  degrees of freedom
## AIC: 434395832
## 
## Number of Fisher Scoring iterations: 5

Adding another predictor variable like “beds” to the Poisson regression model alongside “sqft” can help improve the model’s predictive power by capturing additional variability in the response variable “price.” This approach assumes that the number of bedrooms (“beds”) has an independent effect on the price of the apartment, beyond what can be explained by the square root of the square footage (“sqft”).

By including “beds” in the model, we aim to account for any variation in price that is associated with the number of bedrooms, beyond what is already explained by the square footage. This can lead to a more accurate and robust model for predicting apartment prices based on these features.

Week 11 Analysis

url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/apartments/apartments.csv"

apts <- read_delim(url_, delim = ",")

## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# repeating the transformations
apts <- apts |>
  mutate(price_per_sqft_2 = price_per_sqft ^ 2,
         log_price = log(price),
         sqrt_sqft = sqrt(sqft))

Deviance

Maybe show display?

model1 <- glm(in_sf ~ elevation, data = apts,
             family = binomial(link = 'logit'))

model2 <- glm(in_sf ~ sqft, data = apts,
             family = binomial(link = 'logit'))

paste("Model 1 Deviance", round(model1$deviance, 1))

## [1] "Model 1 Deviance 300.1"

paste("Model 2 Deviance", round(model2$deviance, 1))

## [1] "Model 2 Deviance 671.6"

Coin Flips

#Here are some potential improvements and explanations:

#Enhanced Visualization

#Highlight Observed Data

#Interactive Elements

In this updated version, we increased the line and point sizes for better visibility, highlighted the observed data with a gray dashed line, and improved the legend labels and color palette for clarity. Additionally, we used the minimal theme for a cleaner appearance. These changes enhance the overall readability and interpretability of the plot.

library(ggplot2)

n_flips <- 10
n_observed <- 6

# Generate data frame for likelihood calculation
df_flips <- data.frame(
  flips = rep(seq(0, n_flips), 3),
  probs = c(rep(0.25, n_flips + 1), 
            rep(0.5, n_flips + 1), 
            rep(0.75, n_flips + 1))
)

# Plot likelihood of observing heads
ggplot(data = df_flips,
       mapping = aes(x = flips,
                     y = dbinom(flips, size = n_flips, prob = probs),
                     color = factor(probs))) +
  geom_line(linewidth = 1) +  # Use linewidth instead of size
  geom_point(size = 3) +  # Keep point size as it is
  geom_vline(xintercept = n_observed, 
             color = 'gray', linetype = 'dashed', linewidth = 1) +  # Highlight observed data
  labs(title = "Likelihood for Candidate Coin Probabailities",
       x = "Number of Flips",
       y = "Likelihood",
       color = "Probability of Heads") +  # Update legend title
  scale_x_continuous(breaks = 0:10) +
  scale_y_continuous(limits = c(0, 0.5)) +
  scale_color_manual(values = c("blue", "green", "red"),  # Update color palette
                     labels = c("0.25", "0.50", "0.75")) +  # Update legend labels
  theme_minimal()  # Use minimal theme for cleaner appearance

Analysis on other sections

##Cost Function:

Provides a comprehensive explanation of the cost function in GLMs, illustrating its role in maximizing likelihood and minimizing error. Highlights the importance of understanding cost functions for model fitting and evaluation, emphasizing their practical significance in statistical modeling.

##Logistic Regression:

Offers a succinct overview of logistic regression and its application in binary classification problems. Clarifies the interpretation of model deviance in logistic regression, emphasizing its role in evaluating classification accuracy and model performance.

##Model Comparison:

Presents a thorough examination of various model comparison methods, including deviance, AIC, and BIC. Demonstrates the importance of selecting appropriate criteria for evaluating model fit and highlights the complementary nature of different evaluation metrics.

##ANOVA and Nested Models:

Provides clear insights into ANOVA for nested models and its utility in assessing differences in model deviance. Illustrates the practical application of ANOVA in comparing nested GLM models, emphasizing its role in identifying significant improvements in model fit.