For this lab, you’ll be working with a group of other classmates, and each group will be assigned a lab from a previous week. Your goal is to critique the models (or analyses) present in the lab.
First, review the materials from the Lesson on Ethics and Epistemology (week 5?). This includes lecture slides, the lecture video, or the reading. You can use these as reference materials for this lab. You may even consider the reading for the week associated with the lab, or even supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.).
For the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and possible solutions (even if you would need to request more data or resources to accomplish those solutions).
Share your model critique in this notebook as your data dive submission for the week.
As a start, think about the context of the lab and consider the following:
Analytical issues, such as model assumptions
Statistical improvements; what do we know now that we didn’t know then?
Are there better visualizations which could have been used?
Overcoming biases (existing or potential)
Possible risks or societal implications
Crucial issues which might not be measurable
Treat this exercise as if the analyses in your assigned lab
(i.e., the one you are critiquing) were to be published, made available
to the public in a press release, or used at some large company (e.g.,
for mpg data, imagine if Toyota used the conclusions to
drive strategic decisions).
If you were unable to attend class, select a
notes_*.Rmdfile from a previous week (not including weeks 1 or 3), and complete the analysis above. Share your critique below.
For example, in Week 11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:
#Week 10 lab # Week 10 Notes Analysis
Analysis: Logistics Regression
In this section our goal was to predict whether or not an apartment
is in SF. To do this, we used in_sf which is a
binary response variable, and
elevation as our explanatory variable. What we want is
a monotonic function which outputs something like 1 or
0 for a given elevation. I think a “better” variables for
predicting whether an apartment is in San Francisco for linear
regression are:
Longitude/Latitude: Instead of using elevation alone I think we should use more specific location indicators such as latitude and longitude coordinates because some apartments in SF are on lower elevations which can also be the same elevation as apartments in NYC . These can provide a more precise representation of the apartment’s location, capturing nuances like neighborhood characteristics, proximity to landmarks, or distance from the city center.
Zip Codes: Using zip codes could capture localized variations in housing patterns and demographics. Apartments in certain zip codes or neighborhoods might be more likely to be in San Francisco due to factors like affordability, amenities, or cultural attractions.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.3.3
library(broom)
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
options(scipen = 6)
theme_set(theme_minimal())
url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/apartments/apartments.csv"
apts <- read_delim(url_, delim = ",")
## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Here are a few potential improvements to consider:
#1) Add Labels and Legend
#2) Adjust Axis Limits:
#3) Improve Aesthetics:
#4) Add Regression Lines:
# Define the data
x <- runif(100, 1, 100)
y <- 2 * sqrt(x) + rnorm(100, sd = 0.5)
# Calculate transformed x values
x_transformed <- sqrt(x)
# Create a data frame for the transformed data
transformed_data <- data.frame(x = x_transformed, y = y)
# Plot 1: Before Transformation
p1 <- ggplot(data = data.frame(x = x, y = y)) +
geom_point(mapping = aes(x = x, y = y), color = "blue", alpha = 0.5) + # Adjust aesthetics
labs(title = 'Before Transformation', x = "x", y = "y") + # Add axis labels
theme_minimal() # Improve aesthetics
# Plot 2: After Transformation
p2 <- ggplot(data = transformed_data) +
geom_point(mapping = aes(x = x, y = y), color = "red", alpha = 0.5) + # Adjust aesthetics
labs(title = 'After Transformation', x = "x_p = sqrt(x)", y = "y") + # Add axis labels
theme_minimal() + # Improve aesthetics
xlim(0, max(x_transformed)) + # Adjust x-axis limits
ylim(min(y), max(y)) # Adjust y-axis limits
# Combine plots using patchwork
combined_plots <- p1 + p2
# Display the combined plot
combined_plots
#Here are a few potential improvements to consider:
#Add Residuals Plot: Along with the scatter plot of price vs. price_per_sqft, added a subplot to visualize the residuals
#Enhance Aesthetics:
#Increase Interpretability: Added annotations
library(ggplot2)
library(dplyr)
# Calculate residuals
apts_ny <- apts %>%
filter(in_sf == 0)
model <- lm(price ~ price_per_sqft, data = apts_ny)
apts_ny <- apts_ny %>%
mutate(residuals = residuals(model))
# Scatter plot with residuals
scatter_plot <- ggplot(data = apts_ny, aes(x = price_per_sqft, y = price)) +
geom_point() +
geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
labs(title = "Price vs. Price Per Sq. Ft.",
subtitle = paste("Linear Fit R-Squared =", round(summary(model)$r.squared, 3)),
x = "Price Per Sq. Ft.",
y = "Price") +
theme_minimal()
# Residuals plot
residuals_plot <- ggplot(data = apts_ny, aes(x = price_per_sqft, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(title = "Residuals Plot",
x = "Price Per Sq. Ft.",
y = "Residuals") +
theme_minimal()
# Combine both plots
scatter_plot + residuals_plot + plot_layout(guides = 'collect')
## `geom_smooth()` using formula = 'y ~ x'
# Fit the model
model <- lm(price ~ I(price_per_sqft ^ 2) + price_per_sqft, data = apts_ny)
rsquared <- summary(model)$r.squared
# Create the plot
ggplot(data = apts_ny, aes(x = price_per_sqft ^ 2, y = price)) +
geom_point() +
geom_smooth(method = 'lm', aes(color = "Linear Fit"), se = FALSE, linetype = 'dashed') +
geom_smooth(method = 'lm', aes(color = "Quadratic Fit"), se = FALSE) +
labs(title = "Price vs. (Price Per Sq. Ft.) ^ 2",
subtitle = paste("Linear Fit R-Squared =", round(rsquared, 3)),
x = "Squared Price Per Sq. Ft.",
y = "Price") +
scale_color_manual(values = c("Linear Fit" = "gray", "Quadratic Fit" = "blue")) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Till this section it looks pretty good. Things we have done till this
section :
#1) Visualization Enchancement
#2) Code Efficiency
#3) Interpretation Clarity
summary(model)
##
## Call:
## lm(formula = price ~ I(price_per_sqft^2) + price_per_sqft, data = apts_ny)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6546137 -557706 -227157 409955 15109341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2218792.5314 535882.0113 4.140 0.00004933 ***
## I(price_per_sqft^2) 1.5127 0.1312 11.528 < 2e-16 ***
## price_per_sqft -2773.1446 577.6418 -4.801 0.00000291 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1759000 on 221 degrees of freedom
## Multiple R-squared: 0.7907, Adjusted R-squared: 0.7888
## F-statistic: 417.4 on 2 and 221 DF, p-value: < 2.2e-16
#The code utilizes gg_diagnose from the lindia package to generate diagnostic plots for a linear regression model.
#Diagnostic plots include the plot of residuals versus fitted values (res_fitted) and the QQ plot (qqplot) of residuals.
#These plots assist in evaluating the assumptions of linearity, homoscedasticity, and normality of residuals in the linear regression model.
#Patterns in the plot of residuals versus fitted values and deviations from the diagonal line in the QQ plot can indicate potential violations of model assumptions.
#The diagnostic plots aid in identifying areas for further model refinement or validation.
plots <- gg_diagnose(model, plot.all = FALSE)
plot_all(plots[c('res_fitted', 'qqplot')],
max.per.page = 1)
#The following code snippet utilizes the powerTransform function from
the car package.
#It applies the Box-Cox transformation (family = “bcnPower”) to the linear regression model (model).
#The purpose is to identify the optimal transformation parameter (lambda) for achieving normality in the response variable distribution.
pT <- powerTransform(model, family = "bcnPower")
pT$lambda
## [1] 0.04443713
# Create histograms for price and log(price)
p1 <- ggplot(data = apts) +
geom_histogram(aes(x = price), color = 'white') +
labs(x = "Price")
p2 <- ggplot(data = apts) +
geom_histogram(aes(x = log(price)), color = 'white') +
labs(x = "Log(Price)")
# Combine the histograms using patchwork
p1 + p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
apts |>
ggplot(mapping = aes(x = price)) +
geom_histogram(color = 'white') +
labs("Histogram of Price of SF and NYC Apartments") +
theme_hc()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
apts |>
ggplot(mapping = aes(x = sqft, y = log(price))) +
geom_point(shape = 'O', size = 3) +
geom_smooth(se = FALSE) +
labs(title = "Log(Price) vs. Square Footage") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
apts |>
ggplot(mapping = aes(x = sqrt(sqft), y = log(price))) +
geom_point(shape = 'O', size = 3) +
geom_smooth(se = FALSE) +
labs(title = "Log(price) vs. sqrt(Square Footage)") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#The number of bedrooms (beds) can reflect the size and type of the
property. Different types of properties (e.g., apartments, houses) may
have different price dynamics even with the same square footage.
#While sqft represents the total area of the property, beds provides another dimension of size or capacity.
#There could be interaction effects between sqft and beds that influence the price.
url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/apartments/apartments.csv"
# Fit a Poisson GLM with log link function including both 'sqft' and 'beds'
model <- glm(price ~ sqft + beds, data = apts, family = poisson(link = "log"))
# Print summary of the model
summary(model)
##
## Call:
## glm(formula = price ~ sqft + beds, family = poisson(link = "log"),
## data = apts)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 13.63716143633 0.00006142994 221995 <2e-16 ***
## sqft 0.00051139808 0.00000002713 18850 <2e-16 ***
## beds -0.03936119885 0.00002984976 -1319 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1026709125 on 491 degrees of freedom
## Residual deviance: 434387986 on 489 degrees of freedom
## AIC: 434395832
##
## Number of Fisher Scoring iterations: 5
Adding another predictor variable like “beds” to the Poisson regression model alongside “sqft” can help improve the model’s predictive power by capturing additional variability in the response variable “price.” This approach assumes that the number of bedrooms (“beds”) has an independent effect on the price of the apartment, beyond what can be explained by the square root of the square footage (“sqft”).
By including “beds” in the model, we aim to account for any variation in price that is associated with the number of bedrooms, beyond what is already explained by the square footage. This can lead to a more accurate and robust model for predicting apartment prices based on these features.
url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/apartments/apartments.csv"
apts <- read_delim(url_, delim = ",")
## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# repeating the transformations
apts <- apts |>
mutate(price_per_sqft_2 = price_per_sqft ^ 2,
log_price = log(price),
sqrt_sqft = sqrt(sqft))
Deviance
Maybe show display?
model1 <- glm(in_sf ~ elevation, data = apts,
family = binomial(link = 'logit'))
model2 <- glm(in_sf ~ sqft, data = apts,
family = binomial(link = 'logit'))
paste("Model 1 Deviance", round(model1$deviance, 1))
## [1] "Model 1 Deviance 300.1"
paste("Model 2 Deviance", round(model2$deviance, 1))
## [1] "Model 2 Deviance 671.6"
#Here are some potential improvements and explanations:
#Enhanced Visualization
#Highlight Observed Data
#Interactive Elements
In this updated version, we increased the line and point sizes for better visibility, highlighted the observed data with a gray dashed line, and improved the legend labels and color palette for clarity. Additionally, we used the minimal theme for a cleaner appearance. These changes enhance the overall readability and interpretability of the plot.
library(ggplot2)
n_flips <- 10
n_observed <- 6
# Generate data frame for likelihood calculation
df_flips <- data.frame(
flips = rep(seq(0, n_flips), 3),
probs = c(rep(0.25, n_flips + 1),
rep(0.5, n_flips + 1),
rep(0.75, n_flips + 1))
)
# Plot likelihood of observing heads
ggplot(data = df_flips,
mapping = aes(x = flips,
y = dbinom(flips, size = n_flips, prob = probs),
color = factor(probs))) +
geom_line(linewidth = 1) + # Use linewidth instead of size
geom_point(size = 3) + # Keep point size as it is
geom_vline(xintercept = n_observed,
color = 'gray', linetype = 'dashed', linewidth = 1) + # Highlight observed data
labs(title = "Likelihood for Candidate Coin Probabailities",
x = "Number of Flips",
y = "Likelihood",
color = "Probability of Heads") + # Update legend title
scale_x_continuous(breaks = 0:10) +
scale_y_continuous(limits = c(0, 0.5)) +
scale_color_manual(values = c("blue", "green", "red"), # Update color palette
labels = c("0.25", "0.50", "0.75")) + # Update legend labels
theme_minimal() # Use minimal theme for cleaner appearance
##Cost Function:
Provides a comprehensive explanation of the cost function in GLMs, illustrating its role in maximizing likelihood and minimizing error. Highlights the importance of understanding cost functions for model fitting and evaluation, emphasizing their practical significance in statistical modeling.
##Logistic Regression:
Offers a succinct overview of logistic regression and its application in binary classification problems. Clarifies the interpretation of model deviance in logistic regression, emphasizing its role in evaluating classification accuracy and model performance.
##Model Comparison:
Presents a thorough examination of various model comparison methods, including deviance, AIC, and BIC. Demonstrates the importance of selecting appropriate criteria for evaluating model fit and highlights the complementary nature of different evaluation metrics.
##ANOVA and Nested Models:
Provides clear insights into ANOVA for nested models and its utility in assessing differences in model deviance. Illustrates the practical application of ANOVA in comparing nested GLM models, emphasizing its role in identifying significant improvements in model fit.