For this lab, you’ll be working with a group of other classmates, and each group will be assigned a lab from a previous week. Your goal is to critique the models (or analyses) present in the lab.
First, review the materials from the Lesson on Ethics and Epistemology (week 5?). This includes lecture slides, the lecture video, or the reading. You can use these as reference materials for this lab. You may even consider the reading for the week associated with the lab, or even supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.).
For the lab your group has been assigned, consider issues with models, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and possible solutions (even if you would need to request more data or resources to accomplish those solutions).
Share your model critique in this notebook as your data dive submission for the week.
As a start, think about the context of the lab and consider the following:
Analytical issues, such as model assumptions
Overcoming biases (existing or potential)
Possible risks or societal implications
Crucial issues which might not be measurable
Treat this exercise as if the analyses in your assigned lab
(i.e., the one you are critiquing) were to be published, made available
to the public in a press release, or used at some large company (e.g.,
for mpg data, imagine if Toyota used the conclusions to
drive strategic decisions).
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
url <- "https://raw.githubusercontent.com/leontoddjohnson/i590/main/data/apartments/apartments.csv"
apts <- read_delim(url, delim = ',')
## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# repeating the transformations
apts <- apts |>
mutate(price_per_sqft_2 = price_per_sqft ^ 2,
log_price = log(price),
sqrt_sqft = sqrt(sqft))
n_flips <- 10
n_observed <- 6
# 10 flips per candidate P(Heads): (0.25, 0.5, 0.75)
df_flips <- data.frame(
flips = rep(seq(0, n_flips), 3),
probs = c(rep(0.25, n_flips + 1),
rep(0.5, n_flips + 1),
rep(0.75, n_flips + 1))
)
ggplot(data = df_flips,
mapping = aes(x = flips,
y = dbinom(flips, size = n_flips, prob = probs),
color = paste("p = ", probs))) +
geom_line() +
geom_point(size = 2) +
geom_vline(xintercept = n_observed,
color = 'gray', linetype = 'dashed', linewidth = 1) +
labs(title = "Likelihood for Candidate Coin Probabailities",
x = "Number of Flips",
y = "Likelihood",
color = '') +
scale_x_continuous(breaks = 0:10) +
scale_y_continuous(limits = c(0, 0.5)) +
scale_color_brewer(palette = "Dark2") +
theme_hc()
model1 <- glm(in_sf ~ elevation, data = apts,
family = binomial(link = 'logit'))
model2 <- glm(in_sf ~ sqft, data = apts,
family = binomial(link = 'logit'))
paste("Model 1 Deviance", round(model1$deviance, 1))
## [1] "Model 1 Deviance 300.1"
paste("Model 2 Deviance", round(model2$deviance, 1))
## [1] "Model 2 Deviance 671.6"
paste("Model 1 BIC", round(BIC(model1), 2))
## [1] "Model 1 BIC 312.53"
paste("Model 2 BIC", round(BIC(model2), 2))
## [1] "Model 2 BIC 684.01"
# baseline model
model0 <- glm(price ~ 1, data = apts,
family = poisson(link = 'log'))
model1 <- glm(price ~ sqrt_sqft, data = apts,
family = poisson(link = 'log'))
model2 <- glm(price ~ sqrt_sqft + beds, data = apts,
family = poisson(link = 'log'))
cor(select(apts, year_built, sqft, elevation, beds, bath))
## year_built sqft elevation beds bath
## year_built 1.00000000 -0.0413914 -0.2089692 -0.07223557 0.07087105
## sqft -0.04139140 1.0000000 0.1746025 0.78850393 0.84993575
## elevation -0.20896922 0.1746025 1.0000000 0.33126732 0.11356167
## beds -0.07223557 0.7885039 0.3312673 1.00000000 0.83392423
## bath 0.07087105 0.8499357 0.1135617 0.83392423 1.00000000
model <- glm(price ~ year_built + sqft + elevation + beds + bath,
data = apts,
family = poisson(link = 'log'))
vif(model)
## year_built sqft elevation beds bath
## 1.117824 3.644101 1.244761 10.007863 11.913003
model <- glm(price ~ year_built + sqft + elevation + beds,
data = apts,
family = poisson(link = 'log'))
vif(model)
## year_built sqft elevation beds
## 1.098529 2.922694 1.069701 2.947531
lm_1 <- lm(log_price ~ sqrt_sqft, apts)
lm_2 <- lm(log_price ~ sqrt_sqft + year_built + sqft +
elevation + beds + bath, apts)
paste("R-Squared 1: ", round(summary(lm_1)$r.squared, 3))
## [1] "R-Squared 1: 0.587"
paste("R-Squared 2: ", round(summary(lm_2)$r.squared, 3))
## [1] "R-Squared 2: 0.742"
paste("Adj. R-Squared 1: ", round(summary(lm_1)$adj.r.squared, 3))
## [1] "Adj. R-Squared 1: 0.587"
paste("Adj. R-Squared 2: ", round(summary(lm_2)$adj.r.squared, 3))
## [1] "Adj. R-Squared 2: 0.739"
lm_b <- lm(log_price ~ ., apts)
step(lm_b, direction = "backward")
## Start: AIC=-2555.67
## log_price ~ in_sf + beds + bath + price + year_built + sqft +
## price_per_sqft + elevation + price_per_sqft_2 + sqrt_sqft
##
## Df Sum of Sq RSS AIC
## - elevation 1 0.000 2.610 -2557.7
## - beds 1 0.004 2.614 -2556.9
## <none> 2.610 -2555.7
## - bath 1 0.019 2.629 -2554.1
## - year_built 1 0.019 2.629 -2554.0
## - in_sf 1 0.044 2.654 -2549.4
## - price 1 0.270 2.880 -2509.2
## - sqft 1 3.411 6.021 -2146.4
## - price_per_sqft_2 1 8.098 10.708 -1863.1
## - sqrt_sqft 1 15.117 17.726 -1615.1
## - price_per_sqft 1 32.448 35.058 -1279.6
##
## Step: AIC=-2557.67
## log_price ~ in_sf + beds + bath + price + year_built + sqft +
## price_per_sqft + price_per_sqft_2 + sqrt_sqft
##
## Df Sum of Sq RSS AIC
## - beds 1 0.004 2.614 -2558.9
## <none> 2.610 -2557.7
## - bath 1 0.019 2.629 -2556.1
## - year_built 1 0.020 2.629 -2556.0
## - in_sf 1 0.056 2.665 -2549.3
## - price 1 0.274 2.884 -2510.5
## - sqft 1 3.443 6.053 -2145.8
## - price_per_sqft_2 1 8.291 10.900 -1856.4
## - sqrt_sqft 1 15.147 17.757 -1616.3
## - price_per_sqft 1 33.087 35.696 -1272.7
##
## Step: AIC=-2558.9
## log_price ~ in_sf + bath + price + year_built + sqft + price_per_sqft +
## price_per_sqft_2 + sqrt_sqft
##
## Df Sum of Sq RSS AIC
## <none> 2.614 -2558.9
## - bath 1 0.016 2.629 -2558.0
## - year_built 1 0.018 2.632 -2557.6
## - in_sf 1 0.052 2.666 -2551.2
## - price 1 0.281 2.895 -2510.6
## - sqft 1 3.520 6.133 -2141.3
## - price_per_sqft_2 1 8.661 11.275 -1841.8
## - sqrt_sqft 1 17.072 19.686 -1567.5
## - price_per_sqft 1 35.039 37.653 -1248.5
##
## Call:
## lm(formula = log_price ~ in_sf + bath + price + year_built +
## sqft + price_per_sqft + price_per_sqft_2 + sqrt_sqft, data = apts)
##
## Coefficients:
## (Intercept) in_sf bath price
## 1.043e+01 2.758e-02 1.034e-02 3.217e-08
## year_built sqft price_per_sqft price_per_sqft_2
## -1.685e-04 -5.341e-04 1.454e-03 -2.256e-07
## sqrt_sqft
## 9.129e-02
Strengths: MLE is a powerful method for estimating model parameters, especially in the context of Generalized Linear Models (GLMs). It provides a way to find parameters that maximize the likelihood of observing the given data under a specified probability distribution. Considerations: MLE assumes that the observations are independent and identically distributed (i.i.d.), which may not always hold in real-world scenarios. Additionally, MLE can be sensitive to outliers, and its results might be influenced by the choice of the probability distribution.
Strengths: Deviance, AIC, and BIC are useful tools for model comparison. They provide a quantitative basis for selecting among competing models, considering both goodness of fit and model complexity. Considerations: These metrics are based on certain assumptions, and their effectiveness depends on the context. AIC and BIC penalize model complexity, but the degree of penalization may need to be carefully interpreted, especially in the presence of a large number of observations.
Strengths: The discussion on adding explanatory variables and addressing issues like multicollinearity is crucial. The use of VIF (Variance Inflation Factor) to assess multicollinearity adds a practical aspect to the model building process. Considerations: The adjusted R-squared is a valuable metric for assessing model performance while accounting for the number of variables. However, its interpretation is subject to certain assumptions, and caution is needed when comparing models with different numbers of variables.
Strengths: The step-wise variable selection method is a practical approach to reduce the number of variables and potentially overcome biases. It automates the process, making it less prone to subjective decisions. Considerations: While step-wise selection can be useful, it has limitations, and its outcomes may not always lead to the best model. The importance of careful consideration and validation of the selected model is emphasized.
Strengths: The discussion on considering which variable to remove in the step-wise selection process is insightful. It acknowledges that the choice of which variable to remove can impact the final model and its biases. Considerations: There is a need to recognize that automated methods for variable selection, while convenient, may not always capture the true underlying relationships in the data and could introduce biases.
Strengths: The mention of potential model deployment, such as in a press release or at a large company, raises awareness of the real-world implications of the analysis. Considerations: It is crucial to communicate the limitations and assumptions of the model to stakeholders. Model results can have significant consequences if misinterpreted or if users are unaware of the model’s assumptions and potential biases.
Strengths: The inclusion of ethical considerations, such as fairness and avoiding discrimination, demonstrates a responsible approach to data analysis. This is especially important when the model might impact individuals or groups. Considerations: Continuous monitoring, informed consent, and transparency are highlighted as ethical principles. These considerations are essential to ensure the fair and responsible use of the model in different societal contexts.
Strengths: The discussion on model interpretability acknowledges the challenges that arise as models become more complex. The consideration of the trade-off between model complexity and interpretability is a crucial aspect of model development. Considerations: While complexity might improve model fit, it can make it harder for users to understand and trust the model. Balancing complexity with interpretability is essential, especially in contexts where the model’s outputs inform critical decisions.
Strengths: The consideration of potential unintended consequences and the acknowledgment that not all crucial issues are measurable reflects a thoughtful approach to model critique. Considerations: It is important to recognize that models, especially in real-world applications, can have far-reaching consequences that may not be fully anticipated or measurable. Ongoing evaluation and adaptation are necessary.
Strengths: The emphasis on transparency and communication throughout the model development process is a key strength. Clear communication of assumptions, limitations, and potential biases is essential, especially in scenarios where the model might impact decision-making.
Strengths: The iterative nature of model development, with considerations for continuous monitoring and updates, is highlighted. This approach acknowledges that models are not static entities and need to evolve based on changing data and circumstances.
Strengths: The incorporation of ethical considerations, such as fairness and informed consent, aligns with best practices in data science. Recognizing the societal implications of model deployment and actively addressing ethical concerns is a responsible approach.