Model Critique

For this lab, you’ll be working with a group of other classmates, and each group will be assigned a lab from a previous week. Your goal is to critique the models (or analyses) present in the lab.

First, review the materials from the Lesson on Ethics and Epistemology (week 5?). This includes lecture slides, the lecture video, or the reading. You can use these as reference materials for this lab. You may even consider the reading for the week associated with the lab, or even supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.).

For the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and possible solutions (even if you would need to request more data or resources to accomplish those solutions).

Share your model critique in this notebook as your data dive submission for the week.

As a start, think about the context of the lab and consider the following:

Analytical issues, such as model assumptions
Statistical improvements; what do we know now that we didn’t know then?
Are there better visualizations which could have been used?
Overcoming biases (existing or potential)
Possible risks or societal implications
Crucial issues which might not be measurable

Treat this exercise as if the analyses in your assigned lab (i.e., the one you are critiquing) were to be published, made available to the public in a press release, or used at some large company (e.g., for mpg data, imagine if Toyota used the conclusions to drive strategic decisions).

Example

For example, in Week 11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:

Is this a “good” selection of variables? What could we be missing, or are there potential biases inherent in the groups of apartments here?
Nowhere in the lab do we investigate the assumptions of a linear model. Is the relationship between the response (i.e., \(\log(\text{price})\)) and each of these variables linear? Are the error terms evenly distributed?
Is it possible that our conclusions are more appropriate for some group(s) of the data and not others?
What if assumptions are not met? What could happen to this model if it were deployed on a platform like Zillow?
Consider different evaluation metrics between models. What is a practical use for these values?

Critique of Lab 8 (and 9) on Linear Regression

Group: Michael Frontz, Brittany Mitrani, Shresta Reddy Nukula, Anurag Reddeddy

# libraries and dataset for model critique of Lab 8
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(car) # need for Levene's test

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

theme_set(theme_minimal())

# make a simpler name for (and a copy of) the Ames data
ames <- make_ames()
ames <- ames |> rename_with(tolower)

Lab 8

ANOVA

Might help to provide brief explanation of the different building types as context for understanding and interpreting the statistical analyses
Might help to show actual calculated results for group variance, error variance, and total variance for this dataset
For Exercise, it might be helpful to suggest multiple ways to test for 3 assumptions for ANOVA (e.g., could test for constant variance by comparing standard deviations of groups, performing Levene’s test, could demonstrate normal distributions of groups by plotting and comparing histograms, etc.)
- Might help to demonstrate how the data could be transformed if an assumption is not met. For example, if the distributions are not normal, then show or explain how to make the data more normal (e.g., try log transform of the data, perform bootstrap sampling of the data, etc.) or how to perform an alternate test for non-normal data
It appears that the building type groups do not have normal distributions, so it might be helpful to mention (but explain why we will or will not proceed with ANOVA)
Might be useful to consider other variables in the dataset besides Building Type that might be more relevant or interesting for understanding price difference, such as neighborhood, zoning, etc.

Ames Data - Assumption of Normality

One of the assumptions for performing ANOVA is that the distribution within each group must be roughly normal.

A histogram of the sale price data can be plotted as a visual check of whether the overall data is distributed normally or not.

# Normality Check: histogram of response variable (sale_price)
ames |>
  ggplot(aes(x = sale_price)) +
  geom_histogram(aes(y = ..density..), color = 'white') +
  geom_density(color = 'red')

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As seen in the plot above, the sale price data does not appear to follow a normal distribution. Instead, it appears to have a right skew. This is not surprising with data for price, where negative values are not possible and thus the data is bound by zero.

For data like this, a transformation could be performed to create a (more) normal distribution. As mentioned previously, this might involve methods such as log transform or bootstrap random sampling.

Histograms can also be plotted for each group within the explanatory variable (building type) to see whether each group follows a normal distribution for sale price.

# Normality Check: histogram of groups within explanatory variable (bldg_type)
ames |>  ggplot(aes(x = sale_price)) +
  geom_histogram(aes(y = ..density..), color = 'white')+
  geom_density(color = 'red') +
  facet_wrap(~bldg_type)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Again, the histograms for the groups appear to show the distributions within the groups are not normal. Again, transformations could be performed to translate the data into a more normal distribution.

The Shapiro-Wilk test of normality can be utilized to determine whether data has a normal distribution or not. The null-hypothesis of this test is that the population is normally distributed.

Therefore, if the calculated p-value is less than the chosen alpha level (such as: 0.05 or less), then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.

# Shapiro-Wilk test of normality
# null hypothesis is data is normal
# low p-value = reject null hypothesis, data not normal

shapiro.test(ames$sale_price)

## 
##  Shapiro-Wilk normality test
## 
## data:  ames$sale_price
## W = 0.87626, p-value < 2.2e-16

The p-value for the Shapiro-Wilk test is very low (highly significant), which means we would reject the null hypothesis that sale price is normally distributed and thus assume that the data is not normal.

Ames Data - Assumption of Homoscedasticity (Constant Variance)

Another assumption for performing ANOVA is that the variance of data within groups remains consistent for every group.

The Levene’s test can be used to determine homogeneity of variance across groups within a variable. The null hypothesis of this test is that the group variances are equal.

Therefore, if the calculated p-value is less than the chosen alpha level (such as: 0.05 or less), then the null hypothesis is rejected and there is evidence that the groups do not have equal variance.

# Levene's test for homogeneity of variance across groups
# null hypothesis is variance across groups is equal
# low p-value = reject null hypothesis, group variances are not equal

leveneTest(sale_price ~ bldg_type, data = ames)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value   Pr(>F)    
## group    4  15.511 1.46e-12 ***
##       2925                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value for the Levene’s test is very low (highly significant), which means we would reject the null hypothesis that the groups (building types) have equal variance and thus assume that the groups are not homoscedastic.

Multicollinearity

Would be helpful to include correlation matrix between the variables in the set to help detect multicollinearity
Might help to present faceted scatter plots of each explanatory variable vs. price (first_flr_sf vs. price, lot_area vs. price) and then show the scatter plot of the two possible explanatory variables (first_flr_sf vs. price) to see that they are correlated with each other

Linear Regression

It would help to diagnose the linear regression model (e.g., creating diagnostic plots such as residuals vs. fitted values, etc.) to verify the validity of the model and provide any necessary caveats about application of the model (e.g., model may be less valid for price points above certain values, etc.)

Error

Would be better to display plots separately (or side-by-side) instead of having to switch between them

Lab 9

Regression Characteristics - Coefficients

For the CI plot, it would help to be more explicit about explaining that because the intervals for “Great Quality” and “Intercept” contain zero as a value, this helps explain why the p-values are high and thus why we cannot reject the null hypothesis that this variables do not impact price.

Cook’s D

Might help to provide explanation of options of how to address high influence points (as identified by having high Cook’s D values): Should they be removed from the dataset as being outliers? Should they be analyze separately explored further to try to identify why they are so different? Basically, what options exist and are appropriate.
- Maybe there is an opportunity to tie this into Simpson’s Paradox? (i.e., these points might represent a subgroup)

Variable Selection

Might be opportunity to consider variable selection (introduced in Lab 11) and use backward step-wise selection to eliminate variables which don’t provide useful information for the model

H510 Week 13 Data Dive

Group: Michael Frontz, Brittany Mitrani, Shresta Reddy Nukula, Anurag Reddeddy

Model Critique

Example

Critique of Lab 8 (and 9) on Linear Regression

Group: Michael Frontz, Brittany Mitrani, Shresta Reddy Nukula, Anurag Reddeddy

Lab 8

ANOVA

Multicollinearity

Linear Regression

Error

Lab 9

Regression Characteristics - Coefficients

Cook’s D

Variable Selection