For this lab, you’ll be working with a group of other classmates, and each group will be assigned a lab from a previous week. Your goal is to critique the models (or analyses) present in the lab.
First, review the materials from the Lesson on Ethics and Epistemology (week 5?). This includes lecture slides, the lecture video, or the reading. You can use these as reference materials for this lab. You may even consider the reading for the week associated with the lab, or even supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.).
For the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and possible solutions (even if you would need to request more data or resources to accomplish those solutions).
Share your model critique in this notebook as your data dive submission for the week.
As a start, think about the context of the lab and consider the following:
Analytical issues, such as model assumptions
Statistical improvements; what do we know now that we didn’t know then?
Are there better visualizations which could have been used?
Overcoming biases (existing or potential)
Possible risks or societal implications
Crucial issues which might not be measurable
Treat this exercise as if the analyses in your assigned lab
(i.e., the one you are critiquing) were to be published, made available
to the public in a press release, or used at some large company (e.g.,
for mpg data, imagine if Toyota used the conclusions to
drive strategic decisions).
For example, in Week 11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:
# libraries and dataset for model critique of Lab 8
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(car) # need for Levene's test
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
theme_set(theme_minimal())
# make a simpler name for (and a copy of) the Ames data
ames <- make_ames()
ames <- ames |> rename_with(tolower)
Might help to provide brief explanation of the different building types as context for understanding and interpreting the statistical analyses
Might help to show actual calculated results for group variance, error variance, and total variance for this dataset
For Exercise, it might be helpful to suggest multiple ways to test for 3 assumptions for ANOVA (e.g., could test for constant variance by comparing standard deviations of groups, performing Levene’s test, could demonstrate normal distributions of groups by plotting and comparing histograms, etc.)
It appears that the building type groups do not have normal distributions, so it might be helpful to mention (but explain why we will or will not proceed with ANOVA)
Might be useful to consider other variables in the dataset besides Building Type that might be more relevant or interesting for understanding price difference, such as neighborhood, zoning, etc.
Ames Data - Assumption of Normality
One of the assumptions for performing ANOVA is that the distribution within each group must be roughly normal.
A histogram of the sale price data can be plotted as a visual check of whether the overall data is distributed normally or not.
# Normality Check: histogram of response variable (sale_price)
ames |>
ggplot(aes(x = sale_price)) +
geom_histogram(aes(y = ..density..), color = 'white') +
geom_density(color = 'red')
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As seen in the plot above, the sale price data does not appear to follow a normal distribution. Instead, it appears to have a right skew. This is not surprising with data for price, where negative values are not possible and thus the data is bound by zero.
For data like this, a transformation could be performed to create a (more) normal distribution. As mentioned previously, this might involve methods such as log transform or bootstrap random sampling.
Histograms can also be plotted for each group within the explanatory variable (building type) to see whether each group follows a normal distribution for sale price.
# Normality Check: histogram of groups within explanatory variable (bldg_type)
ames |> ggplot(aes(x = sale_price)) +
geom_histogram(aes(y = ..density..), color = 'white')+
geom_density(color = 'red') +
facet_wrap(~bldg_type)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Again, the histograms for the groups appear to show the distributions within the groups are not normal. Again, transformations could be performed to translate the data into a more normal distribution.
The Shapiro-Wilk test of normality can be utilized to determine whether data has a normal distribution or not. The null-hypothesis of this test is that the population is normally distributed.
Therefore, if the calculated p-value is less than the chosen alpha level (such as: 0.05 or less), then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.
# Shapiro-Wilk test of normality
# null hypothesis is data is normal
# low p-value = reject null hypothesis, data not normal
shapiro.test(ames$sale_price)
##
## Shapiro-Wilk normality test
##
## data: ames$sale_price
## W = 0.87626, p-value < 2.2e-16
The p-value for the Shapiro-Wilk test is very low (highly significant), which means we would reject the null hypothesis that sale price is normally distributed and thus assume that the data is not normal.
Ames Data - Assumption of Homoscedasticity (Constant Variance)
Another assumption for performing ANOVA is that the variance of data within groups remains consistent for every group.
The Levene’s test can be used to determine homogeneity of variance across groups within a variable. The null hypothesis of this test is that the group variances are equal.
Therefore, if the calculated p-value is less than the chosen alpha level (such as: 0.05 or less), then the null hypothesis is rejected and there is evidence that the groups do not have equal variance.
# Levene's test for homogeneity of variance across groups
# null hypothesis is variance across groups is equal
# low p-value = reject null hypothesis, group variances are not equal
leveneTest(sale_price ~ bldg_type, data = ames)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 15.511 1.46e-12 ***
## 2925
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value for the Levene’s test is very low (highly significant), which means we would reject the null hypothesis that the groups (building types) have equal variance and thus assume that the groups are not homoscedastic.
Would be helpful to include correlation matrix between the variables in the set to help detect multicollinearity
Might help to present faceted scatter plots of each explanatory variable vs. price (first_flr_sf vs. price, lot_area vs. price) and then show the scatter plot of the two possible explanatory variables (first_flr_sf vs. price) to see that they are correlated with each other
Might help to provide explanation of options of how to address high influence points (as identified by having high Cook’s D values): Should they be removed from the dataset as being outliers? Should they be analyze separately explored further to try to identify why they are so different? Basically, what options exist and are appropriate.