Critique of Lab 8

# libraries and dataset for model critique of Lab 8
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(car) 
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
theme_set(theme_minimal())

# make a simpler name for (and a copy of) the Ames data
ames <- make_ames()
ames <- ames |> rename_with(tolower)

ANOVA

  • Might help to provide brief explanation of the different building types as context for understanding and interpreting the statistical analyses

  • Might help to show actual calculated results for group variance, error variance, and total variance for this dataset

  • For Exercise, it might be helpful to suggest multiple ways to test for 3 assumptions for ANOVA (e.g., could test for constant variance by comparing standard deviations of groups, performing Levene’s test, could demonstrate normal distributions of groups by plotting and comparing histograms, etc.)

    • Might help to demonstrate how the data could be transformed if an assumption is not met. For example, if the distributions are not normal, then show or explain how to make the data more normal (e.g., try log transform of the data, perform bootstrap sampling of the data, etc.) or how to perform an alternate test for non-normal data
  • It appears that the building type groups do not have normal distributions, so it might be helpful to mention (but explain why we will or will not proceed with ANOVA)

  • Might be useful to consider other variables in the dataset besides Building Type that might be more relevant or interesting for understanding price difference, such as neighborhood, zoning, etc.

Ames Data - Assumption of Normality

One of the assumptions for performing ANOVA is that the distribution within each group must be roughly normal.

A histogram of the sale price data can be plotted as a visual check of whether the overall data is distributed normally or not.

# Normality Check: histogram of response variable (sale_price)
ames |>
  ggplot(aes(x = sale_price)) +
  geom_histogram(aes(y = ..density..), color = 'white') +
  geom_density(color = 'red')
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As seen in the plot above, the sale price data does not appear to follow a normal distribution. Instead, it appears to have a right skew. This is not surprising with data for price, where negative values are not possible and thus the data is bound by zero.

For data like this, a transformation could be performed to create a (more) normal distribution. As mentioned previously, this might involve methods such as log transform or bootstrap random sampling.

Histograms can also be plotted for each group within the explanatory variable (building type) to see whether each group follows a normal distribution for sale price.

# Normality Check: histogram of groups within explanatory variable (bldg_type)
ames |>  ggplot(aes(x = sale_price)) +
  geom_histogram(aes(y = ..density..), color = 'white')+
  geom_density(color = 'red') +
  facet_wrap(~bldg_type)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Again, the histograms for the groups appear to show the distributions within the groups are not normal. Again, transformations could be performed to translate the data into a more normal distribution.

The Shapiro-Wilk test of normality can be utilized to determine whether data has a normal distribution or not. The null-hypothesis of this test is that the population is normally distributed.

Therefore, if the calculated p-value is less than the chosen alpha level (such as: 0.05 or less), then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.

# Shapiro-Wilk test of normality
# null hypothesis is data is normal
# low p-value = reject null hypothesis, data not normal

shapiro.test(ames$sale_price)
## 
##  Shapiro-Wilk normality test
## 
## data:  ames$sale_price
## W = 0.87626, p-value < 2.2e-16

The p-value for the Shapiro-Wilk test is very low (highly significant), which means we would reject the null hypothesis that sale price is normally distributed and thus assume that the data is not normal.

Ames Data - Assumption of Homoscedasticity (Constant Variance)

Another assumption for performing ANOVA is that the variance of data within groups remains consistent for every group.

The Levene’s test can be used to determine homogeneity of variance across groups within a variable. The null hypothesis of this test is that the group variances are equal.

Therefore, if the calculated p-value is less than the chosen alpha level (such as: 0.05 or less), then the null hypothesis is rejected and there is evidence that the groups do not have equal variance.

# Levene's test for homogeneity of variance across groups
# null hypothesis is variance across groups is equal
# low p-value = reject null hypothesis, group variances are not equal

leveneTest(sale_price ~ bldg_type, data = ames)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value   Pr(>F)    
## group    4  15.511 1.46e-12 ***
##       2925                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value for the Levene’s test is very low (highly significant), which means we would reject the null hypothesis that the groups (building types) have equal variance and thus assume that the groups are not homoscedastic.

Multicollinearity

  • Would be helpful to include correlation matrix between the variables in the set to help detect multicollinearity

  • Might help to present faceted scatter plots of each explanatory variable vs. price (first_flr_sf vs. price, lot_area vs. price) and then show the scatter plot of the two possible explanatory variables (first_flr_sf vs. price) to see that they are correlated with each other

Linear Regression

  • It would help to diagnose the linear regression model (e.g., creating diagnostic plots such as residuals vs. fitted values, etc.) to verify the validity of the model and provide any necessary caveats about application of the model (e.g., model may be less valid for price points above certain values, etc.)

Error

  • Would be better to display plots separately (or side-by-side) instead of having to switch between them