week-13

# libraries and dataset for model critique of Lab 8
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(car) 
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
theme_set(theme_minimal())

# make a simpler name for (and a copy of) the Ames data
ames <- make_ames()
ames <- ames |> rename_with(tolower)

ANOVA

Ames Data - Assumption of Normality

One of the assumptions for performing ANOVA is that the distribution within each group must be roughly normal.

A histogram of the sale price data can be plotted as a visual check of whether the overall data is distributed normally or not.

# Normality Check: histogram of response variable (sale_price)
ames |>
  ggplot(aes(x = sale_price)) +
  geom_histogram(aes(y = ..density..), color = 'white') +
  geom_density(color = 'red')
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As seen in the plot above, the sale price data does not appear to follow a normal distribution. Instead, it appears to have a right skew. This is not surprising with data for price, where negative values are not possible and thus the data is bound by zero.

For data like this, a transformation could be performed to create a (more) normal distribution. As mentioned previously, this might involve methods such as log transform or bootstrap random sampling.

Histograms can also be plotted for each group within the explanatory variable (building type) to see whether each group follows a normal distribution for sale price.

# Normality Check: histogram of groups within explanatory variable (bldg_type)
ames |>  ggplot(aes(x = sale_price)) +
  geom_histogram(aes(y = ..density..), color = 'white')+
  geom_density(color = 'red') +
  facet_wrap(~bldg_type)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Again, the histograms for the groups appear to show the distributions within the groups are not normal. Again, transformations could be performed to translate the data into a more normal distribution.

The Shapiro-Wilk test of normality can be utilized to determine whether data has a normal distribution or not. The null-hypothesis of this test is that the population is normally distributed.

Therefore, if the calculated p-value is less than the chosen alpha level (such as: 0.05 or less), then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.

# Shapiro-Wilk test of normality
# null hypothesis is data is normal
# low p-value = reject null hypothesis, data not normal

shapiro.test(ames$sale_price)
## 
##  Shapiro-Wilk normality test
## 
## data:  ames$sale_price
## W = 0.87626, p-value < 2.2e-16

The p-value for the Shapiro-Wilk test is very low (highly significant), which means we would reject the null hypothesis that sale price is normally distributed and thus assume that the data is not normal.

Ames Data - Assumption of Homoscedasticity (Constant Variance)

Another assumption for performing ANOVA is that the variance of data within groups remains consistent for every group.

The Levene’s test can be used to determine homogeneity of variance across groups within a variable. The null hypothesis of this test is that the group variances are equal.

Therefore, if the calculated p-value is less than the chosen alpha level (such as: 0.05 or less), then the null hypothesis is rejected and there is evidence that the groups do not have equal variance.

# Levene's test for homogeneity of variance across groups
# null hypothesis is variance across groups is equal
# low p-value = reject null hypothesis, group variances are not equal

leveneTest(sale_price ~ bldg_type, data = ames)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value   Pr(>F)    
## group    4  15.511 1.46e-12 ***
##       2925                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value for the Levene’s test is very low (highly significant), which means we would reject the null hypothesis that the groups (building types) have equal variance and thus assume that the groups are not homoscedastic.

Multicollinearity

  • Would be helpful to include correlation matrix between the variables in the set to help detect multicollinearity

  • Might help to present faceted scatter plots of each explanatory variable vs. price (first_flr_sf vs. price, lot_area vs. price) and then show the scatter plot of the two possible explanatory variables (first_flr_sf vs. price) to see that they are correlated with each other

Linear Regression

  • It would help to diagnose the linear regression model (e.g., creating diagnostic plots such as residuals vs. fitted values, etc.) to verify the validity of the model and provide any necessary caveats about application of the model (e.g., model may be less valid for price points above certain values, etc.)

Error

  • Would be better to display plots separately (or side-by-side) instead of having to switch between them