notes8_critique data dive week 13

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)

ames <- make_ames()
ames <- ames |> rename_with(tolower)

# we're interested in the house price across different Building Types.
m <- aov(sale_price ~ bldg_type, data = ames)
summary(m)

##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## bldg_type      4 6.454e+11 1.614e+11   26.15 <2e-16 ***
## Residuals   2925 1.805e+13 6.170e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: the average house price is equal across all Building Types, is a reasonable for an analysis, but it’s essential to consider potential factors that might influence house prices and introduce biases into the groups like selection bias, geographical or temporal biases.
Assumptions of independence, normality, and homoscedasticity are taken here. If these are not met, it can impact the validity and reliability of the statistical tests, such as ANOVA (Analysis of Variance).

we can do tests to check normality which could have been included in the lab:

Normality of Residuals:

Method: You can check the normality of residuals using graphical methods, such as a normal probability plot (Q-Q plot) or a histogram of residuals.

# Check normality of residuals
qqnorm(residuals(m))
qqline(residuals(m))

The points in the Q-Q plot roughly follow a straight line, the assumption of normality may be met.

Homogeneity of Variances (Constant Variance):

Method: Plot the residuals against the predicted values and look for patterns or changes in spread.

# Check homogeneity of variances
plot(fitted(m), residuals(m))
abline(h = 0, col = "red")

spread of residuals is roughly constant across all predicted values, the assumption of homogeneity of variances may be met.

Levene’s test or Bartlett’s test can be used to formally test the homogeneity of variances assumption.

# Levene's test for homogeneity of variances
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:boot':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

leveneTest(residuals(m), ames$bldg_type)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value   Pr(>F)    
## group    4  15.511 1.46e-12 ***
##       2925                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is less we reject the null hypothesis. This implies that there is evidence to suggest that the variances of the groups are not equal.

Given the violation of the homogeneity of variance assumption, we should consider alternative methods that are robust to unequal variances like Welch’s ANOVA, which does not assume equal variances.

# Welch's ANOVA
oneway.test(sale_price ~ bldg_type, data = ames)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  sale_price and bldg_type
## F = 89.41, num df = 4.00, denom df = 249.16, p-value < 2.2e-16

The extremely small p-value gives strong evidence against H0. Therefore, we reject H0 and conclude that there are significant differences in the means of sale prices across different building types.

Anova doesn’t explicitly quantify the effect size or provide estimates for each level of bldg_type. Extracting the effect of bldg_type alone on sale price can be complex. Groups within a categorical variable may have different levels of variability. If the variance within certain groups is significantly larger than in others, it might impact the reliability of conclusions drawn from statistical tests.

Post-hoc tests like Tukey’s HSD can be done to check which specific groups differ from each other. These tests provide pairwise comparisons between different levels of the categorical variable.

posthoc_tukey <- TukeyHSD(m)
print(posthoc_tukey)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = sale_price ~ bldg_type, data = ames)
## 
## $bldg_type
##                       diff        lwr       upr     p adj
## TwoFmCon-OneFam -59230.332 -86804.779 -31655.88 0.0000000
## Duplex-OneFam   -45003.105 -65995.154 -24011.06 0.0000001
## Twnhs-OneFam    -48877.982 -70651.106 -27104.86 0.0000000
## TwnhsE-OneFam     7499.873  -7205.102  22204.85 0.6328136
## Duplex-TwoFmCon  14227.226 -19877.110  48331.56 0.7859640
## Twnhs-TwoFmCon   10352.350 -24238.235  44942.93 0.9255215
## TwnhsE-TwoFmCon  66730.204  36092.386  97368.02 0.0000000
## Twnhs-Duplex     -3874.876 -33486.100  25736.35 0.9965227
## TwnhsE-Duplex    52502.978  27623.431  77382.53 0.0000001
## TwnhsE-Twnhs     56377.855  30835.836  81919.87 0.0000000

It appears that there are significant differences in mean sale prices between several pairs of building types.

pairwise.t.test(ames$sale_price, ames$bldg_type, p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  ames$sale_price and ames$bldg_type 
## 
##          OneFam  TwoFmCon Duplex  Twnhs  
## TwoFmCon 5.1e-08 -        -       -      
## Duplex   5.4e-08 1        -       -      
## Twnhs    1.0e-08 1        1       -      
## TwnhsE   1       3.1e-08  9.3e-08 1.9e-08
## 
## P value adjustment method: bonferroni

The Bonferroni correction is used in the lab is a conservative method that controls the familywise error rate and it reduces the risk of Type I errors. But it can increase the risk of Type II errors (failing to reject a false null hypothesis) when the number of comparisons is large. If a more stringent control over the familywise error rate is necessary, Bonferroni might be preferred. The Benjamini-Hochberg procedure is a method for controlling the false discovery rate (FDR) when conducting multiple hypothesis tests. It’s less conservative than the Bonferroni correction, making it more powerful in certain situations. If the goal is to control the proportion of false discoveries and a less conservative approach is acceptable, the Benjamini-Hochberg procedure may be more suitable.

notes8_critique data dive week 13

parimala

2023-11-14