library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
ames <- make_ames()
ames <- ames |> rename_with(tolower)
# we're interested in the house price across different Building Types.
m <- aov(sale_price ~ bldg_type, data = ames)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## bldg_type 4 6.454e+11 1.614e+11 26.15 <2e-16 ***
## Residuals 2925 1.805e+13 6.170e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: the average house price is equal across all Building Types, is a reasonable for an analysis, but it’s essential to consider potential factors that might influence house prices and introduce biases into the groups like selection bias, geographical or temporal biases.
Assumptions of independence, normality, and homoscedasticity are taken here. If these are not met, it can impact the validity and reliability of the statistical tests, such as ANOVA (Analysis of Variance).
we can do tests to check normality which could have been included in the lab:
Normality of Residuals:
# Check normality of residuals
qqnorm(residuals(m))
qqline(residuals(m))
The points in the Q-Q plot roughly follow a straight line, the assumption of normality may be met.
Homogeneity of Variances (Constant Variance):
# Check homogeneity of variances
plot(fitted(m), residuals(m))
abline(h = 0, col = "red")
spread of residuals is roughly constant across all predicted values, the assumption of homogeneity of variances may be met.
Levene’s test or Bartlett’s test can be used to formally test the homogeneity of variances assumption.
# Levene's test for homogeneity of variances
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:boot':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
leveneTest(residuals(m), ames$bldg_type)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 15.511 1.46e-12 ***
## 2925
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Since the p-value is less we reject the null hypothesis. This implies that there is evidence to suggest that the variances of the groups are not equal.
# Welch's ANOVA
oneway.test(sale_price ~ bldg_type, data = ames)
##
## One-way analysis of means (not assuming equal variances)
##
## data: sale_price and bldg_type
## F = 89.41, num df = 4.00, denom df = 249.16, p-value < 2.2e-16
The extremely small p-value gives strong evidence against H0. Therefore, we reject H0 and conclude that there are significant differences in the means of sale prices across different building types.
Post-hoc tests like Tukey’s HSD can be done to check which specific groups differ from each other. These tests provide pairwise comparisons between different levels of the categorical variable.
posthoc_tukey <- TukeyHSD(m)
print(posthoc_tukey)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = sale_price ~ bldg_type, data = ames)
##
## $bldg_type
## diff lwr upr p adj
## TwoFmCon-OneFam -59230.332 -86804.779 -31655.88 0.0000000
## Duplex-OneFam -45003.105 -65995.154 -24011.06 0.0000001
## Twnhs-OneFam -48877.982 -70651.106 -27104.86 0.0000000
## TwnhsE-OneFam 7499.873 -7205.102 22204.85 0.6328136
## Duplex-TwoFmCon 14227.226 -19877.110 48331.56 0.7859640
## Twnhs-TwoFmCon 10352.350 -24238.235 44942.93 0.9255215
## TwnhsE-TwoFmCon 66730.204 36092.386 97368.02 0.0000000
## Twnhs-Duplex -3874.876 -33486.100 25736.35 0.9965227
## TwnhsE-Duplex 52502.978 27623.431 77382.53 0.0000001
## TwnhsE-Twnhs 56377.855 30835.836 81919.87 0.0000000
It appears that there are significant differences in mean sale prices between several pairs of building types.
pairwise.t.test(ames$sale_price, ames$bldg_type, p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: ames$sale_price and ames$bldg_type
##
## OneFam TwoFmCon Duplex Twnhs
## TwoFmCon 5.1e-08 - - -
## Duplex 5.4e-08 1 - -
## Twnhs 1.0e-08 1 1 -
## TwnhsE 1 3.1e-08 9.3e-08 1.9e-08
##
## P value adjustment method: bonferroni