We are going to Critiquing on Linear regression notebook of week-8 module. I have discussed the following issues and ways to mitigate their impact on the model:
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(AmesHousing)
## Warning: package 'AmesHousing' was built under R version 4.3.2
library(boot)
library(broom)
library(lindia)
library(survey)
## Warning: package 'survey' was built under R version 4.3.2
## Loading required package: grid
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loading required package: survival
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:boot':
##
## aml
##
##
## Attaching package: 'survey'
##
## The following object is masked from 'package:graphics':
##
## dotchart
library(conflicted)
## Warning: package 'conflicted' was built under R version 4.3.2
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
suppressWarnings({
# Load required packages
library(dplyr)
library(lubridate)
library(AmesHousing)
library(survey)
})
ames <- make_ames()
ames <- ames |> rename_with(tolower)
head(ames)
## # A tibble: 6 × 81
## ms_subclass ms_zoning lot_frontage lot_area street alley lot_shape
## <fct> <fct> <dbl> <int> <fct> <fct> <fct>
## 1 One_Story_1946_and_New… Resident… 141 31770 Pave No_A… Slightly…
## 2 One_Story_1946_and_New… Resident… 80 11622 Pave No_A… Regular
## 3 One_Story_1946_and_New… Resident… 81 14267 Pave No_A… Slightly…
## 4 One_Story_1946_and_New… Resident… 93 11160 Pave No_A… Regular
## 5 Two_Story_1946_and_New… Resident… 74 13830 Pave No_A… Slightly…
## 6 Two_Story_1946_and_New… Resident… 78 9978 Pave No_A… Slightly…
## # ℹ 74 more variables: land_contour <fct>, utilities <fct>, lot_config <fct>,
## # land_slope <fct>, neighborhood <fct>, condition_1 <fct>, condition_2 <fct>,
## # bldg_type <fct>, house_style <fct>, overall_qual <fct>, overall_cond <fct>,
## # year_built <int>, year_remod_add <int>, roof_style <fct>, roof_matl <fct>,
## # exterior_1st <fct>, exterior_2nd <fct>, mas_vnr_type <fct>,
## # mas_vnr_area <dbl>, exter_qual <fct>, exter_cond <fct>, foundation <fct>,
## # bsmt_qual <fct>, bsmt_cond <fct>, bsmt_exposure <fct>, …
# Independence assumption - check the data structure
summary(ames$bldg_type)
## OneFam TwoFmCon Duplex Twnhs TwnhsE
## 2425 62 109 101 233
# Normality assumption - check the distribution of residuals
model <- aov(sale_price ~ bldg_type, data = ames)
residuals <- residuals(model)
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")
# Homoscedasticity assumption - check for constant variance of residuals
plot(model, 1) # Residuals vs. Fitted plot
Both methods assume that the residuals (the differences between observed
and predicted values) are normally distributed. This assumption is
particularly important for smaller sample sizes. If the residuals are
not normally distributed, it might impact the validity of confidence
intervals and hypothesis tests.
Overcoming Analysis Challenges:
1)Applying transformations to the data, such as log transformations, might help address issues related to normality or heteroscedasticity. Residual Analysis:
table(ames$bldg_type)
##
## OneFam TwoFmCon Duplex Twnhs TwnhsE
## 2425 62 109 101 233
design <- svydesign(ids = ~1, data = ames, weights = ~sale_price)
# Fit a model with the survey design
model_survey <- svyglm(sale_price ~ bldg_type, design = design)
summary(model_survey)
##
## Call:
## svyglm(formula = sale_price ~ bldg_type, design = design)
##
## Survey design:
## svydesign(ids = ~1, data = ames, weights = ~sale_price)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 221913 3038 73.038 <2e-16 ***
## bldg_typeTwoFmCon -88759 5516 -16.092 <2e-16 ***
## bldg_typeDuplex -71047 5732 -12.394 <2e-16 ***
## bldg_typeTwnhs -73167 5730 -12.770 <2e-16 ***
## bldg_typeTwnhsE -6916 6216 -1.113 0.266
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 9679514764)
##
## Number of Fisher Scoring iterations: 2
The analysis focuses on specific types of houses (e.g., single-family, one-story homes built after 2000). This could introduce bias if these types of houses are not representative of the overall housing market. Strategies to Overcome Biases:
Ensure a more representative sample by employing random sampling techniques.
summary(ames$sale_price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12789 129500 160000 180796 213500 755000
Policy Implications: If the analysis informs policies or decisions, there is a risk of unintended consequences if the model is not comprehensive.
Transparent Reporting: Clearly communicate the limitations and assumptions of the analysis to avoid misinterpretation. Robustness Checks:
Perform robustness checks and sensitivity analyses to assess the stability of results under different conditions.
Subjective Preferences: Individual preferences for certain features of a house (e.g., aesthetics, layout) are subjective and challenging to quantify.
Qualitative Data: Supplement quantitative analysis with qualitative insights or surveys to capture aspects that are not easily quantifiable.
Dynamic Market Factors: Acknowledge that housing markets are dynamic, and certain influential factors may change over time.