Model Critque:

We are going to Critiquing on Linear regression notebook of week-8 module. I have discussed the following issues and ways to mitigate their impact on the model:

  1. Analytical issues, such as model assumptions
  2. Overcoming biases (existing or potential)
  3. Possible risks or societal implications
  4. Crucial issues which might not be measurable

Importing Libaries and managing conflicts.

library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(AmesHousing)
## Warning: package 'AmesHousing' was built under R version 4.3.2
library(boot)
library(broom)
library(lindia)
library(survey)
## Warning: package 'survey' was built under R version 4.3.2
## Loading required package: grid
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loading required package: survival
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:boot':
## 
##     aml
## 
## 
## Attaching package: 'survey'
## 
## The following object is masked from 'package:graphics':
## 
##     dotchart
library(conflicted)
## Warning: package 'conflicted' was built under R version 4.3.2
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
suppressWarnings({
  # Load required packages
  library(dplyr)
  library(lubridate)
  library(AmesHousing)
  library(survey)
})

Loading the Ames Housing Dataset

ames <- make_ames()
ames <- ames |> rename_with(tolower)

head(ames)
## # A tibble: 6 × 81
##   ms_subclass             ms_zoning lot_frontage lot_area street alley lot_shape
##   <fct>                   <fct>            <dbl>    <int> <fct>  <fct> <fct>    
## 1 One_Story_1946_and_New… Resident…          141    31770 Pave   No_A… Slightly…
## 2 One_Story_1946_and_New… Resident…           80    11622 Pave   No_A… Regular  
## 3 One_Story_1946_and_New… Resident…           81    14267 Pave   No_A… Slightly…
## 4 One_Story_1946_and_New… Resident…           93    11160 Pave   No_A… Regular  
## 5 Two_Story_1946_and_New… Resident…           74    13830 Pave   No_A… Slightly…
## 6 Two_Story_1946_and_New… Resident…           78     9978 Pave   No_A… Slightly…
## # ℹ 74 more variables: land_contour <fct>, utilities <fct>, lot_config <fct>,
## #   land_slope <fct>, neighborhood <fct>, condition_1 <fct>, condition_2 <fct>,
## #   bldg_type <fct>, house_style <fct>, overall_qual <fct>, overall_cond <fct>,
## #   year_built <int>, year_remod_add <int>, roof_style <fct>, roof_matl <fct>,
## #   exterior_1st <fct>, exterior_2nd <fct>, mas_vnr_type <fct>,
## #   mas_vnr_area <dbl>, exter_qual <fct>, exter_cond <fct>, foundation <fct>,
## #   bsmt_qual <fct>, bsmt_cond <fct>, bsmt_exposure <fct>, …

1. Analytical Issues:

# Independence assumption - check the data structure
summary(ames$bldg_type)
##   OneFam TwoFmCon   Duplex    Twnhs   TwnhsE 
##     2425       62      109      101      233
# Normality assumption - check the distribution of residuals
model <- aov(sale_price ~ bldg_type, data = ames)
residuals <- residuals(model)
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")

# Homoscedasticity assumption - check for constant variance of residuals
plot(model, 1)  # Residuals vs. Fitted plot

Both methods assume that the residuals (the differences between observed and predicted values) are normally distributed. This assumption is particularly important for smaller sample sizes. If the residuals are not normally distributed, it might impact the validity of confidence intervals and hypothesis tests.

Overcoming Analysis Challenges:

1)Applying transformations to the data, such as log transformations, might help address issues related to normality or heteroscedasticity. Residual Analysis:

  1. Conducting residual analysis, such as plotting residuals against predicted values or using residual plots, can help identify patterns or deviations from assumptions.

2) Overcoming biases.

table(ames$bldg_type)
## 
##   OneFam TwoFmCon   Duplex    Twnhs   TwnhsE 
##     2425       62      109      101      233
design <- svydesign(ids = ~1, data = ames, weights = ~sale_price)

# Fit a model with the survey design
model_survey <- svyglm(sale_price ~ bldg_type, design = design)
summary(model_survey)
## 
## Call:
## svyglm(formula = sale_price ~ bldg_type, design = design)
## 
## Survey design:
## svydesign(ids = ~1, data = ames, weights = ~sale_price)
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         221913       3038  73.038   <2e-16 ***
## bldg_typeTwoFmCon   -88759       5516 -16.092   <2e-16 ***
## bldg_typeDuplex     -71047       5732 -12.394   <2e-16 ***
## bldg_typeTwnhs      -73167       5730 -12.770   <2e-16 ***
## bldg_typeTwnhsE      -6916       6216  -1.113    0.266    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 9679514764)
## 
## Number of Fisher Scoring iterations: 2
  1. Potential Biases in the Analysis:

The analysis focuses on specific types of houses (e.g., single-family, one-story homes built after 2000). This could introduce bias if these types of houses are not representative of the overall housing market. Strategies to Overcome Biases:

  1. Random Sampling:

Ensure a more representative sample by employing random sampling techniques.

3) Possible risks or societal implications

summary(ames$sale_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129500  160000  180796  213500  755000
  1. Policy Implications: If the analysis informs policies or decisions, there is a risk of unintended consequences if the model is not comprehensive.

  2. Transparent Reporting: Clearly communicate the limitations and assumptions of the analysis to avoid misinterpretation. Robustness Checks:

  3. Perform robustness checks and sensitivity analyses to assess the stability of results under different conditions.

4) Crucial issues which might not be measurable

  1. Subjective Preferences: Individual preferences for certain features of a house (e.g., aesthetics, layout) are subjective and challenging to quantify.

  2. Qualitative Data: Supplement quantitative analysis with qualitative insights or surveys to capture aspects that are not easily quantifiable.

  3. Dynamic Market Factors: Acknowledge that housing markets are dynamic, and certain influential factors may change over time.