Week 13 Data Dive

library(tidyverse)

## Warning: package 'lubridate' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(car)

## Warning: package 'car' was built under R version 4.3.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.2

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:boot':
## 
##     logit
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(lindia)

## Warning: package 'lindia' was built under R version 4.3.2

Our group for this data dive:

Mansi Sudhirbhai Joshi
Karthik Balasubramanian
Pranav Barathwaj Padmanaban
Shemaa Eltayeb
Magesh Kumar Govind

We have chosen Week 8 notebook that talks about Linear Regression.

In our examination of the Week 8 notebook focused on Linear Regression, our initial focus is on identifying areas for improvement and addressing any analytical issues. Specifically, we aim to scrutinize the assumptions outlined in the notebook.

ASSUMPTIONS

PART -1: We noticed that there could have been some smaller and simpler codes that provide a visualization and validation of the five assumptions given. Below are some assumptions that could have been presented in a more concise way.

ASSUMPTION-2: Homoscedasticity (Constant Variance) There was no code in the notebook explaining how this can be done.

There was no code given to check on it. Hence we can use the below code with respect to ames dataset thats homoscedasticity by plotting the residuals against the fitted values.

# make a simpler name for (and a copy of) the Ames data
ames <- make_ames()
ames <- ames |> rename_with(tolower)
# Fit a linear model
model_homoscedasticity <- lm(sale_price ~ gr_liv_area, data = ames)

# Check homoscedasticity by plotting residuals against fitted values
plot(model_homoscedasticity, which = 3)

ASSUMPTION 3: Independence

# Fit a linear model
model_independence <- lm(sale_price ~ gr_liv_area, data = ames)

# Check independence of errors by plotting residuals against the predictor variable
plot(ames$gr_liv_area, residuals(model_independence), main = "Independence of Errors Check", xlab = "Gr_Liv_Area", ylab = "Residuals")
abline(h = 0, col = "red")

The above graph checks independence of errors by plotting the residuals against the predictor variable.

ASSUMPTION 5: Normality Errors There was no code in the notebook explaining how this can be done.

# Fit a linear model
model_normality <- lm(sale_price ~ gr_liv_area, data = ames)

# Check normality of errors by plotting a quantile-quantile (Q-Q) plot
qqnorm(residuals(model_normality))
qqline(residuals(model_normality), col = "red")

The above graph checks normality of errors by plotting a quantile-quantile (Q-Q) plot of the residuals. If the points fall approximately along the red line, it suggests that the residuals are normally distributed.

Part -2: We notice that with regard to the assumptions of notebook, there aren’t assumptions checks. Conducting assumption checks is crucial when performing ANOVA (Analysis of Variance) or regression analysis. There are several assumptions that need to be validated before interpreting the results of these models. Two common assumptions are normality and homoscedasticity.Even though there are examples to conduct certain checks for each of the five assumptions, there is a need for more of these checks to validate these assumptions.

Below are some possible checks that could have been added to the notebook:

Let’s assume we are examining the impact of Neighborhood on SalePrice as an example. The code will be structured to first fit an ANOVA model, and then perform the tests for normality and homoscedasticity

# Fit an ANOVA model
model <- aov(sale_price ~ neighborhood, data = ames)

Normality Assumption: This assumption tests if the residuals (the differences between the observed and predicted values) are normally distributed. You can perform a variety of tests and visualizations to assess normality, such as:

Histogram: Plotting a histogram of the residuals and visually inspecting it for a bell-shaped curve.

residuals <- model$residuals
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")

- Q-Q (Quantile-Quantile) Plot: Comparing the distribution of the residuals against a normal distribution. If the points lie along a diagonal line, it suggests normality.

qqnorm(residuals)
qqline(residuals, col = "red")

- Shapiro-Wilk Test: A statistical test to formally check normality. The null hypothesis is that the data is normally distributed.

shapiro_result <- shapiro.test(residuals)
print(shapiro_result)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.90176, p-value < 2.2e-16

Homoscedasticity Assumption: This assumption tests if the variance of the residuals is constant across all levels of the independent variable(s). Some methods to assess homoscedasticity include:

Residuals vs. Fitted Values Plot: Plotting residuals against predicted values. If there is no discernible pattern in the plot, homoscedasticity is assumed.

fitted_values <- fitted(model)
plot(fitted_values, residuals, xlab = "Fitted Values", ylab = "Residuals", main = "Residuals vs Fitted")
abline(h = 0, col = "red")

Scale-Location Plot: A variant of residuals vs. fitted values plot, with square root of standardized residuals on the y-axis to better identify any patterns.

std_residuals <- rstandard(model)
plot(fitted_values, sqrt(abs(std_residuals)), xlab = "Fitted Values", ylab = "Sqrt(Standardized Residuals)", main = "Scale-Location Plot")

Levene’s Test: A statistical test to formally check homogeneity of variances. The null hypothesis is that the variances are equal across groups.

levene_result <- leveneTest(residuals, ames$neighborhood)
print(levene_result)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group   27  25.225 < 2.2e-16 ***
##       2902                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The shapiro.test() and leveneTest() functions return results that can be printed to check for the p-value and assess whether the assumptions hold.

Remember, for the Shapiro-Wilk test, a significant p-value (typically < 0.05) indicates a deviation from normality. For Levene’s test, a significant p-value suggests a violation of homoscedasticity.

SIMPSON’S PARADOX

The notebook for Week 8 doesn’t give a sample implementation of the Simpson’s Paradox. Hence below is a code that can be provided in the notebook that uses the ames dataset.

# Simulating data
set.seed(123)

# Data for Department A
data_A <- data.frame(
  Gender = c("Male", "Female"),
  Applicants = c(100, 100),
  Admitted = c(80, 20)
)

# Data for Department B
data_B <- data.frame(
  Gender = c("Male", "Female"),
  Applicants = c(300, 100),
  Admitted = c(180, 70)
)

# Overall Admission Rates
overall_data <- data.frame(
  Gender = rep(c("Male", "Female"), each = 2),
  Applicants = c(400, 200),
  Admitted = c(260, 90)
)

# Calculate Admission Rates
data_A$Admission_Rate <- data_A$Admitted / data_A$Applicants * 100
data_B$Admission_Rate <- data_B$Admitted / data_B$Applicants * 100
overall_data$Admission_Rate <- overall_data$Admitted / overall_data$Applicants * 100

# Print the data
cat("Data for Department A:\n")

## Data for Department A:

print(data_A)

##   Gender Applicants Admitted Admission_Rate
## 1   Male        100       80             80
## 2 Female        100       20             20

cat("\nData for Department B:\n")

## 
## Data for Department B:

print(data_B)

##   Gender Applicants Admitted Admission_Rate
## 1   Male        300      180             60
## 2 Female        100       70             70

cat("\nOverall Admission Rates:\n")

## 
## Overall Admission Rates:

print(overall_data)

##   Gender Applicants Admitted Admission_Rate
## 1   Male        400      260             65
## 2   Male        200       90             45
## 3 Female        400      260             65
## 4 Female        200       90             45

OVERCOMING BIASES

Suggestion: Multiple Hypothesis Testing (to make more robust and reliable model) — this mitigates potential Type I Error (False Positive) bias.

The code performs multiple pairwise t-tests without adjusting for multiple comparisons adequately. While the Bonferroni correction is applied, other approaches like the Tukey HSD test or false discovery rate control methods could be considered.

Here are examples of performing multiple pairwise comparisons using Tukey’s Honest Significant Difference (HSD) test and controlling false discovery rate using the Benjamini-Hochberg method in R.

Tukey’s Honest Significant Difference (HSD) Test

Tukey’s HSD test is used after an ANOVA to identify which group means are significantly different from each other.

# Assuming you have a data frame 'data' with a response variable 'outcome' and a factor variable 'group'
# Example assuming 'group' is a factor variable with multiple levels

# Performing ANOVA
model <- aov(sale_price ~ ms_zoning, data = ames)

# Applying Tukey HSD test for multiple pairwise comparisons
TukeyHSD(model)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = sale_price ~ ms_zoning, data = ames)
## 
## $ms_zoning
##                                                                diff        lwr
## Residential_High_Density-Floating_Village_Residential    -82567.172 -129226.31
## Residential_Low_Density-Floating_Village_Residential     -27703.698  -47088.15
## Residential_Medium_Density-Floating_Village_Residential  -92205.556 -113668.08
## A_agr-Floating_Village_Residential                      -171686.950 -329687.67
## C_all-Floating_Village_Residential                      -139191.910 -187388.51
## I_all-Floating_Village_Residential                      -138674.450 -296675.17
## Residential_Low_Density-Residential_High_Density          54863.474   11914.36
## Residential_Medium_Density-Residential_High_Density       -9638.384  -53564.57
## A_agr-Residential_High_Density                           -89119.778 -251702.36
## C_all-Residential_High_Density                           -56624.738 -118202.19
## I_all-Residential_High_Density                           -56107.278 -218689.86
## Residential_Medium_Density-Residential_Low_Density       -64501.858  -75824.03
## A_agr-Residential_Low_Density                           -143983.252 -300928.40
## C_all-Residential_Low_Density                           -111488.212 -156102.83
## I_all-Residential_Low_Density                           -110970.752 -267915.90
## A_agr-Residential_Medium_Density                         -79481.394 -236696.73
## C_all-Residential_Medium_Density                         -46986.354  -92542.33
## I_all-Residential_Medium_Density                         -46468.894 -203684.23
## C_all-A_agr                                               32495.040 -130535.43
## I_all-A_agr                                               33012.500 -188843.87
## I_all-C_all                                                 517.460 -162513.01
##                                                                upr     p adj
## Residential_High_Density-Floating_Village_Residential   -35908.038 0.0000040
## Residential_Low_Density-Floating_Village_Residential     -8319.250 0.0005091
## Residential_Medium_Density-Floating_Village_Residential -70743.033 0.0000000
## A_agr-Floating_Village_Residential                      -13686.228 0.0230336
## C_all-Floating_Village_Residential                      -90995.306 0.0000000
## I_all-Floating_Village_Residential                       19326.272 0.1294569
## Residential_Low_Density-Residential_High_Density         97812.590 0.0031724
## Residential_Medium_Density-Residential_High_Density      34287.800 0.9951765
## A_agr-Residential_High_Density                           73462.809 0.6711723
## C_all-Residential_High_Density                            4952.712 0.0953033
## I_all-Residential_High_Density                          106475.309 0.9499903
## Residential_Medium_Density-Residential_Low_Density      -53179.683 0.0000000
## A_agr-Residential_Low_Density                            12961.897 0.0968679
## C_all-Residential_Low_Density                           -66873.591 0.0000000
## I_all-Residential_Low_Density                            45974.397 0.3611403
## A_agr-Residential_Medium_Density                         77733.945 0.7500212
## C_all-Residential_Medium_Density                         -1430.373 0.0381110
## I_all-Residential_Medium_Density                        110746.445 0.9766667
## C_all-A_agr                                             195525.514 0.9971625
## I_all-A_agr                                             254868.874 0.9994588
## I_all-C_all                                             163547.934 1.0000000

False Discovery Rate Control using Benjamini-Hochberg method

The Benjamini-Hochberg method controls the False Discovery Rate (FDR) while performing multiple hypothesis testing.

# Assuming you have a list of p-values from multiple tests in 'p_values'

# Apply Benjamini-Hochberg procedure for controlling False Discovery Rate
adjusted_p_values <- p.adjust(p_values, method = "BH")

Replace 'p_values' with your actual list of p-values obtained from multiple hypothesis tests. This method adjusts these p-values to control the FDR.

You can use these methods after conducting your statistical tests to make multiple comparisons more robust and reliable. Adjusting for multiple comparisons helps reduce the likelihood of false positives when testing multiple hypotheses simultaneously.

The existing code in lab-8 describes multiple pairwise t-tests are performed without adequately adjusting for multiple comparisons (using Bonferroni correction or similar approaches), can lead to an increased risk of making Type I errors, also known as false positives.

Bias Mitigation Impact: By using methods like Tukey’s HSD test or controlling FDR, you reduce the probability of erroneously concluding that there is a significant difference or relationship when, in reality, there isn’t. These methods help maintain a balance between identifying true effects and minimizing the risk of falsely detecting effects due to multiple testing. They provide more stringent control over Type I errors and, in the case of FDR control, allow for a higher power to detect true effects among multiple tests.

In summary, the bias mitigated here primarily involves lowering the probability of falsely identifying relationships or differences between groups when conducting multiple comparisons, ultimately enhancing the reliability of statistical inference.

POSSIBLE RISKS OR SOCIETAL IMPACTS

Misinterpretation due to Violation of Assumptions: The code discusses several assumptions related to linear regression, such as independence, normality, and constant variance of errors. If these assumptions are violated or not adequately checked, it can lead to incorrect conclusions. For instance, if the assumption of normality is not met, confidence intervals and hypothesis tests might be inaccurate, leading to flawed decisions in the housing market, affecting investment strategies or policy-making.

Model Overfitting or Underfitting: While fitting the model, there might be a risk of overfitting (capturing noise along with the signal) or underfitting (oversimplifying the relationship). Overfitting can lead to a lack of generalization, making predictions unreliable for new data. Underfitting, on the other hand, might not capture the complexity of relationships, resulting in inadequate predictions.

Ethical Implications of Model Outputs: If the model is used to make decisions that affect people’s lives, such as determining housing prices, loan approvals, or urban planning policies, ethical issues may arise. Biases within the data or model can lead to discrimination or unfair treatment of certain groups or communities.

Risks of Correlation vs. Causation Misinterpretation: Linear regression can suggest correlations between variables, but it does not imply causation. Incorrectly assuming causation based on correlation can lead to mistaken policy interventions or investments.

Algorithmic Bias: Analyzing data using algorithms or statistical models might perpetuate biases present in the data itself, leading to biased predictions or recommendations.

Before implementing or drawing conclusions from the code, it’s crucial to conduct thorough data analysis, validate assumptions, consider broader context, perform sensitivity analyses, and exercise caution while interpreting the results. Moreover, engaging domain experts and stakeholders can provide valuable insights and perspectives. Additionally, ethical considerations and fairness should be prioritized when applying statistical models to real-world scenarios.

CRUCIAL ISSUES WHICH MIGHT NOT BE MEASURABLE

In the context of “Lab 8” code discussing ANOVA, Linear Regression, and Simpson’s Paradox, there are several crucial issues that might not be directly measurable due to certain inherent limitations or complexities within the statistical analysis process. These issues can affect the reliability or generalizability of the conclusions drawn from the analyses. Some of these crucial issues include:

Causality vs. Correlation: Statistical analyses, especially those involving observational data, might establish correlations between variables but not necessarily indicate causality. Determining causality often requires additional evidence, such as controlled experiments, to ascertain a cause-and-effect relationship. For instance, while regression analysis might show a relationship between certain house features and sale prices, it may not definitively prove that these features directly cause changes in prices.

Linear Regression: Provides an introduction to linear regression using Anscombe’s Quartet and further extends it to cover single-variable and multiple-variable regression, interaction terms, multicollinearity, errors, and diagnostics.

Multicollinearity: Multicollinearity occurs when predictor variables in a regression model are highly correlated with each other. This can cause issues in interpreting the significance of individual predictors.

Simpson’s Paradox and Misleading Aggregations: Simpson’s Paradox refers to a phenomenon where trends or associations observed in different groups vanish or even reverse when the groups are aggregated. It is a statistical paradox that might not be immediately apparent and might mislead the interpretation of the overall relationship between variables.

These issues might not be directly measurable due to their nature, complexity, or the limitations of the available data. While statistical techniques aim to mitigate these issues, certain challenges remain, and addressing these challenges often requires a deeper understanding of the data context, subject matter expertise, and careful consideration of potential biases and limitations within the analysis.