library(tidyverse)
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
##
## Attaching package: 'car'
##
## The following object is masked from 'package:boot':
##
## logit
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(lindia)
## Warning: package 'lindia' was built under R version 4.3.2
Our group for this data dive:
We have chosen Week 8 notebook that talks about Linear Regression.
In our examination of the Week 8 notebook focused on Linear Regression, our initial focus is on identifying areas for improvement and addressing any analytical issues. Specifically, we aim to scrutinize the assumptions outlined in the notebook.
PART -1: We noticed that there could have been some smaller and simpler codes that provide a visualization and validation of the five assumptions given. Below are some assumptions that could have been presented in a more concise way.
ASSUMPTION-2: Homoscedasticity (Constant Variance) There was no code in the notebook explaining how this can be done.
There was no code given to check on it. Hence we can use the below code with respect to ames dataset thats homoscedasticity by plotting the residuals against the fitted values.
# make a simpler name for (and a copy of) the Ames data
ames <- make_ames()
ames <- ames |> rename_with(tolower)
# Fit a linear model
model_homoscedasticity <- lm(sale_price ~ gr_liv_area, data = ames)
# Check homoscedasticity by plotting residuals against fitted values
plot(model_homoscedasticity, which = 3)
ASSUMPTION 3: Independence
# Fit a linear model
model_independence <- lm(sale_price ~ gr_liv_area, data = ames)
# Check independence of errors by plotting residuals against the predictor variable
plot(ames$gr_liv_area, residuals(model_independence), main = "Independence of Errors Check", xlab = "Gr_Liv_Area", ylab = "Residuals")
abline(h = 0, col = "red")
The above graph checks independence of errors by plotting the residuals against the predictor variable.
ASSUMPTION 5: Normality Errors There was no code in the notebook explaining how this can be done.
# Fit a linear model
model_normality <- lm(sale_price ~ gr_liv_area, data = ames)
# Check normality of errors by plotting a quantile-quantile (Q-Q) plot
qqnorm(residuals(model_normality))
qqline(residuals(model_normality), col = "red")
The above graph checks normality of errors by plotting a quantile-quantile (Q-Q) plot of the residuals. If the points fall approximately along the red line, it suggests that the residuals are normally distributed.
Part -2: We notice that with regard to the assumptions of notebook, there aren’t assumptions checks. Conducting assumption checks is crucial when performing ANOVA (Analysis of Variance) or regression analysis. There are several assumptions that need to be validated before interpreting the results of these models. Two common assumptions are normality and homoscedasticity.Even though there are examples to conduct certain checks for each of the five assumptions, there is a need for more of these checks to validate these assumptions.
Below are some possible checks that could have been added to the notebook:
Let’s assume we are examining the impact of Neighborhood on SalePrice as an example. The code will be structured to first fit an ANOVA model, and then perform the tests for normality and homoscedasticity
# Fit an ANOVA model
model <- aov(sale_price ~ neighborhood, data = ames)
Normality Assumption: This assumption tests if the residuals (the differences between the observed and predicted values) are normally distributed. You can perform a variety of tests and visualizations to assess normality, such as:
residuals <- model$residuals
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")
- Q-Q (Quantile-Quantile) Plot: Comparing the
distribution of the residuals against a normal distribution. If the
points lie along a diagonal line, it suggests normality.
qqnorm(residuals)
qqline(residuals, col = "red")
- Shapiro-Wilk Test: A statistical test to formally
check normality. The null hypothesis is that the data is normally
distributed.
shapiro_result <- shapiro.test(residuals)
print(shapiro_result)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.90176, p-value < 2.2e-16
Homoscedasticity Assumption: This assumption tests if the variance of the residuals is constant across all levels of the independent variable(s). Some methods to assess homoscedasticity include:
fitted_values <- fitted(model)
plot(fitted_values, residuals, xlab = "Fitted Values", ylab = "Residuals", main = "Residuals vs Fitted")
abline(h = 0, col = "red")
std_residuals <- rstandard(model)
plot(fitted_values, sqrt(abs(std_residuals)), xlab = "Fitted Values", ylab = "Sqrt(Standardized Residuals)", main = "Scale-Location Plot")
levene_result <- leveneTest(residuals, ames$neighborhood)
print(levene_result)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 27 25.225 < 2.2e-16 ***
## 2902
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The shapiro.test() and leveneTest() functions return results that can be printed to check for the p-value and assess whether the assumptions hold.
Remember, for the Shapiro-Wilk test, a significant p-value (typically < 0.05) indicates a deviation from normality. For Levene’s test, a significant p-value suggests a violation of homoscedasticity.
SIMPSON’S PARADOX
The notebook for Week 8 doesn’t give a sample implementation of the Simpson’s Paradox. Hence below is a code that can be provided in the notebook that uses the ames dataset.
# Simulating data
set.seed(123)
# Data for Department A
data_A <- data.frame(
Gender = c("Male", "Female"),
Applicants = c(100, 100),
Admitted = c(80, 20)
)
# Data for Department B
data_B <- data.frame(
Gender = c("Male", "Female"),
Applicants = c(300, 100),
Admitted = c(180, 70)
)
# Overall Admission Rates
overall_data <- data.frame(
Gender = rep(c("Male", "Female"), each = 2),
Applicants = c(400, 200),
Admitted = c(260, 90)
)
# Calculate Admission Rates
data_A$Admission_Rate <- data_A$Admitted / data_A$Applicants * 100
data_B$Admission_Rate <- data_B$Admitted / data_B$Applicants * 100
overall_data$Admission_Rate <- overall_data$Admitted / overall_data$Applicants * 100
# Print the data
cat("Data for Department A:\n")
## Data for Department A:
print(data_A)
## Gender Applicants Admitted Admission_Rate
## 1 Male 100 80 80
## 2 Female 100 20 20
cat("\nData for Department B:\n")
##
## Data for Department B:
print(data_B)
## Gender Applicants Admitted Admission_Rate
## 1 Male 300 180 60
## 2 Female 100 70 70
cat("\nOverall Admission Rates:\n")
##
## Overall Admission Rates:
print(overall_data)
## Gender Applicants Admitted Admission_Rate
## 1 Male 400 260 65
## 2 Male 200 90 45
## 3 Female 400 260 65
## 4 Female 200 90 45
Suggestion: Multiple Hypothesis Testing (to make more robust and reliable model) — this mitigates potential Type I Error (False Positive) bias.
The code performs multiple pairwise t-tests without adjusting for multiple comparisons adequately. While the Bonferroni correction is applied, other approaches like the Tukey HSD test or false discovery rate control methods could be considered.
Here are examples of performing multiple pairwise comparisons using Tukey’s Honest Significant Difference (HSD) test and controlling false discovery rate using the Benjamini-Hochberg method in R.
Tukey’s HSD test is used after an ANOVA to identify which group means are significantly different from each other.
# Assuming you have a data frame 'data' with a response variable 'outcome' and a factor variable 'group'
# Example assuming 'group' is a factor variable with multiple levels
# Performing ANOVA
model <- aov(sale_price ~ ms_zoning, data = ames)
# Applying Tukey HSD test for multiple pairwise comparisons
TukeyHSD(model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = sale_price ~ ms_zoning, data = ames)
##
## $ms_zoning
## diff lwr
## Residential_High_Density-Floating_Village_Residential -82567.172 -129226.31
## Residential_Low_Density-Floating_Village_Residential -27703.698 -47088.15
## Residential_Medium_Density-Floating_Village_Residential -92205.556 -113668.08
## A_agr-Floating_Village_Residential -171686.950 -329687.67
## C_all-Floating_Village_Residential -139191.910 -187388.51
## I_all-Floating_Village_Residential -138674.450 -296675.17
## Residential_Low_Density-Residential_High_Density 54863.474 11914.36
## Residential_Medium_Density-Residential_High_Density -9638.384 -53564.57
## A_agr-Residential_High_Density -89119.778 -251702.36
## C_all-Residential_High_Density -56624.738 -118202.19
## I_all-Residential_High_Density -56107.278 -218689.86
## Residential_Medium_Density-Residential_Low_Density -64501.858 -75824.03
## A_agr-Residential_Low_Density -143983.252 -300928.40
## C_all-Residential_Low_Density -111488.212 -156102.83
## I_all-Residential_Low_Density -110970.752 -267915.90
## A_agr-Residential_Medium_Density -79481.394 -236696.73
## C_all-Residential_Medium_Density -46986.354 -92542.33
## I_all-Residential_Medium_Density -46468.894 -203684.23
## C_all-A_agr 32495.040 -130535.43
## I_all-A_agr 33012.500 -188843.87
## I_all-C_all 517.460 -162513.01
## upr p adj
## Residential_High_Density-Floating_Village_Residential -35908.038 0.0000040
## Residential_Low_Density-Floating_Village_Residential -8319.250 0.0005091
## Residential_Medium_Density-Floating_Village_Residential -70743.033 0.0000000
## A_agr-Floating_Village_Residential -13686.228 0.0230336
## C_all-Floating_Village_Residential -90995.306 0.0000000
## I_all-Floating_Village_Residential 19326.272 0.1294569
## Residential_Low_Density-Residential_High_Density 97812.590 0.0031724
## Residential_Medium_Density-Residential_High_Density 34287.800 0.9951765
## A_agr-Residential_High_Density 73462.809 0.6711723
## C_all-Residential_High_Density 4952.712 0.0953033
## I_all-Residential_High_Density 106475.309 0.9499903
## Residential_Medium_Density-Residential_Low_Density -53179.683 0.0000000
## A_agr-Residential_Low_Density 12961.897 0.0968679
## C_all-Residential_Low_Density -66873.591 0.0000000
## I_all-Residential_Low_Density 45974.397 0.3611403
## A_agr-Residential_Medium_Density 77733.945 0.7500212
## C_all-Residential_Medium_Density -1430.373 0.0381110
## I_all-Residential_Medium_Density 110746.445 0.9766667
## C_all-A_agr 195525.514 0.9971625
## I_all-A_agr 254868.874 0.9994588
## I_all-C_all 163547.934 1.0000000
The Benjamini-Hochberg method controls the False Discovery Rate (FDR) while performing multiple hypothesis testing.
# Assuming you have a list of p-values from multiple tests in 'p_values'
# Apply Benjamini-Hochberg procedure for controlling False Discovery Rate
adjusted_p_values <- p.adjust(p_values, method = "BH")
Replace 'p_values' with your actual list of p-values
obtained from multiple hypothesis tests. This method adjusts these
p-values to control the FDR.
You can use these methods after conducting your statistical tests to make multiple comparisons more robust and reliable. Adjusting for multiple comparisons helps reduce the likelihood of false positives when testing multiple hypotheses simultaneously.
The existing code in lab-8 describes multiple pairwise t-tests are performed without adequately adjusting for multiple comparisons (using Bonferroni correction or similar approaches), can lead to an increased risk of making Type I errors, also known as false positives.
Bias Mitigation Impact: By using methods like Tukey’s HSD test or controlling FDR, you reduce the probability of erroneously concluding that there is a significant difference or relationship when, in reality, there isn’t. These methods help maintain a balance between identifying true effects and minimizing the risk of falsely detecting effects due to multiple testing. They provide more stringent control over Type I errors and, in the case of FDR control, allow for a higher power to detect true effects among multiple tests.
In summary, the bias mitigated here primarily involves lowering the probability of falsely identifying relationships or differences between groups when conducting multiple comparisons, ultimately enhancing the reliability of statistical inference.
Misinterpretation due to Violation of Assumptions: The code discusses several assumptions related to linear regression, such as independence, normality, and constant variance of errors. If these assumptions are violated or not adequately checked, it can lead to incorrect conclusions. For instance, if the assumption of normality is not met, confidence intervals and hypothesis tests might be inaccurate, leading to flawed decisions in the housing market, affecting investment strategies or policy-making.
Model Overfitting or Underfitting: While fitting the model, there might be a risk of overfitting (capturing noise along with the signal) or underfitting (oversimplifying the relationship). Overfitting can lead to a lack of generalization, making predictions unreliable for new data. Underfitting, on the other hand, might not capture the complexity of relationships, resulting in inadequate predictions.
Ethical Implications of Model Outputs: If the model is used to make decisions that affect people’s lives, such as determining housing prices, loan approvals, or urban planning policies, ethical issues may arise. Biases within the data or model can lead to discrimination or unfair treatment of certain groups or communities.
Risks of Correlation vs. Causation Misinterpretation: Linear regression can suggest correlations between variables, but it does not imply causation. Incorrectly assuming causation based on correlation can lead to mistaken policy interventions or investments.
Algorithmic Bias: Analyzing data using algorithms or statistical models might perpetuate biases present in the data itself, leading to biased predictions or recommendations.
Before implementing or drawing conclusions from the code, it’s crucial to conduct thorough data analysis, validate assumptions, consider broader context, perform sensitivity analyses, and exercise caution while interpreting the results. Moreover, engaging domain experts and stakeholders can provide valuable insights and perspectives. Additionally, ethical considerations and fairness should be prioritized when applying statistical models to real-world scenarios.
In the context of “Lab 8” code discussing ANOVA, Linear Regression, and Simpson’s Paradox, there are several crucial issues that might not be directly measurable due to certain inherent limitations or complexities within the statistical analysis process. These issues can affect the reliability or generalizability of the conclusions drawn from the analyses. Some of these crucial issues include:
Causality vs. Correlation: Statistical analyses, especially those involving observational data, might establish correlations between variables but not necessarily indicate causality. Determining causality often requires additional evidence, such as controlled experiments, to ascertain a cause-and-effect relationship. For instance, while regression analysis might show a relationship between certain house features and sale prices, it may not definitively prove that these features directly cause changes in prices.
Linear Regression: Provides an introduction to linear regression using Anscombe’s Quartet and further extends it to cover single-variable and multiple-variable regression, interaction terms, multicollinearity, errors, and diagnostics.
Multicollinearity: Multicollinearity occurs when predictor variables in a regression model are highly correlated with each other. This can cause issues in interpreting the significance of individual predictors.
Simpson’s Paradox and Misleading Aggregations: Simpson’s Paradox refers to a phenomenon where trends or associations observed in different groups vanish or even reverse when the groups are aggregated. It is a statistical paradox that might not be immediately apparent and might mislead the interpretation of the overall relationship between variables.
These issues might not be directly measurable due to their nature, complexity, or the limitations of the available data. While statistical techniques aim to mitigate these issues, certain challenges remain, and addressing these challenges often requires a deeper understanding of the data context, subject matter expertise, and careful consideration of potential biases and limitations within the analysis.