P-Value: Understanding Its Use and Misuse

2024-11-13

Purpose of the Presentation

Objective: To help understand what the p-value is and how it assists in data based decision making.
Example Problem: We will use the Ames Housing data from Kaggle to setup an example problem.
- Problem: Do housing prices differ between “North Ames” and “Old Town” neighborhoods?
Demonstrate Abuse of P-Values:
- The example problems will examine both proper and improper use of p-values.
  - Highlight how selective data manipulation (p-hacking) can produce misleading results.

What is the P-Value?

Definition: Assuming a null hypothesis ($H_0$) is true, the P-Value is the probability of obtaining results as extreme as, or more extreme than, the observed results due to random chance.
Purpose: Aids in determining the significance of the results, and the evidence against ($H_0$).
Formula: \[P(\text{Data as extreme as observed} \mid H_0 \text{ true})\]

Hypothesis Testing and P-Value

Why Hypothesis Test?:
- To assess that the observed data is sufficient to reject the null hypothesis.
Decision Rule: The P-value is compared against generally accepted significance level. ($\alpha = 0.05$)

Example: Variance of Housing Prices by Location

Dataset Details
- Variable of Interest: SalePrice (Price of last sale).
- Grouping Variable: Neighborhood (Name of neighborhood).
Hypotheses
- Null Hypothesis ($H_0$): There is no difference in mean sale price between neighborhoods.
- Alternative Hypothesis ($H_1$): The mean sale price is not equal between neighborhoods.
Data Preview
- Neighborhoods: “North Ames” and “Old Town” These were selected for their sample sizes and characteristics.

Previewing Data

neighborhood_summary <- neighborhood_data %>%
  group_by(Neighborhood) %>%
  summarise(
    mean_price = mean(SalePrice, na.rm = TRUE),
    median_price = median(SalePrice, na.rm = TRUE),
    sd_price = sd(SalePrice, na.rm = TRUE),
    sample_size = n()
  )
print(neighborhood_summary)

## # A tibble: 2 × 5
##   Neighborhood mean_price median_price sd_price sample_size
##   <chr>             <dbl>        <int>    <dbl>       <int>
## 1 North Ames      145097.       140000   31883.         443
## 2 Old Town        123992.       119900   44327.         239

Plotting Sale Prices

Box Plot of Price per Square Foot by Neighborhood

Interactive 2D Scatter Plot: Square Footage vs Sale Price

Hypothesis Test with P-Values Part 1

t_test_result <- t.test(SalePrice ~ Neighborhood,
                        data = neighborhood_data)

# make sure p-value is readable
if (t_test_result$p.value < 0.001) {
  house_p_value <- "< 0.001"
} else {
  house_p_value <- format(t_test_result$p.value, scientific = FALSE,
                          digits = 4)
}
house_conf_int <- sprintf("%s to %s", dollar(t_test_result$conf.int[1]),
                          dollar(t_test_result$conf.int[2]))

Housing Price by Neighborhood
- P-Value: < 0.001
- 95% Confidence Interval: $14,728.99 to $27,481.93

Hypothesis Test with P-Values Part 2

t_test_sqft <- t.test(PricePerSqFt ~ Neighborhood,
                      data = neighborhood_data)

if (t_test_sqft$p.value < 0.001) {
  sqft_p_value <- "< 0.001"
} else {
  sqft_p_value <- format(t_test_sqft$p.value, scientific = FALSE,
                         digits = 4)
}

sqft_conf_int <- sprintf("%s to %s", dollar(t_test_sqft$conf.int[1]),
                         dollar(t_test_sqft$conf.int[2]))

Price/SqFt by Neighborhood
- P-Value: < 0.001
- 95% Confidence Interval: $21.76 to $29.30

Interpreting Example P-Value

P-Value: < 0.001
- Since the p-value is less than 0.05, we reject the null hypothesis, concluding that there is a statistically significant difference in mean sale prices between “NAmes” and “OldTown”.
95% Confidence Interval: $14,728.99 to $27,481.93
- This interval, $14,728.99 to $27,481.93, represents the estimated range for the true difference in mean sale prices between the neighborhoods.
- Interpretation: We are 95% confident that the true difference in mean sale prices falls between $14,728.99 to $27,481.93. This suggests that “NAmes” homes are, on average, more expensive than “OldTown” homes by this amount.
Key Takeaway:
- The statistical significance (low p-value) indicates that the difference is unlikely due to random chance.
- The confidence interval provides a practical estimate of this difference, which is significant both statistically and in a practical context (since the interval does not contain zero).

Example of Intentional Misuse

P-Value is artificially increased to indicate no relation between neighborhood and sale price.
- This is done by filtering out the lower priced houses from Old Town and the higher priced housed from North Ames.

Misuse Results

# Run t-test on manipulated subset
p_hacked_test <- t.test(SalePrice ~ Neighborhood,
                        data = p_hacked_data)

# Display p-value and confidence interval for the manipulated test
p_hacked_p_value <- ifelse(p_hacked_test$p.value < 0.001,
                           "< 0.001", format(p_hacked_test$p.value,
                                  scientific = FALSE, digits = 4))
p_hacked_conf_int <- sprintf("%s to %s",
                             dollar(p_hacked_test$conf.int[1]),
                             dollar(p_hacked_test$conf.int[2]))

P-Value: 0.4704

95% Confidence Interval: -$17,516.96 to $8,195.99

Comparing Misuse to Proper Use

Proper Use:
- P-Value for Sale Price: < 0.001
- 95% Confidence Interval: $14,728.99 to $27,481.93
- Conclusion: A statistically significant difference in mean sale prices exists between “North Ames” and “Old Town.”
Misuse (P-Hacking):
- P-Value for Sale Price: 0.4704
- 95% Confidence Interval: -$17,516.96 to $8,195.99
- Conclusion: Selective data manipulation falsely shows no significant difference between neighborhoods.

Key Takeaway:

P-hacking demonstrates how results can be skewed to misrepresent reality.
Proper statistical practices ensure reliable and ethical interpretations of data.

Conclusion

P-Value:
- A critical tool for hypothesis testing, but it must be interpreted correctly and ethically.
Proper Use:
- Can demonstrates meaningful differences in sale price and price per square foot between neighborhoods.
Misuse:
- Shines light on the dangers of selective data manipulation (P-hacking).
Key Message:
- Always evaluate p-values within the context of complete data and avoid practices that are unethical.

References

Ames Housing Data https://www.kaggle.com/datasets/shashanknecrothapa/ames-housing-dataset