2024-11-13

Purpose of the Presentation

  • Objective: To help understand what the p-value is and how it assists in data based decision making.

  • Example Problem: We will use the Ames Housing data from Kaggle to setup an example problem.

    • Problem: Do housing prices differ between “North Ames” and “Old Town” neighborhoods?
  • Demonstrate Abuse of P-Values:

    • The example problems will examine both proper and improper use of p-values.
      • Highlight how selective data manipulation (p-hacking) can produce misleading results.

What is the P-Value?

  • Definition: Assuming a null hypothesis (\(H_0\)) is true, the P-Value is the probability of obtaining results as extreme as, or more extreme than, the observed results due to random chance.

  • Purpose: Aids in determining the significance of the results, and the evidence against (\(H_0\)).

  • Formula: \[P(\text{Data as extreme as observed} \mid H_0 \text{ true})\]

Hypothesis Testing and P-Value

  • Why Hypothesis Test?:
    • To assess that the observed data is sufficient to reject the null hypothesis.
  • Decision Rule: The P-value is compared against generally accepted significance level. (\(\alpha = 0.05\))

Example: Variance of Housing Prices by Location

  • Dataset Details
    • Variable of Interest: SalePrice (Price of last sale).
    • Grouping Variable: Neighborhood (Name of neighborhood).
  • Hypotheses
    • Null Hypothesis (\(H_0\)): There is no difference in mean sale price between neighborhoods.
    • Alternative Hypothesis (\(H_1\)): The mean sale price is not equal between neighborhoods.
  • Data Preview
    • Neighborhoods: “North Ames” and “Old Town” These were selected for their sample sizes and characteristics.

Previewing Data

neighborhood_summary <- neighborhood_data %>%
  group_by(Neighborhood) %>%
  summarise(
    mean_price = mean(SalePrice, na.rm = TRUE),
    median_price = median(SalePrice, na.rm = TRUE),
    sd_price = sd(SalePrice, na.rm = TRUE),
    sample_size = n()
  )
print(neighborhood_summary)
## # A tibble: 2 × 5
##   Neighborhood mean_price median_price sd_price sample_size
##   <chr>             <dbl>        <int>    <dbl>       <int>
## 1 North Ames      145097.       140000   31883.         443
## 2 Old Town        123992.       119900   44327.         239

Plotting Sale Prices

Box Plot of Price per Square Foot by Neighborhood

Interactive 2D Scatter Plot: Square Footage vs Sale Price

Hypothesis Test with P-Values Part 1

t_test_result <- t.test(SalePrice ~ Neighborhood,
                        data = neighborhood_data)

# make sure p-value is readable
if (t_test_result$p.value < 0.001) {
  house_p_value <- "< 0.001"
} else {
  house_p_value <- format(t_test_result$p.value, scientific = FALSE,
                          digits = 4)
}
house_conf_int <- sprintf("%s to %s", dollar(t_test_result$conf.int[1]),
                          dollar(t_test_result$conf.int[2]))
  • Housing Price by Neighborhood
    • P-Value: < 0.001
    • 95% Confidence Interval: $14,728.99 to $27,481.93

Hypothesis Test with P-Values Part 2

t_test_sqft <- t.test(PricePerSqFt ~ Neighborhood,
                      data = neighborhood_data)

if (t_test_sqft$p.value < 0.001) {
  sqft_p_value <- "< 0.001"
} else {
  sqft_p_value <- format(t_test_sqft$p.value, scientific = FALSE,
                         digits = 4)
}

sqft_conf_int <- sprintf("%s to %s", dollar(t_test_sqft$conf.int[1]),
                         dollar(t_test_sqft$conf.int[2]))
  • Price/SqFt by Neighborhood
    • P-Value: < 0.001
    • 95% Confidence Interval: $21.76 to $29.30

Interpreting Example P-Value

  • P-Value: < 0.001
    • Since the p-value is less than 0.05, we reject the null hypothesis, concluding that there is a statistically significant difference in mean sale prices between “NAmes” and “OldTown”.
  • 95% Confidence Interval: $14,728.99 to $27,481.93
    • This interval, $14,728.99 to $27,481.93, represents the estimated range for the true difference in mean sale prices between the neighborhoods.
    • Interpretation: We are 95% confident that the true difference in mean sale prices falls between $14,728.99 to $27,481.93. This suggests that “NAmes” homes are, on average, more expensive than “OldTown” homes by this amount.
  • Key Takeaway:
    • The statistical significance (low p-value) indicates that the difference is unlikely due to random chance.
    • The confidence interval provides a practical estimate of this difference, which is significant both statistically and in a practical context (since the interval does not contain zero).

Example of Intentional Misuse

  • P-Value is artificially increased to indicate no relation between neighborhood and sale price.
    • This is done by filtering out the lower priced houses from Old Town and the higher priced housed from North Ames.

Misuse Results

# Run t-test on manipulated subset
p_hacked_test <- t.test(SalePrice ~ Neighborhood,
                        data = p_hacked_data)

# Display p-value and confidence interval for the manipulated test
p_hacked_p_value <- ifelse(p_hacked_test$p.value < 0.001,
                           "< 0.001", format(p_hacked_test$p.value,
                                  scientific = FALSE, digits = 4))
p_hacked_conf_int <- sprintf("%s to %s",
                             dollar(p_hacked_test$conf.int[1]),
                             dollar(p_hacked_test$conf.int[2]))

P-Value: 0.4704

95% Confidence Interval: -$17,516.96 to $8,195.99

Comparing Misuse to Proper Use

  • Proper Use:
    • P-Value for Sale Price: < 0.001
    • 95% Confidence Interval: $14,728.99 to $27,481.93
    • Conclusion: A statistically significant difference in mean sale prices exists between “North Ames” and “Old Town.”
  • Misuse (P-Hacking):
    • P-Value for Sale Price: 0.4704
    • 95% Confidence Interval: -$17,516.96 to $8,195.99
    • Conclusion: Selective data manipulation falsely shows no significant difference between neighborhoods.

Key Takeaway:

  • P-hacking demonstrates how results can be skewed to misrepresent reality.

  • Proper statistical practices ensure reliable and ethical interpretations of data.

Conclusion

  • P-Value:
    • A critical tool for hypothesis testing, but it must be interpreted correctly and ethically.
  • Proper Use:
    • Can demonstrates meaningful differences in sale price and price per square foot between neighborhoods.
  • Misuse:
    • Shines light on the dangers of selective data manipulation (P-hacking).
  • Key Message:
    • Always evaluate p-values within the context of complete data and avoid practices that are unethical.

References