00 - Introduction to Week #7 Data Dive

This week’s data dives focuses on hypothesis testing. Specifically, it will detail on the bank marketing dataset two separate hypotheses and build two visualizations, one for each hypothesis we are testing. More information about the dataset can be found here.

Hypothesis #1 will apply the Neyman-Pearson framework, identify the sample size needed to support a defined critical value, power level, and delta value, apply a t-test to determine whether the null hypothesis can be rejected, and provide a visualization as supporting evidence.
Hypothesis #2 will apply the Fisher’s Significance Testing framework, overview it’s step process, perform a test, and compute an effect size to provide a satisfying conclusion.

Along the way, I will explain what insight has been gathered, it’s significance, and any potential further questions that may need to be evaluation.

# Declare libraries
library(readr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.2.1
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(pwrss)

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

library(effsize)

setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")

# Read in dataframe
bank_marketing <- read_delim("bank-marketing.csv",delim=";")

## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl  (7): age, balance, day, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

01 - Neyman-Pearson Framework to Test Significance in Balance Based on Education

The Null Hypothesis, Visualized

For our first null hypothesis, we are interested in whether the assumption that there is no difference between education groups and average annual balance is valid. Since the purpose of this lab is to focus on AB Testing, we will only compare secondary vs tertiary education groups.

We can specify this hypothesis as follows:

$H_0$ = The average annual balance does not differ between bank clients who have attained up to a secondary level of education and those who attained up to a tertiary level of education.

$H_0: bal_e = bal_t = |bal_e - bal_t| = 0$

$H_a : bal_e \neq bal_t$

Where $H_0$ represents the null hypothesis, $H_a$ the alternative hypothesis, $bal_e$ the mean balance for bank clients who have obtained up to a secondary education degree and $bal_t$ the mean balance for bank clients who have obtained up to tertiary education level degree.

We can construct a rudimentary box and whisker plot to get some surface-level insight.

For the purposes of this class – and since I don’t want to deal with such extreme values, I will go ahead and remove all outliers as specified by the IQR range ~ e.g., values that are 1.5x the 25th and 75th percentile values.

# Box and Whisker of Education against Balance
bal_iqr <- IQR(bank_marketing$balance)
bot_bal_whisk = quantile(bank_marketing$balance, 0.25) - bal_iqr
top_bal_whisk = quantile(bank_marketing$balance, 0.75) + bal_iqr

bank_marketing_iqr <- bank_marketing |>
   filter((education == "secondary" | education == "tertiary")
         & balance >= bot_bal_whisk 
         & balance <= top_bal_whisk)

label_data <- bank_marketing_iqr |>
  group_by(education) |>
  summarize(
    med_balance = median(balance),
    mean_balance = mean(balance),
    n = n()
  )

ggplot(bank_marketing_iqr, aes(x = education, y = balance)) +
  geom_boxplot() +
  geom_label(
    data = label_data,
    aes(
      x = education,
      y = -1000,
      label = paste0(
        "Median = ", round(med_balance, 0), "\n",
        "Mean = ", round(mean_balance, 0), "\n",
        "n = ", n
      )
    ),
    vjust = 0,
    size = 4
  ) +
  theme_classic()

Well this is a bit easier on the eyes than including all of the extreme values which would be present outside the whiskers – and we can see that there doesn’t seem to be too much of a difference between the columns IQR’s as we may expect – e.g., a median of 302 vs a mean of 383 ~ about a 26% difference and then about a 20% difference in the mean values – which is what we are testing in our hypothesis test.

Potential Error: Perhaps this is because we cut the IQR based on the entire dataset instead of the IQR for each of the groups–which may be different per our null hypothesis. It’s entirely possible that the IQR for bank clients with a tertiary education level is higher and by cutting based on the IQR of the dataset as a whole (which also includes education attainment categories ‘primary’ and ‘unknown,’) we may be making these groups seem closer than they otherwise would be.

Let’s go ahead and adjust accordingly so we are only keeping those within the IQR for each categorical group.

# Box and Whisker of Education Against Balance Adjusted IQR Within Education Groups
bm_secondary <- bank_marketing |>
  filter(education == "secondary")

bm_tertiary <- bank_marketing |>
  filter(education == "tertiary")

bm_sec_iqr = IQR(bm_secondary$balance)
bm_sec_bot_bal_whisk = quantile(bm_secondary$balance, 0.25) - bm_sec_iqr
bm_sec_top_bal_whisk = quantile(bm_secondary$balance, 0.75) + bm_sec_iqr

bm_tert_iqr = IQR(bm_tertiary$balance)
bm_tert_bot_bal_whisk = quantile(bm_tertiary$balance, 0.25) - bm_tert_iqr
bm_tert_top_bal_whisk = quantile(bm_tertiary$balance, 0.75) + bm_tert_iqr

bm_sec_final <- bm_secondary |>
  filter(balance >= bm_sec_bot_bal_whisk 
         & balance <= bm_sec_top_bal_whisk)

bm_tert_final <- bm_tertiary |>
  filter(balance >= bm_tert_bot_bal_whisk
         & balance <= bm_tert_top_bal_whisk)

bm_final <- bind_rows(bm_sec_final, bm_tert_final)

label_data <- bm_final |>
  group_by(education) |>
  summarize(
    med_balance = median(balance),
    mean_balance = mean(balance),
    n = n()
  )

ggplot(bm_final, aes(x = education, y = balance)) +
  geom_boxplot() +
  geom_label(
    data = label_data,
    aes(
      x = education,
      y = -1000,
      label = paste0(
        "Median = ", round(med_balance, 0), "\n",
        "Mean = ", round(mean_balance, 0), "\n",
        "n = ", n
      )
    ),
    vjust = 0,
    size = 4
  ) +
  theme_classic()

Well – now we can see a slightly larger difference in the median values / IQR as denoted by the box plot and in the difference between mean values observed within each group. This was a good catch – and also interestingly helps us identify an interesting facet about outlier detection from the previous lesson.

Bootstrapping Around Null Hypothesis for Differences in Mean

Now let’s go ahead and quantify the difference in means – which is about $252. .

ed_bal_diff <- mean(bm_tert_final$balance) - mean(bm_sec_final$balance)
ed_bal_diff

## [1] 252.1785

This number is meaningless without knowing the standard deviation / variation present within the data.
If the balance most clients carry is exceptionally large with the bank – then $252 worth of deviation among the means of each group may be practically $0.
Of course this is not what we observe – as we will discover further.

To further explore – why don’t we copy some of the components from the notebook and understand what a normal distribution of the differences in balances would look like if the null hypothesis was accepted – e.g., the mean average difference was truly $0.

The notebook mentions this under the terminology – the sampling distribution assumes always assumes the conditions of the null hypothesis.

To accomplish this, we will compute a sampling distribution centered around 0 with a standard error represented by the standard deviation of the differences in bootstrapped averages. This helps us provide meaning to that $252 number.

# the same bootstrapping function from lab_06
bootstrap <- function (x, func=mean, n_iter=10^4) {
  # empty vector to be filled with values from each iteration
  func_values <- c(NULL)
  
  # we simulate sampling `n_iter` times
  for (i in 1:n_iter) {
    # pull the sample (a vector)
    x_sample <- sample(x, size = length(x), replace = TRUE)
    
    # add on this iteration's value to the collection
    func_values <- c(func_values, func(x_sample))
  }
  
  return(func_values)
}

# Bootstrap the averages within each category
avgs_secondary <- bm_sec_final |>
  pluck("balance") |>
  bootstrap(n_iter = 100)

avgs_tertiary <- bm_tert_final |>
  pluck("balance") |>
  bootstrap(n_iter = 100)

# Construct a 'vector' containing all differences
diff_in_avgs = avgs_tertiary - avgs_secondary

# Find Standard Deviation ~ standard error of difference in averages
se_bal_diff <- sd(diff_in_avgs)

ggplot() +
  geom_function(xlim = c(-1000, 1000), 
                fun = function(x) dnorm(x, mean = 0, 
                                        sd = se_bal_diff)) +
  geom_vline(mapping = aes(xintercept = ed_bal_diff,
                           color = paste("observed: ",
                                         round(ed_bal_diff)))) +
  labs(title = "Bootstrapped Sampling Distribution of Balance Differences",
       x = "Difference in Valance Calculated",
       y = "Probability Density",
       color = "") +
  theme_classic()

Here we can see that our observed difference between the two means falls far outside the bell curve / normal distribution assuming our null hypothesis is accepted. This provides visual evidence that our observed difference in means of 252 may be significant.

Applying the Neyman-Pearson Framework

We’ve seen some visual evidence that the null hypothesis should likely be rejected – so let’s go ahead and apply the Neyman-Pearson framework by specifying a critical value, power level, and accepted difference to formally accept or reject the null hypothesis.

Note – we’ve already devised a null hypothesis and the distribution of the random variable represented in our null hypothesis.

We’ll set the critical value $\alpha$ to 0.01 since we can above just how different the means seem to be from an assumed null hypothesis of zero differences in means.
We’ll set a false positive rate / power level of ($\beta$) of 0.01 – again since the difference seems very large. This means that based on our sample, there would be a less than 1% chance that we would have detected difference in means between the two groups to land somewhere in the null-hypothesis distribution.
We’ll say that a delta $\Delta$ is around 250 given that is the observed difference in means between the two groups.

With our framework outlined, we’ll go ahead and identify the sample size needed from a t-test to be statistically astute and in line with the strength needed. I will need to first identify the ratio between my sample sizes for secondary and tertiary education levels to fill in as the ‘kappa’ value.

# Computing Kappa Value for 2 Means T-Test
bm_final |>
  count(education) |>
  summarise(ratio = n[education == "secondary"] / n[education == "tertiary"])

## # A tibble: 1 × 1
##   ratio
##   <dbl>
## 1  1.75

# Computing T-Test with Desired Values
test <- pwrss.t.2means(mu1 = mean(filter(bm_final, education == "secondary")$balance),
                       sd1 = sd(filter(bm_final, education == "secondary")$balance),
                       mu2 = mean(filter(bm_final, education == "tertiary")$balance),
                       sd2 = sd(filter(bm_final, education == "tertiary")$balance),
                       kappa = 1.747,
                       power = .99, alpha = 0.01, 
                       alternative = "not equal")

## +--------------------------------------------------+
## |             SAMPLE SIZE CALCULATION              |
## +--------------------------------------------------+
## 
## Welch's T-Test (Independent Samples)
## 
## ---------------------------------------------------
## Hypotheses
## ---------------------------------------------------
##   H0 (Null Claim) : d - null.d = 0 
##   H1 (Alt. Claim) : d - null.d != 0 
## 
## ---------------------------------------------------
## Results
## ---------------------------------------------------
##   Sample Size            = 667 and 382  <<
##   Type 1 Error (alpha)   = 0.010
##   Type 2 Error (beta)    = 0.010
##   Statistical Power      = 0.99

plot(test)

We would need at least 667 observations in the secondary-education group and 382 observations in the tertiary-education group for the two-sample t-test to have the specified power to detect a true difference in means of 250 at the chosen significance level.

We satisfy both of these requirements per the count below.

# Count based on Education
bm_final |>
  group_by(education) |>
  count()

## # A tibble: 2 × 2
## # Groups:   education [2]
##   education     n
##   <chr>     <int>
## 1 secondary 20063
## 2 tertiary  11480

Now let’s compute the t-test mathematically. We’ll apply Welch’s t-test to determine whether the null hypothesis can be rejected. We’re applying Welch’s test because:

There are differences in the variance between groups.
We are computing the difference in means between two independent samples.

We’ll specify a two-tailed approach since the balance could go either direction – e.g., positive or negative.

The formula for Welch’s t-test statistic is: $t = \frac{\bar{x_1} - \bar{x_2}}{\sqrt(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2})}$

Where:

$\bar x_1$, $\bar x_2$ are the sample means of group 1 and group 2
$s^2_1$, $s^2_2$ are the sample variances of their respective groups
$n_1$, $n_2$ are the sample sizes

# Calculating the T-Value
bm_final |>
  summarise(
    x_1 = mean(balance[education == "secondary"]),
    x_2 = mean(balance[education == "tertiary"]),
    s_1 = var(balance[education == "secondary"]),
    s_2 = var(balance[education == "tertiary"]),
    n_1 = sum(education == "secondary"),
    n_2 = sum(education == "tertiary"),
    t = (x_1 - x_2) / sqrt((s_1)/n_1 + (s_2)/n_2)
  )

## # A tibble: 1 × 7
##     x_1   x_2     s_1     s_2   n_1   n_2     t
##   <dbl> <dbl>   <dbl>   <dbl> <int> <int> <dbl>
## 1  475.  728. 390598. 779903. 20063 11480 -27.0

Here, the test statistic is $t = -26.97$. For a two sided hypothesis test at the 1% significance level, the critical values are approximately $\pm 2.58$ when the degrees of freedom are large. Therefore, any test statistic between $-2.58$ and $+2.58$ would fail to reject the null hypothesis. Because our observed t-value lies far outside this range, we can reject the null hypothesis. This provides strong evidence that the mean balance differs between clients with secondary education and those with tertiary education.

This is validated from the official findings of the t-test below.

# Validating the T-Value
t.test(balance ~ education, data = bm_final, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  balance by education
## t = -26.974, df = 18148, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group secondary and group tertiary is not equal to 0
## 95 percent confidence interval:
##  -270.5035 -233.8536
## sample estimates:
## mean in group secondary  mean in group tertiary 
##                475.4093                727.5878

We can also see from the t-distribution below where our t-value lies, and that it is well beyond the range of values that would fail to reject the null hypothesis at the chosen significance level. Recall that the t-distribution represents the distribution of the test statistic under the null hypothesis, assuming that the null hypothesis is true.

# Key Values from test above
t_value <- -26.97
df_val  <- 18148
alpha   <- 0.01

# Calculate teh critical value based on df
crit <- qt(1 - alpha / 2, df = df_val) # two tailed

# Sets length of distribution from -30 to +30 and presents 2000 evenly spaced values for smoothness
curve_data <- data.frame(
  x = seq(-30, 30, length.out = 2000)
)

# Compute height / density of the t-distribution at each curve
curve_data$y <- dt(curve_data$x, df = df_val)

ggplot(curve_data, aes(x = x, y = y)) +
  geom_line() +
  geom_vline(xintercept = t_value, linetype = "dashed", color = "red") +
  geom_vline(xintercept = c(-crit, crit), linetype = "dotted") +
  annotate("text", x = t_value, y = max(curve_data$y) * 0.9,
           label = paste("t =", t_value), hjust = ifelse(t_value < 0, 1.1, -0.1)) +
  annotate("text", x = -crit, y = max(curve_data$y) * 0.7,
           label = paste("critical =", round(-crit, 2)), hjust = 1.1) +
  annotate("text", x = crit, y = max(curve_data$y) * 0.7,
           label = paste("critical =", round(crit, 2)), hjust = -0.1) +
  labs(
    title = "t Distribution with Observed Test Statistic",
    x = "t",
    y = "Density"
  ) +
  theme_classic()

Although – significance is simply not enough. As mentioned in the class notebook, significance reporting should always be accompanied by reporting on effect size and the size of the sample being tested. Let’s go ahead and compute effect size to determine whether this significance is of any value.

Computing Effect Size Using Cohen’s D

Effect size expresses the difference detected between the null hypothesis and the alternative hypothesis. Allegedly per the class notebook there are multiple measures – but Cohen’s D is the simplest which expresses a numerical value scaling the difference in the two sample means by their standard deviation.

This almost feels like thinking about how a difference of 250 in means equates to about 1/3 to 1/4 of the standard deviation in the balance distribution depending on which sample is selected – e.g., secondary or tertiary educational attainment.

Cohen’s D is similar and can be expressed as this difference divided by a pooled standard deviation where values less than 0.2 are ‘negligible’ and values greater than 0.8 are ‘large’. I apply the statistical test below by (1) calculating manually from the formula and (2) by using the statistical package.

# Cohen's D Calculated Manually
bm_final |>
  summarise(
    x_1 = mean(balance[education == "secondary"]),
    x_2 = mean(balance[education == "tertiary"]),
    s_1 = var(balance[education == "secondary"]),
    s_2 = var(balance[education == "tertiary"]),
    n_1 = sum(education == "secondary"),
    n_2 = sum(education == "tertiary"),
    cohens_d = (x_1 - x_2) / sqrt(
      ((n_1 - 1) * s_1 + (n_2 - 1) * s_2) / ((n_1 - 1) + (n_2 - 1))
    )
  )

## # A tibble: 1 × 7
##     x_1   x_2     s_1     s_2   n_1   n_2 cohens_d
##   <dbl> <dbl>   <dbl>   <dbl> <int> <int>    <dbl>
## 1  475.  728. 390598. 779903. 20063 11480   -0.346

# Cohen's D by Using Statistical Package
cohen.d(d = filter(bm_final, education == "secondary") |> pluck("balance"),
        f = filter(bm_final, education == "tertiary") |> pluck("balance"))

## 
## Cohen's d
## 
## d estimate: -0.3456507 (small)
## 95 percent confidence interval:
##      lower      upper 
## -0.3687464 -0.3225551

Here we can see that we get the same result: a Cohen’s D value of -0.34 within a 95% confidence interval that the value ranges between -0.368 and -0.322 ~ or roughly a third of a standard deviation. This isn’t a large effect size, but it’s not a small one either.

Concluding on Neyman-Pearson Hypothesis

We can clearly reject the null hypothesis that there is no difference in the mean balance based on secondary vs tertiary educational attainment status based on a differnence in means of $252, a t value of -26.97, an effect size of -0.34, and an adequate sample size of over n = 31,000 at our specified alpha of 0.01 and power level of 0.99. Essentially, the average annual balance of of the bank client group having obtained a secondary education is 1/3 a standard deviation less than the bank client group having obtained a higher, tertiary level of education. Which in effect means that those who have more education have more money on hand in their bank accounts – which tracks logically.

02 - Fisher’s Significance Testing Framework

Fisher’s significance testing framework is all about specifying a p-value – e.g., probability that the test statistic or data observed is more as or more extreme than the value currently observed within the paradigm that the null hypothesis is correct. Let’s go ahead and apply Fisher’s significance framework by testing a different hypothesis with our data.

$H_0$ = The proportion of bank clients who subscribe to a term deposit does not differ between bank clients who have obtained up to a secondary level of education as opposed to those who have obtained up to a tertiary level of education.

$H_a$ = The proportion of bank clients who subscribe to a term deposit does differ between bank clients who have obtained up to a secondary level of education as opposed to those who have obtained up to a tertiary level of education.

In this case, we are teasing out the purpose of this dataset – to explain what factors may help explain success in our bank marketing campaign based on the observed population. We will be focusing on comparing bank clients who have obtained up to a secondary level of education to those who have obtained up to a tertiary level of education as done in our prior Neyman-Pearson framework application. We will be looking at the proportion of this population that subscribed to a term deposit defined the variable ‘y’ and indicated by a ‘yes’ in the bank marketing dataset.

We can begin to see some preliminary evidence below:

Of the 23,202 bank clients with a secondary education level, 2,420 subscribed to a term deposit for a proportional rate of 10.5%.
Of the 13,301 bank clients with a tertiary education level, 1,996 subscribed to a term deposit for a proportional rate of 15%.

# Obtaining Number of Clients in Each Group 
bank_marketing |>
  filter(education == "secondary" | education == "tertiary") |>
  mutate(success = if_else(y == "yes", 1, 0)) |>
  group_by(education) |>
  summarize(
    successes = sum(success),
    trials = n(),
    prop = successes/trials
  )

## # A tibble: 2 × 4
##   education successes trials  prop
##   <chr>         <dbl>  <int> <dbl>
## 1 secondary      2450  23202 0.106
## 2 tertiary       1996  13301 0.150

Setting Parameters of the Test

To apply Fisher’s test, we will need to do the following:

identify a null and alternative hypothesis (done above);
specify an appropriate p-value for the test;
choose a test statistic to test the hypothesis;
perform test compute effect size; and
interpret the results

For the purposes of this test – since our sample size is so large (n > 45,000), we’ll specify a p-value of 0.01 – indicating that if the null hypothesis were true than test statistic value as extreme than the on we observed would occur 1% of the time. We’ll specify this 1% value because our sample size is so large that small deviations may produce statistical significance – hence why it is also important to report on effect size.

This will also be a two-tailed test ~ meaning that the p-value will need to be below 0.005 to achieve a 0.01 value given the context of both tails. We specify two-tailed since the proportion difference could sway either direction – negative or positive.

We’ll also applying the normal test of equal proportions under the normal assumption $H_0$ : $|p_1 - p_2| = 0$

Where $p_1$ is our proportion of bank clients who subscribed to a term deposit that obtained up to a secondary level of education
And $p_2$ our proportion of bank clients who subscribed to a term deposit that obtained up to a tertiary level of education.

This test applies the following formula:

\[ z=\frac{(\hat{p}_1 - \hat{p}_2)}{\sqrt{\hat{p}(1 - \hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}},\quad \hat{p}=\frac{x_1 + x_2}{n_1 + n_2} \]

Where z is our test statistic – computed by taking the difference of the two proportions and dividing them by the square root of the ‘pooled variance of proportions’ $\hat{p}$ denominator term multiplied by the sample size.

We could also technically use a t-test here as well and calculate the mean of the subscribed value and test to determine whether there is a significant difference at a p-value of 0.01 or lower.

We’ll then look at a computing effect size by applying Cohen’s H – which is a play / variation of Cohen’s D specified towards obtaining a standardized effect size for proportions.

Applying Fisher’s Significance with the Normal Test of Equal Proportions

Let’s go ahead and write up the code to compute our normal test of equal proportions and then compare it to the official package from R.

# Manually Calculating Normal Test of Equal Proportion
bank_marketing |>
  filter(education == "secondary" | education == "tertiary") |>
  mutate(y_nbr = if_else(y == "yes", 1, 0)) |>
  summarise(
    x_1 = mean(y_nbr[education == "secondary"]),
    x_2 = mean(y_nbr[education == "tertiary"]),
    n_1 = sum(education == "secondary"),
    n_2 = sum(education == "tertiary"),
    p_hat = (x_1*n_1 + x_2*n_2) / (n_1 + n_2),
    p_1hat = sum(education == "secondary" & y_nbr == 1)
       / sum(education == "secondary"),
    p_2hat = sum(education == "tertiary" & y_nbr == 1)
       / sum(education == "tertiary"),
    z = (p_1hat - p_2hat)/sqrt(p_hat*(1-p_hat)*(1/n_1 + 1/n_2))
  )

## # A tibble: 1 × 8
##     x_1   x_2   n_1   n_2 p_hat p_1hat p_2hat     z
##   <dbl> <dbl> <int> <int> <dbl>  <dbl>  <dbl> <dbl>
## 1 0.106 0.150 23202 13301 0.122  0.106  0.150 -12.5

# Using R Package to calculate Normal Test of Equal Proportions
bm_trials <- bank_marketing |>
  filter(education == "secondary" | education == "tertiary") |>
  mutate(success = if_else(y == "yes", 1, 0)) |>
  group_by(education) |>
  summarize(
    successes = sum(success),
    trials = n()
  )

ptst <- prop.test(x = bm_trials$successes,
          n = bm_trials$trials,
          alternative = "two.sided",
          correct = FALSE)

ptst

## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  bm_trials$successes out of bm_trials$trials
## X-squared = 156.3, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.05171338 -0.03722574
## sample estimates:
##    prop 1    prop 2 
## 0.1055943 0.1500639

Here we can see our z statistic of -12.5 – which when squared gives us an approximate value near our x-squared from the prop test around 156. Based on the sampling distribution of Z, our p-value is significant below the 0.01 threshold we specified below (0.005 since it’s two tailed). Additionally if we took our z value and looked at it on the distribution – as shown below, we can see that the value is well within our ‘blue range’ for rejecting the null hypothesis where z is specified as being between -2.576 and +2.576. This indicates that we have evidence to reject the null hypothesis within Fisher’s framework.

# Building a Chart to Visualize the Hypothesis

z_obs <- -12.5

z_alpha <- 2.576

z_df <- data.frame(
  x = seq(-13, 13, by = 0.01)
)
z_df$y <- dnorm(z_df$x, mean = 0, sd = 1)

ggplot(z_df, aes(x = x, y = y)) +
  geom_line() +
  geom_area(
    data = subset(z_df, x <= -z_alpha),
    fill = "lightblue",
    alpha = 0.5
  ) +
  geom_area(
    data = subset(z_df, x >= z_alpha),
    fill = "lightblue",
    alpha = 0.5
  ) +
  geom_vline(xintercept = z_alpha, linetype = "dashed") +
  geom_vline(xintercept = -z_alpha, linetype = "dashed") +
  geom_vline(xintercept = z_obs, color = "red") +
  theme_classic()

Again – we can use this z distribution to calculate the probability of observing this value assuming the null hypothesis being correct by taking the area of the density function that is on or past the red line and multiplying it by 2 – since this is a two tailed test. I did try this – but there was some slight differences in values. This was partially caused by how the proptest doesn’t display the full p value because it is so small and there is something about a Yates Continuity Correction which I am not going to spend too much time looking into.

Getting Effect Size to Make a Conclusion using Cohen’s H

While Cohen’s d computes effect size as the difference in means divided by the pooled standard deviation, Cohen’s H does things a little differently to try and capture / standardize the difference in how proportions pairs may vary (source). I believe the calculation / interpretation guidelines follow the same as Cohen’s D ~ where > 0.2 is a small effect, ~0.5 is a medium effect, and > 0.8 is a large effect. It is calculated as:

\[h = 2(arcsin(\sqrt{p_1} - arcsin(\sqrt{p_2}))\]

Let’s go ahead and compute:

# Calculating Effect Size Manually
bank_marketing |>
  filter(education == "secondary" | education == "tertiary") |>
  mutate(y_nbr = if_else(y == "yes", 1, 0)) |>
  summarise(
    p_1 = sum(education == "secondary" & y_nbr == 1) / sum(education == "secondary"),
    p_2 = sum(education == "tertiary" & y_nbr == 1) / sum(education == "tertiary"),
    h = 2 * (asin(sqrt(p_1)) - asin(sqrt(p_2)))
  )

## # A tibble: 1 × 3
##     p_1   p_2      h
##   <dbl> <dbl>  <dbl>
## 1 0.106 0.150 -0.134

This is rather interesting – the effect size is quite small at -0.13 – despite a difference of 0.045 or about 4.5 percentage points. Thus, in conclusion we can reject the null hypothesis based on our level of significance we observe a very small effect size based on the difference in proportions.

week-7-data-dives

2026-04-03