Data Dive Week 7

Libraries & Data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(ggrepel)
library(effsize)
library(pwrss)

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

data_frame = read.csv('C:/Users/prera/OneDrive/Desktop/INFO-I590/bank-full2.csv',header=TRUE, sep = ",")

Summary of data frame

summary(data_frame)

##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##       y            
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Description of the columns

1 - age; 2 - job; 3 - marital(marital status); 4 - education; 5 - default: has credit in default?; 6 - balance: average yearly balance, in euros

7 - housing: has housing loan?; 8 - loan: has personal loan?; 9 - contact: contact communication type; 10 - day: last contact day of the month

11 - month: last contact month of year; 12 - duration: last contact duration, in seconds; 13 - campaign: number of contacts performed during this campaign and for this client

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign

15 - previous: number of contacts performed before this campaign and for this client

16 - poutcome: outcome of the previous marketing campaign; 17 - y : has the client subscribed a term deposit?

Hypothesis I

Devising the null hypothesis

Null Hypothesis I

H₀ - average balance remains the same for the clients having a housing loan or not having a housing loan.

Coming up with an alpha level, power level, and minimum effect size

alpha level

represents the probability of making a Type I error. A smaller alpha is more stringent and requires stronger evidence to reject the null hypothesis. If the impact of a false positive is high, we need to choose a smaller alpha to minimize the risk. If we have a large sample, we can opt for a smaller alpha value.

Since the size of the data is big and having a loan and balance in the account of the client is important to the bank, I am assuming that the 𝛂 value for this to be 3%, i.e. 𝛂 = 0.03.

power level

Power refers to your study’s ability to find a difference. It is denoted as 1 - β, where β is the probability of failing to reject a null hypothesis when the null hypothesis is wrong. Since the higher the power, the better the ability to detect effects. I am assuming the power to be 0.85.

minimum effect size

effect_size = cohen.d(d = filter(data_frame, housing == 'no') |> pluck("balance"),
                      f = filter(data_frame, housing == 'yes') |> pluck("balance"))
effect_size

## 
## Cohen's d
## 
## d estimate: 0.1387277 (negligible)
## 95 percent confidence interval:
##     lower     upper 
## 0.1201536 0.1573019

Neyman-Pearson hypothesis test

Check to see if we can perform a Neyman-Pearson hypothesis test

test <- pwrss.t.2means(mu1 = 0.1387277, 
                       sd1 = 1,
                       kappa = 1,
                       power = .85, alpha = 0.03, 
                       alternative = "not equal")

##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.85 
##   n1 = 1070 
##   n2 = 1070 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 2138 
##  Non-centrality parameter = 3.209 
##  Type I error rate = 0.03 
##  Type II error rate = 0.15

plot(test)

## Warning in qt(1 - prob.extreme, df = df, ncp = ncp, lower.tail = TRUE): full
## precision may not have been achieved in 'pnt{final}'

From the above calculation, the sample size is 1070. Now calculating the data in each category,

avg_balances <- data_frame|>
  filter(!(is.na(housing)))|>
    group_by(housing) |>
      summarise(avg_balance = mean(balance,na.rm=TRUE),max_balance= max(balance,na.rm=TRUE), size=n())
avg_balances

## # A tibble: 2 × 4
##   housing avg_balance max_balance  size
##   <chr>         <dbl>       <int> <int>
## 1 no            1597.      102127 20081
## 2 yes           1175.       58544 25130

Since each of the categories have sufficient count of data. I can implement Neyman-Pearson hypothesis test.

Alternative Hypothesis

\[ H_A: \text{Average balance does not remain the same for the clients having a housing loan or not having a housing loan.} \]

Observed Difference

#avg_revenues$avg_revenue[1] - not having a housing loan
#avg_revenues$avg_revenue[2] - having a housing loan
observed_diff <- (avg_balances$avg_balance[1] - avg_balances$avg_balance[2])
paste("Observed Difference: ", observed_diff)

## [1] "Observed Difference:  421.398205790226"

Distribution of the random variable represented in the null hypothesis

bootstrap <- function (x, func=mean, n_iter=10^4) {
  func_values <- c(NULL)
  
  for (i in 1:n_iter) {
    x_sample <- sample_n(x, size = length(x), replace = TRUE)
    
    func_values <- c(func_values, func(x_sample))
  }
  
  return(func_values)
}

diff_in_avg <- function (x_data) {
  avg_balances <- x_data |>
    group_by(housing) |>
    summarize(avg_balance = mean(balance)) |>
    arrange(housing)
  
  #not having a housing loan - having a housing loan
  diff <- (avg_balances$avg_balance[1] - avg_balances$avg_balance[2])
  
  return(diff)
}

diffs_in_balance_avgs <- bootstrap(data_frame, diff_in_avg, n_iter = 500)

ggplot() +
  geom_function(xlim = c(-500, 500), 
                fun = function(x) dnorm(x, mean = 0, 
                                        sd = sd(diffs_in_balance_avgs))) +
  geom_vline(mapping = aes(xintercept = observed_diff,
                           color = paste("observed: ",
                                         round(observed_diff)))) +
  labs(title = "Bootstrapped Sampling Distribution of Balance Differences",
       x = "Difference in Balance Calculated",
       y = "Probability Density",
       color = "") +
  theme_minimal()

Critical Value

alpha <- 0.03  # Significance level
critical_value <- qnorm(1 - alpha/2)  # Two-tailed test
critical_value

## [1] 2.17009

delta <- 0.1387277

f_0 <- function(x) dnorm(x, mean = 0)
f_a <- function(x) dnorm(x, mean = delta)

ggplot() +
  stat_function(mapping = aes(fill = 'power'),
                fun = f_a, 
                xlim = c(critical_value, 4),
                geom = "area") +
  stat_function(mapping = aes(fill = 'alpha'),
                fun = f_0, 
                xlim = c(critical_value, 4),
                geom = "area") +
  geom_function(mapping = aes(color = 'Null Hypothesis'),
                xlim = c(-4, 4),
                fun = f_0) +
  geom_function(mapping = aes(color = 'Alternative Hypothesis'),
                xlim = c(-4, 4),
                fun = f_a) +
  geom_vline(mapping = aes(xintercept = critical_value,
                           color = "Critical Value")) +
  geom_vline(mapping = aes(xintercept = delta,
                           color = "Delta")) +
  geom_vline(mapping = aes(xintercept = 0),
             color = 'gray', linetype=2) +
  labs(title = "One-Tailed Test Illustration",
       x = "Test Statistic",
       y = "Probability Density",
       color = "",
       fill = "") +
  scale_x_continuous(breaks = seq(-4, 4, 1)) +
  scale_fill_manual(values = c('lightblue', 'pink')) +
  scale_color_manual(values = c('darkred', 'darkorange', 'darkblue', 
                                'darkgreen')) +
  theme_minimal()

sample_loan_yes <- data_frame$balance[data_frame$housing == 'yes']
sample_loan_no <- data_frame$balance[data_frame$housing == 'no']

# Impute missing values with the mean of the respective samples
sample_loan_yes[is.na(sample_loan_yes)] <- mean(sample_loan_yes, na.rm = TRUE)
sample_loan_no[is.na(sample_loan_no)] <- mean(sample_loan_no, na.rm = TRUE)

t_test_result <- t.test(sample_loan_yes, sample_loan_no)

# Print t-test results
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  sample_loan_yes and sample_loan_no
## t = -14.081, df = 34204, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -480.0551 -362.7413
## sample estimates:
## mean of x mean of y 
##  1175.103  1596.501

Interpreting Results

if (t_test_result$p.value < alpha) {
  cat("Reject the null hypothesis. There is a significant difference in average balance between clients having a housing loan and not having a housing loan")
} else {
  cat("Fail to reject the null hypothesis. There is no significant difference in average balance between clients having a housing loan and not having a housing loan")
}

## Reject the null hypothesis. There is a significant difference in average balance between clients having a housing loan and not having a housing loan

To perform a Fisher’s test, it is necessary to have success and non-success count totals to make the test meaningful in assessing significance; however, given the nature of my data and the null hypothesis, this test would not be appropriate. Hence I am opting for a Chi Square test

Chi Square Test instead of Fishers Test

contingency_table <- table(data_frame$balance, data_frame$housing)

chisq.test(contingency_table, correct = TRUE)

## Warning in chisq.test(contingency_table, correct = TRUE): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 9661.8, df = 7167, p-value < 2.2e-16

Interpreting Results

The p-value is very very small, which is lesser than the alpha level of 0.03. Therefore, provides strong evidence that the null hypothesis is false.

Visualization from Hypothesis I

ggplot(data_frame, aes(x = housing, y = balance)) +
  geom_boxplot() +
  labs(title = "Having a Housing Loan VS Balance", x = "Having a housing loan", y = "Balance") +
  theme_minimal()

p <- avg_balances |>
  ggplot(aes(x = housing, y=avg_balance) )+
  geom_bar(position = "dodge", stat = "identity",fill="lightpink") +
  theme_minimal()
  
p

As seen from the above graph we can see that there is a significant difference in the average balances of the two groups of clients.

Hypothesis II

Devising the null hypothesis

Null Hypothesis II

H₀ - average age is the same for the clients having a housing loan or not having a housing loan.

Coming up with an alpha level, power level, and minimum effect size

alpha level

Since the size of the data is big and age of the client is important to the bank to improve targeted marketing, I am assuming that the 𝛂 value for this to be 3%, i.e. 𝛂 = 0.05.

power level

minimum effect size

effect_size = cohen.d(d = filter(data_frame, housing == 'no') |> pluck("age"),
                      f = filter(data_frame, housing == 'yes') |> pluck("age"))
effect_size

## 
## Cohen's d
## 
## d estimate: 0.3799486 (small)
## 95 percent confidence interval:
##     lower     upper 
## 0.3612320 0.3986653

Neyman-Pearson hypothesis test

Check to see if we can perform a Neyman-Pearson hypothesis test

test <- pwrss.t.2means(mu1 = 0.3799486, 
                       sd1 = 1,
                       kappa = 1,
                       power = .80, alpha = 0.05, 
                       alternative = "not equal")

##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 110 
##   n2 = 110 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 218 
##  Non-centrality parameter = 2.818 
##  Type I error rate = 0.05 
##  Type II error rate = 0.2

plot(test)

## Warning in qt(1 - prob.extreme, df = df, ncp = ncp, lower.tail = TRUE): full
## precision may not have been achieved in 'pnt{final}'

From the above calculation, the sample size is 110. Now calculating the data in each category,

avg_ages <- data_frame|>
  filter(!(is.na(housing)))|>
    group_by(housing) |>
      summarise(avg_age = mean(age,na.rm=TRUE), size=n())
avg_ages

## # A tibble: 2 × 3
##   housing avg_age  size
##   <chr>     <dbl> <int>
## 1 no         43.1 20081
## 2 yes        39.2 25130

Since each of the categories have sufficient count of data. I can implement Neyman-Pearson hypothesis test.

Alternative Hypothesis

\[ H_A: \text{Average age is not the same for the clients having a housing loan or not having a housing loan.} \]

Observed Difference

observed_diff <- (avg_ages$avg_age[1] - avg_ages$avg_age[2])
paste("Observed Difference: ", observed_diff)

## [1] "Observed Difference:  3.96459497213761"

Distribution of the random variable represented in the null hypothesis

diff_in_avg <- function (x_data) {
  avg_ages <- x_data |>
    group_by(housing) |>
    summarize(avg_age = mean(age)) |>
    arrange(housing)
  
  #not having a housing loan - having a housing loan
  diff <- (avg_ages$avg_age[1] - avg_ages$avg_age[2])
  
  return(diff)
}

diffs_in_age_avgs <- bootstrap(data_frame, diff_in_avg, n_iter = 500)

ggplot() +
  geom_function(xlim = c(-500, 500), 
                fun = function(x) dnorm(x, mean = 0, 
                                        sd = sd(diffs_in_age_avgs))) +
  geom_vline(mapping = aes(xintercept = observed_diff,
                           color = paste("observed: ",
                                         round(observed_diff)))) +
  labs(title = "Bootstrapped Sampling Distribution of Age Differences",
       x = "Difference in Age Calculated",
       y = "Probability Density",
       color = "") +
  theme_minimal()

Critical Value

alpha <- 0.05  # Significance level
critical_value <- qnorm(1 - alpha/2)
critical_value

## [1] 1.959964

Delta Value

delta <- 0.3799486

f_0 <- function(x) dnorm(x, mean = 0)
f_a <- function(x) dnorm(x, mean = delta)

ggplot() +
  stat_function(mapping = aes(fill = 'power'),
                fun = f_a, 
                xlim = c(critical_value, 4),
                geom = "area") +
  stat_function(mapping = aes(fill = 'alpha'),
                fun = f_0, 
                xlim = c(critical_value, 4),
                geom = "area") +
  geom_function(mapping = aes(color = 'Null Hypothesis'),
                xlim = c(-4, 4),
                fun = f_0) +
  geom_function(mapping = aes(color = 'Alternative Hypothesis'),
                xlim = c(-4, 4),
                fun = f_a) +
  geom_vline(mapping = aes(xintercept = critical_value,
                           color = "Critical Value")) +
  geom_vline(mapping = aes(xintercept = delta,
                           color = "Delta")) +
  geom_vline(mapping = aes(xintercept = 0),
             color = 'gray', linetype=2) +
  labs(title = "One-Tailed Test Illustration",
       x = "Test Statistic",
       y = "Probability Density",
       color = "",
       fill = "") +
  scale_x_continuous(breaks = seq(-4, 4, 1)) +
  scale_fill_manual(values = c('lightblue', 'pink')) +
  scale_color_manual(values = c('darkred', 'darkorange', 'darkblue', 
                                'darkgreen')) +
  theme_minimal()

sample_loan_yes <- data_frame$age[data_frame$housing == 'yes']
sample_loan_no <- data_frame$age[data_frame$housing == 'no']

# Impute missing values with the mean of the respective samples
sample_loan_yes[is.na(sample_loan_yes)] <- mean(sample_loan_yes, na.rm = TRUE)
sample_loan_no[is.na(sample_loan_no)] <- mean(sample_loan_no, na.rm = TRUE)

t_test_result <- t.test(sample_loan_no, sample_loan_yes)

# Print t-test results
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  sample_loan_no and sample_loan_yes
## t = 38.853, df = 36001, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.764592 4.164598
## sample estimates:
## mean of x mean of y 
##  43.13988  39.17529

Interpreting Results

if (t_test_result$p.value < alpha) {
  cat("Reject the null hypothesis. There is a significant difference in average age between clients having a housing loan and not having a housing loan")
} else {
  cat("Fail to reject the null hypothesis. There is no significant difference in average age between clients having a housing loan and not having a housing loan")
}

## Reject the null hypothesis. There is a significant difference in average age between clients having a housing loan and not having a housing loan

Chi-Square Test instead of Fishers Test

contingency_table <- table(data_frame$age, data_frame$housing)

chisq.test(contingency_table, correct = TRUE)

## Warning in chisq.test(contingency_table, correct = TRUE): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 2688.3, df = 76, p-value < 2.2e-16

Interpreting Results

The p-value is very very small, which is lesser than the alpha level of 0.05. Therefore, provides strong evidence that the null hypothesis is false.

Visualization from Hypothesis II

ggplot(data_frame, aes(x = housing, y = age)) +
  geom_boxplot() +
  labs(title = "Having a Housing Loan VS Age", x = "Having a housing loan", y = "Age") +
  theme_minimal()

p <- avg_ages |>
  ggplot(aes(x = housing, y=avg_age) )+
  geom_bar(position = "dodge", stat = "identity",fill="lightblue") +
  theme_minimal()
  
p

Data Dive Week 7

PreranaRajole

2023-10-07

Libraries & Data

Summary of data frame

Description of the columns

Hypothesis I

Devising the null hypothesis

Null Hypothesis I

Coming up with an alpha level, power level, and minimum effect size

alpha level

power level

minimum effect size

Neyman-Pearson hypothesis test

Check to see if we can perform a Neyman-Pearson hypothesis test

Alternative Hypothesis

Observed Difference

Distribution of the random variable represented in the null hypothesis

Critical Value

Interpreting Results

Chi Square Test instead of Fishers Test

Interpreting Results

Visualization from Hypothesis I

Hypothesis II

Devising the null hypothesis

Null Hypothesis II

Coming up with an alpha level, power level, and minimum effect size

alpha level

power level

minimum effect size

Neyman-Pearson hypothesis test

Check to see if we can perform a Neyman-Pearson hypothesis test

Alternative Hypothesis

Observed Difference

Distribution of the random variable represented in the null hypothesis

Critical Value

Delta Value

Interpreting Results

Chi-Square Test instead of Fishers Test

Interpreting Results

Visualization from Hypothesis II