Assignment #02

1. Find your data

The database selected consists in customers from the Santander Bank Spain, with their respective data such as gender, age, and their income. This data was extracted from the following source

2. Import Data to R Studio

santander_data <- read.table("./data/test_ver2.csv", header = TRUE, sep = ",", dec = ",")

3. Display your data using the head function

head(santander_data)

##   fecha_dato ncodpers ind_empleado pais_residencia sexo age fecha_alta
## 1 2016-06-28    15889            F              ES    V  56 1995-01-16
## 2 2016-06-28  1170544            N              ES    H  36 2013-08-28
## 3 2016-06-28  1170545            N              ES    V  22 2013-08-28
## 4 2016-06-28  1170547            N              ES    H  22 2013-08-28
## 5 2016-06-28  1170548            N              ES    H  22 2013-08-28
## 6 2016-06-28  1170550            N              ES    V  22 2013-08-28
##   ind_nuevo antiguedad indrel ult_fec_cli_1t indrel_1mes tiprel_1mes indresi
## 1         0        256      1                          1           A       S
## 2         0         34      1                          1           I       S
## 3         0         34      1                          1           A       S
## 4         0         34      1                          1           I       S
## 5         0         34      1                          1           I       S
## 6         0         34      1                          1           I       S
##   indext conyuemp canal_entrada indfall tipodom cod_prov        nomprov
## 1      N        N           KAT       N       1       28         MADRID
## 2      N                    KAT       N       1        3       ALICANTE
## 3      N                    KHE       N       1       15      CORUÑA, A
## 4      N                    KHE       N       1        8      BARCELONA
## 5      N                    KHE       N       1        7 BALEARS, ILLES
## 6      N                    KHE       N       1        8      BARCELONA
##   ind_actividad_cliente       renta           segmento
## 1                     1   326124.90           01 - TOP
## 2                     0          NA  02 - PARTICULARES
## 3                     1          NA 03 - UNIVERSITARIO
## 4                     0   148402.98 03 - UNIVERSITARIO
## 5                     0   106885.80 03 - UNIVERSITARIO
## 6                     0          NA 03 - UNIVERSITARIO

4. Explain your data

Unit of observation: one customer from Santander Bank Spain

Total observations: 929,615 clients

Sample Size: 200 clients

From all variables, the following were selected because of their relevance. They are presented with their original names and their English translation within parenthesis:

sexo (Gender): Customer’s sex where {V: Varón (Male), H: Hembra (Female)}
age (Age): Customer’s age
renta (Income): Gross income of the household in EUR, yearly

5. Name the source of the data

Data source name: Santander Bank Spain Customers

6. Carry out data manipulation

Installing and activating packages

#install.packages("dplyr")
#install.packages("tidyverse")
#install.packages("tidyr)
#install.packages("ggplot2")
library(dplyr)
library(tidyverse)
library(tidyr)
library(ggplot2)

Getting an extract of the relevant columns, and then rename them

santander_data_extract <- santander_data %>%
  select(sexo, age, renta)

santander_data_extract <- santander_data_extract %>%
  rename("Gender" = "sexo",
         "Age" = "age",
         "Income" = "renta")

Fixing the data type of the numerical variables which were characters

santander_data_extract$Income <- as.numeric(santander_data_extract$Income)

## Warning: NAs introduced by coercion

santander_data_extract$Age <- as.numeric(santander_data_extract$Age)

Factorize variables and present them with their data type

santander_data_extract$GenderFactor <- factor(santander_data_extract$Gender,
                                   levels = c("V","H"),
                                   labels = c("Male", "Female"))
str(santander_data_extract)

## 'data.frame':    929615 obs. of  4 variables:
##  $ Gender      : chr  "V" "H" "V" "H" ...
##  $ Age         : num  56 36 22 22 22 22 51 22 22 22 ...
##  $ Income      : num  326125 NA NA 148403 106886 ...
##  $ GenderFactor: Factor w/ 2 levels "Male","Female": 1 2 1 2 2 1 2 2 1 2 ...

Dropping NA Values

santander_data_extract <- santander_data_extract %>%
  drop_na()

Taking a random sample with a size of 200, since this is the maximum acceptable

data_sample <- sample_n(santander_data_extract, size = 200)

7. Present the descriptive statistics

The arithmetic mean of the sample for income (grouped by gender), and income disparity ratio are presented

sample_mean <- mean(data_sample$Income)
data_sample$GenderFactor <- factor(data_sample$Gender,
                                   levels = c("V","H"),
                                   labels = c("Male", "Female"))
  
sample_mean_male <- mean(data_sample[data_sample$GenderFactor == "Male",]$Income)
sample_mean_female <- mean(data_sample[data_sample$GenderFactor == "Female",]$Income)
income_ratio <- round(log(sample_mean_male / sample_mean_female) * 100,2)

cat("The arithmetic mean of the annual income of the sample is", sample_mean, "EUR, where man represent an average of",sample_mean_male,"EUR yearly, and women represent an average of",sample_mean_female,"EUR yearly. This resulting in an income disparity of",income_ratio,"percentaje on the average.")

## The arithmetic mean of the annual income of the sample is 183133.4 EUR, where man represent an average of 144209.7 EUR yearly, and women represent an average of 230706.8 EUR yearly. This resulting in an income disparity of -46.99 percentaje on the average.

The standard deviation for income on the sample (grouped by gender) is presented

sample_sd <- sd(data_sample$Income)
sample_sd_male <- sd(data_sample[data_sample$GenderFactor == "Male",]$Income)
sample_sd_female <- sd(data_sample[data_sample$GenderFactor == "Female",]$Income)
max_income <- max(data_sample$Income)

cat("Due to the dispersion of income wihtin the sample where maximums of",max_income,"EUR were found, and toghether with an standard deviation of the sample of", sample_sd, "EUR (divided by", sample_sd_male, "EUR for male, and", sample_sd_female, "EUR for women) a more representative measure of the real income perceived by Santander Customers must be found.")

## Due to the dispersion of income wihtin the sample where maximums of 9686020 EUR were found, and toghether with an standard deviation of the sample of 684417.4 EUR (divided by 122124.3 EUR for male, and 1012401 EUR for women) a more representative measure of the real income perceived by Santander Customers must be found.

The median of the sample for income (grouped by gender) is presented

sample_median <- median(data_sample$Income)
sample_median_male <- median(data_sample[data_sample$GenderFactor == "Male",]$Income)
sample_median_female <- median(data_sample[data_sample$GenderFactor == "Female",]$Income)

cat("This is why, the central tendency measure of median was calculated. Here, the results deviate from the arithmetic mean, where the median of the sample is", sample_median, "EUR compared with the arithmetic mean of", sample_mean, "EUR. Furthermore, this means that 50% or less customers, earn up to this figure, or less and vice versa.\n\nThis can be also seen when analyzing by gender, where the median for male was", sample_median_male, "EUR compared with the mean for male of", sample_mean_male, "EUR, and where the median for women is", sample_median_female, "EUR, compared with the average of",sample_mean_female,"EUR too. Here the analysis is the same, meaning that 50% or less of the sample (and thus the population), earn these both figures or less. This can be explained since the median is not affected by outliers, meanwhile it does for the arithmetic mean.")

## This is why, the central tendency measure of median was calculated. Here, the results deviate from the arithmetic mean, where the median of the sample is 105155.7 EUR compared with the arithmetic mean of 183133.4 EUR. Furthermore, this means that 50% or less customers, earn up to this figure, or less and vice versa.
## 
## This can be also seen when analyzing by gender, where the median for male was 105875.7 EUR compared with the mean for male of 144209.7 EUR, and where the median for women is 102727.2 EUR, compared with the average of 230706.8 EUR too. Here the analysis is the same, meaning that 50% or less of the sample (and thus the population), earn these both figures or less. This can be explained since the median is not affected by outliers, meanwhile it does for the arithmetic mean.

The mean of the sample for income (grouped by age thresholds) is presented

subsample_age_first <- data_sample %>% 
  filter(Age >= 20 & Age < 40)

subsample_age_second <- data_sample %>% 
  filter(Age >= 40 & Age < 60)

subsample_age_third <- data_sample %>% 
  filter(Age >= 60)

subsample_age_first_mean <- round(mean(subsample_age_first$Income))
subsample_age_second_mean <- round(mean(subsample_age_second$Income))
subsample_age_third_mean <- round(mean(subsample_age_third$Income))

cat("Lastly, the arithmetic mean of income for three ranges was analyzed. The first being customers between 20 and 39, the second being customers between 40 and 59, and the third being customers 60 and above.\n\nAmong results, it was found that for the first threshold, the arithmetic mean is",subsample_age_first_mean,"EUR, meanwhile for the second threshold the arithmetic mean is",subsample_age_second_mean,"EUR, and for the third one, is", subsample_age_third_mean, "EUR.")

## Lastly, the arithmetic mean of income for three ranges was analyzed. The first being customers between 20 and 39, the second being customers between 40 and 59, and the third being customers 60 and above.
## 
## Among results, it was found that for the first threshold, the arithmetic mean is 100460 EUR, meanwhile for the second threshold the arithmetic mean is 271151 EUR, and for the third one, is 212014 EUR.

7. Hypothesis testing

Hypothesis Question

Santander Bank is the biggest bank in Spain¹, and thus we can infere that is representative for the country’s population. So, is there evidence of income inequality among Santander Bank customers?

¹ Source: https://www.statista.com/statistics/693883/leading-banks-assets-spain/

Hipothesis setup

For this research hypothesis, the null and alternative hypothesis are the following:

\(H_0: \mu_{male} = \mu_{female}\)
\(H_1: \mu_{male} \neq \mu_{female}\)

Or alternatively

\(H_0: \mu_{male} - \mu_{female} = 0\)
\(H_1: \mu_{male} - \mu_{female} \neq 0\)

Moreover, as in this case we want to compare the average income among two different and independent groups, the parametrical test Independent t-test with Welch correction will be performed, toghether with the non-parametrical test Wilcoxon Rank Sum Test.

Parametrical test assumptions

Before realizing this test, the assumptions that must be met will be checked

Assumption #1: The research variable (income) is numeric

str(data_sample$Income)

##  num [1:200] 105614 57293 88653 222990 269946 ...

Conclusion: The assumption is correct

Assumption #2: The variable must be normally distributed in both populations (male and female)

For cheking this assumption, first, a histogram will be plotted and then analyzed

ggplot(data = data_sample, aes(x=Income)) +
  geom_histogram(binwidth = 15000, colour="gray", fill = "blue") +
  facet_wrap(~GenderFactor, ncol=1) +
  ylab("Frequency")

As there are outliers in the sample, a subsample will be created, restricting the income to be lower than 500,000EUR yearly

data_sample_subset <- data_sample %>%
  filter(Income < 500000)

ggplot(data = data_sample_subset, aes(x=Income)) +
  geom_histogram(binwidth = 10000, colour="gray", fill = "blue") +
  facet_wrap(~GenderFactor, ncol=1) +
  ylab("Frequency")

Here, it can be appreciated that the distribution of income within genders is not even close to a normal distribution. This, since the distribution is positvely (or right) skewed. Furthermore, a Shapiro test will be conducted just to increase the certainty in the later statement.

#install.packages("rstatix")
library(rstatix)

shapiro_test <- data_sample %>%
  group_by(GenderFactor) %>%
  shapiro_test(Income)

shapiro_test

## # A tibble: 2 × 4
##   GenderFactor variable statistic        p
##   <fct>        <chr>        <dbl>    <dbl>
## 1 Male         Income       0.754 2.67e-12
## 2 Female       Income       0.131 1.32e-20

As previously concluded, neither income variables from both male and female are normally distributed. This since the hypothesis tested here is:

\(H_0: x_i \sim N \space \space \forall i \in \{0:Male, 1:Female\}\)
\(H_1: x_i \space is \space not \space \sim N \space \space \forall i \in \{0:Male, 1:Female\}\)

Conclusion: the null hypothesis is rejected with \(p<0.001\) in both cases, and thus neither variables are normally distributed. This will normally lead us to use the non-parametrical test right away, but for the sake of the exercise, the third assumption will be checked.

Assumption #3: The variable has the same variance in both populations (male and female)

Here, the levene test will be conducted, where the hypothesis tested is:

\(H_0: \sigma^2_{male} = \sigma^2_{female}\)
\(H_1: \sigma^2_{male} \neq \sigma^2_{female}\)

library(car)
levene <- leveneTest(data_sample$Income, group = data_sample$GenderFactor)
levene

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.8998  0.344
##       198

Conlusion: In this case, the null hypothesis cannot be rejected with a P value equal to

levene[,3]

## [1] 0.3439995        NA

And thus, the variance of both populations can be assumed to be the same

Parametrical test execution

Even though that formerly was demonstrated that in this case, a non-parametrical test should be conducted, for the sake of the exercise, a parametrical test will be conducted anyway

t_test <- t.test(data_sample$Income ~ data_sample$GenderFactor,
       var.equal = TRUE,
       alternative = "two.sided")
t_test

## 
##  Two Sample t-test
## 
## data:  data_sample$Income by data_sample$GenderFactor
## t = -0.8887, df = 198, p-value = 0.3752
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -278434.1  105440.0
## sample estimates:
##   mean in group Male mean in group Female 
##             144209.7             230706.8

Conclusion: The null hypothesis cannot be rejected with a P value equal to

t_test$p.value

## [1] 0.3752447

And thus, there is no income inequality among male and female

Disclamer: An effect size test will not be conducted for this part, since the results here are invalid due to the violation of the normality assumption.

Non-parametrical test execution

For this test, per the nature of the data, a Wilcoxon Rank Sum Test will be conducted. It is important to state that as the normality assumption was violated as demonstrated above, this will be the valid test. Moreover, the null hypothesis will change to:

\(H_0: Me_{male} = Me_{female}\)
\(H_1: Me_{male} \neq Me_{female}\)

wilcox <- wilcox.test(data_sample$Income ~ data_sample$GenderFactor,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
wilcox

## 
##  Wilcoxon rank sum test
## 
## data:  data_sample$Income by data_sample$GenderFactor
## W = 5259, p-value = 0.448
## alternative hypothesis: true location shift is not equal to 0

Conclusion: The null hypothesis cannot be rejected with a P value equal to

wilcox$p.value

## [1] 0.4479658

And thus, as the medians of both populations are statistically the same, there is no income inequality among male and female in Santander Bank.

Moreover, a formal effect size test will be conducted to see how large is the relationship among the standardize measure used, and the value in the null hypothesis

#install.packages("effectsize")
library(effectsize)
effectsize <- effectsize(wilcox.test(data_sample$Income ~ data_sample$GenderFactor,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))
effectsize

## r (rank biserial) |        95% CI
## ---------------------------------
## 0.06              | [-0.10, 0.22]

interpret_rank_biserial(abs(effectsize$r_rank_biserial))

## [1] "very small"
## (Rules: funder2019)

As expected, the interpretation of the effect size test lead to the conclusion that the differences were very small. This output is consistent with the previous test results, where the null hypothesis couldn’t be rejected since the medians were statistically the same. Moreover, with this result we can state that the difference between the sample estimate (median in this case) and the value for the equality assumption in \(H_0\) is not significant.

Conclusion

Test election

The appropriate test for this case was the Wilcoxon Rank Sum Test. This choice is justified by the nature of the hypothesis, which aimed to compare the income distributions between two independent groups (genders). Additionally, the assumptions required for the independent t-test were not met, as the normality assumption was violated, making the non-parametric Wilcoxon test more suitable.For this reason also, the effect size was assessed using the biserial correlation, which is appropriate for the non-parametric test mentioned.

Research Hypothesis

As demonstrated above, even though the medians are different from each other (among male and female), is not statistically significant to say that there is income inequality among Santander Bank customers, and thus in the country.