Assignment #01

1. Find your data

The data was extracted from the following source: https://www.kaggle.com/competitions/santander-product-recommendation/data?select=train_ver2.csv.zip. This database consists in customers of Santander Bank, with their respective data such as country, gender, age, etc.

2.Import Data to R Studio

santander_data <- read.table("./data/test_ver2.csv", header = TRUE, sep = ",", dec = ",")

3. Display your data using the head function

head(santander_data)

##   fecha_dato ncodpers ind_empleado pais_residencia sexo age fecha_alta
## 1 2016-06-28    15889            F              ES    V  56 1995-01-16
## 2 2016-06-28  1170544            N              ES    H  36 2013-08-28
## 3 2016-06-28  1170545            N              ES    V  22 2013-08-28
## 4 2016-06-28  1170547            N              ES    H  22 2013-08-28
## 5 2016-06-28  1170548            N              ES    H  22 2013-08-28
## 6 2016-06-28  1170550            N              ES    V  22 2013-08-28
##   ind_nuevo antiguedad indrel ult_fec_cli_1t indrel_1mes tiprel_1mes indresi
## 1         0        256      1                          1           A       S
## 2         0         34      1                          1           I       S
## 3         0         34      1                          1           A       S
## 4         0         34      1                          1           I       S
## 5         0         34      1                          1           I       S
## 6         0         34      1                          1           I       S
##   indext conyuemp canal_entrada indfall tipodom cod_prov        nomprov
## 1      N        N           KAT       N       1       28         MADRID
## 2      N                    KAT       N       1        3       ALICANTE
## 3      N                    KHE       N       1       15      CORUÑA, A
## 4      N                    KHE       N       1        8      BARCELONA
## 5      N                    KHE       N       1        7 BALEARS, ILLES
## 6      N                    KHE       N       1        8      BARCELONA
##   ind_actividad_cliente       renta           segmento
## 1                     1   326124.90           01 - TOP
## 2                     0          NA  02 - PARTICULARES
## 3                     1          NA 03 - UNIVERSITARIO
## 4                     0   148402.98 03 - UNIVERSITARIO
## 5                     0   106885.80 03 - UNIVERSITARIO
## 6                     0          NA 03 - UNIVERSITARIO

4. Explain your data

Unit of observation: a Santander Bank customer

Total observations: 929,615 clients

Sample Size: 10,000 clients

These are the relevant variables selected, with their original names, and their english translation:

fecha_alta (Date): Date when they joined the bank in the format YYYY-MMM-DD
antiguedad (Seniority): Customer seniority (in months)
pais_residencia (Country): Initials of customer’s country of residence
nomprov (Province): Province name
sexo (Gender): Customer’s sex where {V: Varón (Male), H: Hembra (Female)}
age (Age): Customer’s age
renta (Income): Gross income of the household in EUR, yearly
segmento (Segment): Customer segment where {01: VIP, 02: Individuals 03: college graduated}

5. Name the source of the data

Name: Santander Bank Customers

6. Carry out data manipulation

#--> Installing Packages
#install.packages("dplyr")
#install.packages("tidyverse")
#install.packages("tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.4     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyr)

#--> Get an extract of the relevant columns, and rename them
santander_data_extract <- santander_data %>%
  select(fecha_alta, antiguedad, pais_residencia, nomprov, sexo, age, renta, segmento)

santander_data_extract <- santander_data_extract %>%
  rename("Date" = "fecha_alta",
         "Seniority" = "antiguedad",
         "Country" = "pais_residencia",
         "Province" = "nomprov",
         "Gender" = "sexo",
         "Age" = "age",
         "Income" = "renta",
         "Segment" = "segmento")


#--> Fix the data type of the numerical variables which were characters
santander_data_extract$Date <- as.Date(santander_data_extract$Date)
santander_data_extract$Income <- as.numeric(santander_data_extract$Income)

## Warning: NAs introduced by coercion

santander_data_extract$Age <- as.numeric(santander_data_extract$Age)
santander_data_extract$Seniority <- as.numeric(santander_data_extract$Seniority)

#--> Dropping NA Values
santander_data_extract <- santander_data_extract %>%
  drop_na()

#--> Taking a random sample with a Size 10,000
data_sample <- sample_n(santander_data_extract, size = 10000)

7. Present the descriptive statistics

sample_mean <- mean(data_sample$Income)
data_sample$GenderFactor <- factor(data_sample$Gender,
                                   levels = c("V","H"),
                                   labels = c("Male", "Female"))
  
sample_mean_male <- mean(data_sample[data_sample$GenderFactor == "Male",]$Income)
sample_mean_female <- mean(data_sample[data_sample$GenderFactor == "Female",]$Income)
income_ratio <- round(log(sample_mean_male / sample_mean_female) * 100,2)

cat("The arithmetic mean of the annual income of the sample is", sample_mean, "EUR, where man represent an average of",sample_mean_male,"EUR yearly, and women represent an average of",sample_mean_female,"EUR yearly. This results in an income disparity of",income_ratio,"percentaje on the average.")

## The arithmetic mean of the annual income of the sample is 133511.9 EUR, where man represent an average of 134963.5 EUR yearly, and women represent an average of 131758.5 EUR yearly. This results in an income disparity of 2.4 percentaje on the average.

sample_sd <- sd(data_sample$Income)
sample_sd_male <- sd(data_sample[data_sample$GenderFactor == "Male",]$Income)
sample_sd_female <- sd(data_sample[data_sample$GenderFactor == "Female",]$Income)
max_income <- max(data_sample$Income)

cat("Due to the dispersion of income wihtin the sample where maximums of",max_income,"EUR were found, and toghether with an standard deviation of the sample of", sample_sd, "EUR (divided by", sample_sd_male, "EUR for male, and", sample_sd_female, "EUR for women) a more representative measure of the real income perceived by Santander Customers must be found.")

## Due to the dispersion of income wihtin the sample where maximums of 9567741 EUR were found, and toghether with an standard deviation of the sample of 181142 EUR (divided by 172494 EUR for male, and 191072.1 EUR for women) a more representative measure of the real income perceived by Santander Customers must be found.

sample_median <- median(data_sample$Income)
sample_median_male <- median(data_sample[data_sample$GenderFactor == "Male",]$Income)
sample_median_female <- median(data_sample[data_sample$GenderFactor == "Female",]$Income)

cat("This is why, the central tendency measure of median was calculated. Here, the results deviate from the arithmetic mean, where the median of the sample is", sample_median, "EUR compared with the mean of", sample_mean, "EUR. This meaning the 50% or less customers, earn this number or less. This can be also seen when analyzing by gender, where the median for male was", sample_median_male, "EUR compared with the mean for male of", sample_mean_male, "EUR, and where the median for women is", sample_median_female, "EUR, compared with the average of",sample_mean_female,"EUR too, where the analysis is the same, meaning that 50% or less of the sample (and thus the population), earn these both figures or less. This can be explained since the median is not affected by outliers, meanwhile it does for the arithmetic mean. This can be appreciated in the graphs of part 7.")

## This is why, the central tendency measure of median was calculated. Here, the results deviate from the arithmetic mean, where the median of the sample is 100795.1 EUR compared with the mean of 133511.9 EUR. This meaning the 50% or less customers, earn this number or less. This can be also seen when analyzing by gender, where the median for male was 102520.6 EUR compared with the mean for male of 134963.5 EUR, and where the median for women is 98376 EUR, compared with the average of 131758.5 EUR too, where the analysis is the same, meaning that 50% or less of the sample (and thus the population), earn these both figures or less. This can be explained since the median is not affected by outliers, meanwhile it does for the arithmetic mean. This can be appreciated in the graphs of part 7.

library(dplyr)
sample_provinces <- data_sample %>% 
  group_by(Province) %>%
  summarise(Mean = mean(Income))

cat("A further analysis can be done when analyzing the average income of the provinces. In this case, the top 5 provinces by their mean income are:\n")

## A further analysis can be done when analyzing the average income of the provinces. In this case, the top 5 provinces by their mean income are:

head(sample_provinces[order(-sample_provinces$Mean),],5)

## # A tibble: 5 × 2
##   Province          Mean
##   <chr>            <dbl>
## 1 MADRID         179093.
## 2 BARCELONA      160398.
## 3 GIRONA         150880.
## 4 BALEARS, ILLES 148206.
## 5 MELILLA        146134.

income_ratio_provinces <- log(max(sample_provinces$Mean) / sample_mean) * 100

cat("Where the top 1 province, represents an average income", income_ratio_provinces, "percentage higher than the sample arithmetic mean.")

## Where the top 1 province, represents an average income 29.37148 percentage higher than the sample arithmetic mean.

7. Graph the distribution of variables and explain the results

#First, an histogram graph was built in order to see the distribution of the income for both Male and Female. As noted here, and as previously mentioned, the dispersion of the results is too high, making the graph skewed to the right (positive skew). This is why a filter was applied in order to analyze the distribution in more detail by removing customers with an income higher than 1,000,000EUR")

library(ggplot2)
ggplot(data = data_sample, aes(x=Income)) +
  geom_histogram(binwidth = 15000, colour="gray", fill = "blue") +
  facet_wrap(~GenderFactor, ncol=1) +
  ylab("Frequency")

# Here, even though the tendency stays the same, is clearer for both male and female that the majority of the observations are concentrated towards the left side of the histogram, indicating that lower-income customers are more frequent that the ones with higher. This, restating that the income distribution for both genders is highly right-skewed (or positive-skewed). Lastly, regarding other type of central tendency, this histogram suggests that the mode is at the lower-income range, where the most frequent income values appear, and with just one peak (making it uni-modal). It should be also noted that just as previously mentioned, the fact that there are many outliers, suggests that the average income is likely higher than the median due to their influence.

library(ggplot2)
data_sample_subset <- data_sample %>%
  filter(Income < 1000000)

ggplot(data = data_sample_subset, aes(x=Income)) +
  geom_histogram(binwidth = 10000, colour="gray", fill = "blue") +
  facet_wrap(~GenderFactor, ncol=1) +
  ylab("Frequency")

# Second, in order to see the quartiles, a boxplot was plotted. Here, the same problem was found with the outliers, making the graph not representative to the majority of people. This is why, in the following graph, the outliers were excluded (outliers = FALSE).

ggplot(data_sample, aes(x=GenderFactor, y=Income)) +
  geom_boxplot() +
  xlab("Gender")

q1 <- quantile(data_sample$Income, prob = 0.25)
q2 <- quantile(data_sample$Income, prob = 0.5)
q3 <- quantile(data_sample$Income, prob = 0.75)
min_salary <- min(data_sample$Income)
max_salary <- max(data_sample$Income)

cat("Here, a much more representative graph can be appreciated, where the first, second, and third quartiles together with the minimum and maximums can be appreciated. By looking at this graph, we can affirm that the 25% (Q1) of the people earned", q1, "EUR or less, that 50% (Q2) of the people earned", q2, "EUR or less, and that 75% of the people earned", q3, "EUR or less. All these, with the person that earned the lowest at", min_salary, "EUR, and the one who earned the most at", max_salary, "EUR.")

## Here, a much more representative graph can be appreciated, where the first, second, and third quartiles together with the minimum and maximums can be appreciated. By looking at this graph, we can affirm that the 25% (Q1) of the people earned 67798.09 EUR or less, that 50% (Q2) of the people earned 100795.1 EUR or less, and that 75% of the people earned 157123.6 EUR or less. All these, with the person that earned the lowest at 6611.58 EUR, and the one who earned the most at 9567741 EUR.

ggplot(data_sample, aes(y=Income)) +
  geom_boxplot(outliers = FALSE, colour = "black", fill = "blue")

# Furthermore, by comparing both boxplots stratified by gender, no statistical significant differences were observed.

ggplot(data_sample, aes(x=GenderFactor, y=Income)) +
  geom_boxplot(outliers = FALSE, colour = "black", fill = "blue") +
  xlab("Gender")

# Lastly, the relationship between two variables (income and age) was analyzed using a scatter plot for both male and female (this were conducted separately, since the amount of data is significant, and the join-graph is confusing)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:purrr':
## 
##     some

## The following object is masked from 'package:dplyr':
## 
##     recode

data_sample_subset_male <- data_sample %>%
  filter(Income < 1000000,
         GenderFactor == "Male")

scatterplot(x = data_sample_subset_male$Age, 
            y = data_sample_subset_male$Income, 
            xlab = "Customer's Age", 
            ylab = "Level of Income", 
            smooth = FALSE)

# Here, a weak positive relationship can be appreciated for both graphs, indicating that age it is probably not a strong predictor of income as people could think. Moreover, the same conclusions regarding the distributions can be made, where the majority of data points appear to be concentrated at lower income levels, with a high variability, and showing some outliers. This, makes us conclude that other factors could influence more than age in determining income.

data_sample_subset_female <- data_sample %>%
  filter(Income < 1000000,
         GenderFactor == "Female")

scatterplot(x = data_sample_subset_female$Age, 
            y = data_sample_subset_female$Income, 
            xlab = "Customer's Age", 
            ylab = "Level of Income", 
            smooth = FALSE)