Let’s load the following libraries for this lecture.
library(tidyverse)
library(openintro)
bank_data <- read_csv("~/Documents/Fei Tian/Course_STA305_Statistical_Computing_and_Graphics_Fall2023/Datasets/BankChurners.csv")
If variation describes the behavior within a single variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables or do a statistical test.
How you do that should again depend on the type of variables involved. As below is the recommended plot types or the statistical tests:
Two categorical (ordinal) variables: A stacked proportion bar plot or a chi-square test
Two numeric variables: scatter plot along with a line graph or computing the correlation coefficient
One numeric, one categorical (ordinal): grouped density plot or a t-test
In data visualization part, we learned to create stacked bar chart to investigate the effect of two categorical variables on a third numeric variable.
However, that is not to check the relationship between that two variables. In statistics, we usually hope to know whether the two categorical variables are dependent or independent of each other. If they are highly dependent, then knowing the category of one variable helps us predict the category of another variable with better accuracy.
For example, whether one brings umbrella or not is highly dependent on whether it rains on a day or not. On the other hand, whether one eats bread or not for breakfast is not quite dependent of whether it rains or not.
Let’s check whether Gender
and
Dependent_count
has a significant impact on
Attrition_Flag
. We can do by graph or by table:
ggplot(bank_data) +
geom_bar(aes(x = Gender, fill = Attrition_Flag), position = "fill") +
labs(y = "proportion")
bank_data %>%
group_by(Gender) %>%
summarize(Attrition_rate = prop.table(table(Attrition_Flag))[1])
## # A tibble: 2 Ă— 2
## Gender Attrition_rate
## <chr> <dbl>
## 1 F 0.174
## 2 M 0.146
So we see that the attrition rate is significantly higher in female cutomers than male ones. In other words, it seems that female customers are more likely to cancel their credit cards as suggested by the data set.
But please keep in mind that this difference is not necessarily statistical significant (depending on some other factors such as sample size). Later we will learn how to run statistical tests to draw a more quantitative conclusion.
To check the covariation, we use the function chisq.test
and table
function together. For example, let’s check
whether Attrition_Flag
and Gender
are
dependent or not.
chisq.test(table(bank_data$Attrition_Flag, bank_data$Gender))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(bank_data$Attrition_Flag, bank_data$Gender)
## X-squared = 13.866, df = 1, p-value = 0.0001964
So one would check the p-value here. If it’s a very small number (usually people use <0.05 as a common criterion), then the two variables are dependent.
Check whether Marital_Status
or
Dependent_count
is correlated with
Attrition_Flag
or not with a graph and a statistical
test.
Are Marital_Status
and Dependent_count
correlated with each other? Can you explain the result?
For numeric variables, we simply create a scatter plot and/or a line graph. A correlation analysis is usually performed, too.
Let’s take Customer_Age
and Credit_Limit
as
an example:
ggplot(bank_data, aes(Customer_Age, Credit_Limit)) + geom_point() + geom_smooth()
The points are on vertical lines because Customer_Age
is
a discrete variable and only takes integer values. It seems that the
correlation between the two variables are quite weak. But we do see a
reasonable trend in the smoothed version of line graph that, the
mid-aged group has higher credit limits than young professional and
retired people, which agrees with common sense.
It is also common to check Pearson’s correlation coefficient betweeb two numeric variables. But please be noted that it only measures linear correlation and cannot capture nonlinear ones.
Use cor(<data1>, <data2>)
to compute the
coefficient in R. A value around zero indicates little correlation. A
value around \(1\) or \(-1\) indicates strong linear correlation,
as shown in the figure below.
cor(bank_data$Customer_Age, bank_data$Credit_Limit)
## [1] 0.002476227
We see a very weak linear correlation here. But as the line graph indicates, there can be some nonlinear correlation between the two variables. For example, we can check whether the credit limit is correlated with the square of customer age.
mean_age <- mean(bank_data$Customer_Age)
mean_credit_limit <- mean(bank_data$Credit_Limit)
cor((bank_data$Customer_Age - mean_age)^2, bank_data$Credit_Limit - mean_credit_limit)
## [1] -0.09446443
So there is indeed quadratic correlation between customer age and credit limit. Please be noted that for nonlinear correlation, the data must be centered (subtracting its average) before computing the correlation coefficient.
For one numeric variable, and one categorical variable, we commonly
use boxplots to visualize their relationships. However, boxplots are
sometimes not informative enough. Here you are recommended to compare
the density curve between two groups. Let’s take
Attrition_Flag
and Customer_Age
as an
example.
ggplot(bank_data) +
geom_density(aes(x = Customer_Age, fill = Attrition_Flag), alpha = 0.5)
We see the two distributions nearly overlapping with each other, so
at least we can say that the effect of Customer_Age
on
Attrition_Flag
is not quite significant.
In comparison, we can create the same plot between
Attrition_Flag
and Total_Trans_Ct
. A evidently
more significant difference is observed.
ggplot(bank_data) +
geom_density(aes(x = Total_Trans_Ct, fill = Attrition_Flag), alpha = 0.5)
We see that customers with more transaction counts (over 50 or so) are much less likely to cancel their credit card service.
In the case of little outliers, we may use the two-sample t-test to quantitatively tell whether there is a significant effect of age here.
data1 <- bank_data$Customer_Age[bank_data$Attrition_Flag != "Existing Customer"]
data2 <- bank_data$Customer_Age[bank_data$Attrition_Flag == "Existing Customer"]
t.test(data1, data2)
##
## Welch Two Sample t-test
##
## data: data1 and data2
## t = 1.8988, df = 2370.8, p-value = 0.05772
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.01302059 0.80777731
## sample estimates:
## mean of x mean of y
## 46.65950 46.26212
The first two lines of codes divide the data of customer age into two groups - ages of existing cutomers and ages of attrited customers.
The t.test
function performs a two-sample t-test to see
whether the mean customer age is the same or not among existing
customers and attrited customers. The p-value is above 0.05, which
indicates that the difference can be quite small (we cannot reject the
null hypothesis).
When there are more than two comparison groups, we may refer to the
F-test (ANOVA) using the oneway.test
function:
oneway.test(bank_data$Customer_Age ~ bank_data$Education_Level)
##
## One-way analysis of means (not assuming equal variances)
##
## data: bank_data$Customer_Age and bank_data$Education_Level
## F = 2.4125, num df = 6.0, denom df = 2672.9, p-value = 0.02503
For all the statistical tests introduced in this module, there are prior conditions that need to be met. More technical details are taught in the regression class. For your reference, you can learn from the following course on Coursera:
Analyze whether the variable Avg_Unilization_Ratio
is
correlated with Attrition_Flag
or not using a graph and a
statistical test. Explain your result.
We may use cut
function to convert a numeric variable
into customized groups. For example, for Customer_Age
, we
may want to classify all ages into groups of <= 30
,
31-40
, 41-50
, 51-60
,
>60
and then analyze its effect. The following can be
done to realize this:
bank_data <- mutate(bank_data, Age_group = cut(Customer_Age, breaks = c(0, 30, 40, 50, 60, Inf)))
bank_data %>%
group_by(Age_group) %>%
summarize(Attrition_ratio = prop.table(table(Attrition_Flag))[1])
## # A tibble: 5 Ă— 2
## Age_group Attrition_ratio
## <fct> <dbl>
## 1 (0,30] 0.121
## 2 (30,40] 0.145
## 3 (40,50] 0.167
## 4 (50,60] 0.168
## 5 (60,Inf] 0.143
So we see that customers between 40 and 60 are most likely to churn.
Cut Total_Trans_Ct
into a few reasonable categories and
study its effect on Attrition_Flag
.
An interesting while challenging part of data exploration is that, sometimes there are complicated composite effect between variables. For example, if we only look at the effect of credit limit on churning, it’s not that significant:
ggplot(bank_data) +
geom_density(aes(x = Credit_Limit, fill = Attrition_Flag), alpha = 0.5)
ggplot(bank_data) +
geom_boxplot(aes(x = Credit_Limit, y = Attrition_Flag))
However, if we put Customer_Age
also in the picture, we
see some composite effect:
ggplot(bank_data) +
geom_point(aes(x = Customer_Age, y = Credit_Limit, color = Attrition_Flag), alpha = 0.5) +
geom_smooth(aes(x = Customer_Age, y = Credit_Limit, color = Attrition_Flag))
So it seems that for young customers (less than 55), churning customers tended to have an lower credit limit on average. But the trend becomes opposite for aged customers, who were more likely to churn when they had a higher credit limit. Can you try to explain why?
In this example, we try to analyze the effect of two numeric variables on a categorical variable. So we use a grouped scatter/smooth plot.
What if we try to analyze the effect of two categorical variables?
Let’s analyze the composite effect of Gender
and
Income_Cateogry
bank_data %>%
group_by(Income_Category, Gender) %>%
summarize(Attrition_ratio = prop.table(table(Attrition_Flag))[1]) %>%
arrange(Attrition_ratio)
## # A tibble: 9 Ă— 3
## # Groups: Income_Category [6]
## Income_Category Gender Attrition_ratio
## <chr> <chr> <dbl>
## 1 Unknown M 0.0962
## 2 Less than $40K M 0.108
## 3 $60K - $80K M 0.135
## 4 $40K - $60K M 0.135
## 5 $80K - $120K M 0.158
## 6 $40K - $60K F 0.164
## 7 Unknown F 0.172
## 8 $120K + M 0.173
## 9 Less than $40K F 0.177
In this example, we break our data into groups as defined by all
possible combination of Gender
and
Income_Category
, and study the effect by listing the
attrition ratio in a table.
We can also use facet_wrap
to help use visualize the
effect:
# Relevel the income category, will learn later, not required
bank_data$Income_Category <- factor(bank_data$Income_Category,
levels = unique(bank_data$Income_Category)[c(2,4,1,3,5,6)])
ggplot(bank_data) +
geom_bar(aes(y = Income_Category, fill = Attrition_Flag), position = "fill") +
facet_wrap(~ Gender, nrow = 2)
Question: What can we learn from the result?
Analyze the composite effect of Education_Level
and
Marital_Status
on Attrition_Flag
.
In the last two modules, we learn how to start with the process of EDA for a given data set. In practice, this only serves as the first step of data analysis since they provide us basic information about the data set, which helps us ask and answer useful questions more efficiently and reasonably. Meaningful analysis would usually come after these basic steps. For example, now let’s think about the following questions:
If we are going to do data modeling to predict potential churning customers, which variables are you going to keep in your model, or you will keep them all? How to incorporate the knowledge extracted from our exploration into the next stage?
If we are asked to propose solutions to improve retention rate of credit card customers for the bank, what are reasonable directions to explore?
Guided by these questions, now go back to check all results we created. Please think whether there is any useful information that helps us answer the two questions above.