library(tidyverse)
bank_data <- read_csv("~/Documents/Fei Tian/Course_STA305_Statistical_Computing_and_Graphics_Fall2023/Datasets/BankChurners.csv")
If variation describes the behavior within a single variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables.
How you do that should again depend on the type of variables involved. As below is a very brief summary:
In data visualization part, we learned to create stacked bar chart to investigate the effect of two categorical variables on a third numeric variable.
However, that is not to check the relationship between that two variables. In statistics, we usually hope to know whether the two categorical variables are dependent or independent of each other. If they are highly dependent, then knowing the category of one variable helps us predict the category of another variable with better accuracy.
For example, whether one brings umbrella or not is highly dependent on whether it rains on a day or not. On the other hand, whether one eats bread or not for breakfast is not quite dependent of whether it rains or not.
To check the covariation, we use the function chisq.test
and table
function together. For example, let’s check
whether Attrition_Flag
and Gender
are
dependent or not.
chisq.test(table(bank_data$Attrition_Flag, bank_data$Gender))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(bank_data$Attrition_Flag, bank_data$Gender)
## X-squared = 13.866, df = 1, p-value = 0.0001964
So one would check the p-value here. If it’s a very small number (usually people use <0.05 as a common criterion), then the two variables are dependent.
Analysis: In this example, the two variables are dependent with a quite small p-value. This can be explained by the frequency table
table(bank_data$Attrition_Flag, bank_data$Gender)
##
## F M
## Attrited Customer 930 697
## Existing Customer 4428 4072
So we see that the famale to male ratio is significantly higher in attrited customers. In other words, it seems that female customers are more likely to cancel their credit cards as suggested by the data set.
We can also create a grouped bar chart to illustrate this:
ggplot(bank_data) +
geom_bar(aes(x = Attrition_Flag, fill = Gender), position = "dodge")
But it is less straightforward to see the dependence from the graph than the summary table.
Attrition_Flag
is dependent of
Marital_Status
or not.For numeric variables, we simply create a scatter plot and/or a line graph. A correlation analysis is usually performed, too.
Let’s take Customer_Age
and Credit_Limit
as
an example:
ggplot(bank_data, aes(Customer_Age, Credit_Limit)) + geom_point() + geom_smooth()
The points are on vertical lines because Customer_Age
is
a discrete variable and only takes integer values. It seems that the
correlation between the two variables are quite weak. But we do see a
reasonable trend in the smoothed version of line graph that, the
mid-aged group has higher credit limits than young professional and
retired people, which agrees with common sense.
It is also common to check Pearson’s correlation coefficient betweeb two numeric variables. But please be noted that it only measures linear correlation and cannot capture nonlinear ones.
Use cor(<data1>, <data2>)
to compute the
coefficient in R. A value around zero indicates little correlation. A
value around \(1\) or \(-1\) indicates strong linear correlation,
as shown in the figure below.
cor(bank_data$Customer_Age, bank_data$Credit_Limit)
## [1] 0.002476227
We see a very weak linear correlation here. But as the line graph indicates, there can be some nonlinear correlation between the two variables.
For one numeric variable, and one categorical variable, we commonly
use boxplots to visualize their relationships. Let’s take
Attrition_Flag
and Customer_Age
as an
example.
ggplot(bank_data) +
geom_boxplot(aes(x = Customer_Age, y = Attrition_Flag))
We see little difference here, with the median of customer age slightly older in the group of attrited customers.
In the case of little outliers, we may use the two-sample t-test to quantitatively tell whether there is a significant effect of age here.
data1 <- bank_data$Customer_Age[bank_data$Attrition_Flag != "Existing Customer"]
data2 <- bank_data$Customer_Age[bank_data$Attrition_Flag == "Existing Customer"]
t.test(data1, data2)
##
## Welch Two Sample t-test
##
## data: data1 and data2
## t = 1.8988, df = 2370.8, p-value = 0.05772
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.01302059 0.80777731
## sample estimates:
## mean of x mean of y
## 46.65950 46.26212
The code above performs a two-sample t-test to see whether the mean customer age is the same or not among existing customers and attriting customers. The p-value is above 0.05, which indicates that the difference can be quite small (we cannot reject the null hypothesis).
Contacts_Count_12_mon
has an effect on
Attrition_Flag
or not.