Load Libraries


library(tidyverse)
bank_data <- read_csv("~/Documents/Fei Tian/Course_STA305_Statistical_Computing_and_Graphics_Fall2023/Datasets/BankChurners.csv")


Covariation

If variation describes the behavior within a single variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables.

How you do that should again depend on the type of variables involved. As below is a very brief summary:


Two Categorical Variables

In data visualization part, we learned to create stacked bar chart to investigate the effect of two categorical variables on a third numeric variable.

However, that is not to check the relationship between that two variables. In statistics, we usually hope to know whether the two categorical variables are dependent or independent of each other. If they are highly dependent, then knowing the category of one variable helps us predict the category of another variable with better accuracy.

For example, whether one brings umbrella or not is highly dependent on whether it rains on a day or not. On the other hand, whether one eats bread or not for breakfast is not quite dependent of whether it rains or not.

To check the covariation, we use the function chisq.test and table function together. For example, let’s check whether Attrition_Flag and Gender are dependent or not.

chisq.test(table(bank_data$Attrition_Flag, bank_data$Gender))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(bank_data$Attrition_Flag, bank_data$Gender)
## X-squared = 13.866, df = 1, p-value = 0.0001964

So one would check the p-value here. If it’s a very small number (usually people use <0.05 as a common criterion), then the two variables are dependent.

Analysis: In this example, the two variables are dependent with a quite small p-value. This can be explained by the frequency table

table(bank_data$Attrition_Flag, bank_data$Gender)
##                    
##                        F    M
##   Attrited Customer  930  697
##   Existing Customer 4428 4072

So we see that the famale to male ratio is significantly higher in attrited customers. In other words, it seems that female customers are more likely to cancel their credit cards as suggested by the data set.

We can also create a grouped bar chart to illustrate this:

ggplot(bank_data) + 
  geom_bar(aes(x = Attrition_Flag, fill = Gender), position = "dodge")

But it is less straightforward to see the dependence from the graph than the summary table.


Lab Exercise:

Check whether Attrition_Flag is dependent of Marital_Status or not.


Two numeric variables

For numeric variables, we simply create a scatter plot and/or a line graph. A correlation analysis is usually performed, too.

Let’s take Customer_Age and Credit_Limit as an example:

ggplot(bank_data, aes(Customer_Age, Credit_Limit)) + geom_point() + geom_smooth()

The points are on vertical lines because Customer_Age is a discrete variable and only takes integer values. It seems that the correlation between the two variables are quite weak. But we do see a reasonable trend in the smoothed version of line graph that, the mid-aged group has higher credit limits than young professional and retired people, which agrees with common sense.

Pearson’s Correlation Coefficient

It is also common to check Pearson’s correlation coefficient betweeb two numeric variables. But please be noted that it only measures linear correlation and cannot capture nonlinear ones.

Use cor(<data1>, <data2>) to compute the coefficient in R. A value around zero indicates little correlation. A value around \(1\) or \(-1\) indicates strong linear correlation, as shown in the figure below.

cor(bank_data$Customer_Age, bank_data$Credit_Limit)
## [1] 0.002476227

We see a very weak linear correlation here. But as the line graph indicates, there can be some nonlinear correlation between the two variables.


One numeric variable and one categorical variable

For one numeric variable, and one categorical variable, we commonly use boxplots to visualize their relationships. Let’s take Attrition_Flag and Customer_Age as an example.

ggplot(bank_data) + 
  geom_boxplot(aes(x = Customer_Age, y = Attrition_Flag))

We see little difference here, with the median of customer age slightly older in the group of attrited customers.

In the case of little outliers, we may use the two-sample t-test to quantitatively tell whether there is a significant effect of age here.

data1 <- bank_data$Customer_Age[bank_data$Attrition_Flag != "Existing Customer"]
data2 <- bank_data$Customer_Age[bank_data$Attrition_Flag == "Existing Customer"]
t.test(data1, data2)
## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = 1.8988, df = 2370.8, p-value = 0.05772
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01302059  0.80777731
## sample estimates:
## mean of x mean of y 
##  46.65950  46.26212

The code above performs a two-sample t-test to see whether the mean customer age is the same or not among existing customers and attriting customers. The p-value is above 0.05, which indicates that the difference can be quite small (we cannot reject the null hypothesis).


Lab Exercise:

Analyze whether the variable Contacts_Count_12_mon has an effect on Attrition_Flag or not.