Exploratory Data Analysis Part Two

Load Libraries

Let’s load the following libraries for this lecture.

library(tidyverse)
library(openintro)
bank_data <- read_csv("~/Documents/Fei Tian/Course_STA305_Statistical_Computing_and_Graphics_Fall2023/Datasets/BankChurners.csv")

Covariation

If variation describes the behavior within a single variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables or do a statistical test.

How you do that should again depend on the type of variables involved. As below is the recommended plot types or the statistical tests:

Two categorical (ordinal) variables: A stacked proportion bar plot or a chi-square test
Two numeric variables: scatter plot along with a line graph or computing the correlation coefficient
One numeric, one categorical (ordinal): grouped density plot or a t-test

Two Categorical Variables

In data visualization part, we learned to create stacked bar chart to investigate the effect of two categorical variables on a third numeric variable.

However, that is not to check the relationship between that two variables. In statistics, we usually hope to know whether the two categorical variables are dependent or independent of each other. If they are highly dependent, then knowing the category of one variable helps us predict the category of another variable with better accuracy.

For example, whether one brings umbrella or not is highly dependent on whether it rains on a day or not. On the other hand, whether one eats bread or not for breakfast is not quite dependent of whether it rains or not.

Example with the bank customer data

Let’s check whether Gender and Dependent_count has a significant impact on Attrition_Flag. We can do by graph or by table:

ggplot(bank_data) + 
  geom_bar(aes(x = Gender, fill = Attrition_Flag), position = "fill") +
  labs(y = "proportion")

bank_data %>%
  group_by(Gender) %>%
  summarize(Attrition_rate = prop.table(table(Attrition_Flag))[1])

## # A tibble: 2 × 2
##   Gender Attrition_rate
##   <chr>           <dbl>
## 1 F               0.174
## 2 M               0.146

So we see that the attrition rate is significantly higher in female cutomers than male ones. In other words, it seems that female customers are more likely to cancel their credit cards as suggested by the data set.

But please keep in mind that this difference is not necessarily statistical significant (depending on some other factors such as sample size). Later we will learn how to run statistical tests to draw a more quantitative conclusion.

Chi-square Test

To check the covariation, we use the function chisq.test and table function together. For example, let’s check whether Attrition_Flag and Gender are dependent or not.

chisq.test(table(bank_data$Attrition_Flag, bank_data$Gender))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(bank_data$Attrition_Flag, bank_data$Gender)
## X-squared = 13.866, df = 1, p-value = 0.0001964

So one would check the p-value here. If it’s a very small number (usually people use <0.05 as a common criterion), then the two variables are dependent.

Lab Homework (Required):

Check whether Marital_Status or Dependent_count is correlated with Attrition_Flag or not with a graph and a statistical test.
Are Marital_Status and Dependent_count correlated with each other? Can you explain the result?

Two numeric variables

For numeric variables, we simply create a scatter plot and/or a line graph. A correlation analysis is usually performed, too.

Let’s take Customer_Age and Credit_Limit as an example:

ggplot(bank_data, aes(Customer_Age, Credit_Limit)) + geom_point() + geom_smooth()

The points are on vertical lines because Customer_Age is a discrete variable and only takes integer values. It seems that the correlation between the two variables are quite weak. But we do see a reasonable trend in the smoothed version of line graph that, the mid-aged group has higher credit limits than young professional and retired people, which agrees with common sense.

Pearson’s Correlation Coefficient

It is also common to check Pearson’s correlation coefficient betweeb two numeric variables. But please be noted that it only measures linear correlation and cannot capture nonlinear ones.

Use cor(<data1>, <data2>) to compute the coefficient in R. A value around zero indicates little correlation. A value around \(1\) or \(-1\) indicates strong linear correlation, as shown in the figure below.

Example

cor(bank_data$Customer_Age, bank_data$Credit_Limit)

## [1] 0.002476227

We see a very weak linear correlation here. But as the line graph indicates, there can be some nonlinear correlation between the two variables. For example, we can check whether the credit limit is correlated with the square of customer age.

mean_age <- mean(bank_data$Customer_Age)
mean_credit_limit <- mean(bank_data$Credit_Limit)
cor((bank_data$Customer_Age - mean_age)^2, bank_data$Credit_Limit - mean_credit_limit)

## [1] -0.09446443

So there is indeed quadratic correlation between customer age and credit limit. Please be noted that for nonlinear correlation, the data must be centered (subtracting its average) before computing the correlation coefficient.

Lab Homework (Required):

Use your common sense to find two numeric variables in the bank customer data that are closely correlated with each other (correlation coefficient larger than 0.5 or less than -0.5). Verify your speculation.

One numeric variable and one categorical variable

For one numeric variable, and one categorical variable, we commonly use boxplots to visualize their relationships. However, boxplots are sometimes not informative enough. Here you are recommended to compare the density curve between two groups. Let’s take Attrition_Flag and Customer_Age as an example.

ggplot(bank_data) + 
  geom_density(aes(x = Customer_Age, fill = Attrition_Flag), alpha = 0.5)

We see the two distributions nearly overlapping with each other, so at least we can say that the effect of Customer_Age on Attrition_Flag is not quite significant.

Another example

In comparison, we can create the same plot between Attrition_Flag and Total_Trans_Ct. A evidently more significant difference is observed.

ggplot(bank_data) + 
  geom_density(aes(x = Total_Trans_Ct, fill = Attrition_Flag), alpha = 0.5)

We see that customers with more transaction counts (over 50 or so) are much less likely to cancel their credit card service.

Student’s t-test for two comparison groups

In the case of little outliers, we may use the two-sample t-test to quantitatively tell whether there is a significant effect of age here.

data1 <- bank_data$Customer_Age[bank_data$Attrition_Flag != "Existing Customer"]
data2 <- bank_data$Customer_Age[bank_data$Attrition_Flag == "Existing Customer"]
t.test(data1, data2)

## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = 1.8988, df = 2370.8, p-value = 0.05772
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01302059  0.80777731
## sample estimates:
## mean of x mean of y 
##  46.65950  46.26212

The first two lines of codes divide the data of customer age into two groups - ages of existing cutomers and ages of attrited customers.

The t.test function performs a two-sample t-test to see whether the mean customer age is the same or not among existing customers and attrited customers. The p-value is above 0.05, which indicates that the difference can be quite small (we cannot reject the null hypothesis).

F-test for more than two comparison groups

When there are more than two comparison groups, we may refer to the F-test (ANOVA) using the oneway.test function:

oneway.test(bank_data$Customer_Age ~ bank_data$Education_Level)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  bank_data$Customer_Age and bank_data$Education_Level
## F = 2.4125, num df = 6.0, denom df = 2672.9, p-value = 0.02503

For all the statistical tests introduced in this module, there are prior conditions that need to be met. More technical details are taught in the regression class. For your reference, you can learn from the following course on Coursera:

https://www.coursera.org/learn/statistical-inference-and-hypothesis-testing-in-data-science-applications

Lab Homework (Required):

Analyze whether the variable Avg_Unilization_Ratio is correlated with Attrition_Flag or not using a graph and a statistical test. Explain your result.

Cut a numeric variable into categorical groups

We may use cut function to convert a numeric variable into customized groups. For example, for Customer_Age, we may want to classify all ages into groups of <= 30, 31-40, 41-50, 51-60, >60 and then analyze its effect. The following can be done to realize this:

bank_data <- mutate(bank_data, Age_group = cut(Customer_Age, breaks = c(0, 30, 40, 50, 60, Inf)))

bank_data %>%
  group_by(Age_group) %>%
  summarize(Attrition_ratio = prop.table(table(Attrition_Flag))[1])

## # A tibble: 5 × 2
##   Age_group Attrition_ratio
##   <fct>               <dbl>
## 1 (0,30]              0.121
## 2 (30,40]             0.145
## 3 (40,50]             0.167
## 4 (50,60]             0.168
## 5 (60,Inf]            0.143

So we see that customers between 40 and 60 are most likely to churn.

Lab Exercise:

Cut Total_Trans_Ct into a few reasonable categories and study its effect on Attrition_Flag.

Analyze composite effect of 2 or more variables.

An interesting while challenging part of data exploration is that, sometimes there are complicated composite effect between variables. For example, if we only look at the effect of credit limit on churning, it’s not that significant:

ggplot(bank_data) + 
  geom_density(aes(x = Credit_Limit, fill = Attrition_Flag), alpha = 0.5)

ggplot(bank_data) + 
  geom_boxplot(aes(x = Credit_Limit, y = Attrition_Flag))

Example (Cont’d)

However, if we put Customer_Age also in the picture, we see some composite effect:

ggplot(bank_data) + 
  geom_point(aes(x = Customer_Age, y = Credit_Limit, color = Attrition_Flag), alpha = 0.5) + 
  geom_smooth(aes(x = Customer_Age, y = Credit_Limit, color = Attrition_Flag))

So it seems that for young customers (less than 55), churning customers tended to have an lower credit limit on average. But the trend becomes opposite for aged customers, who were more likely to churn when they had a higher credit limit. Can you try to explain why?

In this example, we try to analyze the effect of two numeric variables on a categorical variable. So we use a grouped scatter/smooth plot.

What if we try to analyze the effect of two categorical variables?

Example

Let’s analyze the composite effect of Gender and Income_Cateogry

bank_data %>%
  group_by(Income_Category, Gender) %>%
  summarize(Attrition_ratio = prop.table(table(Attrition_Flag))[1]) %>%
  arrange(Attrition_ratio)

## # A tibble: 9 × 3
## # Groups:   Income_Category [6]
##   Income_Category Gender Attrition_ratio
##   <chr>           <chr>            <dbl>
## 1 Unknown         M               0.0962
## 2 Less than $40K  M               0.108 
## 3 $60K - $80K     M               0.135 
## 4 $40K - $60K     M               0.135 
## 5 $80K - $120K    M               0.158 
## 6 $40K - $60K     F               0.164 
## 7 Unknown         F               0.172 
## 8 $120K +         M               0.173 
## 9 Less than $40K  F               0.177

In this example, we break our data into groups as defined by all possible combination of Gender and Income_Category, and study the effect by listing the attrition ratio in a table.

Example (Cont’d)

We can also use facet_wrap to help use visualize the effect:

# Relevel the income category, will learn later, not required
bank_data$Income_Category <- factor(bank_data$Income_Category, 
                                    levels = unique(bank_data$Income_Category)[c(2,4,1,3,5,6)])

ggplot(bank_data) + 
  geom_bar(aes(y = Income_Category, fill = Attrition_Flag), position = "fill") + 
  facet_wrap(~ Gender, nrow = 2)

Question: What can we learn from the result?

Lab Exercise:

Analyze the composite effect of Education_Level and Marital_Status on Attrition_Flag.

Summary

In the last two modules, we learn how to start with the process of EDA for a given data set. In practice, this only serves as the first step of data analysis since they provide us basic information about the data set, which helps us ask and answer useful questions more efficiently and reasonably. Meaningful analysis would usually come after these basic steps. For example, now let’s think about the following questions:

If we are going to do data modeling to predict potential churning customers, which variables are you going to keep in your model, or you will keep them all? How to incorporate the knowledge extracted from our exploration into the next stage?
If we are asked to propose solutions to improve retention rate of credit card customers for the bank, what are reasonable directions to explore?

Guided by these questions, now go back to check all results we created. Please think whether there is any useful information that helps us answer the two questions above.

Exploratory Data Analysis Part Two

Miao Yu

2024-02-12

Load Libraries

Covariation

Two Categorical Variables

Example with the bank customer data

Chi-square Test

Lab Homework (Required):

Two numeric variables

Pearson’s Correlation Coefficient

Example

Lab Homework (Required):

One numeric variable and one categorical variable

Another example

Student’s t-test for two comparison groups

F-test for more than two comparison groups

Lab Homework (Required):

Cut a numeric variable into categorical groups

Lab Exercise:

Analyze composite effect of 2 or more variables.

Example (Cont’d)

Example

Example (Cont’d)

Lab Exercise:

Summary