Load Libraries

library(tidyverse)


0. Introduction

In this module, we are going to do an exemplary study to familiarize ourselves with the EDA process.

EDA represents “Exploratory Data Analysis”. Literally, it refers to “exploring data” to gain useful insights. We need to explore since we won’t know what is hidden in the data set, maybe treasure, maybe only garbage, unless we thoroughly inspect our data.

The process of EDA usually involves the cycle of i) asking questions out of specific or common interest, ii) answer them with data transformation, visualization and modeling, which typically would generate new questions and thus close the loop. Here we will focus on data transformation and visualization, and leave the modeling part to future classes.

There is no standard approach for exploring data. If there is any, then that would be questionable, as quoted below:

“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

So the best way to learn it is to exercise. We will use a data set of bank customers which is available from Kaggle (one of the most famous data science sites):

https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers

Please download the data set file in csv format. Let’s load the data set into R first using the read_csv function offered by the readr package in tidyverse.


Put the data set file into the working folder

To load a data set file into R, there are multiple ways. Let’s first learn how to find the current working folder on your computer.

getwd()

This command will return the current working folder of RStudio. Then put the downloaded data set file into this folder. Then we can directly read it without offering any path information. Otherwise we need to input the complete path of the data file.

bank_data <- read_csv("~/Documents/Fei Tian/Course_STA305_Statistical_Computing_and_Graphics_Fall2023/Datasets/BankChurners.csv")

Now we can work on the data frame bank_data just as what we had done for all previous examples. Now let’s use this example to show a reasonable (but definitely not routine) process to explore data.

1. Data Overview

The first thing is nearly always to take a glimpse of our data to collect basic information:

glimpse(bank_data)
## Rows: 10,127
## Columns: 20
## $ Attrition_Flag           <chr> "Existing Customer", "Existing Customer", "Ex…
## $ Customer_Age             <dbl> 45, 49, 51, 40, 40, 44, 51, 32, 37, 48, 42, 6…
## $ Gender                   <chr> "M", "F", "M", "F", "M", "M", "M", "M", "M", …
## $ Dependent_count          <dbl> 3, 5, 3, 4, 3, 2, 4, 0, 3, 2, 5, 1, 1, 3, 2, …
## $ Education_Level          <chr> "High School", "Graduate", "Graduate", "High …
## $ Marital_Status           <chr> "Married", "Single", "Married", "Unknown", "M…
## $ Income_Category          <chr> "$60K - $80K", "Less than $40K", "$80K - $120…
## $ Card_Category            <chr> "Blue", "Blue", "Blue", "Blue", "Blue", "Blue…
## $ Months_on_book           <dbl> 39, 44, 36, 34, 21, 36, 46, 27, 36, 36, 31, 5…
## $ Total_Relationship_Count <dbl> 5, 6, 4, 3, 5, 3, 6, 2, 5, 6, 5, 6, 3, 5, 5, …
## $ Months_Inactive_12_mon   <dbl> 1, 1, 1, 4, 1, 1, 1, 2, 2, 3, 3, 2, 6, 1, 2, …
## $ Contacts_Count_12_mon    <dbl> 3, 2, 0, 1, 0, 2, 3, 2, 0, 3, 2, 3, 0, 3, 2, …
## $ Credit_Limit             <dbl> 12691.0, 8256.0, 3418.0, 3313.0, 4716.0, 4010…
## $ Total_Revolving_Bal      <dbl> 777, 864, 0, 2517, 0, 1247, 2264, 1396, 2517,…
## $ Avg_Open_To_Buy          <dbl> 11914.0, 7392.0, 3418.0, 796.0, 4716.0, 2763.…
## $ Total_Amt_Chng_Q4_Q1     <dbl> 1.335, 1.541, 2.594, 1.405, 2.175, 1.376, 1.9…
## $ Total_Trans_Amt          <dbl> 1144, 1291, 1887, 1171, 816, 1088, 1330, 1538…
## $ Total_Trans_Ct           <dbl> 42, 33, 20, 20, 28, 24, 31, 36, 24, 32, 42, 2…
## $ Total_Ct_Chng_Q4_Q1      <dbl> 1.625, 3.714, 2.333, 2.333, 2.500, 0.846, 0.7…
## $ Avg_Utilization_Ratio    <dbl> 0.061, 0.105, 0.000, 0.760, 0.000, 0.311, 0.0…

So we see that there are about 10,000 rows and 20 columns. Note that for this data set, there is no built-in help documentation. To check the meaning of each variable, we have to refer to the original Kaggle link. For this example, I list them for you. But in the future, you need to check it by yourself.

Meaning of Variables

Variable Name Meaning
Attrition_Flag whether the customer canceled the credit card or not
Customer Age Age
Gender Gender
Dependents how many dependents a customer has in terms of tax
Education_Level Highest degree a customer earned
Marital_Status Marital status
Income_Category Categorized annual income
Card_Category the category of credit card
Months_on_book the length of a customer’s relationship with bank
Total_Relationship_Count Total number of bank products held by the customer
Months_Inactive_12_mon number of months inactive in the last 12 months
Contacts_Count_12_mon number of contacts with bank in the last 12 months
Credit_Limit Credit line of a customer
Total_Revolving_Bal total Revolving Balance on the Credit Card
Avg_Open_To_Buy the amount of credit line available (average of last 12 months)
Total_Amt_Chng_Q4_Q1 change in Transaction Amount (Q4 over Q1 ratio)
Total_Trans_Amt total Transaction Amount (Last 12 months)
Total_Trans_Ct total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1 Change in Transaction Count (Q4 over Q1 ratio)
Avg_Utilization_Ratio Average Card Utilization Ratio

Following the data source from Kaggle, a key question to answer from this data set is, which factor is important to predict whether a customer is more likely to cancel the credit card service.

Therefore, we do have a target variable which is Attrition_Flag. We shall keep this question in mind as the EDA process progresses.


2. Data Inspection

After we get the basic information of the data set, we usually need to inspect each variable to understand distributions to give us some “feeling” about our data. To do this, we need to check the distribution for each variable to visualize the variation.


Variation

Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. Every variable has its own pattern of variation, which may reveal interesting information.


For categorical (ordinal) variables, create a bar plot or make a frequency table

For example, if we hope to know the distribution of Attrtion_Flag, which is a categorical variable. We can do the following:

table(bank_data$Attrition_Flag)
## 
## Attrited Customer Existing Customer 
##              1627              8500
prop.table(table(bank_data$Attrition_Flag))
## 
## Attrited Customer Existing Customer 
##         0.1606596         0.8393404

The table() function creates a frequency table for a categorical variable. Further put that table into prop.table function results in the proportion of each category. Here we see that there are only two values in the variable - existing customer and attriting customer. Also, there are many more existing customers than attriting customers. This information is very important in modeling - the data is unbalanced in quantity between the two key groups that we hope to study.

We may also create a bar plot to show the same information

ggplot(bank_data) + geom_bar(aes(Attrition_Flag))


Lab Exercise:

Create a frequency table and a bar plot for variables Gender and Card_Category. Summarize your findings


For numeric variables, create a histogram and a summary

To understand the distribution of a numeric variable, the most typical action is to create a histogram. But usually it is good to complement it with a summary as well. Let’s take the variable Credit_Limit as an example:

summary(bank_data$Credit_Limit)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1438    2555    4549    8632   11068   34516
ggplot(bank_data) + geom_histogram(aes(Credit_Limit))

Analysis: This histogram provides some useful information for us. We see that this is a bimodal distribution - one around $2000, and one around the maximum value of $34516. The origin of the second mode may need some investigation if necessary. Now we just keep this information in mind.


Lab Exercise:

Create a histogram and a summary for variables Customer_Age and Total_Trans_Amt. Summarize your findings


Lab Exercise:

Find another discrete random variable from the bank data set and explore its distribution with table and graph.


Check for NA values

An important step is to check whether there are any missing values in the data set. We may simply use

sum(is.na(bank_data))
## [1] 0

which returns the number of NA values in the whole data set. If it is non-zero, we can further check for each column by the following code:

map(bank_data, ~ sum(is.na(.)))
## $Attrition_Flag
## [1] 0
## 
## $Customer_Age
## [1] 0
## 
## $Gender
## [1] 0
## 
## $Dependent_count
## [1] 0
## 
## $Education_Level
## [1] 0
## 
## $Marital_Status
## [1] 0
## 
## $Income_Category
## [1] 0
## 
## $Card_Category
## [1] 0
## 
## $Months_on_book
## [1] 0
## 
## $Total_Relationship_Count
## [1] 0
## 
## $Months_Inactive_12_mon
## [1] 0
## 
## $Contacts_Count_12_mon
## [1] 0
## 
## $Credit_Limit
## [1] 0
## 
## $Total_Revolving_Bal
## [1] 0
## 
## $Avg_Open_To_Buy
## [1] 0
## 
## $Total_Amt_Chng_Q4_Q1
## [1] 0
## 
## $Total_Trans_Amt
## [1] 0
## 
## $Total_Trans_Ct
## [1] 0
## 
## $Total_Ct_Chng_Q4_Q1
## [1] 0
## 
## $Avg_Utilization_Ratio
## [1] 0

If there is some NA value, then we need to decide later how we are going to handle them.


Lab Exercise:

Check how many NA values are there in each column of the flights data set.


Check for unusual values (potential outliers)

Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. It is very important to be aware of them before our analysis.

Using the diamonds data set as an example. If we create a histogram regarding the y variable, the width, we see something strange.

ggplot(diamonds) + geom_histogram(aes(y))

We see that most data are within 1-10 mm but the plot extends itself to 60 mm. This is an indicator that there is some unusually large values in y to trigger this. There are so many observations in the common bins that the rare bins are so short that you can’t see them.

To make it easy to see the unusual values, we need to zoom to small values of the y-axis with coord_cartesian():

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))

Now we see that there is a few suspicious data points with y being 0 or 30, 60 mm. They do not look correct (how can a diamond be 6 cm long?) and we will inspect them more closely.

An important side note here is that we cannot use ylim here to zoom because that would remove all bins that are higher than 50!

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  ylim(0, 50)

Check whether unusually values should be removed or replaced with NA

Since 0, 30 or 60mm wide diamonds are not likely to be true. Let’s look at those data in details using the filter function:

strange_y <- diamonds %>%
  filter(y < 3 | y > 20) %>%
  print()
## # A tibble: 9 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1    Very Good H     VS2      63.3    53  5139  0      0    0   
## 2  1.14 Fair      G     VS1      57.5    67  6381  0      0    0   
## 3  2    Premium   H     SI2      58.9    57 12210  8.09  58.9  8.06
## 4  1.56 Ideal     G     VS2      62.2    54 12800  0      0    0   
## 5  1.2  Premium   D     VVS1     62.1    59 15686  0      0    0   
## 6  2.25 Premium   H     SI2      62.8    59 18034  0      0    0   
## 7  0.51 Ideal     E     VS1      61.8    55  2075  5.15  31.8  5.12
## 8  0.71 Good      F     SI2      64.1    60  2130  0      0    0   
## 9  0.71 Good      F     SI2      64.1    60  2130  0      0    0

Now we see that the zero values are obviously missing information of diamond size. For 58.9-mm and 31.8-mm values in y, they are probably incorrect since the sizes do not match with the carat and price data (would be too light and too cheap if y is true).

Now we have a good idea that those unusual values are either missing or incorrect, so we will replace them with NA now.

diamonds2 <- diamonds %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

So we can work on diamonds2 data set from now on. One is supposed to keep doing this until all unusually values are properly labeled or removed.


Covariation

If variation describes the behavior within a single variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables.

How you do that should again depend on the type of variables involved. As below is the recommended plot types:

  • Two categorical (ordinal) variables: A stacked proportion bar plot
  • Two numeric variables: scatter plot along with a line graph
  • One numeric, one categorical (ordinal): grouped density plot


Two Categorical Variables

In data visualization part, we learned to create stacked bar chart to investigate the effect of two categorical variables on a third numeric variable.

However, that is not to check the relationship between that two variables. In statistics, we usually hope to know whether the two categorical variables are dependent or independent of each other. If they are highly dependent, then knowing the category of one variable helps us predict the category of another variable with better accuracy.

For example, whether one brings umbrella or not is highly dependent on whether it rains on a day or not. On the other hand, whether one eats bread or not for breakfast is not quite dependent of whether it rains or not.

Let’s check whether Gender and Dependent_count has a significant impact on Attrition_Flag. We can do by graph or by table:

ggplot(bank_data) + 
  geom_bar(aes(x = Gender, fill = Attrition_Flag), position = "fill") +
  labs(y = "proportion")

bank_data %>%
  group_by(Gender) %>%
  summarize(Attrition_rate = prop.table(table(Attrition_Flag))[1])
## # A tibble: 2 × 2
##   Gender Attrition_rate
##   <chr>           <dbl>
## 1 F               0.174
## 2 M               0.146

So we see that the attrition rate is significantly higher in female cutomers than male ones. In other words, it seems that female customers are more likely to cancel their credit cards as suggested by the data set.

But please keep in mind that this difference is not necessarily statistical significant (depending on some other factors such as sample size). Later we will learn how to run statistical tests to draw a more quantitative conclusion.


Lab Exercise:

Check whether Marital_Status or Dependent_count is correlated with Attrition_Flag or not.


Two numeric variables

For numeric variables, we simply create a scatter plot and/or a line graph. A correlation analysis is usually performed, too.

Let’s take Customer_Age and Credit_Limit as an example:

ggplot(bank_data, aes(Customer_Age, Credit_Limit)) + geom_point() + geom_smooth()

The points are on vertical lines because Customer_Age is a discrete variable and only takes integer values. It seems that the correlation between the two variables are quite weak. But we do see a reasonable trend in the smoothed version of line graph that, the mid-aged group has higher credit limits than young professional and retired people, which agrees with common sense.


One numeric variable and one categorical variable

For one numeric variable, and one categorical variable, we commonly use boxplots to visualize their relationships. However, boxplots are sometimes not informative enough. Here you are recommended to compare the density curve between two groups. Let’s take Attrition_Flag and Customer_Age as an example.

ggplot(bank_data) + 
  geom_density(aes(x = Customer_Age, fill = Attrition_Flag), alpha = 0.5)

We see the two distributions nearly overlapping with each other, so at least we can say that the effect of Customer_Age on Attrition_Flag is not quite significant.

In comparison, we can create the same plot between Attrition_Flag and Total_Trans_Ct. A evidently more significant difference is observed.

ggplot(bank_data) + 
  geom_density(aes(x = Total_Trans_Ct, fill = Attrition_Flag), alpha = 0.5)

We see that customers with more transaction counts (over 50 or so) are much less likely to cancel their credit card service.


Lab Exercise:

Analyze whether the variable Contacts_Count_12_mon has an effect on Attrition_Flag or not.


Cut a numeric variable into categorical groups

We may use cut function to convert a numeric variable into customized groups. For example, for Customer_Age, we may want to classify all ages into groups of <= 30, 31-40, 41-50, 51-60, >60 and then analyze its effect. The following can be done to realize this:

bank_data <- mutate(bank_data, Age_group = cut(Customer_Age, breaks = c(0, 30, 40, 50, 60, Inf)))

bank_data %>%
  group_by(Age_group) %>%
  summarize(Attrition_ratio = prop.table(table(Attrition_Flag))[1])
## # A tibble: 5 × 2
##   Age_group Attrition_ratio
##   <fct>               <dbl>
## 1 (0,30]              0.121
## 2 (30,40]             0.145
## 3 (40,50]             0.167
## 4 (50,60]             0.168
## 5 (60,Inf]            0.143

So we see that customers between 40 and 60 are most likely to churn.


Lab Exercise:

Cut Total_Trans_Ct into a few reasonable categories and study its effect on Attrition_Flag.


Analyze composite effect of 2 or more variables.

An interesting while challenging part of data exploration is that, sometimes there are complicated composite effect between variables. For example, if we only look at the effect of credit limit on churning, it’s not that significant:

ggplot(bank_data) + 
  geom_density(aes(x = Credit_Limit, fill = Attrition_Flag), alpha = 0.5)

ggplot(bank_data) + 
  geom_boxplot(aes(x = Credit_Limit, y = Attrition_Flag)) 

However, if we put Customer_Age also in the picture, we see some composite effect:

ggplot(bank_data) + 
  geom_point(aes(x = Customer_Age, y = Credit_Limit, color = Attrition_Flag), alpha = 0.5) + 
  geom_smooth(aes(x = Customer_Age, y = Credit_Limit, color = Attrition_Flag))

So it seems that for young customers (less than 55), churning customers tended to have an lower credit limit on average. But the trend becomes opposite for aged customers, who were more likely to churn when they had a higher credit limit. Can you try to explain why?

In this example, we try to analyze the effect of two numeric variables on a categorical variable. So we use a grouped scatter/smooth plot.

What if we try to analyze the effect of two categorical variables?


Example

Let’s analyze the composite effect of Gender and Income_Cateogry

bank_data %>%
  group_by(Income_Category, Gender) %>%
  summarize(Attrition_ratio = prop.table(table(Attrition_Flag))[1]) %>%
  arrange(Attrition_ratio)
## # A tibble: 9 × 3
## # Groups:   Income_Category [6]
##   Income_Category Gender Attrition_ratio
##   <chr>           <chr>            <dbl>
## 1 Unknown         M               0.0962
## 2 Less than $40K  M               0.108 
## 3 $60K - $80K     M               0.135 
## 4 $40K - $60K     M               0.135 
## 5 $80K - $120K    M               0.158 
## 6 $40K - $60K     F               0.164 
## 7 Unknown         F               0.172 
## 8 $120K +         M               0.173 
## 9 Less than $40K  F               0.177

In this example, we break our data into groups as defined by all possible combination of Gender and Income_Category, and study the effect by listing the attrition ratio in a table.


Lab Exercise:

Analyze the composite effect of Education_Level and Marital_Status on Attrition_Flag.


Effect of one categorical variable and one numeric variable

One way to do this kind of problem is to cut numeric variables into groups. For example, if we hope to study the composite effect of customer age and gender, we may do the following:

bank_data %>%
  group_by(Age_group, Gender) %>%
  summarize(Attrition_rate = prop.table(table(Attrition_Flag))[1]) %>%
  arrange(Attrition_rate)
## # A tibble: 10 × 3
## # Groups:   Age_group [5]
##    Age_group Gender Attrition_rate
##    <fct>     <chr>           <dbl>
##  1 (0,30]    M               0.108
##  2 (0,30]    F               0.135
##  3 (30,40]   M               0.135
##  4 (60,Inf]  F               0.142
##  5 (60,Inf]  M               0.144
##  6 (40,50]   M               0.151
##  7 (50,60]   M               0.152
##  8 (30,40]   F               0.155
##  9 (50,60]   F               0.181
## 10 (40,50]   F               0.182


Lab Exercise:

Analyze the composite effect of Marital_Status and Months_on_books on Attrition_Flag.


Summary

In this module, we learn how to start with the process of EDA for a given data set. In practice, this only serves as the first step of EDA since they provide us basic information about the data set, which helps us ask and answer useful questions more efficiently and reasonably. The true journey of exploring data comes after these basic steps. For example, now let’s think about the following questions:

  1. If we are going to do data modeling to predict potential churning customers, which variables are you going to keep in your model, or you will keep them all? How to incorporate the knowledge extracted from our exploration into the next stage?

  2. If we are asked to propose solutions to improve retention rate of credit card customers for the bank, what are reasonable directions to explore?

Guided by these questions, now go back to check all results we created. Are there any useful information that can be helpful?