Load Libraries

library(tidyverse)


0. Introduction

In this lecture and the next, we are going to do an exemplary study to familiarize ourselves with the EDA process - asking questions, answere it, then generating new questions.

There is no standard approach for exploring data. If there is any, then that would be questionable, as quoted below:

“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

So the best way to learn it is to exercise. We will use a data set of bank customers which is available from Kaggle (one of the most famous data science sites):

https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers

Please download the data set file in csv format. Let’s load the data set into R first using the read_csv function offered by the readr package in tidyverse.


Put the data set file into the working folder

To load a data set file into R, there are multiple ways. Let’s first learn how to find the current working folder on your computer.

getwd()

This command will return the current working folder of RStudio. Then put the downloaded data set file into this folder. Then we can directly read it without offering any path information.

bank_data <- read_csv("BankChurners.csv")

Simply put the file name into the read_csv() function then assign to a new name. Then we can work on the data frame bank_data just as what we had done for all previous examples.

Take a glimpse of our data.

glimpse(bank_data)
## Rows: 10,127
## Columns: 20
## $ Attrition_Flag           <chr> "Existing Customer", "Existing Customer", "Ex…
## $ Customer_Age             <dbl> 45, 49, 51, 40, 40, 44, 51, 32, 37, 48, 42, 6…
## $ Gender                   <chr> "M", "F", "M", "F", "M", "M", "M", "M", "M", …
## $ Dependent_count          <dbl> 3, 5, 3, 4, 3, 2, 4, 0, 3, 2, 5, 1, 1, 3, 2, …
## $ Education_Level          <chr> "High School", "Graduate", "Graduate", "High …
## $ Marital_Status           <chr> "Married", "Single", "Married", "Unknown", "M…
## $ Income_Category          <chr> "$60K - $80K", "Less than $40K", "$80K - $120…
## $ Card_Category            <chr> "Blue", "Blue", "Blue", "Blue", "Blue", "Blue…
## $ Months_on_book           <dbl> 39, 44, 36, 34, 21, 36, 46, 27, 36, 36, 31, 5…
## $ Total_Relationship_Count <dbl> 5, 6, 4, 3, 5, 3, 6, 2, 5, 6, 5, 6, 3, 5, 5, …
## $ Months_Inactive_12_mon   <dbl> 1, 1, 1, 4, 1, 1, 1, 2, 2, 3, 3, 2, 6, 1, 2, …
## $ Contacts_Count_12_mon    <dbl> 3, 2, 0, 1, 0, 2, 3, 2, 0, 3, 2, 3, 0, 3, 2, …
## $ Credit_Limit             <dbl> 12691.0, 8256.0, 3418.0, 3313.0, 4716.0, 4010…
## $ Total_Revolving_Bal      <dbl> 777, 864, 0, 2517, 0, 1247, 2264, 1396, 2517,…
## $ Avg_Open_To_Buy          <dbl> 11914.0, 7392.0, 3418.0, 796.0, 4716.0, 2763.…
## $ Total_Amt_Chng_Q4_Q1     <dbl> 1.335, 1.541, 2.594, 1.405, 2.175, 1.376, 1.9…
## $ Total_Trans_Amt          <dbl> 1144, 1291, 1887, 1171, 816, 1088, 1330, 1538…
## $ Total_Trans_Ct           <dbl> 42, 33, 20, 20, 28, 24, 31, 36, 24, 32, 42, 2…
## $ Total_Ct_Chng_Q4_Q1      <dbl> 1.625, 3.714, 2.333, 2.333, 2.500, 0.846, 0.7…
## $ Avg_Utilization_Ratio    <dbl> 0.061, 0.105, 0.000, 0.760, 0.000, 0.311, 0.0…

So we see that there are about 10,000 rows and 20 columns. Note that for this data set, there is no built-in help documentation. To check the meaning of each variable, we have to refer to the original Kaggle link. For this example, I list them for you. But in the future, you need to check it by yourself.

Meaning of Variables

Variable Name Meaning
Attrition_Flag whether the customer canceled the credit card or not
Customer Age Age
Gender Gender
Dependents how many dependents a customer has in terms of tax
Education_Level Highest degree a customer earned
Marital_Status Marital status
Income_Category Categorized annual income
Card_Category the category of credit card
Months_on_book the length of a customer’s relationship with bank
Total_Relationship_Count Total number of bank products held by the customer
Months_Inactive_12_mon number of months inactive in the last 12 months
Contacts_Count_12_mon number of contacts with bank in the last 12 months
Credit_Limit Credit line of a customer
Total_Revolving_Bal total Revolving Balance on the Credit Card
Avg_Open_To_Buy the amount of credit line available (average of last 12 months)
Total_Amt_Chng_Q4_Q1 change in Transaction Amount (Q4 over Q1 ratio)
Total_Trans_Amt total Transaction Amount (Last 12 months)
Total_Trans_Ct total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1 Change in Transaction Count (Q4 over Q1 ratio)
Avg_Utilization_Ratio Average Card Utilization Ratio

The key questions to answer from this data set is, which factor is important to predict whether a customer is more likely to cancel the credit card service.

We shall keep this question in mind as the EDA process progresses.


1. Data Inspection

After we get the basic information of the data set, we need to inspect each variable to understand it. To do this, we need to check the variation for each variable and covariation between pairs of variables.


Variation

Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results.

Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of the variable’s values.


For categorical (ordinal) variables, create a bar plot or make a frequency table

For example, if we hope to know the distribution of Attrtion_Flag, which is a categorical variable. We can do the following:

table(bank_data$Attrition_Flag)
## 
## Attrited Customer Existing Customer 
##              1627              8500

The table() function creates a frequency table for a categorical variable. So we see that there are only two values in the variable - existing customer and attriting customer. Also, there are many more existing customers than attriting customers. This information is very important in modeling - the data is unbalanced in quantity between the two key groups that we hope to study.

We may also create a bar plot to show the same information

ggplot(bank_data) + geom_bar(aes(Attrition_Flag))


Lab Exercise:

Create a frequency table and a bar plot for variables Gender and Card_Category.


For numeric variables, create a histogram and a summary

To understand the distribution of a numeric variable, the most typical action is to create a histogram. But usually it is good to complement it with a summary as well. Let’s take the variable Credit_Limit as an example:

summary(bank_data$Credit_Limit)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1438    2555    4549    8632   11068   34516
ggplot(bank_data) + geom_histogram(aes(Credit_Limit))

Analysis: This histogram provides some useful information for us. We see that this is a bimodal distribution - one around $2000, and one around the maximum value of $34516. The origin of the second mode may need some investigation if necessary. Now we just keep this information in mind.


Lab Exercise:

Create a histogram and a summary for variables Customer_Age and Total_Trans_Amt.


Check for NA values

An important step is to check whether there are any missing values in the data set. We may simply use

sum(is.na(bank_data))
## [1] 0

which returns the number of NA values in the whole data set. If it is non-zero, we can further check for each column by the following code:

map(bank_data, ~ sum(is.na(.)))
## $Attrition_Flag
## [1] 0
## 
## $Customer_Age
## [1] 0
## 
## $Gender
## [1] 0
## 
## $Dependent_count
## [1] 0
## 
## $Education_Level
## [1] 0
## 
## $Marital_Status
## [1] 0
## 
## $Income_Category
## [1] 0
## 
## $Card_Category
## [1] 0
## 
## $Months_on_book
## [1] 0
## 
## $Total_Relationship_Count
## [1] 0
## 
## $Months_Inactive_12_mon
## [1] 0
## 
## $Contacts_Count_12_mon
## [1] 0
## 
## $Credit_Limit
## [1] 0
## 
## $Total_Revolving_Bal
## [1] 0
## 
## $Avg_Open_To_Buy
## [1] 0
## 
## $Total_Amt_Chng_Q4_Q1
## [1] 0
## 
## $Total_Trans_Amt
## [1] 0
## 
## $Total_Trans_Ct
## [1] 0
## 
## $Total_Ct_Chng_Q4_Q1
## [1] 0
## 
## $Avg_Utilization_Ratio
## [1] 0

If there is some NA value, then we need to decide later how we are going to handle them.


Lab Exercise:

Check how many NA values are there in each column of the flights data set.


Check for unusual values (potential outliers)

Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. It is very important to be aware of them before our analysis.

Using the diamonds data set as an example. If we create a histogram regarding the y variable, the width, we see something strange.

ggplot(diamonds) + geom_histogram(aes(y))

We see that most data are within 1-10 mm but the plot extends itself to 60 mm. This is an indicator that there is some unusually large values in y to trigger this. There are so many observations in the common bins that the rare bins are so short that you can’t see them.

To make it easy to see the unusual values, we need to zoom to small values of the y-axis with coord_cartesian():

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))

Now we see that there is a few suspicious data points with y being 0 or 30, 60 mm. They do not look correct (how can a diamond be 6 cm long?) and we will inspect them more closely.

An important side note here is that we cannot use ylim here to zoom because that would remove all bins that are higher than 50!

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  ylim(0, 50)

Check whether unusually values should be removed or replaced with NA

Since 0, 30 or 60mm wide diamonds are not likely to be true. Let’s look at those data in details using the filter function:

strange_y <- diamonds %>%
  filter(y < 3 | y > 20) %>%
  print()
## # A tibble: 9 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1    Very Good H     VS2      63.3    53  5139  0      0    0   
## 2  1.14 Fair      G     VS1      57.5    67  6381  0      0    0   
## 3  2    Premium   H     SI2      58.9    57 12210  8.09  58.9  8.06
## 4  1.56 Ideal     G     VS2      62.2    54 12800  0      0    0   
## 5  1.2  Premium   D     VVS1     62.1    59 15686  0      0    0   
## 6  2.25 Premium   H     SI2      62.8    59 18034  0      0    0   
## 7  0.51 Ideal     E     VS1      61.8    55  2075  5.15  31.8  5.12
## 8  0.71 Good      F     SI2      64.1    60  2130  0      0    0   
## 9  0.71 Good      F     SI2      64.1    60  2130  0      0    0

Now we see that the zero values are obviously missing information of diamond size. For 58.9-mm and 31.8-mm values in y, they are probably incorrect since the sizes do not match with the carat and price data (would be too light and too cheap if y is true).

Now we have a good idea that those unusual values are either missing or incorrect, so we will replace them with NA now.

diamonds2 <- diamonds %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

So we can work on diamonds2 data set from now on. One is supposed to keep doing this until all unusually values are properly labeled or removed.


Covariation

If variation describes the behavior within a single variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables.

How you do that should again depend on the type of variables involved. As below is a very brief summary:

  • Two categorical (ordinal) variables: chi-square independence test
  • Two numeric variables: scatter plot, line graph
  • One numeric, one categorical (ordinal): Multiple boxplot, density ridge plot


Two Categorical Variables

In data visualization part, we learned to create stacked bar chart to investigate the effect of two categorical variables on a third numeric variable.

However, that is not to check the relationship between that two variables. In statistics, we usually hope to know whether the two categorical variables are dependent or independent of each other. If they are highly dependent, then knowing the category of one variable helps us predict the category of another variable with better accuracy.

For example, whether one brings umbrella or not is highly dependent on whether it rains on a day or not. On the other hand, whether one eats bread or not for breakfast is not quite dependent of whether it rains or not.

To check the covariation, we use the function chisq.test and table function together. For example, let’s check whether Attrition_Flag and Gender are dependent or not.

chisq.test(table(bank_data$Attrition_Flag, bank_data$Gender))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(bank_data$Attrition_Flag, bank_data$Gender)
## X-squared = 13.866, df = 1, p-value = 0.0001964

So one would check the p-value here. If it’s a very small number (usually people use <0.05 as a common criterion), then the two variables are dependent.

Analysis: In this example, the two variables are dependent with a quite small p-value. This can be explained by the frequency table

table(bank_data$Attrition_Flag, bank_data$Gender)
##                    
##                        F    M
##   Attrited Customer  930  697
##   Existing Customer 4428 4072

So we see that the famale to male ratio is significantly higher in attrited customers. In other words, it seems that female customers are more likely to cancel their credit cards as suggested by the data set.

We can also create a grouped bar chart to illustrate this:

ggplot(bank_data) + 
  geom_bar(aes(x = Attrition_Flag, fill = Gender), position = "dodge")

But it is less straightforward to see the dependence from the graph than the summary table.


Lab Exercise:

Check whether Attrition_Flag is dependent of Marital_Status or not.


Two numeric variables

For numeric variables, we simply create a scatter plot and/or a line graph. A correlation analysis is usually performed, too.

Let’s take Customer_Age and Credit_Limit as an example:

ggplot(bank_data, aes(Customer_Age, Credit_Limit)) + geom_point() + geom_smooth()

The points are on vertical lines because Customer_Age is a discrete variable and only takes integer values. It seems that the correlation between the two variables are quite weak. But we do see a reasonable trend in the smoothed version of line graph that, the mid-aged group has higher credit limits than young professional and retired people, which agrees with common sense.

Pearson’s Correlation Coefficient

It is also common to check Pearson’s correlation coefficient betweeb two numeric variables. But please be noted that it only measures linear correlation and cannot capture nonlinear ones.

Use cor(<data1>, <data2>) to compute the coefficient in R. A value around zero indicates little correlation. A value around \(1\) or \(-1\) indicates strong linear correlation, as shown in the figure below.

cor(bank_data$Customer_Age, bank_data$Credit_Limit)
## [1] 0.002476227

We see a very weak linear correlation here. But as the line graph indicates, there can be some nonlinear correlation between the two variables.


One numeric variable and one categorical variable

For one numeric variable, and one categorical variable, we commonly use boxplots to visualize their relationships. Let’s take Attrition_Flag and Customer_Age as an example.

ggplot(bank_data) + 
  geom_boxplot(aes(x = Customer_Age, y = Attrition_Flag))

We see little difference here, with the median of customer age slightly older in the group of attrited customers.

In the case of little outliers, we may use the two-sample t-test to quantitatively tell whether there is a significant effect of age here.

data1 <- bank_data$Customer_Age[bank_data$Attrition_Flag != "Existing Customer"]
data2 <- bank_data$Customer_Age[bank_data$Attrition_Flag == "Existing Customer"]
t.test(data1, data2)
## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = 1.8988, df = 2370.8, p-value = 0.05772
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01302059  0.80777731
## sample estimates:
## mean of x mean of y 
##  46.65950  46.26212

The code above performs a two-sample t-test to see whether the mean customer age is the same or not among existing customers and attriting customers. The p-value is above 0.05, which indicates that the difference can be quite small (we cannot reject the null hypothesis).


Lab Exercise:

Analyze whether the variable Contacts_Count_12_mon has an effect on Attrition_Flag or not.


Use ggpairs() to create a matrix plot

You may wonder whether we have analyze each variable or variable pair separately. The answer is “No”. Sometimes people also create a matrix plot to make things easier.

To do this, we need to use the ggpairs() function in the package GGAlly.

library(GGally)

bank_data2 <- select(bank_data, Attrition_Flag, Customer_Age, Gender, Credit_Limit)
ggpairs(bank_data2)

So we see that the ggpairs() function automatically select an appropriate graph for each single variable (on the main diagonal) and each variable pair. We reduce the number of variables by select() function to make the graph less crowded.

However, this plot only provides a preliminary preview of data inspection. One still needs to do detailed manual work to better understand his/her data set before moving forward.