library(tidyverse)
In this module, we are going to do an exemplary study to familiarize ourselves with the EDA process.
EDA represents “Exploratory Data Analysis”. Literally, it refers to “exploring data” to gain useful insights. We need to explore since we won’t know what is hidden in the data set, maybe treasure, maybe only garbage, unless we thoroughly inspect our data.
The process of EDA usually involves the cycle of i) asking questions out of specific or common interest, ii) answer them with data transformation, visualization and modeling, which typically would generate new questions and thus close the loop. Here we will focus on data transformation and visualization, and leave the modeling part to future classes.
There is no standard approach for exploring data. If there is any, then that would be questionable, as quoted below:
“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey
So the best way to learn it is to exercise. We will use a data set of bank customers which is available from Kaggle (one of the most famous data science sites):
https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers
Please download the data set file in csv format. Let’s load the data
set into R first using the read_csv
function offered by the
readr
package in tidyverse
.
To load a data set file into R, there are multiple ways. Let’s first learn how to find the current working folder on your computer.
getwd()
This command will return the current working folder of RStudio. Then put the downloaded data set file into this folder. Then we can directly read it without offering any path information. Otherwise we need to input the complete path of the data file.
bank_data <- read_csv("~/Documents/Fei Tian/Course_STA305_Statistical_Computing_and_Graphics_Fall2023/Datasets/BankChurners.csv")
Now we can work on the data frame bank_data
just as what
we had done for all previous examples. Now let’s use this example to
show a reasonable (but definitely not routine) process to explore
data.
The first thing is nearly always to take a glimpse of our data to collect basic information:
glimpse(bank_data)
## Rows: 10,127
## Columns: 20
## $ Attrition_Flag <chr> "Existing Customer", "Existing Customer", "Ex…
## $ Customer_Age <dbl> 45, 49, 51, 40, 40, 44, 51, 32, 37, 48, 42, 6…
## $ Gender <chr> "M", "F", "M", "F", "M", "M", "M", "M", "M", …
## $ Dependent_count <dbl> 3, 5, 3, 4, 3, 2, 4, 0, 3, 2, 5, 1, 1, 3, 2, …
## $ Education_Level <chr> "High School", "Graduate", "Graduate", "High …
## $ Marital_Status <chr> "Married", "Single", "Married", "Unknown", "M…
## $ Income_Category <chr> "$60K - $80K", "Less than $40K", "$80K - $120…
## $ Card_Category <chr> "Blue", "Blue", "Blue", "Blue", "Blue", "Blue…
## $ Months_on_book <dbl> 39, 44, 36, 34, 21, 36, 46, 27, 36, 36, 31, 5…
## $ Total_Relationship_Count <dbl> 5, 6, 4, 3, 5, 3, 6, 2, 5, 6, 5, 6, 3, 5, 5, …
## $ Months_Inactive_12_mon <dbl> 1, 1, 1, 4, 1, 1, 1, 2, 2, 3, 3, 2, 6, 1, 2, …
## $ Contacts_Count_12_mon <dbl> 3, 2, 0, 1, 0, 2, 3, 2, 0, 3, 2, 3, 0, 3, 2, …
## $ Credit_Limit <dbl> 12691.0, 8256.0, 3418.0, 3313.0, 4716.0, 4010…
## $ Total_Revolving_Bal <dbl> 777, 864, 0, 2517, 0, 1247, 2264, 1396, 2517,…
## $ Avg_Open_To_Buy <dbl> 11914.0, 7392.0, 3418.0, 796.0, 4716.0, 2763.…
## $ Total_Amt_Chng_Q4_Q1 <dbl> 1.335, 1.541, 2.594, 1.405, 2.175, 1.376, 1.9…
## $ Total_Trans_Amt <dbl> 1144, 1291, 1887, 1171, 816, 1088, 1330, 1538…
## $ Total_Trans_Ct <dbl> 42, 33, 20, 20, 28, 24, 31, 36, 24, 32, 42, 2…
## $ Total_Ct_Chng_Q4_Q1 <dbl> 1.625, 3.714, 2.333, 2.333, 2.500, 0.846, 0.7…
## $ Avg_Utilization_Ratio <dbl> 0.061, 0.105, 0.000, 0.760, 0.000, 0.311, 0.0…
So we see that there are about 10,000 rows and 20 columns. Note that for this data set, there is no built-in help documentation. To check the meaning of each variable, we have to refer to the original Kaggle link. For this example, I list them for you. But in the future, you need to check it by yourself.
Variable Name | Meaning |
---|---|
Attrition_Flag | whether the customer canceled the credit card or not |
Customer Age | Age |
Gender | Gender |
Dependents | how many dependents a customer has in terms of tax |
Education_Level | Highest degree a customer earned |
Marital_Status | Marital status |
Income_Category | Categorized annual income |
Card_Category | the category of credit card |
Months_on_book | the length of a customer’s relationship with bank |
Total_Relationship_Count | Total number of bank products held by the customer |
Months_Inactive_12_mon | number of months inactive in the last 12 months |
Contacts_Count_12_mon | number of contacts with bank in the last 12 months |
Credit_Limit | Credit line of a customer |
Total_Revolving_Bal | total Revolving Balance on the Credit Card |
Avg_Open_To_Buy | the amount of credit line available (average of last 12 months) |
Total_Amt_Chng_Q4_Q1 | change in Transaction Amount (Q4 over Q1 ratio) |
Total_Trans_Amt | total Transaction Amount (Last 12 months) |
Total_Trans_Ct | total Transaction Count (Last 12 months) |
Total_Ct_Chng_Q4_Q1 | Change in Transaction Count (Q4 over Q1 ratio) |
Avg_Utilization_Ratio | Average Card Utilization Ratio |
Therefore, we do have a target variable which is
Attrition_Flag
. We shall keep this question in mind as the
EDA process progresses.
After we get the basic information of the data set, we usually need to inspect each variable to understand distributions to give us some “feeling” about our data. To do this, we need to check the distribution for each variable to visualize the variation.
Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. Every variable has its own pattern of variation, which may reveal interesting information.
For example, if we hope to know the distribution of
Attrtion_Flag
, which is a categorical variable. We can do
the following:
table(bank_data$Attrition_Flag)
##
## Attrited Customer Existing Customer
## 1627 8500
prop.table(table(bank_data$Attrition_Flag))
##
## Attrited Customer Existing Customer
## 0.1606596 0.8393404
The table()
function creates a frequency table for a
categorical variable. Further put that table into
prop.table
function results in the proportion of each
category. Here we see that there are only two values in the variable -
existing customer and attriting customer. Also, there are many more
existing customers than attriting customers. This information is very
important in modeling - the data is unbalanced in
quantity between the two key groups that we hope to study.
We may also create a bar plot to show the same information
ggplot(bank_data) + geom_bar(aes(Attrition_Flag))
Gender
and Card_Category
. Summarize your
findingsTo understand the distribution of a numeric variable, the most
typical action is to create a histogram. But usually it is good to
complement it with a summary as well. Let’s take the variable
Credit_Limit
as an example:
summary(bank_data$Credit_Limit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1438 2555 4549 8632 11068 34516
ggplot(bank_data) + geom_histogram(aes(Credit_Limit))
Analysis: This histogram provides some useful information for us. We see that this is a bimodal distribution - one around $2000, and one around the maximum value of $34516. The origin of the second mode may need some investigation if necessary. Now we just keep this information in mind.
Customer_Age
and Total_Trans_Amt
. Summarize
your findingsFor example, the variable Dependent_count
is a discrete
random variable with values to be from 0 to 5:
table(bank_data$Dependent_count)
##
## 0 1 2 3 4 5
## 904 1838 2655 2732 1574 424
unique(bank_data$Dependent_count)
## [1] 3 5 4 2 0 1
The function unique
lists all unique values of a
variable.
Therefore it’s reasonable to treat Dependent_count
as a
categorical variable. Use as.factor()
function to convert
it into a categorical variable when creating a bar plot.
ggplot(bank_data) +
geom_bar(aes(x = as.factor(Dependent_count))) +
labs(x = "Number of Dependents")
NA
valuesAn important step is to check whether there are any missing values in the data set. We may simply use
sum(is.na(bank_data))
## [1] 0
which returns the number of NA
values in the whole data
set. If it is non-zero, we can further check for each column by the
following code:
map(bank_data, ~ sum(is.na(.)))
## $Attrition_Flag
## [1] 0
##
## $Customer_Age
## [1] 0
##
## $Gender
## [1] 0
##
## $Dependent_count
## [1] 0
##
## $Education_Level
## [1] 0
##
## $Marital_Status
## [1] 0
##
## $Income_Category
## [1] 0
##
## $Card_Category
## [1] 0
##
## $Months_on_book
## [1] 0
##
## $Total_Relationship_Count
## [1] 0
##
## $Months_Inactive_12_mon
## [1] 0
##
## $Contacts_Count_12_mon
## [1] 0
##
## $Credit_Limit
## [1] 0
##
## $Total_Revolving_Bal
## [1] 0
##
## $Avg_Open_To_Buy
## [1] 0
##
## $Total_Amt_Chng_Q4_Q1
## [1] 0
##
## $Total_Trans_Amt
## [1] 0
##
## $Total_Trans_Ct
## [1] 0
##
## $Total_Ct_Chng_Q4_Q1
## [1] 0
##
## $Avg_Utilization_Ratio
## [1] 0
If there is some NA
value, then we need to decide later
how we are going to handle them.
NA
values are there in each
column of the flights
data set.Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. It is very important to be aware of them before our analysis.
Using the diamonds
data set as an example. If we create
a histogram regarding the y
variable, the width, we see
something strange.
ggplot(diamonds) + geom_histogram(aes(y))
We see that most data are within 1-10 mm but the plot extends itself
to 60 mm. This is an indicator that there is some unusually large values
in y
to trigger this. There are so many observations in the
common bins that the rare bins are so short that you can’t see them.
To make it easy to see the unusual values, we need to zoom to small
values of the y-axis with coord_cartesian()
:
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
Now we see that there is a few suspicious data points with
y
being 0 or 30, 60 mm. They do not look correct (how can a
diamond be 6 cm long?) and we will inspect them more closely.
An important side note here is that we cannot use ylim
here to zoom because that would remove all bins that
are higher than 50!
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
ylim(0, 50)
NA
Since 0, 30 or 60mm wide diamonds are not likely to be true. Let’s
look at those data in details using the filter
function:
strange_y <- diamonds %>%
filter(y < 3 | y > 20) %>%
print()
## # A tibble: 9 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 Very Good H VS2 63.3 53 5139 0 0 0
## 2 1.14 Fair G VS1 57.5 67 6381 0 0 0
## 3 2 Premium H SI2 58.9 57 12210 8.09 58.9 8.06
## 4 1.56 Ideal G VS2 62.2 54 12800 0 0 0
## 5 1.2 Premium D VVS1 62.1 59 15686 0 0 0
## 6 2.25 Premium H SI2 62.8 59 18034 0 0 0
## 7 0.51 Ideal E VS1 61.8 55 2075 5.15 31.8 5.12
## 8 0.71 Good F SI2 64.1 60 2130 0 0 0
## 9 0.71 Good F SI2 64.1 60 2130 0 0 0
Now we see that the zero values are obviously missing information of
diamond size. For 58.9-mm and 31.8-mm values in y
, they are
probably incorrect since the sizes do not match with the
carat
and price
data (would be too light and
too cheap if y
is true).
Now we have a good idea that those unusual values are either missing
or incorrect, so we will replace them with NA
now.
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
So we can work on diamonds2
data set from now on. One is
supposed to keep doing this until all unusually values are properly
labeled or removed.
If variation describes the behavior within a single variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables.
How you do that should again depend on the type of variables involved. As below is the recommended plot types:
In data visualization part, we learned to create stacked bar chart to investigate the effect of two categorical variables on a third numeric variable.
However, that is not to check the relationship between that two variables. In statistics, we usually hope to know whether the two categorical variables are dependent or independent of each other. If they are highly dependent, then knowing the category of one variable helps us predict the category of another variable with better accuracy.
For example, whether one brings umbrella or not is highly dependent on whether it rains on a day or not. On the other hand, whether one eats bread or not for breakfast is not quite dependent of whether it rains or not.
Let’s check whether Gender
and
Dependent_count
has a significant impact on
Attrition_Flag
. We can do by graph or by table:
ggplot(bank_data) +
geom_bar(aes(x = Gender, fill = Attrition_Flag), position = "fill") +
labs(y = "proportion")
bank_data %>%
group_by(Gender) %>%
summarize(Attrition_rate = prop.table(table(Attrition_Flag))[1])
## # A tibble: 2 × 2
## Gender Attrition_rate
## <chr> <dbl>
## 1 F 0.174
## 2 M 0.146
So we see that the attrition rate is significantly higher in female cutomers than male ones. In other words, it seems that female customers are more likely to cancel their credit cards as suggested by the data set.
But please keep in mind that this difference is not necessarily statistical significant (depending on some other factors such as sample size). Later we will learn how to run statistical tests to draw a more quantitative conclusion.
For numeric variables, we simply create a scatter plot and/or a line graph. A correlation analysis is usually performed, too.
Let’s take Customer_Age
and Credit_Limit
as
an example:
ggplot(bank_data, aes(Customer_Age, Credit_Limit)) + geom_point() + geom_smooth()
The points are on vertical lines because Customer_Age
is
a discrete variable and only takes integer values. It seems that the
correlation between the two variables are quite weak. But we do see a
reasonable trend in the smoothed version of line graph that, the
mid-aged group has higher credit limits than young professional and
retired people, which agrees with common sense.
For one numeric variable, and one categorical variable, we commonly
use boxplots to visualize their relationships. However, boxplots are
sometimes not informative enough. Here you are recommended to compare
the density curve between two groups. Let’s take
Attrition_Flag
and Customer_Age
as an
example.
ggplot(bank_data) +
geom_density(aes(x = Customer_Age, fill = Attrition_Flag), alpha = 0.5)
We see the two distributions nearly overlapping with each other, so
at least we can say that the effect of Customer_Age
on
Attrition_Flag
is not quite significant.
In comparison, we can create the same plot between
Attrition_Flag
and Total_Trans_Ct
. A evidently
more significant difference is observed.
ggplot(bank_data) +
geom_density(aes(x = Total_Trans_Ct, fill = Attrition_Flag), alpha = 0.5)
We see that customers with more transaction counts (over 50 or so) are much less likely to cancel their credit card service.
Contacts_Count_12_mon
has an effect on
Attrition_Flag
or not.We may use cut
function to convert a numeric variable
into customized groups. For example, for Customer_Age
, we
may want to classify all ages into groups of <= 30
,
31-40
, 41-50
, 51-60
,
>60
and then analyze its effect. The following can be
done to realize this:
bank_data <- mutate(bank_data, Age_group = cut(Customer_Age, breaks = c(0, 30, 40, 50, 60, Inf)))
bank_data %>%
group_by(Age_group) %>%
summarize(Attrition_ratio = prop.table(table(Attrition_Flag))[1])
## # A tibble: 5 × 2
## Age_group Attrition_ratio
## <fct> <dbl>
## 1 (0,30] 0.121
## 2 (30,40] 0.145
## 3 (40,50] 0.167
## 4 (50,60] 0.168
## 5 (60,Inf] 0.143
So we see that customers between 40 and 60 are most likely to churn.
Total_Trans_Ct
into a few reasonable
categories and study its effect on
Attrition_Flag
.An interesting while challenging part of data exploration is that, sometimes there are complicated composite effect between variables. For example, if we only look at the effect of credit limit on churning, it’s not that significant:
ggplot(bank_data) +
geom_density(aes(x = Credit_Limit, fill = Attrition_Flag), alpha = 0.5)
ggplot(bank_data) +
geom_boxplot(aes(x = Credit_Limit, y = Attrition_Flag))
However, if we put Customer_Age
also in the picture, we
see some composite effect:
ggplot(bank_data) +
geom_point(aes(x = Customer_Age, y = Credit_Limit, color = Attrition_Flag), alpha = 0.5) +
geom_smooth(aes(x = Customer_Age, y = Credit_Limit, color = Attrition_Flag))
So it seems that for young customers (less than 55), churning customers tended to have an lower credit limit on average. But the trend becomes opposite for aged customers, who were more likely to churn when they had a higher credit limit. Can you try to explain why?
In this example, we try to analyze the effect of two numeric variables on a categorical variable. So we use a grouped scatter/smooth plot.
What if we try to analyze the effect of two categorical variables?
Let’s analyze the composite effect of Gender
and
Income_Cateogry
bank_data %>%
group_by(Income_Category, Gender) %>%
summarize(Attrition_ratio = prop.table(table(Attrition_Flag))[1]) %>%
arrange(Attrition_ratio)
## # A tibble: 9 × 3
## # Groups: Income_Category [6]
## Income_Category Gender Attrition_ratio
## <chr> <chr> <dbl>
## 1 Unknown M 0.0962
## 2 Less than $40K M 0.108
## 3 $60K - $80K M 0.135
## 4 $40K - $60K M 0.135
## 5 $80K - $120K M 0.158
## 6 $40K - $60K F 0.164
## 7 Unknown F 0.172
## 8 $120K + M 0.173
## 9 Less than $40K F 0.177
In this example, we break our data into groups as defined by all
possible combination of Gender
and
Income_Category
, and study the effect by listing the
attrition ratio in a table.
Education_Level
and Marital_Status
on
Attrition_Flag
.One way to do this kind of problem is to cut numeric variables into groups. For example, if we hope to study the composite effect of customer age and gender, we may do the following:
bank_data %>%
group_by(Age_group, Gender) %>%
summarize(Attrition_rate = prop.table(table(Attrition_Flag))[1]) %>%
arrange(Attrition_rate)
## # A tibble: 10 × 3
## # Groups: Age_group [5]
## Age_group Gender Attrition_rate
## <fct> <chr> <dbl>
## 1 (0,30] M 0.108
## 2 (0,30] F 0.135
## 3 (30,40] M 0.135
## 4 (60,Inf] F 0.142
## 5 (60,Inf] M 0.144
## 6 (40,50] M 0.151
## 7 (50,60] M 0.152
## 8 (30,40] F 0.155
## 9 (50,60] F 0.181
## 10 (40,50] F 0.182
Marital_Status
and Months_on_books
on
Attrition_Flag
.In this module, we learn how to start with the process of EDA for a given data set. In practice, this only serves as the first step of EDA since they provide us basic information about the data set, which helps us ask and answer useful questions more efficiently and reasonably. The true journey of exploring data comes after these basic steps. For example, now let’s think about the following questions:
If we are going to do data modeling to predict potential churning customers, which variables are you going to keep in your model, or you will keep them all? How to incorporate the knowledge extracted from our exploration into the next stage?
If we are asked to propose solutions to improve retention rate of credit card customers for the bank, what are reasonable directions to explore?
Guided by these questions, now go back to check all results we created. Are there any useful information that can be helpful?