Let’s load the following libraries for this lecture.
library(tidyverse)
library(openintro)
In this module, we are going to familiarize ourselves with the EDA process.
EDA stands for “Exploratory Data Analysis”. It is a critical initial step in the data analysis process that involves investigating and summarizing the main characteristics of a data set, often with visual methods, before making any assumptions or building predictive models.
The primary goal of EDA is to
understand the underlying structures of data
detect outliers or anomalies
identify important variables
discover patterns and relationships that might not be immediately apparent.
The concept of EDA was introduced by John Tukey in the 1970s as a way to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.
EDA is not confined to any one specific set of techniques; rather, it encompasses a variety of statistical graphics, plots, and information tables.
Summary Statistics: Calculating mean, median, mode, variance, and standard deviation to understand the central tendency, dispersion, and shape of the data distribution.
Data Visualization: Using plots and graphics to visually inspect the data. Common tools include histograms, box plots, scatter plots, and bar charts, which help to reveal trends, patterns, and outliers.
Handling Missing Data: Identifying and treating missing values appropriately, either by imputation or removal, based on their nature and the amount of missing data.
Correlation Analysis: Examining the relationships between variables using correlation coefficients and scatter plots to understand how variables are related to one another.
Dimension Reduction: Applying techniques such as Principal Component Analysis (PCA) to reduce the number of variables under consideration, focusing on those that capture the most information.
Data Cleaning: Identifying errors or inconsistencies in the data, and making corrections or filtering out problematic data points.
The process of EDA usually involves the cycle of
asking questions out of specific or common interest;
answer them with data transformation, visualization and modeling, which typically would generate new questions and thus close the loop.
Here we will focus on data transformation and visualization, and leave the modeling part to future classes.
There is no standard approach for exploring data. If there is any, then that would be questionable, as quoted below:
“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey
So the best way to learn it is to exercise. We will exercise with two data sets in this module.
diamonds
data setNext, let’s study a data set to start our journey in data
exploration. We will use the diamonds
which is a built-in
data set in ggplot2
package. The data set should be
available after you load tidyverse
into the R session.
First, let’s do some basic exploration of the data set by doing the following exercise:
diamonds
data setprice
and carat
. How do you understand this plot?price
and cut
, color
, clarity
,
respectively. How do you understand these plots?stat_summary()
So far our bar plots are all plotting for counts. In many cases we
hope to plot for other descriptive measures. In the
diamonds
data set, for example, we may hope to the mean
price for diamonds with various combination of color/clarity levels. The
following code does the job for us:
ggplot(data = diamonds) +
stat_summary(mapping = aes(x = clarity, y = price, fill = color), fun = "mean", geom = "bar", position = "dodge") +
labs(title = "Diamond Price by clarity and Color",
x = "Clarity Level",
y = "Mean Price (US dollar)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
Basically, stat_summary
or any other
<STAT_FUNCTION>
is an alternative to to
<GEOM_FUNCTIONS>
. The different is that,
-<GEOM_FUNCTIONS>
creates a graph of a particular
graph type (point, line, bar plot etc.) and we need to specify the
mapping in it. -<STAT_FUNCTION>
creates a graph that
plots a particular statistic (count, proportion, density, or any
specified variables and their functions), and we need to specify the
graph type in the function by using the geom
argument.
The graph above suggests some information that is not consistent with common sense.
Why? That is because we didn’t consider the major price indicator
carat
, along with another important factor sample size in
this graph. To find the correct trend, we need to look into the data
more carefully.
To resolve the effect of carat
, one way (of course not
the best way here) to handle this is to create a new ordinal variable
named carat_group
. Then we label all samples into one group
in <= 1 carat
, 1-2 carat
,
2-3 carat
, and > 3 carat
based on the
diamond carat.
diamonds <- mutate(diamonds, carat_group = cut(diamonds$carat, c(0, 1,2,3, Inf), c('<= 1 carat', '1-2 carat', '2-3 carat', '> 3 carat')))
ggplot(data = diamonds) +
stat_summary(mapping = aes(x = carat_group, y = price, fill = color), fun = "mean", geom = "bar", position = "dodge") +
labs(title = "Diamond Price by Carat and Color",
x = "Carat",
y = "Mean Price (US dollar)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
This graph makes much more sense now since for diamonds of less than 3 carats, the mean price decreases as the color level downgrades as an overall trend. However, the group “> 3 carat” still looks weird since the worst color still has the highest mean price.
In this case, we need to speculate why this may happen and investigate whether our speculation is correct or not by graphs. Here a natural speculation is that “> 3 carat” diamonds are very rare so the sample size is too small to reflect the trend. We may verify our speculation with the following graph:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = carat_group, fill = color) , position = "dodge") +
labs(title = "Diamond Data by Carat and Color",
x = "Carat",
y = "Counts") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
The graph above clearly shows that the sample size in the “> 3 carats” group is simply too small, so we are not able to draw reliable conclusions for that group.
It is also helpful to check the mean carat
for each
color level:
ggplot(data = diamonds) +
stat_summary(mapping = aes(x = color, y = carat, fill = color), fun = "mean", geom = "bar") +
labs(title = "Mean Diamond Weight by Color",
x = "Color Level",
y = "Mean Weight (carat)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
As we see, in this data set diamonds with lower color quality are heavier on average, and that’s why we have to consider this factor when analyzing the relationship between price and color level.
It is also helpful to study diamonds of less than 2 carat to exclude
big diamonds that are rare. In this case, we need to first filter our
data using the filter
function in dplyr
package.
ggplot(data = filter(diamonds, carat < 2)) +
stat_summary(mapping = aes(x = color, y = carat, fill = color), fun = "mean", geom = "bar") +
labs(title = "Mean Weight vs Color for '< 2 carat' Diamonds",
x = "Color Level",
y = "Mean Weight (carat)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
The trend changes little compared with the previous graph. Therefore
we must take into account the effect of carat
when
analyzing the effect of color on price.
This also shows that diamond color indeed impacts the price because although diamonds with color “D” and “F” share similar mean price, but on average “F” is heavier than “D”.
The example above shows the process of Exploratory Data Analysis (EDA), which involves using visualization and transformation to explore your data in a systematic way. This is really an iterative cycle of the following steps:
Generate questions about your data.
Search for answers by visualizing, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.
During this process, you will find that in many situations we don’t have the data in exactly the right form for what we need. In that case we have to refer to data transformation.
In `mpg’ data set, which manufacturer has the best average mpg across all its models in the data sets?
Do you think that the conclusion is more generally valid. That is, can we say that manufacturer produces cars that are more fuel economic than other ones? Think carefully and try to find the answer from the data set itself, rather than from your own opinion or external research.
We will use a data set of bank customers which is available from Kaggle (one of the most famous data science sites):
https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers
The data set is also uploaded to Canvas. But you should still go to Kaggle for details of each variable.
Please download the data set file in csv format. Let’s load the data
set into R first using the read_csv
function offered by the
readr
package in tidyverse
.
To load a data set file into R, there are multiple ways. Let’s first learn how to find the current working folder on your computer.
getwd()
This command will return the current working folder of RStudio. Then put the downloaded data set file into this folder. Then we can directly read it without offering any path information. Otherwise we need to input the complete path of the data file.
bank_data <- read_csv("~/Documents/Fei Tian/Course_STA305_Statistical_Computing_and_Graphics_Fall2023/Datasets/BankChurners.csv")
Now we can work on the data frame bank_data
just as what
we had done for all previous examples. Now let’s use this example to
show a reasonable (but definitely not routine) process to explore
data.
The first thing is nearly always to take a glimpse of our data to collect basic information:
glimpse(bank_data)
## Rows: 10,127
## Columns: 20
## $ Attrition_Flag <chr> "Existing Customer", "Existing Customer", "Ex…
## $ Customer_Age <dbl> 45, 49, 51, 40, 40, 44, 51, 32, 37, 48, 42, 6…
## $ Gender <chr> "M", "F", "M", "F", "M", "M", "M", "M", "M", …
## $ Dependent_count <dbl> 3, 5, 3, 4, 3, 2, 4, 0, 3, 2, 5, 1, 1, 3, 2, …
## $ Education_Level <chr> "High School", "Graduate", "Graduate", "High …
## $ Marital_Status <chr> "Married", "Single", "Married", "Unknown", "M…
## $ Income_Category <chr> "$60K - $80K", "Less than $40K", "$80K - $120…
## $ Card_Category <chr> "Blue", "Blue", "Blue", "Blue", "Blue", "Blue…
## $ Months_on_book <dbl> 39, 44, 36, 34, 21, 36, 46, 27, 36, 36, 31, 5…
## $ Total_Relationship_Count <dbl> 5, 6, 4, 3, 5, 3, 6, 2, 5, 6, 5, 6, 3, 5, 5, …
## $ Months_Inactive_12_mon <dbl> 1, 1, 1, 4, 1, 1, 1, 2, 2, 3, 3, 2, 6, 1, 2, …
## $ Contacts_Count_12_mon <dbl> 3, 2, 0, 1, 0, 2, 3, 2, 0, 3, 2, 3, 0, 3, 2, …
## $ Credit_Limit <dbl> 12691.0, 8256.0, 3418.0, 3313.0, 4716.0, 4010…
## $ Total_Revolving_Bal <dbl> 777, 864, 0, 2517, 0, 1247, 2264, 1396, 2517,…
## $ Avg_Open_To_Buy <dbl> 11914.0, 7392.0, 3418.0, 796.0, 4716.0, 2763.…
## $ Total_Amt_Chng_Q4_Q1 <dbl> 1.335, 1.541, 2.594, 1.405, 2.175, 1.376, 1.9…
## $ Total_Trans_Amt <dbl> 1144, 1291, 1887, 1171, 816, 1088, 1330, 1538…
## $ Total_Trans_Ct <dbl> 42, 33, 20, 20, 28, 24, 31, 36, 24, 32, 42, 2…
## $ Total_Ct_Chng_Q4_Q1 <dbl> 1.625, 3.714, 2.333, 2.333, 2.500, 0.846, 0.7…
## $ Avg_Utilization_Ratio <dbl> 0.061, 0.105, 0.000, 0.760, 0.000, 0.311, 0.0…
So we see that there are about 10,000 rows and 20 columns. Note that for this data set, there is no built-in help documentation. To check the meaning of each variable, we have to refer to the original Kaggle link. For this example, I list them for you. But in the future, you need to check it by yourself.
Variable Name | Meaning |
---|---|
Attrition_Flag | whether the customer canceled the credit card or not |
Customer Age | Age |
Gender | Gender |
Dependents | how many dependents a customer has in terms of tax |
Education_Level | Highest degree a customer earned |
Marital_Status | Marital status |
Income_Category | Categorized annual income |
Card_Category | the category of credit card |
Months_on_book | the length of a customer’s relationship with bank |
Total_Relationship_Count | Total number of bank products held by the customer |
Months_Inactive_12_mon | number of months inactive in the last 12 months |
Contacts_Count_12_mon | number of contacts with bank in the last 12 months |
Credit_Limit | Credit line of a customer |
Total_Revolving_Bal | total Revolving Balance on the Credit Card |
Avg_Open_To_Buy | the amount of credit line available (average of last 12 months) |
Total_Amt_Chng_Q4_Q1 | change in Transaction Amount (Q4 over Q1 ratio) |
Total_Trans_Amt | total Transaction Amount (Last 12 months) |
Total_Trans_Ct | total Transaction Count (Last 12 months) |
Total_Ct_Chng_Q4_Q1 | Change in Transaction Count (Q4 over Q1 ratio) |
Avg_Utilization_Ratio | Average Card Utilization Ratio |
Following the data source from Kaggle, a key question to answer from this data set is, which factor is important to predict whether a customer is more likely to cancel the credit card service.
Therefore, we do have a target variable which is
Attrition_Flag
. We shall keep this question in mind as the
EDA process progresses.
After we get the basic information of the data set, we usually need to inspect each variable to understand distributions to give us some “feeling” about our data. To do this, we need to check the distribution for each variable to visualize the variation.
Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. Every variable has its own pattern of variation, which may reveal interesting information.
For example, if we hope to know the distribution of
Attrition_Flag
, which is a categorical variable. We can do
the following:
table(bank_data$Attrition_Flag)
##
## Attrited Customer Existing Customer
## 1627 8500
prop.table(table(bank_data$Attrition_Flag))
##
## Attrited Customer Existing Customer
## 0.1606596 0.8393404
The table()
function creates a frequency table for a
categorical variable. Further put that table into
prop.table
function results in the proportion of each
category. Here we see that there are only two values in the variable -
existing customer and attriting customer. Also, there are many more
existing customers than attriting customers. This information is very
important in modeling - the data is unbalanced in
quantity between the two key groups that we hope to study.
We may also create a bar plot to show the same information
ggplot(bank_data) + geom_bar(aes(Attrition_Flag))
Create a frequency table and a bar plot for variables
Gender
and Card_Category
. Summarize your
findings
To understand the distribution of a numeric variable, the most
typical action is to create a histogram. But usually it is good to
complement it with a summary as well. Let’s take the variable
Credit_Limit
as an example:
summary(bank_data$Credit_Limit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1438 2555 4549 8632 11068 34516
ggplot(bank_data) + geom_histogram(aes(Credit_Limit))
Analysis: This histogram provides some useful information for us. We see that this is a bimodal distribution - one around $2000, and one around the maximum value of $34516. The origin of the second mode may need some investigation if necessary. Now we just keep this information in mind.
Create a histogram and a summary for variables
Customer_Age
and Total_Trans_Amt
. Summarize
your findings.
For discrete numeric variables, you are recommended to treat it as categorical variable if the number of possible values is small, or treat it as a continuous variable if the number of possible values is larage. But in many cases you can do both.
For example, the variable Dependent_count
is a discrete
random variable with values to be from 0 to 5:
table(bank_data$Dependent_count)
##
## 0 1 2 3 4 5
## 904 1838 2655 2732 1574 424
unique(bank_data$Dependent_count)
## [1] 3 5 4 2 0 1
The function unique
lists all unique values of a
variable.
Therefore it’s reasonable to treat Dependent_count
as a
categorical variable. Use as.factor()
function to convert
it into a categorical variable when creating a bar plot.
ggplot(bank_data) +
geom_bar(aes(x = as.factor(Dependent_count))) +
labs(x = "Number of Dependents")
Find another discrete random variable from the bank data set and explore its distribution with table and graph.
NA
valuesAn important step is to check whether there are any missing values in the data set. We may simply use
sum(is.na(bank_data))
## [1] 0
which returns the number of NA
values in the whole data
set. If it is non-zero, we can further check for each column by the
following code:
map(bank_data, ~ sum(is.na(.)))
## $Attrition_Flag
## [1] 0
##
## $Customer_Age
## [1] 0
##
## $Gender
## [1] 0
##
## $Dependent_count
## [1] 0
##
## $Education_Level
## [1] 0
##
## $Marital_Status
## [1] 0
##
## $Income_Category
## [1] 0
##
## $Card_Category
## [1] 0
##
## $Months_on_book
## [1] 0
##
## $Total_Relationship_Count
## [1] 0
##
## $Months_Inactive_12_mon
## [1] 0
##
## $Contacts_Count_12_mon
## [1] 0
##
## $Credit_Limit
## [1] 0
##
## $Total_Revolving_Bal
## [1] 0
##
## $Avg_Open_To_Buy
## [1] 0
##
## $Total_Amt_Chng_Q4_Q1
## [1] 0
##
## $Total_Trans_Amt
## [1] 0
##
## $Total_Trans_Ct
## [1] 0
##
## $Total_Ct_Chng_Q4_Q1
## [1] 0
##
## $Avg_Utilization_Ratio
## [1] 0
If there is some NA
value, then we need to decide later
how we are going to handle them.
Check how many NA
values are there in each column of the
flights
data set.
Outliers are observations that are unusual; data points that don’t seem to fit the overall data pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. It is very important to be aware of them before our analysis.
Using the diamonds
data set as an example. If we create
a histogram regarding the y
variable, the width, we see
something strange.
ggplot(diamonds) + geom_histogram(aes(y))
We see that most data are within 1-10 mm but the plot extends itself
to 60 mm. This is an indicator that there is some unusually large values
in y
to trigger this. There are so many observations in the
common bins that the rare bins are so short that you can’t see them.
To make it easy to see the unusual values, we need to zoom to small
values of the y-axis with coord_cartesian()
:
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
Now we see that there is a few suspicious data points with
y
being 0 or 30, 60 mm. They do not look correct (how can a
diamond be 6 cm long?) and we will inspect them more closely.
An important side note here is that we cannot use ylim
here to zoom because that would remove all bins that
are higher than 50!
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
ylim(0, 50)
NA
Since 0, 30 or 60mm wide diamonds are not likely to be true. Let’s
look at those data in details using the filter
function:
strange_y <- diamonds %>%
filter(y < 3 | y > 20) %>%
print()
## # A tibble: 9 × 11
## carat cut color clarity depth table price x y z carat_group
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>
## 1 1 Very Good H VS2 63.3 53 5139 0 0 0 <= 1 carat
## 2 1.14 Fair G VS1 57.5 67 6381 0 0 0 1-2 carat
## 3 2 Premium H SI2 58.9 57 12210 8.09 58.9 8.06 1-2 carat
## 4 1.56 Ideal G VS2 62.2 54 12800 0 0 0 1-2 carat
## 5 1.2 Premium D VVS1 62.1 59 15686 0 0 0 1-2 carat
## 6 2.25 Premium H SI2 62.8 59 18034 0 0 0 2-3 carat
## 7 0.51 Ideal E VS1 61.8 55 2075 5.15 31.8 5.12 <= 1 carat
## 8 0.71 Good F SI2 64.1 60 2130 0 0 0 <= 1 carat
## 9 0.71 Good F SI2 64.1 60 2130 0 0 0 <= 1 carat
Now we see that the zero values are obviously missing information of
diamond size. For 58.9-mm and 31.8-mm values in y
, they are
probably incorrect since the sizes do not match with the
carat
and price
data (would be too light and
too cheap if y
is true).
Now we have a good idea that those unusual values are either missing
or incorrect, so we will replace them with NA
now.
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
So we can work on diamonds2
data set from now on. One is
supposed to keep doing this until all unusually values are properly
labeled or removed.
Finish all Lab Exercises.
Explore the loans_full_schema
data set in
openintro
by performing EDA