Load Libraries


Let’s load the following libraries for this lecture.

library(tidyverse)
library(openintro)

Introduction to EDA


In this module, we are going to familiarize ourselves with the EDA process.

EDA stands for “Exploratory Data Analysis”. It is a critical initial step in the data analysis process that involves investigating and summarizing the main characteristics of a data set, often with visual methods, before making any assumptions or building predictive models.


The primary goal of EDA is to

  • understand the underlying structures of data

  • detect outliers or anomalies

  • identify important variables

  • discover patterns and relationships that might not be immediately apparent.

History of EDA


The concept of EDA was introduced by John Tukey in the 1970s as a way to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.

EDA is not confined to any one specific set of techniques; rather, it encompasses a variety of statistical graphics, plots, and information tables.

Key aspects and techniques of EDA


  • Summary Statistics: Calculating mean, median, mode, variance, and standard deviation to understand the central tendency, dispersion, and shape of the data distribution.

  • Data Visualization: Using plots and graphics to visually inspect the data. Common tools include histograms, box plots, scatter plots, and bar charts, which help to reveal trends, patterns, and outliers.

  • Handling Missing Data: Identifying and treating missing values appropriately, either by imputation or removal, based on their nature and the amount of missing data.

  • Correlation Analysis: Examining the relationships between variables using correlation coefficients and scatter plots to understand how variables are related to one another.

  • Dimension Reduction: Applying techniques such as Principal Component Analysis (PCA) to reduce the number of variables under consideration, focusing on those that capture the most information.

  • Data Cleaning: Identifying errors or inconsistencies in the data, and making corrections or filtering out problematic data points.

Process of EDA


The process of EDA usually involves the cycle of

  1. asking questions out of specific or common interest;

  2. answer them with data transformation, visualization and modeling, which typically would generate new questions and thus close the loop.

Here we will focus on data transformation and visualization, and leave the modeling part to future classes.


The best way to learn is to do


There is no standard approach for exploring data. If there is any, then that would be questionable, as quoted below:

“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

So the best way to learn it is to exercise. We will exercise with two data sets in this module.

Explore the diamonds data set


Next, let’s study a data set to start our journey in data exploration. We will use the diamonds which is a built-in data set in ggplot2 package. The data set should be available after you load tidyverse into the R session.

First, let’s do some basic exploration of the data set by doing the following exercise:

Lab Exercise for diamonds data set


  1. How many samples are there? How many variables are there?
  2. Understand the meaning and the data type for each variable.
  3. Create a plot to study the relationship between price and carat. How do you understand this plot?
  4. Create plots to study the relationship between price and cut, color, clarity, respectively. How do you understand these plots?
  5. After finishing 3 and 4, do you have more questions raised in your mind?

Change the y variable in bar charts with stat_summary()


So far our bar plots are all plotting for counts. In many cases we hope to plot for other descriptive measures. In the diamonds data set, for example, we may hope to the mean price for diamonds with various combination of color/clarity levels. The following code does the job for us:

ggplot(data = diamonds) + 
  stat_summary(mapping = aes(x = clarity, y = price, fill = color), fun = "mean", geom = "bar", position = "dodge") + 
  labs(title = "Diamond Price by clarity and Color", 
       x = "Clarity Level", 
       y = "Mean Price (US dollar)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

Basically, stat_summary or any other <STAT_FUNCTION> is an alternative to to <GEOM_FUNCTIONS>. The different is that, -<GEOM_FUNCTIONS> creates a graph of a particular graph type (point, line, bar plot etc.) and we need to specify the mapping in it. -<STAT_FUNCTION> creates a graph that plots a particular statistic (count, proportion, density, or any specified variables and their functions), and we need to specify the graph type in the function by using the geom argument.

Graph analysis


The graph above suggests some information that is not consistent with common sense.

  • The clarity level is from I1 (worst) to IF (best), but the mean price seems to decrease as the clarity level increases.
  • Similarly, diamonds of the best color “D” for most clarity groups do not have the highest price, but rather the worst color “J” usually has a higher price.

Why? That is because we didn’t consider the major price indicator carat, along with another important factor sample size in this graph. To find the correct trend, we need to look into the data more carefully.

An Example of EDA (exploratory data analysis)


To resolve the effect of carat, one way (of course not the best way here) to handle this is to create a new ordinal variable named carat_group. Then we label all samples into one group in <= 1 carat, 1-2 carat, 2-3 carat, and > 3 carat based on the diamond carat.

diamonds <- mutate(diamonds, carat_group = cut(diamonds$carat, c(0, 1,2,3, Inf), c('<= 1 carat', '1-2 carat', '2-3 carat', '> 3 carat')))

ggplot(data = diamonds) + 
  stat_summary(mapping = aes(x = carat_group, y = price, fill = color), fun = "mean", geom = "bar", position = "dodge") + 
  labs(title = "Diamond Price by Carat and Color", 
       x = "Carat", 
       y = "Mean Price (US dollar)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

Graph Analysis


This graph makes much more sense now since for diamonds of less than 3 carats, the mean price decreases as the color level downgrades as an overall trend. However, the group “> 3 carat” still looks weird since the worst color still has the highest mean price.

In this case, we need to speculate why this may happen and investigate whether our speculation is correct or not by graphs. Here a natural speculation is that “> 3 carat” diamonds are very rare so the sample size is too small to reflect the trend. We may verify our speculation with the following graph:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = carat_group, fill = color) , position = "dodge") + 
  labs(title = "Diamond Data by Carat and Color", 
       x = "Carat", 
       y = "Counts") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))  

The graph above clearly shows that the sample size in the “> 3 carats” group is simply too small, so we are not able to draw reliable conclusions for that group.

Exploratory Data Analysis (Cont’d)


It is also helpful to check the mean carat for each color level:

ggplot(data = diamonds) + 
  stat_summary(mapping = aes(x = color, y = carat, fill = color), fun = "mean", geom = "bar") + 
  labs(title = "Mean Diamond Weight by Color", 
       x = "Color Level", 
       y = "Mean Weight (carat)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))  

As we see, in this data set diamonds with lower color quality are heavier on average, and that’s why we have to consider this factor when analyzing the relationship between price and color level.

Exploratory Data Analysis (Cont’d)


It is also helpful to study diamonds of less than 2 carat to exclude big diamonds that are rare. In this case, we need to first filter our data using the filter function in dplyr package.

ggplot(data = filter(diamonds, carat < 2)) + 
  stat_summary(mapping = aes(x = color, y = carat, fill = color), fun = "mean", geom = "bar") + 
  labs(title = "Mean Weight vs Color for '< 2 carat' Diamonds", 
       x = "Color Level", 
       y = "Mean Weight (carat)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))  

The trend changes little compared with the previous graph. Therefore we must take into account the effect of carat when analyzing the effect of color on price.

This also shows that diamond color indeed impacts the price because although diamonds with color “D” and “F” share similar mean price, but on average “F” is heavier than “D”.

Explorative Data Analysis (EDA)

The example above shows the process of Exploratory Data Analysis (EDA), which involves using visualization and transformation to explore your data in a systematic way. This is really an iterative cycle of the following steps:

  1. Generate questions about your data.

  2. Search for answers by visualizing, transforming, and modelling your data.

  3. Use what you learn to refine your questions and/or generate new questions.

During this process, you will find that in many situations we don’t have the data in exactly the right form for what we need. In that case we have to refer to data transformation.

Lab Exercise

  • In `mpg’ data set, which manufacturer has the best average mpg across all its models in the data sets?

  • Do you think that the conclusion is more generally valid. That is, can we say that manufacturer produces cars that are more fuel economic than other ones? Think carefully and try to find the answer from the data set itself, rather than from your own opinion or external research.

Bank customer data set


We will use a data set of bank customers which is available from Kaggle (one of the most famous data science sites):

https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers

The data set is also uploaded to Canvas. But you should still go to Kaggle for details of each variable.

Please download the data set file in csv format. Let’s load the data set into R first using the read_csv function offered by the readr package in tidyverse.

Put the data set file into the working folder


To load a data set file into R, there are multiple ways. Let’s first learn how to find the current working folder on your computer.

getwd()

This command will return the current working folder of RStudio. Then put the downloaded data set file into this folder. Then we can directly read it without offering any path information. Otherwise we need to input the complete path of the data file.

bank_data <- read_csv("~/Documents/Fei Tian/Course_STA305_Statistical_Computing_and_Graphics_Fall2023/Datasets/BankChurners.csv")

Now we can work on the data frame bank_data just as what we had done for all previous examples. Now let’s use this example to show a reasonable (but definitely not routine) process to explore data.

Data Overview


The first thing is nearly always to take a glimpse of our data to collect basic information:

  • The number of samples (rows) and variables (columns)
  • The meaning of each variable
  • The data type for each variable
  • our target variable (if there is any)
glimpse(bank_data)
## Rows: 10,127
## Columns: 20
## $ Attrition_Flag           <chr> "Existing Customer", "Existing Customer", "Ex…
## $ Customer_Age             <dbl> 45, 49, 51, 40, 40, 44, 51, 32, 37, 48, 42, 6…
## $ Gender                   <chr> "M", "F", "M", "F", "M", "M", "M", "M", "M", …
## $ Dependent_count          <dbl> 3, 5, 3, 4, 3, 2, 4, 0, 3, 2, 5, 1, 1, 3, 2, …
## $ Education_Level          <chr> "High School", "Graduate", "Graduate", "High …
## $ Marital_Status           <chr> "Married", "Single", "Married", "Unknown", "M…
## $ Income_Category          <chr> "$60K - $80K", "Less than $40K", "$80K - $120…
## $ Card_Category            <chr> "Blue", "Blue", "Blue", "Blue", "Blue", "Blue…
## $ Months_on_book           <dbl> 39, 44, 36, 34, 21, 36, 46, 27, 36, 36, 31, 5…
## $ Total_Relationship_Count <dbl> 5, 6, 4, 3, 5, 3, 6, 2, 5, 6, 5, 6, 3, 5, 5, …
## $ Months_Inactive_12_mon   <dbl> 1, 1, 1, 4, 1, 1, 1, 2, 2, 3, 3, 2, 6, 1, 2, …
## $ Contacts_Count_12_mon    <dbl> 3, 2, 0, 1, 0, 2, 3, 2, 0, 3, 2, 3, 0, 3, 2, …
## $ Credit_Limit             <dbl> 12691.0, 8256.0, 3418.0, 3313.0, 4716.0, 4010…
## $ Total_Revolving_Bal      <dbl> 777, 864, 0, 2517, 0, 1247, 2264, 1396, 2517,…
## $ Avg_Open_To_Buy          <dbl> 11914.0, 7392.0, 3418.0, 796.0, 4716.0, 2763.…
## $ Total_Amt_Chng_Q4_Q1     <dbl> 1.335, 1.541, 2.594, 1.405, 2.175, 1.376, 1.9…
## $ Total_Trans_Amt          <dbl> 1144, 1291, 1887, 1171, 816, 1088, 1330, 1538…
## $ Total_Trans_Ct           <dbl> 42, 33, 20, 20, 28, 24, 31, 36, 24, 32, 42, 2…
## $ Total_Ct_Chng_Q4_Q1      <dbl> 1.625, 3.714, 2.333, 2.333, 2.500, 0.846, 0.7…
## $ Avg_Utilization_Ratio    <dbl> 0.061, 0.105, 0.000, 0.760, 0.000, 0.311, 0.0…

So we see that there are about 10,000 rows and 20 columns. Note that for this data set, there is no built-in help documentation. To check the meaning of each variable, we have to refer to the original Kaggle link. For this example, I list them for you. But in the future, you need to check it by yourself.

Meaning of Variables


Variable Name Meaning
Attrition_Flag whether the customer canceled the credit card or not
Customer Age Age
Gender Gender
Dependents how many dependents a customer has in terms of tax
Education_Level Highest degree a customer earned
Marital_Status Marital status
Income_Category Categorized annual income
Card_Category the category of credit card
Months_on_book the length of a customer’s relationship with bank
Total_Relationship_Count Total number of bank products held by the customer
Months_Inactive_12_mon number of months inactive in the last 12 months
Contacts_Count_12_mon number of contacts with bank in the last 12 months
Credit_Limit Credit line of a customer
Total_Revolving_Bal total Revolving Balance on the Credit Card
Avg_Open_To_Buy the amount of credit line available (average of last 12 months)
Total_Amt_Chng_Q4_Q1 change in Transaction Amount (Q4 over Q1 ratio)
Total_Trans_Amt total Transaction Amount (Last 12 months)
Total_Trans_Ct total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1 Change in Transaction Count (Q4 over Q1 ratio)
Avg_Utilization_Ratio Average Card Utilization Ratio

Following the data source from Kaggle, a key question to answer from this data set is, which factor is important to predict whether a customer is more likely to cancel the credit card service.

Therefore, we do have a target variable which is Attrition_Flag. We shall keep this question in mind as the EDA process progresses.

Data Inspection


After we get the basic information of the data set, we usually need to inspect each variable to understand distributions to give us some “feeling” about our data. To do this, we need to check the distribution for each variable to visualize the variation.

Variation


Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. Every variable has its own pattern of variation, which may reveal interesting information.

For categorical (ordinal) variables, create a bar plot or make a frequency table


For example, if we hope to know the distribution of Attrition_Flag, which is a categorical variable. We can do the following:

table(bank_data$Attrition_Flag)
## 
## Attrited Customer Existing Customer 
##              1627              8500
prop.table(table(bank_data$Attrition_Flag))
## 
## Attrited Customer Existing Customer 
##         0.1606596         0.8393404

The table() function creates a frequency table for a categorical variable. Further put that table into prop.table function results in the proportion of each category. Here we see that there are only two values in the variable - existing customer and attriting customer. Also, there are many more existing customers than attriting customers. This information is very important in modeling - the data is unbalanced in quantity between the two key groups that we hope to study.

We may also create a bar plot to show the same information

ggplot(bank_data) + geom_bar(aes(Attrition_Flag))

Lab Exercise


Create a frequency table and a bar plot for variables Gender and Card_Category. Summarize your findings

For numeric variables, create a histogram and a summary


To understand the distribution of a numeric variable, the most typical action is to create a histogram. But usually it is good to complement it with a summary as well. Let’s take the variable Credit_Limit as an example:

summary(bank_data$Credit_Limit)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1438    2555    4549    8632   11068   34516
ggplot(bank_data) + geom_histogram(aes(Credit_Limit))

Analysis: This histogram provides some useful information for us. We see that this is a bimodal distribution - one around $2000, and one around the maximum value of $34516. The origin of the second mode may need some investigation if necessary. Now we just keep this information in mind.

Lab Exercise


Create a histogram and a summary for variables Customer_Age and Total_Trans_Amt. Summarize your findings.

Treat discrete variables as categorical variable


For discrete numeric variables, you are recommended to treat it as categorical variable if the number of possible values is small, or treat it as a continuous variable if the number of possible values is larage. But in many cases you can do both.

For example, the variable Dependent_count is a discrete random variable with values to be from 0 to 5:

table(bank_data$Dependent_count)
## 
##    0    1    2    3    4    5 
##  904 1838 2655 2732 1574  424
unique(bank_data$Dependent_count)
## [1] 3 5 4 2 0 1

The function unique lists all unique values of a variable.

Therefore it’s reasonable to treat Dependent_count as a categorical variable. Use as.factor() function to convert it into a categorical variable when creating a bar plot.

ggplot(bank_data) + 
  geom_bar(aes(x = as.factor(Dependent_count))) + 
  labs(x = "Number of Dependents")

Lab Exercise


Find another discrete random variable from the bank data set and explore its distribution with table and graph.

Check for NA values


An important step is to check whether there are any missing values in the data set. We may simply use

sum(is.na(bank_data))
## [1] 0

which returns the number of NA values in the whole data set. If it is non-zero, we can further check for each column by the following code:

map(bank_data, ~ sum(is.na(.)))
## $Attrition_Flag
## [1] 0
## 
## $Customer_Age
## [1] 0
## 
## $Gender
## [1] 0
## 
## $Dependent_count
## [1] 0
## 
## $Education_Level
## [1] 0
## 
## $Marital_Status
## [1] 0
## 
## $Income_Category
## [1] 0
## 
## $Card_Category
## [1] 0
## 
## $Months_on_book
## [1] 0
## 
## $Total_Relationship_Count
## [1] 0
## 
## $Months_Inactive_12_mon
## [1] 0
## 
## $Contacts_Count_12_mon
## [1] 0
## 
## $Credit_Limit
## [1] 0
## 
## $Total_Revolving_Bal
## [1] 0
## 
## $Avg_Open_To_Buy
## [1] 0
## 
## $Total_Amt_Chng_Q4_Q1
## [1] 0
## 
## $Total_Trans_Amt
## [1] 0
## 
## $Total_Trans_Ct
## [1] 0
## 
## $Total_Ct_Chng_Q4_Q1
## [1] 0
## 
## $Avg_Utilization_Ratio
## [1] 0

If there is some NA value, then we need to decide later how we are going to handle them.

Lab Exercise


Check how many NA values are there in each column of the flights data set.

Check for unusual values (potential outliers)


Outliers are observations that are unusual; data points that don’t seem to fit the overall data pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. It is very important to be aware of them before our analysis.

Using the diamonds data set as an example. If we create a histogram regarding the y variable, the width, we see something strange.

ggplot(diamonds) + geom_histogram(aes(y))

We see that most data are within 1-10 mm but the plot extends itself to 60 mm. This is an indicator that there is some unusually large values in y to trigger this. There are so many observations in the common bins that the rare bins are so short that you can’t see them.

To make it easy to see the unusual values, we need to zoom to small values of the y-axis with coord_cartesian():

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))

Now we see that there is a few suspicious data points with y being 0 or 30, 60 mm. They do not look correct (how can a diamond be 6 cm long?) and we will inspect them more closely.

An important side note here is that we cannot use ylim here to zoom because that would remove all bins that are higher than 50!

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  ylim(0, 50)

Check whether unusually values should be removed or replaced with NA


Since 0, 30 or 60mm wide diamonds are not likely to be true. Let’s look at those data in details using the filter function:

strange_y <- diamonds %>%
  filter(y < 3 | y > 20) %>%
  print()
## # A tibble: 9 × 11
##   carat cut       color clarity depth table price     x     y     z carat_group
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>      
## 1  1    Very Good H     VS2      63.3    53  5139  0      0    0    <= 1 carat 
## 2  1.14 Fair      G     VS1      57.5    67  6381  0      0    0    1-2 carat  
## 3  2    Premium   H     SI2      58.9    57 12210  8.09  58.9  8.06 1-2 carat  
## 4  1.56 Ideal     G     VS2      62.2    54 12800  0      0    0    1-2 carat  
## 5  1.2  Premium   D     VVS1     62.1    59 15686  0      0    0    1-2 carat  
## 6  2.25 Premium   H     SI2      62.8    59 18034  0      0    0    2-3 carat  
## 7  0.51 Ideal     E     VS1      61.8    55  2075  5.15  31.8  5.12 <= 1 carat 
## 8  0.71 Good      F     SI2      64.1    60  2130  0      0    0    <= 1 carat 
## 9  0.71 Good      F     SI2      64.1    60  2130  0      0    0    <= 1 carat

Now we see that the zero values are obviously missing information of diamond size. For 58.9-mm and 31.8-mm values in y, they are probably incorrect since the sizes do not match with the carat and price data (would be too light and too cheap if y is true).

Now we have a good idea that those unusual values are either missing or incorrect, so we will replace them with NA now.

diamonds2 <- diamonds %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

So we can work on diamonds2 data set from now on. One is supposed to keep doing this until all unusually values are properly labeled or removed.


Lab Homework:


  • Finish all Lab Exercises.

  • Explore the loans_full_schema data set in openintro by performing EDA

  1. Ask a question of your interest
  2. Visualize data to answer your question
  3. Try to raise new questions from your plot
  4. Visualize data to answer the new question
  5. Repeat cycle 1-4