Exploratory Data Analysis

In this section I will be going over plots that are used during exploratory data analysis.

Plots used to display frequency distributions and distributions for numerical variables are focused on what is being summarized. There is no distinction between independent and dependent variables in these graphs.

Plots used to display the associations between two numerical variables are used to display the relationship with independent and dependent variables. The x-axis is the independent variable and the y-axis is the dependent variable.

Frequency Distributions for categorical variables (bar charts, bubble plot, heatmap)

Frequency distributions are used to display the counts of categorical variables in a dataset. A frequency distribution helps assess how balanced or uneven the sample sizes are among categories, which is important because unequal group sizes can affect the reliability of comparisons in later steps (such as when calculating averages). It is important to note that frequency distributions are used to summarize categorical variables by allowing us to see the mode, the category that occurs the most.

The purpose of frequency distributions can be analogous to the central tendency in numerical variables. Although it is important to note that the comparison is not entirely similar since central tendencies contain the mean, median, and mode.

Frequency distributions don’t invalidate our analysis, they inform us about how much confidence you should have in those averages and comparisons.

Here is a table of the graphs I will talk about in this section:

Graph Name	Frequency Distribution Type	Purpose
Bar Chart	frequency distribution of a single variable	To assess how balanced or uneven the sample sizes are among categories. Unequal group sizes can affect the reliability of comparisons to different variables
Bubble Chart	joint frequency distribution	To assess how balanced or uneven the sample sizes are among two categories/variables which will inform us whether the averages we are calculating within those groups are accurate or not.
Heat map	joint frequency distribution	To assess how balanced or uneven the sample sizes are among two categories/variables which will inform us whether the averages we are calculating within those groups are accurate or not.

# 8 (I1) = Lowest clarity grade
# 0 (IF) = Highest clarity grade

ggplot(data=diamonds) + 
  geom_bar(aes(x=clarity)) + 
  scale_x_discrete(labels = c(
    "I1" = 8, "SI2"= 6 , "SI1"= 5 , "VS2"= 4,
    "VS1"= 3, "VVS2"= 2, "VVS1"= 1, "IF" = 0 )) +
  scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500)) + 
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(y = "Number of Diamonds",
       x = 'Diamond Clarity Scale', 
       title = "Bar Chart to display frequency distribution",
       subtitle = "Dataset: Diamonds Dataset",
       caption = "Data Source: ggplot2 package")

#bubble graph
ggplot(data = diamonds) + 
  geom_count(mapping=aes(x=clarity,y=color)) + 
  scale_x_discrete(labels = c(
    "I1" = 8, "SI2"= 6 , "SI1"= 5 , "VS2"= 4,
    "VS1"= 3, "VVS2"= 2, "VVS1"= 1, "IF" = 0 )) + 
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(size='Diamond Count',
       y = "Color",
       x = 'clarity grade', 
       title = "Heat map to display the joint frequency distributions",
       subtitle = "Dataset: Diamonds Dataset",
       caption = "Data Source: ggplot2 package")

#heat map
clarity_color_counts <- diamonds %>%
  group_by(clarity, color) %>% 
  count()

ggplot(clarity_color_counts, aes(x=clarity, y=color, fill=n)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low="white", high="blue") +
  scale_x_discrete(labels = c(
    "I1" = 8, "SI2"= 6 , "SI1"= 5 , "VS2"= 4,
    "VS1"= 3, "VVS2"= 2, "VVS1"= 1, "IF" = 0 )) +
  labs(fill = "Count", 
       y = "Color",
       x = 'clarity grade', 
       title = "Heat map to display the joint frequency distributions",
       subtitle = "Dataset: Diamonds Dataset",
       caption = "Data Source: ggplot2 package")

Getting the exact counts through a table:

diamonds %>%
  group_by(clarity) %>% 
  count()

## # A tibble: 8 × 2
## # Groups:   clarity [8]
##   clarity     n
##   <ord>   <int>
## 1 I1        741
## 2 SI2      9194
## 3 SI1     13065
## 4 VS2     12258
## 5 VS1      8171
## 6 VVS2     5066
## 7 VVS1     3655
## 8 IF       1790

Distributions for numerical variables (boxplot)

The boxplot highlights the median and overall distribution shape (via quartiles and spread) while also flagging potential outliers. It gives a visual way to assess whether the data might be symmetric or skewed. A boxplot makes it easier to judge which measure of central tendency is more representative for your data.

While the summary() function and the boxplot are related, both use the five-number summary (Min, Q1, Median, Q3, Max) plus the mean, they serve different purposes.

summary() provides a numeric overview of each column, useful for quick diagnostics.
Boxplots provide a visual overview of numerical distributions, especially when comparing across groups.

A boxplot contains:

Minimum:
Lower Quartile Q1:
Median (Quartile Q2): The middle value in the dataset
Upper Quartile Q3:
Maximum (Quartile Q4):
Whiskers: Left whisker represent quartile 1 and the whisker on the right represent quartile 4
Interquartile range (IQR): The spread of the middle 50% of a data set
Outliers: Points beyond whiskers
Mean (optional, sometimes plotted as a dot): The arithmetic average.

The table below highlights the skew of the distribution based on the mean and median in the box plot. The skew helps you decide which measure of central tendency to trust (median vs. mean).

Skew	Median and Mean positions	Purpose
Right Skew	Mean > Median	the mean is inflated relative to most of the data so the median is more reliable for central tendency. If you use the mean, it might suggest the average is more expensive/higher than the actual data points (when outliers are removed)
Left Skew	Mean < Median	the mean is deflated relative to most of the data so the median is more reliable for central tendency. If you use the mean, it might suggest the average a cheaper/lower than most of the data (when outliers are removed).
Middle Skew/ Symmetrical Distribution	Mean = Median = Mode	Mean is reliable for central tendency. The data is predictable and well-behaved

ggplot(diamonds, mapping=aes(x=clarity,y=price)) + 
  geom_boxplot() + 
  stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "red") +
  coord_flip() + 
  scale_x_discrete(labels = c(
    "I1" = 8, "SI2"= 6 , "SI1"= 5 , "VS2"= 4,
    "VS1"= 3, "VVS2"= 2, "VVS1"= 1, "IF" = 0 )) +
  scale_y_continuous(breaks=c(0,2500,5000,10000,15000,20000)) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))  +
  labs(y = "Price",
       x = 'Diamond Clarity Scale', 
       title = "Boxplot to display distribution of numerical variables",
       caption = "Dataset: Diamonds Dataset | Data Source: ggplot2 package")

Associations between two numerical variables (scatter plot with regression lines, fitted curves)

Including scatter plots with regression lines or fitted curves can absolutely still fall under Exploratory Data Analysis (EDA). Here’s why:

EDA purpose → to explore patterns, trends, and relationships visually, not necessarily to confirm them statistically.

A regression line or smooth curve (like geom_smooth in R) in a scatter plot is often used as a visual guide to see the direction and form of a relationship (linear, curved, flat, etc.).

At this stage, you’re not claiming statistical significance, you’re just saying “it looks like carat and price have a strong positive association” or “this variable doesn’t show much relationship.”

It only becomes advanced analytics (or the next stage after EDA) when you:

Fit formal regression models with coefficients.

Test hypotheses.

Check model assumptions, confidence intervals, p-values, etc.

Advanced Analytics (Statistical Modeling / Inferential Analysis)

Purpose: test hypotheses and estimate effects.

Tools: regression modeling, hypothesis tests, ANOVA, confidence intervals.

It is important to note that the term explanatory variable (a type of independent variable) is used in the context of statistical modeling (e.g., regression). In simple graphs, such as scatter plots with color coding, the variables are not called explanatory variables because we are only visualizing relationships or distributions. In those cases, the variables are simply plotted variables, grouping variables, or aesthetics (in ggplot2 terms). In graphs without modeling → you’re not assuming causality or prediction, so you don’t call them explanatory/response variables. Instead, you’re just showing data structure (e.g., “color-coded by clarity”)

Focus: “Does clarity significantly affect price?” rather than just “Does clarity seem to affect price?”

When you move from plotting scatterplots with trend lines (exploration) to actually fitting a regression model (estimating coefficients, p-values, R², etc.), that’s when you’ve stepped out of pure EDA and into statistical modeling / inferential analysis.

Regression

Predictive / Advanced Modeling (if relevant)

Purpose: make predictions or optimize decisions.

Tools: machine learning models, cross-validation, regularization.

Statistics concepts explanation

Hussain Sarfraz

2025-09-24