In this section I will be going over plots that are used during exploratory data analysis.
Plots used to display frequency distributions and distributions for numerical variables are focused on what is being summarized. There is no distinction between independent and dependent variables in these graphs.
Plots used to display the associations between two numerical variables are used to display the relationship with independent and dependent variables. The x-axis is the independent variable and the y-axis is the dependent variable.
Frequency distributions are used to display the counts of categorical variables in a dataset. A frequency distribution helps assess how balanced or uneven the sample sizes are among categories, which is important because unequal group sizes can affect the reliability of comparisons in later steps (such as when calculating averages). It is important to note that frequency distributions are used to summarize categorical variables by allowing us to see the mode, the category that occurs the most.
The purpose of frequency distributions can be analogous to the central tendency in numerical variables. Although it is important to note that the comparison is not entirely similar since central tendencies contain the mean, median, and mode.
Frequency distributions don’t invalidate our analysis, they inform us about how much confidence you should have in those averages and comparisons.
Here is a table of the graphs I will talk about in this section:
| Graph Name | Frequency Distribution Type | Purpose |
|---|---|---|
| Bar Chart | frequency distribution of a single variable | To assess how balanced or uneven the sample sizes are among categories. Unequal group sizes can affect the reliability of comparisons to different variables |
| Bubble Chart | joint frequency distribution | To assess how balanced or uneven the sample sizes are among two categories/variables which will inform us whether the averages we are calculating within those groups are accurate or not. |
| Heat map | joint frequency distribution | To assess how balanced or uneven the sample sizes are among two categories/variables which will inform us whether the averages we are calculating within those groups are accurate or not. |
# 8 (I1) = Lowest clarity grade
# 0 (IF) = Highest clarity grade
ggplot(data=diamonds) +
geom_bar(aes(x=clarity)) +
scale_x_discrete(labels = c(
"I1" = 8, "SI2"= 6 , "SI1"= 5 , "VS2"= 4,
"VS1"= 3, "VVS2"= 2, "VVS1"= 1, "IF" = 0 )) +
scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500)) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
labs(y = "Number of Diamonds",
x = 'Diamond Clarity Scale',
title = "Bar Chart to display frequency distribution",
subtitle = "Dataset: Diamonds Dataset",
caption = "Data Source: ggplot2 package")
#bubble graph
ggplot(data = diamonds) +
geom_count(mapping=aes(x=clarity,y=color)) +
scale_x_discrete(labels = c(
"I1" = 8, "SI2"= 6 , "SI1"= 5 , "VS2"= 4,
"VS1"= 3, "VVS2"= 2, "VVS1"= 1, "IF" = 0 )) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
labs(size='Diamond Count',
y = "Color",
x = 'clarity grade',
title = "Heat map to display the joint frequency distributions",
subtitle = "Dataset: Diamonds Dataset",
caption = "Data Source: ggplot2 package")
#heat map
clarity_color_counts <- diamonds %>%
group_by(clarity, color) %>%
count()
ggplot(clarity_color_counts, aes(x=clarity, y=color, fill=n)) +
geom_tile(color = "white") +
scale_fill_gradient(low="white", high="blue") +
scale_x_discrete(labels = c(
"I1" = 8, "SI2"= 6 , "SI1"= 5 , "VS2"= 4,
"VS1"= 3, "VVS2"= 2, "VVS1"= 1, "IF" = 0 )) +
labs(fill = "Count",
y = "Color",
x = 'clarity grade',
title = "Heat map to display the joint frequency distributions",
subtitle = "Dataset: Diamonds Dataset",
caption = "Data Source: ggplot2 package")
Getting the exact counts through a table:
diamonds %>%
group_by(clarity) %>%
count()
## # A tibble: 8 × 2
## # Groups: clarity [8]
## clarity n
## <ord> <int>
## 1 I1 741
## 2 SI2 9194
## 3 SI1 13065
## 4 VS2 12258
## 5 VS1 8171
## 6 VVS2 5066
## 7 VVS1 3655
## 8 IF 1790
The boxplot highlights the median and overall distribution shape (via quartiles and spread) while also flagging potential outliers. It gives a visual way to assess whether the data might be symmetric or skewed. A boxplot makes it easier to judge which measure of central tendency is more representative for your data.
While the summary() function and the boxplot are
related, both use the five-number summary (Min, Q1, Median, Q3, Max)
plus the mean, they serve different purposes.
A boxplot contains:
The table below highlights the skew of the distribution based on the mean and median in the box plot. The skew helps you decide which measure of central tendency to trust (median vs. mean).
| Skew | Median and Mean positions | Purpose |
|---|---|---|
| Right Skew | Mean > Median | the mean is inflated relative to most of the data so the median is more reliable for central tendency. If you use the mean, it might suggest the average is more expensive/higher than the actual data points (when outliers are removed) |
| Left Skew | Mean < Median | the mean is deflated relative to most of the data so the median is more reliable for central tendency. If you use the mean, it might suggest the average a cheaper/lower than most of the data (when outliers are removed). |
| Middle Skew/ Symmetrical Distribution | Mean = Median = Mode | Mean is reliable for central tendency. The data is predictable and well-behaved |
ggplot(diamonds, mapping=aes(x=clarity,y=price)) +
geom_boxplot() +
stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "red") +
coord_flip() +
scale_x_discrete(labels = c(
"I1" = 8, "SI2"= 6 , "SI1"= 5 , "VS2"= 4,
"VS1"= 3, "VVS2"= 2, "VVS1"= 1, "IF" = 0 )) +
scale_y_continuous(breaks=c(0,2500,5000,10000,15000,20000)) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
labs(y = "Price",
x = 'Diamond Clarity Scale',
title = "Boxplot to display distribution of numerical variables",
caption = "Dataset: Diamonds Dataset | Data Source: ggplot2 package")
Including scatter plots with regression lines or fitted curves can absolutely still fall under Exploratory Data Analysis (EDA). Here’s why:
EDA purpose → to explore patterns, trends, and relationships visually, not necessarily to confirm them statistically.
A regression line or smooth curve (like geom_smooth in R) in a scatter plot is often used as a visual guide to see the direction and form of a relationship (linear, curved, flat, etc.).
At this stage, you’re not claiming statistical significance, you’re just saying “it looks like carat and price have a strong positive association” or “this variable doesn’t show much relationship.”
It only becomes advanced analytics (or the next stage after EDA) when you:
Fit formal regression models with coefficients.
Test hypotheses.
Check model assumptions, confidence intervals, p-values, etc.
Purpose: test hypotheses and estimate effects.
Tools: regression modeling, hypothesis tests, ANOVA, confidence intervals.
Focus: “Does clarity significantly affect price?” rather than just “Does clarity seem to affect price?”
When you move from plotting scatterplots with trend lines (exploration) to actually fitting a regression model (estimating coefficients, p-values, R², etc.), that’s when you’ve stepped out of pure EDA and into statistical modeling / inferential analysis.
Purpose: make predictions or optimize decisions.
Tools: machine learning models, cross-validation, regularization.