Code
library(tidyverse)Scatterplots are the foundational tool for visualizing the relationship between two continuous variables. By observing the distribution of points, we can identify correlations, detect outliers, and uncover hidden clusters within our data.
We use the tidyverse library for data manipulation and plotting.
library(tidyverse)A basic scatterplot maps two continuous variables. In this example, we examine the relationship between Sepal Length and Sepal Width from the iris dataset.
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
labs(title = "Basic Scatterplot",
x = "Sepal Length (cm)",
y = "Sepal Width (cm)")To make points more distinct, we can use specific shapes that support both color (outline) and fill (interior).
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(size = 3,
shape = 21,
color = "black",
fill = "salmon",
alpha = 0.7) +
labs(title = "Customized Scatterplot: Fill, Color, and Alpha")Scatterplots can represent more than just two variables by mapping additional data to Color and Size.
Using color to represent a third categorical variable helps visualize how different groups cluster.
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
labs(title = "Clustering by Species")By mapping size to a third numerical variable, you create a “Bubble Chart,” allowing you to see three dimensions of data at once.
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, size = Petal.Length)) +
geom_point(alpha = 0.6) +
labs(title = "Size Encodes Petal Length")To identify the mathematical trend, we add a linear regression line using geom_smooth().
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Scatterplot with Linear Trend Line")When points overlap significantly (overplotting), splitting the graph into facets for each category is a systemic way to improve clarity.
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
facet_wrap(~Species) +
theme(legend.position = "none") +
labs(title = "Faceted by Species")Combining all systemic elements: manual color scales, centered titles, and statistical overlays.
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species, shape = Species), size = 3, alpha = 0.8) +
geom_smooth(method = "lm", color = "black", linetype = "dashed", se = FALSE) +
scale_color_manual(values = c("setosa" = "#A88EF2",
"versicolor" = "#FBB222",
"virginica" = "#4CAF50")) +
labs(title = "Advanced Scatterplot Analysis",
subtitle = "Sepal Dimensions with Linear Trends across Species",
x = "Sepal Length (cm)",
y = "Sepal Width (cm)",
caption = "Dataset: Iris | Prepared by Abdullah Al Shamim") +
theme(legend.position = "bottom",
plot.title = element_text(face = "bold", size = 14))| Aesthetic Mapping | Usage | Variable Type |
|---|---|---|
| Color | Grouping or Intensity | Categorical or Continuous |
| Size | Weight or Magnitude | Continuous |
| Shape | Differentiation | Categorical |
| Alpha | Managing Overplotting | Numerical (0-1) |