Code-Through: Data Visualizations in R

1. Overview and Goals

In this code-through, I will demonstrate several common data visualization techniques in R.

The main goals are to: - Show how different plot types answer different kinds of questions about data
- Compare fuel efficiency across cars using the mpg dataset
- Introduce spatial visualizations (a choropleth map and a Dorling cartogram)

By the end, you should be able to: - Choose an appropriate plot type for your question
- Apply ggplot2 code to explore and visualize your own data

2. Getting to Know the `mpg` Dataset

We start with the mpg dataset that comes with ggplot2. Each row represents a specific car model, with variables such as: displ: engine displacement (liters) cty: city miles per gallon hwy: highway miles per gallon class: type of vehicle (compact, suv, pickup, etc.) Before visualizing, it is helpful to look at the data directly and understand what variables are available.

# Peek at the first few rows

head(mpg)

# Compact overview of variables and types

glimpse(mpg)

## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Data

From the output above, we can see that we have 234 observations and several numeric and categorical variables. We will use displ, hwy, and class in the next plots to explore fuel efficiency patterns.

3. Scatterplot: Engine Size vs Highway MPG

A scatter plot is useful for visualizing the relationship between two numeric variables. Here, we want to see how engine size (displ) relates to highway fuel efficiency (hwy). Each point in the plot will represent one car model.

ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
labs(
title = "Scatterplot of Engine Displacement vs Highway MPG",
x = "Engine Displacement (liters)",
y = "Highway MPG"
) +
theme_minimal()

Assessing relationships:

We can see a clear negative relationship: as engine displacement increases, highway MPG tends to decrease. This is what we would expect—larger engines usually burn more fuel. You can apply this same pattern to any data set where you want to explore the relationship between two numeric variables (for example, study hours vs. exam score, or age vs. income).

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
labs(
title = "Engine Displacement vs Highway MPG by Vehicle Class",
x = "Engine Displacement (liters)",
y = "Highway MPG",
color = "Vehicle Class"
) +
theme_minimal()

3.1 Adding a Categorical Variable with Color

Sometimes we want to see whether the relationship differs across groups. We can add class as a color aesthetic to see how different types of cars compare.

Using color to represent class reveals that certain vehicle types (like SUVs and pickups) tend to have larger engines and lower MPG, while compact and subcompact cars cluster in the more efficient region. In your work, you can map any categorical variable to color to see how groups differ on a scatter plot.

4. Histogram: Distribution of Highway MPG

We want to understand how highway MPG is distributed across all cars in the data set. A histogram shows how many observations fall into different value ranges of a single numeric variable.

ggplot(mpg, aes(x = hwy)) +
geom_histogram(bins = 20) +
labs(
title = "Histogram of Highway MPG",
x = "Highway MPG",
y = "Count"
) +
theme_minimal()

The histogram shows where most cars fall in terms of fuel efficiency and whether there are very low or very high MPG values. To adapt this to your data, you can replace hwy with any numeric variable you want to explore.

5. Boxplot: Comparing Fuel Efficiency Across Vehicle Classes

While the histogram shows the overall distribution of hwy, a box plot lets us compare the distribution across categories. Here we compare highway MPG across different vehicle classes.

ggplot(mpg, aes(x = class, y = hwy)) +
geom_boxplot() +
labs(
title = "Boxplots of Highway MPG by Vehicle Class",
x = "Vehicle Class",
y = "Highway MPG"
) +
theme_minimal()

Each box summarizes the distribution within a class, the line in the middle is the median, the box shows the middle 50% of values, and points outside the whiskers are potential outliers. This helps us see which classes tend to have higher or lower fuel efficiency. For example, we can quickly see that compact and subcompact cars generally have higher MPG than SUVs and pickups. You can use this same pattern to compare any numeric variable across groups in your own data.

6. Faceting: Small Multiples by Vehicle Class

Color is helpful, but sometimes plots get crowded. Faceting creates small multiples, the same plot repeated in separate panels for each category. Here we facet the scatter plot by class.

ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ class) +
labs(
title = "Engine Displacement vs Highway MPG, Faceted by Vehicle Class",
x = "Engine Displacement (liters)",
y = "Highway MPG"
) +
theme_minimal()

Faceting makes it easier to compare patterns between classes without overlap. For example, in some panels we see a narrow range of engine sizes, while others show a wider spread. In your work, you can facet by any categorical variable (such as region, gender, or experimental condition) to compare patterns across subgroups.

7. Choropleth Map: Population by Country

So far we have worked with tabular car data. Next, we switch to spatial data to show how data can be mapped onto geographic regions. We use the rnaturalearth package to load country boundaries and then color each country by its estimated population (pop_est).

# Load world country boundaries as an sf object
world <- rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")

# Create a filtered version (remove Antarctica and missing population)
world_small <- world %>%
filter(!is.na(pop_est), name != "Antarctica")

# Preview the key columns
head(world_small[, c("name", "pop_est")])

ggplot(world_small) +
geom_sf(aes(fill = pop_est)) +
scale_fill_viridis_c() +
labs(
title = "World Map Colored by Population Estimate",
fill = "Population"
) +
theme_minimal()

This type of map is called a choropleth. Each country is shaded according to its population, with darker or brighter colors indicating larger values. Choropleths are useful when the data is aggregated by region (countries, states, counties, etc.) and we want to see geographic patterns. To apply this to the data, we would join regional data (for example, unemployment by state) to a spatial object and map the variable of interest to fill.

8. Dorling Cartogram: Emphasizing Population Size

Sometimes geographic shapes distract from the main message. A Dorling cartogram simplifies each region into a circle whose size represents a numeric variable—in this case, population—while keeping countries roughly in their geographic positions. To construct a Dorling cartogram, we first reproject the map to a planar coordinate system and then generate circles sized by pop_est.

# Step 1 — Reproject world map into a planar CRS (Robinson projection)
world_proj <- sf::st_transform(world_small, crs = 54030)

# Step 2 — Create Dorling cartogram using the projected map
world_dorling <- cartogram::cartogram_dorling(world_proj, weight = "pop_est")

# Step 3 — Plot Dorling cartogram
ggplot(world_dorling) +
  geom_sf(aes(fill = pop_est)) +
  scale_fill_viridis_c() +
  labs(
    title = "Dorling Cartogram of World Population",
    fill = "Population"
  ) +
  theme_minimal()

In the Dorling cartogram, countries with large populations are represented by much larger circles, making their relative importance immediately visible. Exact shapes and borders are sacrificed to highlight the magnitude of the variable. This can be used whenever we care more about comparing values than preserving exact geography, for example comparing city populations or the size of different markets.

9. Conclusion and How to Use This Code through

In this code-through, we walked through a sequence of data visualizations, each designed to answer a different type of question:

Scatterplots showed relationships between two numeric variables (displ vs. hwy).
Color and faceting helped us compare those relationships across groups (class).
Histograms allowed us to see the distribution of a single variable (hwy).
Boxplots compared distributions across categories (highway MPG by vehicle class).
Choropleth maps displayed spatial variation in a variable (population by country).
Dorling cartograms emphasized the relative magnitude of values using circles.

The overall story is that choosing the right visualization depends on the question: - Use scatterplots for relationships. - Use histograms and boxplots for distributions and group comparisons. - Use faceting to compare patterns across subgroups. - Use maps and cartograms when data is tied to places.

Every code chunk in this document can be used by changing the data set and variables inside aes() to explore your own questions.