For this code through, I am primarily exploring data visualizations of fuel efficiency through the mpg dataset. I will be demonstrating this comparison through bar charts, histograms and line charts to understand how different types of vehichles compare in the context of fuel efficiency.

Load Libraries

To complete this code through, I have loaded the tidyverse library which includes both ggplot2 and dplyr. Both of those packages are central to data visualization.

The ggplot2 package is one of the most widely applicable visualization tool in R because it utilizes a rather simple structure to impletment different visualization techniques. While the base code for each visualization method is rather straightforward, it can be adjusted based on the size of a dataset, the variables being compared, scales and aesthetic preferences, creating more complex graphics.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data(mpg)

Data Exploration

While data exploration isn’t necessary, it can help to clarify and conceptualize the data being analyzed.

head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
glimpse (mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…
summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

MPG Histogram

Histograms provide indicators for distribution in a given dataset. For the mpg dataset, the histogram illustrates fuel efficiency values across each individual point within this dataset. It creates a visual for the frequency of fuel efficiency values, shows whether or not the distribution is skewed, and creates data clusters. These visual indicators help to understand any outlying values, gaps in the dataset, and creates a starting point for building assumptions and hypotheses.

To build this histogram, I use geom_histogram call in the ggplot2 library. This automatically divides the data from mpg dataset into bins and counts the frequency of observations for each bin range. I then customize the appearance by adjusting bin color, bin outline, and bin width. It is always best practice to provide a specific and concise graph title, and descriptive labels for both the x and y axeses. This way, analysts can understand what the visualization is measuring.

ggplot(mpg, aes(x = hwy)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 20) +
  labs(
    title = "Distribution of Highway MPG",
    x = "Highway MPG",
    y = "Car Count"
  ) +
  theme_minimal()

MPG Bar Chart

Bar Charts have the capability of exploring data in a more granular manner. Bar charts highlight differences between categories and illustrate which car classes are the most fuel efficient through a clear visual ranking. In the case of this dataset, it can help analysts and viewers to determine which car best fits their needs based on size and estimated budget for fuel costs. Bar charts are important because they help to simplify comparisons by immediately identifying differences. This data visualization strategy is rather intuitive for viewers who may not be well versed in data analytics, making it a great option for wider audience consumption.

To build this bar chart, I started by creating a measureable variable for the y-axis called average mpg with the mpg dataset. I then grouped each variable into its respective vehicle class using the group_by command. To summarize this data, I created a new variable called mean_hwy by determining the average highway fuel efficiency for each vehicle class. After prepping that variable I called ggplot and the newly manufactured avg_mpg summary data. Inside the aes function I included the construction for each bar on the x axis (vehicle class), the fuel efficiency benchmarks on the y-axis (demonstrated by avg_mpg), and for aesthetic purposed differentiated fill color by car class. To customize this chart, I hid the legend because this chart is rather self-explanatory, picked a color palate, and then labelled the axeses and titles the graph in a clear and concise manner.

avg_mpg <- mpg %>%
  group_by(class) %>%
  summarize(mean_hwy = mean(hwy))

ggplot(avg_mpg, aes(x = class, y = mean_hwy, fill = class)) +
  geom_col(show.legend = FALSE) +
  scale_fill_brewer(palette = "Pastel1") +
  labs(
    title = "Average Highway MPG by Vehicle Class",
    x = "Vehicle Class",
    y = "Average Highway Fuel Efficiency"
  )

## MPG Scatterplot

Scatterplots allow for an alternative method to depict the relationship between a pair of continuous variables. In this specific scenario engine size is being compared to highway fuel efficiency which depicts a very stark negative trend: as engine size increases, fuel efficiency decreases. Additionally, each individual datapoint is differentiated based upon vehicle class providing a more in-depth analysis of variable relationships that aren’t immediately obvious upon poking around the raw data. This scatterplot visualization highlights variation across categories of engine size and vehichle class, identifies data clusters, and allows for multivariate analysis in one easy to read illustration.

To build this scatterplot, I called ggplot using the mpg dataset. In the aes() function, I used the displ variable (engine side) to build the x-axis and highway fuel efficiency to build the x-axis. Similar to how I differentiated each bar in the bar chart by vehicle class, I mapped each point by vehicle class as well. To customize this scatterplot, I chose to adjust the pigmentation and size for each data point. I then labelled the axeses and titled the graph in a clear and concise manner.

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(
    title = "Engine Size vs Highway MPG by Vehicle Class",
    x = "Engine Displacement (Liters)",
    y = "Highway Fuel Efficiency",
    color = "Vehicle Class"
  ) +
  theme_minimal()

This analysis ultimately showcases the flexibility and customization ability of the ggplot2 function. Each of these visualizitions can be created with a few lines of code, and can be customized to highlight certain data aspects. These visualizations have the ability to transform raw data into clear illustrations of patterns that can be analyzed to influence decision making, in the case of this data for car consumers, but on a large scale can stir policy implementation and change.