For this code through, I am primarily exploring data visualizations of fuel efficiency through the mpg dataset. I will be demonstrating this comparison through bar charts, histograms and line charts to understand how different types of vehichles compare in the context of fuel efficiency.
To complete this code through, I have loaded the tidyverse library which includes both ggplot2 and dplyr. Both of those packages are central to data visualization.
The ggplot2 package is one of the most widely applicable visualization tool in R because it utilizes a rather simple structure to impletment different visualization techniques. While the base code for each visualization method is rather straightforward, it can be adjusted based on the size of a dataset, the variables being compared, scales and aesthetic preferences, creating more complex graphics.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data(mpg)
While data exploration isn’t necessary, it can help to clarify and conceptualize the data being analyzed.
head(mpg)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
glimpse (mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
Histograms provide indicators for distribution in a given dataset. For the mpg dataset, the histogram illustrates fuel efficiency values across each individual point within this dataset. It creates a visual for the frequency of fuel efficiency values, shows whether or not the distribution is skewed, and creates data clusters. These visual indicators help to understand any outlying values, gaps in the dataset, and creates a starting point for building assumptions and hypotheses.
To build this histogram, I use geom_histogram call in the ggplot2 library. This automatically divides the data from mpg dataset into bins and counts the frequency of observations for each bin range. I then customize the appearance by adjusting bin color, bin outline, and bin width. It is always best practice to provide a specific and concise graph title, and descriptive labels for both the x and y axeses. This way, analysts can understand what the visualization is measuring.
ggplot(mpg, aes(x = hwy)) +
geom_histogram(fill = "steelblue", color = "black", bins = 20) +
labs(
title = "Distribution of Highway MPG",
x = "Highway MPG",
y = "Car Count"
) +
theme_minimal()
Bar Charts have the capability of exploring data in a more granular manner. Bar charts highlight differences between categories and illustrate which car classes are the most fuel efficient through a clear visual ranking. In the case of this dataset, it can help analysts and viewers to determine which car best fits their needs based on size and estimated budget for fuel costs. Bar charts are important because they help to simplify comparisons by immediately identifying differences. This data visualization strategy is rather intuitive for viewers who may not be well versed in data analytics, making it a great option for wider audience consumption.
To build this bar chart, I started by creating a measureable variable for the y-axis called average mpg with the mpg dataset. I then grouped each variable into its respective vehicle class using the group_by command. To summarize this data, I created a new variable called mean_hwy by determining the average highway fuel efficiency for each vehicle class. After prepping that variable I called ggplot and the newly manufactured avg_mpg summary data. Inside the aes function I included the construction for each bar on the x axis (vehicle class), the fuel efficiency benchmarks on the y-axis (demonstrated by avg_mpg), and for aesthetic purposed differentiated fill color by car class. To customize this chart, I hid the legend because this chart is rather self-explanatory, picked a color palate, and then labelled the axeses and titles the graph in a clear and concise manner.
avg_mpg <- mpg %>%
group_by(class) %>%
summarize(mean_hwy = mean(hwy))
ggplot(avg_mpg, aes(x = class, y = mean_hwy, fill = class)) +
geom_col(show.legend = FALSE) +
scale_fill_brewer(palette = "Pastel1") +
labs(
title = "Average Highway MPG by Vehicle Class",
x = "Vehicle Class",
y = "Average Highway Fuel Efficiency"
)
## MPG Scatterplot
Scatterplots allow for an alternative method to depict the relationship between a pair of continuous variables. In this specific scenario engine size is being compared to highway fuel efficiency which depicts a very stark negative trend: as engine size increases, fuel efficiency decreases. Additionally, each individual datapoint is differentiated based upon vehicle class providing a more in-depth analysis of variable relationships that aren’t immediately obvious upon poking around the raw data. This scatterplot visualization highlights variation across categories of engine size and vehichle class, identifies data clusters, and allows for multivariate analysis in one easy to read illustration.
To build this scatterplot, I called ggplot using the mpg dataset. In the aes() function, I used the displ variable (engine side) to build the x-axis and highway fuel efficiency to build the x-axis. Similar to how I differentiated each bar in the bar chart by vehicle class, I mapped each point by vehicle class as well. To customize this scatterplot, I chose to adjust the pigmentation and size for each data point. I then labelled the axeses and titled the graph in a clear and concise manner.
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point(size = 2, alpha = 0.7) +
labs(
title = "Engine Size vs Highway MPG by Vehicle Class",
x = "Engine Displacement (Liters)",
y = "Highway Fuel Efficiency",
color = "Vehicle Class"
) +
theme_minimal()
This analysis ultimately showcases the flexibility and customization ability of the ggplot2 function. Each of these visualizitions can be created with a few lines of code, and can be customized to highlight certain data aspects. These visualizations have the ability to transform raw data into clear illustrations of patterns that can be analyzed to influence decision making, in the case of this data for car consumers, but on a large scale can stir policy implementation and change.