All non-base packages have been detached.
Data Visualization
Prerequisites
Before we continue, make sure you have all the software you need:
R: I assume a basic familiarity with R in this session. If you’d like to learn how to use R, read R for Data Science which is designed to get you up and running with R.
RStudio: RStudio is a free and open source integrated development environment (IDE) for R. While you can write and use R code with any R environment, though RStudio has some nice features specifically for authoring and debugging your code. You can download RStudio Desktop from https://posit.co/download/rstudio-desktop
R packages: This book uses a bunch of R packages. You can install them all at once by running:
# remove(list = ls())
#install.packages("ggplot2")
#install.packages("plotly")
#install.packages("skimr")
#install.packages("esquisse")
library("ggplot2") # static graphing
library("plotly") # dynamic graphing / interactive plots
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library("skimr") # EDA
library("visdat") # missing values
library("stargazer") # summary statistics table
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library("tidyverse") # data manipulation── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks plotly::filter(), stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("crosstalk") # For linking interactive widgets
library("DT") # For creating data tables
library("gapminder") # loads dataset
library("patchwork") # arranging chartsIntroduction to ggplot2
R has several systems for making graphs, but
ggplot2is one of the most elegant and most versatile R package dedicated to data visualization.ggplot2implements the grammar of graphics, a coherent system for describing and building graphs.-
According to ggplot2 concept, a plot can be divided into different fundamental parts:
Plot = data + Aesthetics + Geometry.
data is a data frame
Aesthetics is used to indicate x and y variables. It can also be used to control the color, the size or the shape of points, the height of bars, etc…..
Geometry corresponds to the type of graphics (histogram, box plot, line plot, density plot, dot plot, ….)
Basic syntax and structure of ggplot2
Lets explore a series of videos that highlight key concepts related to data visualization using ggplot2. These videos will help you appreciate the principles behind grammar of graphics and how ggplot2 can be leveraged to create powerful visual representations of data.
-
Appreciating Grammar of Graphics (~3 mins) - This video introduces the foundational concept of the Grammar of Graphics, which is the theoretical framework behind
ggplot2.-
Grammar in a language refers to the set of rules and conventions that govern how words, phrases, and sentences are formed and structured. It provides guidelines on word order, verb tense, agreement between subjects and verbs, and the proper use of parts of speech such as nouns, adjectives, and adverbs. By following grammatical rules, speakers and writers can communicate their ideas more clearly and consistently, making it easier for others to understand the intended meaning.
A well-known pangram (sentence that uses every letter of the English alphabet at least once). -
A grammar of graphics is a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics. Data, Aesthetic and Geometries are three essential grammatical elements of any graphic.
The three Essential Grammatical Elements for any chart. Other grammatical elements include facets, statistics, co-ordinates and themes. Facetting can be used to produce separate plots for different subsets of the data, making it easier to compare groups or categories side by side.
A plot may also incorporate statistical transformations, such as smoothing or summarizing data, and specify a particular coordinate system (e.g., Cartesian, polar) that influences how the data is displayed.
Themes allow you to adjust non-data elements of the plot’s appearance, including background color, grid lines, and fonts.
-
The combination of these independent components—data, aesthetics, geometric objects, facets, statistics, coordinates, and themes—collectively form the core building blocks of a graphic.
-
-
New York Times example (~6 mins) - The New York Times is renowned for its exemplary data visualizations. In this video, we analyze an example of how
ggplot2can be used to replicate the style and clarity of a New York Times chart.The gg in `ggplot2` means Grammar of Graphics, a graphic concept which describes plots by using a “grammar”. The three essential arguments (grammatical elements) for any chart. Implementing a chart in ggplot2package withggplotcommand. ggplot2 nuanced example (~5 mins) - This video dives deeper into the nuances of
ggplot2.
Data: Motor Trend Car Road Tests data (mtcars)
?datasets
library(help = "datasets")
?mtcars
mtcars mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
vis_dat(mtcars)glimpse(mtcars)Rows: 32
Columns: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
stargazer(mtcars,
type = "text")
============================================
Statistic N Mean St. Dev. Min Max
--------------------------------------------
mpg 32 20.091 6.027 10.400 33.900
cyl 32 6.188 1.786 4 8
disp 32 230.722 123.939 71.100 472.000
hp 32 146.688 68.563 52 335
drat 32 3.597 0.535 2.760 4.930
wt 32 3.217 0.978 1.513 5.424
qsec 32 17.849 1.787 14.500 22.900
vs 32 0.438 0.504 0 1
am 32 0.406 0.499 0 1
gear 32 3.688 0.738 3 5
carb 32 2.812 1.615 1 8
--------------------------------------------
Creating simple plots (scatter plots, bar charts, histograms)
Note you must have installed and loaded ggplot2 package.
# install.packages("ggplot2") ## required only once
# library(ggplot2) ## required in every sessionWhen using ggplot2 package, we use the ggplot function.
1. Scatter Plot
Scatter plots are useful for visualizing the relationship between two continuous variables. For example, we can plot mpg (miles per gallon) against hp (horsepower).
ggplot(data = mtcars,
mapping = aes(x = hp,
y = mpg
)
) +
geom_point()-
ggplot2can be saved and called upon later.
chart1 <-
ggplot(data = mtcars,
mapping = aes(x = hp,
y = mpg
)
) +
geom_point()-
ggplot2charts can be exported as an image as well.
??ggsave
# Save the plot to a PNG file
ggsave(filename = "images/scatter_plot_mtcars.png",
plot = chart1,
width = 8,
height = 6,
dpi = 300
)Exporting to Different Formats
You can specify different file extensions in the filename argument to save the plot in various formats like PNG, JPEG, PDF, SVG.
- Use raster image format (raster image formats, unlike vector formats, are composed of a grid of pixels, each assigned a specific color value, and do not use mathematical paths to represent shapes and lines. Thus, they support lossless compression, making it suitable for web graphics, icons, and images requiring high quality and detail).
Adjusting Size and Resolution
width: Width of the saved image in inches.height: Height of the saved image in inches.dpi: Resolution of the image in dots per inch (only for raster formats like PNG and JPEG).
2. Histogram
Histograms are used to show the distribution of a single continuous variable by dividing it into bins. We can create a histogram of mpg to see its distribution.
# Histogram: Distribution of mpg
ggplot(data = mtcars,
mapping = aes(x = mpg)
) +
geom_histogram(binwidth = 2) ggplot(data = mtcars,
mapping = aes(x = mpg)
) +
geom_histogram(binwidth = 2,
fill = "cyan",
color="black"
)- Density plots are used to visualize the distribution of a continuous variable and estimate its probability density function.
# Density plot: Distribution of miles per gallon (mpg)
ggplot(data = mtcars,
mapping = aes(x = mpg)
) +
geom_density() - Box plots show the distribution of a continuous variable and highlight the median, quartiles (Q1 and Q3), and potential outliers (1.5 times IQR from Q1 and Q3 against the mean).
# Box plot: Distribution of miles per gallon (mpg) by number of cylinders (cyl)
ggplot(data = mtcars,
mapping = aes(y = mpg)
) +
geom_boxplot()summary(mtcars$mpg) Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.43 19.20 20.09 22.80 33.90
3. Bar Chart
Bar charts are used to show the frequency of categorical data. We can create a bar chart of the number of cars by the number of cylinders (cyl).
Customizing Plots: Titles/labels, themes, annotations, colors and legend
ggplot2 offers extensive options for customizing plots to make them more informative and visually appealing. Key aspects include adding titles, axis labels, annotations, and choosing suitable themes.
1. Adding Titles and Labels
Title: Provides a descriptive title for the plot.
X and Y Labels: Label the axes to indicate what data they represent.
chart1 #print the saved chart at beginning of the document# Scatter plot: mpg vs hp with general annotation
ggplot(data = mtcars,
mapping = aes(x = hp,
y = mpg)
) +
geom_point() +
labs(title = "Scatter Plot: mpg vs hp",
x = "Horsepower (hp)",
y = "Miles per Gallon (mpg)"
) 2. Choosing Themes
- Themes: Control the overall appearance of the plot, such as background color, grid lines, and font sizes.
-
Built-in Themes:
ggplot2provides several built-in themes such astheme_minimal(),theme_classic(), andtheme_light().- Check out the ggplot2 themes.
- Custom Themes: You can even create your own themes or modify existing ones to better suit your presentation needs.
# Scatter plot: mpg vs hp with general annotation
ggplot(data = mtcars,
mapping = aes(x = hp,
y = mpg)
) +
geom_point() +
labs(title = "Scatter Plot: mpg vs hp",
x = "Horsepower (hp)",
y = "Miles per Gallon (mpg)"
) +
theme_classic()3. Adding Annotations (advanced)
- Annotations: Text or markers added to specific locations on the plot to highlight important points or add additional information.
chart1 # print the saved chart at beginning of the document# Scatter plot: mpg vs hp with general annotation
ggplot(data = mtcars,
mapping = aes(x = hp,
y = mpg)
) +
geom_point() +
labs(title = "Scatter Plot: mpg vs hp",
x = "Horsepower (hp)",
y = "Miles per Gallon (mpg)") +
theme_minimal() +
annotate("text", x = 200, y = 30, label = "High HP & MPG", size = 4, vjust = 1) 4. Customizing colors
Lots of colors are predefined in R -
- Points and Lines: You can change the color of points, lines, and other plot elements to enhance visibility or match a specific color scheme.
- Fills: Customize the fill color of areas such as bars or regions in density plots.
-
Scales: When creating visualizations, you may want to customize the colors used for different data groups. The
scale_color_manual(),scale_fill_manual()and other scale functions functions allow you to manually define the colors used in your plots - giving you full control over the appearance.
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point() +
scale_color_manual(values = c("firebrick", "slateblue", "green4"))5. Customizing Legends
Legends in ggplot2 are automatically generated based on mappings in aes(), but you can customize their appearance, placement, and behavior for clarity and better communication.
The first plot will have a basic legend with a single color line labeled
"savings".The second plot will use the
timeseriesglyph for the legend key, visually signaling that the plot represents time series data.
base + geom_line(key_glyph = "timeseries") # change glyphbase + geom_line(show.legend = FALSE) # hide legendbase + geom_line() +
labs(color = "Savings Rate") +
theme(
legend.position = "bottom", # Options: "top", "bottom", "left", "right", "none"
legend.key.size = unit(1, "cm"),
legend.text = element_text(size = 12),
legend.background = element_rect(fill = "lightgray", color = "black")
)Types of graphs for Data Visualisation
“One-way charts” refer to visualizations that focus on a single variable at a time. These charts help you understand the distribution, frequency, or basic summary characteristics of that single variable. Examples include:
Bar Charts: Show counts or frequencies of categories in a single categorical variable. One variable discrete.
Histograms: Display the distribution of a single numeric variable by grouping data into bins. One variable discrete.
Box Plots (when used for one variable): Summarize a single numeric variable in terms of its median, quartiles, and potential outliers. One variable continous.
Two-way charts involve plotting two variables simultaneously, allowing you to observe relationships or comparisons between them. Some common two-way charts include:
Scatter Plots: Plot one numeric variable on the x-axis and another on the y-axis, showing how the two variables relate to each other. Two variables - continuous and/or discrete.
Line Charts: Display how one numeric variable changes in relation to another (often time on the x-axis and a measurement on the y-axis). Two variables: Discrete X, Continuous Y
Grouped Bar Charts: Extend a bar chart to compare multiple categories across a second variable, providing two dimensions of comparison. Two discrete (categorical) variables.
Heatmaps: Use color intensity to represent a numeric value at the intersection of two categorical or continuous variables, effectively showing a relationship across two dimensions. Two variables : Continous X, Continous Y
Can also split variables on one way charts by other variables to explore patterns.
# Box plot: Distribution of miles per gallon (mpg) by number of cylinders (cyl)
p3 <-
ggplot(data = mtcars,
mapping = aes(x = factor(cyl),
y = mpg)
) +
geom_boxplot( fill = "lightblue",
color = "black") +
labs(title = "Box Plot: Distribution of Miles per Gallon (mpg) by Number of Cylinders",
x = "Number of Cylinders",
y = "Miles per Gallon (mpg)"
) +
theme_minimal()Arranging charts
p2 / p3 # under each other p2 + p3 + chart1 + plot_layout(ncol = 2) + plot_annotation('Put your title here',
caption = 'made with `patchwork`')
ggplot2 cheatsheet
Understanding this grammar will enable you to build complex and meaningful visualizations by combining different components in a structured way.
-
Newer version (maintained by Posit)
ggplot2.
Distributions and Correlations are common ggplot2 visualizations.
https://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
ggplot2 summary
The “gg” in
ggplotstands for grammar of graphics, which describes the fundamental features that underlie all statistical graphics. In brief, the grammar tells us that a graphic maps the data to the aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars).
- With
ggplot2, you can do more faster by learning one system and applying it in many places. It can greatly improve the quality and aesthetics of your graphics, and will make you much more efficient in creating charts.
-
ggplot2allows to build almost any type of chart. The R graph gallery provides manyggplot2examples.
Limitations
It’s also important to note what the grammar doesn’t do:
It doesn’t suggest which graphics to use. Many resources may focus on how to produce the plots you want, not on which plot to produce. For more advice on choosing or creating plots to answer the question you’re interested in, you may want to consult the cheatsheet.
It doesn’t describe interactive graphics, only static ones. There is essentially no difference between displaying ggplot2 graphs on a computer screen and printing them on a piece of paper.
plotly
The plotly package in R is an advanced tool for creating interactive and high-quality visualizations. It enhances the visual appeal and user interaction of your graphics, making the data exploration process more insightful. https://plotly.com/r/
The plotly package supports a wide range of charts and plots and works seamlessly with ggplot2 graphics through the ggplotly() function, converting static ggplot2 plots into interactive Plotly graphics.
Fuel Economy Dataset (ggplot2::mpg)
-
We use the
mpgdataset inggplot2package.- This dataset contains a subset of the fuel economy data that the EPA makes available on https://fueleconomy.gov/. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.
It has 5 character variables and 5 numeric.
No missing values.
?ggplot2::mpg
df <- as.data.frame(ggplot2::mpg)
skim(df) # check for types of variables| Name | df |
| Number of rows | 234 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| manufacturer | 0 | 1 | 4 | 10 | 0 | 15 | 0 |
| model | 0 | 1 | 2 | 22 | 0 | 38 | 0 |
| trans | 0 | 1 | 8 | 10 | 0 | 10 | 0 |
| drv | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
| fl | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
| class | 0 | 1 | 3 | 10 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| displ | 0 | 1 | 3.47 | 1.29 | 1.6 | 2.4 | 3.3 | 4.6 | 7 | ▇▆▆▃▁ |
| year | 0 | 1 | 2003.50 | 4.51 | 1999.0 | 1999.0 | 2003.5 | 2008.0 | 2008 | ▇▁▁▁▇ |
| cyl | 0 | 1 | 5.89 | 1.61 | 4.0 | 4.0 | 6.0 | 8.0 | 8 | ▇▁▇▁▇ |
| cty | 0 | 1 | 16.86 | 4.26 | 9.0 | 14.0 | 17.0 | 19.0 | 35 | ▆▇▃▁▁ |
| hwy | 0 | 1 | 23.44 | 5.95 | 12.0 | 18.0 | 24.0 | 27.0 | 44 | ▅▅▇▁▁ |
vis_dat(df) # check for missing valuesstargazer(df,
type="text")
============================================
Statistic N Mean St. Dev. Min Max
--------------------------------------------
displ 234 3.472 1.292 1.600 7.000
year 234 2,003.500 4.510 1,999 2,008
cyl 234 5.889 1.612 4 8
cty 234 16.859 4.256 9 35
hwy 234 23.440 5.955 12 44
--------------------------------------------
ggplot(data = df,
mapping = aes(x = displ,
y = cty)
) +
geom_point()# Create a basic ggplot with mtcars data
p <- ggplot(data = df,
mapping = aes(x = displ,
y = cty
)
) +
geom_point(color = "blue") + # Scatter plot of displ vs mpg
geom_smooth(method = "lm", se = FALSE, col = "red") + # Add a linear fit line
labs(
title = "City MPG vs. Engine Displacement of Cars",
x = "Displacement (in litres)",
y = "Miles per Gallon"
) +
theme_minimal() # Use a minimal theme for a cleaner look
# Convert the ggplot to an interactive plotly plot
ggplotly(p)`geom_smooth()` using formula = 'y ~ x'
- Maybe helpful to find the observations if you are doing EDA.
# Make the 'mpg' dataset interactive by associating it with a 'key'
# This allows selections in the plot to be linked to specific rows in the data table
m <- highlight_key(mpg)
# Create a ggplot scatter plot of 'displ' vs 'cty' from the interactive 'm' data
p <- ggplot(data = m,
mapping = aes(x = displ, y = cty)) +
geom_point()
# Convert the static ggplot 'p' into an interactive plotly object
# 'highlight()' with "plotly_selected" enables selection of points in the plot
gg <- highlight(
p = ggplotly(p), # Convert ggplot to a plotly object
"plotly_selected" # Enable point selection highlighting
)
# Display the interactive plot and a data table side-by-side
# Selecting points in the plot will highlight corresponding rows in the data table
crosstalk::bscols(
gg, # The interactive plot
DT::datatable(m) # A data table of the same data
)Setting the `off` event (i.e., 'plotly_deselect') to match the `on` event (i.e., 'plotly_selected'). You can change this default via the `highlight()` function.
The Gapminder Dataset
A dataset containing country-level data on life expectancy, GDP per capita, and population for various years, allowing you to study changes in global development over time.
gapminder # A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
p <- gapminder |> # loads gapminder dataset
filter(year==1977) |> # keeps rows where year is equal to 1977
ggplot( mapping = aes(gdpPercap, lifeExp,
size = pop,
color=continent
)
) +
geom_point() +
theme_bw()
ggplotly(p)Machine Learning
- Interactive charts can be very useful in machine learning (for identifying clusters).
plot_ly(data = mpg,
x = ~cty, # Map the 'cty' column to the x-axis
y = ~hwy, # Map the 'hwy' column to the y-axis
z = ~cyl # Map the 'cyl' column to the z-axis
) %>%
add_markers(color = ~cyl)
htmlwidgets for R
- HTML widgets work just like R plots except they produce interactive web visualizations.
Cross-sectional data: https://www.htmlwidgets.org/showcase_rbokeh.html ; https://www.htmlwidgets.org/showcase_plotly.html
Networks: https://www.htmlwidgets.org/showcase_networkD3.html
3D scatterplot: https://www.htmlwidgets.org/showcase_threejs.html
Time Series Data https://www.htmlwidgets.org/showcase_dygraphs.html
esquisse
R Esquisse is an R package that provides an intuitive GUI for creating data visualizations using ggplot2 . One can explore Data in R Through a Tableau-like Drag-and-Drop Interface.
Appendix
ggplot
ggplot2: elegant graphics for data analysis. https://ggplot2-book.org/
“The Layered Grammar of Graphics”, http://vita.had.co.nz/papers/layered-grammar.pdf.
Be Awesome in ggplot2: A Practical Guide to be Highly Effective - R software and data visualization. https://www.sthda.com/english/wiki/be-awesome-in-ggplot2-a-practical-guide-to-be-highly-effective-r-software-and-data-visualization#geom_bin2d-add-heatmap-of-2d-bin-counts
Top 50 ggplot2 Visualizations - The Master List (With Full R Code). https://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
https://www.amazon.com/The-Grammar-Graphics-Statistics-Computing/dp/0387245448
plotly
- Interactive web-based data visualization with R, plotly, and shiny. https://plotly-r.com/
htmlwidgets
- html widgets https://www.htmlwidgets.org/showcase_plotly.html
The Work of Edward Tufte
SessionInfo
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.6.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] patchwork_1.3.0 gapminder_1.0.0 DT_0.33 crosstalk_1.2.1
[5] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[9] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[13] tidyverse_2.0.0 stargazer_5.2.3 visdat_0.6.0 skimr_2.1.5
[17] plotly_4.10.4 ggplot2_3.5.1
loaded via a namespace (and not attached):
[1] gtable_0.3.5 bslib_0.8.0 xfun_0.47 htmlwidgets_1.6.4
[5] lattice_0.22-6 tzdb_0.4.0 vctrs_0.6.5 tools_4.4.2
[9] generics_0.1.3 fansi_1.0.6 pkgconfig_2.0.3 Matrix_1.7-1
[13] data.table_1.16.0 lifecycle_1.0.4 compiler_4.4.2 farver_2.1.2
[17] textshaping_0.4.0 munsell_0.5.1 repr_1.1.7 httpuv_1.6.15
[21] sass_0.4.9 htmltools_0.5.8.1 yaml_2.3.10 lazyeval_0.2.2
[25] jquerylib_0.1.4 later_1.3.2 pillar_1.9.0 cachem_1.1.0
[29] mime_0.12 nlme_3.1-166 tidyselect_1.2.1 digest_0.6.37
[33] stringi_1.8.4 labeling_0.4.3 splines_4.4.2 fastmap_1.2.0
[37] grid_4.4.2 colorspace_2.1-1 cli_3.6.3 magrittr_2.0.3
[41] base64enc_0.1-3 utf8_1.2.4 withr_3.0.1 promises_1.3.0
[45] scales_1.3.0 timechange_0.3.0 rmarkdown_2.28 httr_1.4.7
[49] ragg_1.3.3 hms_1.1.3 shiny_1.9.1 evaluate_1.0.0
[53] knitr_1.48 viridisLite_0.4.2 mgcv_1.9-1 rlang_1.1.4
[57] Rcpp_1.0.13 xtable_1.8-4 glue_1.8.0 rstudioapi_0.16.0
[61] jsonlite_1.8.9 R6_2.5.1 systemfonts_1.1.0
Layers
One of the key ideas behind ggplot2 is that it allows you to easily iterate, building up a complex plot a layer at a time. Each layer can come from a different dataset and have a different aesthetic mapping, making it possible to create sophisticated plots that display data from multiple sources.
So far, whenever we’ve created a plot with ggplot(), we’ve immediately added on a layer with a geom function. But it’s important to realise that there really are two distinct steps. First, we create a plot with default dataset and aesthetic mappings:
There’s nothing to see yet, so we need to add a layer:
p + geom_point()geom_point() is a shortcut. Behind the scenes it calls the layer() function to create a new layer:
p + layer(
mapping = NULL,
data = NULL,
geom = "point",
stat = "identity",
position = "identity"
)This call fully specifies the five components to the layer:
mapping: A set of aesthetic mappings, specified using the
aes()function and combined with the plot defaults. IfNULL, uses the default mapping set inggplot().data: A dataset which overrides the default plot dataset. It is usually omitted (set to
NULL), in which case the layer will use the default data specified inggplot().-
geom: The name of the geometric object to use to draw each observation.
Geoms can have additional arguments. All geoms take aesthetics as parameters. If you supply an aesthetic (e.g. colour) as a parameter, it will not be scaled, allowing you to control the appearance of the plot. You can pass params in
...(in which case stat and geom parameters are automatically teased apart), or in a list passed togeom_params. -
stat: The name of the statistical tranformation to use. A statistical transformation performs some useful statistical summary, and is key to histograms and smoothers. To keep the data as is, use the “identity” stat.
You only need to set one of stat and geom: every geom has a default stat, and every stat a default geom.
Most stats take additional parameters to specify the details of statistical transformation. You can supply params either in
...(in which case stat and geom parameters are automatically teased apart), or in a list calledstat_params. position: The method used to adjust overlapping objects, like jittering, stacking or dodging.
It’s useful to understand the layer() function so you have a better mental model of the layer object. But you’ll rarely use the full layer() call because it’s so verbose. Instead, you’ll use the shortcut geom_ functions: geom_point(mapping, data, ...) is exactly equivalent to layer(mapping, data, geom = "point", ...).