Data Visualization

Author

Arvind Sharma

Prerequisites

Before we continue, make sure you have all the software you need:

R: I assume a basic familiarity with R in this session. If you’d like to learn how to use R, read R for Data Science which is designed to get you up and running with R.
RStudio: RStudio is a free and open source integrated development environment (IDE) for R. While you can write and use R code with any R environment, though RStudio has some nice features specifically for authoring and debugging your code. You can download RStudio Desktop from https://posit.co/download/rstudio-desktop
R packages: This book uses a bunch of R packages. You can install them all at once by running:

All non-base packages have been detached.

# remove(list = ls())

#install.packages("ggplot2")
#install.packages("plotly")
#install.packages("skimr")
#install.packages("esquisse")

library("ggplot2")        # static graphing
library("plotly")         # dynamic graphing / interactive plots


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library("skimr")          # EDA
library("visdat")         # missing values
library("stargazer")      # summary statistics table


Please cite as:

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

library("tidyverse")      # data manipulation

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks plotly::filter(), stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("crosstalk")        # For linking interactive widgets
library("DT")               # For creating data tables

library("gapminder")        # loads dataset
library("patchwork")        # arranging charts

Introduction to `ggplot2`

R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile R package dedicated to data visualization.
ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs.
According to ggplot2 concept, a plot can be divided into different fundamental parts:

Plot = data + Aesthetics + Geometry.
- data is a data frame
- Aesthetics is used to indicate x and y variables. It can also be used to control the color, the size or the shape of points, the height of bars, etc…..
- Geometry corresponds to the type of graphics (histogram, box plot, line plot, density plot, dot plot, ….)

Basic syntax and structure of `ggplot2`

Lets explore a series of videos that highlight key concepts related to data visualization using ggplot2. These videos will help you appreciate the principles behind grammar of graphics and how ggplot2 can be leveraged to create powerful visual representations of data.

Appreciating Grammar of Graphics (~3 mins) - This video introduces the foundational concept of the Grammar of Graphics, which is the theoretical framework behind ggplot2.
- Grammar in a language refers to the set of rules and conventions that govern how words, phrases, and sentences are formed and structured. It provides guidelines on word order, verb tense, agreement between subjects and verbs, and the proper use of parts of speech such as nouns, adjectives, and adverbs. By following grammatical rules, speakers and writers can communicate their ideas more clearly and consistently, making it easier for others to understand the intended meaning.
  
  A well-known pangram (sentence that uses every letter of the English alphabet at least once).
- A grammar of graphics is a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics. Data, Aesthetic and Geometries are three essential grammatical elements of any graphic.
  
  The three Essential Grammatical Elements for any chart.
  
  Other grammatical elements include facets, statistics, co-ordinates and themes.
  - Facetting can be used to produce separate plots for different subsets of the data, making it easier to compare groups or categories side by side.
  - A plot may also incorporate statistical transformations, such as smoothing or summarizing data, and specify a particular coordinate system (e.g., Cartesian, polar) that influences how the data is displayed.
  - Themes allow you to adjust non-data elements of the plot’s appearance, including background color, grid lines, and fonts.
- The combination of these independent components—data, aesthetics, geometric objects, facets, statistics, coordinates, and themes—collectively form the core building blocks of a graphic.
New York Times example (~6 mins) - The New York Times is renowned for its exemplary data visualizations. In this video, we analyze an example of how ggplot2 can be used to replicate the style and clarity of a New York Times chart.

The gg in `ggplot2` means Grammar of Graphics, a graphic concept which describes plots by using a “grammar”.

The three essential arguments (grammatical elements) for any chart.

Implementing a chart in ggplot2 package with ggplot command.
ggplot2 nuanced example (~5 mins) - This video dives deeper into the nuances of ggplot2.

Data: Motor Trend Car Road Tests data (`mtcars`)

?datasets
library(help = "datasets")

?mtcars
mtcars

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

vis_dat(mtcars)

glimpse(mtcars)

Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

stargazer(mtcars, 
          type = "text")


============================================
Statistic N   Mean   St. Dev.  Min     Max  
--------------------------------------------
mpg       32 20.091   6.027   10.400 33.900 
cyl       32  6.188   1.786     4       8   
disp      32 230.722 123.939  71.100 472.000
hp        32 146.688  68.563    52     335  
drat      32  3.597   0.535   2.760   4.930 
wt        32  3.217   0.978   1.513   5.424 
qsec      32 17.849   1.787   14.500 22.900 
vs        32  0.438   0.504     0       1   
am        32  0.406   0.499     0       1   
gear      32  3.688   0.738     3       5   
carb      32  2.812   1.615     1       8   
--------------------------------------------

Creating simple plots (scatter plots, bar charts, histograms)

Note you must have installed and loaded ggplot2 package.

# install.packages("ggplot2") ## required only once
# library(ggplot2)            ## required in every session

When using ggplot2 package, we use the ggplot function.

1. Scatter Plot

Scatter plots are useful for visualizing the relationship between two continuous variables. For example, we can plot mpg (miles per gallon) against hp (horsepower).

ggplot(data    = mtcars, 
       mapping = aes(x = hp,
                     y = mpg
                     )
       ) +
  geom_point()

ggplot2 can be saved and called upon later.

chart1 <-
ggplot(data    = mtcars, 
       mapping = aes(x = hp,
                     y = mpg
                     )
       ) +
  geom_point()

ggplot2 charts can be exported as an image as well.

??ggsave

# Save the plot to a PNG file
ggsave(filename = "images/scatter_plot_mtcars.png", 
       plot     = chart1, 
       width    = 8, 
       height   = 6,
       dpi      = 300
       )

Exporting to Different Formats

You can specify different file extensions in the filename argument to save the plot in various formats like PNG, JPEG, PDF, SVG.

Use raster image format (raster image formats, unlike vector formats, are composed of a grid of pixels, each assigned a specific color value, and do not use mathematical paths to represent shapes and lines. Thus, they support lossless compression, making it suitable for web graphics, icons, and images requiring high quality and detail).

Adjusting Size and Resolution

width: Width of the saved image in inches.
height: Height of the saved image in inches.
dpi: Resolution of the image in dots per inch (only for raster formats like PNG and JPEG).

2. Histogram

Histograms are used to show the distribution of a single continuous variable by dividing it into bins. We can create a histogram of mpg to see its distribution.

# Histogram: Distribution of mpg

ggplot(data    = mtcars, 
       mapping = aes(x = mpg)
       ) +
  geom_histogram(binwidth = 2)

ggplot(data    = mtcars, 
       mapping = aes(x = mpg)
       ) +
  geom_histogram(binwidth = 2, 
                     fill = "cyan", 
                     color="black"
                 )

Density plots are used to visualize the distribution of a continuous variable and estimate its probability density function.

# Density plot: Distribution of miles per gallon (mpg)

ggplot(data    = mtcars, 
       mapping = aes(x = mpg)
       ) +
  geom_density()

Box plots show the distribution of a continuous variable and highlight the median, quartiles (Q1 and Q3), and potential outliers (1.5 times IQR from Q1 and Q3 against the mean).

# Box plot: Distribution of miles per gallon (mpg) by number of cylinders (cyl)

ggplot(data    = mtcars, 
       mapping = aes(y = mpg)
       ) +
  geom_boxplot()

summary(mtcars$mpg)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90

3. Bar Chart

Bar charts are used to show the frequency of categorical data. We can create a bar chart of the number of cars by the number of cylinders (cyl).

# Bar chart: Number of cars by number of cylinders

ggplot(data    = mtcars, 
       mapping = aes(x = factor(cyl) 
                     )
       ) +
  geom_bar()

Customizing Plots: Titles/labels, themes, annotations, colors and legend

ggplot2 offers extensive options for customizing plots to make them more informative and visually appealing. Key aspects include adding titles, axis labels, annotations, and choosing suitable themes.

1. Adding Titles and Labels

Title: Provides a descriptive title for the plot.
X and Y Labels: Label the axes to indicate what data they represent.

chart1 #print the saved chart at beginning of the document

# Scatter plot: mpg vs hp with general annotation

ggplot(data = mtcars, 
       mapping = aes(x = hp, 
                     y = mpg)
       ) +
  geom_point() +
  labs(title = "Scatter Plot: mpg vs hp",
           x = "Horsepower (hp)",
           y = "Miles per Gallon (mpg)"
       )

2. Choosing Themes

Themes: Control the overall appearance of the plot, such as background color, grid lines, and font sizes.
Built-in Themes: ggplot2 provides several built-in themes such as theme_minimal(), theme_classic(), and theme_light().
- Check out the ggplot2 themes.
Custom Themes: You can even create your own themes or modify existing ones to better suit your presentation needs.

# Scatter plot: mpg vs hp with general annotation

ggplot(data = mtcars, 
       mapping = aes(x = hp, 
                     y = mpg)
       ) +
  geom_point() +
  labs(title = "Scatter Plot: mpg vs hp",
           x = "Horsepower (hp)",
           y = "Miles per Gallon (mpg)"
       ) +
  theme_classic()

3. Adding Annotations (advanced)

Annotations: Text or markers added to specific locations on the plot to highlight important points or add additional information.

chart1 # print the saved chart at beginning of the document

# Scatter plot: mpg vs hp with general annotation
ggplot(data = mtcars, 
       mapping = aes(x = hp, 
                     y = mpg)
       ) +
  geom_point() +
  labs(title = "Scatter Plot: mpg vs hp",
           x = "Horsepower (hp)",
           y = "Miles per Gallon (mpg)") +
  theme_minimal() + 
    annotate("text", x = 200, y = 30, label = "High HP & MPG", size = 4, vjust = 1)

4. Customizing colors

Lots of colors are predefined in R -

Points and Lines: You can change the color of points, lines, and other plot elements to enhance visibility or match a specific color scheme.
Fills: Customize the fill color of areas such as bars or regions in density plots.
Scales: When creating visualizations, you may want to customize the colors used for different data groups. The scale_color_manual(), scale_fill_manual() and other scale functions functions allow you to manually define the colors used in your plots - giving you full control over the appearance.

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point() +
  scale_color_manual(values = c("firebrick", "slateblue", "green4"))

Formally, each scale is a function from a region in data space (the domain of the scale) to a region in aesthetic space (the range of the scale). The axis or legend is the inverse function, known as the guide: it allows you to convert visual properties back to data. You might find it surprising that axes and legends are the same type of thing, but while they look very different, they have the same purpose: to allow you to read observations from the plot and map them back to their original values. The commonalities between the two are illustrated above.

5. Customizing Legends

Legends in ggplot2 are automatically generated based on mappings in aes(), but you can customize their appearance, placement, and behavior for clarity and better communication.

The first plot will have a basic legend with a single color line labeled "savings".
The second plot will use the timeseries glyph for the legend key, visually signaling that the plot represents time series data.

base <- ggplot(economics, aes(date, psavert, color = "savings"))

base + geom_line()

base + geom_line(key_glyph = "timeseries")  # change glyph

base + geom_line(show.legend = FALSE)       # hide legend

base + geom_line() +
  labs(color = "Savings Rate") +
  theme(
    legend.position = "bottom",       # Options: "top", "bottom", "left", "right", "none"
    legend.key.size = unit(1, "cm"),
    legend.text = element_text(size = 12),
    legend.background = element_rect(fill = "lightgray", color = "black")
  )

Types of graphs for Data Visualisation

“One-way charts” refer to visualizations that focus on a single variable at a time. These charts help you understand the distribution, frequency, or basic summary characteristics of that single variable. Examples include:

Bar Charts: Show counts or frequencies of categories in a single categorical variable. One variable discrete.
Histograms: Display the distribution of a single numeric variable by grouping data into bins. One variable discrete.
Box Plots (when used for one variable): Summarize a single numeric variable in terms of its median, quartiles, and potential outliers. One variable continous.

# Bar chart: Number of cars by number of cylinders

p2 <-
ggplot(data    = mtcars, 
       mapping = aes(x = factor(cyl) )
       ) +
  geom_bar(fill = "lightblue") +
  labs(title = "Bar Chart: Number of Cars by Cylinders",
           x = "Number of Cylinders",
           y = "Count") +
  theme_minimal()

Two-way charts involve plotting two variables simultaneously, allowing you to observe relationships or comparisons between them. Some common two-way charts include:

Scatter Plots: Plot one numeric variable on the x-axis and another on the y-axis, showing how the two variables relate to each other. Two variables - continuous and/or discrete.
Line Charts: Display how one numeric variable changes in relation to another (often time on the x-axis and a measurement on the y-axis). Two variables: Discrete X, Continuous Y
Grouped Bar Charts: Extend a bar chart to compare multiple categories across a second variable, providing two dimensions of comparison. Two discrete (categorical) variables.
Heatmaps: Use color intensity to represent a numeric value at the intersection of two categorical or continuous variables, effectively showing a relationship across two dimensions. Two variables : Continous X, Continous Y

Can also split variables on one way charts by other variables to explore patterns.

# Box plot: Distribution of miles per gallon (mpg) by number of cylinders (cyl)

p3 <- 
ggplot(data    = mtcars, 
       mapping = aes(x = factor(cyl),
                     y = mpg)
       ) +
  geom_boxplot( fill = "lightblue", 
               color = "black") +
  labs(title = "Box Plot: Distribution of Miles per Gallon (mpg) by Number of Cylinders",
           x = "Number of Cylinders",
           y = "Miles per Gallon (mpg)"
       ) +
  theme_minimal()

Arranging charts

library(patchwork)
?patchwork

p2 + p3  # side by side

p2 / p3  # under each other

p2 + p3 + chart1 +  plot_layout(ncol = 2) + plot_annotation('Put your title here', 
                                                   caption = 'made with `patchwork`')

`ggplot2` cheatsheet

Understanding this grammar will enable you to build complex and meaningful visualizations by combining different components in a structured way.

Newer version (maintained by Posit)

Explore types of charts based on variable description.

Be Awesome in ggplot2: A Practical Guide to be Highly Effective - R software and data visualization.

Explore the 8 popular types of visualizations with ggplot2.

Distributions and Correlations are common ggplot2 visualizations.

https://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html

`ggplot2` summary

The “gg” in ggplot stands for grammar of graphics, which describes the fundamental features that underlie all statistical graphics. In brief, the grammar tells us that a graphic maps the data to the aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars).

With ggplot2, you can do more faster by learning one system and applying it in many places. It can greatly improve the quality and aesthetics of your graphics, and will make you much more efficient in creating charts.

Explore `ggplot2` charts at R Graph Gallery.

ggplot2 allows to build almost any type of chart. The R graph gallery provides many ggplot2 examples.

Limitations

It’s also important to note what the grammar doesn’t do:

It doesn’t suggest which graphics to use. Many resources may focus on how to produce the plots you want, not on which plot to produce. For more advice on choosing or creating plots to answer the question you’re interested in, you may want to consult the cheatsheet.
It doesn’t describe interactive graphics, only static ones. There is essentially no difference between displaying ggplot2 graphs on a computer screen and printing them on a piece of paper.

`plotly`

The plotly package in R is an advanced tool for creating interactive and high-quality visualizations. It enhances the visual appeal and user interaction of your graphics, making the data exploration process more insightful. https://plotly.com/r/

The plotly package supports a wide range of charts and plots and works seamlessly with ggplot2 graphics through the ggplotly() function, converting static ggplot2 plots into interactive Plotly graphics.

Fuel Economy Dataset (`ggplot2::mpg`)

We use the mpg dataset in ggplot2 package.
- This dataset contains a subset of the fuel economy data that the EPA makes available on https://fueleconomy.gov/. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.
It has 5 character variables and 5 numeric.
No missing values.

?ggplot2::mpg

df <- as.data.frame(ggplot2::mpg)
skim(df)              # check for types of variables

Data summary
Name	df
Number of rows	234
Number of columns	11
_______________________
Column type frequency:
character	6
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
manufacturer	1	4	10	15
model	1	2	22	38
trans	1	8	10	10
drv	1	1	1	3
fl	1	1	1	5
class	1	3	10	7

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
displ	1	3.47	1.29	1.6	2.4	3.3	4.6	7	▇▆▆▃▁
year	1	2003.50	4.51	1999.0	1999.0	2003.5	2008.0	2008	▇▁▁▁▇
cyl	1	5.89	1.61	4.0	4.0	6.0	8.0	8	▇▁▇▁▇
cty	1	16.86	4.26	9.0	14.0	17.0	19.0	35	▆▇▃▁▁
hwy	1	23.44	5.95	12.0	18.0	24.0	27.0	44	▅▅▇▁▁

vis_dat(df)           # check for missing values

stargazer(df, 
          type="text")


============================================
Statistic  N    Mean    St. Dev.  Min   Max 
--------------------------------------------
displ     234   3.472    1.292   1.600 7.000
year      234 2,003.500  4.510   1,999 2,008
cyl       234   5.889    1.612     4     8  
cty       234  16.859    4.256     9    35  
hwy       234  23.440    5.955    12    44  
--------------------------------------------

ggplot(data    = df, 
       mapping = aes(x = displ,
                     y = cty)
       ) +
  geom_point()

# Create a basic ggplot with mtcars data
p <- ggplot(data = df, 
            mapping = aes(x = displ, 
                          y = cty
                          )
            ) +
  geom_point(color = "blue") +                          # Scatter plot of displ vs mpg
  geom_smooth(method = "lm", se = FALSE, col = "red") + # Add a linear fit line
  labs(
    title = "City MPG vs. Engine Displacement of Cars",
        x = "Displacement (in litres)",
        y = "Miles per Gallon"
  ) +
  theme_minimal()  # Use a minimal theme for a cleaner look

# Convert the ggplot to an interactive plotly plot
ggplotly(p)

`geom_smooth()` using formula = 'y ~ x'

Maybe helpful to find the observations if you are doing EDA.

# Make the 'mpg' dataset interactive by associating it with a 'key'
# This allows selections in the plot to be linked to specific rows in the data table
m <- highlight_key(mpg)

# Create a ggplot scatter plot of 'displ' vs 'cty' from the interactive 'm' data
p <- ggplot(data = m,
            mapping = aes(x = displ, y = cty)) + 
  geom_point()


# Convert the static ggplot 'p' into an interactive plotly object
# 'highlight()' with "plotly_selected" enables selection of points in the plot
gg <- highlight(
  p = ggplotly(p),      # Convert ggplot to a plotly object
  "plotly_selected"     # Enable point selection highlighting
)

# Display the interactive plot and a data table side-by-side
# Selecting points in the plot will highlight corresponding rows in the data table
crosstalk::bscols(
  gg,                   # The interactive plot
  DT::datatable(m)       # A data table of the same data
)

Setting the `off` event (i.e., 'plotly_deselect') to match the `on` event (i.e., 'plotly_selected'). You can change this default via the `highlight()` function.

The Gapminder Dataset

A dataset containing country-level data on life expectancy, GDP per capita, and population for various years, allowing you to study changes in global development over time.

gapminder

# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

p <- gapminder |>                                   # loads gapminder dataset     
    filter(year==1977) |>                           # keeps rows where year is equal to 1977
    ggplot( mapping = aes(gdpPercap, lifeExp, 
                          size = pop, 
                          color=continent
                          )
            ) +
    geom_point() +
    theme_bw()

ggplotly(p)

Machine Learning

Interactive charts can be very useful in machine learning (for identifying clusters).

plot_ly(data = mpg, 
        x = ~cty,     # Map the 'cty' column to the x-axis
        y = ~hwy,     # Map the 'hwy' column to the y-axis
        z = ~cyl      # Map the 'cyl' column to the z-axis
        ) %>%
  add_markers(color = ~cyl)

`htmlwidgets` for R

HTML widgets work just like R plots except they produce interactive web visualizations.

Maps: https://www.htmlwidgets.org/showcase_leaflet.html
Cross-sectional data: https://www.htmlwidgets.org/showcase_rbokeh.html ; https://www.htmlwidgets.org/showcase_plotly.html
Networks: https://www.htmlwidgets.org/showcase_networkD3.html
3D scatterplot: https://www.htmlwidgets.org/showcase_threejs.html
Time Series Data https://www.htmlwidgets.org/showcase_dygraphs.html

`esquisse`

R Esquisse is an R package that provides an intuitive GUI for creating data visualizations using ggplot2 . One can explore Data in R Through a Tableau-like Drag-and-Drop Interface.

library(esquisse)
esquisser(mtcars)

https://dreamrs.github.io/esquisse/

Appendix

`ggplot`

ggplot2: elegant graphics for data analysis. https://ggplot2-book.org/
https://r-graph-gallery.com/ggplot2-package.html
https://r4ds.had.co.nz/data-visualisation.html
“The Layered Grammar of Graphics”, http://vita.had.co.nz/papers/layered-grammar.pdf.
Be Awesome in ggplot2: A Practical Guide to be Highly Effective - R software and data visualization. https://www.sthda.com/english/wiki/be-awesome-in-ggplot2-a-practical-guide-to-be-highly-effective-r-software-and-data-visualization#geom_bin2d-add-heatmap-of-2d-bin-counts
Top 50 ggplot2 Visualizations - The Master List (With Full R Code). https://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
https://www.amazon.com/The-Grammar-Graphics-Statistics-Computing/dp/0387245448
https://clauswilke.com/dataviz/index.html

`plotly`

Interactive web-based data visualization with R, plotly, and shiny. https://plotly-r.com/

`htmlwidgets`

html widgets https://www.htmlwidgets.org/showcase_plotly.html

The Work of Edward Tufte

https://www.edwardtufte.com/books/

`SessionInfo`

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] patchwork_1.3.0 gapminder_1.0.0 DT_0.33         crosstalk_1.2.1
 [5] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [9] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
[13] tidyverse_2.0.0 stargazer_5.2.3 visdat_0.6.0    skimr_2.1.5    
[17] plotly_4.10.4   ggplot2_3.5.1  

loaded via a namespace (and not attached):
 [1] gtable_0.3.5      bslib_0.8.0       xfun_0.47         htmlwidgets_1.6.4
 [5] lattice_0.22-6    tzdb_0.4.0        vctrs_0.6.5       tools_4.4.2      
 [9] generics_0.1.3    fansi_1.0.6       pkgconfig_2.0.3   Matrix_1.7-1     
[13] data.table_1.16.0 lifecycle_1.0.4   compiler_4.4.2    farver_2.1.2     
[17] textshaping_0.4.0 munsell_0.5.1     repr_1.1.7        httpuv_1.6.15    
[21] sass_0.4.9        htmltools_0.5.8.1 yaml_2.3.10       lazyeval_0.2.2   
[25] jquerylib_0.1.4   later_1.3.2       pillar_1.9.0      cachem_1.1.0     
[29] mime_0.12         nlme_3.1-166      tidyselect_1.2.1  digest_0.6.37    
[33] stringi_1.8.4     labeling_0.4.3    splines_4.4.2     fastmap_1.2.0    
[37] grid_4.4.2        colorspace_2.1-1  cli_3.6.3         magrittr_2.0.3   
[41] base64enc_0.1-3   utf8_1.2.4        withr_3.0.1       promises_1.3.0   
[45] scales_1.3.0      timechange_0.3.0  rmarkdown_2.28    httr_1.4.7       
[49] ragg_1.3.3        hms_1.1.3         shiny_1.9.1       evaluate_1.0.0   
[53] knitr_1.48        viridisLite_0.4.2 mgcv_1.9-1        rlang_1.1.4      
[57] Rcpp_1.0.13       xtable_1.8-4      glue_1.8.0        rstudioapi_0.16.0
[61] jsonlite_1.8.9    R6_2.5.1          systemfonts_1.1.0

Layers

One of the key ideas behind ggplot2 is that it allows you to easily iterate, building up a complex plot a layer at a time. Each layer can come from a different dataset and have a different aesthetic mapping, making it possible to create sophisticated plots that display data from multiple sources.

So far, whenever we’ve created a plot with ggplot(), we’ve immediately added on a layer with a geom function. But it’s important to realise that there really are two distinct steps. First, we create a plot with default dataset and aesthetic mappings:

p <- ggplot(data = mtcars, mapping = aes(x = hp, y = mpg))
p

There’s nothing to see yet, so we need to add a layer:

p + geom_point()

geom_point() is a shortcut. Behind the scenes it calls the layer() function to create a new layer:

p + layer(
  mapping = NULL, 
  data = NULL,
  geom = "point", 
  stat = "identity",
  position = "identity"
)

This call fully specifies the five components to the layer:

mapping: A set of aesthetic mappings, specified using the aes() function and combined with the plot defaults. If NULL, uses the default mapping set in ggplot().
data: A dataset which overrides the default plot dataset. It is usually omitted (set to NULL), in which case the layer will use the default data specified in ggplot().
geom: The name of the geometric object to use to draw each observation.

Geoms can have additional arguments. All geoms take aesthetics as parameters. If you supply an aesthetic (e.g. colour) as a parameter, it will not be scaled, allowing you to control the appearance of the plot. You can pass params in ... (in which case stat and geom parameters are automatically teased apart), or in a list passed to geom_params.
stat: The name of the statistical tranformation to use. A statistical transformation performs some useful statistical summary, and is key to histograms and smoothers. To keep the data as is, use the “identity” stat.

You only need to set one of stat and geom: every geom has a default stat, and every stat a default geom.

Most stats take additional parameters to specify the details of statistical transformation. You can supply params either in ... (in which case stat and geom parameters are automatically teased apart), or in a list called stat_params.
position: The method used to adjust overlapping objects, like jittering, stacking or dodging.

It’s useful to understand the layer() function so you have a better mental model of the layer object. But you’ll rarely use the full layer() call because it’s so verbose. Instead, you’ll use the shortcut geom_ functions: geom_point(mapping, data, ...) is exactly equivalent to layer(mapping, data, geom = "point", ...).

https://ggplot2-book.org/layers

Prerequisites

Introduction to ggplot2

Basic syntax and structure of ggplot2

Data: Motor Trend Car Road Tests data (mtcars)

Creating simple plots (scatter plots, bar charts, histograms)

1. Scatter Plot

2. Histogram

3. Bar Chart

Customizing Plots: Titles/labels, themes, annotations, colors and legend

1. Adding Titles and Labels

2. Choosing Themes

3. Adding Annotations (advanced)

4. Customizing colors

5. Customizing Legends

Types of graphs for Data Visualisation

Arranging charts

ggplot2 cheatsheet

ggplot2 summary

Limitations

plotly

Fuel Economy Dataset (ggplot2::mpg)

The Gapminder Dataset

Machine Learning

htmlwidgets for R

esquisse

Appendix

ggplot

plotly

htmlwidgets

The Work of Edward Tufte

SessionInfo

Layers

Introduction to `ggplot2`

Basic syntax and structure of `ggplot2`

Data: Motor Trend Car Road Tests data (`mtcars`)

`ggplot2` cheatsheet

`ggplot2` summary

`plotly`

Fuel Economy Dataset (`ggplot2::mpg`)

`htmlwidgets` for R

`esquisse`

`ggplot`

`plotly`

`htmlwidgets`

`SessionInfo`