Data Viz for Stats Students

Author

Lydia Gibson

For this project, I’ll be doing a comparison of the data visualization capabilities of Base R graphics, ggplot2, ggpubr, easystats, and ggstatsplot. The inspiration for this project came seeing the R Graph Gallery and the ggplot2 extensions gallery.

#install.packages("pacman")
library(pacman)
p_load(tidyverse, ggstatsplot, ggpubr, easystats, vioplot)

For this project, I’ll be using the mpg dataset from the ggplot2 package. This dataset is used extensively in the book ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham, Danielle Navarro, and Thomas Lin Pedersen, which also served as an inspiration for me to further explore data viz.

glimpse(mpg)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Distributions

Histograms

Base R histograms

By default, the Base R hist function returns a frequency histogram. You can make a count histogram instead by setting the argument freq to FALSE. Here I’ve changed the number of bins using the the argument breaks.

hist(mpg$hwy, breaks = 30)

ggplot2 histograms

The default number of bins for histograms made in ggplot2is 30, but that change be changed with the bins or binwidth argument. unfortunately ggplot2 histograms only shows observation counts, not densities.

ggplot(mpg, aes(hwy)) + 
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggstatsplot Histograms

The ggstatsplot::gghistostats function allows you to see both the count and frequency (proportion) of observations through dual axes on your histograms. By default, gghistostats also plots central tendency measures and provides statistical test results as subtitles. Setting the results.subtitle and centrality.plotting arguments equal to FALSE removes them.

ggstatsplot::gghistostats(mpg, hwy, 
                          results.subtitle = FALSE, 
                          centrality.plotting = FALSE)

ggpubr Histograms

Unlike the previous package, ggpubr::gghistogram requires you to use quotation marks around the name of your x value. You can add a density curve to you histogram using the argument add_density = TRUE. You can also include a line of central tendency using the add argument.

ggpubr::gghistogram(data = mpg, x = "hwy")
Warning: Using `bins = 30` by default. Pick better value with the argument
`bins`.

Density Plots

Base R density plots

The plot function ….

plot(density(mpg$hwy))

ggplot2 density plots

ggplot(mpg, aes(hwy)) + 
  geom_density()

ggpubr density plots

ggpubr::ggdensity(mpg, x = "hwy")

Summary Statistics

Boxplots

Base R boxplots

boxplot(hwy~drv, data = mpg)

ggplot2 boxplots

ggplot(mpg, aes(drv, hwy)) + 
  geom_boxplot()

ggstatsplot boxplots

ggstatsplot::ggbetweenstats(mpg, x= drv, y = hwy, 
                            plot.type = "box", 
                            results.subtitle = FALSE, 
                            pairwise.comparisons = FALSE, 
                            centrality.plotting = FALSE)

ggpubr boxplots

ggpubr::ggboxplot(data = mpg, x = "drv", y = "hwy")

Violin Plots

Base R violin plots

Using the vioplot package, we are able to create violin plots in base R. Using the horizontal logical argument, you can choose to have either horizontal or vertical plots. You can choose whether to have a one or two-sided violin by setting the side argument to either “left”, right, or the default “both”. The plotCentre argument allows you to see the median value as either a point (default) or line by setting it equal to “points” and “line” respectively.

vioplot::vioplot(hwy~drv, data=mpg)

ggplot2 violin plots

ggplot(mpg, aes(drv, hwy)) + 
  geom_violin()

ggstatsplot violin plots

ggstatsplot::ggbetweenstats(mpg, x= drv, y = hwy, 
                            plot.type = "violin", 
                            results.subtitle = FALSE, 
                            pairwise.comparisons = FALSE, 
                            centrality.plotting = FALSE)

ggpubr violin plots

ggpubr::ggviolin(data = mpg, x = "drv", y = "hwy")

see

ggplot(mpg, aes(drv, hwy)) + 
  see::geom_violindot()

ggplot(mpg, aes(drv, hwy)) + 
  see::geom_violinhalf()

Jitterplots

ggplot2 jitterplots

ggplot(mpg, aes(drv, hwy)) + 
  geom_jitter()

easystats jitterplots

ggplot(mpg, aes(drv, hwy)) + 
  see::geom_jitter2()

Simple Linear Regression

Scatterplot

Base R scatterplot

plot(hwy ~ displ, data = mpg)

ggplot2 scatterplot

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point()

ggstatsplot scatterplot

ggstatsplot::ggscatterstats(data = mpg, x = displ, y = hwy, 
                            results.subtitle = FALSE, 
                            marginal = FALSE, 
                            conf.level = NULL )
Warning in cbind(predictor, predictor + hwid %o% c(1, -1)): number of rows of
result is not a multiple of vector length (arg 1)
Warning: Computation failed in `stat_smooth()`
Caused by error in `base::data.frame()`:
! arguments imply differing number of rows: 80, 0

By default, ggstatsplot::ggscatterstats adds marginal plots to scatterplots. Setting the argument marginal equal to FALSE removes them.

easystats scatterplot

ggplot(mpg, aes(x = displ, y = hwy)) + 
  see::geom_point2()

ggpubr scatterplot

ggpubr::ggscatter(mpg, x="displ", y = "hwy")

You can also make scatter plots with marginal plots using ggpubr::ggscatterhist and setting the parameter margin.plot to either “histogram”, “density”, or “boxplot”.

Regression lines

Base R regression lines

plot(hwy ~ displ, data = mpg)
abline(lm(hwy ~ displ, data = mpg))

ggplot2 regression lines

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() +
  geom_smooth(method = lm)
`geom_smooth()` using formula = 'y ~ x'

ggstatsplot regression lines

ggstatsplot::ggscatterstats(data = mpg, x = displ, y = hwy, 
                            results.subtitle = FALSE, 
                            marginal = FALSE )

ggpubr regression lines

ggpubr::ggscatter(mpg, x="displ", y = "hwy", 
                  add = "reg.line",
                  conf.int = TRUE)
`geom_smooth()` using formula = 'y ~ x'