Introduction

Sections

In this lesson, we will focus on how to visualize variables (focusing especially on visualizing frequency, proportions, but also sprinkling in discussions of how we can visualize dispersion and central tendencies of variables). We will discuss the following:

histograms
bar plots
lollipop plot
pie charts
strip plots
violin plots
box plots

Getting Started

Remember to start by setting your working directory to a new folder, and saving your script/markdown document in this folder. You should also store any data you need in that folder.

Load pacakges, Download and Import data

library(rio)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#install.packages("gapminder")
library(gapminder)
#install.packages("lubridate")
library(lubridate)
#install.packages("ggridges")
library(ggridges)

econ <- import("USMacroSWQ.csv")
# info about that data here: https://vincentarelbundock.github.io/Rdatasets/doc/AER/USMacroSWQ.html 
# Note: I have already made the date variable for you.

goldprice <- import("GoldSilver.csv")
# info about that data here: https://vincentarelbundock.github.io/Rdatasets/doc/AER/GoldSilver.html 

juice <- import("FrozenJuice.csv")
# info about the data here: https://vincentarelbundock.github.io/Rdatasets/doc/AER/FrozenJuice.html

leaders <- import("Archigos.dta")
# this is a dataset about leaders from 1875-2004. 
# this data is available here: https://www.rochester.edu/college/faculty/hgoemans/data.htm

First we’re going to start by joining the first three datasets that we loaded so they are stored in a single dataframe in our global environment.

We won’t worry about the leaders dataset until later on in the lab.

Join two (or more!) datasets

We need to make sure that the variables we’d like to “join on” have both the same variable name and same variable class/type.

econ <- econ %>%
  select(-c(tbill, V1, rownames))

juice <- juice %>%
  mutate(date = as.Date(date)) %>%
  select(-c(V1, fdd))

goldprice <- goldprice %>%
  rename(date = rownames) 

gold_econ <- left_join(goldprice, econ, by="date")

gold_econ$date <- as.Date(gold_econ$date)

goldeconjuice <- left_join(gold_econ, juice, by="date")

For juice: ppi is Producer price index for finished goods. Used to deflate the overall producer price index for finished goods to eliminate the effects of overall price inflation.

You could use a pipe to do the left_joins at the same time (more efficient), it would look something like this:

(you may need to play with the dates to get them to join properly)

# wrangle variable names/types
goldprice <- goldprice %>%
  rename(date = rownames) %>%
  mutate(date = as.Date(date))

econ %>%
  select(-c(tbill, V1, rownames)) %>%
  mutate(date = as.Date(date))

juice <- juice %>%
  mutate(date = as.Date(date)) %>%
  select(-c(V1))

# left join datasets together
econgoldjuice <- goldprice %>%
  left_join(econ, by="date") %>%
  left_join(juice, by="date")

Let’s save our data so we can use it again in another class.

save(goldeconjuice, file="gold-econ-juice-data.RData")

Visualizing frequencies and amounts

Histograms

Let’s discuss how we can visualize frequencies/amounts using geom_bar() and geom_histogram().

Remember in the last lecture I argued that we typically do not want to just report the frequency of a variable (either on it’s own or in a table). Instead, we often turn to proportions. Well, one way to usefully communicate a variable’s frequency is to visualize or plot it.

One of the most common ways to visualize frequencies is by creating a histogram or a bar plot. Bar plots are effective because humans can easily distinguish differences in length and bar plots communicate visually by displaying bars of different lengths.

We use histograms to look at continuous variables. You can think of a histogram as a specific type of bar plot where the bars are touching. Let’s create a histogram below for juice price (variable called “price” in the goldeconjuice dataframe). This will show us the counts or frequency of the different values of the price variable by “binning” the values.

ggplot(data = goldeconjuice, aes(x=price)) +
  geom_histogram(binwidth = 5) +
  scale_x_continuous(breaks= seq(70, 165, by=10))

## Warning: Removed 8935 rows containing non-finite outside the scale range
## (`stat_bin()`).

In the code above:

Binwidth defines how the data is grouped into intervals. A smaller bindwidth creates more narrower bars and shows more detail. A larger binwdith creates fewer bars that are wider and shows less detail about the specific values that the variable takes on.
scale_x_continuous(), which is a scales function, controls where tick marks and labels appear along the x-axis. Does not affect how data is grouped, just how the axis looks.

You should have received a warning that says “Removed 8935 rows containing non-finite outside the scale range”. We did not get rid of the NA values that appear in the price column before we made the plot, so the warning is telling us that these values were not plotted. Ideally, you would remove these NA values from the data before plotting. But in the meantime, you could double check that’s all that was excluded from the plot (and not other non-missing values due to how you restricted the x-axis):

sum(is.na(goldeconjuice$price))

## [1] 8935

Now let’s make the same plot but much more presentable:

ggplot(data = goldeconjuice, aes(x=price)) +
  geom_histogram(binwidth = 5,
                 fill = "orange", 
                 colour = "black") +
  scale_x_continuous(breaks= seq(70, 165, by=10)) +
  labs(title="Distribution of the Price of Frozen Orange Juice 1950-2000", 
       x = "Price",
       y = "Frequency")

## Warning: Removed 8935 rows containing non-finite outside the scale range
## (`stat_bin()`).

Hack: If you wanted a specific orange to fill the bars, try googling “colour picker”, dragging it to the exact shade of orange you’d like to fill with and then copy and paste the HEX code in quotes in the fill argument of the geom_histogram() function.

Bar plots

If we want to visualize a categorical variable, a traditional bar plot tends to be highly effective. Again, bar plots are effective because humans can easily distinguish differences in length and bar plots communicate visually by displaying bars of different lengths.

The mpg data is included in the ggplot2 package. It is fuel economy data from 1999 to 2008 for 38 popular models of cars. Run ?mpg in your console to learn more.

ggplot(data = mpg, aes(x = class, fill = class)) + # here fill car "class" with all different colours 
  geom_bar() +
  ggtitle("Simple plot of car classes in mpg data")

Let’s use contrast to highlight SUVs and pick ups specifically. First, we recode the class variable to a new variable called “vehicle_type” that takes on three categories: suv, pick up, and other.

Then we use this variable to “fill” the bars of our bar plot.

We can use a scales function (scale_fill_manual()) to change the colours of the bar plot if we’d like to further increase the contrast between SUVs, pick ups, and the other vehicle types. When creating a histogram of a continuous variable using geom_hist() we will often use a continuous colour gradient to colour the bins of our variable (using scale_fill_gradient()) or we will use scale_fill_manual() to fill based on another variable in our dataset that is categorical.

mpg <- mpg %>%
  mutate(vehicle_type = case_when(class == "suv" ~ "suv", 
                                  class == "pickup" ~ "pickup",
                                   TRUE ~ "other"))

ggplot(data = mpg, aes(x = class, fill = vehicle_type)) +
  geom_bar() +
  scale_fill_manual(values = c("suv" = "darkgreen", 
                               "pickup" = "blue", 
                               "other" = "grey70"), 
                    name = "Vehicle Type") + 
  # name argument allows us to rename the legend
  ggtitle("Car Classes in mpg data") +
  theme(legend.position = "bottom") # move legend from default location on right side to bottom of the plot

Bar plot alternative

Lollipop plots

Lollipop plots focus our attention to the end of the bars - the most important aspect for interpreting differences between categories.

gapminder_continents <- gapminder %>%
  filter(year == 2007) %>% # only look at 2007
  count(continent) %>% # count number of countries in each continent
  arrange(desc(n)) %>% 
  mutate(continent = fct_inorder(continent)) # ordered factor 

ggplot(gapminder_continents,
 aes(x = continent, y = n,
 color = continent)) +
 geom_pointrange(aes(ymin = 0, ymax = n)) + 
 labs(x = NULL, y = "Number of countries")

Bar plot No-Nos!

General

Like any data visualization, bar plots can be deceptive. Let’s talk briefly about some things to avoid.

In the bar plot above, there are several issues:

It violates the design principle we discussed in lecture which is that in general, “simple is better” when it comes to our visualizations. There is a lot of distracting background elements in this plot.
The y-axis is not properly declared in the title of the plot or using a y-axis label. This is especially problematic when the plot is meant for a general audience.
The plot compares two unequal time intervals (average annual GDP in 58 year period compared to average annual GDP in 7 year period). Is data back to 1950 necessary to begin with? How do these (unequal) time intervals support the story being told with the data? What is the benchmark for GDP growth and how has this changed over time? Is growth slower now than it was in the early postwar period?
What are we missing? Implicitly assumes that GDP growth can be entirely attributed to a president. While government policy influences GDP growth, it is not the only factor. We need more information.

When creating a bar plot, the y-axis should always start at zero. In other words, do not truncate the y-axis. The entire length of the bar matters - it is the only way for us to accurately interpret differences in bar lengths. Cutting off parts of the bars distorts or exaggerates differences. For instance, in the plot above, the y-axis starts at zero, but if it started at 1% then we would say the y-axis has been truncated.

Summary statistics

We should be careful when using bar plots to show summary statistics, such as the mean or median of our variable.

Remember this plot which shows two different variables with similar means but different standard deviations?

I created this plot using “fake” data that I generated in the following way:

(You don’t need to know how to generate fake data, but I’m showing you this to be transparent. You should know how to create a density plot and understand how group_by() works. HINT: group_by() may be useful for the first data assignment. Feel free to test out the code below on your own.)

# set seed for reproducibility
set.seed(123)

# generate dataset with variable with high standard deviation
high_sd_data <- data.frame(
  group = "High SD",
  value = rnorm(100, mean = 50, sd = 20) # Mean = 50, High SD = 20
)


# generate dataset with variable with low standard deviation
low_sd_data <- data.frame(
  group = "Low SD",
  value = rnorm(100, mean = 50, sd = 5)  # Mean = 50, Low SD = 5
)


# combine datasets
data_combined <- bind_rows(high_sd_data, low_sd_data)

# calculate means for each group, store separately
means <- data_combined %>%
  group_by(group) %>% 
  # behind scenes: put each unique group (high and ld sd) into its own dataset
  # you can specify more than one variable to group by 
  summarize(mean_value = mean(value)) 
  # behind scenes: calculate mean for each group (high sd, low sd)

# create a density plot with mean lines
ggplot(data_combined, aes(x = value, fill = group)) +
  geom_density(alpha = 0.5) +
  geom_vline(data = means, 
             aes(xintercept = mean_value, color = group), 
             linetype = "dashed", 
             linewidth = 1) +
  labs(
    title = "Comparison of High and Low Standard Deviation",
    x = "Value",
    y = "Density"
  ) +
  theme_minimal()

If we were to present only the summary statistic (means) in a bar plot, it would look like this:

If we compare this bar plot to the density plot above, we can see that we’ve lost a lot of information! While the means are similar, we know that the way the data is distributed varies drastically for each group. We should be aware of this when using a bar plot to present summary statistics - it may not always be the best approach. It isn’t always bad to use a bar plot to present summary statistics, but you should be aware of how it may hide important information.

Later on, I will show you how we can use a strip plot to present more of our data, or how we can use a box plot to summarize our data.

What about a pie chart? 🥧

Pie charts are a common way to show proportions since it visually displays “parts of a whole”.

There is no specific geom from ggplot2 to build a pie chart. Instead, we build a barplot and use coord_polar to make it into a pie chart.

I wonder… why is that? Why wouldn’t the creators of ggplot2 create a specific function to build a pie chart (e.g. like geom_pie or something)?

ggplot(mpg, aes(x = "", fill = class)) + # we don't specify anything for x here (go ahead and try putting class there you'll see its no longer a filled pie chart)
  geom_bar(stat = "count", width = 1, color = "black") +
  coord_polar(theta = "y") + 
  ggtitle("Frequency of Car Classes in mpg Data") +
  #scale_fill_brewer(palette = "Set3") +  # choose a specific colour palette if you'd like; commented out (not affecting plot below)
  theme_void()

Same pie chart as above using different code

In the code below, we produce essentially the same pie chart as above but we explicitly calculate the proportions first. This code might be useful if you have already generated a table of proportions and want to visualize it.

Note how the stat argument is now set to “identity” instead of count and we specify y=prop. Feel free to run this code to get a sense of how it works and prove to yourself that it produces the same output as the code above.

# table of proportions
mpg_prop <- mpg %>%
  count(class) %>%
  mutate(prop = round((n/sum(n)), 2))

ggplot(mpg_prop, aes(x = "", y = prop, fill = class)) +
  geom_bar(stat = "identity", width = 1, color = "black") +
  coord_polar(theta = "y") + # changes to circular plot (pie chart) using y (prop, specified above) to determine the area/angles of pie
  ggtitle("Proportion of Car Classes in mpg Data") +
  theme_void()

When are pie charts problematic?

What’s wrong with the above pie chart?

While we are good at distinguishing between lengths of bars in a bar plot, it is more difficult for us to distinguish between areas with close angles.

For instance, in the plot above, the subcompact, pickup, and mid-size categories all look pretty similar. The audience consuming this visualization is going to be hard pressed to accurately tell you which of these three vehicle types appears more often in the data.

When we look at the bar plot, however, we can easily tell you that there are more mid-size vehicles than pick ups or subcompacts, and between subcompacts and pick ups, there are slightly more subcompacts.

Not all pie charts are bad! 🥧

ggplot(mpg, aes(x = "", fill = vehicle_type)) + 
  geom_bar(stat = "count", width = 1, color = "black") +
  coord_polar(theta = "y") +  # changes to pie chart
  ggtitle("Vehicles in mpg Data") +
  theme_void()

This is a much better pie chart because we have few, distinguishable categories, and the differences between the frequency of the categories is large.

Visualizations that present additional information

Strip plots

Strip plots show more of our data and allow us to look at how the distribution of a variable varies across different categories. For instance, in our leadership data we might wonder how does time in power vary by gender? Or how does GDP vary by continent?

Let’s look at an example using the leaders dataset we loaded above.

# wrangle the leaders data we loaded earlier 
# the lubridate package works in the background 
# when subtracting dates 
 
df_leaders <- leaders %>%
  drop_na(enddate, startdate) %>%
  mutate(startdate = as.Date(startdate),
         enddate = as.Date(enddate),
         time_pwr = enddate - startdate,
         days_pwr = as.numeric(time_pwr), 
# returns difftime "class" originally 
# which is lubridate specific;
# changed to numeric for simplicity sake
         gender = factor(gender)) 

check <- df_leaders %>%
  filter(days_pwr >= 0 & days_pwr <= 10) 

head(check)

##        obsid                               leadid ccode idacr     leader
## 1 CUB-1934-1 81df0b74-1e42-11e4-b4cd-db5882bf8def    40   CUB      Hevia
## 2 HAI-1879-1 81df6d1b-1e42-11e4-b4cd-db5882bf8def    41   HAI    Herrise
## 3 HAI-1957-2 81e03076-1e42-11e4-b4cd-db5882bf8def    41   HAI    Cantave
## 4 HAI-1957-4 81e03076-1e42-11e4-b4cd-db5882bf8def    41   HAI    Cantave
## 5 HAI-1990-1 81e0921c-1e42-11e4-b4cd-db5882bf8def    41   HAI    Abraham
## 6 DOM-1876-1 81e0c2f2-1e42-11e4-b4cd-db5882bf8def    42   DOM Villanueva
##    startdate    eindate    enddate   eoutdate     entry      exit
## 1 1934-01-15 1934-01-15 1934-01-18 1934-01-18 Irregular Irregular
## 2 1879-07-17 1879-07-17 1879-07-26 1879-07-26 Irregular   Regular
## 3 1957-04-02 1957-04-02 1957-04-06 1957-04-06 Irregular   Regular
## 4 1957-05-20 1957-05-20 1957-05-26 1957-05-26 Irregular Irregular
## 5 1990-03-10 1990-03-10 1990-03-13 1990-03-13   Regular   Regular
## 6 1876-02-24 1876-02-24 1876-03-03 1876-03-03 Irregular Irregular
##                                           exitcode prevtimesinoffice
## 1 Removed in Military Power Struggle Short of Coup                 0
## 2                                          Regular                 0
## 3                                          Regular                 0
## 4     Removed by Military, without Foreign Support                 1
## 5                                          Regular                 0
## 6     Removed by Military, without Foreign Support                 0
##   posttenurefate gender yrborn yrdied   borndate ebirthdate deathdate
## 1             OK      M   1900   1964         NA       <NA>        NA
## 2   Imprisonment      M   -999   -999         NA       <NA>        NA
## 3             OK      M   1910   1967         NA       <NA>        NA
## 4          Exile      M   1910   1967 1910-07-04 1910-07-04        NA
## 5             OK      M   1940   -777         NA       <NA>        NA
## 6          Exile      M   -999   1920         NA       <NA>        NA
##   edeathdate
## 1       <NA>
## 2       <NA>
## 3       <NA>
## 4       <NA>
## 5       <NA>
## 6       <NA>
##                                                                                                                                                                                                                                                                                               dbpediauri
## 1            https://urldefense.proofpoint.com/v2/url?u=http-3A__dbpedia.org_resource_Carlos-5FHevia&d=BQID-g&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=ZliGVSRwfirWoETOrKCh2RSnoygzVPWEk95Me9L-Kwo&m=EaKyFx9mpkhPdFxsxzvW_fiM3jbYwM3xYLjVbSFoDAg&s=5-6e0ciWg-SlbSXQhsIrBVVl7sydg_gVQjoic3YQiQE&e=
## 2                                                                                                                                                                                                                                                                                                     NA
## 3                                                                                                                                                                                                                                                                                                     NA
## 4                                                                                                                                                                                                                                                                                                     NA
## 5 https://urldefense.proofpoint.com/v2/url?u=http-3A__dbpedia.org_resource_H-25C3-25A9rard-5FAbraham&d=BQID-g&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=ZliGVSRwfirWoETOrKCh2RSnoygzVPWEk95Me9L-Kwo&m=EaKyFx9mpkhPdFxsxzvW_fiM3jbYwM3xYLjVbSFoDAg&s=iZkmgjNtCkklKJJvfoEJIpmIZtJpAQpjcuitCwvTeR4&e=
## 6                                                                                                                                                                                                                                                                                                     NA
##   numentry numexit numexitcode numposttenurefate fties ftcur time_pwr days_pwr
## 1        1       3          16                 0    NA    NA   3 days        3
## 2        1       1           0                 2    NA    NA   9 days        9
## 3        1       1           0                 0    NA    NA   4 days        4
## 4        1       3           6                 1    NA    NA   6 days        6
## 5        0       1           0                 0    NA    NA   3 days        3
## 6        1       3           6                 1    NA    NA   8 days        8

df_leaders2 <- df_leaders %>%
  filter(days_pwr < 10950)

There are 112 leaders who were in power for 0-10 days. Odd but it is a feature of the data. We’re going to get rid of anyone who was leader for more than 30 years (10,950 days).

Let’s use an anti-join to see which leaders served more than 10,950 days.

head(anti_join(df_leaders, df_leaders2))

## Joining with `by = join_by(obsid, leadid, ccode, idacr, leader, startdate,
## eindate, enddate, eoutdate, entry, exit, exitcode, prevtimesinoffice,
## posttenurefate, gender, yrborn, yrdied, borndate, ebirthdate, deathdate,
## edeathdate, dbpediauri, numentry, numexit, numexitcode, numposttenurefate,
## fties, ftcur, time_pwr, days_pwr)`

##        obsid                               leadid ccode idacr         leader
## 1   CUB-1959 81df6d18-1e42-11e4-b4cd-db5882bf8def    40   CUB         Castro
## 2 DOM-1930-2 81e1863f-1e42-11e4-b4cd-db5882bf8def    42   DOM Rafel Trujillo
## 3   MEX-1876 81e39f71-1e42-11e4-b4cd-db5882bf8def    70   MEX           Diaz
## 4   BRA-1840 81fab41b-1e42-11e4-b4cd-db5882bf8def   140   BRA       Pedro II
## 5 PAR-1954-2 81ffa9b9-1e42-11e4-b4cd-db5882bf8def   150   PAR     Stroessner
## 6 SPN-1939-2 8211f99e-1e42-11e4-b4cd-db5882bf8def   230   SPN         Franco
##    startdate    eindate    enddate   eoutdate     entry
## 1 1959-01-02 1959-01-02 2008-02-24 2008-02-24 Irregular
## 2 1930-08-16 1930-08-16 1961-05-30 1961-05-30   Regular
## 3 1876-11-23 1876-11-23 1911-05-25 1911-05-25 Irregular
## 4 1840-07-23 1840-07-23 1889-11-15 1889-11-15   Regular
## 5 1954-07-11 1954-07-11 1989-02-03 1989-02-03 Irregular
## 6 1939-04-01 1939-04-01 1975-10-30 1975-10-30 Irregular
##                        exit                                     exitcode
## 1 Retired Due to Ill Health                                      Regular
## 2                 Irregular Removed by Military, without Foreign Support
## 3                 Irregular                                      Unknown
## 4                 Irregular                                      Unknown
## 5                 Irregular Removed by Military, without Foreign Support
## 6             Natural Death                                      Regular
##   prevtimesinoffice                                            posttenurefate
## 1                 0                                                        OK
## 2                 0                                                     Death
## 3                 0                                                     Exile
## 4                 0                                                     Exile
## 5                 0                                                     Exile
## 6                 0 Missing: Natural Death within Six Months of Losing Office
##   gender yrborn yrdied   borndate ebirthdate  deathdate edeathdate
## 1      M   1927   -777 1927-08-13 1927-08-13         NA       <NA>
## 2      M   1891   1961 1891-10-24 1891-10-24 1961-05-30 1961-05-30
## 3      M   1830   1915 1830-09-15 1830-09-15 1915-07-02 1915-07-02
## 4      M   1825   1891 1825-12-02 1825-12-02 1891-12-05 1891-12-05
## 5      M   1912   2006 1912-11-03 1912-11-03         NA       <NA>
## 6      M   1892   1975 1892-12-04 1892-12-04 1975-11-20 1975-11-20
##                                                                                                                                                                                                                                                                                              dbpediauri
## 1                                                                                                                                                                                                                                                                                                    NA
## 2        https://urldefense.proofpoint.com/v2/url?u=http-3A__dbpedia.org_resource_Rafael-5FTrujillo&d=BQID-g&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=ZliGVSRwfirWoETOrKCh2RSnoygzVPWEk95Me9L-Kwo&m=EaKyFx9mpkhPdFxsxzvW_fiM3jbYwM3xYLjVbSFoDAg&s=pQrJNBAuvPHTuw_XYf0hy4QgVrDQQKTdyDToro0iPcc&e=
## 3 https://urldefense.proofpoint.com/v2/url?u=http-3A__dbpedia.org_resource_Porfirio-5FD-25C3-25ADaz&d=BQID-g&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=ZliGVSRwfirWoETOrKCh2RSnoygzVPWEk95Me9L-Kwo&m=EaKyFx9mpkhPdFxsxzvW_fiM3jbYwM3xYLjVbSFoDAg&s=aimBPbn3rxCKiOG9llJzEUriz0ix8JGH5ok_TrWbRdk&e=
## 4 https://urldefense.proofpoint.com/v2/url?u=http-3A__dbpedia.org_resource_Pedro-5FII-5Fof-5FBrazil&d=BQID-g&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=ZliGVSRwfirWoETOrKCh2RSnoygzVPWEk95Me9L-Kwo&m=EaKyFx9mpkhPdFxsxzvW_fiM3jbYwM3xYLjVbSFoDAg&s=EUgwYcTAAIGG_KHshdVPvWQYtkJt-TfRZ0naBjtF0dQ&e=
## 5     https://urldefense.proofpoint.com/v2/url?u=http-3A__dbpedia.org_resource_Alfredo-5FStroessner&d=BQID-g&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=ZliGVSRwfirWoETOrKCh2RSnoygzVPWEk95Me9L-Kwo&m=EaKyFx9mpkhPdFxsxzvW_fiM3jbYwM3xYLjVbSFoDAg&s=W8pZFUtPoxWgmD3QLuDII9W2oqDbOJ9wVDixDzT8v9c&e=
## 6       https://urldefense.proofpoint.com/v2/url?u=http-3A__dbpedia.org_resource_Francisco-5FFranco&d=BQID-g&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=ZliGVSRwfirWoETOrKCh2RSnoygzVPWEk95Me9L-Kwo&m=EaKyFx9mpkhPdFxsxzvW_fiM3jbYwM3xYLjVbSFoDAg&s=h4XbplSQ52mUvj0HoNiaYQ1LJdNyIlFlgRB-LylziWc&e=
##   numentry numexit numexitcode numposttenurefate
## 1        1     2.1           0                 0
## 2        0     3.0           6                 3
## 3        1     3.0        -999                 1
## 4        0     3.0        -999                 1
## 5        1     3.0           6                 1
## 6        1     2.0           0              -777
##                                                                       fties
## 1               Brother of Raul Castro%824ff801-1e42-11e4-b4cd-db5882bf8def
## 2                                                                        NA
## 3                                                                        NA
## 4                                                                        NA
## 5 Relative-in-law of Rodriguez Pedotti%81ffa9ba-1e42-11e4-b4cd-db5882bf8def
## 6                                                                        NA
##   ftcur   time_pwr days_pwr
## 1     0 17950 days    17950
## 2    NA 11245 days    11245
## 3    NA 12600 days    12600
## 4    NA 18012 days    18012
## 5     0 12626 days    12626
## 6    NA 13361 days    13361

For instance, we see that Fidel Castro was the leader of Cuba for over 49 years 1959-2008 (17,950 days).

Let’s plot our (filtered) leaders data using a strip plot.

position_jitter() below adds a small amount of random noise to the points’ positions to prevent overplotting, where multiple points overlap. In other words, it just spreads the points out so that we can get a better sense of how many data points there are (and they’re not totally overlapping).

We specify height = 0 so that it doesn’t move the points on the y-axis (up or down), only along the x-axis (side to side). width=0.2 specifies how much to jitter on x-axis. Again, you could customize the plot below by specifying a title, change labels of the y and x-axes, change legend title and labels, specify a minimal theme to remove the grey background, etc.!

ggplot(df_leaders2, aes(x = gender, 
                       y=days_pwr, 
                       color=gender)) + 
     geom_point(size = 0.5, 
                position = position_jitter(height = 0, width=0.2))

Violin plots

A violin plot is similar to a strip plot but uses density to define the shape of the distribution. In areas where it is fatter, there are more data points.

ggplot(df_leaders2, aes(x = gender, y = days_pwr)) +
  geom_violin(fill = "lightblue", color = "black") +
  ggtitle("Violin Plot: Days in Power by Gender") +
  labs(x = "Gender", y = "Days in Power")

Box plots

A box plot summarizes the distribution of a numeric variable across one or more groups, making it a convenient tool for quickly grasping differences between those groups.

However, a potential downfall of the box plot is that we can’t discern the underlying distribution of individual data points within each group or the total number of observations within each group. If you need to know the shape of the distribution or how the number of observations varies within each group, it can be helpful to overlay jittered points on top of the box plots. I provide an example of how to do this below.

Source: https://leansigmacorporation.com/box-plot-with-minitab/

The box itself represents the interquartile range and tells us how spread out the middle 50% of the data is. A taller box identifies more variability, while a shorter box suggests the data tends to cluster together. The bottom of the box represents the 25th percentile (value below which 25% of the data lies). The top of the box is the 75th percentile (value below which 75% of the data lies).
The line inside of the box is the median (measure of central tendency).
Whiskers show us the range of the data excluding potential outliers.
Individual points beyond the whiskers indicate outliers, which are unusually high or low values in the data.

ggplot(df_leaders2, 
       # start with our smaller leaders dataset
       aes(x = gender,
           y = days_pwr,
           color = gender)) +
 geom_boxplot(width = 0.5) +
  # create boxplot
geom_point(size = 0.5, alpha=0.25, 
               position = position_jitter(height = 0, width=0.2)) +
  # size controls size of points, alpha controls transparency, jitter points
  scale_y_continuous(breaks= seq(0, 11000, by=1000)) +
  # change how ticks appear on y-axis
  labs(title = "Leadership tenure by sex",
       x="sex", y="days in power") # adds labels

Wrapping up

Important functions/operators:

Various ggplot2 functions and operators:

+ to layer the ggplot2 components
geom_histogram() histograms
geom_bar() bar plots
geom_pointrange() lollipop plot (for our purposes; has other uses as well)
geom_density() density plot
coord_polar() used to create pie chart
geom_point() strip plot (we will also discuss how to use it to create scatterplots in a later lecture)
geom_boxplot() boxplot
geom_violin() violin plot

Class 7: Visualizing a variable

POL3325G Data Science for Politics (February 25, 2025)

Shanaya Vanhooren