In this lesson, we will focus on how to visualize variables (focusing especially on visualizing frequency, proportions, but also sprinkling in discussions of how we can visualize dispersion and central tendencies of variables). We will discuss the following:
Remember to start by setting your working directory to a new folder, and saving your script/markdown document in this folder. You should also store any data you need in that folder.
library(rio)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#install.packages("gapminder")
library(gapminder)
#install.packages("lubridate")
library(lubridate)
#install.packages("ggridges")
library(ggridges)
econ <- import("USMacroSWQ.csv")
# info about that data here: https://vincentarelbundock.github.io/Rdatasets/doc/AER/USMacroSWQ.html
# Note: I have already made the date variable for you.
goldprice <- import("GoldSilver.csv")
# info about that data here: https://vincentarelbundock.github.io/Rdatasets/doc/AER/GoldSilver.html
juice <- import("FrozenJuice.csv")
# info about the data here: https://vincentarelbundock.github.io/Rdatasets/doc/AER/FrozenJuice.html
leaders <- import("Archigos.dta")
# this is a dataset about leaders from 1875-2004.
# this data is available here: https://www.rochester.edu/college/faculty/hgoemans/data.htm
First we’re going to start by joining the first three datasets that we loaded so they are stored in a single dataframe in our global environment.
We won’t worry about the leaders dataset until later on in the lab.
We need to make sure that the variables we’d like to “join on” have both the same variable name and same variable class/type.
econ <- econ %>%
select(-c(tbill, V1, rownames))
juice <- juice %>%
mutate(date = as.Date(date)) %>%
select(-c(V1, fdd))
goldprice <- goldprice %>%
rename(date = rownames)
gold_econ <- left_join(goldprice, econ, by="date")
gold_econ$date <- as.Date(gold_econ$date)
goldeconjuice <- left_join(gold_econ, juice, by="date")
For juice: ppi is Producer price index for finished goods. Used to deflate the overall producer price index for finished goods to eliminate the effects of overall price inflation.
You could use a pipe to do the left_joins at the same time (more efficient), it would look something like this:
(you may need to play with the dates to get them to join properly)
# wrangle variable names/types
goldprice <- goldprice %>%
rename(date = rownames) %>%
mutate(date = as.Date(date))
econ %>%
select(-c(tbill, V1, rownames)) %>%
mutate(date = as.Date(date))
juice <- juice %>%
mutate(date = as.Date(date)) %>%
select(-c(V1))
# left join datasets together
econgoldjuice <- goldprice %>%
left_join(econ, by="date") %>%
left_join(juice, by="date")
Let’s save our data so we can use it again in another class.
save(goldeconjuice, file="gold-econ-juice-data.RData")
Let’s discuss how we can visualize frequencies/amounts using
geom_bar()
and geom_histogram()
.
Remember in the last lecture I argued that we typically do not want to just report the frequency of a variable (either on it’s own or in a table). Instead, we often turn to proportions. Well, one way to usefully communicate a variable’s frequency is to visualize or plot it.
One of the most common ways to visualize frequencies is by creating a histogram or a bar plot. Bar plots are effective because humans can easily distinguish differences in length and bar plots communicate visually by displaying bars of different lengths.
We use histograms to look at continuous variables. You can think of a histogram as a specific type of bar plot where the bars are touching. Let’s create a histogram below for juice price (variable called “price” in the goldeconjuice dataframe). This will show us the counts or frequency of the different values of the price variable by “binning” the values.
ggplot(data = goldeconjuice, aes(x=price)) +
geom_histogram(binwidth = 5) +
scale_x_continuous(breaks= seq(70, 165, by=10))
## Warning: Removed 8935 rows containing non-finite outside the scale range
## (`stat_bin()`).
In the code above:
Binwidth defines how the data is grouped into intervals. A smaller bindwidth creates more narrower bars and shows more detail. A larger binwdith creates fewer bars that are wider and shows less detail about the specific values that the variable takes on.
scale_x_continuous()
, which is a scales function, controls where tick marks
and labels appear along the x-axis. Does not affect how data is grouped,
just how the axis looks.
You should have received a warning that says “Removed 8935 rows containing non-finite outside the scale range”. We did not get rid of the NA values that appear in the price column before we made the plot, so the warning is telling us that these values were not plotted. Ideally, you would remove these NA values from the data before plotting. But in the meantime, you could double check that’s all that was excluded from the plot (and not other non-missing values due to how you restricted the x-axis):
sum(is.na(goldeconjuice$price))
## [1] 8935
Now let’s make the same plot but much more presentable:
ggplot(data = goldeconjuice, aes(x=price)) +
geom_histogram(binwidth = 5,
fill = "orange",
colour = "black") +
scale_x_continuous(breaks= seq(70, 165, by=10)) +
labs(title="Distribution of the Price of Frozen Orange Juice 1950-2000",
x = "Price",
y = "Frequency")
## Warning: Removed 8935 rows containing non-finite outside the scale range
## (`stat_bin()`).
Hack: If you wanted a specific orange to fill the
bars, try googling “colour picker”, dragging it to the exact shade of
orange you’d like to fill with and then copy and paste the HEX code in
quotes in the fill argument of the geom_histogram()
function.
If we want to visualize a categorical variable, a traditional bar plot tends to be highly effective. Again, bar plots are effective because humans can easily distinguish differences in length and bar plots communicate visually by displaying bars of different lengths.
The mpg data is included in the ggplot2 package. It is fuel economy
data from 1999 to 2008 for 38 popular models of cars. Run
?mpg
in your console to learn more.
ggplot(data = mpg, aes(x = class, fill = class)) + # here fill car "class" with all different colours
geom_bar() +
ggtitle("Simple plot of car classes in mpg data")
Let’s use contrast to highlight SUVs and pick ups specifically. First, we recode the class variable to a new variable called “vehicle_type” that takes on three categories: suv, pick up, and other.
Then we use this variable to “fill” the bars of our bar plot.
We can use a scales function
(scale_fill_manual()
) to change the colours of the bar plot
if we’d like to further increase the contrast between SUVs, pick ups,
and the other vehicle types. When creating a histogram of a continuous
variable using geom_hist()
we will often use a continuous
colour gradient to colour the bins of our variable (using
scale_fill_gradient()
) or we will use
scale_fill_manual()
to fill based on another variable in
our dataset that is categorical.
mpg <- mpg %>%
mutate(vehicle_type = case_when(class == "suv" ~ "suv",
class == "pickup" ~ "pickup",
TRUE ~ "other"))
ggplot(data = mpg, aes(x = class, fill = vehicle_type)) +
geom_bar() +
scale_fill_manual(values = c("suv" = "darkgreen",
"pickup" = "blue",
"other" = "grey70"),
name = "Vehicle Type") +
# name argument allows us to rename the legend
ggtitle("Car Classes in mpg data") +
theme(legend.position = "bottom") # move legend from default location on right side to bottom of the plot
Lollipop plots focus our attention to the end of the bars - the most important aspect for interpreting differences between categories.
gapminder_continents <- gapminder %>%
filter(year == 2007) %>% # only look at 2007
count(continent) %>% # count number of countries in each continent
arrange(desc(n)) %>%
mutate(continent = fct_inorder(continent)) # ordered factor
ggplot(gapminder_continents,
aes(x = continent, y = n,
color = continent)) +
geom_pointrange(aes(ymin = 0, ymax = n)) +
labs(x = NULL, y = "Number of countries")
Like any data visualization, bar plots can be deceptive. Let’s talk briefly about some things to avoid.
In the bar plot above, there are several issues:
It violates the design principle we discussed in lecture which is that in general, “simple is better” when it comes to our visualizations. There is a lot of distracting background elements in this plot.
The y-axis is not properly declared in the title of the plot or using a y-axis label. This is especially problematic when the plot is meant for a general audience.
The plot compares two unequal time intervals (average annual GDP in 58 year period compared to average annual GDP in 7 year period). Is data back to 1950 necessary to begin with? How do these (unequal) time intervals support the story being told with the data? What is the benchmark for GDP growth and how has this changed over time? Is growth slower now than it was in the early postwar period?
What are we missing? Implicitly assumes that GDP growth can be entirely attributed to a president. While government policy influences GDP growth, it is not the only factor. We need more information.
When creating a bar plot, the y-axis should always start at zero. In other words, do not truncate the y-axis. The entire length of the bar matters - it is the only way for us to accurately interpret differences in bar lengths. Cutting off parts of the bars distorts or exaggerates differences. For instance, in the plot above, the y-axis starts at zero, but if it started at 1% then we would say the y-axis has been truncated.
We should be careful when using bar plots to show summary statistics, such as the mean or median of our variable.
Remember this plot which shows two different variables with similar means but different standard deviations?
I created this plot using “fake” data that I generated in the following way:
(You don’t need to know how to generate fake data, but I’m showing you this to be transparent. You should know how to create a density plot and understand how group_by() works. HINT: group_by() may be useful for the first data assignment. Feel free to test out the code below on your own.)
# set seed for reproducibility
set.seed(123)
# generate dataset with variable with high standard deviation
high_sd_data <- data.frame(
group = "High SD",
value = rnorm(100, mean = 50, sd = 20) # Mean = 50, High SD = 20
)
# generate dataset with variable with low standard deviation
low_sd_data <- data.frame(
group = "Low SD",
value = rnorm(100, mean = 50, sd = 5) # Mean = 50, Low SD = 5
)
# combine datasets
data_combined <- bind_rows(high_sd_data, low_sd_data)
# calculate means for each group, store separately
means <- data_combined %>%
group_by(group) %>%
# behind scenes: put each unique group (high and ld sd) into its own dataset
# you can specify more than one variable to group by
summarize(mean_value = mean(value))
# behind scenes: calculate mean for each group (high sd, low sd)
# create a density plot with mean lines
ggplot(data_combined, aes(x = value, fill = group)) +
geom_density(alpha = 0.5) +
geom_vline(data = means,
aes(xintercept = mean_value, color = group),
linetype = "dashed",
linewidth = 1) +
labs(
title = "Comparison of High and Low Standard Deviation",
x = "Value",
y = "Density"
) +
theme_minimal()
If we compare this bar plot to the density plot above, we can see that we’ve lost a lot of information! While the means are similar, we know that the way the data is distributed varies drastically for each group. We should be aware of this when using a bar plot to present summary statistics - it may not always be the best approach. It isn’t always bad to use a bar plot to present summary statistics, but you should be aware of how it may hide important information.
Later on, I will show you how we can use a strip plot to present more of our data, or how we can use a box plot to summarize our data.
Pie charts are a common way to show proportions since it visually displays “parts of a whole”.
There is no specific geom
from ggplot2 to build a pie
chart. Instead, we build a barplot and use coord_polar to make it into a
pie chart.
I wonder… why is that? Why wouldn’t the creators of ggplot2 create a specific function to build a pie chart (e.g. like geom_pie or something)?
ggplot(mpg, aes(x = "", fill = class)) + # we don't specify anything for x here (go ahead and try putting class there you'll see its no longer a filled pie chart)
geom_bar(stat = "count", width = 1, color = "black") +
coord_polar(theta = "y") +
ggtitle("Frequency of Car Classes in mpg Data") +
#scale_fill_brewer(palette = "Set3") + # choose a specific colour palette if you'd like; commented out (not affecting plot below)
theme_void()
In the code below, we produce essentially the same pie chart as above but we explicitly calculate the proportions first. This code might be useful if you have already generated a table of proportions and want to visualize it.
Note how the stat argument is now set to “identity” instead of count and we specify y=prop. Feel free to run this code to get a sense of how it works and prove to yourself that it produces the same output as the code above.
# table of proportions
mpg_prop <- mpg %>%
count(class) %>%
mutate(prop = round((n/sum(n)), 2))
ggplot(mpg_prop, aes(x = "", y = prop, fill = class)) +
geom_bar(stat = "identity", width = 1, color = "black") +
coord_polar(theta = "y") + # changes to circular plot (pie chart) using y (prop, specified above) to determine the area/angles of pie
ggtitle("Proportion of Car Classes in mpg Data") +
theme_void()
What’s wrong with the above pie chart?
While we are good at distinguishing between lengths of bars in a bar plot, it is more difficult for us to distinguish between areas with close angles.
For instance, in the plot above, the subcompact, pickup, and mid-size categories all look pretty similar. The audience consuming this visualization is going to be hard pressed to accurately tell you which of these three vehicle types appears more often in the data.
When we look at the bar plot, however, we can easily tell you that there are more mid-size vehicles than pick ups or subcompacts, and between subcompacts and pick ups, there are slightly more subcompacts.
Not all pie charts are bad! 🥧
ggplot(mpg, aes(x = "", fill = vehicle_type)) +
geom_bar(stat = "count", width = 1, color = "black") +
coord_polar(theta = "y") + # changes to pie chart
ggtitle("Vehicles in mpg Data") +
theme_void()
This is a much better pie chart because we have few, distinguishable categories, and the differences between the frequency of the categories is large.
Strip plots show more of our data and allow us to look at how the distribution of a variable varies across different categories. For instance, in our leadership data we might wonder how does time in power vary by gender? Or how does GDP vary by continent?
Let’s look at an example using the leaders dataset we loaded above.
# wrangle the leaders data we loaded earlier
# the lubridate package works in the background
# when subtracting dates
df_leaders <- leaders %>%
drop_na(enddate, startdate) %>%
mutate(startdate = as.Date(startdate),
enddate = as.Date(enddate),
time_pwr = enddate - startdate,
days_pwr = as.numeric(time_pwr),
# returns difftime "class" originally
# which is lubridate specific;
# changed to numeric for simplicity sake
gender = factor(gender))
check <- df_leaders %>%
filter(days_pwr >= 0 & days_pwr <= 10)
head(check)
## obsid leadid ccode idacr leader
## 1 CUB-1934-1 81df0b74-1e42-11e4-b4cd-db5882bf8def 40 CUB Hevia
## 2 HAI-1879-1 81df6d1b-1e42-11e4-b4cd-db5882bf8def 41 HAI Herrise
## 3 HAI-1957-2 81e03076-1e42-11e4-b4cd-db5882bf8def 41 HAI Cantave
## 4 HAI-1957-4 81e03076-1e42-11e4-b4cd-db5882bf8def 41 HAI Cantave
## 5 HAI-1990-1 81e0921c-1e42-11e4-b4cd-db5882bf8def 41 HAI Abraham
## 6 DOM-1876-1 81e0c2f2-1e42-11e4-b4cd-db5882bf8def 42 DOM Villanueva
## startdate eindate enddate eoutdate entry exit
## 1 1934-01-15 1934-01-15 1934-01-18 1934-01-18 Irregular Irregular
## 2 1879-07-17 1879-07-17 1879-07-26 1879-07-26 Irregular Regular
## 3 1957-04-02 1957-04-02 1957-04-06 1957-04-06 Irregular Regular
## 4 1957-05-20 1957-05-20 1957-05-26 1957-05-26 Irregular Irregular
## 5 1990-03-10 1990-03-10 1990-03-13 1990-03-13 Regular Regular
## 6 1876-02-24 1876-02-24 1876-03-03 1876-03-03 Irregular Irregular
## exitcode prevtimesinoffice
## 1 Removed in Military Power Struggle Short of Coup 0
## 2 Regular 0
## 3 Regular 0
## 4 Removed by Military, without Foreign Support 1
## 5 Regular 0
## 6 Removed by Military, without Foreign Support 0
## posttenurefate gender yrborn yrdied borndate ebirthdate deathdate
## 1 OK M 1900 1964 NA <NA> NA
## 2 Imprisonment M -999 -999 NA <NA> NA
## 3 OK M 1910 1967 NA <NA> NA
## 4 Exile M 1910 1967 1910-07-04 1910-07-04 NA
## 5 OK M 1940 -777 NA <NA> NA
## 6 Exile M -999 1920 NA <NA> NA
## edeathdate
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
## dbpediauri
## 1 https://urldefense.proofpoint.com/v2/url?u=http-3A__dbpedia.org_resource_Carlos-5FHevia&d=BQID-g&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=ZliGVSRwfirWoETOrKCh2RSnoygzVPWEk95Me9L-Kwo&m=EaKyFx9mpkhPdFxsxzvW_fiM3jbYwM3xYLjVbSFoDAg&s=5-6e0ciWg-SlbSXQhsIrBVVl7sydg_gVQjoic3YQiQE&e=
## 2 NA
## 3 NA
## 4 NA
## 5 https://urldefense.proofpoint.com/v2/url?u=http-3A__dbpedia.org_resource_H-25C3-25A9rard-5FAbraham&d=BQID-g&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=ZliGVSRwfirWoETOrKCh2RSnoygzVPWEk95Me9L-Kwo&m=EaKyFx9mpkhPdFxsxzvW_fiM3jbYwM3xYLjVbSFoDAg&s=iZkmgjNtCkklKJJvfoEJIpmIZtJpAQpjcuitCwvTeR4&e=
## 6 NA
## numentry numexit numexitcode numposttenurefate fties ftcur time_pwr days_pwr
## 1 1 3 16 0 NA NA 3 days 3
## 2 1 1 0 2 NA NA 9 days 9
## 3 1 1 0 0 NA NA 4 days 4
## 4 1 3 6 1 NA NA 6 days 6
## 5 0 1 0 0 NA NA 3 days 3
## 6 1 3 6 1 NA NA 8 days 8
df_leaders2 <- df_leaders %>%
filter(days_pwr < 10950)
There are 112 leaders who were in power for 0-10 days. Odd but it is a feature of the data. We’re going to get rid of anyone who was leader for more than 30 years (10,950 days).
Let’s use an anti-join to see which leaders served more than 10,950 days.
head(anti_join(df_leaders, df_leaders2))
## Joining with `by = join_by(obsid, leadid, ccode, idacr, leader, startdate,
## eindate, enddate, eoutdate, entry, exit, exitcode, prevtimesinoffice,
## posttenurefate, gender, yrborn, yrdied, borndate, ebirthdate, deathdate,
## edeathdate, dbpediauri, numentry, numexit, numexitcode, numposttenurefate,
## fties, ftcur, time_pwr, days_pwr)`
## obsid leadid ccode idacr leader
## 1 CUB-1959 81df6d18-1e42-11e4-b4cd-db5882bf8def 40 CUB Castro
## 2 DOM-1930-2 81e1863f-1e42-11e4-b4cd-db5882bf8def 42 DOM Rafel Trujillo
## 3 MEX-1876 81e39f71-1e42-11e4-b4cd-db5882bf8def 70 MEX Diaz
## 4 BRA-1840 81fab41b-1e42-11e4-b4cd-db5882bf8def 140 BRA Pedro II
## 5 PAR-1954-2 81ffa9b9-1e42-11e4-b4cd-db5882bf8def 150 PAR Stroessner
## 6 SPN-1939-2 8211f99e-1e42-11e4-b4cd-db5882bf8def 230 SPN Franco
## startdate eindate enddate eoutdate entry
## 1 1959-01-02 1959-01-02 2008-02-24 2008-02-24 Irregular
## 2 1930-08-16 1930-08-16 1961-05-30 1961-05-30 Regular
## 3 1876-11-23 1876-11-23 1911-05-25 1911-05-25 Irregular
## 4 1840-07-23 1840-07-23 1889-11-15 1889-11-15 Regular
## 5 1954-07-11 1954-07-11 1989-02-03 1989-02-03 Irregular
## 6 1939-04-01 1939-04-01 1975-10-30 1975-10-30 Irregular
## exit exitcode
## 1 Retired Due to Ill Health Regular
## 2 Irregular Removed by Military, without Foreign Support
## 3 Irregular Unknown
## 4 Irregular Unknown
## 5 Irregular Removed by Military, without Foreign Support
## 6 Natural Death Regular
## prevtimesinoffice posttenurefate
## 1 0 OK
## 2 0 Death
## 3 0 Exile
## 4 0 Exile
## 5 0 Exile
## 6 0 Missing: Natural Death within Six Months of Losing Office
## gender yrborn yrdied borndate ebirthdate deathdate edeathdate
## 1 M 1927 -777 1927-08-13 1927-08-13 NA <NA>
## 2 M 1891 1961 1891-10-24 1891-10-24 1961-05-30 1961-05-30
## 3 M 1830 1915 1830-09-15 1830-09-15 1915-07-02 1915-07-02
## 4 M 1825 1891 1825-12-02 1825-12-02 1891-12-05 1891-12-05
## 5 M 1912 2006 1912-11-03 1912-11-03 NA <NA>
## 6 M 1892 1975 1892-12-04 1892-12-04 1975-11-20 1975-11-20
## dbpediauri
## 1 NA
## 2 https://urldefense.proofpoint.com/v2/url?u=http-3A__dbpedia.org_resource_Rafael-5FTrujillo&d=BQID-g&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=ZliGVSRwfirWoETOrKCh2RSnoygzVPWEk95Me9L-Kwo&m=EaKyFx9mpkhPdFxsxzvW_fiM3jbYwM3xYLjVbSFoDAg&s=pQrJNBAuvPHTuw_XYf0hy4QgVrDQQKTdyDToro0iPcc&e=
## 3 https://urldefense.proofpoint.com/v2/url?u=http-3A__dbpedia.org_resource_Porfirio-5FD-25C3-25ADaz&d=BQID-g&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=ZliGVSRwfirWoETOrKCh2RSnoygzVPWEk95Me9L-Kwo&m=EaKyFx9mpkhPdFxsxzvW_fiM3jbYwM3xYLjVbSFoDAg&s=aimBPbn3rxCKiOG9llJzEUriz0ix8JGH5ok_TrWbRdk&e=
## 4 https://urldefense.proofpoint.com/v2/url?u=http-3A__dbpedia.org_resource_Pedro-5FII-5Fof-5FBrazil&d=BQID-g&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=ZliGVSRwfirWoETOrKCh2RSnoygzVPWEk95Me9L-Kwo&m=EaKyFx9mpkhPdFxsxzvW_fiM3jbYwM3xYLjVbSFoDAg&s=EUgwYcTAAIGG_KHshdVPvWQYtkJt-TfRZ0naBjtF0dQ&e=
## 5 https://urldefense.proofpoint.com/v2/url?u=http-3A__dbpedia.org_resource_Alfredo-5FStroessner&d=BQID-g&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=ZliGVSRwfirWoETOrKCh2RSnoygzVPWEk95Me9L-Kwo&m=EaKyFx9mpkhPdFxsxzvW_fiM3jbYwM3xYLjVbSFoDAg&s=W8pZFUtPoxWgmD3QLuDII9W2oqDbOJ9wVDixDzT8v9c&e=
## 6 https://urldefense.proofpoint.com/v2/url?u=http-3A__dbpedia.org_resource_Francisco-5FFranco&d=BQID-g&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=ZliGVSRwfirWoETOrKCh2RSnoygzVPWEk95Me9L-Kwo&m=EaKyFx9mpkhPdFxsxzvW_fiM3jbYwM3xYLjVbSFoDAg&s=h4XbplSQ52mUvj0HoNiaYQ1LJdNyIlFlgRB-LylziWc&e=
## numentry numexit numexitcode numposttenurefate
## 1 1 2.1 0 0
## 2 0 3.0 6 3
## 3 1 3.0 -999 1
## 4 0 3.0 -999 1
## 5 1 3.0 6 1
## 6 1 2.0 0 -777
## fties
## 1 Brother of Raul Castro%824ff801-1e42-11e4-b4cd-db5882bf8def
## 2 NA
## 3 NA
## 4 NA
## 5 Relative-in-law of Rodriguez Pedotti%81ffa9ba-1e42-11e4-b4cd-db5882bf8def
## 6 NA
## ftcur time_pwr days_pwr
## 1 0 17950 days 17950
## 2 NA 11245 days 11245
## 3 NA 12600 days 12600
## 4 NA 18012 days 18012
## 5 0 12626 days 12626
## 6 NA 13361 days 13361
For instance, we see that Fidel Castro was the leader of Cuba for over 49 years 1959-2008 (17,950 days).
position_jitter()
below adds a small amount of random
noise to the points’ positions to prevent overplotting,
where multiple points overlap. In other words, it just spreads the
points out so that we can get a better sense of how many data points
there are (and they’re not totally overlapping).
We specify height = 0 so that it doesn’t move the points on the y-axis (up or down), only along the x-axis (side to side). width=0.2 specifies how much to jitter on x-axis. Again, you could customize the plot below by specifying a title, change labels of the y and x-axes, change legend title and labels, specify a minimal theme to remove the grey background, etc.!
ggplot(df_leaders2, aes(x = gender,
y=days_pwr,
color=gender)) +
geom_point(size = 0.5,
position = position_jitter(height = 0, width=0.2))
A violin plot is similar to a strip plot but uses density to define the shape of the distribution. In areas where it is fatter, there are more data points.
ggplot(df_leaders2, aes(x = gender, y = days_pwr)) +
geom_violin(fill = "lightblue", color = "black") +
ggtitle("Violin Plot: Days in Power by Gender") +
labs(x = "Gender", y = "Days in Power")
A box plot summarizes the distribution of a numeric variable across one or more groups, making it a convenient tool for quickly grasping differences between those groups.
However, a potential downfall of the box plot is that we can’t discern the underlying distribution of individual data points within each group or the total number of observations within each group. If you need to know the shape of the distribution or how the number of observations varies within each group, it can be helpful to overlay jittered points on top of the box plots. I provide an example of how to do this below.
Source: https://leansigmacorporation.com/box-plot-with-minitab/
The box itself represents the interquartile range and tells us how spread out the middle 50% of the data is. A taller box identifies more variability, while a shorter box suggests the data tends to cluster together. The bottom of the box represents the 25th percentile (value below which 25% of the data lies). The top of the box is the 75th percentile (value below which 75% of the data lies).
The line inside of the box is the median (measure of central tendency).
Whiskers show us the range of the data excluding potential outliers.
Individual points beyond the whiskers indicate outliers, which are unusually high or low values in the data.
ggplot(df_leaders2,
# start with our smaller leaders dataset
aes(x = gender,
y = days_pwr,
color = gender)) +
geom_boxplot(width = 0.5) +
# create boxplot
geom_point(size = 0.5, alpha=0.25,
position = position_jitter(height = 0, width=0.2)) +
# size controls size of points, alpha controls transparency, jitter points
scale_y_continuous(breaks= seq(0, 11000, by=1000)) +
# change how ticks appear on y-axis
labs(title = "Leadership tenure by sex",
x="sex", y="days in power") # adds labels
Various ggplot2 functions and operators:
+
to layer the ggplot2 components
geom_histogram()
histograms
geom_bar()
bar plots
geom_pointrange()
lollipop plot (for our purposes;
has other uses as well)
geom_density()
density plot
coord_polar()
used to create pie chart
geom_point()
strip plot (we will also discuss how to
use it to create scatterplots in a later lecture)
geom_boxplot()
boxplot
geom_violin()
violin plot