Email: siregarbakti@gmail.com
RPubs: https://rpubs.com/datazerotohero/
Github: https://github.com/datazerotohero/
Data visualization is the technics of taking information from data into a visual context, such as charts, graphs, and maps. Data visualizations are possible to handle a small and event a big data become easier for the human brain to understand, and visualization also makes it more reliable to detect patterns, trends, and outliers in groups of the data.
R is an amazing platform for data analysis, capable of creating almost any type of graph. This book helps you create the most popular visualizations - from quick and dirty plots to publication-ready graphs. Here we will learn how to visualize data from univariate, bivariate, and mulitivariate. From the very basic to the advance, please follow the following instruction step by step.
The univariate plot is usually used to perform the distribution of data from a single variable. The variable can be categorical (e.g., gender, race, country, city, etc) or quantitative (e.g., age, weight, inflation, rate, etc).
The distribution of a single categorical variable is typically plotted with a bar-chart, pie-chart, or (less commonly) a treemap.
Here is an example showing the frequency of the Marriage dataset, I collect from package mosaicData. We use a bar chart to display the distribution of wedding participants by Zodiac.
library(ggplot2) # for visualization
#setwd("C:/Users/Bakti/Desktop/") # don't forget set working directory
Marriage<- read.csv("Data/Marriage.csv") # load the data from your PC`
ggplot(Marriage, aes(x = zodiacs)) + # plot the distribution of `Zodiacs`
geom_bar(fill = "cornflowerblue",color= "black")+ # you can modify colors
theme_minimal() + # use a minimal theme
labs(x = "Zodiacs", # you can modify labels and title plot
y = "Frequency",
title = "Marriage Participants by Zodiacs") Categorical Bar Chart 1
Bars can represent percents rather than counts. For bar charts (zodiacs), the code aes(x=sign) is actually a shortcut for aes(x = sign, y = ..count..), where ..count.. is a special variable representing the frequency within each category. You can use this to calculate percentages, by specifying the y variable explicitly.
In R, colors can be specified either by name (e.g col = “red”) as you can see in the following picture.
or you can assign the color by using a hexadecimal RGB triplet (such as col = “#FFCC00”) More. Moreover, you can also use other color systems such as ones taken from the RColorBrewer package More and the package grDevices (you probably already have this loaded) contains a number of palettes, type this ?rainbow in your Rconsole. Let’s consider the following graph:
library(ggplot2) # for visualization
ggplot(Marriage,
aes(x = zodiacs,
y = ..count.. / sum(..count..))) +
geom_bar(fill = rainbow(12), color= "azure4") +
theme_minimal() + # use a minimal theme
labs(x = "Zodiacs",
y = "Percent",
title = "Marriage Participants in Percent") +
scale_y_continuous(labels = scales::percent) # add % symbols to the y-axis labels Categorical Bar Chart 2
It is often helpful to sort the bars by frequency. In the code below, the frequencies are calculated explicitly. Then the reorder function is used to sort the categories by the frequency. The option stat="identity" tells the plotting function not to calculate counts, because they are supplied directly.
library(dplyr) # for data manipulation
library(ggplot2) # for visualization
plotdata <- Marriage %>% # load dataset
count(zodiacs) # count of participants in each `zodiacs`
# plot the bars in ascending order
ggplot(plotdata,
aes(x = reorder(zodiacs, n),
y = n)) +
geom_bar(stat = "identity",
fill = rainbow(12),
color= "azure4") +
theme_minimal() + # use a minimal theme
labs(x = "Zodiacs",
y = "Frequency",
title = "Sorting Categories")Categorical Bar Chart 3
If you may want to label each bar with its numerical value, check the following code:
library(dplyr) # for data manipulation
library(ggplot2) # for visualization
library(scales) # automatically determining breaks/labels
plotdata <- Marriage %>%
count(zodiacs) %>%
mutate(pct = n / sum(n),
pctlabel = paste0(round(pct*100), "%"))
# plot the bars as percentages, in decending order with bar labels
ggplot(plotdata,
aes(x = reorder(zodiacs, -pct),
y = pct)) +
geom_bar(stat = "identity",
fill = rainbow(12),
color = "azure4") +
geom_text(aes(label = pctlabel),
vjust = -0.25) +
theme_minimal() + # use a minimal theme
scale_y_continuous(labels = percent) +
labs(x = "Zodiacs",
y = "Percent",
title = "Labeling Bars")Categorical Bar Chart 4
Sometimes category labels may overlap, it is very annoying right?. So, you can rotate the axis labels.
library(ggplot2) # for visualization
library(scales) # automatically determining breaks/labels
# plot the bars as percentages,
# in decending order with bar labels
ggplot(plotdata,
aes(x = reorder(zodiacs, -pct),
y = pct)) +
geom_bar(stat = "identity",
fill = rainbow(12),
color = "azure4") +
geom_text(aes(label = pctlabel),
vjust = -0.25) +
scale_y_continuous(labels = percent) +
theme_minimal() + # use a minimal theme
labs(x = "Zodiacs",
y = "Percent",
title = "Overlapping Labels")+
theme(axis.text.x = element_text(angle = 25, hjust = 1))Categorical Bar Chart 5
Alternatively, You can handle this situation by flipping the x and y axes.
library(ggplot2) # for visualization
library(scales) # automatically determining breaks/labels
# plot the bars as percentages,
# in decending order with bar labels
ggplot(plotdata,
aes(x = reorder(zodiacs, -pct),
y = pct)) +
geom_bar(stat = "identity",
fill = rainbow(12),
color = "azure4") +
geom_text(aes(label = pctlabel),
hjust = -0.10) +
scale_y_continuous(labels = percent) +
theme_minimal() + # use a minimal theme
labs(x = "Zodiacs",
y = "Percent",
title = "Overlapping Labels")+
coord_flip()Categorical Bar Chart 6
Pie charts are controversial in statistics. If your goal is to compare the frequency of categories, you are better off with bar charts (humans are better at judging the length of bars than the volume of pie slices). If your goal is to compare each category with the whole (e.g., what portion of participants are Hispanic compared to all participants), and the number of categories is small, then pie charts may work for you. It takes a bit more code to make an attractive pie chart in R.
This is an example to create a basic ggplot2 pie chart:
library(dplyr) # for data manipulation
library(ggplot2) # for visualization
library(scales) # automatically determining breaks/labels
# Data preparation
plotdata <- Marriage %>%
count(race) %>%
arrange(desc(race)) %>%
mutate(prop = round(n*100/sum(n), 1),
lab.ypos = cumsum(prop) - 0.5*prop)
# Create Pie chart
mycols <- c("#0073C2FF", "#EFC000FF", "#868686FF", "#CD534CFF")
ggplot(plotdata, aes(x = "", y = prop, fill = race)) +
geom_bar(width = 1, stat = "identity", color = "white") +
coord_polar("y", start = 0)+
geom_text(aes(y = lab.ypos, label = prop), color = "white")+
scale_fill_manual(values = mycols) +
theme_void()+
labs(title = "Marriage Participants by Race")Categorical Pie Chart
Donut chart is just a simple pie chart with a hole inside. The only difference between the pie chart code is that we set: x = 2 andxlim = c(0.5, 2.5) to create the hole inside the pie chart. Additionally, the argument width in the function geom_bar() is no longer needed.
library(ggplot2) # for visualization
library(scales) # automatically determining breaks/labels
# create Donut chart
ggplot(plotdata, aes(x = 2, y = prop, fill = race)) +
geom_bar(stat = "identity", color = "white") +
coord_polar(theta = "y", start = 0)+
geom_text(aes(y = lab.ypos, label = prop), color = "white")+
scale_fill_manual(values = mycols) +
theme_void()+
xlim(0.5, 2.5)+
labs(title = "Marriage Participants by Race")Categorical Donut Chart 1
Now let’s get fancy and add labels, while removing the legend.
library(ggplot2) # for visualization
library(scales) # automatically determining breaks/labels
# add percent label
plotdata$percent <- paste0(plotdata$race, "\n",
round(plotdata$prop), "%")
# create Donut chart in percent
ggplot(plotdata, aes(x = 2, y = prop, fill = race)) +
geom_bar(stat = "identity", color = "white") +
coord_polar(theta = "y", start = 0)+
geom_text(aes(y = lab.ypos, label = percent), color = "white")+
scale_fill_manual(values = mycols) +
theme_void()+
xlim(0.5, 2.5)+
labs(title = "Marriage Participants by Race")Categorical Donut Chart 2
An alternative to a pie chart is a treemap. Unlike pie charts, it can handle categorical variables that have many levels.
library(ggplot2) # for visualization
library(treemapify) # for visualization
library(scales) # automatically determining breaks/labels
plotdata <- Marriage %>%
count(officialTitle)
ggplot(plotdata,
aes(fill = officialTitle,
area = n)) +
geom_treemap() +
labs(title = "Marriage Participants by Officiate")Categorical Tree Map 1
Here is a more useful version with labels.
ggplot(plotdata,
aes(fill = officialTitle,
area = n,
label = officialTitle)) +
geom_treemap() +
geom_treemap_text(colour = "white",
place = "centre") +
labs(title = "Marriage Participants by Officiate") +
theme(legend.position = "none")Categorical Tree Map 2
The distribution of a single quantitative variable is typically plotted with a histogram, kernel density plot, or dot plot.
Using the Marriage data set, let’s plot the ages of the wedding participants.
library(ggplot2) # for visualization
ggplot(Marriage, aes(x = age)) +
geom_histogram(fill = "cornflowerblue",
color = "white",bins = 20) +
theme_minimal() + # use a minimal theme
labs(title="Marriage Participants by age (Basic)",
x = "Age")Quantitative Histogram
Most participants appear to be in their early 20’s with another group in their 40’s, and a much smaller group in their later sixties and early seventies. This would be a multimodal distribution. Histogram colors can be modified using two options:
Alternatively, you can specify the binwidth, the width of the bins represented by the bars.
library(ggplot2) # for visualization
library(scales) # automatically determining breaks/labels
ggplot(Marriage,
aes(x = age,
y= ..count.. / sum(..count..))) +
geom_histogram(fill = "cornflowerblue",
color = "white",
binwidth = 5) +
theme_minimal() + # use a minimal theme
labs(title="Marriage Participants by age (Alternative Bins and bandwidths)",
y = "Percent",
x = "Age") +
scale_y_continuous(labels = percent)Quantitative Histogram 2
As with bar charts, the y-axis can represent counts or percent of the total.
library(ggplot2) # for visualization
library(scales) # automatically determining breaks/labels
ggplot(Marriage,
aes(x = age,
y= ..count.. / sum(..count..))) +
geom_histogram(fill = "cornflowerblue",
color = "white",
binwidth = 5) +
theme_minimal() + # use a minimal theme
labs(title="Marriage Participants by age (Percent)",
y = "Percent",
x = "Age") +
scale_y_continuous(labels = percent)Quantitative Histogram 3
An alternative to a histogram is the kernel density plot. Technically, kernel density estimation is a non-parametric method for estimating the probability density function of a continuous random variable. (What??) Basically, we are trying to draw a smoothed histogram, where the area under the curve equals one.
library(ggplot2) # for visualization
ggplot(Marriage, aes(x = age)) +
geom_density(fill = "indianred3") +
theme_minimal() + # use a minimal theme
labs(title = "Marriage Participants by age")Quantitative Kenel Density Plot
The graph shows the distribution of scores. For example, the proportion of cases between 20 and 40 years old would be represented by the area under the curve between 20 and 40 on the x-axis. As with previous charts, we can also use fill and color to specify the fill and border colors.
The degree of smoothness is controlled by the bandwidth parameter bw. To find the default value for a particular variable, use the bw.nrd0 function. Values that are larger will result in more smoothing, while values that are smaller will produce less smoothing.
## [1] 5.181946
ggplot(Marriage, aes(x = age)) +
geom_density(fill = "deepskyblue",
bw = 1) +
theme_minimal() + # use a minimal theme
labs(title = "Participants by age",
subtitle = "bandwidth = 1")Smoothing Parameter Plot
Kernel density plots allow you to easily see which scores are most frequent and which are relatively rare. However it can be difficult to explain the meaning of the y-axis to a non-statistician. (But it will make you look really smart at parties!)
Another alternative to the histogram is the dot chart. Again, the quantitative variable is divided into bins, but rather than summary bars, each observation is represented by a dot. By default, the width of a dot corresponds to the bin width, and dots are stacked, with each dot representing one observation. This works best when the number of observations is small (say, less than 150). The fill and color options can be used to specify the fill and border color of each dot respectively
library(ggplot2) # for visualization
ggplot(Marriage, aes(x = age)) +
geom_dotplot(fill = "gold",
color = "azure4",
binwidth = 2) +
theme_minimal() + # use a minimal theme
labs(title = "Participants by age",
y = "Proportion",
x = "Age")Dot Chart
There are many more options available. Click here for details and examples.
Bivariate graphs display the relationship between two variables. The type of graph will depend on the measurement level of the variables (categorical or quantitative)
Let`s plot the relationship between automobile class and drive type (front-wheel, rear-wheel, or 4-wheel drive) for the automobiles in the Fuel economy dataset.
library(ggplot2) # for visualization
mpg$drv<-factor(mpg$drv,
levels = c("f", "r", "4"),
labels = c("front-wheel", "rear-wheel", "4-wheel"))
# stacked bar chart
ggplot(mpg,
aes(x = class,
fill = drv)) +
geom_bar(position = "fill") +
theme_minimal() + # use a minimal theme
labs(y = "Proportion")Stacked Bar Chart
Grouped bar charts place bars for the second categorical variable side-by-side. To create a grouped bar plot use the position = "single" option. Note that this option is only available in the latest development version of ggplot2, but should be generally available shortly.
library(ggplot2) # for visualization
ggplot(mpg, aes(x = class, fill = drv)) +
theme_minimal() + # use a minimal theme
geom_bar(position = position_dodge(preserve = "single"))Grouped Bar Chart
A segmented bar plot is a stacked bar plot where each bar represents 100 percent. You can create a segmented bar chart using the position = “filled” option. This type of plot is particularly useful if the goal is to compare the percentage of a category in one variable across each level of another variable. For example, the proportion of front-wheel drive cars go up as you move from compact, to midsize, to minivan.
library(dplyr) # for data manipulation
library(ggplot2) # for visualization
library(scales) # automatically determining breaks/labels
# create a summary dataset (data manipulation)
plotdata <- mpg %>%
group_by(class, drv) %>%
dplyr::summarize(n = n()) %>%
mutate(pct = n/sum(n),
lbl = scales::percent(pct))
# create segmented bar chart
# adding labels to each segmen
ggplot(plotdata,
aes(x = factor(class),
y = pct,
fill = factor(drv))) +
geom_bar(stat = "identity",
position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .2),
label = percent)+
geom_text(aes(label = lbl),
size = 3,
position = position_stack(vjust = 0.5)) +
scale_fill_brewer(palette = "Set2") +
theme_minimal() + # use a minimal theme
labs(y = "Percent",
fill = "Drive Train",
x = "Class",
title = "Automobile Drive by Class") +
theme_minimal()Segmented Bar Chart
Notes: You can use additional options to improve color and labeling. In the graph below
factor modifies the order of the categories for the class variable and both the order and the * labels for the drive variablescale_y_continuous modifies the y-axis tick mark labelslabs provides a title and changed the labels for the x and y axes and the legendscale_fill_brewer changes the fill color schemetheme_minimal removes the grey background and changed the grid colorThe other functions are discussed more fully in the section on Chapter Advance Data Visualization.
Mosaic charts can display the relationship between categorical variables using rectangles whose areas represent the proportion of cases for any given combination of levels. The color of the tiles can also indicate the degree relationship among the variables.
Although mosaic charts can be created with ggplot2 using the ggmosaic package, I recommend using the vcd package instead. Although it won’t create ggplot2 graphs, the package provides a more comprehensive approach to visualizing categorical data.
People are fascinated with the Titanic (or is it with Leo?). In the Titanic disaster, what role did sex and class play in survival? We can visualize the relationship between these three categorical variables using the code below.
## Sex Male Female
## Survived Class
## No 1st 118 4
## 2nd 154 13
## 3rd 422 106
## Crew 670 3
## Yes 1st 62 141
## 2nd 25 93
## 3rd 88 90
## Crew 192 20
## Loading required package: grid
Basic mosaic plot
The size of the tile is proportional to the percentage of cases in that combination of levels. Clearly more passengers perished, than survived. Those that perished were primarily 3rd class male passengers and male crew (the largest group).
If we assume that these three variables are independent, we can examine the residuals from the model and shade the tiles to match. In the graph below, dark blue represents more cases than expected given independence. Dark red represents less cases than expected if independence holds.
mosaic(tbl,
shade = TRUE,
legend = TRUE,
labeling_args = list(set_varnames = c(Sex = "Gender",
Survived = "Survived",
Class = "Passenger Class")),
set_labels = list(Survived = c("No", "Yes"),
Class = c("1st", "2nd", "3rd", "Crew"),
Sex = c("F", "M")),
main = "Titanic data")Mosaic plot with shading
We can see that if class, gender, and survival are independent, we are seeing many more male crew perishing, and 1st, 2nd and 3rd class females surviving than would be expected. Conversely, far fewer 1st class passengers (both male and female) died than would be expected by chance. Thus the assumption of independence is rejected. (Spoiler alert: Leo doesn’t make it.)
The relationship between two quantitative variables is typically displayed using scatterplots and line graphs.
A scatterplot is made to study the relationship between 2 variables. Thus it is often accompanied by a correlation coefficient calculation, that usually tries to measure the linear relationship. However other types of relationship can be detected using scatterplots, and a common task consists to fit a model explaining Y in function of X. Here is a few pattern you can detect doing a scatterplot.
library(ggplot2) # data visualization
library(hrbrthemes) # for the `theme_ipsum()` and legend
# Create data
d1 <- data.frame(x=seq(1,100),
y=rnorm(100),
name="No trend")
d2 <- d1 %>%
mutate(y=x*10 + rnorm(100,sd=60)) %>%
mutate(name="Linear relationship")
d3 <- d1 %>%
mutate(y=x^2 + rnorm(100,sd=140)) %>%
mutate(name="Square")
d4 <- data.frame( x=seq(1,10,0.1),
y=sin(seq(1,10,0.1)) +
rnorm(91,sd=0.6)) %>%
mutate(name="Sin")
don <- do.call(rbind, list(d1, d2, d3, d4))
# Plot
don %>%
ggplot(aes(x=x, y=y)) +
geom_point(color="#69b3a2", alpha=0.8) +
theme_ipsum() +
facet_wrap(~name, scale="free")relationship scatterplots
The simplest display of two quantitative variables is a scatterplot, with each variable represented on an axis. For example, using the Salaries dataset, we can plot experience (yrs.since.phd) vs. academic salary (salary) for college Professors.
library(ggplot2) # for visualization
library(scales) # automatically determining breaks/labels
data(Salaries, package="carData")
# enhanced scatter plot
ggplot(Salaries,
aes(x = yrs.since.phd,
y = salary)) +
geom_point(color="cornflowerblue",
size = 2,
alpha=.8) +
scale_y_continuous(label = scales::dollar,
limits = c(50000, 250000)) +
scale_x_continuous(breaks = seq(0, 60, 10),
limits=c(0, 60)) +
theme_minimal() + # use a minimal theme
labs(x = "Years Since PhD",
y = "",
title = "Experience vs. Salary",
subtitle = "9-month salary for 2008-2009")Scatterplot 1
Notes: geom_point options can be used to change the
color - point colorsize - point sizeshape - point shapealpha - point transparency. Transparency ranges from 0 (transparent) to 1 (opaque), and is a useful parameter when points overlap.The functions scale_x_continuous and scale_y_continuous control the scaling on x and y axes respectively. We can use these options and functions to create a more attractive scatter-plot.
It is often useful to summarize the relationship displayed in the scatterplot, using a best fit line. Many types of lines are supported, including linear, polynomial, and nonparametric (loess). By default, 95% confidence limits for these lines are displayed.
library(ggplot2) # for visualization
ggplot(Salaries,
aes(x = yrs.since.phd,
y = salary)) +
geom_point(color= "cornflowerblue") +
geom_smooth(method = "lm", color = "brown1")+
theme_minimal() + # use a minimal theme
labs(x = "Years Since PhD",
y = "",
title = "Experience vs. Salary",
subtitle = "9-month salary for 2008-2009")Scatterplot Linear
Clearly, the salary increases with experience. However, there seems to be a dip at the right end - professors with significant experience, earning lower salaries. A straight line does not capture this non-linear effect. A line with a bend will fit better here.
A polynomial regression line provides a fit line of the form
\[\begin{equation} \label{eq:1} \hat{y}=\beta_0+\beta_1x+\beta_2x^2+\cdots+\beta_nx^2 \end{equation}\]
Typically either a quadratic (one bend), or cubic (two bends) line is used. It is rarely necessary to use a higher order \(( >3 )\) polynomials. Applying a quadratic fit to the salary data set produces the following result.
library(ggplot2) # for visualization
ggplot(Salaries,
aes(x = yrs.since.phd,
y = salary)) +
geom_point(color= "cornflowerblue") +
geom_smooth(method = "lm",
formula = y ~ poly(x, 2),
color = "yellow")+
theme_minimal() + # use a minimal theme
labs(x = "Years Since PhD",
y = "",
title = "Experience vs. Salary",
subtitle = "9-month salary for 2008-2009")Scatterplot Quadratic
Finally, a smoothed nonparametric fit line can often provide a good picture of the relationship. The default in ggplot2 is a loess line which stands for for locally weighted scatterplot smoothing.
library(ggplot2) # for visualization
ggplot(Salaries,
aes(x = yrs.since.phd,
y = salary)) +
geom_point(color="cornflowerblue",
size = 2,
alpha = 1) +
geom_smooth(size = 1,
color = "green") +
scale_y_continuous(label = scales::dollar,
limits = c(50000, 250000)) +
scale_x_continuous(breaks = seq(0, 60, 10),
limits = c(0, 60)) +
theme_minimal() + # use a minimal theme
labs(x = "Years Since PhD",
y = "",
title = "Experience vs. Salary",
subtitle = "9-month salary for 2008-2009") +
theme_minimal()Scatterplot Smoothed Nonparametric
When plotting the relationship between a categorical variable and a quantitative variable, a large number of graph types are available. These include bar charts using summary statistics, grouped kernel density plots, side-by-side box plots, side-by-side violin plots, mean/sem plots, ridgeline plots, and Cleveland plots.
In previous sections, bar charts were used to display the number of cases by category for a single variable or for two variables. You can also use bar charts to display other summary statistics (e.g., means or medians) on a quantitative variable for each level of a categorical variable.
For example, the following graph displays the mean salary for a sample of university professors by their academic rank.
library(dplyr) # for data manipulation
library(ggplot2) # for visualization
library(scales) # automatically determining breaks/labels
data(Salaries, package="carData")
# calculate mean salary for each rank
plotdata <- Salaries %>%
group_by(rank) %>%
dplyr::summarize(mean_salary = mean(salary))
# plot mean salaries in a more attractive fashion
mycols <- c("#CD534CFF", "#EFC000FF", "#0073C2FF")
ggplot(plotdata,
aes(x = factor(rank,
labels = c("Assistant\nProfessor",
"Associate\nProfessor",
"Full\nProfessor")),
y = mean_salary)) +
geom_bar(stat = "identity",
fill = mycols) +
geom_text(aes(label = dollar(mean_salary)),
vjust = -0.25) +
scale_y_continuous(breaks = seq(0, 130000, 20000),
label = dollar) +
theme_minimal() + # use a minimal theme
labs(title = "Mean Salary by Rank",
subtitle = "9-month academic salary for 2008-2009",
x = "",
y = "")Bar Chart (Summary statistics)
One can compare groups on a numeric variable by superimposing kernel density plots in a single graph. Let`s plot the distribution of salaries by rank using kernel density plots.
ggplot(Salaries,
aes(x = salary,
fill = rank)) +
geom_density(alpha = 0.4) +
theme_minimal() +
labs(title = "Salary distribution by rank")Grouped Kernel Density Plots
The alpha option makes the density plots partially transparent so that we can see what is happening under the overlaps. Alpha values range from 0 (transparent) to 1 (opaque). The graph makes clear that, in general, the salary goes up with rank. However, the salary range for full professors is very wide.
A box-plot displays the \(25^{th}\) percentile, median, and \(75^{th}\) percentile of a distribution. The whiskers (vertical lines) capture roughly 99% of a normal distribution, and observations outside this range are plotted as points representing outlier (see the figure below)
Box Plots
Side-by-side box plots are very useful for comparing groups (i.e., the levels of a categorical variable) on a numerical variable. Let`s plot the distribution of salaries by rank using box-plots. Notched box-plots provide an approximate method for visualizing whether groups differ. Although not a formal test, if the notches of two box-plots do not overlap, there is strong evidence (95% confidence) that the medians of the two groups differ.
mycols <- c("#CD534CFF", "#EFC000FF", "#0073C2FF")
ggplot(Salaries, aes(x = rank,
y = salary)) +
geom_boxplot(notch = TRUE,
fill = mycols,
alpha = .7) +
theme_minimal() +
labs(title = "Salary Distribution by rank")Box Plots
In the example above, all three groups appear to differ. One of the advantages of boxplots is that their widths are not usually meaningful. This allows you to compare the distribution of many groups in a single graph.
Violin plots are similar to kernel density plots but are mirrored and rotated \(90^0\). Let`s plot the distribution of salaries by rank using violin plots.
ggplot(Salaries,
aes(x = rank,
y = salary)) +
geom_violin(fill = "azure1") +
geom_boxplot(width = .2,
fill = mycols,
outlier.color = "red",
outlier.size = 2) +
theme_minimal() +
labs(title = "Salary distribution by rank")Violin Plots
A ridgeline plot (also called a joy plot) displays the distribution of a quantitative variable for several groups. They’re similar to kernel density plots with vertical faceting, but take up less room. Ridgeline plots are created with the ggridges package.
Using the Fuel economy dataset, let’s plot the distribution of city driving miles per gallon by car class.
library(dplyr) # for data manipulation
library(ggplot2) # for visulization
library(ggridges) # to handle overlapping visulization
ggplot(mpg,
aes(x = cty,
y = class,
fill = class)) +
geom_density_ridges(alpha = 0.7) +
theme_ridges() +
labs("Highway mileage by auto class") +
theme(legend.position = "none")Ridgeline Plots
I have suppressed the legend here because it’s redundant (the distributions are already labeled on the y-axis). Unsurprisingly, pickup trucks have the poorest mileage, while subcompacts and compact cars tend to achieve ratings. However, there is a very wide range of gas mileage scores for these smaller cars.
Note the possible overlap of distributions is the trade-off for a more compact graph. You can add transparency if the the overlap is severe using geom_density_ridges(alpha = n), where n ranges from 0 (transparent) to 1 (opaque). See the package vingnette for more details.
A popular method for comparing groups on a numeric variable is the mean plot with error bars. Error bars can represent standard deviations, standard error of the mean, or confidence intervals. In this section, we’ll plot means and standard errors. We can use the same technique to compare salary across rank and sex. (Technically, this is not bivariate since we’re plotting rank, sex, and salary, but it seems to fit here).
library(dplyr) # for data manipulation
library(ggplot2) # for visulization
library(ggridges) # to handle overlapping visulization
# calculate means, standard deviations,
# standard errors, and 95% confidence
# intervals by rank
plotdata <- Salaries %>%
group_by(rank, sex) %>%
dplyr::summarize(n = n(),
mean = mean(salary),
sd = sd(salary),
se = sd/sqrt(n),
ci = qt(0.975, df = n - 1) * sd / sqrt(n))
# improved means/standard error plot
pd <- position_dodge(0.2)
ggplot(plotdata,
aes(x = factor(rank,
labels = c("Assistant\nProfessor",
"Associate\nProfessor",
"Full\nProfessor")),
y = mean,
group=sex,
color=sex)) +
geom_point(position=pd,
size = 3) +
geom_line(position = pd,
size = 1) +
geom_errorbar(aes(ymin = mean - se,
ymax = mean + se),
width = .1,
position = pd,
size = 1) +
scale_y_continuous(label = scales::dollar) +
scale_color_brewer(palette="Set1") +
theme_minimal() +
labs(title = "Mean salary by rank and sex",
subtitle = "(mean +/- standard error)",
x = "",
y = "",
color = "Gender")Line Plots
The relationship between a grouping variable and a numeric variable can be displayed with a scatter plot. For example, plot the distribution of salaries by rank using strip plots. These one-dimensional scatter-plots are called strip plots. Unfortunately, the overprinting of points makes interpretation difficult. The relationship is easier to see if the points are jittered. Basically a small random number is added to each y-coordinate. It is also easier to compare groups if we use color.
library(ggplot2) # for visulization
library(scales) # scaling infrastructure
ggplot(Salaries,
aes(y = factor(rank,
labels = c("Assistant\nProfessor",
"Associate\nProfessor",
"Full\nProfessor")),
x = salary,
color = rank)) +
geom_jitter(alpha = 0.7,
size = 1.5) +
scale_x_continuous(label = dollar) +
labs(title = "Academic Salary by Rank",
subtitle = "9-month salary for 2008-2009",
x = "",
y = "") +
theme_minimal() +
theme(legend.position = "none")Strip Plots
The option legend.position = "none" is used to suppress the legend (which is not needed here). Jittered plots work well when the number of points in not overly large.
It may be easier to visualize distributions if we add boxplots to the jitter plots. Several options were added to create this plot.
size = 1 makes the lines thickeroutlier.color = "black" makes outliers blackoutlier.shape = 1 specifies circles for outliersoutlier.size = 3 increases the size of the outlieralpha = 0.5 makes the points more transparentwidth = .2 decreases the amount of jitter (.4 is the default)Finally, the \(x\) and \(y\) axes are revered using the coord_flip function (i.e., the graph is turned on its side).
library(ggplot2) # for visulization
library(scales) # scaling infrastructure
ggplot(Salaries,
aes(x = factor(rank,
labels = c("Assistant\nProfessor",
"Associate\nProfessor",
"Full\nProfessor")),
y = salary,
color = rank)) +
geom_boxplot(size=1,
outlier.shape = 1,
outlier.color = "black",
outlier.size = 3) +
geom_jitter(alpha = 0.5,
width=.2) +
scale_y_continuous(label = dollar) +
labs(title = "Academic Salary by Rank",
subtitle = "9-month salary for 2008-2009",
x = "",
y = "") +
theme_minimal() +
theme(legend.position = "none") +
coord_flip()Combining Jitter and Boxplots 1
Before moving on, it is worth mentioning the geom_boxjitter function provided in the ggpol package. It creates a hybrid boxplot - half boxplot, half scatterplot.
library(ggplot2) # for visulization
library(scales) # scaling infrastructure
library(ggpol) # hybrid boxplot -half scatterplot
ggplot(Salaries,
aes(x = factor(rank,
labels = c("Assistant\nProfessor",
"Associate\nProfessor",
"Full\nProfessor")),
y = salary,
fill=rank)) +
geom_boxjitter(color="black",
jitter.color = "darkgrey",
errorbar.draw = TRUE) +
scale_y_continuous(label = dollar) +
labs(title = "Academic Salary by Rank",
subtitle = "9-month salary for 2008-2009",
x = "",
y = "") +
theme_minimal() +
theme(legend.position = "none")Combining Jitter and Boxplots 2
Beeswarm plots (also called violin scatter plots) are similar to jittered scatterplots, in that they display the distribution of a quantitative variable by plotting points in a way that reduces overlap. In addition, they also help display the density of the data at each point (in a manner that is similar to a violin plot). Continuing the previous example
library(ggplot2) # for visulization
library(scales) # scaling infrastructure
library(ggbeeswarm) # reduces overlap
ggplot(Salaries,
aes(x = factor(rank,
labels = c("Assistant\nProfessor",
"Associate\nProfessor",
"Full\nProfessor")),
y = salary,
color = rank)) +
geom_quasirandom(alpha = 0.7,
size = 1.5) +
scale_y_continuous(label = dollar) +
labs(title = "Academic Salary by Rank",
subtitle = "9-month salary for 2008-2009",
x = "",
y = "") +
theme_minimal() +
theme(legend.position = "none")Beeswarm Plots
The plots are create using the geom_quasirandom function. These plots can be easier to read than simple jittered strip plots. To learn more about these plots, see Beeswarm-style plots with ggplot2.
Cleveland plots are useful when you want to compare a numeric statistic for a large number of groups. For example, say that you want to compare the 2007 life expectancy for Asian country using the gapminder dataset.
library(dplyr) # for data manipulation
library(ggplot2) # for visulization
library(scales) # scaling infrastructure
library(ggbeeswarm) # reduces overlap
library(gapminder) # for dataset `gapminder`
data(gapminder, package="gapminder") # load dataset `gapminder`
# subset Asian countries in 2007
library(dplyr)
plotdata <- gapminder %>%
filter(continent == "Asia" &
year == 2007)
# Fancy Cleveland plot
ggplot(plotdata,
aes(x=lifeExp,
y=reorder(country, lifeExp))) +
geom_point(color="blue",
size = 2) +
geom_segment(aes(x = 40,
xend = lifeExp,
y = reorder(country, lifeExp),
yend = reorder(country, lifeExp)),
color = "azure3") +
labs (x = "Life Expectancy (years)",
y = "",
title = "Life Expectancy by Country",
subtitle = "GapMinder data for Asia - 2007") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())Beeswarm Plots
Japan clearly has the highest life expectancy, while Afghanistan has the lowest by far. This last plot is also called a lollipop graph.
Multivariate graphs display the relationships among three or more variables. There are two common methods for accommodating multiple variables: grouping and faceting.
In grouping, the values of the first two variables are mapped to the x and y axes. Then additional variables are mapped to other visual characteristics such as color, shape, size, line type, and transparency. Grouping allows you to plot the data for multiple groups in a single graph. Using the Salaries data set, let’s display the relationship between yrs.since.phd and salary.
library(carData) # for dataset
library(ggplot2) # for visulization
data(Salaries, package="carData")
ggplot(Salaries, aes(x = yrs.since.phd,
y = salary,
color=rank)) +
geom_point() +
theme_minimal() +
labs(title = "Academic salary by rank and years since degree")Multivariate Grouping Plot 1
Next, let’s add the gender of the professor, using the shape of the points to indicate sex. We’ll increase the point size and add transparency to make the individual points clearer.
library(carData) # for dataset
library(ggplot2) # for visulization
ggplot(Salaries,
aes(x = yrs.since.phd,
y = salary,
color = rank,
shape = sex)) +
geom_point(size = 3, alpha = .6) +
theme_minimal() +
labs(title = "Academic salary by rank, sex, and years since degree")Multivariate Grouping Plot 2
We can’t say that this is a great graphic. It is very busy, and it can be difficult to distinguish male from female professors. Faceting (described in the next section) would probably be a better approach.
Notice the difference between specifying a constant value (such as size = 3) and a mapping of a variable to a visual characteristic (e.g., color = rank). Mappings are always placed within the aes function, while the assignment of a constant value always appear outside of the aes function.
Here is a cleaner example. We’ll graph the relationship between years since Ph.D. and salary using the size of the points to indicate years of service. This is called a bubble plot.
library(carData) # for dataset
library(ggplot2) # for visulization
ggplot(Salaries,
aes(x = yrs.since.phd,
y = salary,
color = rank,
size = yrs.service)) +
geom_point(alpha = .8) +
theme_minimal() +
labs(title = "Academic salary by rank, years of service, and years since degree")Multivariate Grouping Plot 3
There is obviously a strong positive relationship between years since Ph.D. and year of service. Assistant Professors fall in the 0-11 years since Ph.D. and 0-10 years of service range. Clearly highly experienced professionals don’t stay at the Assistant Professor level (they are probably promoted or leave the University). We don’t find the same time demarcation between Associate and Full Professors. Bubble plots are described in more detail in a later chapter.
As a final example, let’s look at the yrs.since.phd vs salary and add sex using color and quadratic best fit lines.
library(carData) # for dataset
library(ggplot2) # for visulization
ggplot(Salaries,
aes(x = yrs.since.phd,
y = salary,
color = sex)) +
geom_point(alpha = .4,
size = 3) +
geom_smooth(se=FALSE,
method = "lm",
formula = y~poly(x,2),
size = 1.5) +
labs(x = "Years Since Ph.D.",
title = "Academic Salary by Sex and Years Experience",
subtitle = "9-month salary for 2008-2009",
y = "",
color = "Sex") +
scale_y_continuous(label = scales::dollar) +
scale_color_brewer(palette = "Set1") +
theme_minimal()Multivariate Grouping Plot 4
Grouping allows you to plot multiple variables in a single graph, using visual characteristics such as color, shape, and size. In faceting, a graph consists of several separate plots or small multiples, one for each level of a third variable, or combination of variables. It is easiest to understand this with an example.
library(carData) # for dataset
library(ggplot2) # for visulization
ggplot(Salaries, aes(x = salary)) +
geom_histogram(fill = "cornflowerblue",
color = "white") +
facet_wrap(~rank, ncol = 1) +
theme_minimal() +
labs(title = "Salary histograms by rank")Multivariate Faceting 1
The facet_wrap function creates a separate graph for each level of rank. The ncol option controls the number of columns. In the next example, two variables are used to define the facets.
Here, the function assigns sex to the rows and ranks to the columns, creating a matrix of 6 plots in one graph.
library(carData) # for dataset
library(ggplot2) # for visulization
ggplot(Salaries, aes(x = salary / 1000)) +
geom_histogram(color = "white",
fill = "cornflowerblue") +
facet_grid(sex ~ rank) +
theme_minimal() +
labs(title = "Salary histograms by sex and rank",
x = "Salary ($1000)")Multivariate Faceting 2
We can also combine grouping and faceting. Let’s use Mean/SE plots and faceting to compare the salaries of male and female professors, within rank and discipline. We’ll use color to distinguish sex and faceting to create plots for rank by discipline combinations.
library(carData) # for dataset
library(ggplot2) # for visulization
library(dplyr) # data manipulation
# calculate means and standard erroes by sex,
# rank and discipline
plotdata <- Salaries %>%
group_by(sex, rank, discipline) %>%
dplyr::summarize(n = n(),
mean = mean(salary),
sd = sd(salary),
se = sd / sqrt(n))
# create better labels for discipline
plotdata$discipline <- factor(plotdata$discipline,
labels = c("Theoretical",
"Applied"))
# create plot
ggplot(plotdata,
aes(x = sex,
y = mean,
color = sex)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean - se,
ymax = mean + se),
width = .1) +
scale_y_continuous(breaks = seq(70000, 140000, 10000),
label = scales::dollar) +
facet_grid(. ~ rank + discipline) +
theme_bw() +
theme(legend.position = "none",
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank()) +
labs(x="",
y="",
title="Nine month academic salaries by gender, discipline, and rank",
subtitle = "(Means and standard errors)") +
scale_color_brewer(palette="Set1")Multivariate Faceting 3
The statement facet_grid(. ~ rank + discipline) specifies no row variable (.) and columns defined by the combination of rank and discipline.
The theme() functions create a black and white theme and eliminates vertical grid lines and minor horizontal grid lines. The scale_color_brewer() function changes the color scheme for the points and error bars.
At first glance, it appears that there might be gender differences in salaries for the associate and full professors in theoretical fields. I say “might” because we haven’t done any formal hypothesis testing yet (ANCOVA in this case). See the Customizing section to learn more about customizing the appearance of a graph.
As a final example, we’ll shift to a new dataset and plot the change in life expectancy over time for countries in the “Asia”. The data comes from the gapminder dataset in the gapminder package. Each country appears in its own facet. The theme functions are used to simplify the background color, rotate the x-axis text, and make the font size smaller.
library(gapminder) # for dataset
library(ggplot2) # for visulization
library(dplyr) # data manipulation
# plot life expectancy by year separately
# for each country in the Americas
data(gapminder, package = "gapminder")
# Select the Americas data
plotdata <- dplyr::filter(gapminder,
continent == "Asia")
# plot life expectancy by year, for each country
ggplot(plotdata, aes(x=year, y = lifeExp)) +
geom_line(color="grey") +
geom_point(color="blue") +
facet_wrap(~country) +
theme_minimal(base_size = 9) +
theme(axis.text.x = element_text(angle = 45,
hjust = 1)) +
labs(title = "Changes in Life Expectancy",
x = "Year",
y = "Life Expectancy") Multivariate Faceting 4