This is an rmarkdown document. Markdown can be used to generate web pages, PDFs, MS Word documents, presentations, dashboards, and systematic reports.
For example, if you have a table, or a visual, which you need to update you could open the report, edit the document, save it, and send it to the client. With rmarkdown, you can add dynamic fields which automatically update based on the data file you save.
This document will showcase the different data visuals reviewed in the data visualizations in R demonstration.
In addition to make documents, rmarkdown is great to showcase code and visuals for educational purposes. We will be using the “diamonds” data set available in the ggplots package in r. In addition to ggplot2, we’ll also be using the data manipulation package dplyr and the quick analytics package GGalley.
You can use the code listed below to install the required packages on your computer.
load.lib<-c("ggplot2", "dplyr", "GGally")
install.lib<-load.lib[!load.lib %in% installed.packages()]
for(lib in install.lib) install.packages(lib,dependencies=TRUE)
sapply(load.lib,require,character=TRUE)
Imagine you have a data set and you want to know the frequency (count) of a variable. With R, you can quickly make a plot.
To use ggplot2 we need to specify (1) the data set, (2)
the aesthetic (variable mapping), and (3) the geometry. In the code
chunk below, we told R that we wanted to use the function ggplot to make
a bar plot using the diamonds data set. This produces a count of the X
variable, which in this case is the count of the types of diamond
cuts.
ggplot(data = diamonds, # We name the data set we want to use
aes(x = cut)) + # We select a variable to take counts of
geom_bar() # We want to show the counts with bars so we can compare differences
Sometimes bar plots can help use quickly see differences between columns. For example, we can easily see a difference in counts between ‘Fair’ and ‘Ideal’, but what about the difference between ‘Very Good’ and ‘Premium’? Although we know that we have more premium cuts than very good cuts, the magnitude is tough to see.
ggplot(data = diamonds, # Same information as before
aes(x = cut)) +
geom_bar() +
geom_text(stat="count",
aes(label = after_stat(count)),
vjust = -1) + # But we just a label to see the counts
ylim(0, 25000) # And we'll adjust the y-axis so we can see the labels better
We also do not always want to know counts, and other statistics, such as means, are also excellent for display in a bar plot.
For example, using the same data set we can also create a bar plot showing us the average of one variable by another. For instance, in the diamonds data set we might want to know the average price for each diamond cut.
ggplot(diamonds, aes(cut, # ggplot knows that the first value list in the aes is X and the second Y
price)) +
geom_bar(stat = "summary", # Within the geom_bar we are specifying we want summary statistics, specifically a mean
fun = "mean")
As previously demonstrated, sometimes what we want are the labels of the statistics. However, statistics are limited to the shape of a data file. For example, reviewing the diamonds package below, we can see that the values are stacked on top of each other. Importantly, this data structure allows the bar plot to quickly go down a column of data to produce counts. But the mean values of another variable do not physically appear in this data set.
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
As a result, we’ll use dplyr to construct a new data set. We begin by
stating our data frame once, and using our pipes to specify that we want
to group the data on the categorical variable we wish to take means of,
in this case cut. Then, using these pipes, we specify that
we want to summarize the mean of price for each of those groups. We can
then add commands to tell ggplot2 we’ll be using the
avg_price variable we constructed.
diamonds %>%
group_by(cut) %>%
summarise(avg_price = round(mean(price, na.rm = TRUE),2)) %>% # rounding to 2 decimals
ggplot(aes(cut,
avg_price,
label=avg_price)) +
geom_col() + # Change to geom_col to have a Y
geom_text(vjust = -1) + # Add a variable to adjust the vertical label
ylim(0, 5500) # Adjust our y-axis for clearer labels
Sometimes we might want the bars and the labels to tell us different things. For example, we might be interested in seeing the mean price for each cut in the bars, but we want to see the counts in the labels to know how many there are.
ggplot(diamonds, aes(cut,
price)) +
geom_bar(stat = "summary", fun = "mean") +
geom_label(inherit.aes = FALSE,
data = . %>% group_by(cut) %>% count(), # Use dplyr to get counts
aes(label = paste0(n, " Cuts"), # Past the word 'cuts' to the label
x = cut), y = -0.5)
You can also look at subsets by adding a facet_wrap to
your plot. A facet wrap allows you to view your chosen visual (and
summary statistic), by another variable. By adding another categorical
variable (clarity) as a facet, we can now see the cut
counts for each level of clarity.
ggplot(data = diamonds, aes(x = cut)) +
geom_bar() +
geom_text(stat="count",
aes(label = after_stat(count)),
vjust = -1) +
facet_wrap(.~clarity) # Add a facet wrap on the clarity variable
Similar to a facet wrap is a facet_grid. The core
distinction between the two commands is that a facet grid forms a panel
matrix, while the grid makes a ribbon of panels wrapped along two
variables.
ggplot(data = diamonds, aes(x = cut)) +
geom_bar() +
geom_text(stat="count", aes(label = after_stat(count)), vjust = -1) +
facet_grid(clarity~.) # Add a facet grid
But be careful when adding multiple variables. Depending on what you select for an output, you might find margins can be truncated to a point where a plot cannot be viewed.
ggplot(data = diamonds, aes(x = cut)) +
geom_bar() +
geom_text(stat="count", aes(label = after_stat(count)), vjust = -1) +
facet_wrap(clarity~color, scales="free_y") # Add a facet grid with a second variable
Sometimes we want to use a bar plot to convey percentage differences within categorical variables. For example, while we might want to know the count of different cuts, we also might want to know the percentage for each cut.
Similar to the issue we ran into with making a bar plot that has mean labels, we want to add a percentage label, but we do not have the percentage data within our data set. Indeed, we would want to construct a new data frame for the desired visual.
diamonds_edit <- diamonds %>% # Save a copy of the diamonds data set
group_by(cut) %>% # grouping by cut
count(cut, color) %>% # Counting the cut within the color
mutate(pct = n/sum(n), # Calculating a percentage
pct_label = scales::percent(pct))
We can see that the first data set has a different structure compared to the second.
head(diamonds) # Old
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
head(diamonds_edit) # New
## # A tibble: 6 × 5
## # Groups: cut [1]
## cut color n pct pct_label
## <ord> <ord> <int> <dbl> <chr>
## 1 Fair D 163 0.101 10.12%
## 2 Fair E 224 0.139 13.91%
## 3 Fair F 312 0.194 19.38%
## 4 Fair G 314 0.195 19.50%
## 5 Fair H 303 0.188 18.82%
## 6 Fair I 175 0.109 10.87%
And when we plot the percentages as a stacked bar plot, we can more easily look across the plot for percentage changes across the x-axis.
ggplot(data = diamonds_edit,
aes(x = cut,
y = pct,
fill = color)) + # Add a fill so that every group is assigned a color
geom_col() + # Using geom_col for the X and Y labels
geom_text(aes(label = paste(pct_label, n, sep = "\n")), # Paste the label with a count
lineheight = 0.8, # Adjust the line height to make the label ease to read
position = position_stack(vjust = 0.5)) + # Adjust the vertical position so its easier to read the label
scale_y_continuous(labels = scales::percent) # Set the axis labels to be a scaled percentage
Depending on the trends, you might even want to make the coordinates polar.
ggplot(data = diamonds_edit,
aes(x = cut,
y = pct,
fill = color)) +
geom_col() +
geom_text(aes(label = paste(pct_label, n, sep = "\n")),
lineheight = 0.8,
position = position_stack(vjust = 0.5),
size = 2, # Add label size
color = "white") + # Change label color
scale_y_continuous(labels = scales::percent) +
coord_polar() # Change label size and color
Although we can get a lot of mileage out of a bar plot, analyst often have different variables which require different data visuals.
For example, bar plots are excellent for showing differences between groups, but potentially less clear when trying to examine change. Additionally, when both variables are continuous, a bar plot may not be the wisest choice.
ggplot(data = diamonds, aes(x = price,
y = carat)) +
geom_col()
In the plot above, it’s difficult to quickly glean insight. The plot
below is more suited for the variables price and
carat because both are continuous.
ggplot(data = diamonds,
aes(x = price, y = carat)) + # Same data set and a similar aes for scatter plots
geom_point() # Change the geometry to a point
Sometimes we want our plots to be more dynamic, adding additional information to our gemotry aesthetic.
ggplot(data = diamonds,
aes(x = price, y = carat)) + # Same data set and a similar aes for scatter plots
geom_point(aes(color = cut)) # Add color of cut
R also make it easy to add additional optional variables to various
dimensions. In the plot below, we added the x variable in
our data set (width), to represent the size of the point in
our plot.
This means a larger width should have a larger point (although it is difficult to see with this data set).
ggplot(data = diamonds, aes(x = price,
y = carat)) +
geom_point(aes(color = cut,
size = y)) # Adding a variable for size
R also make it easy to add additional optional variables to various
dimensions. In the plot below, we added the x variable in
our data set (width), to represent the size of the point in
our plot.
This means a larger width should have a larger point (although it is difficult to see with this data set).
ggplot(data = diamonds, aes(x = price,
y = carat)) +
geom_point(aes(color = cut,
size = y),
alpha = 0.1) # The lower the alpha, the more transparent
Below we changed the alpha to a larger value, making it less translucent.
ggplot(data = diamonds, aes(x = price,
y = carat)) +
geom_point(aes(color = cut,
size = y),
alpha = 0.8) # The lower the alpha, the more transparent
And we can even anchor the value on a different variable. Although, at this point the plot may be too busy. Generally, adding more than 2-3 aesthetic mappings to a plot can be confusing.
ggplot(data = diamonds,
aes(x = price, y = carat,
alpha = x)) + # Just add the alpha to the variable aesthetic mapping
geom_point(aes(color = cut,
size = y))
Because R is an object oriented language, you can save the parameters
you specify in an object and reference that object while you work. For
example, in the code chunk below, we’re making a density plot by first
saving our data set and aesthetic saved as an object called
g.
We then add the gemotry to the object, specifying the information we want to know.
g <- ggplot(diamonds, aes(cut)) # Save the aesthetic as an object
g + geom_density(aes(fill=factor(cut)), alpha=0.5)
The plots one can make using ggplot2 are infinite. Any
plot that can be imagined can be created. However, sometimes, as you may
have observed throughout this lesson, there is some coding required.
To facilitate plotting, search for packages that are relevant to your work. You will likely find a variety of packages with helper functions which provide you with the statistics and visuals you need.
For example, below we take a few variables from the diamonds data set and construct a statistical plot showing us multiple comparisons of distributions.
diamonds_cut <- diamonds[,c("cut", "price", "color", "clarity")]
ggpairs(diamonds_cut)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.