Introduction to markdown

This is an rmarkdown document. Markdown can be used to generate web pages, PDFs, MS Word documents, presentations, dashboards, and systematic reports.

For example, if you have a table, or a visual, which you need to update you could open the report, edit the document, save it, and send it to the client. With rmarkdown, you can add dynamic fields which automatically update based on the data file you save.

This document will showcase the different data visuals reviewed in the data visualizations in R demonstration.

Making data visuals

In addition to make documents, rmarkdown is great to showcase code and visuals for educational purposes. We will be using the “diamonds” data set available in the ggplots package in r. In addition to ggplot2, we’ll also be using the data manipulation package dplyr and the quick analytics package GGalley.

Setting up

You can use the code listed below to install the required packages on your computer.

load.lib<-c("ggplot2", "dplyr", "GGally")
install.lib<-load.lib[!load.lib %in% installed.packages()]
for(lib in install.lib) install.packages(lib,dependencies=TRUE)
sapply(load.lib,require,character=TRUE)

Making bar plots

Imagine you have a data set and you want to know the frequency (count) of a variable. With R, you can quickly make a plot.

Make a bar plot of variable counts

To use ggplot2 we need to specify (1) the data set, (2) the aesthetic (variable mapping), and (3) the geometry. In the code chunk below, we told R that we wanted to use the function ggplot to make a bar plot using the diamonds data set. This produces a count of the X variable, which in this case is the count of the types of diamond cuts.

ggplot(data = diamonds,               # We name the data set we want to use
       aes(x = cut)) +                # We select a variable to take counts of
  geom_bar()                          # We want to show the counts with bars so we can compare differences

Make a barplot with a count label

Sometimes bar plots can help use quickly see differences between columns. For example, we can easily see a difference in counts between ‘Fair’ and ‘Ideal’, but what about the difference between ‘Very Good’ and ‘Premium’? Although we know that we have more premium cuts than very good cuts, the magnitude is tough to see.

ggplot(data = diamonds,                         # Same information as before
       aes(x = cut)) +
  geom_bar() + 
  geom_text(stat="count", 
            aes(label = after_stat(count)), 
            vjust = -1) +                      # But we just a label to see the counts
  ylim(0, 25000)                               # And we'll adjust the y-axis so we can see the labels better

Make a barplot showing the average of one variable by the categories of another

We also do not always want to know counts, and other statistics, such as means, are also excellent for display in a bar plot.

For example, using the same data set we can also create a bar plot showing us the average of one variable by another. For instance, in the diamonds data set we might want to know the average price for each diamond cut.

ggplot(diamonds, aes(cut,                     # ggplot knows that the first value list in the aes is X and the second Y
                     price)) +  
  geom_bar(stat = "summary",                  # Within the geom_bar we are specifying we want summary statistics, specifically a mean
           fun = "mean")      

Make a barplot showing the average of one variable by the categories of another with labels

As previously demonstrated, sometimes what we want are the labels of the statistics. However, statistics are limited to the shape of a data file. For example, reviewing the diamonds package below, we can see that the values are stacked on top of each other. Importantly, this data structure allows the bar plot to quickly go down a column of data to produce counts. But the mean values of another variable do not physically appear in this data set.

head(diamonds)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

As a result, we’ll use dplyr to construct a new data set. We begin by stating our data frame once, and using our pipes to specify that we want to group the data on the categorical variable we wish to take means of, in this case cut. Then, using these pipes, we specify that we want to summarize the mean of price for each of those groups. We can then add commands to tell ggplot2 we’ll be using the avg_price variable we constructed.

diamonds %>% 
  group_by(cut) %>% 
  summarise(avg_price = round(mean(price, na.rm = TRUE),2)) %>%   # rounding to 2 decimals
  ggplot(aes(cut, 
             avg_price, 
             label=avg_price)) +
  geom_col() +                             # Change to geom_col to have a Y
  geom_text(vjust = -1) +                  # Add a variable to adjust the vertical label
  ylim(0, 5500)                            # Adjust our y-axis for clearer labels

Make a barplot showing the average of one variable by the categories of another

Sometimes we might want the bars and the labels to tell us different things. For example, we might be interested in seeing the mean price for each cut in the bars, but we want to see the counts in the labels to know how many there are.

ggplot(diamonds, aes(cut, 
                     price)) +                        
  geom_bar(stat = "summary", fun = "mean") +
  geom_label(inherit.aes = FALSE,                       
             data = . %>% group_by(cut) %>% count(),    # Use dplyr to get counts
             aes(label = paste0(n, " Cuts"),            # Past the word 'cuts' to the label
                 x = cut), y = -0.5)

Adding a facet to a plot

You can also look at subsets by adding a facet_wrap to your plot. A facet wrap allows you to view your chosen visual (and summary statistic), by another variable. By adding another categorical variable (clarity) as a facet, we can now see the cut counts for each level of clarity.

ggplot(data = diamonds, aes(x = cut)) +
  geom_bar() + 
  geom_text(stat="count", 
            aes(label = after_stat(count)), 
            vjust = -1) +
  facet_wrap(.~clarity)                     # Add a facet wrap on the clarity variable

Adding a facet to a plot

Similar to a facet wrap is a facet_grid. The core distinction between the two commands is that a facet grid forms a panel matrix, while the grid makes a ribbon of panels wrapped along two variables.

ggplot(data = diamonds, aes(x = cut)) +
  geom_bar() + 
  geom_text(stat="count", aes(label = after_stat(count)), vjust = -1) +
  facet_grid(clarity~.)                    # Add a facet grid

But be careful when adding multiple variables. Depending on what you select for an output, you might find margins can be truncated to a point where a plot cannot be viewed.

ggplot(data = diamonds, aes(x = cut)) +
  geom_bar() + 
  geom_text(stat="count", aes(label = after_stat(count)), vjust = -1) +
  facet_wrap(clarity~color, scales="free_y")           # Add a facet grid with a second variable

Making a stacked bar plot of percentages

Sometimes we want to use a bar plot to convey percentage differences within categorical variables. For example, while we might want to know the count of different cuts, we also might want to know the percentage for each cut.

Similar to the issue we ran into with making a bar plot that has mean labels, we want to add a percentage label, but we do not have the percentage data within our data set. Indeed, we would want to construct a new data frame for the desired visual.

diamonds_edit <- diamonds %>%                     # Save a copy of the diamonds data set
  group_by(cut) %>%                               # grouping by cut
  count(cut, color) %>%                           # Counting the cut within the color
  mutate(pct = n/sum(n),                          # Calculating a percentage
         pct_label = scales::percent(pct))

We can see that the first data set has a different structure compared to the second.

head(diamonds)       # Old
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
head(diamonds_edit)  # New
## # A tibble: 6 × 5
## # Groups:   cut [1]
##   cut   color     n   pct pct_label
##   <ord> <ord> <int> <dbl> <chr>    
## 1 Fair  D       163 0.101 10.12%   
## 2 Fair  E       224 0.139 13.91%   
## 3 Fair  F       312 0.194 19.38%   
## 4 Fair  G       314 0.195 19.50%   
## 5 Fair  H       303 0.188 18.82%   
## 6 Fair  I       175 0.109 10.87%

And when we plot the percentages as a stacked bar plot, we can more easily look across the plot for percentage changes across the x-axis.

ggplot(data = diamonds_edit, 
       aes(x = cut, 
           y = pct, 
           fill = color)) +                                  # Add a fill so that every group is assigned a color
  geom_col() +                                               # Using geom_col for the X and Y labels
  geom_text(aes(label = paste(pct_label, n, sep = "\n")),    # Paste the label with a count
            lineheight = 0.8,                                # Adjust the line height to make the label ease to read
            position = position_stack(vjust = 0.5)) +        # Adjust the vertical position so its easier to read the label
  scale_y_continuous(labels = scales::percent)               # Set the axis labels to be a scaled percentage

Depending on the trends, you might even want to make the coordinates polar.

ggplot(data = diamonds_edit, 
       aes(x = cut, 
           y = pct, 
           fill = color)) +                                 
  geom_col() +                                           
  geom_text(aes(label = paste(pct_label, n, sep = "\n")),   
            lineheight = 0.8,                            
            position = position_stack(vjust = 0.5),
            size = 2,                                # Add label size
            color = "white") +                       # Change label color
  scale_y_continuous(labels = scales::percent) +
  coord_polar()                                      # Change label size and color

Make a scatter plot

Although we can get a lot of mileage out of a bar plot, analyst often have different variables which require different data visuals.

For example, bar plots are excellent for showing differences between groups, but potentially less clear when trying to examine change. Additionally, when both variables are continuous, a bar plot may not be the wisest choice.

ggplot(data = diamonds, aes(x = price, 
                            y = carat)) +                            
  geom_col()

In the plot above, it’s difficult to quickly glean insight. The plot below is more suited for the variables price and carat because both are continuous.

ggplot(data = diamonds, 
       aes(x = price, y = carat)) +  # Same data set and a similar aes for scatter plots
  geom_point()                       # Change the geometry to a point

Make a scatter plot with a variable for point color

Sometimes we want our plots to be more dynamic, adding additional information to our gemotry aesthetic.

ggplot(data = diamonds, 
       aes(x = price, y = carat)) +  # Same data set and a similar aes for scatter plots
  geom_point(aes(color = cut))       # Add color of cut

Make a scatter plot with a variable for point color and another for point size

R also make it easy to add additional optional variables to various dimensions. In the plot below, we added the x variable in our data set (width), to represent the size of the point in our plot.

This means a larger width should have a larger point (although it is difficult to see with this data set).

ggplot(data = diamonds, aes(x = price, 
                            y = carat)) +                            
  geom_point(aes(color = cut, 
                 size = y))                   # Adding a variable for size

Make a scatter plot with an adjusted transparency

R also make it easy to add additional optional variables to various dimensions. In the plot below, we added the x variable in our data set (width), to represent the size of the point in our plot.

This means a larger width should have a larger point (although it is difficult to see with this data set).

ggplot(data = diamonds, aes(x = price, 
                            y = carat)) +                            
  geom_point(aes(color = cut, 
                 size = y), 
             alpha = 0.1)             # The lower the alpha, the more transparent

Below we changed the alpha to a larger value, making it less translucent.

ggplot(data = diamonds, aes(x = price, 
                            y = carat)) +                            
  geom_point(aes(color = cut, 
                 size = y), 
             alpha = 0.8)             # The lower the alpha, the more transparent

And we can even anchor the value on a different variable. Although, at this point the plot may be too busy. Generally, adding more than 2-3 aesthetic mappings to a plot can be confusing.

ggplot(data = diamonds, 
       aes(x = price, y = carat, 
           alpha = x)) +               # Just add the alpha to the variable aesthetic mapping    
  geom_point(aes(color = cut, 
                 size = y))            

Saving plots

Because R is an object oriented language, you can save the parameters you specify in an object and reference that object while you work. For example, in the code chunk below, we’re making a density plot by first saving our data set and aesthetic saved as an object called g.

We then add the gemotry to the object, specifying the information we want to know.

g <- ggplot(diamonds, aes(cut))                                                 # Save the aesthetic as an object
g + geom_density(aes(fill=factor(cut)), alpha=0.5)     

Statistical plots

The plots one can make using ggplot2 are infinite. Any plot that can be imagined can be created. However, sometimes, as you may have observed throughout this lesson, there is some coding required.

To facilitate plotting, search for packages that are relevant to your work. You will likely find a variety of packages with helper functions which provide you with the statistics and visuals you need.

For example, below we take a few variables from the diamonds data set and construct a statistical plot showing us multiple comparisons of distributions.

diamonds_cut <- diamonds[,c("cut", "price", "color", "clarity")]
ggpairs(diamonds_cut)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.