Kable package

In a previous weeks, we saw R Markdown in action, where multiple things can be created in one location: code, commentary, and output.

In this chapter we will explore package which will facilitate the creation of presentation-worthy tables: “kableExtra”.

Let’s work with the cross-sectional data on the credit history for a sample of applicants for a type of credit card.

data(CreditCard)
cardHead <- head(CreditCard)
cardHead
##   card reports      age income       share expenditure owner selfemp dependents
## 1  yes       0 37.66667 4.5200 0.033269910  124.983300   yes      no          3
## 2  yes       0 33.25000 2.4200 0.005216942    9.854167    no      no          3
## 3  yes       0 33.66667 4.5000 0.004155556   15.000000   yes      no          4
## 4  yes       0 30.50000 2.5400 0.065213780  137.869200    no      no          0
## 5  yes       0 32.16667 9.7867 0.067050590  546.503300   yes      no          2
## 6  yes       0 23.25000 2.5000 0.044438400   91.996670    no      no          0
##   months majorcards active
## 1     54          1     12
## 2     34          1     13
## 3     58          1      5
## 4     25          1      7
## 5     64          1      5
## 6     54          1      1


cardHead %>%
  kbl()
card reports age income share expenditure owner selfemp dependents months majorcards active
yes 0 37.66667 4.5200 0.0332699 124.983300 yes no 3 54 1 12
yes 0 33.25000 2.4200 0.0052169 9.854167 no no 3 34 1 13
yes 0 33.66667 4.5000 0.0041556 15.000000 yes no 4 58 1 5
yes 0 30.50000 2.5400 0.0652138 137.869200 no no 0 25 1 7
yes 0 32.16667 9.7867 0.0670506 546.503300 yes no 2 64 1 5
yes 0 23.25000 2.5000 0.0444384 91.996670 no no 0 54 1 1


Let’s tweak the appearance of this with the “align” and the “caption” arguments.

The align argument takes a character vector with letters “l”, “c”, or “r” - specifying where you want the columns to be aligned.

The caption argument gives a caption to the table.


base <- cardHead %>%
  kbl(align = c(rep("c", 7), rep("r", 5)), caption = "kable example with card data")
base
kable example with card data
card reports age income share expenditure owner selfemp dependents months majorcards active
yes 0 37.66667 4.5200 0.0332699 124.983300 yes no 3 54 1 12
yes 0 33.25000 2.4200 0.0052169 9.854167 no no 3 34 1 13
yes 0 33.66667 4.5000 0.0041556 15.000000 yes no 4 58 1 5
yes 0 30.50000 2.5400 0.0652138 137.869200 no no 0 25 1 7
yes 0 32.16667 9.7867 0.0670506 546.503300 yes no 2 64 1 5
yes 0 23.25000 2.5000 0.0444384 91.996670 no no 0 54 1 1


A key function, where we can enjoy much of the configuration for the table, is via kable_styling().

We have options “bootstrap_options” or “latex_options”, where the latter requires the use of the package “tinytex” and a local installation of LaTeX.

Possible options for “bootstrap_options” include ‘basic’, ‘striped’, ‘bordered’, ‘hover’, ‘condensed’, ‘responsive’, and none.

Possible for “latex_options” include ‘basic’, ‘striped’, ‘hold_position’, ‘HOLD_position’, ‘scale_down’, and ‘repeat_header’.


base %>%
  kable_styling(bootstrap_options = "striped")
kable example with card data
card reports age income share expenditure owner selfemp dependents months majorcards active
yes 0 37.66667 4.5200 0.0332699 124.983300 yes no 3 54 1 12
yes 0 33.25000 2.4200 0.0052169 9.854167 no no 3 34 1 13
yes 0 33.66667 4.5000 0.0041556 15.000000 yes no 4 58 1 5
yes 0 30.50000 2.5400 0.0652138 137.869200 no no 0 25 1 7
yes 0 32.16667 9.7867 0.0670506 546.503300 yes no 2 64 1 5
yes 0 23.25000 2.5000 0.0444384 91.996670 no no 0 54 1 1


Next, we can customize the look and feel of particular rows and columns.

Let’s see an example here, where we make the last three rows blue.

base %>%
  kable_styling(bootstrap_options = "bordered") %>%
  column_spec(8:12, bold = T) %>%
  row_spec(4:6, italic = T, color = "gold", background = "blue")
kable example with card data
card reports age income share expenditure owner selfemp dependents months majorcards active
yes 0 37.66667 4.5200 0.0332699 124.983300 yes no 3 54 1 12
yes 0 33.25000 2.4200 0.0052169 9.854167 no no 3 34 1 13
yes 0 33.66667 4.5000 0.0041556 15.000000 yes no 4 58 1 5
yes 0 30.50000 2.5400 0.0652138 137.869200 no no 0 25 1 7
yes 0 32.16667 9.7867 0.0670506 546.503300 yes no 2 64 1 5
yes 0 23.25000 2.5000 0.0444384 91.996670 no no 0 54 1 1


We can also create groups for our columns.


base %>%
  kable_styling(bootstrap_options = "bordered") %>%
  add_header_above(c("Group 1" = 4, "Group 2" = 2, "Group 3" = 6))
kable example with card data
Group 1
Group 2
Group 3
card reports age income share expenditure owner selfemp dependents months majorcards active
yes 0 37.66667 4.5200 0.0332699 124.983300 yes no 3 54 1 12
yes 0 33.25000 2.4200 0.0052169 9.854167 no no 3 34 1 13
yes 0 33.66667 4.5000 0.0041556 15.000000 yes no 4 58 1 5
yes 0 30.50000 2.5400 0.0652138 137.869200 no no 0 25 1 7
yes 0 32.16667 9.7867 0.0670506 546.503300 yes no 2 64 1 5
yes 0 23.25000 2.5000 0.0444384 91.996670 no no 0 54 1 1


Data Aggregation

In the first stage of our analysis we are going to group our data in the form of the simple frequency table.

First, let’s take a look at the distribution of income in our sample and verify the tabular accuracy using TAI measure:

options(scipen=999)

limits<- cut(CreditCard$income,seq(0,14,by=2))
tabelka <- freq(limits,type="html")
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
tabelka
## $`x:`
##                x label Freq Percent Valid Percent Cumulative Percent
##    Valid   (0,2]        236    17.9          17.9               17.9
##            (2,4]        783    59.4          59.4               77.3
##            (4,6]        205    15.5          15.5               92.8
##            (6,8]         63     4.8           4.8               97.6
##           (8,10]         23     1.7           1.7               99.3
##          (10,12]          7     0.5           0.5               99.8
##          (12,14]          2     0.2           0.2              100.0
##            Total       1319   100.0         100.0                   
##  Missing <blank>          0     0.0                                 
##             <NA>          0     0.0                                 
##            Total       1319   100.0

Without ‘kable’ styling it’s quite ugly right? ;-)

Tabular accuracy

An index of tabular accuracy TAI, described by Jenks and Casspal in 1971 is to optimize the class distribution used in a cartograms/frequency tables etc.

The TAI indicator takes values in the range (0;1). The numerator of the expression is the sum of the absolute deviations of the values classified into classes, and the denominator is the sum of the absolute deviations of the entire classified set.

The better the class division reflects the nature of the data, the larger the indicator will be. As the number of classes increases, the indicator will take on larger values.

Let’s calculate TAI index to check the properties of the tabulated data:

tabelka2 <- classIntervals(CreditCard$income, n=7, style="fixed", fixedBreaks=seq(0,14,by=2))
jenks.tests(tabelka2)
##        # classes  Goodness of fit Tabular accuracy 
##        7.0000000        0.9085328        0.6568085

As we can see - TAI index…

We can use different recipes… (styles):

tabelka3<-classIntervals(CreditCard$income, n=10, style="sd")
plot(tabelka3,pal=c(1:10))

jenks.tests(tabelka3)
##        # classes  Goodness of fit Tabular accuracy 
##        8.0000000        0.9274792        0.6909392

Still, the TAI indicator is not satisfactory. What should we change in the final frequency table design?

hist(CreditCard$income)

Continuous variables

We can calculate the absolute and relative frequencies of a vector x with the function ‘Freq’ from the DescTools packages. Continuous (numeric) variables will be cut using the same logic as used by the function hist. Categorical variables will be aggregated by table. The result will contain single and cumulative frequencies for both, absolute values and percentages.

tabela4<-Freq(CreditCard$income,breaks=seq(0,14,by=2),useNA="ifany")

tabela4 %>%
  kable(col.names = c("Incomes in kUSD","Frequency","Percentage %","Cumulative frequency","Cumulative percentage %")) %>%
  kable_classic(full_width = F, html_font = "Cambria")
Incomes in kUSD Frequency Percentage % Cumulative frequency Cumulative percentage %
[0,2] 236 0.1789234 236 0.1789234
(2,4] 783 0.5936315 1019 0.7725550
(4,6] 205 0.1554208 1224 0.9279757
(6,8] 63 0.0477635 1287 0.9757392
(8,10] 23 0.0174375 1310 0.9931766
(10,12] 7 0.0053071 1317 0.9984837
(12,14] 2 0.0015163 1319 1.0000000

BTW: what about TAI of that table?…

Categorical variables

Now, let’s take a look at the categorical data and make some tabulations. The xtabs function works like table except it can produce tables from frequencies using the formula interface.

Let’s say we want to see the table with data on how many card applications was accepted or not:

## card
##   no  yes 
##  296 1023

We may easily produce cross-tabs (status vs. Does the individual own their home?) as well:

crosstab<-xtabs(~ card + owner, data=CreditCard)
crosstab
##      owner
## card   no yes
##   no  206  90
##   yes 532 491

and transform it into pretty html table with the kable function:

crosstab %>% 
  kbl() %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T, border_right = T) %>%
  column_spec(2, background = "yellow")
no yes
no 206 90
yes 532 491

Data Visualization

We will explore the “ggplot2” package of the tidyverse for data visualization purposes. The “ggplot2” packages involve the the following three mandatory components:

  1. Data
  2. An aesthetic mapping
  3. Geoms (aka objects)

The following components can also optionally be added:

  1. Stats (aka transformations)
  2. Scales
  3. Facets
  4. Coordinate systems
  5. Position adjustments
  6. Themes

Please note that code in this tutorial was adapted from Chapters 3 of the book “R for Data Science” by Hadley Wickham and Garrett Grolemund.

The full book can be found at: https://r4ds.had.co.nz/

A good cheat sheet for ggplot2 functions can be found at: https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

Scatterplots

Let’s create an extremely simple scatterplot.

We will use the function ggplot() to do this.

The format of any ggplot graph is this function, followed by another function to add objects.

The objects on a graph in the case of a scatterplot are points. The function we add to it is geom_point.

These functions rely on a function on the inside called aes().

The data and aesthetic mapping components can be added to either the ggplot() or geom functions.

ggplot(data = mpg) +
  geom_point(aes(x = displ, y = hwy))


This is one of the most basic graphs that one can make using the ggplot2 framework.

Next, let’s add color.

geom_point() understands the following aesthetics: x, y, alpha, color, fill, group, shape, size, and stroke (see help documentation).

Let’s map the color argument to the variable “class” from mpg.

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy, color = class))

This is not the only way to color objects.

Including the color argument inside of the aes() function can map colors to a choice of variable.

However, we can specify colors manually, by specifying color outside of the aes() function. We will also illustrate the “size” argument.

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy, size = class), color = "blue")

Barplots

Lastly, let’s examine other objects that we can plot using ggplot(). We will create a bar chart using the function geom_bar().

ggplot(mpg) +
  geom_bar(aes(x = class))


With the geom_bar() function, we have a great use-case for a stat transformation.

The following code can be used to convert these counts to proportions:

ggplot(mpg) +
  geom_bar(aes(x = class, y = stat(prop), group = 1))

Histograms

Next, let’s create a histogram with the geom_histogram() function.

ggplot(mpg) +
  geom_histogram(aes(x = hwy))

The geom_histogram() function accepts the argument “binwidth”, and has two key arguments for color: fill (this controls the overall color), and color (this controls the border).

Let’s fill all these in:

ggplot(mpg) +
  geom_histogram(aes(x = hwy), binwidth = 5, fill = "navy", color = "gold")


geom_histogram() provides a great example to modify the scale.

Notice in this example that the axis is automatically broken up by units of 10, and does not begin at 0.

We can modify this with the function scale_x_continuous(), as well as the y-axis with the function scale_y_continuous().

There are three key arguments we will feed this function: “breaks”, “limits”, and “expand”.

“breaks” will define the breaks on the axis.

“limits” will define the beginning and end of the axis, and the “expand” argument can be used to start the axes at 0 by using “expand = c(0,0)”.

ggplot(mpg) +
  geom_histogram(aes(x = hwy), binwidth = 5, fill = "navy", color = "gold") +
  scale_x_continuous(breaks = seq(0, 45, 5), limits = c(0, 50), expand = c(0,0)) +
  scale_y_continuous(breaks = seq(0, 90, 10), limits = c(0, 90), expand = c(0,0))

Boxplots

Next, we will create boxplots.

p <- ggplot(mpg) +
  geom_boxplot(aes(x = class, y = cty, fill = class))
p

Facets

Faceting generates small multiples each showing a different subset of the data. Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different.

Read more about facets here.

Notice in this document the use of the fig.height and fig.width options.

Key arguments to facet_wrap() are “facets”, “nrow”, and “ncol”.

ggplot(mpg) +
  geom_boxplot(aes(x = class, y = cty, fill = class)) +
  facet_wrap(facets = ~cyl, nrow = 2, ncol = 2)

Coordinates

Other coordinate systems can be applied to graphs created from ggplot2.

One example is coord_polar(), which uses polar coordinates. Most of these are quite rare. Probably the most common one is coord_flip(), which will flip the X and Y axes. Let’s also illustrate the labs() function, which can be used to change labels.

ggplot(mpg) +
  geom_bar(aes(x = class, fill = factor(cyl))) +
  labs(title = "Cylinders by Class", fill = "cylinders") +
  coord_flip()


These bars are stacked on top of each of other, due to the “cyl” variable being mapped to the “fill” argument. There are various position adjustments that can be used. Again, most of these are not very common, but a common one is the argument “position = ‘dodge’”, which will put items side-by-side.

See this example:

ggplot(mpg) +
  geom_bar(aes(x = class, fill = factor(cyl)), position = "dodge") +
  labs(title = "Cylinders by Class", fill = "cylinders") + 
  coord_flip()

Themes

Lastly, we can alter the “theme”, or the overall appearance of our plot.

I recommend using the ggThemeAssist package, because this will make this incredibly easy, with an interface that will automatically generate reproducible code.

This can be used by highlighting a ggplot2 object, and navigating to Addins > ggplot Theme Assistant.

We’ll make the following changes: eliminating the panel grid lines, eliminating axis ticks, adding a title called “Boxplot Example”, making it bigger and putting it in bold, and adjusting it to the center.

# p
p + theme(axis.ticks = element_line(linetype = "blank"),
    panel.grid.major = element_line(linetype = "blank"),
    panel.grid.minor = element_line(linetype = "blank"),
    plot.title = element_text(size = 14,
        face = "bold", hjust = 0.5)) +labs(title = "Boxplot Example")


There are many more examples of things that can be done with ggplot2.

It is an amazingly powerful and flexible package, and it is worth getting acquainted with the cheat sheet.

Exercise 1.

Using data on credit card applications’ status please present the frequency table with the nice, kable format for average monthly credit card expenditures of applicants.

Exercise 2.

The data comes from https://flixgem.com/ (dataset version as of March 12, 2021). The data contains information on 9425 movies and series available on Netlix.

Answer with the most appropriate data visualization for the following questions:

  1. What is the distribution of Imdb scores for Polish movies and movie-series?

  2. What is the density function of Imdb scores for Polish movies and movie-series?

  3. What are the most popular languages available on Netflix?

For extra credits:

Extra challenge 1.: Create a chart showing actors starring in the most popular productions.

Extra challenge 2.: For movies and series, create rating charts from the various portals (Hidden Gem, IMDb, Rotten Tomatoes, Metacritic). Hint: it’s a good idea to reshape the data to long format.

Extra challenge 3.: Which film studios produce the most and how has this changed over the years?

