Kable package
In a previous weeks, we saw R Markdown in action, where multiple
things can be created in one location: code, commentary, and output.
In this chapter we will explore package which will facilitate the
creation of presentation-worthy tables: “kableExtra”.
Let’s work with the cross-sectional data on the credit history for a
sample of applicants for a type of credit card.
data(CreditCard)
cardHead <- head(CreditCard)
cardHead
## card reports age income share expenditure owner selfemp dependents
## 1 yes 0 37.66667 4.5200 0.033269910 124.983300 yes no 3
## 2 yes 0 33.25000 2.4200 0.005216942 9.854167 no no 3
## 3 yes 0 33.66667 4.5000 0.004155556 15.000000 yes no 4
## 4 yes 0 30.50000 2.5400 0.065213780 137.869200 no no 0
## 5 yes 0 32.16667 9.7867 0.067050590 546.503300 yes no 2
## 6 yes 0 23.25000 2.5000 0.044438400 91.996670 no no 0
## months majorcards active
## 1 54 1 12
## 2 34 1 13
## 3 58 1 5
## 4 25 1 7
## 5 64 1 5
## 6 54 1 1
cardHead %>%
kbl()
|
card
|
reports
|
age
|
income
|
share
|
expenditure
|
owner
|
selfemp
|
dependents
|
months
|
majorcards
|
active
|
|
yes
|
0
|
37.66667
|
4.5200
|
0.0332699
|
124.983300
|
yes
|
no
|
3
|
54
|
1
|
12
|
|
yes
|
0
|
33.25000
|
2.4200
|
0.0052169
|
9.854167
|
no
|
no
|
3
|
34
|
1
|
13
|
|
yes
|
0
|
33.66667
|
4.5000
|
0.0041556
|
15.000000
|
yes
|
no
|
4
|
58
|
1
|
5
|
|
yes
|
0
|
30.50000
|
2.5400
|
0.0652138
|
137.869200
|
no
|
no
|
0
|
25
|
1
|
7
|
|
yes
|
0
|
32.16667
|
9.7867
|
0.0670506
|
546.503300
|
yes
|
no
|
2
|
64
|
1
|
5
|
|
yes
|
0
|
23.25000
|
2.5000
|
0.0444384
|
91.996670
|
no
|
no
|
0
|
54
|
1
|
1
|
Let’s tweak the appearance of this with the “align” and the “caption”
arguments.
The align argument takes a character vector with letters “l”, “c”, or
“r” - specifying where you want the columns to be aligned.
The caption argument gives a caption to the table.
base <- cardHead %>%
kbl(align = c(rep("c", 7), rep("r", 5)), caption = "kable example with card data")
base
kable example with card data
|
card
|
reports
|
age
|
income
|
share
|
expenditure
|
owner
|
selfemp
|
dependents
|
months
|
majorcards
|
active
|
|
yes
|
0
|
37.66667
|
4.5200
|
0.0332699
|
124.983300
|
yes
|
no
|
3
|
54
|
1
|
12
|
|
yes
|
0
|
33.25000
|
2.4200
|
0.0052169
|
9.854167
|
no
|
no
|
3
|
34
|
1
|
13
|
|
yes
|
0
|
33.66667
|
4.5000
|
0.0041556
|
15.000000
|
yes
|
no
|
4
|
58
|
1
|
5
|
|
yes
|
0
|
30.50000
|
2.5400
|
0.0652138
|
137.869200
|
no
|
no
|
0
|
25
|
1
|
7
|
|
yes
|
0
|
32.16667
|
9.7867
|
0.0670506
|
546.503300
|
yes
|
no
|
2
|
64
|
1
|
5
|
|
yes
|
0
|
23.25000
|
2.5000
|
0.0444384
|
91.996670
|
no
|
no
|
0
|
54
|
1
|
1
|
A key function, where we can enjoy much of the configuration for the
table, is via kable_styling().
We have options “bootstrap_options” or “latex_options”, where the
latter requires the use of the package “tinytex” and a local
installation of LaTeX.
Possible options for “bootstrap_options” include ‘basic’, ‘striped’,
‘bordered’, ‘hover’, ‘condensed’, ‘responsive’, and none.
Possible for “latex_options” include ‘basic’, ‘striped’,
‘hold_position’, ‘HOLD_position’, ‘scale_down’, and ‘repeat_header’.
base %>%
kable_styling(bootstrap_options = "striped")
kable example with card data
|
card
|
reports
|
age
|
income
|
share
|
expenditure
|
owner
|
selfemp
|
dependents
|
months
|
majorcards
|
active
|
|
yes
|
0
|
37.66667
|
4.5200
|
0.0332699
|
124.983300
|
yes
|
no
|
3
|
54
|
1
|
12
|
|
yes
|
0
|
33.25000
|
2.4200
|
0.0052169
|
9.854167
|
no
|
no
|
3
|
34
|
1
|
13
|
|
yes
|
0
|
33.66667
|
4.5000
|
0.0041556
|
15.000000
|
yes
|
no
|
4
|
58
|
1
|
5
|
|
yes
|
0
|
30.50000
|
2.5400
|
0.0652138
|
137.869200
|
no
|
no
|
0
|
25
|
1
|
7
|
|
yes
|
0
|
32.16667
|
9.7867
|
0.0670506
|
546.503300
|
yes
|
no
|
2
|
64
|
1
|
5
|
|
yes
|
0
|
23.25000
|
2.5000
|
0.0444384
|
91.996670
|
no
|
no
|
0
|
54
|
1
|
1
|
Next, we can customize the look and feel of particular rows and
columns.
Let’s see an example here, where we make the last three rows
blue.
base %>%
kable_styling(bootstrap_options = "bordered") %>%
column_spec(8:12, bold = T) %>%
row_spec(4:6, italic = T, color = "gold", background = "blue")
kable example with card data
|
card
|
reports
|
age
|
income
|
share
|
expenditure
|
owner
|
selfemp
|
dependents
|
months
|
majorcards
|
active
|
|
yes
|
0
|
37.66667
|
4.5200
|
0.0332699
|
124.983300
|
yes
|
no
|
3
|
54
|
1
|
12
|
|
yes
|
0
|
33.25000
|
2.4200
|
0.0052169
|
9.854167
|
no
|
no
|
3
|
34
|
1
|
13
|
|
yes
|
0
|
33.66667
|
4.5000
|
0.0041556
|
15.000000
|
yes
|
no
|
4
|
58
|
1
|
5
|
|
yes
|
0
|
30.50000
|
2.5400
|
0.0652138
|
137.869200
|
no
|
no
|
0
|
25
|
1
|
7
|
|
yes
|
0
|
32.16667
|
9.7867
|
0.0670506
|
546.503300
|
yes
|
no
|
2
|
64
|
1
|
5
|
|
yes
|
0
|
23.25000
|
2.5000
|
0.0444384
|
91.996670
|
no
|
no
|
0
|
54
|
1
|
1
|
We can also create groups for our columns.
base %>%
kable_styling(bootstrap_options = "bordered") %>%
add_header_above(c("Group 1" = 4, "Group 2" = 2, "Group 3" = 6))
kable example with card data
|
Group 1
|
Group 2
|
Group 3
|
|
card
|
reports
|
age
|
income
|
share
|
expenditure
|
owner
|
selfemp
|
dependents
|
months
|
majorcards
|
active
|
|
yes
|
0
|
37.66667
|
4.5200
|
0.0332699
|
124.983300
|
yes
|
no
|
3
|
54
|
1
|
12
|
|
yes
|
0
|
33.25000
|
2.4200
|
0.0052169
|
9.854167
|
no
|
no
|
3
|
34
|
1
|
13
|
|
yes
|
0
|
33.66667
|
4.5000
|
0.0041556
|
15.000000
|
yes
|
no
|
4
|
58
|
1
|
5
|
|
yes
|
0
|
30.50000
|
2.5400
|
0.0652138
|
137.869200
|
no
|
no
|
0
|
25
|
1
|
7
|
|
yes
|
0
|
32.16667
|
9.7867
|
0.0670506
|
546.503300
|
yes
|
no
|
2
|
64
|
1
|
5
|
|
yes
|
0
|
23.25000
|
2.5000
|
0.0444384
|
91.996670
|
no
|
no
|
0
|
54
|
1
|
1
|
Data Aggregation
In the first stage of our analysis we are going to group our data in
the form of the simple frequency table.
First, let’s take a look at the distribution of income in our sample
and verify the tabular accuracy using TAI measure:
options(scipen=999)
limits<- cut(CreditCard$income,seq(0,14,by=2))
tabelka <- freq(limits,type="html")
##
|
| | 0%
|
|======================================================================| 100%
tabelka
## $`x:`
## x label Freq Percent Valid Percent Cumulative Percent
## Valid (0,2] 236 17.9 17.9 17.9
## (2,4] 783 59.4 59.4 77.3
## (4,6] 205 15.5 15.5 92.8
## (6,8] 63 4.8 4.8 97.6
## (8,10] 23 1.7 1.7 99.3
## (10,12] 7 0.5 0.5 99.8
## (12,14] 2 0.2 0.2 100.0
## Total 1319 100.0 100.0
## Missing <blank> 0 0.0
## <NA> 0 0.0
## Total 1319 100.0
Without ‘kable’ styling it’s quite ugly right? ;-)
Tabular accuracy
An index of tabular accuracy TAI, described by Jenks and Casspal in
1971 is to optimize the class distribution used in a
cartograms/frequency tables etc.
The TAI indicator takes values in the range (0;1). The numerator of
the expression is the sum of the absolute deviations of the values
classified into classes, and the denominator is the sum of the absolute
deviations of the entire classified set.
The better the class division reflects the nature of the data, the
larger the indicator will be. As the number of classes increases, the
indicator will take on larger values.
Let’s calculate TAI index to check the properties of the tabulated
data:
tabelka2 <- classIntervals(CreditCard$income, n=7, style="fixed", fixedBreaks=seq(0,14,by=2))
jenks.tests(tabelka2)
## # classes Goodness of fit Tabular accuracy
## 7.0000000 0.9085328 0.6568085
As we can see - TAI index…
We can use different recipes… (styles):
tabelka3<-classIntervals(CreditCard$income, n=10, style="sd")
plot(tabelka3,pal=c(1:10))

jenks.tests(tabelka3)
## # classes Goodness of fit Tabular accuracy
## 8.0000000 0.9274792 0.6909392
Still, the TAI indicator is not satisfactory. What should we change
in the final frequency table design?
hist(CreditCard$income)

Continuous variables
We can calculate the absolute and relative frequencies of a vector x
with the function ‘Freq’ from the DescTools packages.
Continuous (numeric) variables will be cut using the same logic as used
by the function hist. Categorical variables will be aggregated by table.
The result will contain single and cumulative frequencies for both,
absolute values and percentages.
tabela4<-Freq(CreditCard$income,breaks=seq(0,14,by=2),useNA="ifany")
tabela4 %>%
kable(col.names = c("Incomes in kUSD","Frequency","Percentage %","Cumulative frequency","Cumulative percentage %")) %>%
kable_classic(full_width = F, html_font = "Cambria")
|
Incomes in kUSD
|
Frequency
|
Percentage %
|
Cumulative frequency
|
Cumulative percentage %
|
|
[0,2]
|
236
|
0.1789234
|
236
|
0.1789234
|
|
(2,4]
|
783
|
0.5936315
|
1019
|
0.7725550
|
|
(4,6]
|
205
|
0.1554208
|
1224
|
0.9279757
|
|
(6,8]
|
63
|
0.0477635
|
1287
|
0.9757392
|
|
(8,10]
|
23
|
0.0174375
|
1310
|
0.9931766
|
|
(10,12]
|
7
|
0.0053071
|
1317
|
0.9984837
|
|
(12,14]
|
2
|
0.0015163
|
1319
|
1.0000000
|
BTW: what about TAI of that table?…
Categorical variables
Now, let’s take a look at the categorical data and make some
tabulations. The xtabs function works like table except
it can produce tables from frequencies using the formula interface.
Let’s say we want to see the table with data on how many card
applications was accepted or not:
## card
## no yes
## 296 1023
We may easily produce cross-tabs (status vs. Does the
individual own their home?) as well:
crosstab<-xtabs(~ card + owner, data=CreditCard)
crosstab
## owner
## card no yes
## no 206 90
## yes 532 491
and transform it into pretty html table with the kable function:
crosstab %>%
kbl() %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T, border_right = T) %>%
column_spec(2, background = "yellow")
|
|
no
|
yes
|
|
no
|
206
|
90
|
|
yes
|
532
|
491
|
Data Visualization
We will explore the “ggplot2” package of the
tidyverse for data visualization purposes. The “ggplot2” packages
involve the the following three mandatory components:
- Data
- An aesthetic mapping
- Geoms (aka objects)
The following components can also optionally be added:
- Stats (aka transformations)
- Scales
- Facets
- Coordinate systems
- Position adjustments
- Themes
Please note that code in this tutorial was adapted from Chapters 3 of
the book “R for Data Science” by Hadley Wickham and Garrett
Grolemund.
The full book can be found at: https://r4ds.had.co.nz/
A good cheat sheet for ggplot2 functions can be found at: https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
Scatterplots
Let’s create an extremely simple scatterplot.
We will use the function ggplot() to do this.
The format of any ggplot graph is this function, followed by another
function to add objects.
The objects on a graph in the case of a scatterplot are points. The
function we add to it is geom_point.
These functions rely on a function on the inside called
aes().
The data and aesthetic mapping components can be added to either the
ggplot() or geom functions.
ggplot(data = mpg) +
geom_point(aes(x = displ, y = hwy))

This is one of the most basic graphs that one can make using the
ggplot2 framework.
Next, let’s add color.
geom_point() understands the following aesthetics: x, y,
alpha, color, fill, group, shape, size, and stroke (see help
documentation).
Let’s map the color argument to the variable “class” from mpg.
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy, color = class))

This is not the only way to color objects.
Including the color argument inside of the aes()
function can map colors to a choice of variable.
However, we can specify colors manually, by specifying color outside
of the aes() function. We will also illustrate the “size”
argument.
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy, size = class), color = "blue")

Barplots
Lastly, let’s examine other objects that we can plot using
ggplot(). We will create a bar chart using the function
geom_bar().
ggplot(mpg) +
geom_bar(aes(x = class))

With the geom_bar() function, we have a great use-case
for a stat transformation.
The following code can be used to convert these counts to
proportions:
ggplot(mpg) +
geom_bar(aes(x = class, y = stat(prop), group = 1))

Histograms
Next, let’s create a histogram with the geom_histogram()
function.
ggplot(mpg) +
geom_histogram(aes(x = hwy))

The geom_histogram() function accepts the argument
“binwidth”, and has two key arguments for color: fill (this controls the
overall color), and color (this controls the border).
Let’s fill all these in:
ggplot(mpg) +
geom_histogram(aes(x = hwy), binwidth = 5, fill = "navy", color = "gold")

geom_histogram() provides a great example to modify the
scale.
Notice in this example that the axis is automatically broken up by
units of 10, and does not begin at 0.
We can modify this with the function
scale_x_continuous(), as well as the y-axis with the
function scale_y_continuous().
There are three key arguments we will feed this function: “breaks”,
“limits”, and “expand”.
“breaks” will define the breaks on the axis.
“limits” will define the beginning and end of the axis, and the
“expand” argument can be used to start the axes at 0 by using “expand =
c(0,0)”.
ggplot(mpg) +
geom_histogram(aes(x = hwy), binwidth = 5, fill = "navy", color = "gold") +
scale_x_continuous(breaks = seq(0, 45, 5), limits = c(0, 50), expand = c(0,0)) +
scale_y_continuous(breaks = seq(0, 90, 10), limits = c(0, 90), expand = c(0,0))

Boxplots
Next, we will create boxplots.
p <- ggplot(mpg) +
geom_boxplot(aes(x = class, y = cty, fill = class))
p

Facets
Faceting generates small multiples each showing a different subset of
the data. Small multiples are a powerful tool for exploratory data
analysis: you can rapidly compare patterns in different parts of the
data and see whether they are the same or different.
Read more about facets here.
Notice in this document the use of the fig.height and fig.width
options.
Key arguments to facet_wrap() are “facets”, “nrow”, and
“ncol”.
ggplot(mpg) +
geom_boxplot(aes(x = class, y = cty, fill = class)) +
facet_wrap(facets = ~cyl, nrow = 2, ncol = 2)

Coordinates
Other coordinate systems can be applied to graphs created from
ggplot2.
One example is coord_polar(), which uses polar
coordinates. Most of these are quite rare. Probably the most common one
is coord_flip(), which will flip the X and Y axes. Let’s
also illustrate the labs() function, which can be used to
change labels.
ggplot(mpg) +
geom_bar(aes(x = class, fill = factor(cyl))) +
labs(title = "Cylinders by Class", fill = "cylinders") +
coord_flip()

These bars are stacked on top of each of other, due to the “cyl”
variable being mapped to the “fill” argument. There are various position
adjustments that can be used. Again, most of these are not very common,
but a common one is the argument “position = ‘dodge’”, which will put
items side-by-side.
See this example:
ggplot(mpg) +
geom_bar(aes(x = class, fill = factor(cyl)), position = "dodge") +
labs(title = "Cylinders by Class", fill = "cylinders") +
coord_flip()

Themes
Lastly, we can alter the “theme”, or the overall appearance of our
plot.
I recommend using the ggThemeAssist package, because this will make
this incredibly easy, with an interface that will automatically generate
reproducible code.
This can be used by highlighting a ggplot2 object, and navigating to
Addins > ggplot Theme Assistant.
We’ll make the following changes: eliminating the panel grid lines,
eliminating axis ticks, adding a title called “Boxplot Example”, making
it bigger and putting it in bold, and adjusting it to the center.
# p
p + theme(axis.ticks = element_line(linetype = "blank"),
panel.grid.major = element_line(linetype = "blank"),
panel.grid.minor = element_line(linetype = "blank"),
plot.title = element_text(size = 14,
face = "bold", hjust = 0.5)) +labs(title = "Boxplot Example")

There are many more examples of things that can be done with
ggplot2.
It is an amazingly powerful and flexible package, and it is worth
getting acquainted with the cheat sheet.
---
title: "Data Visualization"
author: "Your Name"
output:
  html_document: 
    theme: cerulean
    highlight: textmate
    fontsize: 8pt
    toc: yes
    code_download: yes
    toc_float:
      collapsed: no
    df_print: default
    toc_depth: 5
editor_options: 
  markdown: 
    wrap: 72
---



```{r, include=FALSE}
## Global options
options(qwraps2_markup = "markdown")
library(qwraps2)
library(arsenal)
library(e1071)
library(haven)
library(papeR)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(kableExtra)
library(summarytools)
library(classInt)
library(pastecs)
library(reporttools)
library(desctable)
library(DescTools)
library(frequency)
library(ggpubr)
library(dlookr)
library(tidyverse)
library(knitr)
library(tinytex)
library(AER)
library(ggThemeAssist)
library(SmarterPoland)
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```


## Kable package

In a previous weeks, we saw R Markdown in action, where multiple things can be created in one location: code, commentary, and output. 

In this chapter we will explore package which will facilitate the creation of presentation-worthy tables: "kableExtra".

Let's work with the cross-sectional data on the credit history for a sample of applicants for a type of credit card.

```{r}
data(CreditCard)
cardHead <- head(CreditCard)
cardHead
```

<br>

```{r Tibble with "kable"}
cardHead %>%
  kbl()
```

<br>

Let's tweak the appearance of this with the "align" and the "caption" arguments.  

The align argument takes a character vector with letters "l", "c", or "r" - specifying where you want the columns to be aligned.

The caption argument gives a caption to the table.

<br>

```{r Aligned kable}
base <- cardHead %>%
  kbl(align = c(rep("c", 7), rep("r", 5)), caption = "kable example with card data")
base
```

<br>

A key function, where we can enjoy much of the configuration for the table, is via `kable_styling()`.   

We have options "bootstrap_options" or "latex_options", where the latter requires the use of the package "tinytex" and a local installation of LaTeX.  

Possible options for "bootstrap_options" include 'basic', 'striped', 'bordered', 'hover', 'condensed', 'responsive', and none.  

Possible for "latex_options" include 'basic', 'striped', 'hold_position', 'HOLD_position', 'scale_down', and 'repeat_header'.

<br>


```{r Bootstrap - striped}
base %>%
  kable_styling(bootstrap_options = "striped")
```

<br>

Next, we can customize the look and feel of particular rows and columns.   

Let's see an example here, where we make the last three rows blue.

```{r Column and row customizing}
base %>%
  kable_styling(bootstrap_options = "bordered") %>%
  column_spec(8:12, bold = T) %>%
  row_spec(4:6, italic = T, color = "gold", background = "blue")
```

<br>


We can also create groups for our columns.

<br>

```{r}
base %>%
  kable_styling(bootstrap_options = "bordered") %>%
  add_header_above(c("Group 1" = 4, "Group 2" = 2, "Group 3" = 6))
```

<br>


## Data Aggregation

In the first stage of our analysis we are going to group our data in the form of the simple frequency table.

First, let's take a look at the distribution of income in our sample and verify the tabular accuracy using TAI measure:

```{r table, message=FALSE, warning=FALSE, paged.print=FALSE}
options(scipen=999)

limits<- cut(CreditCard$income,seq(0,14,by=2))
tabelka <- freq(limits,type="html")
tabelka
```

Without 'kable' styling it's quite ugly right? ;-)


### Tabular accuracy

An index of tabular accuracy TAI, described by Jenks and Casspal in 1971 is to optimize the class distribution used in a cartograms/frequency tables etc. 

The TAI indicator takes values in the range (0;1). The numerator of the expression is the sum of the absolute deviations of the values classified into classes, and the denominator is the sum of the absolute deviations of the entire classified set. 

The better the class division reflects the nature of the data, the larger the indicator will be. As the number of classes increases, the indicator will take on larger values.

Let's calculate TAI index to check the properties of the tabulated data:

```{r tai, echo=TRUE}
tabelka2 <- classIntervals(CreditCard$income, n=7, style="fixed", fixedBreaks=seq(0,14,by=2))
jenks.tests(tabelka2)
```

As we can see - TAI index...


We can use different recipes... (styles):


``` {r}
tabelka3<-classIntervals(CreditCard$income, n=10, style="sd")
plot(tabelka3,pal=c(1:10))
jenks.tests(tabelka3)
```

Still, the TAI indicator is not satisfactory. What should we change in the final frequency table design?

```{r}
hist(CreditCard$income)
```


### Continuous variables

We can calculate the absolute and relative frequencies of a vector x with the function 'Freq' from the *DescTools* packages. Continuous (numeric) variables will be cut using the same logic as used by the function hist. Categorical variables will be aggregated by table. The result will contain single and cumulative frequencies for both, absolute values and percentages. 

```{r}
tabela4<-Freq(CreditCard$income,breaks=seq(0,14,by=2),useNA="ifany")

tabela4 %>%
  kable(col.names = c("Incomes in kUSD","Frequency","Percentage %","Cumulative frequency","Cumulative percentage %")) %>%
  kable_classic(full_width = F, html_font = "Cambria")
```

BTW: what about TAI of that table?...

### Categorical variables

Now, let's take a look at the categorical data and make some tabulations. The **xtabs** function works like table except it can produce tables from frequencies using the formula interface. 

Let's say we want to see the table with data on how many card applications was accepted or not:

```{r, echo=FALSE}
xtabs(~ card, data=CreditCard)
```

We may easily produce *cross-tabs* (status vs. Does the individual own their home?) as well:

```{r}
crosstab<-xtabs(~ card + owner, data=CreditCard)
crosstab
```

and transform it into pretty html table with the kable function:

```{r}
crosstab %>% 
  kbl() %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T, border_right = T) %>%
  column_spec(2, background = "yellow")
```


## Data Visualization

We will explore the **"ggplot2"** package of the tidyverse for data visualization purposes. The "ggplot2" packages involve the the following three mandatory components:

1) Data
2) An aesthetic mapping
3) Geoms (aka objects)

The following components can also optionally be added:

4) Stats (aka transformations)
5) Scales
6) Facets
7) Coordinate systems
8) Position adjustments
9) Themes

Please note that code in this tutorial was adapted from Chapters 3 of the book "R for Data Science" by Hadley Wickham and Garrett Grolemund.  

The full book can be found at: https://r4ds.had.co.nz/

A good cheat sheet for ggplot2 functions can be found at: https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf


### Scatterplots


Let's create an extremely simple scatterplot.  

We will use the function `ggplot()` to do this. 

The format of any ggplot graph is this function, followed by another function to add objects.

The objects on a graph in the case of a scatterplot are points. The function we add to it is `geom_point`.

These functions rely on a function on the inside called `aes()`.

The data and aesthetic mapping components can be added to either the `ggplot()` or geom functions.  

```{r Graph 1}
ggplot(data = mpg) +
  geom_point(aes(x = displ, y = hwy))
``` 

<br>

This is one of the most basic graphs that one can make using the ggplot2 framework. 

Next, let's add color.

`geom_point()` understands the following aesthetics: x, y, alpha, color, fill, group, shape, size, and stroke (see help documentation).

Let's map the color argument to the variable "class" from mpg.

```{r Graph 2}
ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy, color = class))
```


This is not the only way to color objects.  

Including the color argument inside of the `aes()` function can map colors to a choice of variable. 

However, we can specify colors manually, by specifying color outside of the `aes()` function. We will also illustrate the "size" argument.

```{r Graph 3}
ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy, size = class), color = "blue")
```

### Barplots

Lastly, let's examine other objects that we can plot using `ggplot()`. We will create a bar chart using the function `geom_bar()`.  

```{r Graph 4}
ggplot(mpg) +
  geom_bar(aes(x = class))
```

<br>

With the `geom_bar()` function, we have a great use-case for a stat transformation.

The following code can be used to convert these counts to proportions:

```{r Graph 5}
ggplot(mpg) +
  geom_bar(aes(x = class, y = stat(prop), group = 1))
```

### Histograms

Next, let's create a histogram with the `geom_histogram()` function.

```{r Graph 6}
ggplot(mpg) +
  geom_histogram(aes(x = hwy))
```

The `geom_histogram()` function accepts the argument "binwidth", and has two key arguments for color: fill (this controls the overall color), and color (this controls the border).  

Let's fill all these in:

```{r Graph 7}
ggplot(mpg) +
  geom_histogram(aes(x = hwy), binwidth = 5, fill = "navy", color = "gold")
```

<br>

`geom_histogram()` provides a great example to modify the scale.

Notice in this example that the axis is automatically broken up by units of 10, and does not begin at 0.

We can modify this with the function `scale_x_continuous()`, as well as the y-axis with the function `scale_y_continuous()`.  

There are three key arguments we will feed this function: "breaks", "limits", and "expand".

"breaks" will define the breaks on the axis. 

"limits" will define the beginning and end of the axis, and the "expand" argument can be used to start the axes at 0 by using "expand = c(0,0)". 

```{r Graph 8}
ggplot(mpg) +
  geom_histogram(aes(x = hwy), binwidth = 5, fill = "navy", color = "gold") +
  scale_x_continuous(breaks = seq(0, 45, 5), limits = c(0, 50), expand = c(0,0)) +
  scale_y_continuous(breaks = seq(0, 90, 10), limits = c(0, 90), expand = c(0,0))
```

### Boxplots

Next, we will create boxplots.

```{r Graph 9}
p <- ggplot(mpg) +
  geom_boxplot(aes(x = class, y = cty, fill = class))
p
```

### Facets

Faceting generates small multiples each showing a different subset of the data. Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different. 

Read more about facets [here](https://ggplot2-book.org/facet.html).

Notice in this document the use of the fig.height and fig.width options.

Key arguments to `facet_wrap()` are "facets", "nrow", and "ncol".

```{r Graph 10, fig.height = 8, fig.width = 12}
ggplot(mpg) +
  geom_boxplot(aes(x = class, y = cty, fill = class)) +
  facet_wrap(facets = ~cyl, nrow = 2, ncol = 2)
```

### Coordinates

Other coordinate systems can be applied to graphs created from ggplot2.  

One example is `coord_polar()`, which uses polar coordinates. Most of these are quite rare. Probably the most common one is `coord_flip()`, which will flip the X and Y axes. Let's also illustrate the `labs()` function, which can be used to change labels.

```{r Graph 11}
ggplot(mpg) +
  geom_bar(aes(x = class, fill = factor(cyl))) +
  labs(title = "Cylinders by Class", fill = "cylinders") +
  coord_flip()
```

<br>

These bars are stacked on top of each of other, due to the "cyl" variable being mapped to the "fill" argument. There are various position adjustments that can be used. Again, most of these are not very common, but a common one is the argument "position = 'dodge'", which will put items side-by-side.   

See this example:

```{r Graph 12}
ggplot(mpg) +
  geom_bar(aes(x = class, fill = factor(cyl)), position = "dodge") +
  labs(title = "Cylinders by Class", fill = "cylinders") + 
  coord_flip()
```

### Themes

Lastly, we can alter the "theme", or the overall appearance of our plot.  

I recommend using the ggThemeAssist package, because this will make this incredibly easy, with an interface that will automatically generate reproducible code.

This can be used by highlighting a ggplot2 object, and navigating to Addins > ggplot Theme Assistant.

We'll make the following changes: eliminating the panel grid lines, eliminating axis ticks, adding a title called "Boxplot Example", making it bigger and putting it in bold, and adjusting it to the center.

```{r Graph 13}
# p
p + theme(axis.ticks = element_line(linetype = "blank"),
    panel.grid.major = element_line(linetype = "blank"),
    panel.grid.minor = element_line(linetype = "blank"),
    plot.title = element_text(size = 14,
        face = "bold", hjust = 0.5)) +labs(title = "Boxplot Example")
```

<br>

There are many more examples of things that can be done with ggplot2.  

It is an amazingly powerful and flexible package, and it is worth getting acquainted with the cheat sheet.

## Exercise 1. 

Using data on credit card applications' status please present the frequency table with the nice, kable format for average monthly credit card expenditures of applicants.

```{r ex1}

```


## Exercise 2.

The data comes from [https://flixgem.com/](https://flixgem.com/) (dataset version as of March 12, 2021). The data contains information on 9425 movies and series available on Netlix.

```{r ex2, message=FALSE, warning=FALSE, include=FALSE}
download.file("https://raw.githubusercontent.com/kflisikowski/ds/master/netflix-dataset.csv?raw=true", destfile ="dane.csv",mode="wb")
mydata<-read.csv(file="dane.csv",encoding ="UTF-8",header=TRUE,sep = ",")
attach(mydata)
```

Answer with the most appropriate data visualization for the following questions:

1. What is the distribution of Imdb scores for Polish movies and movie-series?

```{r ex2_1}

```

2. What is the density function of Imdb scores for Polish movies and movie-series?

```{r ex2_2}

```

3. What are the most popular languages available on Netflix?

```{r ex2_3}

```

For extra credits:

*Extra challenge 1.*: Create a chart showing actors starring in the most popular productions.

```{r challenge1}

```


*Extra challenge 2.*: For movies and series, create rating charts from the various portals (Hidden Gem, IMDb, Rotten Tomatoes, Metacritic). Hint: it's a good idea to reshape the data to *long* format.

```{r challenge2}

```

*Extra challenge 3.*: Which film studios produce the most and how has this changed over the years?

```{r challenge3}

```


