Pivoting

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.1     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.0.5

Wide vs Long Data

Sometimes, we need to transform the shape of our data. The wide form of data is an efficient way to store data - especially time-series data.

However, we often need data in a long for for data analysis, so shape transformations are important. I frequently have both versions of my data when doing analysis depending on if I’m graphing it, making a table for it, or analyzing it with certain statistical methods.

To accomplish this, the first thing we’ll have to do is reshape the data, using the pivot_longer() function from the tidyr package. This function gathers multiple columns and collapses them into new name-value pairs. This transforms data from wide format into a long format, where:

  • The first argument defines the columns to gather
  • The names_to argument is the name of the new column that you are creating which contains the values of the column headings that you are gathering
  • The values_to argument is the name of the new column that will contain the values themselves; you can indicate the name of this column with the values_to argument.

To select all the columns except a specific column, use the - (subtraction) operator. (This process is also referred to as negative indexing.)

Long to wide

library(gapminder)
library(tidyverse)

gapminder # dataset we start with
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows
# data comes from gapminder package

A tibble with 1,704 rows and 6 columns. Unit of Analysis is the country-year.

pivot_wider()

Let’s create a “wide” version of gapminder using pivot_wider(). This starts with the data named “gapminder” that is from the gapminder package. select() then selects the four variables mentioned, and then uses pivot_wider() to make new columns from every year that was in the “year” column and puts the values from “lifeExp” into those newly created year columns. Now we will have a lot less rows and a lot more columns in our new tibble named gapminder_wide.

How many rows and columns do we have now? What is our unit of analysis?

gapminder_wide <- gapminder %>% 
  select(country, continent, 
         lifeExp, year) %>% 
  pivot_wider(names_from = year, 
              values_from = lifeExp)

gapminder_wide          # View gapminder_wide 
## # A tibble: 142 x 14
##    country     continent `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987`
##    <fct>       <fct>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Afghanistan Asia        28.8   30.3   32.0   34.0   36.1   38.4   39.9   40.8
##  2 Albania     Europe      55.2   59.3   64.8   66.2   67.7   68.9   70.4   72  
##  3 Algeria     Africa      43.1   45.7   48.3   51.4   54.5   58.0   61.4   65.8
##  4 Angola      Africa      30.0   32.0   34     36.0   37.9   39.5   39.9   39.9
##  5 Argentina   Americas    62.5   64.4   65.1   65.6   67.1   68.5   69.9   70.8
##  6 Australia   Oceania     69.1   70.3   70.9   71.1   71.9   73.5   74.7   76.3
##  7 Austria     Europe      66.8   67.5   69.5   70.1   70.6   72.2   73.2   74.9
##  8 Bahrain     Asia        50.9   53.8   56.9   59.9   63.3   65.6   69.1   70.8
##  9 Bangladesh  Asia        37.5   39.3   41.2   43.5   45.3   46.9   50.0   52.8
## 10 Belgium     Europe      68     69.2   70.2   70.9   71.4   72.8   73.9   75.4
## # ... with 132 more rows, and 4 more variables: 1992 <dbl>, 1997 <dbl>,
## #   2002 <dbl>, 2007 <dbl>

Wide to long

gapminder_long <- gapminder_wide %>% 
  pivot_longer(`1952`:`2007`, # backticks needed because the variable name starts with a number which confuses R
               # column names look like number values without them
               names_to = "year", 
               values_to = "lifeExp") %>% 
  mutate(year = as.numeric(year)) 

  gapminder_long
## # A tibble: 1,704 x 4
##    country     continent  year lifeExp
##    <fct>       <fct>     <dbl>   <dbl>
##  1 Afghanistan Asia       1952    28.8
##  2 Afghanistan Asia       1957    30.3
##  3 Afghanistan Asia       1962    32.0
##  4 Afghanistan Asia       1967    34.0
##  5 Afghanistan Asia       1972    36.1
##  6 Afghanistan Asia       1977    38.4
##  7 Afghanistan Asia       1982    39.9
##  8 Afghanistan Asia       1987    40.8
##  9 Afghanistan Asia       1992    41.7
## 10 Afghanistan Asia       1997    41.8
## # ... with 1,694 more rows

Remember to visualize the data first in your head (even better: on a piece of paper). Then think about the code.

# Save your output as a new dataset
_______________=  
# Call the dataset that you want to modify
pivot_wider(_________________,   
            # Indicate the column from where the name(s) of the new column(s) should be taken
            names_from = ____________ , 
            # Indicate the column(s) from where the values of the new column(s) should be taken.
            values_from = ______________)

Tables

Setup

# warning = FALSE and message = FALSE are so that the output that shows up when you load a package isn't included in the final document when you knit it.

#install.packages("kableExtra")

library(tidyverse)
library("kableExtra")
library("stevedata")

data <- gss_wages

Data

stat_info_2000 = 
data %>%
  filter( year >= 2000) %>%
  group_by(year) %>%
  summarize(average_income = mean(realrinc, na.rm = T),
            num_children = mean(childs, na.rm = T))

Simple tables with KABLE

kable(stat_info_2000)
year average_income num_children
2000 22110.13 1.799357
2002 27276.64 1.810507
2004 26019.78 1.822650
2006 23539.08 1.898154
2008 32395.10 1.939109
2010 20696.68 1.885350
2012 27587.48 1.891933
2014 23282.27 1.820949
2016 23772.92 1.852046
2018 24994.19 1.855375
kable(stat_info_2000, 
      "pipe")
year average_income num_children
2000 22110.13 1.799357
2002 27276.64 1.810507
2004 26019.78 1.822650
2006 23539.08 1.898154
2008 32395.10 1.939109
2010 20696.68 1.885350
2012 27587.48 1.891933
2014 23282.27 1.820949
2016 23772.92 1.852046
2018 24994.19 1.855375
kable(stat_info_2000, "simple",
      col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4), 
      format.args = list(big.mark = ",")) 
Average Income by Year Since 2000
Year Average Income Number of Children
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2

kableExtra

Keep building from the previous table with the function kable_styling.

stat_info_2000 %>% 
  kbl(col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4),  
      format.args = list(big.mark = ","))%>%
  
  kable_styling(font_size = 14,
                html_font = "Cambria", 
                full_width = F)
Average Income by Year Since 2000
Year Average Income Number of Children
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2

bootstrap_options

stat_info_2000 %>% 
  kbl(col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4),  
      format.args = list(big.mark = ","))%>%
  
  kable_styling(bootstrap_options = c("striped", "bordered"),
                font_size = 14,
                html_font = "Cambria", 
                full_width = F)
Average Income by Year Since 2000
Year Average Income Number of Children
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2
stat_info_2000 %>% 
  kbl(col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4),  
      format.args = list(big.mark = ","))%>%
  
  kable_styling(bootstrap_options = c("hover", "condensed"),
                font_size = 14,
                html_font = "Cambria", 
                full_width = F)
Average Income by Year Since 2000
Year Average Income Number of Children
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2

Themes

stat_info_2000 %>% 
  kbl(col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4), 
      format.args = list(big.mark = ","))%>%
  
  kable_paper(font_size = 14,
                html_font = "Cambria", 
                full_width = F)
Average Income by Year Since 2000
Year Average Income Number of Children
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2
stat_info_2000 %>% 
  kbl(col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4), 
      format.args = list(big.mark = ","))%>%
  
  kable_material_dark(font_size = 14,
                html_font = "Cambria", 
                full_width = F)
Average Income by Year Since 2000
Year Average Income Number of Children
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2

Style of rows and columns

stat_info_2000 %>% 
  kbl(col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4),  
      format.args = list(big.mark = ","))%>%
 
   kable_styling(font_size = 14,
                html_font = "Cambria", 
                full_width = F) %>%
  
  row_spec(5, bold = T, color = "red") %>%
  row_spec(2, underline = T) %>%
  row_spec(4, background = "#457b9d") %>%
  column_spec(3, strikeout = T)
Average Income by Year Since 2000
Year Average Income Number of Children
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2

Group rows and columns

Group columns

stat_info_2000 %>% 
  kbl(col.names = c("Year", "Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4, 2), 
      format.args = list(big.mark = ","))%>%
 
   kable_styling(font_size = 14,
                html_font = "Cambria", 
                full_width = F) %>%
  
  add_header_above(c("Year" = 1, "Year Average" = 2))
Average Income by Year Since 2000
Year
Year Average
Year Income Number of Children
2,000 22,110.13 1.80
2,002 27,276.64 1.81
2,004 26,019.78 1.82
2,006 23,539.08 1.90
2,008 32,395.10 1.94
2,010 20,696.68 1.89
2,012 27,587.48 1.89
2,014 23,282.27 1.82
2,016 23,772.92 1.85
2,018 24,994.19 1.86

Group rows

stat_info_2000 %>% 
   kbl(col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4),  # Round the number of digits
      format.args = list(big.mark = ","))%>%
 
   kable_styling(font_size = 14,
                html_font = "Cambria", 
                full_width = F) %>%
  
  group_rows("2009-2010", 1, 6, label_row_css = "background-color: #666; color: #fff;") %>%
  group_rows("2011-2018", 7, 10, label_row_css = "background-color: #666; color: #fff;")
Average Income by Year Since 2000
Year Average Income Number of Children
2009-2010
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2011-2018
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2
stat_info_2000 %>% 
  mutate(year = as.character(year)) %>%
  kbl(col.names = c("Year[note]", "Average Income", "# Children"), #<<
      align = "ccc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 2, 2),  # Round the # digits
      format.args = list(big.mark = ","))%>%
   kable_styling(font_size = 20,
                html_font = "Cambria", 
                full_width = F) %>%
  
  row_spec(0, bold = T, background = "#e5e5e5") %>%

  group_rows("Before crisis", 1, 5, label_row_css = "background-color: #666; color: #fff;") %>%
  group_rows("After crisis[note]", 6, 10, label_row_css = "background-color: #666; color: #fff;") %>%
  
  add_footnote(c("Only year up to 2000 were included", "We consider the end of crisis after 2009"), notation = "alphabet")
Average Income by Year Since 2000
Yeara Average Income # Children
Before crisis
2000 22,110.13 1.80
2002 27,276.64 1.81
2004 26,019.78 1.82
2006 23,539.08 1.90
2008 32,395.10 1.94
After crisisb
2010 20,696.68 1.89
2012 27,587.48 1.89
2014 23,282.27 1.82
2016 23,772.92 1.85
2018 24,994.19 1.86
a Only year up to 2000 were included
b We consider the end of crisis after 2009

R markdown

Setup

  • Open R Studio

  • Click on File -> New File -> R Markdown.

    • If you have never used R markdown before, it should prompt you to install a set of packages. Please say “yes” and install them.

    • If R markdown is already installed, it should prompt you to name a new file and open a new document similar to a R script.

  • In either cases, if you have never knit a R markdown document into a PDF document, please install tinytex by running both these commands:

install.packages('tinytex')
tinytex::install_tinytex()

Reproducibility

What does reproducibility mean?

It is about setting up all your processes in a way that is repeatable by others and well documented.

Why do we care about reproducibility?

  • Can I take your script and run it again without making any change?

  • Can I slightly change the original data, run the script again and not have any warning messages?

  • Can I take your script, not run it, and still understand what you are doing?

  • Can you take your script in one year and still perfectly understand what you did?

Reproducibility - how?

  1. At the most basic level, it means that you are using a repeatable method for every actions that you are doing - your R script! R script can be saved, annotated, and shared with others so that they can run the same workflow and get to the same output. Avoid any manual step when working with your data.

  2. Document your process. Write comments and annotations at each step of the way. Other people or you after one year should be able to quickly understand what you did and WHY you did it.

  3. The best R script are simple and well annotated. You have to re-check your script after you are done with it to clean it up a bit. In some cases, you might realize that some steps could have been done more efficiently.

A couple of things I have noticed:

  • Avoid reporting numbers manually: if you are setting the hline or vline to be equal to the mean or setting the abline, put the formula in the code for the plot.
  • Avoid merging datasets unless it is necessary. Use pipes instead.
ggplot(election_turnout) +
  geom_density(mapping = aes(x = turnoutho)) + 
  geom_vline(xintercept = 23.5)

ggplot(election_turnout) +
  geom_density(mapping = aes(x = turnoutho)) + 
  geom_vline(xintercept = mean(election_turnout$turnoutho))

4.Version control means keep old versions of your codes and data so that you can track down errors but also go back to old versions of the data. Always keep the original dataset but save intermediate versions and the final one. Use progressive names (e.g., numbers or dates of your edits: project_1107, project_1109). You can also set up a folder with Box that saves versions of your document without changing the name. If have shared data projects, teams, or actually do data analysis in the future, you NEED to use Github. It will save your life when sharing code (and honestly even if sharing an Excel file between team members). It can track line by line additions and deletions. I could do a whole lecture just on Github.

  1. Keep your work on a cloud system rather than your own computer as much as possible. A few consideration:
  • Pay attention to privacy when saving online sensitive data (e.g., data containing personal information)
  • You can use password-protected files or remove personal identifiable information.

Transparency

Transparency means to disclose your thinking process and any discovery that you make from your own data. Report all information that you extracted from the data, including limitations or issues with the data collection.

Some of these information is reported in your final report - the level of detail depends on the purpose of your report.

In most cases, you should create a methodological note at the end of the your report all these details: data used, missing value issues, limitations of your data collection, and so on.

R markdown files

R markdown files are data-driven docs that combine TEXT, CODES, and RESULTS.

They promote transparency, reproducibility, and replicability as they allow others to see your annotations + codes + outputs. The balance of these three elements depends on the purpose of your work.

They are a great way to share your results to non-R users.

R markdown is a set of rules to format these data-driven documents. It supports dozens of output formats, like PDFs, Word files, slideshows, and more.

Type of R markdown documents

Type of R markdown documents

Why should you create R markdown documents?

  • For communicating to decision makers, who want to focus on the conclusions, not the code behind the analysis.

  • For collaborating with other data scientists (including future you!), who are interested in both your conclusions, and how you reached them (i.e. the code).

  • As an environment in which to do data science, as a modern day lab notebook where you can capture not only what you did, but also what you were thinking.

knitr::include_graphics("rmarkdown.png")

When your script is ready, you can knit it and produce a complete report containing all text, code, and results.

knitr::include_graphics("knitting.png")

Formatting

There are few conventions on how to write text in R markdown. For instance:

# indicates the main Title

## Subtitle

### Header 3
**Bold text**
*Italic text*

R chunks

You can create a new R chunk by:

  1. using the keyboard shortcut Cmd/Ctrl + Alt + I (recommended option)

OR

  1. by manually typing the chunk delimiters at the beginning and end.
knitr::include_graphics("chunk.png")

Let’s look at the breakdown of a R chunk

{r NAME_OF_THE_CHUNK, OPTION1, OPTION2}

You can give the chunk any name (use something meaningful).

{r project_mean, OPTIONS}

R chunk options

There are several options that you can put in a r chunk. You can see the full list here

eval = F
Show the code but don’t run it

echo = FALSE
Code doesn’t appear in the output file. Results appers. When including an image, you generally want to use this.

message = FALSE
Prevent messages from appearing in the output file

warning = FALSE
Prevents warnings from appearing in the output file

{r project_mean, echo = F, warning = F, message = F}

The setup chunk

The first R chunk in a document is generally called ‘setup’. The name ‘setup’ is used only for the very first chunk where you can set up settings to be applied to the entire document.

{r setup, echo = F}
knitr::opts_chunk$set(echo = FALSE)

This setup, for instance, is great for reports where you might not want to show your code but only the results.

Iinline codes

If needed, you can also set text into your inline codes simply using the format `mean(XXX)`.

This will report your results into the text and save you time when updating your work!