General Announcements

Late Homework: Homeworks 1-4 will become a zero if it is not submitted by Sunday November 21st 11:59PM (midnight)

Homework 5 needs to be submitted on time (Deadline extended to Saturday, Nov. 27th at midnight).

Your final project NEEDS to be submitted on time. There is no flexibility here. If it is submitted late, I will not grade it.

All R Cheatsheets

Cheat Sheet for Data Vis

Monday’s Quiz

knitr::include_graphics("NYTcovid.png") 

1. In a sentence or two, what is the main information that is being communicated by this graph.

Short answer: “The gap in Covid’s death toll between red and blue America has grown faster over the past month than at any previous point”

“In October, 25 out of every 100,000 residents of heavily Trump counties died from Covid, more than three times higher than the rate in heavily Biden counties (7.8 per 100,000). October was the fifth consecutive month that the percentage gap between the death rates in Trump counties and Biden counties widened.”

2. What are the variables being presented in this graph?

Time on x-axis, deaths(cumulative number per 100,000 people) on y-axis Trend for counties categorized by being either majority Trump voters, majority Biden voters, or closer to 50-50 split between Trump and Biden voters
- this involves having the county name, 2020 election voting data of the county, covid cases per day

3. What are things you do like about this graph?

Axis isn’t cluttered, color is used to indicate trends for different political-leaning areas. It has a title and subtitle, original photo from New York Times had a footnote saying that they didn’t have data for Alaska or Washington D.C. Didn’t need a legend because subtitle contained the info that would normally be in a legend.

4. Are there things you would improve?

Graphing things as a cumulative number is always a bit weird. First, they will always only increase over time since they are cumulative. That alone might confuse some people

The New York Times also included this graphic in their Covid update. Notice how they use the same color scheme between the two images and try to visualize the same data in different ways.

knitr::include_graphics("NYTcovid_Daily.png") 

Last Wednesday’s Quiz

categories <- c("Category 1", "Category 2", "Category 3", "Category 4", "Category 5")
A <- c(7, 5, 4, 4, 3)
B <- c(6, 6, 3, 5, 5)
C <- c(5, 3, 1, 1, 3)
D <- c(3, 1, 1, 2, 2)

quizgraphs <- tibble(categories, A, B, C, D) #combines 5 objects into one tibble

quizgraphs # view your tibble
## # A tibble: 5 x 5
##   categories     A     B     C     D
##   <chr>      <dbl> <dbl> <dbl> <dbl>
## 1 Category 1     7     6     5     3
## 2 Category 2     5     6     3     1
## 3 Category 3     4     3     1     1
## 4 Category 4     4     5     1     2
## 5 Category 5     3     5     3     2

Look familiar?

Currently this is in a wide format. We may not quite know what we are doing, but we can guess what to put in the code and see what happens:

quizgraphs %>%  
  ggplot() +
  geom_bar(aes(x = categories)) # puts Categories variable on X-axis

This counts how many times the each category shows up in the “Categories” variable, which is only once. It’s not a very useful graph, but it is a start.

What if we put one of the letter variables on the x axis? This would be similar to a histogram. It would be more of a histogram if the bars were touching.

quizgraphs %>%  
  ggplot() +
  geom_bar(aes(x = A)) # use variable A on the x-axis

geom_bar has additional arguements that you can use for it. One of them is the stat = argument.

The default option is “count”. The default behavior is to count the rows for each x value and it doesn’t expect a y-value since it is doing the counting itself.

If you don’t want R to aggregate your data for you and instead just it use the values provided in the data with a y-value, then use stat = "identity" in the line of code.

If your data is already summarized in some form, you may have to do something like this!

However, the geom_col() can also be used instead if you are providing the x and y values for the graph. The two graphs below look the same but were made with different code:

quizgraphs %>%  
  ggplot() +
  geom_bar(aes(x = categories, y = A), stat = "identity") # use variable A on the x-axis

#graphing the wide version:
quizgraphs %>% 
  ggplot() +
  geom_col(aes(x = categories, y = A)) 

# x axis is each category in Categories
# y axis is values for column A for each Category

Okay, this is closer to what we want. It is showing the value that was associated with column A and how many times that value was present (i.e. the value 4 was in the A column twice. 6 wasn’t there at all). But what if we want to graph more than one of the letters at once??

If I wanted to graph multiple Letter groups and Categories, it would be easier to do so if it were in a long format where only the Categories, letters, and values were the 3 columns.

#Create tibble named quiz_long <- from old tibble named quizgraphs
quiz_long <- quizgraphs %>%
  pivot_longer(c(A:D),               # use column named A through the column named D 
               names_to = "letters",  # send the names of the columns to a new variable that we are naming "groups"
               values_to = "values")  # send the values that were in all of the rows and columns into one new variable named "value"

quiz_long        # 20 by 3 tibble
## # A tibble: 20 x 3
##    categories letters values
##    <chr>      <chr>    <dbl>
##  1 Category 1 A            7
##  2 Category 1 B            6
##  3 Category 1 C            5
##  4 Category 1 D            3
##  5 Category 2 A            5
##  6 Category 2 B            6
##  7 Category 2 C            3
##  8 Category 2 D            1
##  9 Category 3 A            4
## 10 Category 3 B            3
## 11 Category 3 C            1
## 12 Category 3 D            1
## 13 Category 4 A            4
## 14 Category 4 B            5
## 15 Category 4 C            1
## 16 Category 4 D            2
## 17 Category 5 A            3
## 18 Category 5 B            5
## 19 Category 5 C            3
## 20 Category 5 D            2
# Columns are Categories, groups, and value
# Rows represent the value for each category-group combination

Created a long version: Success!

Okay, so we may not know completely what we are doing next, but we can start by at least guessing and trying things. Below are too relatively basic bar graphs. One has “categories” on the x axis, the other has “letters” on the x axis.

# long version
quiz_long %>% 
  ggplot() +
  geom_bar(aes(x = categories))

quiz_long %>% 
  ggplot() +
  geom_bar(aes(x = letters))

The bar graphs above are only counting how many times each option shows up in the tibble. Maybe we should try the geom_col instead of geom_bar since our data is presummarized.

Column graphs summarize grouped data.

quiz_long %>% 
  ggplot() +
  geom_col(aes(x = letters, y = values))

quiz_long %>% 
  ggplot() +
  geom_col(aes(x = categories, y = values))

quiz_long %>% 
  ggplot() +
  geom_col(aes(x = categories, y = values, fill = categories))

# Using color for this example is not recommended since it does not add any information to the graph. Categories is labeled on the axis and then color is also used to try to label the categories. Again, color is redundant and should not be used. 

quiz_long %>% 
  ggplot() +
  geom_col(aes(x = categories, y = values, fill = letters))

quiz_long %>% 
  ggplot() +
  geom_col(aes(x = categories, y = values, fill = letters)) +
  labs(title = "Letters and Categories", 
       subtitle = "A graph created using geom_col()", 
       fill = "Letters",
       caption = "(Your old quiz data that you tried to hand draw)", y = "Sum of Values" , x = "Categories")

quiz_long %>% 
  ggplot() +
  geom_col(aes(x = categories, y = values, fill = letters)) +
  labs(title = "Letters and Categories", 
       subtitle = "A graph created using geom_col() and coord_flip()", 
       fill = "Letters",
       caption = "(Your old quiz data that you tried to hand draw)", y = "Sum of Values" , x = "Categories") +
  coord_flip()

quiz_long %>% 
  ggplot() +
  geom_col(aes(x = categories, y = values, fill = letters), position = "dodge") +
  labs(title = "Letters and Categories", 
       subtitle = "A graph created using geom_col() and coord_flip()", 
       fill = "Letters",
       caption = "(Your old quiz data that you tried to hand draw)", y = "Sum of Values" , x = "Categories") +
  coord_flip()

quiz_long %>% 
  ggplot() +
  geom_col(aes(x = categories, y = values)) +
  labs(title = "Letters and Categories", 
       subtitle = "A graph created using facet_wrap() and coord_flip()",
       caption = "(Your old quiz data that you tried to hand draw)", y = "Values" , x = "Categories") +
  coord_flip()+
  facet_wrap(~letters)

quiz_long %>% 
  ggplot() +
  geom_col(aes(x = letters, y = values, fill = categories)) 

quiz_long %>% 
  ggplot() +
  geom_col(aes(x = letters, y = values, fill = categories)) +
  labs(title = "Letters and Categories", 
       subtitle = "A graph created using geom_col()", 
       fill = "Categories",
       caption = "(Your old quiz data that you tried to hand draw)", y = "Sum of Values" , x = "Letters") 

quiz_long %>% 
  ggplot() +
  geom_col(aes(x = categories, y = values, fill = letters), position = "dodge") +
  labs(title = "Letters and Categories", 
       subtitle = "A graph created using geom_col() and coord_flip()", 
       fill = "Letters",
       caption = "(Your old quiz data that you tried to hand draw)", y = "Sum of Values" , x = "Categories") +
  coord_flip()

quiz_long %>% 
  ggplot() +
  geom_col(aes(x = letters, y = values)) +
  labs(title = "Letters and Categories", 
       subtitle = "A graph created using facet_wrap()", 
       caption = "(Your old quiz data that you tried to hand draw)", y = "Values" , x = "Letters") +
  facet_wrap(~categories)

Pie graphs….

quiz_long %>% 
  ggplot() +
  geom_col(aes(x = letters, y = values, fill = categories)) +
  labs(title = "Letters and Categories", 
       subtitle = "A  pie graph :( ", 
       caption = "(Your old quiz data that you tried to hand draw)") +
  facet_wrap(~categories) +
  coord_polar("y")

quiz_long %>% 
  ggplot() +
  geom_col(aes(x = letters, y = values, fill = categories)) +
  labs(title = "Letters and Categories", 
       subtitle = "A  pie graph :( ", 
       caption = "(Your old quiz data that you tried to hand draw)") +
  facet_wrap(~letters) +
  coord_polar("y")

quiz_long %>% 
  ggplot() +
  geom_bar(aes(x = "", y = values, fill = categories),
           stat = "identity") + #
  labs(title = "Letters and Categories", 
       subtitle = "A  pie graph :( ", 
       caption = "(Your old quiz data that you tried to hand draw)") +
  facet_wrap(~letters) +
  coord_polar("y", start = 0) +# wraps bar graph in a circle 
  theme_void()

quizgraphs %>% 
  ggplot() +
  geom_bar(aes(x = categories, fill = "letters"))

quizgraphs %>% 
  ggplot() +
  geom_bar(aes(x = categories, color = "letters", size = 2))

If you want to see more examples of gradually changing bar and column graph code, here is another helpful link with examples.

Not quite correct options

I saw a lot of line graphs… No. 

No.

No.

Histograms: Only if you focused on one category or letter at a time and indicated how many times each value showed up.

Scatterplots: No! Good for showing two continuous variables. If you put Letters on one axis and Categories on the other and try to plot the “intersection” with a scatter plot, it doesn’t mean anything.

How much distance is between Category 1 and Category 2? Is it something that can be measured?

What if Letters were replaced with different cities? Putting cities on an axis and trying to make a scatter plot just does not work.

General Reminders and Tips from Past Lectures

Readings and Resources

Finding Data

UIC’s data sources UIC has a list of data sources of various types categorized as “Research Tools” and “Policy Documents” If you don’t remember what your options are, this is a good place to start. Links to HUD, American Community Survey, USA Gov, CMAP, National Low Income Housing Coalition, Chicago Data Portal, just to name a few. UIC also allow students to access other large data repositories such as ICPSR, Policy Map, and more.

Creating objects and graphs: Expenditures

Proportions: Part / Whole

TaxExpenditure <- 
  tibble(Expenditure.Type = c("Industry & workforce", 
                              "Defense",
                              "Social security & welfare",
                              "Community services & culture",
                              "Health",
                              "Infrastructure, transport & energy",
                              "Education",
                              "General government services"),
         Expenditure.Amount = c(14.843, 21.277, 121.907, 8.044, 
                                59.858, 13.221, 29.870, 96.797))

ggplot(data = TaxExpenditure,
       aes(x ="", 
           y = Expenditure.Amount, 
           fill = Expenditure.Type)) +
  geom_bar(width =1, stat = "identity") +
  scale_fill_brewer(palette = "Dark2")

ggplot(data = TaxExpenditure,
       aes(x ="", y = Expenditure.Amount, fill = Expenditure.Type)) +
  geom_bar(width =1, stat = "identity") +
  coord_polar("y", start = 0) +
  scale_fill_brewer(palette = "Dark2") +
  theme_void()

ggplot(data = TaxExpenditure,
       aes(x = reorder(Expenditure.Type, Expenditure.Amount), y = Expenditure.Amount, fill = Expenditure.Type)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(breaks = seq(0, 125, by = 25), limits = c(0,125), expand=c(0,0)) +
  scale_x_discrete(labels=function(x) str_wrap(x, width=20)) +
  labs(x="Expenditure type", y="Expenditure ($millions)") +
  scale_fill_brewer(palette = "Dark2") +
  coord_flip() +
  theme(panel.grid.minor.y=element_blank(),
        panel.grid.major.x = element_line(color = "gray"),
        panel.background = element_blank(),
        axis.line = element_line(color="gray", size = 1),
        axis.text=element_text(size=10),
        axis.title=element_text(size=15),
        plot.margin=margin(5,15,5,5),
        legend.position = "none")

Pivoting - Texas Winter Storm Data

ERCOT data on the Texas Winter Storm Examples in slides on taxes changing, comparison of baselines

Texas_wide <- read_excel("TexasEnergy_wide.xlsx")
Texas_wide
## # A tibble: 1,368 x 7
##    nat_gas  wind  coal solar hydro nuclear datetime           
##      <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dttm>             
##  1   10908 11961  8327     0    69    4973 2021-02-26 23:06:00
##  2   13560 10401  8505     0    70    4972 2021-02-26 22:06:00
##  3   15438  8834  9037     0    73    4972 2021-02-26 21:06:00
##  4   17189  7276  9413     0    73    4972 2021-02-26 20:06:00
##  5   18438  5802  9582    70    73    4971 2021-02-26 19:06:00
##  6   16959  5911  9599   793    72    4972 2021-02-26 18:06:00
##  7   15990  6524  8651  2133    72    4972 2021-02-26 17:06:00
##  8   15314  7522  7818  2895    73    4972 2021-02-26 16:06:00
##  9   15201  7382  7995  3252    74    4972 2021-02-26 15:06:00
## 10   15490  6600  8455  3577    74    4973 2021-02-26 14:06:00
## # ... with 1,358 more rows
Texas_long <- Texas_wide %>% 
  pivot_longer(!datetime,    # everything but datetime
               names_to = "source",
               values_to = "power")
Texas_long  # success.
## # A tibble: 8,208 x 3
##    datetime            source  power
##    <dttm>              <chr>   <dbl>
##  1 2021-02-26 23:06:00 nat_gas 10908
##  2 2021-02-26 23:06:00 wind    11961
##  3 2021-02-26 23:06:00 coal     8327
##  4 2021-02-26 23:06:00 solar       0
##  5 2021-02-26 23:06:00 hydro      69
##  6 2021-02-26 23:06:00 nuclear  4973
##  7 2021-02-26 22:06:00 nat_gas 13560
##  8 2021-02-26 22:06:00 wind    10401
##  9 2021-02-26 22:06:00 coal     8505
## 10 2021-02-26 22:06:00 solar       0
## # ... with 8,198 more rows
# 8208 X 3 tibble

Dates in R

Information on the lubridate package. The link also includes the cheatsheet for the package commands.

library(lubridate) # it should be included in tidyverse already, but just in case, you can load it if you want
## Warning: package 'lubridate' was built under R version 4.0.5
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

[Dates and times infographic]https://github.com/rstudio/concept-maps/raw/master/inspirations/datetime-silvia-canelon.png

Texas Energy Data

Information gained from Texas_wide tibble: 1,368 rows X 7 columns Energy values are stored as numeric variables (indicated by for double) and datetime is stored as …. S3: POSIXct ???

class(Texas_wide$datetime)
## [1] "POSIXct" "POSIXt"

POSIX is a way that dates are stored.

Look again at your time variable. A couple things to notice:

  • It is currently in military time
  • The order of the information in the time stamp is
    • year-month-date hour-minute-second
    • ex. 2021-02-26 15:06:00
Texas_long$date <- as_date(Texas_long$datetime)
Texas_long  # a new column appeared!
## # A tibble: 8,208 x 4
##    datetime            source  power date      
##    <dttm>              <chr>   <dbl> <date>    
##  1 2021-02-26 23:06:00 nat_gas 10908 2021-02-26
##  2 2021-02-26 23:06:00 wind    11961 2021-02-26
##  3 2021-02-26 23:06:00 coal     8327 2021-02-26
##  4 2021-02-26 23:06:00 solar       0 2021-02-26
##  5 2021-02-26 23:06:00 hydro      69 2021-02-26
##  6 2021-02-26 23:06:00 nuclear  4973 2021-02-26
##  7 2021-02-26 22:06:00 nat_gas 13560 2021-02-26
##  8 2021-02-26 22:06:00 wind    10401 2021-02-26
##  9 2021-02-26 22:06:00 coal     8505 2021-02-26
## 10 2021-02-26 22:06:00 solar       0 2021-02-26
## # ... with 8,198 more rows
Texas_long$time <- hms::as.hms(Texas_long$datetime)
## Warning: `as.hms()` was deprecated in hms 0.5.0.
## Please use `as_hms()` instead.
Texas_long  # wait, they don't match
## # A tibble: 8,208 x 5
##    datetime            source  power date       time  
##    <dttm>              <chr>   <dbl> <date>     <time>
##  1 2021-02-26 23:06:00 nat_gas 10908 2021-02-26 17:06 
##  2 2021-02-26 23:06:00 wind    11961 2021-02-26 17:06 
##  3 2021-02-26 23:06:00 coal     8327 2021-02-26 17:06 
##  4 2021-02-26 23:06:00 solar       0 2021-02-26 17:06 
##  5 2021-02-26 23:06:00 hydro      69 2021-02-26 17:06 
##  6 2021-02-26 23:06:00 nuclear  4973 2021-02-26 17:06 
##  7 2021-02-26 22:06:00 nat_gas 13560 2021-02-26 16:06 
##  8 2021-02-26 22:06:00 wind    10401 2021-02-26 16:06 
##  9 2021-02-26 22:06:00 coal     8505 2021-02-26 16:06 
## 10 2021-02-26 22:06:00 solar       0 2021-02-26 16:06 
## # ... with 8,198 more rows

Markdown

If needed, you can also set text into your inline codes simply using the format `mean(XXX)`.

This will report your results into the text and save you time when updating your work!

R Markdown reminders:

End a line with two spaces to start a new paragraph.

italics and italics

bold and bold

superscript2

strikethrough

link

endash: –

ellipsis: …

inline equation: \(A = \pi*r^{2}\)

horizontal rule (or slide break): ***

block quote

  • unordered list
  • item 2
    • sub-item 1
    • sub-item 2
  1. ordered list
  2. item 2
  • sub-item 1
  • sub-item 2
Table Header Second Header
Table Cell Cell 2
Cell 3 Cell 4

Inline code: Two plus two equals `r` `2 + 2`

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Tables

Setup

#install.packages("kableExtra")
#install.packages("stevedata")

library(tidyverse)
library("kableExtra")
library("stevedata")

data <- gss_wages

Data

stat_info_2000 <- 
  data %>%
  filter( year >= 2000) %>%
  group_by(year) %>%
  summarize(average_income = mean(realrinc, na.rm = T),
            num_children = mean(childs, na.rm = T))

Simple tables with KABLE

kable(stat_info_2000)
year average_income num_children
2000 22110.13 1.799357
2002 27276.64 1.810507
2004 26019.78 1.822650
2006 23539.08 1.898154
2008 32395.10 1.939109
2010 20696.68 1.885350
2012 27587.48 1.891933
2014 23282.27 1.820949
2016 23772.92 1.852046
2018 24994.19 1.855375
kable(stat_info_2000, 
      "pipe")
year average_income num_children
2000 22110.13 1.799357
2002 27276.64 1.810507
2004 26019.78 1.822650
2006 23539.08 1.898154
2008 32395.10 1.939109
2010 20696.68 1.885350
2012 27587.48 1.891933
2014 23282.27 1.820949
2016 23772.92 1.852046
2018 24994.19 1.855375
kable(stat_info_2000, "simple",
      col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4), 
      format.args = list(big.mark = ",")) 
Average Income by Year Since 2000
Year Average Income Number of Children
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2

kableExtra

Keep building from the previous table with the function kable_styling.

#stat_info_2000 %>% 
  kable(stat_info_2000, 
      col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4),  
      format.args = list(big.mark = ",")) %>%
  kable_styling(font_size = 14,
                html_font = "Cambria", 
                full_width = F)
Average Income by Year Since 2000
Year Average Income Number of Children
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2

bootstrap_options

stat_info_2000 %>% 
  kbl(col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4),  
      format.args = list(big.mark = ","))%>%
  
  kable_styling(bootstrap_options = c("striped", "bordered"),
                font_size = 14,
                html_font = "Cambria", 
                full_width = F)
Average Income by Year Since 2000
Year Average Income Number of Children
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2
stat_info_2000 %>% 
  kable(col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4),  
      format.args = list(big.mark = ","))%>%
  
  kable_styling(bootstrap_options = c("hover", "condensed"),
                font_size = 14,
                html_font = "Cambria", 
                full_width = F)
Average Income by Year Since 2000
Year Average Income Number of Children
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2

Themes

stat_info_2000 %>% 
  kbl(col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4), 
      format.args = list(big.mark = ","))%>%
  
  kable_paper(font_size = 14,
                html_font = "Cambria", 
                full_width = F)
Average Income by Year Since 2000
Year Average Income Number of Children
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2
stat_info_2000 %>% 
  kbl(col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4), 
      format.args = list(big.mark = ","))%>%
  
  kable_material_dark(font_size = 14,
                html_font = "Cambria", 
                full_width = F)
Average Income by Year Since 2000
Year Average Income Number of Children
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2

Style of rows and columns

stat_info_2000 %>% 
  kbl(col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4),  
      format.args = list(big.mark = ","))%>%
 
   kable_styling(font_size = 14,
                html_font = "Cambria", 
                full_width = F) %>%
  
  row_spec(5, bold = T, color = "red") %>%
  row_spec(2, underline = T) %>%
  row_spec(4, background = "#457b9d") %>%
  column_spec(3, strikeout = T)
Average Income by Year Since 2000
Year Average Income Number of Children
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2

Group rows and columns

Group columns

stat_info_2000 %>% 
  kbl(col.names = c("Year", "Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4, 2), 
      format.args = list(big.mark = ","))%>%
   kable_styling(font_size = 14,
                html_font = "Cambria", 
                full_width = F) %>%
  add_header_above(c("Year" = 1, "Year Average" = 2))
Average Income by Year Since 2000
Year
Year Average
Year Income Number of Children
2,000 22,110.13 1.80
2,002 27,276.64 1.81
2,004 26,019.78 1.82
2,006 23,539.08 1.90
2,008 32,395.10 1.94
2,010 20,696.68 1.89
2,012 27,587.48 1.89
2,014 23,282.27 1.82
2,016 23,772.92 1.85
2,018 24,994.19 1.86

Group rows

stat_info_2000 %>% 
   kbl(col.names = c("Year", "Average Income", "Number of Children"), 
      align = "rc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 4),  # Round the number of digits
      format.args = list(big.mark = ","))%>%
   kable_styling(font_size = 14,
                html_font = "Cambria", 
                full_width = F)# %>%
Average Income by Year Since 2000
Year Average Income Number of Children
2,000 22,110.13 2
2,002 27,276.64 2
2,004 26,019.78 2
2,006 23,539.08 2
2,008 32,395.10 2
2,010 20,696.68 2
2,012 27,587.48 2
2,014 23,282.27 2
2,016 23,772.92 2
2,018 24,994.19 2
  #group_rows(group_label = "2009-2010", 1, 6) %>%
#  group_rows(group_label = "2011-2018", 7, 10)
stat_info_2000 %>% 
  
  mutate(year = as.character(year)) %>%
  
  kbl(col.names = c("Year[note]", "Average Income", "# Children"), #<<
      align = "ccc", 
      caption = "Average Income by Year Since 2000", 
      digits = c(0, 2, 2),  # Round the # digits
      format.args = list(big.mark = ","))%>%
 
   kable_styling(font_size = 20,
                html_font = "Cambria", 
                full_width = F) %>%
  
  row_spec(0, bold = T, background = "#e5e5e5") %>%

 # group_rows("Before crisis", 1, 5) %>%
 # group_rows("After crisis[note]", 6, 10) %>%
  
  add_footnote(c("Only year up to 2000 were included", "We consider the end of crisis after 2009"), notation = "alphabet")
## Warning in add_footnote(., c("Only year up to 2000 were included", "We consider
## the end of crisis after 2009"), : You entered 2 labels but you put 1 [note] in
## your table.
Average Income by Year Since 2000
Yeara Average Income # Children
2000 22,110.13 1.80
2002 27,276.64 1.81
2004 26,019.78 1.82
2006 23,539.08 1.90
2008 32,395.10 1.94
2010 20,696.68 1.89
2012 27,587.48 1.89
2014 23,282.27 1.82
2016 23,772.92 1.85
2018 24,994.19 1.86
a Only year up to 2000 were included
b We consider the end of crisis after 2009

R markdown Setup

  • Open R Studio

  • Click on File -> New File -> R Markdown.

    • If you have never used R markdown before, it should prompt you to install a set of packages. Please say “yes” and install them.

    • If R markdown is already installed, it should prompt you to name a new file and open a new document similar to a R script.

  • In either cases, if you have never knit a R markdown document into a PDF document, please install tinytex by running both these commands:

install.packages('tinytex')
tinytex::install_tinytex()

R markdown files are data-driven docs that combine TEXT, CODES, and RESULTS.

They promote transparency, reproducibility, and replicability as they allow others to see your annotations + codes + outputs. The balance of these three elements depends on the purpose of your work.

They are a great way to share your results to non-R users.

R markdown is a set of rules to format these data-driven documents. It supports dozens of output formats, like PDFs, Word files, slideshows, and more.

Type of R markdown documents

Type of R markdown documents

Why should you create R markdown documents?

  • For communicating to decision makers, who want to focus on the conclusions, not the code behind the analysis.

  • For collaborating with other data scientists (including future you!), who are interested in both your conclusions, and how you reached them (i.e. the code).

  • As an environment in which to do data science, as a modern day lab notebook where you can capture not only what you did, but also what you were thinking.

knitr::include_graphics("rmarkdown.png")

When your script is ready, you can knit it and produce a complete report containing all text, code, and results.

knitr::include_graphics("knitting.png")

Formatting

There are few conventions on how to write text in R markdown. For instance:

# indicates the main Title

## Subtitle

### Header 3
**Bold text**
*Italic text*

R chunks

You can create a new R chunk by:

  1. using the keyboard shortcut Cmd/Ctrl + Alt + I (recommended option)

OR

  1. by manually typing the chunk delimiters at the beginning and end.
knitr::include_graphics("chunk.png")

Let’s look at the breakdown of a R chunk

{r NAME_OF_THE_CHUNK, OPTION1, OPTION2}

You can give the chunk any name (use something meaningful).

{r project_mean, OPTIONS}

R chunk options

There are several options that you can put in a r chunk. You can see the full list here

eval = F
Show the code but don’t run it

echo = FALSE
Code doesn’t appear in the output file. Results appers. When including an image, you generally want to use this.

message = FALSE
Prevent messages from appearing in the output file

warning = FALSE
Prevents warnings from appearing in the output file

{r project_mean, echo = F, warning = F, message = F}

R setup

The first R chunk in a document is generally called ‘setup’. The name ‘setup’ is used only for the very first chunk where you can set up settings to be applied to the entire document.

{r setup, echo = F}
knitr::opts_chunk$set(echo = FALSE)

This setup, for instance, is great for reports where you might not want to show your code but only the results.

Iinline codes

If needed, you can also set text into your inline codes simply using the format `mean(XXX)`.

This will report your results into the text and save you time when updating your work!