SPUR R Lecture - Lecture 2

J. Kavanagh

2022-07-14

Introduction

In this series of slides we will be exploring how to merge datasets and the use of dates in R

load('SPUR.RData')

Examine the data

First examine each dataset, note the number of rows & variables and the class types.

glimpse(judges_appointments)
## Rows: 4,202
## Columns: 15
## $ judge_id                       <int> 3419, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ court_name                     <chr> "U. S. District Court, Southern Distric…
## $ court_type                     <chr> "USDC", "USDC", "USDC", "USDC", "USDC",…
## $ president_name                 <chr> "Barack Obama", "Franklin D. Roosevelt"…
## $ president_party                <chr> "Democratic", "Democratic", "Republican…
## $ nomination_date                <chr> "07/28/2011", "02/03/1936", "01/06/1880…
## $ predecessor_last_name          <chr> "Kaplan", "new", "Ketcham", "McFadden",…
## $ predecessor_first_name         <chr> "Lewis A.", NA, "Winthrop", "Frank H.",…
## $ senate_confirmation_date       <chr> "03/22/2012", "02/12/1936", "01/14/1880…
## $ commission_date                <chr> "03/23/2012", "02/15/1936", "01/14/1880…
## $ chief_judge_begin              <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ chief_judge_end                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ retirement_from_active_service <chr> NA, "02/15/1966", NA, "05/31/1996", "02…
## $ termination_date               <chr> NA, "05/28/1971", "02/09/1891", NA, "12…
## $ termination_reason             <chr> NA, "Death", "Appointment to Another Ju…
glimpse(judges_people)
## Rows: 3,532
## Columns: 13
## $ judge_id         <int> 3419, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 2989, 32…
## $ name_first       <chr> "Ronnie", "Matthew", "Marcus", "William", "Harold", "…
## $ name_middle      <chr> NA, "T.", "Wilson", "Marsh", "Arnold", "Waldo", "L.",…
## $ name_last        <chr> "Abrams", "Abruzzo", "Acheson", "Acker", "Ackerman", …
## $ name_suffix      <chr> NA, NA, NA, "Jr.", NA, NA, NA, NA, NA, NA, NA, NA, "J…
## $ birth_date       <int> 1968, 1889, 1828, 1927, 1928, 1926, 1925, 1887, 1921,…
## $ birthplace_city  <chr> "New York", "Brooklyn", "Washington", "Birmingham", "…
## $ birthplace_state <chr> "NY", "NY", "PA", "AL", "NJ", "FL", "NY", "IL", "PA",…
## $ death_date       <int> NA, 1971, 1906, NA, 2009, 1984, NA, 1956, NA, 1916, 1…
## $ death_city       <chr> NA, "Potomac", "Pittsburgh", NA, "West Orange", "Spri…
## $ death_state      <chr> NA, "MD", "PA", NA, "NJ", "IL", NA, NA, NA, "MO", "MS…
## $ gender           <chr> "F", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M"…
## $ race             <chr> "White", "White", "White", "White", "White", "White",…

Joining Datasets - Row

Using base R it is possible to join datasets either by row or column

# Create data frame
df <- data.frame(a=c(1, 3, 3, 4, 5),
                 b=c(7, 7, 8, 3, 2),
                 c=c(3, 3, 6, 6, 8))

df
##   a b c
## 1 1 7 3
## 2 3 7 3
## 3 3 8 6
## 4 4 3 6
## 5 5 2 8
# Create a second dataframe 

df2 <- c(11, 14, 17)

df2
## [1] 11 14 17
# rbind() will join this datasets together as they are of equal length and stack one atop the other
df_new <- rbind(df, df2)

df_new
##    a  b  c
## 1  1  7  3
## 2  3  7  3
## 3  3  8  6
## 4  4  3  6
## 5  5  2  8
## 6 11 14 17

Joining Datasets - Column

This is an example of how to join dataframes by column

# Create data frame
df <- data.frame(a=c(1, 3, 3, 4, 5),
                 b=c(7, 7, 8, 3, 2),
                 c=c(3, 3, 6, 6, 8))

df
##   a b c
## 1 1 7 3
## 2 3 7 3
## 3 3 8 6
## 4 4 3 6
## 5 5 2 8
# Define vector
df2 <- c(11, 14, 16, 17, 22)
df2
## [1] 11 14 16 17 22
# cbind vector to data frame
df_new <- cbind(df, df2)

df_new
##   a b c df2
## 1 1 7 3  11
## 2 3 7 3  14
## 3 3 8 6  16
## 4 4 3 6  17
## 5 5 2 8  22

Creating a new dataframe after joining

Be sure to create a new dataframe or the join will not be saved to your workspace

judges_unified <- inner_join(judges_appointments, judges_people, by.x='judge_id')
## Joining, by = "judge_id"

Note that the people data has been linked to the appointments, and there are clearly multiple entries, indicating that a number of individuals were appointed to numerous judicial posts.

glimpse(judges_unified)
## Rows: 4,202
## Columns: 27
## $ judge_id                       <int> 3419, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ court_name                     <chr> "U. S. District Court, Southern Distric…
## $ court_type                     <chr> "USDC", "USDC", "USDC", "USDC", "USDC",…
## $ president_name                 <chr> "Barack Obama", "Franklin D. Roosevelt"…
## $ president_party                <chr> "Democratic", "Democratic", "Republican…
## $ nomination_date                <chr> "07/28/2011", "02/03/1936", "01/06/1880…
## $ predecessor_last_name          <chr> "Kaplan", "new", "Ketcham", "McFadden",…
## $ predecessor_first_name         <chr> "Lewis A.", NA, "Winthrop", "Frank H.",…
## $ senate_confirmation_date       <chr> "03/22/2012", "02/12/1936", "01/14/1880…
## $ commission_date                <chr> "03/23/2012", "02/15/1936", "01/14/1880…
## $ chief_judge_begin              <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ chief_judge_end                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ retirement_from_active_service <chr> NA, "02/15/1966", NA, "05/31/1996", "02…
## $ termination_date               <chr> NA, "05/28/1971", "02/09/1891", NA, "12…
## $ termination_reason             <chr> NA, "Death", "Appointment to Another Ju…
## $ name_first                     <chr> "Ronnie", "Matthew", "Marcus", "William…
## $ name_middle                    <chr> NA, "T.", "Wilson", "Marsh", "Arnold", …
## $ name_last                      <chr> "Abrams", "Abruzzo", "Acheson", "Acker"…
## $ name_suffix                    <chr> NA, NA, NA, "Jr.", NA, NA, NA, NA, NA, …
## $ birth_date                     <int> 1968, 1889, 1828, 1927, 1928, 1926, 192…
## $ birthplace_city                <chr> "New York", "Brooklyn", "Washington", "…
## $ birthplace_state               <chr> "NY", "NY", "PA", "AL", "NJ", "FL", "NY…
## $ death_date                     <int> NA, 1971, 1906, NA, 2009, 1984, NA, 195…
## $ death_city                     <chr> NA, "Potomac", "Pittsburgh", NA, "West …
## $ death_state                    <chr> NA, "MD", "PA", NA, "NJ", "IL", NA, NA,…
## $ gender                         <chr> "F", "M", "M", "M", "M", "M", "M", "M",…
## $ race                           <chr> "White", "White", "White", "White", "Wh…

Checking Your Join

# This is the smallest dataset containing 3532 rows
judges_people %>% count(judge_id)
## # A tibble: 3,532 × 2
##    judge_id     n
##       <int> <int>
##  1        1     1
##  2        2     1
##  3        3     1
##  4        4     1
##  5        5     1
##  6        6     1
##  7        7     1
##  8        8     1
##  9        9     1
## 10       10     1
## # … with 3,522 more rows
# There are 4,202 appointments, however there are only 3532 individual judges 
judges_appointments %>% count(judge_id)
## # A tibble: 3,532 × 2
##    judge_id     n
##       <int> <int>
##  1        1     1
##  2        2     3
##  3        3     1
##  4        4     1
##  5        5     2
##  6        6     1
##  7        7     1
##  8        8     1
##  9        9     3
## 10       10     3
## # … with 3,522 more rows
judges_unified %>% count(judge_id)
## # A tibble: 3,532 × 2
##    judge_id     n
##       <int> <int>
##  1        1     1
##  2        2     3
##  3        3     1
##  4        4     1
##  5        5     2
##  6        6     1
##  7        7     1
##  8        8     1
##  9        9     3
## 10       10     3
## # … with 3,522 more rows

Dates in R

There are a number of different dates included in the judges_unified dataframe. However, none of these variables are the correct class as shown by the glimpse().

# Date of Nomination
judges_unified$nomination_date %>% head()
## [1] "07/28/2011" "02/03/1936" "01/06/1880" "07/22/1982" "09/28/1979"
## [6] "06/18/1976"
# Date of Confirmation
judges_unified$senate_confirmation_date %>% head()
## [1] "03/22/2012" "02/12/1936" "01/14/1880" "08/18/1982" "10/31/1979"
## [6] "07/02/1976"
# Date of Commission
judges_unified$commission_date %>% head()
## [1] "03/23/2012" "02/15/1936" "01/14/1880" "08/18/1982" "11/02/1979"
## [6] "07/02/1976"
# Date of Termination
judges_unified$termination_date %>% head()
## [1] NA           "05/28/1971" "02/09/1891" NA           "12/02/2009"
## [6] "03/31/1979"

Adjusting the dates

There are a number of different ways to adjust dates, however, as the data is structured we can use the mdy() command from the package ‘lubridate’ to make a relatively simple change.

# Create some sample dates 
begin <- c("May 11, 1996", "September 12, 2001", "July 1, 1988")
end <- c("7/8/97","10/23/02","1/4/91")
class(begin)
## [1] "character"
## [1] "character"
class(end)
## [1] "character"
## [1] "character"
(begin <- mdy(begin))
## [1] "1996-05-11" "2001-09-12" "1988-07-01"
## [1] "1996-05-11" "2001-09-12" "1988-07-01"
(end <- mdy(end))
## [1] "1997-07-08" "2002-10-23" "1991-01-04"
## [1] "1997-07-08" "2002-10-23" "1991-01-04"
class(begin)
## [1] "Date"
## [1] "Date"
class(end)
## [1] "Date"
## [1] "Date"

Creating Dates variables for the Judges dataset

Use the mdy() command and verify the results with the class() command

mdy(judges_unified$nomination_date) -> judges_unified$nomination_date
class(judges_unified$nomination_date)
## [1] "Date"
mdy(judges_unified$senate_confirmation_date) -> judges_unified$senate_confirmation_date
class(judges_unified$senate_confirmation_date)
## [1] "Date"
mdy(judges_unified$commission_date) -> judges_unified$commission_date
class(judges_unified$commission_date)
## [1] "Date"
mdy(judges_unified$termination_date) -> judges_unified$termination_date
class(judges_unified$termination_date)
## [1] "Date"

Creating specific date dataframes

First you need to create a new dataframe that provides the number of nominations per day

# This creates a new variable, however, you will need to rename the column names 
judges_unified %>% count(nomination_date) -> judges_nominations_date
judges_nominations_date
## # A tibble: 2,036 × 2
##    nomination_date     n
##    <date>          <int>
##  1 1789-09-24         13
##  2 1789-09-25          2
##  3 1790-02-08          4
##  4 1790-06-11          1
##  5 1790-07-02          1
##  6 1790-08-02          1
##  7 1790-12-17          2
##  8 1791-03-04          1
##  9 1791-10-31          2
## 10 1792-01-12          1
## # … with 2,026 more rows

Changing Column Names

This is vital as you will create multiple smaller dataframes and need to individualise the column names. This will prevent future errors.

# Rename the columns
colnames(judges_nominations_date) <- c("Date", "Nominations")

# Check your results
judges_nominations_date
## # A tibble: 2,036 × 2
##    Date       Nominations
##    <date>           <int>
##  1 1789-09-24          13
##  2 1789-09-25           2
##  3 1790-02-08           4
##  4 1790-06-11           1
##  5 1790-07-02           1
##  6 1790-08-02           1
##  7 1790-12-17           2
##  8 1791-03-04           1
##  9 1791-10-31           2
## 10 1792-01-12           1
## # … with 2,026 more rows

Using floor_date()

Group nomination dates into years using the floor_date() command from the ‘lubridate’ package. Its fairly intelligent and can reorganise dates into days, months, years etc.

judges_nominations_date %>% group_by(year=floor_date(Date, "year")) %>% summarize(No_of_Nominations=sum(Nominations)) -> judges_nominations_yearly

judges_nominations_yearly
## # A tibble: 220 × 2
##    year       No_of_Nominations
##    <date>                 <int>
##  1 1789-01-01                15
##  2 1790-01-01                 9
##  3 1791-01-01                 3
##  4 1792-01-01                 1
##  5 1793-01-01                 2
##  6 1794-01-01                 1
##  7 1795-01-01                 2
##  8 1796-01-01                 5
##  9 1797-01-01                 1
## 10 1798-01-01                 2
## # … with 210 more rows

Repeat this process for Commission and Termination Date

Create a new dataframe for judges terminations

# This creates a new variable, however, you will need to rename the column names 
judges_unified %>% count(termination_date) -> judges_terminations_date
judges_terminations_date
## # A tibble: 2,498 × 2
##    termination_date     n
##    <date>           <int>
##  1 1790-05-18           1
##  2 1790-08-16           1
##  3 1790-10-12           1
##  4 1791-03-05           1
##  5 1791-05-09           1
##  6 1792-01-04           1
##  7 1793-01-01           1
##  8 1793-01-16           1
##  9 1794-03-17           1
## 10 1794-06-09           1
## # … with 2,488 more rows

Create a new dataframe for judges commissions

# This creates a new variable, however, you will need to rename the column names 
judges_unified %>% count(commission_date) -> judges_commissions_date
judges_commissions_date
## # A tibble: 2,066 × 2
##    commission_date     n
##    <date>          <int>
##  1 1789-09-26         12
##  2 1789-09-27          1
##  3 1789-09-29          1
##  4 1789-09-30          1
##  5 1790-02-10          4
##  6 1790-06-14          1
##  7 1790-07-03          1
##  8 1790-08-03          1
##  9 1790-12-20          2
## 10 1791-03-04          1
## # … with 2,056 more rows

# Rename the columns
colnames(judges_terminations_date) <- c("Date", "Terminations")

# Check your results
judges_terminations_date
## # A tibble: 2,498 × 2
##    Date       Terminations
##    <date>            <int>
##  1 1790-05-18            1
##  2 1790-08-16            1
##  3 1790-10-12            1
##  4 1791-03-05            1
##  5 1791-05-09            1
##  6 1792-01-04            1
##  7 1793-01-01            1
##  8 1793-01-16            1
##  9 1794-03-17            1
## 10 1794-06-09            1
## # … with 2,488 more rows
# Rename the columns
colnames(judges_commissions_date) <- c("Date", "Commissions")

# Check your results
judges_commissions_date
## # A tibble: 2,066 × 2
##    Date       Commissions
##    <date>           <int>
##  1 1789-09-26          12
##  2 1789-09-27           1
##  3 1789-09-29           1
##  4 1789-09-30           1
##  5 1790-02-10           4
##  6 1790-06-14           1
##  7 1790-07-03           1
##  8 1790-08-03           1
##  9 1790-12-20           2
## 10 1791-03-04           1
## # … with 2,056 more rows

judges_terminations_date %>% group_by(year=floor_date(Date, "year")) %>% summarize(No_of_Terminations=sum(Terminations)) -> judges_terminations_yearly

judges_terminations_yearly
## # A tibble: 222 × 2
##    year       No_of_Terminations
##    <date>                  <int>
##  1 1790-01-01                  3
##  2 1791-01-01                  2
##  3 1792-01-01                  1
##  4 1793-01-01                  2
##  5 1794-01-01                  2
##  6 1795-01-01                  4
##  7 1796-01-01                  3
##  8 1797-01-01                  1
##  9 1798-01-01                  2
## 10 1799-01-01                  2
## # … with 212 more rows
judges_commissions_date %>% group_by(year=floor_date(Date, "year")) %>% summarize(No_of_Commissions=sum(Commissions)) -> judges_commissions_yearly

judges_commissions_yearly
## # A tibble: 219 × 2
##    year       No_of_Commissions
##    <date>                 <int>
##  1 1789-01-01                15
##  2 1790-01-01                 9
##  3 1791-01-01                 3
##  4 1792-01-01                 1
##  5 1793-01-01                 1
##  6 1794-01-01                 3
##  7 1795-01-01                 1
##  8 1796-01-01                 4
##  9 1797-01-01                 3
## 10 1798-01-01                 2
## # … with 209 more rows

Explore the new dataframes

Using ggplot we can display the findings

judges_commissions_yearly %>% 
ggplot(aes(x=year, y=No_of_Commissions)) + 
geom_line(size = 0.8)
## Warning: Removed 1 row(s) containing missing values (geom_path).

Advanced ggplot2

Note how with a additional information added to the basic plot we can create a very effective graph

judges_commissions_yearly %>% 
ggplot(aes(x=year, y=No_of_Commissions)) + 
geom_line(size = 0.8) + 
labs(title = "Judicial Commmissions - 1789-2014",
       tag = "Figure 1", 
       x = "Year",
       y = "No.") +
scale_x_date(date_breaks = "50 years", date_labels = "%Y") +
theme_classic() +
theme(axis.text.x = element_text(colour = "darkslategrey", size = 16), 
      axis.text.y = element_text(colour = "darkslategrey", size = 16),
      legend.background = element_rect(fill = "white", size = 4, colour = "white"),
      legend.justification = c(0, 1),
      legend.position = c(0.9, 1),
      text = element_text(family = "Georgia"),
      plot.title = element_text(size = 18, margin = margin(b = 10)),
      plot.subtitle = element_text(size = 12, color = "darkslategrey", margin = margin(b = 25)),
      plot.caption = element_text(size = 8, margin = margin(t = 10), color = "grey70", hjust = 0))
## Warning: Removed 1 row(s) containing missing values (geom_path).