J. Kavanagh
2022-07-14
In this series of slides we will be exploring how to merge datasets and the use of dates in R
First examine each dataset, note the number of rows & variables and the class types.
## Rows: 4,202
## Columns: 15
## $ judge_id <int> 3419, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ court_name <chr> "U. S. District Court, Southern Distric…
## $ court_type <chr> "USDC", "USDC", "USDC", "USDC", "USDC",…
## $ president_name <chr> "Barack Obama", "Franklin D. Roosevelt"…
## $ president_party <chr> "Democratic", "Democratic", "Republican…
## $ nomination_date <chr> "07/28/2011", "02/03/1936", "01/06/1880…
## $ predecessor_last_name <chr> "Kaplan", "new", "Ketcham", "McFadden",…
## $ predecessor_first_name <chr> "Lewis A.", NA, "Winthrop", "Frank H.",…
## $ senate_confirmation_date <chr> "03/22/2012", "02/12/1936", "01/14/1880…
## $ commission_date <chr> "03/23/2012", "02/15/1936", "01/14/1880…
## $ chief_judge_begin <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ chief_judge_end <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ retirement_from_active_service <chr> NA, "02/15/1966", NA, "05/31/1996", "02…
## $ termination_date <chr> NA, "05/28/1971", "02/09/1891", NA, "12…
## $ termination_reason <chr> NA, "Death", "Appointment to Another Ju…
## Rows: 3,532
## Columns: 13
## $ judge_id <int> 3419, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 2989, 32…
## $ name_first <chr> "Ronnie", "Matthew", "Marcus", "William", "Harold", "…
## $ name_middle <chr> NA, "T.", "Wilson", "Marsh", "Arnold", "Waldo", "L.",…
## $ name_last <chr> "Abrams", "Abruzzo", "Acheson", "Acker", "Ackerman", …
## $ name_suffix <chr> NA, NA, NA, "Jr.", NA, NA, NA, NA, NA, NA, NA, NA, "J…
## $ birth_date <int> 1968, 1889, 1828, 1927, 1928, 1926, 1925, 1887, 1921,…
## $ birthplace_city <chr> "New York", "Brooklyn", "Washington", "Birmingham", "…
## $ birthplace_state <chr> "NY", "NY", "PA", "AL", "NJ", "FL", "NY", "IL", "PA",…
## $ death_date <int> NA, 1971, 1906, NA, 2009, 1984, NA, 1956, NA, 1916, 1…
## $ death_city <chr> NA, "Potomac", "Pittsburgh", NA, "West Orange", "Spri…
## $ death_state <chr> NA, "MD", "PA", NA, "NJ", "IL", NA, NA, NA, "MO", "MS…
## $ gender <chr> "F", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M"…
## $ race <chr> "White", "White", "White", "White", "White", "White",…
Using base R it is possible to join datasets either by row or column
## a b c
## 1 1 7 3
## 2 3 7 3
## 3 3 8 6
## 4 4 3 6
## 5 5 2 8
## [1] 11 14 17
# rbind() will join this datasets together as they are of equal length and stack one atop the other
df_new <- rbind(df, df2)
df_new## a b c
## 1 1 7 3
## 2 3 7 3
## 3 3 8 6
## 4 4 3 6
## 5 5 2 8
## 6 11 14 17
This is an example of how to join dataframes by column
## a b c
## 1 1 7 3
## 2 3 7 3
## 3 3 8 6
## 4 4 3 6
## 5 5 2 8
## [1] 11 14 16 17 22
## a b c df2
## 1 1 7 3 11
## 2 3 7 3 14
## 3 3 8 6 16
## 4 4 3 6 17
## 5 5 2 8 22
The key advantage to using inner_join() from ‘dplyr’ is that it allows for the linking by specific named variables, in this case the primary key in both judges datasets
## # A tibble: 6 × 1
## judge_id
## <int>
## 1 3419
## 2 1
## 3 2
## 4 3
## 5 4
## 6 5
## # A tibble: 6 × 1
## judge_id
## <int>
## 1 3419
## 2 1
## 3 2
## 4 3
## 5 4
## 6 5
Be sure to create a new dataframe or the join will not be saved to your workspace
## Joining, by = "judge_id"
Note that the people data has been linked to the appointments, and there are clearly multiple entries, indicating that a number of individuals were appointed to numerous judicial posts.
## Rows: 4,202
## Columns: 27
## $ judge_id <int> 3419, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ court_name <chr> "U. S. District Court, Southern Distric…
## $ court_type <chr> "USDC", "USDC", "USDC", "USDC", "USDC",…
## $ president_name <chr> "Barack Obama", "Franklin D. Roosevelt"…
## $ president_party <chr> "Democratic", "Democratic", "Republican…
## $ nomination_date <chr> "07/28/2011", "02/03/1936", "01/06/1880…
## $ predecessor_last_name <chr> "Kaplan", "new", "Ketcham", "McFadden",…
## $ predecessor_first_name <chr> "Lewis A.", NA, "Winthrop", "Frank H.",…
## $ senate_confirmation_date <chr> "03/22/2012", "02/12/1936", "01/14/1880…
## $ commission_date <chr> "03/23/2012", "02/15/1936", "01/14/1880…
## $ chief_judge_begin <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ chief_judge_end <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ retirement_from_active_service <chr> NA, "02/15/1966", NA, "05/31/1996", "02…
## $ termination_date <chr> NA, "05/28/1971", "02/09/1891", NA, "12…
## $ termination_reason <chr> NA, "Death", "Appointment to Another Ju…
## $ name_first <chr> "Ronnie", "Matthew", "Marcus", "William…
## $ name_middle <chr> NA, "T.", "Wilson", "Marsh", "Arnold", …
## $ name_last <chr> "Abrams", "Abruzzo", "Acheson", "Acker"…
## $ name_suffix <chr> NA, NA, NA, "Jr.", NA, NA, NA, NA, NA, …
## $ birth_date <int> 1968, 1889, 1828, 1927, 1928, 1926, 192…
## $ birthplace_city <chr> "New York", "Brooklyn", "Washington", "…
## $ birthplace_state <chr> "NY", "NY", "PA", "AL", "NJ", "FL", "NY…
## $ death_date <int> NA, 1971, 1906, NA, 2009, 1984, NA, 195…
## $ death_city <chr> NA, "Potomac", "Pittsburgh", NA, "West …
## $ death_state <chr> NA, "MD", "PA", NA, "NJ", "IL", NA, NA,…
## $ gender <chr> "F", "M", "M", "M", "M", "M", "M", "M",…
## $ race <chr> "White", "White", "White", "White", "Wh…
## # A tibble: 3,532 × 2
## judge_id n
## <int> <int>
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 1
## 6 6 1
## 7 7 1
## 8 8 1
## 9 9 1
## 10 10 1
## # … with 3,522 more rows
# There are 4,202 appointments, however there are only 3532 individual judges
judges_appointments %>% count(judge_id)## # A tibble: 3,532 × 2
## judge_id n
## <int> <int>
## 1 1 1
## 2 2 3
## 3 3 1
## 4 4 1
## 5 5 2
## 6 6 1
## 7 7 1
## 8 8 1
## 9 9 3
## 10 10 3
## # … with 3,522 more rows
## # A tibble: 3,532 × 2
## judge_id n
## <int> <int>
## 1 1 1
## 2 2 3
## 3 3 1
## 4 4 1
## 5 5 2
## 6 6 1
## 7 7 1
## 8 8 1
## 9 9 3
## 10 10 3
## # … with 3,522 more rows
There are a number of different dates included in the judges_unified dataframe. However, none of these variables are the correct class as shown by the glimpse().
## [1] "07/28/2011" "02/03/1936" "01/06/1880" "07/22/1982" "09/28/1979"
## [6] "06/18/1976"
## [1] "03/22/2012" "02/12/1936" "01/14/1880" "08/18/1982" "10/31/1979"
## [6] "07/02/1976"
## [1] "03/23/2012" "02/15/1936" "01/14/1880" "08/18/1982" "11/02/1979"
## [6] "07/02/1976"
## [1] NA "05/28/1971" "02/09/1891" NA "12/02/2009"
## [6] "03/31/1979"
There are a number of different ways to adjust dates, however, as the data is structured we can use the mdy() command from the package ‘lubridate’ to make a relatively simple change.
# Create some sample dates
begin <- c("May 11, 1996", "September 12, 2001", "July 1, 1988")
end <- c("7/8/97","10/23/02","1/4/91")
class(begin)## [1] "character"
## [1] "character"
## [1] "1996-05-11" "2001-09-12" "1988-07-01"
## [1] "1997-07-08" "2002-10-23" "1991-01-04"
## [1] "Date"
## [1] "Date"
Use the mdy() command and verify the results with the class() command
mdy(judges_unified$nomination_date) -> judges_unified$nomination_date
class(judges_unified$nomination_date)## [1] "Date"
mdy(judges_unified$senate_confirmation_date) -> judges_unified$senate_confirmation_date
class(judges_unified$senate_confirmation_date)## [1] "Date"
mdy(judges_unified$commission_date) -> judges_unified$commission_date
class(judges_unified$commission_date)## [1] "Date"
mdy(judges_unified$termination_date) -> judges_unified$termination_date
class(judges_unified$termination_date)## [1] "Date"
First you need to create a new dataframe that provides the number of nominations per day
# This creates a new variable, however, you will need to rename the column names
judges_unified %>% count(nomination_date) -> judges_nominations_date
judges_nominations_date## # A tibble: 2,036 × 2
## nomination_date n
## <date> <int>
## 1 1789-09-24 13
## 2 1789-09-25 2
## 3 1790-02-08 4
## 4 1790-06-11 1
## 5 1790-07-02 1
## 6 1790-08-02 1
## 7 1790-12-17 2
## 8 1791-03-04 1
## 9 1791-10-31 2
## 10 1792-01-12 1
## # … with 2,026 more rows
This is vital as you will create multiple smaller dataframes and need to individualise the column names. This will prevent future errors.
# Rename the columns
colnames(judges_nominations_date) <- c("Date", "Nominations")
# Check your results
judges_nominations_date## # A tibble: 2,036 × 2
## Date Nominations
## <date> <int>
## 1 1789-09-24 13
## 2 1789-09-25 2
## 3 1790-02-08 4
## 4 1790-06-11 1
## 5 1790-07-02 1
## 6 1790-08-02 1
## 7 1790-12-17 2
## 8 1791-03-04 1
## 9 1791-10-31 2
## 10 1792-01-12 1
## # … with 2,026 more rows
Group nomination dates into years using the floor_date() command from the ‘lubridate’ package. Its fairly intelligent and can reorganise dates into days, months, years etc.
judges_nominations_date %>% group_by(year=floor_date(Date, "year")) %>% summarize(No_of_Nominations=sum(Nominations)) -> judges_nominations_yearly
judges_nominations_yearly## # A tibble: 220 × 2
## year No_of_Nominations
## <date> <int>
## 1 1789-01-01 15
## 2 1790-01-01 9
## 3 1791-01-01 3
## 4 1792-01-01 1
## 5 1793-01-01 2
## 6 1794-01-01 1
## 7 1795-01-01 2
## 8 1796-01-01 5
## 9 1797-01-01 1
## 10 1798-01-01 2
## # … with 210 more rows
Create a new dataframe for judges terminations
# This creates a new variable, however, you will need to rename the column names
judges_unified %>% count(termination_date) -> judges_terminations_date
judges_terminations_date## # A tibble: 2,498 × 2
## termination_date n
## <date> <int>
## 1 1790-05-18 1
## 2 1790-08-16 1
## 3 1790-10-12 1
## 4 1791-03-05 1
## 5 1791-05-09 1
## 6 1792-01-04 1
## 7 1793-01-01 1
## 8 1793-01-16 1
## 9 1794-03-17 1
## 10 1794-06-09 1
## # … with 2,488 more rows
Create a new dataframe for judges commissions
# This creates a new variable, however, you will need to rename the column names
judges_unified %>% count(commission_date) -> judges_commissions_date
judges_commissions_date## # A tibble: 2,066 × 2
## commission_date n
## <date> <int>
## 1 1789-09-26 12
## 2 1789-09-27 1
## 3 1789-09-29 1
## 4 1789-09-30 1
## 5 1790-02-10 4
## 6 1790-06-14 1
## 7 1790-07-03 1
## 8 1790-08-03 1
## 9 1790-12-20 2
## 10 1791-03-04 1
## # … with 2,056 more rows
# Rename the columns
colnames(judges_terminations_date) <- c("Date", "Terminations")
# Check your results
judges_terminations_date## # A tibble: 2,498 × 2
## Date Terminations
## <date> <int>
## 1 1790-05-18 1
## 2 1790-08-16 1
## 3 1790-10-12 1
## 4 1791-03-05 1
## 5 1791-05-09 1
## 6 1792-01-04 1
## 7 1793-01-01 1
## 8 1793-01-16 1
## 9 1794-03-17 1
## 10 1794-06-09 1
## # … with 2,488 more rows
# Rename the columns
colnames(judges_commissions_date) <- c("Date", "Commissions")
# Check your results
judges_commissions_date## # A tibble: 2,066 × 2
## Date Commissions
## <date> <int>
## 1 1789-09-26 12
## 2 1789-09-27 1
## 3 1789-09-29 1
## 4 1789-09-30 1
## 5 1790-02-10 4
## 6 1790-06-14 1
## 7 1790-07-03 1
## 8 1790-08-03 1
## 9 1790-12-20 2
## 10 1791-03-04 1
## # … with 2,056 more rows
judges_terminations_date %>% group_by(year=floor_date(Date, "year")) %>% summarize(No_of_Terminations=sum(Terminations)) -> judges_terminations_yearly
judges_terminations_yearly## # A tibble: 222 × 2
## year No_of_Terminations
## <date> <int>
## 1 1790-01-01 3
## 2 1791-01-01 2
## 3 1792-01-01 1
## 4 1793-01-01 2
## 5 1794-01-01 2
## 6 1795-01-01 4
## 7 1796-01-01 3
## 8 1797-01-01 1
## 9 1798-01-01 2
## 10 1799-01-01 2
## # … with 212 more rows
judges_commissions_date %>% group_by(year=floor_date(Date, "year")) %>% summarize(No_of_Commissions=sum(Commissions)) -> judges_commissions_yearly
judges_commissions_yearly## # A tibble: 219 × 2
## year No_of_Commissions
## <date> <int>
## 1 1789-01-01 15
## 2 1790-01-01 9
## 3 1791-01-01 3
## 4 1792-01-01 1
## 5 1793-01-01 1
## 6 1794-01-01 3
## 7 1795-01-01 1
## 8 1796-01-01 4
## 9 1797-01-01 3
## 10 1798-01-01 2
## # … with 209 more rows
Using ggplot we can display the findings
## Warning: Removed 1 row(s) containing missing values (geom_path).
Note how with a additional information added to the basic plot we can create a very effective graph
judges_commissions_yearly %>%
ggplot(aes(x=year, y=No_of_Commissions)) +
geom_line(size = 0.8) +
labs(title = "Judicial Commmissions - 1789-2014",
tag = "Figure 1",
x = "Year",
y = "No.") +
scale_x_date(date_breaks = "50 years", date_labels = "%Y") +
theme_classic() +
theme(axis.text.x = element_text(colour = "darkslategrey", size = 16),
axis.text.y = element_text(colour = "darkslategrey", size = 16),
legend.background = element_rect(fill = "white", size = 4, colour = "white"),
legend.justification = c(0, 1),
legend.position = c(0.9, 1),
text = element_text(family = "Georgia"),
plot.title = element_text(size = 18, margin = margin(b = 10)),
plot.subtitle = element_text(size = 12, color = "darkslategrey", margin = margin(b = 25)),
plot.caption = element_text(size = 8, margin = margin(t = 10), color = "grey70", hjust = 0))## Warning: Removed 1 row(s) containing missing values (geom_path).