Autumn School - Day 1

Introduction to R - NUIM - MA in Business Analytics

Dr. J. Kavanagh

2022-09-12

Introduction

This lecture is designed to show you the potential for using R for spatial analysis when applied to historical questions.

You must have the following installed on your computer before we begin.

Open RStudio, in the console type the following

getwd()
## [1] "/Users/jackkavanagh/Dropbox/R_Business"

Set the Working Directory using this command

    setwd()

Base R

Since the creation of the ‘Tidyverse’ by Hadley Wickham there has been a trend to teach solely within the TidyR framework, which involves concentrating on a number of small and interlinked packages in R. This approach while useful for creating a basic overview of R, prevents users from understanding the full potential of R for analysis and leads to errors which can often be quickly solved using the base R commands.

Some Basic Commands in R

The console is where you type in commands and after hitting return you get results. As R was originally designed for statistical purposes, it has fully operational calculator functions as follows:

            + = same
            - = same
            * = mulitply
            / = divide 
1+3
## [1] 4
2*10
## [1] 20
100 - 48
## [1] 52
150 / 25
## [1] 6

R can also perform scientific calculator functions

pi
## [1] 3.141593
3*pi
## [1] 9.424778
sin(pi/2)
## [1] 1

You can assign functions and results to a vector in R

x <- 3*pi

print(x)
## [1] 9.424778
y <- 1:5

y
## [1] 1 2 3 4 5

R Loops

Loops are essential automated processes that can analyses multiple datasets. They often appear quite off-putting but are quite simple to understand if broken down into the distinct types that are available.

# Create a new vector 
r_loop <- 1 

r_loop
## [1] 1

Now this loop is going to add 4 to the r_loop vector 15 times.

for(i in 1:15) {              # Head of for-loop
  
  r_loop <- r_loop + 4          # Body of for-loop
  print(r_loop)
}
## [1] 5
## [1] 9
## [1] 13
## [1] 17
## [1] 21
## [1] 25
## [1] 29
## [1] 33
## [1] 37
## [1] 41
## [1] 45
## [1] 49
## [1] 53
## [1] 57
## [1] 61

R Loops - Character Vectors

r_loop_2 <- c("Ringo", "John", "Paul", "George", "Linda", 
              "Janice", "Ella", "Sarah", "Barbara")
for(i in r_loop_2) {         # Loop over character vector
print(paste("The name", i, "consists of", nchar(i), "characters."))
}
## [1] "The name Ringo consists of 5 characters."
## [1] "The name John consists of 4 characters."
## [1] "The name Paul consists of 4 characters."
## [1] "The name George consists of 6 characters."
## [1] "The name Linda consists of 5 characters."
## [1] "The name Janice consists of 6 characters."
## [1] "The name Ella consists of 4 characters."
## [1] "The name Sarah consists of 5 characters."
## [1] "The name Barbara consists of 7 characters."

R Loops - Breaks & Skipping

You can tell your loop to either skip a sequence or end after a sequence. Using the very simple syntax of break or next.

for(i in 1:5) {                     # for-loop with break
  
  if(i == 4) {
    break
  }
  
  print(paste("Sin é", i))
}
## [1] "Sin é 1"
## [1] "Sin é 2"
## [1] "Sin é 3"
for(i in 1:7) {                     # for-loop with skip
  
  if(i == 4) {
    next
  }
  
  print(paste("Anseo", i))
}
## [1] "Anseo 1"
## [1] "Anseo 2"
## [1] "Anseo 3"
## [1] "Anseo 5"
## [1] "Anseo 6"
## [1] "Anseo 7"

Load the Libraries

Load the following libraries

library(tidyverse) library(lubridate) library(ggthemes) library(reshape2)

After you’ve loaded the libraries, import the following file which you need to place in your Working Directory, use the following command to import the data

load('Business_1.RData')

There should be a number of distinct dataframes:

Key Commands to Remember

The %>% pipeline command will be used throughout this lecture to link various command queries

    early_colleges %>% select(established)

The %in% command is used for matching a vector within a dataframe

    early_colleges %>% filter(established %in% c('1795','1797','1802'))

The $ command is used to display the internal components of a dataframe

    early_colleges$

Please run these now in the Console section of RStudio

Explore the Data

The head() and tail() commands are useful for exploring the datasets, each shows the first ten and last ten rows. This is particularly useful when importing data and ensuring that all the information has been correctly inputted.

head(early_colleges)
## # A tibble: 6 × 6
##   college                original_name       city  state established sponsorship
##   <chr>                  <chr>               <chr> <chr>       <int> <chr>      
## 1 Harvard                <NA>                Camb… MA           1636 Congregati…
## 2 William and Mary       <NA>                Will… VA           1693 Anglican   
## 3 Yale                   <NA>                New … CT           1701 Congregati…
## 4 Pennsylvania, Univ. of <NA>                Phil… PA           1740 Nondenomin…
## 5 Princeton              College of New Jer… Prin… NJ           1746 Presbyteri…
## 6 Columbia               King's College      New … NY           1754 Anglican
tail(early_colleges)
## # A tibble: 6 × 6
##   college             original_name city      state established sponsorship   
##   <chr>               <chr>         <chr>     <chr>       <int> <chr>         
## 1 Beloit              <NA>          Beloit    WI           1846 Congregational
## 2 Bucknell            <NA>          Lewisburg PA           1846 Baptist       
## 3 Grinnell            <NA>          Grinnell  IA           1846 Congregational
## 4 Mount Union         <NA>          Alliance  OH           1846 Methodist     
## 5 Earlham             <NA>          Richmond  IN           1847 Quaker        
## 6 Wisconsin, Univ. of <NA>          Madison   WI           1848 Secular

A bird’s eye view of the dataset

Another way to view an entire dataset is to use the glimpse() command which displays the overall dataset and the class of each type

glimpse(early_colleges)
## Rows: 65
## Columns: 6
## $ college       <chr> "Harvard", "William and Mary", "Yale", "Pennsylvania, Un…
## $ original_name <chr> NA, NA, NA, NA, "College of New Jersey", "King's College…
## $ city          <chr> "Cambridge", "Williamsburg", "New Haven", "Philadelphia"…
## $ state         <chr> "MA", "VA", "CT", "PA", "NJ", "NY", "RI", "NJ", "NH", "S…
## $ established   <int> 1636, 1693, 1701, 1740, 1746, 1754, 1765, 1766, 1769, 17…
## $ sponsorship   <chr> "Congregational; after 1805 Unitarian", "Anglican", "Con…

The summary() command provides an overview of the dataset, most beneficial with numerical data

summary(early_colleges)
##    college          original_name          city              state          
##  Length:65          Length:65          Length:65          Length:65         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   established   sponsorship       
##  Min.   :1636   Length:65         
##  1st Qu.:1793   Class :character  
##  Median :1823   Mode  :character  
##  Mean   :1810                     
##  3rd Qu.:1838                     
##  Max.   :1848

Selecting and Filtering the Data

The select() command from ‘dplyr’ is a very versatile command that can be used in sequence using the %>% pipeline to link to other commands

early_colleges %>% select(college, city, state) %>% head()
## # A tibble: 6 × 3
##   college                city         state
##   <chr>                  <chr>        <chr>
## 1 Harvard                Cambridge    MA   
## 2 William and Mary       Williamsburg VA   
## 3 Yale                   New Haven    CT   
## 4 Pennsylvania, Univ. of Philadelphia PA   
## 5 Princeton              Princeton    NJ   
## 6 Columbia               New York     NY

The filter() command also from ‘dplyr’ is very useful and can implement numerical and text commands.

This example shows the number of colleges established prior to 1800

early_colleges %>% filter(established < 1800) %>% head()
## # A tibble: 6 × 6
##   college                original_name       city  state established sponsorship
##   <chr>                  <chr>               <chr> <chr>       <int> <chr>      
## 1 Harvard                <NA>                Camb… MA           1636 Congregati…
## 2 William and Mary       <NA>                Will… VA           1693 Anglican   
## 3 Yale                   <NA>                New … CT           1701 Congregati…
## 4 Pennsylvania, Univ. of <NA>                Phil… PA           1740 Nondenomin…
## 5 Princeton              College of New Jer… Prin… NJ           1746 Presbyteri…
## 6 Columbia               King's College      New … NY           1754 Anglican

Filtering (cont.)

This example shows the colleges with the state of New York using the ==

early_colleges %>% filter(state == "NY")
## # A tibble: 6 × 6
##   college               original_name  city        state established sponsorship
##   <chr>                 <chr>          <chr>       <chr>       <int> <chr>      
## 1 Columbia              King's College New York    NY           1754 Anglican   
## 2 Union College         <NA>           Schenectady NY           1795 Presbyteri…
## 3 U.S. Military Academy <NA>           West Point  NY           1802 Secular    
## 4 Colgate               <NA>           Hamilton    NY           1819 Baptist    
## 5 New York Univ.        <NA>           New York    NY           1831 Nondenomin…
## 6 Fordham               <NA>           Fordham     NY           1841 Roman Cath…

Using the != displays all states that are not New York

early_colleges %>% filter(state != "NY") %>% head()
## # A tibble: 6 × 6
##   college                original_name       city  state established sponsorship
##   <chr>                  <chr>               <chr> <chr>       <int> <chr>      
## 1 Harvard                <NA>                Camb… MA           1636 Congregati…
## 2 William and Mary       <NA>                Will… VA           1693 Anglican   
## 3 Yale                   <NA>                New … CT           1701 Congregati…
## 4 Pennsylvania, Univ. of <NA>                Phil… PA           1740 Nondenomin…
## 5 Princeton              College of New Jer… Prin… NJ           1746 Presbyteri…
## 6 Brown                  <NA>                Prov… RI           1765 Baptist

The simple logical operators are for the filter command are:

    & (and)

    | (or)

    ! (not)

Filtering (cont.)

This example shows the colleges with the state of New York, Massachusetts & Virginia –

early_colleges %>% filter(state %in% c( 'NY','VA','MA')) %>% head()
## # A tibble: 6 × 6
##   college          original_name  city           state established sponsorship  
##   <chr>            <chr>          <chr>          <chr>       <int> <chr>        
## 1 Harvard          <NA>           Cambridge      MA           1636 Congregation…
## 2 William and Mary <NA>           Williamsburg   VA           1693 Anglican     
## 3 Columbia         King's College New York       NY           1754 Anglican     
## 4 Hampden-Sydney   <NA>           Hampden-Sydney VA           1775 Presbyterian 
## 5 Williams         <NA>           Williamstown   MA           1793 Congregation…
## 6 Union College    <NA>           Schenectady    NY           1795 Presbyterian…

Note that the %in% command can also be expressed as follows:

# Create a new character list of three states using their abbreviations
three_states <- c("NY", "VA", "MA")

# Filter for this using the %in% command
early_colleges %>% filter(state %in% three_states) %>% head()
## # A tibble: 6 × 6
##   college          original_name  city           state established sponsorship  
##   <chr>            <chr>          <chr>          <chr>       <int> <chr>        
## 1 Harvard          <NA>           Cambridge      MA           1636 Congregation…
## 2 William and Mary <NA>           Williamsburg   VA           1693 Anglican     
## 3 Columbia         King's College New York       NY           1754 Anglican     
## 4 Hampden-Sydney   <NA>           Hampden-Sydney VA           1775 Presbyterian 
## 5 Williams         <NA>           Williamstown   MA           1793 Congregation…
## 6 Union College    <NA>           Schenectady    NY           1795 Presbyterian…

Creating new variable using count() and mutate()

This will be one of the most common ways to explore a dataset, exploring specific variables and counting them

early_colleges %>% select(state) %>% count(state)
## # A tibble: 25 × 2
##    state     n
##    <chr> <int>
##  1 CT        3
##  2 DC        2
##  3 GA        2
##  4 IA        1
##  5 IL        1
##  6 IN        2
##  7 KY        1
##  8 LA        1
##  9 MA        6
## 10 MD        2
## # … with 15 more rows

To create a new column, in this case merging two geographic attributes into a single column, use mutate(). The sep= refers to how you want the two words to be separated. In this example its with a comma e.g. New York, NY

early_colleges %>% mutate(location=paste(city,state,sep=","))
## # A tibble: 65 × 7
##    college            original_name city  state established sponsorship location
##    <chr>              <chr>         <chr> <chr>       <int> <chr>       <chr>   
##  1 Harvard            <NA>          Camb… MA           1636 Congregati… Cambrid…
##  2 William and Mary   <NA>          Will… VA           1693 Anglican    William…
##  3 Yale               <NA>          New … CT           1701 Congregati… New Hav…
##  4 Pennsylvania, Uni… <NA>          Phil… PA           1740 Nondenomin… Philade…
##  5 Princeton          College of N… Prin… NJ           1746 Presbyteri… Princet…
##  6 Columbia           King's Colle… New … NY           1754 Anglican    New Yor…
##  7 Brown              <NA>          Prov… RI           1765 Baptist     Provide…
##  8 Rutgers            Queen's Coll… New … NJ           1766 Dutch Refo… New Bru…
##  9 Dartmouth          <NA>          Hano… NH           1769 Congregati… Hanover…
## 10 Charleston, Coll.… <NA>          Char… SC           1770 Anglican    Charles…
## # … with 55 more rows

Saving new variables

Although mutate() creates a new column, unless you save it back into the main dataframe it will be lost therefore always point your code back to the original dataframe using the -> command.

Some programmers use the = sign, however, this is not recommended in R as that sign has other uses depending on which package you are using.

# Now when you run this code, the number of variables will increase to 7
early_colleges %>% mutate(location=paste(city,state,sep=","))-> early_colleges
early_colleges
## # A tibble: 65 × 7
##    college            original_name city  state established sponsorship location
##    <chr>              <chr>         <chr> <chr>       <int> <chr>       <chr>   
##  1 Harvard            <NA>          Camb… MA           1636 Congregati… Cambrid…
##  2 William and Mary   <NA>          Will… VA           1693 Anglican    William…
##  3 Yale               <NA>          New … CT           1701 Congregati… New Hav…
##  4 Pennsylvania, Uni… <NA>          Phil… PA           1740 Nondenomin… Philade…
##  5 Princeton          College of N… Prin… NJ           1746 Presbyteri… Princet…
##  6 Columbia           King's Colle… New … NY           1754 Anglican    New Yor…
##  7 Brown              <NA>          Prov… RI           1765 Baptist     Provide…
##  8 Rutgers            Queen's Coll… New … NJ           1766 Dutch Refo… New Bru…
##  9 Dartmouth          <NA>          Hano… NH           1769 Congregati… Hanover…
## 10 Charleston, Coll.… <NA>          Char… SC           1770 Anglican    Charles…
## # … with 55 more rows

Class Exercise

Filter the early_colleges to show the colleges established in the original 13 colonies of the United States of America

Create a new object from the early_colleges dataset showing the largest religious sponsorship of colleges

Displaying Research Findings

A useful package for visualising research findings as charts and graphs is ‘ggplot2’. It is included in the ‘tidyverse’ package and follows the guidelines of the ‘Layered Grammar of Graphics’.

The key layers are:

Brief tutorial for ggplot2

We’re going to go through Prof. Chris Brunsdon’s introductionary lecture to using ggplot2 available here

# A number of training datasets are included with R, a common one that is used for tutorials is the 'mtcars' 

data(mtcars)
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Examine the data

First examine each dataset, note the number of rows & variables and the class types.

glimpse(judges_appointments)
## Rows: 4,202
## Columns: 15
## $ judge_id                       <int> 3419, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ court_name                     <chr> "U. S. District Court, Southern Distric…
## $ court_type                     <chr> "USDC", "USDC", "USDC", "USDC", "USDC",…
## $ president_name                 <chr> "Barack Obama", "Franklin D. Roosevelt"…
## $ president_party                <chr> "Democratic", "Democratic", "Republican…
## $ nomination_date                <chr> "07/28/2011", "02/03/1936", "01/06/1880…
## $ predecessor_last_name          <chr> "Kaplan", "new", "Ketcham", "McFadden",…
## $ predecessor_first_name         <chr> "Lewis A.", NA, "Winthrop", "Frank H.",…
## $ senate_confirmation_date       <chr> "03/22/2012", "02/12/1936", "01/14/1880…
## $ commission_date                <chr> "03/23/2012", "02/15/1936", "01/14/1880…
## $ chief_judge_begin              <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ chief_judge_end                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ retirement_from_active_service <chr> NA, "02/15/1966", NA, "05/31/1996", "02…
## $ termination_date               <chr> NA, "05/28/1971", "02/09/1891", NA, "12…
## $ termination_reason             <chr> NA, "Death", "Appointment to Another Ju…
glimpse(judges_people)
## Rows: 3,532
## Columns: 13
## $ judge_id         <int> 3419, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 2989, 32…
## $ name_first       <chr> "Ronnie", "Matthew", "Marcus", "William", "Harold", "…
## $ name_middle      <chr> NA, "T.", "Wilson", "Marsh", "Arnold", "Waldo", "L.",…
## $ name_last        <chr> "Abrams", "Abruzzo", "Acheson", "Acker", "Ackerman", …
## $ name_suffix      <chr> NA, NA, NA, "Jr.", NA, NA, NA, NA, NA, NA, NA, NA, "J…
## $ birth_date       <int> 1968, 1889, 1828, 1927, 1928, 1926, 1925, 1887, 1921,…
## $ birthplace_city  <chr> "New York", "Brooklyn", "Washington", "Birmingham", "…
## $ birthplace_state <chr> "NY", "NY", "PA", "AL", "NJ", "FL", "NY", "IL", "PA",…
## $ death_date       <int> NA, 1971, 1906, NA, 2009, 1984, NA, 1956, NA, 1916, 1…
## $ death_city       <chr> NA, "Potomac", "Pittsburgh", NA, "West Orange", "Spri…
## $ death_state      <chr> NA, "MD", "PA", NA, "NJ", "IL", NA, NA, NA, "MO", "MS…
## $ gender           <chr> "F", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M"…
## $ race             <chr> "White", "White", "White", "White", "White", "White",…

Joining Datasets - Row

Using base R it is possible to join datasets either by row or column

# Create data frame
df <- data.frame(a=c(1, 3, 3, 4, 5),
                 b=c(7, 7, 8, 3, 2),
                 c=c(3, 3, 6, 6, 8))

df
##   a b c
## 1 1 7 3
## 2 3 7 3
## 3 3 8 6
## 4 4 3 6
## 5 5 2 8
# Create a second dataframe 

df2 <- c(11, 14, 17)

df2
## [1] 11 14 17
# rbind() will join this datasets together as they are of equal length and stack one atop the other
df_new <- rbind(df, df2)

df_new
##    a  b  c
## 1  1  7  3
## 2  3  7  3
## 3  3  8  6
## 4  4  3  6
## 5  5  2  8
## 6 11 14 17

Joining Datasets - Column

This is an example of how to join dataframes by column

# Create data frame
df <- data.frame(a=c(1, 3, 3, 4, 5),
                 b=c(7, 7, 8, 3, 2),
                 c=c(3, 3, 6, 6, 8))

df
##   a b c
## 1 1 7 3
## 2 3 7 3
## 3 3 8 6
## 4 4 3 6
## 5 5 2 8
# Define vector
df2 <- c(11, 14, 16, 17, 22)
df2
## [1] 11 14 16 17 22
# cbind vector to data frame
df_new <- cbind(df, df2)

df_new
##   a b c df2
## 1 1 7 3  11
## 2 3 7 3  14
## 3 3 8 6  16
## 4 4 3 6  17
## 5 5 2 8  22

Creating a new dataframe after joining

Be sure to create a new dataframe or the join will not be saved to your workspace

judges_unified <- inner_join(judges_appointments, judges_people, by.x='judge_id')
## Joining, by = "judge_id"

Note that the people data has been linked to the appointments, and there are clearly multiple entries, indicating that a number of individuals were appointed to numerous judicial posts.

glimpse(judges_unified)
## Rows: 4,202
## Columns: 27
## $ judge_id                       <int> 3419, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ court_name                     <chr> "U. S. District Court, Southern Distric…
## $ court_type                     <chr> "USDC", "USDC", "USDC", "USDC", "USDC",…
## $ president_name                 <chr> "Barack Obama", "Franklin D. Roosevelt"…
## $ president_party                <chr> "Democratic", "Democratic", "Republican…
## $ nomination_date                <chr> "07/28/2011", "02/03/1936", "01/06/1880…
## $ predecessor_last_name          <chr> "Kaplan", "new", "Ketcham", "McFadden",…
## $ predecessor_first_name         <chr> "Lewis A.", NA, "Winthrop", "Frank H.",…
## $ senate_confirmation_date       <chr> "03/22/2012", "02/12/1936", "01/14/1880…
## $ commission_date                <chr> "03/23/2012", "02/15/1936", "01/14/1880…
## $ chief_judge_begin              <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ chief_judge_end                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ retirement_from_active_service <chr> NA, "02/15/1966", NA, "05/31/1996", "02…
## $ termination_date               <chr> NA, "05/28/1971", "02/09/1891", NA, "12…
## $ termination_reason             <chr> NA, "Death", "Appointment to Another Ju…
## $ name_first                     <chr> "Ronnie", "Matthew", "Marcus", "William…
## $ name_middle                    <chr> NA, "T.", "Wilson", "Marsh", "Arnold", …
## $ name_last                      <chr> "Abrams", "Abruzzo", "Acheson", "Acker"…
## $ name_suffix                    <chr> NA, NA, NA, "Jr.", NA, NA, NA, NA, NA, …
## $ birth_date                     <int> 1968, 1889, 1828, 1927, 1928, 1926, 192…
## $ birthplace_city                <chr> "New York", "Brooklyn", "Washington", "…
## $ birthplace_state               <chr> "NY", "NY", "PA", "AL", "NJ", "FL", "NY…
## $ death_date                     <int> NA, 1971, 1906, NA, 2009, 1984, NA, 195…
## $ death_city                     <chr> NA, "Potomac", "Pittsburgh", NA, "West …
## $ death_state                    <chr> NA, "MD", "PA", NA, "NJ", "IL", NA, NA,…
## $ gender                         <chr> "F", "M", "M", "M", "M", "M", "M", "M",…
## $ race                           <chr> "White", "White", "White", "White", "Wh…

Checking Your Join

# This is the smallest dataset containing 3532 rows
judges_people %>% count(judge_id)
## # A tibble: 3,532 × 2
##    judge_id     n
##       <int> <int>
##  1        1     1
##  2        2     1
##  3        3     1
##  4        4     1
##  5        5     1
##  6        6     1
##  7        7     1
##  8        8     1
##  9        9     1
## 10       10     1
## # … with 3,522 more rows
# There are 4,202 appointments, however there are only 3532 individual judges 
judges_appointments %>% count(judge_id)
## # A tibble: 3,532 × 2
##    judge_id     n
##       <int> <int>
##  1        1     1
##  2        2     3
##  3        3     1
##  4        4     1
##  5        5     2
##  6        6     1
##  7        7     1
##  8        8     1
##  9        9     3
## 10       10     3
## # … with 3,522 more rows
judges_unified %>% count(judge_id)
## # A tibble: 3,532 × 2
##    judge_id     n
##       <int> <int>
##  1        1     1
##  2        2     3
##  3        3     1
##  4        4     1
##  5        5     2
##  6        6     1
##  7        7     1
##  8        8     1
##  9        9     3
## 10       10     3
## # … with 3,522 more rows

Dates in R

There are a number of different dates included in the judges_unified dataframe. However, none of these variables are the correct class as shown by the glimpse().

# Date of Nomination
judges_unified$nomination_date %>% head()
## [1] "07/28/2011" "02/03/1936" "01/06/1880" "07/22/1982" "09/28/1979"
## [6] "06/18/1976"
# Date of Confirmation
judges_unified$senate_confirmation_date %>% head()
## [1] "03/22/2012" "02/12/1936" "01/14/1880" "08/18/1982" "10/31/1979"
## [6] "07/02/1976"
# Date of Commission
judges_unified$commission_date %>% head()
## [1] "03/23/2012" "02/15/1936" "01/14/1880" "08/18/1982" "11/02/1979"
## [6] "07/02/1976"
# Date of Termination
judges_unified$termination_date %>% head()
## [1] NA           "05/28/1971" "02/09/1891" NA           "12/02/2009"
## [6] "03/31/1979"

Adjusting the dates

There are a number of different ways to adjust dates, however, as the data is structured we can use the mdy() command from the package ‘lubridate’ to make a relatively simple change.

# Create some sample dates 
begin <- c("May 11, 1996", "September 12, 2001", "July 1, 1988")
end <- c("7/8/97","10/23/02","1/4/91")
class(begin)
## [1] "character"
## [1] "character"
class(end)
## [1] "character"
## [1] "character"
(begin <- mdy(begin))
## [1] "1996-05-11" "2001-09-12" "1988-07-01"
## [1] "1996-05-11" "2001-09-12" "1988-07-01"
(end <- mdy(end))
## [1] "1997-07-08" "2002-10-23" "1991-01-04"
## [1] "1997-07-08" "2002-10-23" "1991-01-04"
class(begin)
## [1] "Date"
## [1] "Date"
class(end)
## [1] "Date"
## [1] "Date"

Creating Dates variables for the Judges dataset

Use the mdy() command and verify the results with the class() command

mdy(judges_unified$nomination_date) -> judges_unified$nomination_date
class(judges_unified$nomination_date)
## [1] "Date"
mdy(judges_unified$senate_confirmation_date) -> judges_unified$senate_confirmation_date
class(judges_unified$senate_confirmation_date)
## [1] "Date"
mdy(judges_unified$commission_date) -> judges_unified$commission_date
class(judges_unified$commission_date)
## [1] "Date"
mdy(judges_unified$termination_date) -> judges_unified$termination_date
class(judges_unified$termination_date)
## [1] "Date"

Creating specific date dataframes

First you need to create a new dataframe that provides the number of nominations per day

# This creates a new variable, however, you will need to rename the column names 
judges_unified %>% count(nomination_date) -> judges_nominations_date
judges_nominations_date
## # A tibble: 2,036 × 2
##    nomination_date     n
##    <date>          <int>
##  1 1789-09-24         13
##  2 1789-09-25          2
##  3 1790-02-08          4
##  4 1790-06-11          1
##  5 1790-07-02          1
##  6 1790-08-02          1
##  7 1790-12-17          2
##  8 1791-03-04          1
##  9 1791-10-31          2
## 10 1792-01-12          1
## # … with 2,026 more rows

Changing Column Names

This is vital as you will create multiple smaller dataframes and need to individualise the column names. This will prevent future errors.

# Rename the columns
colnames(judges_nominations_date) <- c("Date", "Nominations")

# Check your results
judges_nominations_date
## # A tibble: 2,036 × 2
##    Date       Nominations
##    <date>           <int>
##  1 1789-09-24          13
##  2 1789-09-25           2
##  3 1790-02-08           4
##  4 1790-06-11           1
##  5 1790-07-02           1
##  6 1790-08-02           1
##  7 1790-12-17           2
##  8 1791-03-04           1
##  9 1791-10-31           2
## 10 1792-01-12           1
## # … with 2,026 more rows

Using floor_date()

Group nomination dates into years using the floor_date() command from the ‘lubridate’ package. Its fairly intelligent and can reorganise dates into days, months, years etc.

judges_nominations_date %>% group_by(year=floor_date(Date, "year")) %>% summarize(No_of_Nominations=sum(Nominations)) -> judges_nominations_yearly

judges_nominations_yearly
## # A tibble: 220 × 2
##    year       No_of_Nominations
##    <date>                 <int>
##  1 1789-01-01                15
##  2 1790-01-01                 9
##  3 1791-01-01                 3
##  4 1792-01-01                 1
##  5 1793-01-01                 2
##  6 1794-01-01                 1
##  7 1795-01-01                 2
##  8 1796-01-01                 5
##  9 1797-01-01                 1
## 10 1798-01-01                 2
## # … with 210 more rows

Repeat this process for Commission and Termination Date

Create a new dataframe for judges terminations

# This creates a new variable, however, you will need to rename the column names 
judges_unified %>% count(termination_date) -> judges_terminations_date
judges_terminations_date
## # A tibble: 2,498 × 2
##    termination_date     n
##    <date>           <int>
##  1 1790-05-18           1
##  2 1790-08-16           1
##  3 1790-10-12           1
##  4 1791-03-05           1
##  5 1791-05-09           1
##  6 1792-01-04           1
##  7 1793-01-01           1
##  8 1793-01-16           1
##  9 1794-03-17           1
## 10 1794-06-09           1
## # … with 2,488 more rows

Create a new dataframe for judges commissions

# This creates a new variable, however, you will need to rename the column names 
judges_unified %>% count(commission_date) -> judges_commissions_date
judges_commissions_date
## # A tibble: 2,066 × 2
##    commission_date     n
##    <date>          <int>
##  1 1789-09-26         12
##  2 1789-09-27          1
##  3 1789-09-29          1
##  4 1789-09-30          1
##  5 1790-02-10          4
##  6 1790-06-14          1
##  7 1790-07-03          1
##  8 1790-08-03          1
##  9 1790-12-20          2
## 10 1791-03-04          1
## # … with 2,056 more rows

# Rename the columns
colnames(judges_terminations_date) <- c("Date", "Terminations")

# Check your results
judges_terminations_date
## # A tibble: 2,498 × 2
##    Date       Terminations
##    <date>            <int>
##  1 1790-05-18            1
##  2 1790-08-16            1
##  3 1790-10-12            1
##  4 1791-03-05            1
##  5 1791-05-09            1
##  6 1792-01-04            1
##  7 1793-01-01            1
##  8 1793-01-16            1
##  9 1794-03-17            1
## 10 1794-06-09            1
## # … with 2,488 more rows
# Rename the columns
colnames(judges_commissions_date) <- c("Date", "Commissions")

# Check your results
judges_commissions_date
## # A tibble: 2,066 × 2
##    Date       Commissions
##    <date>           <int>
##  1 1789-09-26          12
##  2 1789-09-27           1
##  3 1789-09-29           1
##  4 1789-09-30           1
##  5 1790-02-10           4
##  6 1790-06-14           1
##  7 1790-07-03           1
##  8 1790-08-03           1
##  9 1790-12-20           2
## 10 1791-03-04           1
## # … with 2,056 more rows

judges_terminations_date %>% group_by(year=floor_date(Date, "year")) %>% summarize(No_of_Terminations=sum(Terminations)) -> judges_terminations_yearly

judges_terminations_yearly
## # A tibble: 222 × 2
##    year       No_of_Terminations
##    <date>                  <int>
##  1 1790-01-01                  3
##  2 1791-01-01                  2
##  3 1792-01-01                  1
##  4 1793-01-01                  2
##  5 1794-01-01                  2
##  6 1795-01-01                  4
##  7 1796-01-01                  3
##  8 1797-01-01                  1
##  9 1798-01-01                  2
## 10 1799-01-01                  2
## # … with 212 more rows
judges_commissions_date %>% group_by(year=floor_date(Date, "year")) %>% summarize(No_of_Commissions=sum(Commissions)) -> judges_commissions_yearly

judges_commissions_yearly
## # A tibble: 219 × 2
##    year       No_of_Commissions
##    <date>                 <int>
##  1 1789-01-01                15
##  2 1790-01-01                 9
##  3 1791-01-01                 3
##  4 1792-01-01                 1
##  5 1793-01-01                 1
##  6 1794-01-01                 3
##  7 1795-01-01                 1
##  8 1796-01-01                 4
##  9 1797-01-01                 3
## 10 1798-01-01                 2
## # … with 209 more rows

Explore the new dataframes

Using ggplot we can display the findings

judges_commissions_yearly %>% 
ggplot(aes(x=year, y=No_of_Commissions)) + 
geom_line(size = 0.8)
## Warning: Removed 1 row(s) containing missing values (geom_path).

Advanced ggplot2

Note how with a additional information added to the basic plot we can create a very effective graph

judges_commissions_yearly %>% 
ggplot(aes(x=year, y=No_of_Commissions)) + 
geom_line(size = 0.8) + 
labs(title = "Judicial Commmissions - 1789-2014",
       tag = "Figure 1", 
       x = "Year",
       y = "No.") +
scale_x_date(date_breaks = "65 years", date_labels = "%Y") +
theme_classic() +
theme(axis.text.x = element_text(colour = "darkslategrey", size = 16), 
      axis.text.y = element_text(colour = "darkslategrey", size = 16),
      legend.background = element_rect(fill = "white", size = 4, colour = "white"),
      legend.justification = c(0, 1),
      legend.position = c(0.9, 1),
      text = element_text(family = "Georgia"),
      plot.title = element_text(size = 18, margin = margin(b = 10)),
      plot.subtitle = element_text(size = 12, color = "darkslategrey", margin = margin(b = 25)),
      plot.caption = element_text(size = 8, margin = margin(t = 10), color = "grey70", hjust = 0))
## Warning: Removed 1 row(s) containing missing values (geom_path).