Remember to set your working directory

Recall that you set the working directory to this folder by selecting from the top menu Session>Set Working Directory>Choose Directory. Then select the folder where you are keeping all your datasets for this class. By doing this, we can load the files in this directory without having to refer to the full path for their location, and anything we save will be written to this folder.

Once you set your working directory for where you will access your stored data, load the “tidyverse” and “psych” package, then read in the data.

getwd()
## [1] "C:/Users/rsaidi/Dropbox/Rachel/MontColl/DATA110/Notes"
setwd("C:/Users/rsaidi/Dropbox/Rachel/MontColl/Datasets/Datasets")
library(tidyverse)
## -- Attaching packages ---------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(psych)
## Warning: package 'psych' was built under R version 3.6.2
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
pfizer <- read_csv("pfizer.csv")
## Parsed with column specification:
## cols(
##   org_indiv = col_character(),
##   first_plus = col_character(),
##   first_name = col_character(),
##   last_name = col_character(),
##   city = col_character(),
##   state = col_character(),
##   category = col_character(),
##   cash = col_double(),
##   other = col_double(),
##   total = col_double()
## )
fda <- read_csv("fda.csv")
## Parsed with column specification:
## cols(
##   name_last = col_character(),
##   name_first = col_character(),
##   name_middle = col_character(),
##   issued = col_character(),
##   office = col_character()
## )

Notice that the Environment now contains two objects, of the type tbl_df.

Comment your code

Anything that appears on a line after # will be treated as a comment, and will be ignored when the code is run. Get into the habit of commenting your code: Don’t trust yourself to remember what it does! Data journalism should ideally be fully documented and reproducible.

Update your R packages

Each time you start R, it’s a good idea to click on Update in the Packages panel to update all your installed packages to the latest versions. Installing a package makes it available to you, but to use it in any R session you need to load it. You can do this by checking its box in the Packages tab. However, we will enter the following code into our script, then highlight these lines of code and run them:

Manipulate the pfizer and fda data

Recall that the pfizer data contains information about Pfizer payments to doctors and warning letters sent by food and drug administration

Examine the data

We can View data at any time by clicking on its table icon in the Environment tab in the Grid view.

Alternatively, you can use code we learned in the last unit. Click on the console output in each square that comes up after your chunk of code.

str(pfizer)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 10087 obs. of  10 variables:
##  $ org_indiv : chr  "3-D MEDICAL SERVICES LLC" "AA DOCTORS, INC." "ABBO, LILIAN MARGARITA" "ABBO, LILIAN MARGARITA" ...
##  $ first_plus: chr  "STEVEN BRUCE" "AAKASH MOHAN" "LILIAN MARGARITA" "LILIAN MARGARITA" ...
##  $ first_name: chr  "STEVEN" "AAKASH" "LILIAN" "LILIAN" ...
##  $ last_name : chr  "DEITELZWEIG" "AHUJA" "ABBO" "ABBO" ...
##  $ city      : chr  "NEW ORLEANS" "PASO ROBLES" "MIAMI" "MIAMI" ...
##  $ state     : chr  "LA" "CA" "FL" "FL" ...
##  $ category  : chr  "Professional Advising" "Expert-Led Forums" "Business Related Travel" "Meals" ...
##  $ cash      : num  2625 1000 0 0 1800 ...
##  $ other     : num  0 0 448 119 0 0 47 0 0 396 ...
##  $ total     : num  2625 1000 448 119 1800 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   org_indiv = col_character(),
##   ..   first_plus = col_character(),
##   ..   first_name = col_character(),
##   ..   last_name = col_character(),
##   ..   city = col_character(),
##   ..   state = col_character(),
##   ..   category = col_character(),
##   ..   cash = col_double(),
##   ..   other = col_double(),
##   ..   total = col_double()
##   .. )
head(pfizer)
## # A tibble: 6 x 10
##   org_indiv first_plus first_name last_name city  state category  cash other
##   <chr>     <chr>      <chr>      <chr>     <chr> <chr> <chr>    <dbl> <dbl>
## 1 3-D MEDI~ STEVEN BR~ STEVEN     DEITELZW~ NEW ~ LA    Profess~  2625     0
## 2 AA DOCTO~ AAKASH MO~ AAKASH     AHUJA     PASO~ CA    Expert-~  1000     0
## 3 ABBO, LI~ LILIAN MA~ LILIAN     ABBO      MIAMI FL    Busines~     0   448
## 4 ABBO, LI~ LILIAN MA~ LILIAN     ABBO      MIAMI FL    Meals        0   119
## 5 ABBO, LI~ LILIAN MA~ LILIAN     ABBO      MIAMI FL    Profess~  1800     0
## 6 ABDULLAH~ ABDULLAH   ABDULLAH   RAFFEE    FLINT MI    Expert-~   750     0
## # ... with 1 more variable: total <dbl>
describe(pfizer)
##             vars     n    mean       sd median trimmed    mad min     max
## org_indiv*     1 10087     NaN       NA     NA     NaN     NA Inf    -Inf
## first_plus*    2  9884     NaN       NA     NA     NaN     NA Inf    -Inf
## first_name*    3  9884     NaN       NA     NA     NaN     NA Inf    -Inf
## last_name*     4 10087     NaN       NA     NA     NaN     NA Inf    -Inf
## city*          5 10087     NaN       NA     NA     NaN     NA Inf    -Inf
## state*         6 10087     NaN       NA     NA     NaN     NA Inf    -Inf
## category*      7 10086     NaN       NA     NA     NaN     NA Inf    -Inf
## cash           8 10086 3241.12 21815.80      0  814.66   0.00   0 1185466
## other          9 10084  266.47   861.06     41  121.95  60.79   0   27681
## total         10 10087 3506.57 21792.20    750 1111.64 947.38   0 1185466
##               range  skew kurtosis     se
## org_indiv*     -Inf    NA       NA     NA
## first_plus*    -Inf    NA       NA     NA
## first_name*    -Inf    NA       NA     NA
## last_name*     -Inf    NA       NA     NA
## city*          -Inf    NA       NA     NA
## state*         -Inf    NA       NA     NA
## category*      -Inf    NA       NA     NA
## cash        1185466 30.20  1280.49 217.23
## other         27681 12.27   234.62   8.57
## total       1185466 30.26  1284.45 216.98

You do not need the “describe” command for the fda data, because there are no continuous variables.

str(fda)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 272 obs. of  5 variables:
##  $ name_last  : chr  "ADELGLASS" "ADKINSON" "ALLEN" "AMSTERDAM" ...
##  $ name_first : chr  "JEFFREY" "N." "MARK" "DANIEL" ...
##  $ name_middle: chr  "M." "FRANKLIN" "S." NA ...
##  $ issued     : chr  "5/25/1999" "4/19/2000" "1/28/2002" "11/17/2004" ...
##  $ office     : chr  "Center for Drug Evaluation and Research" "Center for Biologics Evaluation and Research" "Center for Devices and Radiological Health" "Center for Biologics Evaluation and Research" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name_last = col_character(),
##   ..   name_first = col_character(),
##   ..   name_middle = col_character(),
##   ..   issued = col_character(),
##   ..   office = col_character()
##   .. )
head(fda)
## # A tibble: 6 x 5
##   name_last name_first name_middle issued     office                            
##   <chr>     <chr>      <chr>       <chr>      <chr>                             
## 1 ADELGLASS JEFFREY    M.          5/25/1999  Center for Drug Evaluation and Re~
## 2 ADKINSON  N.         FRANKLIN    4/19/2000  Center for Biologics Evaluation a~
## 3 ALLEN     MARK       S.          1/28/2002  Center for Devices and Radiologic~
## 4 AMSTERDAM DANIEL     <NA>        11/17/2004 Center for Biologics Evaluation a~
## 5 AMSTUTZ   HARLAN     C.          7/19/2004  Center for Devices and Radiologic~
## 6 ANDERSON  C.         JOSEPH      2/25/2000  Center for Devices and Radiologic~

Notice that issued has been recognized as a Date variable. Other common data types include num, for numbers that may contain decimals and POSIXct for full date and time.

To specify an individual column use the name of the data frame and the column name, separated by $. Type this into your script and run:

The output will be the first 10,000 values for that column.

If you need to change the data type for any column, use the following functions:

. as.character converts to a text string

. as.numeric converts to a number

. as.factor converts to a categorical variable

. as.integer converts to an integer

. as.Date converts to a date

. as.POSIXct convets to a full date and time

(Conversions to full dates and times can get complicated, because of timezones. Now add the following code to your script to convert total in the pfizer data to a numeric variable (which would allow it to hold decimal values, if we had any).

pfizer$total <- as.numeric(pfizer$total)  # convert total to numeric variable
str(pfizer)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 10087 obs. of  10 variables:
##  $ org_indiv : chr  "3-D MEDICAL SERVICES LLC" "AA DOCTORS, INC." "ABBO, LILIAN MARGARITA" "ABBO, LILIAN MARGARITA" ...
##  $ first_plus: chr  "STEVEN BRUCE" "AAKASH MOHAN" "LILIAN MARGARITA" "LILIAN MARGARITA" ...
##  $ first_name: chr  "STEVEN" "AAKASH" "LILIAN" "LILIAN" ...
##  $ last_name : chr  "DEITELZWEIG" "AHUJA" "ABBO" "ABBO" ...
##  $ city      : chr  "NEW ORLEANS" "PASO ROBLES" "MIAMI" "MIAMI" ...
##  $ state     : chr  "LA" "CA" "FL" "FL" ...
##  $ category  : chr  "Professional Advising" "Expert-Led Forums" "Business Related Travel" "Meals" ...
##  $ cash      : num  2625 1000 0 0 1800 ...
##  $ other     : num  0 0 448 119 0 0 47 0 0 396 ...
##  $ total     : num  2625 1000 448 119 1800 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   org_indiv = col_character(),
##   ..   first_plus = col_character(),
##   ..   first_name = col_character(),
##   ..   last_name = col_character(),
##   ..   city = col_character(),
##   ..   state = col_character(),
##   ..   category = col_character(),
##   ..   cash = col_double(),
##   ..   other = col_double(),
##   ..   total = col_double()
##   .. )

The summary function will run a quick statistical summary of a data frame, calculating mean, median and quartile values for continuous variables:

summary(pfizer)  # summary of pfizer data
##   org_indiv          first_plus         first_name         last_name        
##  Length:10087       Length:10087       Length:10087       Length:10087      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      city              state             category              cash        
##  Length:10087       Length:10087       Length:10087       Min.   :      0  
##  Class :character   Class :character   Class :character   1st Qu.:      0  
##  Mode  :character   Mode  :character   Mode  :character   Median :      0  
##                                                           Mean   :   3241  
##                                                           3rd Qu.:   2000  
##                                                           Max.   :1185466  
##                                                           NA's   :1        
##      other             total        
##  Min.   :    0.0   Min.   :      0  
##  1st Qu.:    0.0   1st Qu.:    191  
##  Median :   41.0   Median :    750  
##  Mean   :  266.5   Mean   :   3507  
##  3rd Qu.:  262.0   3rd Qu.:   2000  
##  Max.   :27681.0   Max.   :1185466  
##  NA's   :3

Manipulate and analyze data

Now we will use dplyr to manipulate the data, using operations and functions:

. Sort: Largest to smallest, oldest to newest, alphabetical etc.

. Filter: Select a defined subset of the data.

. Summarize/Aggregate: Deriving one value from a series of other values to produce a summary statistic. Examples include: count, sum, mean, median, maximum, minimum etc. Often you’ll group data into categories first, and then aggregate by group.

. Join: Merging entries from two or more datasets based on common field(s), e.g. unique ID number, last name and first name.

. select - Choose which columns to include.

. filter - Filter the data.

. arrange - Sort the data, by size for continuous variables, by date, or alphabetically.

. group_by - Group the data by a categorical variable.

. mutate - Create new column(s) in the data, or change existing column(s).

. rename - Rename column(s).

. bind_rows - Merge two data frames into one, combining data from columns with the same name.

. summarize - Summarize, or aggregate (for each group if following group_by). Often used in conjunction with functions including:

o mean Calculate the mean, or average

o median Calculate the median

o max Find the maximum value

o min Find the minimum value

o sum Add all the values together

o n Count the number of records

There are also various functions to join data, which we will explore below.

Filter and sort data

Filter and sort the data in specific ways. For each of the following examples, copy the code that follows into your script, and view the results. Notice how we create a new objects to hold the processed data.

Find doctors in California paid $10,000 or more by Pfizer to run “Expert-Led Forums.”

# doctors in California who were paid $10,000 or more by Pfizer to run "Expert-Led Forums."
ca_expert_10000 <- pfizer %>%
  filter(state == "CA" & total >= 10000 & category == "Expert-Led Forums")

Notice the use of == to find values that match the specified text, >= for greater than or equal to, and the Boolean operator &.

Now add a sort to the end of the code to list the doctors in descending order by the payments received:

# doctors in California who were paid $10,000 or more by Pfizer to run "Expert-Led Forums."
ca_expert_10000 <- pfizer %>%
  filter(state == "CA" & total >= 10000 & category == "Expert-Led Forums") %>%
  arrange(desc(total))

If you arrange without the desc function, the sort will be from smallest to largest.

# Find doctors in California or New York who were paid $10,000 or more by Pfizer to run "Expert-Led Forums."
ca_ny_expert_10000 <- pfizer %>%
  filter((state == "CA" | state == "NY") & total >= 10000 & category == "Expert-Led Forums") %>%
  arrange(desc(total))

Notice the use of the | Boolean operator, and the brackets around that part of the query. This ensures that this part of the query is run first. See what happens if you exclude them.

# Find doctors in states other than California who were paid $10,000 or more by Pfizer to run "Expert-Led Forums."
not_ca_expert_10000 <- pfizer %>%
  filter(state != "CA" & total >= 10000 & category=="Expert-Led Forums") %>%
  arrange(desc(total))

Notice the use of the != operator to exclude doctors in California.

# Find the 20 doctors across the four largest states (CA, TX, FL, NY) who were paid the most for professional advice.
ca_ny_tx_fl_prof_top20 <- pfizer %>%
  filter((state=="CA" | state == "NY" | state == "TX" | state == "FL") &     category == "Professional Advising") %>%
  arrange(desc(total)) %>%
  head(20)

Notice the use of head, which grabs a defined number of rows from the start of a data frame. Here, it is crucial to run the sort first! See what happens if you change the order of the last two lines.

Filter the data for all payments for running Expert-Led Forums or for Professional Advising, and arrange alphabetically by doctor (last name, then first name)

# Filter the data for all payments for running Expert-Led Forums or for Professional Advising, and arrange alphabetically by doctor (last name, then first name)
expert_advice <- pfizer %>%
  filter(category == "Expert-Led Forums" | category == "Professional Advising") %>%
  arrange(last_name, first_name) %>%
  head(20)
expert_advice
## # A tibble: 20 x 10
##    org_indiv first_plus first_name last_name city  state category  cash other
##    <chr>     <chr>      <chr>      <chr>     <chr> <chr> <chr>    <dbl> <dbl>
##  1 ABBO, LI~ LILIAN MA~ LILIAN     ABBO      MIAMI FL    Profess~  1800     0
##  2 ABEBE, S~ SHEILA Y   SHEILA     ABEBE     INDI~ IN    Expert-~   825     0
##  3 NEW YORK~ JUDITH ANN JUDITH     ABERG     NEW ~ NY    Profess~  1750     0
##  4 ABOLNIK,~ IGOR Z     IGOR       ABOLNIK   PROVO UT    Expert-~  1750     0
##  5 ABRAKSIA~ SAMIR      SAMIR      ABRAKSIA  BEAC~ OH    Expert-~  2000     0
##  6 ABRAKSIA~ SAMIR      SAMIR      ABRAKSIA  BEAC~ OH    Profess~  2500     0
##  7 ABRAMSON~ STEVEN BA~ STEVEN     ABRAMSON  NEW ~ NY    Profess~  4400     0
##  8 ABUZZAHA~ FARUK S    FARUK      ABUZZAHAB MINN~ MN    Profess~  1750     0
##  9 ABUZZAHA~ MARY JENN~ MARY       ABUZZAHAB SAIN~ MN    Expert-~  1000     0
## 10 ACCACHA,~ SIHAM DON~ SIHAM      ACCACHA   MINE~ NY    Expert-~  1250     0
## 11 ACEVEDO ~ IRIS ARLE~ IRIS       ACEVEDO ~ CAGU~ PR    Expert-~   750     0
## 12 ACKERMAN~ IVAN FOST~ IVAN       ACKERMAN  BRAN~ FL    Expert-~  1250     0
## 13 PAIN MED~ WILLIAM E~ WILLIAM    ACKERMAN  LITT~ AR    Expert-~  1000     0
## 14 ACOSTA, ~ LUIS SILV~ LUIS       ACOSTA    HOUS~ TX    Expert-~  1000     0
## 15 ADAMS, S~ SANDRA GA~ SANDRA     ADAMS     SAN ~ TX    Profess~ 12840     0
## 16 ADDONA, ~ TOMMASO    TOMMASO    ADDONA    NEW ~ NY    Expert-~   750     0
## 17 HEALTH R~ ALEX ASIE~ ALEX       ADJEI     BUFF~ NY    Profess~  2000     0
## 18 ADLER, D~ DAVID ELL~ DAVID      ADLER     PORT~ OR    Profess~    71     0
## 19 ADLER, J~ JEREMY A   JEREMY     ADLER     ENCI~ CA    Expert-~   850     0
## 20 ADMANI, ~ ARIFF      ARIFF      ADMANI    PARA~ NJ    Expert-~  2000     0
## # ... with 1 more variable: total <dbl>

Notice that you can sort by multiple variables, separated by commas. Use pattern matching to filter text.

The following code uses the grepl function to find values containing a particular string of text. This can simplify the code used to filter based on text.

# use pattern matching with grepl to filter text
expert_advice <- pfizer %>%
  filter(grepl("Expert|Professional", category)) %>%
  arrange(last_name, first_name)

not_expert_advice <- pfizer %>%
  filter(!grepl("Expert|Professional", category)) %>%
  arrange(last_name, first_name)

This code differs only by the ! Boolean operator. Notice that it has split the data into two, based on categories of payment.

Append one data frame to another.

The following code uses the bind_rows function to append one data frame to another, here recreating the unfiltered data from the two data frames above.

# merge/append data frames
pfizer2 <- bind_rows(expert_advice, not_expert_advice)

Write data to a CSV file

readr can write data to CSV and other text files.

# write expert_advice data to a csv file
write_csv(expert_advice, "expert_advice.csv", na="")

na="" ensures that any empty cells in the data frame are saved as blanks - R represents null values as NA, so if you don’t include this, any null values will appear as NA in the saved file.

Group and summarize data

Calculate the total payments, by state

# calculate total payments by state
state_sum <- pfizer %>%
  group_by(state) %>%
  summarize(sum = sum(total)) %>%
  arrange(desc(sum))

Notice the use of group_by followed by summarize to group and summarize data, here using the function sum.

Calculate some additional summary statistics, by state

# As above, but for each state also calculate the median payment, and the number of payments
state_summary <- pfizer %>%
  group_by(state) %>%
  summarize(sum = sum(total), median = median(total), count = n()) %>%
  arrange(desc(sum))

Notice the use of multiple summary functions, sum, median, and n. (You don’t specify a variable for n because it is simply counting the number of rows in the data.)

Group and summarize for multiple categories

# as above, but group by state and category
state_category_summary <- pfizer %>%
  group_by(state, category) %>%
  summarize(sum = sum(total), median = median(total), count = n()) %>%
  arrange(state, category)

As for arrange, you can group_by by multiple variables, separated by commas.

Working with dates

Now let’s see how to work with dates, using the FDA warning letters data.

Filter the data for letters sent from the start of 2005 onwards. FDA sent warning letters from the start of 2005 onwards

You will have to fix “issued”" to be read as a date. If you look back at str(fda), it was read in as a chr (character). To coerce it to be a date, use the command,

fda$issued <- as.Date(fda$issued, "%m/%d/%Y")

Check it out by running. Alternatively use the package “lubridate”

#install.packages("lubridate")
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
fda$issued <- ymd(fda$issued)
fda$issued
##   [1] "1999-05-25" "2000-04-19" "2002-01-28" "2004-11-17" "2004-07-19"
##   [6] "2000-02-25" "2000-07-19" "2002-10-30" "2004-01-21" "2000-01-14"
##  [11] "2006-11-08" "2000-11-30" "2002-09-27" "2004-06-01" "2004-06-01"
##  [16] "2000-08-02" "1997-07-30" "1999-04-30" "2004-08-31" "2005-06-07"
##  [21] "1998-11-06" "2008-03-20" "2009-01-14" "2001-07-25" "2006-03-27"
##  [26] "1997-11-21" "2009-04-09" "2000-11-30" "2009-11-24" "2009-03-30"
##  [31] "2003-06-25" "2003-12-11" "2002-06-11" "2001-12-11" "2007-10-26"
##  [36] "2007-03-29" "2009-02-02" "2010-03-17" "2004-06-01" "2002-01-02"
##  [41] "2005-05-16" "2000-03-29" "2004-12-10" "2005-05-26" "2004-11-17"
##  [46] "2006-11-22" "2008-05-28" "2009-03-02" "1999-07-22" "2005-06-13"
##  [51] "2002-06-12" "2003-06-19" "2006-06-16" "2004-07-14" "2002-09-23"
##  [56] "2000-05-24" "2009-11-09" "2001-03-09" "1997-10-07" "2008-03-19"
##  [61] "1997-06-24" "1997-02-03" "2010-04-01" "2000-03-23" "2009-03-02"
##  [66] "2006-04-21" "1998-03-20" "2003-03-17" "2004-05-14" "2005-12-21"
##  [71] "2000-04-07" "1999-11-17" "1998-02-09" "1997-08-14" "2001-12-04"
##  [76] "2004-02-05" "2001-07-27" "2004-06-14" "2006-08-22" "1999-05-19"
##  [81] "2003-10-09" "2009-05-20" "2001-12-21" "2006-03-21" "2002-03-21"
##  [86] "2003-07-25" "2007-01-24" "2003-09-29" "2009-04-20" "2002-10-04"
##  [91] "2008-09-03" "2009-11-24" "2003-07-08" "2003-07-25" "1998-10-14"
##  [96] "2008-05-01" "2006-07-06" "2009-06-26" "2008-04-21" "2005-08-23"
## [101] "2006-03-06" "2006-02-24" "2005-04-11" "2002-02-13" "2001-11-15"
## [106] "2003-04-14" "1997-10-09" "1997-10-10" "2002-01-16" "2002-06-05"
## [111] "2001-07-06" "2009-05-19" "2002-09-12" "2006-07-10" "2008-10-01"
## [116] "2005-06-06" "2004-01-16" "2004-01-16" "2002-09-27" "2008-05-30"
## [121] "2008-07-23" "2000-04-28" "2001-02-22" "2003-06-24" "1998-12-17"
## [126] "2001-04-27" "2004-08-02" "2005-06-10" "2005-10-27" "2009-07-01"
## [131] "2001-02-23" "2007-01-12" "1999-11-24" "2005-06-23" "2004-07-30"
## [136] "2003-11-07" "2004-07-26" "1999-03-01" "2009-09-15" "1997-06-11"
## [141] "2004-12-15" "2005-10-18" "2000-06-21" "2005-05-16" "2001-07-31"
## [146] "1997-10-03" "2006-11-03" "2006-10-12" "2000-09-27" "2000-04-17"
## [151] "2010-03-08" "2007-07-03" "1999-09-21" "2009-06-12" "2008-10-06"
## [156] "2005-05-13" "2000-05-24" "2004-07-28" "2005-01-27" "2005-07-06"
## [161] "2005-07-06" "2004-07-19" "2000-09-20" "1997-01-10" "2009-10-23"
## [166] "2004-07-02" "1998-08-04" "1999-09-28" "2000-08-15" "1999-08-06"
## [171] "2006-10-03" "2002-11-27" "2004-11-22" "2008-05-16" "2005-10-07"
## [176] "2010-01-28" "2008-04-23" "2001-07-06" "2005-02-24" "2005-02-24"
## [181] "1998-10-14" "2007-01-08" "2001-05-23" "2006-08-18" "2005-09-29"
## [186] "2001-08-30" "2009-03-11" "2008-12-01" "2004-07-13" "2003-06-17"
## [191] "2008-06-06" "2004-11-05" "2009-03-03" "1997-10-14" "2008-01-09"
## [196] "1999-10-27" "2000-11-30" "2007-10-26" "2004-03-25" "2008-11-13"
## [201] "2009-05-20" "2004-10-04" "2006-12-04" "2006-12-04" "2001-08-02"
## [206] "2000-07-18" "2005-04-11" "2005-10-03" "2005-10-11" "2001-04-03"
## [211] "2008-11-13" "2008-11-13" "2001-08-30" "2008-07-23" "2006-11-07"
## [216] "2001-02-23" "2008-03-19" "2006-12-20" "2004-05-13" "2002-03-26"
## [221] "2004-05-27" "2004-03-25" "2003-01-23" "1999-10-04" "2004-07-13"
## [226] "2004-09-10" "2003-05-27" "2006-02-20" "2000-07-19" "2006-06-28"
## [231] "2009-01-03" "2006-07-06" "2003-06-18" "2009-01-21" "2010-02-04"
## [236] "2003-03-17" "2002-08-07" "2005-02-15" "2000-06-20" "2006-07-14"
## [241] "2003-10-23" "2005-12-19" "2007-09-04" "2003-03-31" "2000-04-03"
## [246] "2007-07-27" "1997-10-02" "2007-05-30" "1999-04-09" "1996-11-19"
## [251] "1997-11-20" "2009-12-03" "2005-10-14" "2005-03-29" "2007-12-05"
## [256] "2008-09-05" "2008-03-06" "2005-03-16" "2006-02-07" "2000-01-28"
## [261] "1998-07-07" "2005-05-01" "2003-04-08" "2006-09-28" "1997-11-04"
## [266] "2000-06-15" "2009-02-18" "2005-04-11" "2003-07-30" "1999-03-26"
## [271] "2008-08-27" "2009-06-16"
str(fda)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 272 obs. of  5 variables:
##  $ name_last  : chr  "ADELGLASS" "ADKINSON" "ALLEN" "AMSTERDAM" ...
##  $ name_first : chr  "JEFFREY" "N." "MARK" "DANIEL" ...
##  $ name_middle: chr  "M." "FRANKLIN" "S." NA ...
##  $ issued     : Date, format: "1999-05-25" "2000-04-19" ...
##  $ office     : chr  "Center for Drug Evaluation and Research" "Center for Biologics Evaluation and Research" "Center for Devices and Radiological Health" "Center for Biologics Evaluation and Research" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name_last = col_character(),
##   ..   name_first = col_character(),
##   ..   name_middle = col_character(),
##   ..   issued = col_character(),
##   ..   office = col_character()
##   .. )
post2005 <- fda %>%
  filter(issued >= "2005-01-01") %>%
  arrange(issued)

Notice that operators like >= can be used for dates, as well as for numbers.

Count the number of letters issued by year

# count the letters by year
letters_year <- fda %>%
  mutate(year = format(issued, "%Y")) %>%
  group_by(year) %>%
  summarize(letters=n())

This code introduces dplyr’s mutate function to create a new column in the data. The new variable year is the four-digit year "%Y (see here for more on time and date formats in R), extracted from the issued dates using the format function. Then the code groups by year and counts the number of letters for each one.

Add columns giving the number of days and weeks that have elapsed since each letter was sent

# add new columns showing many days and weeks elapsed since each letter was sent
fda <- fda %>%
  mutate(days_elapsed = Sys.Date() - issued,
          weeks_elapsed = difftime(Sys.Date(), issued, units = "weeks"))

Notice in the first line that this code changes the fda data frame, rather than creating a new object. The function Sys.Date returns the current date, and if you subtract another date, it will calculate the difference in days. To calculate date and time differences using other units, use the difftime function.

Notice also that you can mutate multiple columns at one go, separated by commas.

Join data from two data frames

Here is an animation for the different types of joins: https://github.com/gadenbuie/tidyexplain

There are a number of join functions in dplyr to combine data from two data frames. Here are the most useful:

. inner_join() returns values from both tables only where there is a match

. left_join() returns all the values from the first-mentioned table, plus those from the second table that match

. semi_join() filters the first-mentioned table to include only values that have matches in the second table

. anti_join() filters the first-mentioned table to include only values that have no matches in the second table.

To illustrate, these joins will find doctors paid by Pfizer to run expert led forums who had also received a warning letter from the FDA:

# join to identify doctors paid to run Expert-led forums who also received a warning letter

expert_warned_inner <- inner_join(pfizer, fda, by=c("first_name" = "name_first", "last_name" = "name_last")) %>%
  filter(category=="Expert-Led Forums")

expert_warned_semi <- semi_join(pfizer, fda, by=c("first_name" = "name_first", "last_name" = "name_last")) %>%
  filter(category=="Expert-Led Forums")

The code in by=c() defines how the join should be made. If instructions on how to join the tables are not supplied, dplyr will look for columns with matching names, and perform the join based on those.

The difference between the two joins above is that the first contains all of the columns from both data frames, while the second gives only columns from the pfizer data frame.

In practice, you may wish to inner_join and then use dplyr’s select function to select the columns that you want to retain, for example:

# as above, but select desired columns from data

expert_warned <- inner_join(pfizer, fda, by=c("first_name" = "name_first", "last_name" = "name_last")) %>%
  filter(category=="Expert-Led Forums") %>%
  select(first_plus, last_name, city, state, total, issued)

expert_warned <- inner_join(pfizer, fda, by=c("first_name" = "name_first", "last_name" = "name_last")) %>%
  filter(category=="Expert-Led Forums") %>%
  select(2:5,10,12)

Notice that you can select by columns’ names, or by their positions, where 1 is the first column, 3 is the third, and so on.