setwd("C:/Users/misst/OneDrive/Documents")Data Journalism
Dataset
The data we will use today could be found in this google drive
Load the “tidyverse” then read in the data
library(tidyverse)
setwd("C:/Users/misst/OneDrive/Documents")
pfizer <- read_csv("pfizer.csv")
fda <- read_csv("fda.csv")Notice that the Environment now contains two objects, of the type tbl_df.
Update your R packages
Each time you start R, it’s a good idea to click on Update in the Packages panel to update all your installed packages to the latest versions. Installing a package makes it available to you, but to use it in any R session you need to load it. You can do this by checking its box in the Packages tab. However, we will enter the following code into our script, then highlight these lines of code and run them:
Manipulate the pfizer and fda data
Recall that the pfizer dataset contains information about Pfizer payments to doctors and warning letters sent by food and drug administration
Examine the data
We can View data at any time by clicking on its table icon in the Environment tab in the Grid view.
Alternatively, you can use code we learned in the last unit - head(data). Notice the variable names and types.
head(pfizer)# A tibble: 6 × 10
org_indiv first_plus first_name last_name city state category cash other
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 3-D MEDICAL … STEVEN BR… STEVEN DEITELZW… NEW … LA Profess… 2625 0
2 AA DOCTORS, … AAKASH MO… AAKASH AHUJA PASO… CA Expert-… 1000 0
3 ABBO, LILIAN… LILIAN MA… LILIAN ABBO MIAMI FL Busines… 0 448
4 ABBO, LILIAN… LILIAN MA… LILIAN ABBO MIAMI FL Meals 0 119
5 ABBO, LILIAN… LILIAN MA… LILIAN ABBO MIAMI FL Profess… 1800 0
6 ABDULLAH RAF… ABDULLAH ABDULLAH RAFFEE FLINT MI Expert-… 750 0
# ℹ 1 more variable: total <dbl>
Now view the fda data
head(fda)# A tibble: 6 × 5
name_last name_first name_middle issued office
<chr> <chr> <chr> <chr> <chr>
1 ADELGLASS JEFFREY M. 5/25/1999 Center for Drug Evaluation and Re…
2 ADKINSON N. FRANKLIN 4/19/2000 Center for Biologics Evaluation a…
3 ALLEN MARK S. 1/28/2002 Center for Devices and Radiologic…
4 AMSTERDAM DANIEL <NA> 11/17/2004 Center for Biologics Evaluation a…
5 AMSTUTZ HARLAN C. 7/19/2004 Center for Devices and Radiologic…
6 ANDERSON C. JOSEPH 2/25/2000 Center for Devices and Radiologic…
Notice that issued has been recognized as a Date variable. Other common data types include num, for numbers that may contain decimals and POSIXct for full date and time.
To specify an individual column use the name of the data frame and the column name, separated by $. Determine the class for the variable “total”
class(fda$issued) [1] "character"
If you need to change the data type for any column, use the following functions:
. as.character converts to a text string
. as.numeric converts to a number
. as.factor converts to a categorical variable
. as.integer converts to an integer
. as.Date converts to a date
. as.POSIXct convets to a full date and time
(Conversions to full dates and times can get complicated, because of timezones.
The summary function will run a quick statistical summary of a data frame, calculating mean, median and quartile values for continuous variables:
summary(pfizer) # summary of pfizer data org_indiv first_plus first_name last_name
Length :10087 Length :10087 Length :10087 Length :10087
N.unique : 4851 N.unique : 4047 N.unique : 1576 N.unique : 3617
N.blank : 0 N.blank : 0 N.blank : 0 N.blank : 0
Min.nchar: 5 Min.nchar: 2 Min.nchar: 1 Min.nchar: 1
Max.nchar: 89 Max.nchar: 22 Max.nchar: 14 Max.nchar: 19
NAs : 203 NAs : 203
city state category cash
Length :10087 Length :10087 Length :10087 Min. : 0
N.unique : 1318 N.unique : 52 N.unique : 8 1st Qu.: 0
N.blank : 0 N.blank : 0 N.blank : 0 Median : 0
Min.nchar: 1 Min.nchar: 2 Min.nchar: 5 Mean : 3241
Max.nchar: 25 Max.nchar: 2 Max.nchar: 60 3rd Qu.: 2000
NAs : 1 Max. :1185466
NAs :1
other total
Min. : 0.0 Min. : 0
1st Qu.: 0.0 1st Qu.: 191
Median : 41.0 Median : 750
Mean : 266.5 Mean : 3507
3rd Qu.: 262.0 3rd Qu.: 2000
Max. :27681.0 Max. :1185466
NAs :3
Manipulate and analyze data
Now we will use dplyr to manipulate the data, using operations and functions:
. Sort: Largest to smallest, oldest to newest, alphabetical etc.
. select - Choose which columns to include.
. filter - Filter the data.
. arrange - Sort the data, by size for continuous variables, by date, or alphabetically.
. group_by - Group the data by a categorical variable.
. mutate - Create new column(s) in the data, or change existing column(s).
. rename - Rename column(s).
. bind_rows - Merge two data frames into one, combining data from columns with the same name.
Summarize has special additional functions associated with it
. summarize - Summarize, or aggregate (for each group if following group_by). Often used in conjunction with functions including:
mean Calculate the mean, or average
median Calculate the median
max Find the maximum value
min Find the minimum value
sum Add all the values together
n Count the number of records
There are also various functions to join data, which we will explore below.
Explore the pfizer dataset
We will explore this dataset, filtering in many ways, in order to see if there are doctors who have behaved unethically by taking money from Pfizer Pharma and then we will join that dataset with the FDA dataset that reveals doctors who have received warnings for this unethical behavior.
TODO: Filter and sort data
Filter and sort the data in specific ways. For each of the following examples, copy the code that follows into your script, and view the results. Notice how we create a new objects to hold the processed data.
Find doctors in California paid $10,000 or more by Pfizer to run “Expert-Led Forums.”
# doctors in California who were paid $10,000 or more by Pfizer to run "Expert-Led Forums."
ca_ny_expert_10000 <- pfizer %>%
filter((state == "CA" | state == "NY") & total >= 10000 & category == "Expert-Led Forums")
ca_ny_expert_10000# A tibble: 40 × 10
org_indiv first_plus first_name last_name city state category cash other
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ANDREW M BL… ANDREW M ANDREW BLUMENFE… DEL … CA Expert-… 20500 0
2 BANDARI, DA… DANIEL SH… DANIEL BANDARI LOS … CA Expert-… 21000 0
3 BAROYA, IVA… IVAN STEP… IVAN BAROYA BONI… CA Expert-… 26400 0
4 BECK, CLIFF… CLIFFORD … CLIFFORD BECK TORR… CA Expert-… 10500 0
5 CHOPRA, SHA… SHAGUN SHAGUN CHOPRA SAN … CA Expert-… 17250 0
6 COHEN, LEE … LEE STEVEN LEE COHEN HAST… NY Expert-… 30500 0
7 COLE, EMILY… EMILY ELI… EMILY COLE SAN … CA Expert-… 12000 0
8 CRANE, DANI… DANIEL LI… DANIEL CRANE NEW … NY Expert-… 25500 0
9 EHRESMANN, … GLENN RIC… GLENN EHRESMANN LOS … CA Expert-… 12000 0
10 FONAROW, GR… GREGG CUR… GREGG FONAROW LOS … CA Expert-… 15000 0
# ℹ 30 more rows
# ℹ 1 more variable: total <dbl>
Notice the use of == to find values that match the specified text, >= for greater than or equal to, and the Boolean operator &.
Add a sort to the end of the code to list the doctors in descending order by the payments received:
# doctors in California who were paid $10,000 or more by Pfizer to run "Expert-Led Forums."
ca_expert_10000_desc <- pfizer %>%
filter(state == "CA" & total >= 10000 & category=="Expert-Led Forums") %>%
arrange(desc(total))
ca_expert_10000_desc# A tibble: 31 × 10
org_indiv first_plus first_name last_name city state category cash other
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 SACKS, GER… GERALD MI… GERALD SACKS SANT… CA Expert-… 146500 0
2 NIDES, MIT… MITCHELL MITCHELL NIDES LOS … CA Expert-… 70500 0
3 POTKIN, ST… STEVEN GA… STEVEN POTKIN ORAN… CA Expert-… 48350 0
4 GINSBERG, … DAVID ALAN DAVID GINSBERG LOS … CA Expert-… 45750 0
5 LOUIE, SAM… SAMUEL SAMUEL LOUIE SACR… CA Expert-… 41250 0
6 INSTITUTE … GURKIPAL GURKIPAL SINGH WOOD… CA Expert-… 40000 0
7 BAROYA, IV… IVAN STEP… IVAN BAROYA BONI… CA Expert-… 26400 0
8 MATTHEW BU… MATTHEW J… MATTHEW BUDOFF MANH… CA Expert-… 24000 0
9 QUANG H NG… QUANG H QUANG NGUYEN LA J… CA Expert-… 22500 0
10 SCHROEDER,… JOHN SPEER JOHN SCHROEDER STAN… CA Expert-… 21500 0
# ℹ 21 more rows
# ℹ 1 more variable: total <dbl>
If you arrange without the desc function, the sort will be from smallest to largest.
Boolean Operators (&, |, ==, >, <, >=, <=, !=)
Notice the use of the | Boolean operator, and the brackets around that part of the query. This ensures that this part of the query is run first. See what happens if you exclude them.
# Find doctors in states other than California who were paid $10,000 or more by Pfizer to run "Expert-Led Forums."
not_ca_expert_10000 <- pfizer %>%
filter(state != "CA" & total >= 10000 & category=="Expert-Led Forums") %>%
arrange(desc(total))
not_ca_expert_10000# A tibble: 133 × 10
org_indiv first_plus first_name last_name city state category cash other
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 HESS, TODD … TODD MICH… TODD HESS SAIN… MN Expert-… 79000 0
2 MILLER, JOH… JOHN JOSE… JOHN MILLER EXET… NH Expert-… 78000 0
3 WEATHERS, V… VIVIAN JOY VIVIAN WEATHERS APEX NC Expert-… 75400 0
4 ROBERT B NE… ROBERT BU… ROBERT NETT SAN … TX Expert-… 60750 0
5 SOLERA CONS… STEVEN AB… STEVEN KAPLAN CHAP… NY Expert-… 56500 0
6 GRIFFIN, JA… JAMES DALE JAMES GRIFFIN DALL… TX Expert-… 54250 0
7 D BRENT JOY… DAVID BRE… DAVID JOYE SUMM… NC Expert-… 53500 0
8 STUBBLEFIEL… MICHAEL D MICHAEL STUBBLEF… NEW … NY Expert-… 50500 0
9 COHEN, SETH… SETH ALEX… SETH COHEN SEAT… WA Expert-… 44500 0
10 SUSSMAN, DA… DAVID OWEN DAVID SUSSMAN VOOR… NJ Expert-… 42500 0
# ℹ 123 more rows
# ℹ 1 more variable: total <dbl>
Notice the use of the != operator to exclude doctors in California.
# Find the 20 doctors across the four largest states (CA, TX, FL, NY) who were paid the most for professional advice.
ca_ny_tx_fl_prof_top20 <- pfizer %>%
filter((state=="CA" | state == "NY" | state == "TX" | state == "FL") & category == "Professional Advising") %>%
arrange(desc(total)) %>%
head(6)
ca_ny_tx_fl_prof_top20# A tibble: 6 × 10
org_indiv first_plus first_name last_name city state category cash other
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 BAILES, JOS… JOSEPH SW… JOSEPH BAILES AUST… TX Profess… 105000 0
2 SAWYERS, CH… CHARLES L… CHARLES SAWYERS NEW … NY Profess… 100000 0
3 MALENKA, RO… ROBERT CH… ROBERT MALENKA STAN… CA Profess… 75566 0
4 BEUTLER, BR… BRUCE ALAN BRUCE BEUTLER DALL… TX Profess… 50000 0
5 REGENTS OF … DAVID RAY… DAVID GANDARA IRVI… CA Profess… 38500 0
6 PTACEK, LOU… LOUIS JOHN LOUIS PTACEK SAN … CA Profess… 37588 0
# ℹ 1 more variable: total <dbl>
Notice the use of head, which grabs a defined number of rows from the start of a data frame. Here, it is crucial to run the sort first! See what happens if you change the order of the last two lines.
Filter the data for all payments for running Expert-Led Forums or for Professional Advising, and arrange alphabetically by doctor (last name, then first name)
expert_advice <- pfizer %>%
filter(category == "Expert-Led Forums" | category == "Professional Advising") %>%
arrange(last_name, first_name) %>%
head(20)
expert_advice# A tibble: 20 × 10
org_indiv first_plus first_name last_name city state category cash other
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ABBO, LILIA… LILIAN MA… LILIAN ABBO MIAMI FL Profess… 1800 0
2 ABEBE, SHEI… SHEILA Y SHEILA ABEBE INDI… IN Expert-… 825 0
3 NEW YORK UN… JUDITH ANN JUDITH ABERG NEW … NY Profess… 1750 0
4 ABOLNIK, IG… IGOR Z IGOR ABOLNIK PROVO UT Expert-… 1750 0
5 ABRAKSIA, S… SAMIR SAMIR ABRAKSIA BEAC… OH Expert-… 2000 0
6 ABRAKSIA, S… SAMIR SAMIR ABRAKSIA BEAC… OH Profess… 2500 0
7 ABRAMSON, S… STEVEN BA… STEVEN ABRAMSON NEW … NY Profess… 4400 0
8 ABUZZAHAB, … FARUK S FARUK ABUZZAHAB MINN… MN Profess… 1750 0
9 ABUZZAHAB, … MARY JENN… MARY ABUZZAHAB SAIN… MN Expert-… 1000 0
10 ACCACHA, SI… SIHAM DON… SIHAM ACCACHA MINE… NY Expert-… 1250 0
11 ACEVEDO MAR… IRIS ARLE… IRIS ACEVEDO … CAGU… PR Expert-… 750 0
12 ACKERMAN, I… IVAN FOST… IVAN ACKERMAN BRAN… FL Expert-… 1250 0
13 PAIN MEDICI… WILLIAM E… WILLIAM ACKERMAN LITT… AR Expert-… 1000 0
14 ACOSTA, LUI… LUIS SILV… LUIS ACOSTA HOUS… TX Expert-… 1000 0
15 ADAMS, SAND… SANDRA GA… SANDRA ADAMS SAN … TX Profess… 12840 0
16 ADDONA, TOM… TOMMASO TOMMASO ADDONA NEW … NY Expert-… 750 0
17 HEALTH RESE… ALEX ASIE… ALEX ADJEI BUFF… NY Profess… 2000 0
18 ADLER, DAVI… DAVID ELL… DAVID ADLER PORT… OR Profess… 71 0
19 ADLER, JERE… JEREMY A JEREMY ADLER ENCI… CA Expert-… 850 0
20 ADMANI, ARI… ARIFF ARIFF ADMANI PARA… NJ Expert-… 2000 0
# ℹ 1 more variable: total <dbl>
Notice that you can sort by multiple variables, separated by commas. Use pattern matching to filter text.
Use the grepl function to find values containing a particular string of text. This can simplify the code used to filter based on text.
# use pattern matching with grepl to filter text
expert_advice <- pfizer %>%
filter(grepl("Expert|Professional", category)) %>%
arrange(last_name, first_name)
expert_advice# A tibble: 4,382 × 10
org_indiv first_plus first_name last_name city state category cash other
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ABBO, LILIA… LILIAN MA… LILIAN ABBO MIAMI FL Profess… 1800 0
2 ABEBE, SHEI… SHEILA Y SHEILA ABEBE INDI… IN Expert-… 825 0
3 NEW YORK UN… JUDITH ANN JUDITH ABERG NEW … NY Profess… 1750 0
4 ABOLNIK, IG… IGOR Z IGOR ABOLNIK PROVO UT Expert-… 1750 0
5 ABRAKSIA, S… SAMIR SAMIR ABRAKSIA BEAC… OH Expert-… 2000 0
6 ABRAKSIA, S… SAMIR SAMIR ABRAKSIA BEAC… OH Profess… 2500 0
7 ABRAMSON, S… STEVEN BA… STEVEN ABRAMSON NEW … NY Profess… 4400 0
8 ABUZZAHAB, … FARUK S FARUK ABUZZAHAB MINN… MN Profess… 1750 0
9 ABUZZAHAB, … MARY JENN… MARY ABUZZAHAB SAIN… MN Expert-… 1000 0
10 ACCACHA, SI… SIHAM DON… SIHAM ACCACHA MINE… NY Expert-… 1250 0
# ℹ 4,372 more rows
# ℹ 1 more variable: total <dbl>
not_expert_advice <- pfizer %>%
filter(!grepl("Expert|Professional", category)) %>%
arrange(last_name, first_name)This code differs only by the ! Boolean operator. Notice that it has split the data into two, based on categories of payment.
Append one data frame to another.
Use the bind_rows function to append one data frame to another, which recreates unfiltered data from the two data frames above
# merge/append data frames
pfizer2 <- bind_rows(expert_advice, not_expert_advice)
pfizer2# A tibble: 10,087 × 10
org_indiv first_plus first_name last_name city state category cash other
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ABBO, LILIA… LILIAN MA… LILIAN ABBO MIAMI FL Profess… 1800 0
2 ABEBE, SHEI… SHEILA Y SHEILA ABEBE INDI… IN Expert-… 825 0
3 NEW YORK UN… JUDITH ANN JUDITH ABERG NEW … NY Profess… 1750 0
4 ABOLNIK, IG… IGOR Z IGOR ABOLNIK PROVO UT Expert-… 1750 0
5 ABRAKSIA, S… SAMIR SAMIR ABRAKSIA BEAC… OH Expert-… 2000 0
6 ABRAKSIA, S… SAMIR SAMIR ABRAKSIA BEAC… OH Profess… 2500 0
7 ABRAMSON, S… STEVEN BA… STEVEN ABRAMSON NEW … NY Profess… 4400 0
8 ABUZZAHAB, … FARUK S FARUK ABUZZAHAB MINN… MN Profess… 1750 0
9 ABUZZAHAB, … MARY JENN… MARY ABUZZAHAB SAIN… MN Expert-… 1000 0
10 ACCACHA, SI… SIHAM DON… SIHAM ACCACHA MINE… NY Expert-… 1250 0
# ℹ 10,077 more rows
# ℹ 1 more variable: total <dbl>
Write data to a CSV file
readr can write data to CSV and other text files.
# write expert_advice data to a csv file
write_csv(expert_advice, "expert_advice.csv", na="")na=“” ensures that any empty cells in the data frame are saved as blanks - R represents null values as NA, so if you don’t include this, any null values will appear as NA in the saved file.
Group and summarize data
Calculate the total payments, by state
# calculate total payments by state
state_sum <- pfizer %>%
group_by(state) %>%
summarize(tot_pay = sum(total)) %>%
arrange(desc(tot_pay))
state_sum# A tibble: 52 × 2
state tot_pay
<chr> <dbl>
1 CA 4737807
2 TX 2802196
3 FL 2564047
4 PA 2484505
5 NC 2328435
6 NY 2065042
7 MA 1764771
8 IL 1256825
9 MI 1146285
10 OH 1019450
# ℹ 42 more rows
Notice the use of group_by followed by summarize to group and summarize data, here using the function sum.
Calculate some additional summary statistics, by state
# As above, but for each state also calculate the median payment, and the number of payments
state_summary <- pfizer %>%
group_by(state) %>%
summarize(tot_pay = sum(total), median = median(total), count = n()) %>%
arrange(desc(tot_pay))
state_summary# A tibble: 52 × 4
state tot_pay median count
<chr> <dbl> <dbl> <int>
1 CA 4737807 886. 1172
2 TX 2802196 802 624
3 FL 2564047 669 738
4 PA 2484505 669 533
5 NC 2328435 777 383
6 NY 2065042 750 833
7 MA 1764771 1250 259
8 IL 1256825 661 448
9 MI 1146285 750 342
10 OH 1019450 685 462
# ℹ 42 more rows
Notice the use of multiple summary functions, sum, median, and n. (You don’t specify a variable for n because it is simply counting the number of rows in the data.)
Group and summarize for multiple categories
# as above, but group by state and category
state_category_summary <- pfizer %>%
group_by(state, category) %>%
summarize(tot_pay = sum(total), median = median(total), count = n()) %>%
arrange(state, category)
state_category_summary# A tibble: 319 × 5
# Groups: state [52]
state category tot_pay median count
<chr> <chr> <dbl> <dbl> <int>
1 AK Expert-Led Forums 1750 1.75e3 1
2 AL Business Related Travel 29123 3.31e2 31
3 AL Educational Items 431 6.7 e1 7
4 AL Expert-Led Forums 129850 2 e3 32
5 AL Investigator-Initiated Research 55937 5.59e4 1
6 AL Meals 10217 1.86e2 38
7 AL Pfizer Sponsored Research initiated before July 1… 279485 1.40e5 2
8 AL Pfizer Sponsored Research initiated on or after J… 66096 6.61e4 1
9 AL Professional Advising 110560 2 e3 20
10 AR Business Related Travel 3783 1.41e2 14
# ℹ 309 more rows
As for arrange, you can group_by by multiple variables, separated by commas.
Working with dates
Now let’s see how to work with dates, using the FDA warning letters data.
You will have to fix “issued”” to be read as a date. If you look back at str(fda), it was read in as a chr (character). To coerce it to be a date, use the command,
class(fda$issued)[1] "character"
fda$issued <- as.Date(fda$issued, "%m/%d/%Y")
class(fda$issued)[1] "Date"
post2005 <- fda %>%
filter(issued >= "2005-01-01") %>%
arrange(issued)
post2005# A tibble: 120 × 5
name_last name_first name_middle issued office
<chr> <chr> <chr> <date> <chr>
1 MARCUM JOHN W. 2005-01-27 Center for Devices and Radiologi…
2 TENNANT JERALD L. 2005-02-15 Center for Devices and Radiologi…
3 NICHOLS TRENT <NA> 2005-02-24 Center for Devices and Radiologi…
4 NICHOLS TRENT. <NA> 2005-02-24 Center for Devices and Radiologi…
5 WILLIAM A. GRAY 2005-03-16 Center for Devices and Radiologi…
6 WEIL LOWELL S. 2005-03-29 Center for Devices and Radiologi…
7 GRUBBS GERALD L. 2005-04-11 Center for Devices and Radiologi…
8 ROSENTHAL KENNETH J. 2005-04-11 Center for Devices and Radiologi…
9 YADAV SANJAY S. 2005-04-11 Center for Devices and Radiologi…
10 WINTERS CHARLES J. 2005-05-01 Center for Devices and Radiologi…
# ℹ 110 more rows
Notice that operators like >= can be used for dates, as well as for numbers.
This code introduces dplyr’s mutate function to create a new column in the data. The new variable year is the four-digit year “%Y (see here for more on time and date formats in R), extracted from the issued dates using the format function. Then the code groups by year and counts the number of letters for each one.
Add columns giving the number of days and weeks that have elapsed since each letter was sent
# add new columns showing many days and weeks elapsed since each letter was sent
fda <- fda %>%
mutate(days_elapsed = Sys.Date() - issued,
weeks_elapsed = difftime(Sys.Date(), issued, units = "weeks"))
fda# A tibble: 272 × 7
name_last name_first name_middle issued office days_elapsed weeks_elapsed
<chr> <chr> <chr> <date> <chr> <drtn> <drtn>
1 ADELGLASS JEFFREY M. 1999-05-25 Cente… 9877 days 1411.000 wee…
2 ADKINSON N. FRANKLIN 2000-04-19 Cente… 9547 days 1363.857 wee…
3 ALLEN MARK S. 2002-01-28 Cente… 8898 days 1271.143 wee…
4 AMSTERDAM DANIEL <NA> 2004-11-17 Cente… 7874 days 1124.857 wee…
5 AMSTUTZ HARLAN C. 2004-07-19 Cente… 7995 days 1142.143 wee…
6 ANDERSON C. JOSEPH 2000-02-25 Cente… 9601 days 1371.571 wee…
7 ANDREWS DAVID W. 2000-07-19 Cente… 9456 days 1350.857 wee…
8 AQEL RAED <NA> 2002-10-30 Cente… 8623 days 1231.857 wee…
9 ARROWSMI… PETER N. 2004-01-21 Cente… 8175 days 1167.857 wee…
10 BARR JOHN D. 2000-01-14 Cente… 9643 days 1377.571 wee…
# ℹ 262 more rows
Notice in the first line that this code changes the fda data frame, rather than creating a new object. The function Sys.Date returns the current date, and if you subtract another date, it will calculate the difference in days. To calculate date and time differences using other units, use the difftime function.
Notice also that you can mutate multiple columns at one go, separated by commas.
Join data from two data frames
Here is an animation for the different types of joins: https://github.com/gadenbuie/tidyexplain
There are a number of join functions in dplyr to combine data from two data frames. Here are the most useful:
. inner_join() returns values from both tables only where there is a match
. left_join() returns all the values from the first-mentioned table, plus those from the second table that match
. semi_join() filters the first-mentioned table to include only values that have matches in the second table
. anti_join() filters the first-mentioned table to include only values that have no matches in the second table.
To illustrate, these joins will find doctors paid by Pfizer to run expert led forums who had also received a warning letter from the FDA:
Join to identify doctors paid to run Expert-led forums who also received a warning letter
expert_warned_inner <- inner_join(pfizer, fda, by=c("first_name" = "name_first", "last_name" = "name_last")) %>%
filter(category=="Expert-Led Forums")
expert_warned_semi <- semi_join(pfizer, fda, by=c("first_name" = "name_first", "last_name" = "name_last")) %>%
filter(category=="Expert-Led Forums")
expert_warned_inner# A tibble: 4 × 15
org_indiv first_plus first_name last_name city state category cash other
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 BUKOWSKI CON… RONALD MA… RONALD BUKOWSKI CLEV… OH Expert-… 22500 0
2 LEVENSON, JE… JEFFREY R… JEFFREY LEVENSON SAIN… FL Expert-… 1500 0
3 THOMAS D GAZ… THOMAS DA… THOMAS GAZDA SCOT… AZ Expert-… 1000 0
4 WARD, DOUGLA… DOUGLAS J… DOUGLAS WARD WASH… DC Expert-… 1500 0
# ℹ 6 more variables: total <dbl>, name_middle <chr>, issued <date>,
# office <chr>, days_elapsed <drtn>, weeks_elapsed <drtn>
expert_warned_semi# A tibble: 4 × 10
org_indiv first_plus first_name last_name city state category cash other
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 BUKOWSKI CON… RONALD MA… RONALD BUKOWSKI CLEV… OH Expert-… 22500 0
2 LEVENSON, JE… JEFFREY R… JEFFREY LEVENSON SAIN… FL Expert-… 1500 0
3 THOMAS D GAZ… THOMAS DA… THOMAS GAZDA SCOT… AZ Expert-… 1000 0
4 WARD, DOUGLA… DOUGLAS J… DOUGLAS WARD WASH… DC Expert-… 1500 0
# ℹ 1 more variable: total <dbl>
The code in by=c() defines how the join should be made. If instructions on how to join the tables are not supplied, dplyr will look for columns with matching names, and perform the join based on those.
The difference between the two joins above is that the first contains all of the columns from both data frames, while the second gives only columns from the pfizer data frame.
In practice, you may wish to inner_join and then use dplyr’s select function to select the columns that you want to retain, for example:
Select desired columns from data
expert_warned <- inner_join(pfizer, fda, by=c("first_name" = "name_first", "last_name" = "name_last")) %>%
filter(category=="Expert-Led Forums") %>%
select(first_plus, last_name, city, state, total, issued)
expert_warned <- inner_join(pfizer, fda, by=c("first_name" = "name_first", "last_name" = "name_last")) %>%
filter(category=="Expert-Led Forums") %>%
select(2:5,10,12)
expert_warned# A tibble: 4 × 6
first_plus first_name last_name city total issued
<chr> <chr> <chr> <chr> <dbl> <date>
1 RONALD MATHEW RONALD BUKOWSKI CLEVELAND 22500 2009-03-30
2 JEFFREY RONALD JEFFREY LEVENSON SAINT PETERSBURG 1500 2000-09-27
3 THOMAS DAVID THOMAS GAZDA SCOTTSDALE 1000 2009-11-24
4 DOUGLAS JAMES DOUGLAS WARD WASHINGTON 1500 1997-11-20
Notice that you can select by columns’ names, or by their positions, where 1 is the first column, 3 is the third, and so on.
Comment your code
Anything that appears on a line after # will be treated as a comment, and will be ignored when the code is run. Get into the habit of commenting your code: Don’t trust yourself to remember what it does! Data journalism should ideally be fully documented and reproducible.