One of the most fundamental skills for a Data Scientist is Data Preparation (Data Manipulation). To be really effective, you need to be masterful at performing essential data manipulations. This is because a very large proportion of your work will just involve getting and cleaning data.

Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis.

In this section, you will learn how to easily perform Data Preparation using R. We’ll cover some fundamental data manipulation verbs that you will use most frequently in your jobs.

  • read_csv() Import data (You can use others)
  • str() Data Structures
  • apply() To check and replace missing values
  • select() Choose which columns to include.
  • filter() Select a defined subset of the data.
  • arrange() Sort the data, by size for continuous variables, by date, or alphabetically.
  • rename() Rename column(s).
  • mutate() Create new column(s) in the data, or change existing column(s).
  • bind_rows() Merge two data frames into one, combining data from columns with the same name.
  • group_by() Group the data by a categorical variable.
  • summarize() Summarize, or aggregate (for each group if following group_by). Often used in conjunction with functions including:
    • mean() Calculate the mean, or average.
    • median() Calculate the median.
    • max() Find the maximum value.
    • min() Find the minimum value
    • sum() Add all the values together.
    • n() Count the number of records.

I recommend you to install the tidyverse packages. Because the core of tidyverse includes the packages that you’re likely to use in everyday data analyses.

install.packages("tidyverse")

We’ll mainly work with two incredibly useful packages developed by Hadley Wickham, chief scientist at RStudio:

  • readr For reading and writes CSV and other text files.
  • dplyr For processing and manipulating data.

1 Import Data

The data we will use for this section are pfizer.csv and fda.csv, please download and place it on your desktop. As an optional, you can load data into the current R session by selecting Import Dataset>From Text File... in the Environment tab. But, in this case, we will use the read_csv() function from the readr package. Copy the following code into your script and Run:

suppressPackageStartupMessages(library(tidyverse))# silently loaded the tidyverse
#setwd("C:/Users/Bakti/Desktop/")                 # remember to set your working directory
pfizer <- read_csv("pfizer.csv")                  # load data of `pfizer` 
## 
## -- Column specification --------------------------------------------------------
## cols(
##   org_indiv = col_character(),
##   first_plus = col_character(),
##   first_name = col_character(),
##   last_name = col_character(),
##   city = col_character(),
##   state = col_character(),
##   category = col_character(),
##   cash = col_double(),
##   other = col_double(),
##   total = col_double()
## )
fda <- read_csv("fda.csv")                        # load data of `fda`
## 
## -- Column specification --------------------------------------------------------
## cols(
##   name_last = col_character(),
##   name_first = col_character(),
##   name_middle = col_character(),
##   issued = col_date(format = ""),
##   office = col_character()
## )

2 Data Structures

Notice that you’ll need a strong understanding of the basic data types and data structures and how to operate on those. The str() function will tell you more about the columns in your data, including their data type. Copy this code into your script and Run:

str(pfizer)                                       # view structure of data `pfizer`
## tibble [10,087 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ org_indiv : chr [1:10087] "3-D MEDICAL SERVICES LLC" "AA DOCTORS, INC." "ABBO, LILIAN MARGARITA" "ABBO, LILIAN MARGARITA" ...
##  $ first_plus: chr [1:10087] "STEVEN BRUCE" "AAKASH MOHAN" "LILIAN MARGARITA" "LILIAN MARGARITA" ...
##  $ first_name: chr [1:10087] "STEVEN" "AAKASH" "LILIAN" "LILIAN" ...
##  $ last_name : chr [1:10087] "DEITELZWEIG" "AHUJA" "ABBO" "ABBO" ...
##  $ city      : chr [1:10087] "NEW ORLEANS" "PASO ROBLES" "MIAMI" "MIAMI" ...
##  $ state     : chr [1:10087] "LA" "CA" "FL" "FL" ...
##  $ category  : chr [1:10087] "Professional Advising" "Expert-Led Forums" "Business Related Travel" "Meals" ...
##  $ cash      : num [1:10087] 2625 1000 0 0 1800 ...
##  $ other     : num [1:10087] 0 0 448 119 0 0 47 0 0 396 ...
##  $ total     : num [1:10087] 2625 1000 448 119 1800 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   org_indiv = col_character(),
##   ..   first_plus = col_character(),
##   ..   first_name = col_character(),
##   ..   last_name = col_character(),
##   ..   city = col_character(),
##   ..   state = col_character(),
##   ..   category = col_character(),
##   ..   cash = col_double(),
##   ..   other = col_double(),
##   ..   total = col_double()
##   .. )
str(fda)                                          # view structure of data `fda`
## tibble [272 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name_last  : chr [1:272] "ADELGLASS" "ADKINSON" "ALLEN" "AMSTERDAM" ...
##  $ name_first : chr [1:272] "JEFFREY" "N." "MARK" "DANIEL" ...
##  $ name_middle: chr [1:272] "M." "FRANKLIN" "S." NA ...
##  $ issued     : Date[1:272], format: "1999-05-25" "2000-04-19" ...
##  $ office     : chr [1:272] "Center for Drug Evaluation and Research" "Center for Biologics Evaluation and Research" "Center for Devices and Radiological Health" "Center for Biologics Evaluation and Research" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name_last = col_character(),
##   ..   name_first = col_character(),
##   ..   name_middle = col_character(),
##   ..   issued = col_date(format = ""),
##   ..   office = col_character()
##   .. )

It is very important to understand because these are the objects you will manipulate on a day-to-day basis in R. If you need to change the data type for any column, use the following functions:

  • as.character() converts to a text string.
  • as.numeric() converts to a number.
  • as.factor() converts to a categorical variable.
  • as.integer() converts to an integer
  • as.Date() converts to a date
  • as.POSIXct() converts to a full date and time.

For instance, add the following code to your script to convert the converted total in the pfizer data to a numeric variable (which would allow it to hold decimal values if we had any).

pfizer$total <- as.numeric(pfizer$total)          # convert total to numeric variable
str(pfizer$total)                                 # let`s check the data structure again
##  num [1:10087] 2625 1000 448 119 1800 ...

3 Missing Value

Unlike typical programming, when working with real-life data, you may have missing values: measurements that were simply not recorded/stored/etc. R has rather sophisticated mechanisms to deal with missing values. It distinguishes between the following types:

  • NA: Not Available entries (NA values have a class also, so there are integer NA, character NA, etc.)
  • NaN: Not a number (A NaN value is also NA but the converse is not true)

Find missing values in a column of a data frame pfizer

is.na(pfizer)                                     # classic way to check NA`s
sum(is.na(pfizer))                                # counting NA`s
apply(is.na(pfizer),2, which)                     # which indexes of NA`s (df only)
which(complete.cases(pfizer))                     # identify observed complete values

A more general mechanism is removing these manually:

clean.vector<- na.omit(pfizer$first_name)         # clean/remove a vector NA`s
clean.df <- na.omit(pfizer)                       # clean/remove a dataframe NA`s
apply(is.na(clean.df),2, which)                   # make sure if there are missing values 
## integer(0)

4 Replace Missing Values

We could also replace missing values with the mean (the median). A good practice is to create two separate variables for the mean. Once created, we can replace the missing values with the newly formed variables. Let’s upload the data and verify the missing data.

PATH <- "https://raw.githubusercontent.com/Bakti-Siregar/dataset/master/Bookdown-Data-Science-for-Beginners/Missing_Values.csv"
titanic <- read.csv(PATH, sep = ",")
list_na <- colnames(titanic)[ apply(titanic, 2, anyNA) ]
list_na
## [1] "Age"  "Fare"

In this case, we did not remove all missing values, but we use the apply() method to compute the mean of the column with NA. First, we need to compute the mean with the argument na.rm = TRUE. This argument is compulsory because the columns have missing data, and this tells R to ignore them.

average_missing <- apply(titanic[,colnames(titanic) %in% list_na],
                         2,
                         mean,
                         na.rm =  TRUE)
average_missing
##      Age     Fare 
## 30.27259 35.62719

Code Explanation: We pass 4 arguments in the apply method.

  • df titanic[,colnames(titanic) %in% list_na]. This code will return the columns name from the list_na object (i.e. “age” and “fare”)
  • 2 Compute the function on the columns
  • mean Compute the mean
  • na.rm = TRUE Ignore the missing values

Next, we can replace the NA Values. The verb ‘mutate’ from the dplyr library is useful in creating a new variable. We don’t necessarily want to change the original column so we can create a new variable without the NA. mutate’ is easy to use, we just choose a variable name and define how to create this variable. Here is the complete code

titanic_replace <- titanic %>%
mutate(age  = ifelse(is.na(Age), average_missing[1], Age),
fare = ifelse(is.na(Fare), average_missing[2], Fare))
sum(is.na(titanic_replace$Age))
## [1] 86
sum(is.na(titanic_replace$Fare))
## [1] 1
sum(is.na(titanic_replace$age))
## [1] 0
sum(is.na(titanic_replace$fare))
## [1] 0

The original column age has 86 missing values while the newly created variable has replaced them with the mean of the variable age. You can try by your self to replace the missing observations with the median as well.

5 Select Data

In this part, you will learn how to select or subset data frame columns by names and position using the R function select() in dplyr package. You will learn how to use the following functions:

  • pull() Extract column values as a vector. The column of interest can be specified either by name or by index.
  • select() Extract one or multiple columns as a data table. It can be also used to remove columns from the data frame.
  • select_if() Select columns based on a particular condition. One can use this function to, for example, select columns if they are numeric.
  • Helper functions: starts_with(), ends_with(), contains(), matches(): Select columns/variables based on their names.
library(tidyverse)                                # load `tidyverse`, which include in `dplyr`
pfizer %>% pull(state) %>% head()                 # extract column values of `state` as a vector
pfizer %>% select(1:3)                            # select columns 1 to 3
pfizer %>% select(1,3)                            # select column 1 and 3 but not 2
pfizer %>% select(state:total)                    # select all columns from `state` to `total`
pfizer %>% select(state,total)                    # select columns by variable names
pfizer %>% select_if(is.numeric)                  # select only numeric columns
pfizer %>% select_if(is.character)                # select only character columns
pfizer %>% select(starts_with("first"))           # Select column whose starts with `first`
pfizer %>% select(ends_with("name"))              # Select column whose ends with `name`
pfizer %>% select(contains("rst"))                # Select columns whose names contains `rst`
pfizer %>% select(matches("_"))                   # Select columns whose name maches a regular
pfizer %>% select(-(state:total))                 # remove all columns from `state` to `total`
pfizer %>% select(-state, -total)                 # remove columns `state` and `total`

6 Filter and Sort Data

Now we will filter() and arrange() the data in specific ways. For each of the following examples, copy the code that follows into your script, and view the results. Notice how we create a new object to hold the processed data.

6.1 Example 1

Find doctors in California paid $10,000 or more by Pfizer to run “Professional Advising”!

ca_expert_10000 <- pfizer %>%                     # load all `pfizer`    
  filter(state == "CA" &                          # load all `pfizer` filter by `state`
         total >= 10000 &                         # also filter by `total` greather/equal 10000
         category == "Professional Advising")     # then moreover filter by `category`
ca_expert_10000                                   # print the result
## # A tibble: 35 x 10
##    org_indiv first_plus first_name last_name city  state category  cash other
##    <chr>     <chr>      <chr>      <chr>     <chr> <chr> <chr>    <dbl> <dbl>
##  1 CUMMINGS~ JEFFREY L~ JEFFREY    CUMMINGS  LOS ~ CA    Profess~ 13500     0
##  2 FONAROW,~ GREGG CUR~ GREGG      FONAROW   LOS ~ CA    Profess~ 15750     0
##  3 GANZ, PE~ PETER      PETER      GANZ      SAN ~ CA    Profess~ 22000     0
##  4 GT AND A~ GEORGE     GEORGE     TRIADAFI~ SAN ~ CA    Profess~ 33705     0
##  5 HANAHAN,~ DOUGLAS    DOUGLAS    HANAHAN   SAN ~ CA    Profess~ 37500     0
##  6 HERBERT ~ HERBERT    HERBERT    CRUZ      FRES~ CA    Profess~ 13250     0
##  7 HU, RONA~ RONA JANE  RONA       HU        PALO~ CA    Profess~ 22500     0
##  8 KAPLOWIT~ NEIL       NEIL       KAPLOWITZ LOS ~ CA    Profess~ 12000     0
##  9 KIZER, K~ KENNETH W~ KENNETH    KIZER     ALIS~ CA    Profess~ 10000     0
## 10 KOO, EDW~ EDWARD HA~ EDWARD     MANG      LA J~ CA    Profess~ 12500     0
## # ... with 25 more rows, and 1 more variable: total <dbl>

6.2 Example 2

Now add a sort to the end of the code to list the doctors in descending order by the payments received!

ca_expert_10000 <- pfizer %>%                     # load all `pfizer`    
  filter(state == "CA" &                          # filter by `state`in California
         total >= 10000 &                         # also filter by `total` greather/equal 10000
         category == "Professional Advising")%>%  # then moreover filter by `category`
  arrange(desc(total))                            # descending order by the payments received
ca_expert_10000                                   # print the result
## # A tibble: 35 x 10
##    org_indiv first_plus first_name last_name city  state category  cash other
##    <chr>     <chr>      <chr>      <chr>     <chr> <chr> <chr>    <dbl> <dbl>
##  1 MALENKA,~ ROBERT CH~ ROBERT     MALENKA   STAN~ CA    Profess~ 75566     0
##  2 REGENTS ~ DAVID RAY~ DAVID      GANDARA   IRVI~ CA    Profess~ 38500     0
##  3 PTACEK, ~ LOUIS JOHN LOUIS      PTACEK    SAN ~ CA    Profess~ 37588     0
##  4 HANAHAN,~ DOUGLAS    DOUGLAS    HANAHAN   SAN ~ CA    Profess~ 37500     0
##  5 REGENTS ~ JOHN POWER JOHN       KANE      IRVI~ CA    Profess~ 37500     0
##  6 RISCH, S~ SAMUEL CR~ SAMUEL     RISCH     SAN ~ CA    Profess~ 36000     0
##  7 GT AND A~ GEORGE     GEORGE     TRIADAFI~ SAN ~ CA    Profess~ 33705     0
##  8 SANDROCK~ CHRISTIAN~ CHRISTIAN  SANDROCK  SACR~ CA    Profess~ 33000     0
##  9 MATTHEW ~ MATTHEW J~ MATTHEW    BUDOFF    MANH~ CA    Profess~ 31500     0
## 10 OLEFSKY,~ JERROLD M~ JERROLD    OLEFSKY   SOLA~ CA    Profess~ 27746     0
## # ... with 25 more rows, and 1 more variable: total <dbl>
# arrange((total))                                # ascending was assigned as R default

6.3 Example 3:

Find doctors in California or New York who were paid $10,000 or more by Pfizer to run “Professional Advising”!

Notice that, in this case we use of the | Boolean operator, and the brackets around that part of the query. This ensures that this part of the query is run first. See what happens if you exclude them.

ca_ny_expert_10000 <- pfizer %>%                  # load all `pfizer`    
  filter((state == "CA" | state == "NY")  &       # filter by `state` California or New York
         total >= 10000 &                         # also filter by `total` greather/equal 10000
         category == "Professional Advising")%>%  # then moreover filter by `category`
  arrange(desc(total))                            # descending order by the payments received
ca_ny_expert_10000                                # print the result
## # A tibble: 46 x 10
##    org_indiv first_plus first_name last_name city  state category   cash other
##    <chr>     <chr>      <chr>      <chr>     <chr> <chr> <chr>     <dbl> <dbl>
##  1 SAWYERS,~ CHARLES L~ CHARLES    SAWYERS   NEW ~ NY    Profess~ 100000     0
##  2 MALENKA,~ ROBERT CH~ ROBERT     MALENKA   STAN~ CA    Profess~  75566     0
##  3 REGENTS ~ DAVID RAY~ DAVID      GANDARA   IRVI~ CA    Profess~  38500     0
##  4 PTACEK, ~ LOUIS JOHN LOUIS      PTACEK    SAN ~ CA    Profess~  37588     0
##  5 HANAHAN,~ DOUGLAS    DOUGLAS    HANAHAN   SAN ~ CA    Profess~  37500     0
##  6 JAVITT, ~ DANIEL CO~ DANIEL     JAVITT    ORAN~ NY    Profess~  37500     0
##  7 REGENTS ~ JOHN POWER JOHN       KANE      IRVI~ CA    Profess~  37500     0
##  8 RISCH, S~ SAMUEL CR~ SAMUEL     RISCH     SAN ~ CA    Profess~  36000     0
##  9 GT AND A~ GEORGE     GEORGE     TRIADAFI~ SAN ~ CA    Profess~  33705     0
## 10 SANDROCK~ CHRISTIAN~ CHRISTIAN  SANDROCK  SACR~ CA    Profess~  33000     0
## # ... with 36 more rows, and 1 more variable: total <dbl>

6.4 Example 4

Find doctors in states other than California who were paid $10,000 or more by Pfizer to run “Professional Advising”!

not_ca_expert_10000 <- pfizer %>%                 # load all `pfizer`    
  filter(state != "CA" &                          # filter by `state`other than California
         total >= 10000 &                         # also filter by `total` greather/equal 10000
         category == "Professional Advising")%>%  # then moreover filter by `category`
  arrange(desc(total))                            # descending order by the payments received
not_ca_expert_10000                               # print the result
## # A tibble: 135 x 10
##    org_indiv first_plus first_name last_name city  state category   cash other
##    <chr>     <chr>      <chr>      <chr>     <chr> <chr> <chr>     <dbl> <dbl>
##  1 BAILES, ~ JOSEPH SW~ JOSEPH     BAILES    AUST~ TX    Profess~ 105000     0
##  2 SAWYERS,~ CHARLES L~ CHARLES    SAWYERS   NEW ~ NY    Profess~ 100000     0
##  3 MARDER, ~ HAROLD KAY HAROLD     MARDER    PHIL~ PA    Profess~  87610     0
##  4 WILLIAMS~ GRANT      GRANT      WILLIAMS  WAYNE PA    Profess~  85621     0
##  5 THE SCHE~ PHILIP SA~ PHILIP     SCHEIN    BRYN~ PA    Profess~  75609     0
##  6 GORDON, ~ JEFFREY I~ JEFFREY    GORDON    SAIN~ MO    Profess~  75108     0
##  7 KOLLEF, ~ MARIN HRI~ MARIN      KOLLEF    SAIN~ MO    Profess~  72750     0
##  8 JOSEPH, ~ WARREN S   WARREN     JOSEPH    HUNT~ PA    Profess~  71875     0
##  9 SHORR, A~ ANDREW FR~ ANDREW     SHORR     WASH~ DC    Profess~  62000     0
## 10 MAYO FOU~ PETER JAM~ PETER      DYCK      ROCH~ MN    Profess~  61175     0
## # ... with 125 more rows, and 1 more variable: total <dbl>

6.5 Example 5

Find the 20 doctors across the four largest states (CA, TX, FL, NY) who were paid the most for “Expert-Led Forums”!

ca_ny_tx_fl_prof_top20 <- pfizer %>%
  filter((state == "CA" | 
          state == "NY" |
          state == "TX" | 
          state == "FL") &
          category == "Expert-Led Forums") %>%
  arrange(desc(total)) %>%
  head(20)
ca_ny_tx_fl_prof_top20
## # A tibble: 20 x 10
##    org_indiv first_plus first_name last_name city  state category   cash other
##    <chr>     <chr>      <chr>      <chr>     <chr> <chr> <chr>     <dbl> <dbl>
##  1 SACKS, G~ GERALD MI~ GERALD     SACKS     SANT~ CA    Expert-~ 146500     0
##  2 NIDES, M~ MITCHELL   MITCHELL   NIDES     LOS ~ CA    Expert-~  70500     0
##  3 ROBERT B~ ROBERT BU~ ROBERT     NETT      SAN ~ TX    Expert-~  60750     0
##  4 SOLERA C~ STEVEN AB~ STEVEN     KAPLAN    CHAP~ NY    Expert-~  56500     0
##  5 GRIFFIN,~ JAMES DALE JAMES      GRIFFIN   DALL~ TX    Expert-~  54250     0
##  6 STUBBLEF~ MICHAEL D  MICHAEL    STUBBLEF~ NEW ~ NY    Expert-~  50500     0
##  7 POTKIN, ~ STEVEN GA~ STEVEN     POTKIN    ORAN~ CA    Expert-~  48350     0
##  8 GINSBERG~ DAVID ALAN DAVID      GINSBERG  LOS ~ CA    Expert-~  45750     0
##  9 LOUIE, S~ SAMUEL     SAMUEL     LOUIE     SACR~ CA    Expert-~  41250     0
## 10 INSTITUT~ GURKIPAL   GURKIPAL   SINGH     WOOD~ CA    Expert-~  40000     0
## 11 NEURONLI~ JEFFREY B~ JEFFREY    GELBLUM   MIAMI FL    Expert-~  39200     0
## 12 HENNEKEN~ CHARLES H  CHARLES    HENNEKENS BOCA~ FL    Expert-~  33250     0
## 13 SOBHY EL~ SAAD A     SAAD       SOBHY     SYRA~ NY    Expert-~  31500     0
## 14 COHEN, L~ LEE STEVEN LEE        COHEN     HAST~ NY    Expert-~  30500     0
## 15 BAROYA, ~ IVAN STEP~ IVAN       BAROYA    BONI~ CA    Expert-~  26400     0
## 16 JONES, S~ STEPHEN E~ STEPHEN    JONES     DALL~ TX    Expert-~  26250     0
## 17 CRANE, D~ DANIEL LI~ DANIEL     CRANE     NEW ~ NY    Expert-~  25500     0
## 18 AXONZ INC STANLEY J  STANLEY    KROLCZYK  WESL~ FL    Expert-~  24750     0
## 19 MATTHEW ~ MATTHEW J~ MATTHEW    BUDOFF    MANH~ CA    Expert-~  24000     0
## 20 QUANG H ~ QUANG H    QUANG      NGUYEN    LA J~ CA    Expert-~  22500     0
## # ... with 1 more variable: total <dbl>

6.6 Example 6

Filter the data pfizer for all payments for running “Expert-Led Forums” or for “Professional Advising”, and arrange alphabetically by doctor (last name, then first name)

expert_professional_advice <- pfizer %>%
  filter(category == "Expert-Led Forums" | 
         category == "Professional Advising") %>%
  arrange(first_name,last_name)
expert_professional_advice
## # A tibble: 4,382 x 10
##    org_indiv first_plus first_name last_name city  state category  cash other
##    <chr>     <chr>      <chr>      <chr>     <chr> <chr> <chr>    <dbl> <dbl>
##  1 THE REGE~ A MARK     A          FENDRICK  ANN ~ MI    Profess~ 12000     0
##  2 AA DOCTO~ AAKASH MO~ AAKASH     AHUJA     PASO~ CA    Expert-~  1000     0
##  3 BLOM, AA~ AARON S    AARON      BLOM      PHIL~ PA    Expert-~  1000     0
##  4 GROTAS, ~ AARON BRE~ AARON      GROTAS    BROO~ NY    Expert-~  3000     0
##  5 MILSTONE~ AARON PAUL AARON      MILSTONE  BREN~ TN    Expert-~  8500     0
##  6 NGUYEN, ~ AARON LINH AARON      NGUYEN    POMO~ CA    Expert-~  1000     0
##  7 SPITZ EN~ AARON      AARON      SPITZ     LAGU~ CA    Expert-~   750     0
##  8 VINIK, A~ AARON I    AARON      VINIK     NORF~ VA    Profess~  1000     0
##  9 WAXMAN, ~ AARON B    AARON      WAXMAN    BOST~ MA    Profess~  2250     0
## 10 SHAH, AA~ AASHIT KA~ AASHIT     SHAH      DETR~ MI    Expert-~  3000     0
## # ... with 4,372 more rows, and 1 more variable: total <dbl>

7 Rename and Mutate

In this part, you will learn how to rename the columns of a data frame in R. Next, you will learn how to compute and add new variables to a data frame in R. You will learn the following R functions from the dplyr R package:

  • rename() This code used to rename the columns of a data frame in R.
  • mutate() Compute and add new variables into a data table. It preserves existing variables.
  • transmute() Compute new columns but drop existing variables.
library(tidyverse)
# Renaming columns of the data `pfizer` with R base functions:
names(pfizer)[names(pfizer) == "org_indiv"] <- "rename1"
names(pfizer)[1] <- "rename2"
names(pfizer)[names(pfizer)==names(pfizer)] <- c("rename3",
                                                 "rename4",
                                                 "first_name",
                                                 "last_name",
                                                 "city",
                                                 "state",
                                                 "category",
                                                 "cash",
                                                 "other",
                                                 "total")
# Renaming columns of the data `pfizer` with `dplyr::rename()`:
pfizer %>% 
  rename(
    org_indiv = rename3 ,
    first_plus = rename4  
    )
## # A tibble: 10,087 x 10
##    org_indiv first_plus first_name last_name city  state category  cash other
##    <chr>     <chr>      <chr>      <chr>     <chr> <chr> <chr>    <dbl> <dbl>
##  1 3-D MEDI~ STEVEN BR~ STEVEN     DEITELZW~ NEW ~ LA    Profess~  2625     0
##  2 AA DOCTO~ AAKASH MO~ AAKASH     AHUJA     PASO~ CA    Expert-~  1000     0
##  3 ABBO, LI~ LILIAN MA~ LILIAN     ABBO      MIAMI FL    Busines~     0   448
##  4 ABBO, LI~ LILIAN MA~ LILIAN     ABBO      MIAMI FL    Meals        0   119
##  5 ABBO, LI~ LILIAN MA~ LILIAN     ABBO      MIAMI FL    Profess~  1800     0
##  6 ABDULLAH~ ABDULLAH   ABDULLAH   RAFFEE    FLINT MI    Expert-~   750     0
##  7 ABEBE, S~ SHEILA Y   SHEILA     ABEBE     INDI~ IN    Educati~     0    47
##  8 ABEBE, S~ SHEILA Y   SHEILA     ABEBE     INDI~ IN    Expert-~   825     0
##  9 ABILENE ~ GALEN CHR~ GALEN      ALBRITTON ABIL~ TX    Profess~  3000     0
## 10 ABOLNIK,~ IGOR Z     IGOR       ABOLNIK   PROVO UT    Busines~     0   396
## # ... with 10,077 more rows, and 1 more variable: total <dbl>
# Add new columns (year*) by preserving existing data `fda`:
letters_year <- fda %>%
  mutate(year = format(issued, "%Y")) %>%
  group_by(year)
letters_year
## # A tibble: 272 x 6
## # Groups:   year [15]
##    name_last  name_first name_middle issued     office                     year 
##    <chr>      <chr>      <chr>       <date>     <chr>                      <chr>
##  1 ADELGLASS  JEFFREY    M.          1999-05-25 Center for Drug Evaluatio~ 1999 
##  2 ADKINSON   N.         FRANKLIN    2000-04-19 Center for Biologics Eval~ 2000 
##  3 ALLEN      MARK       S.          2002-01-28 Center for Devices and Ra~ 2002 
##  4 AMSTERDAM  DANIEL     <NA>        2004-11-17 Center for Biologics Eval~ 2004 
##  5 AMSTUTZ    HARLAN     C.          2004-07-19 Center for Devices and Ra~ 2004 
##  6 ANDERSON   C.         JOSEPH      2000-02-25 Center for Devices and Ra~ 2000 
##  7 ANDREWS    DAVID      W.          2000-07-19 Center for Biologics Eval~ 2000 
##  8 AQEL       RAED       <NA>        2002-10-30 Center for Devices and Ra~ 2002 
##  9 ARROWSMITH PETER      N.          2004-01-21 Center for Devices and Ra~ 2004 
## 10 BARR       JOHN       D.          2000-01-14 Center for Devices and Ra~ 2000 
## # ... with 262 more rows
# Add new columns (year*) and (last_name*) by dropping existing data `fda`:
fda %>%
  transmute(
    year = format(issued, "%Y"),
    last_name = name_last  
    )
## # A tibble: 272 x 2
##    year  last_name 
##    <chr> <chr>     
##  1 1999  ADELGLASS 
##  2 2000  ADKINSON  
##  3 2002  ALLEN     
##  4 2004  AMSTERDAM 
##  5 2004  AMSTUTZ   
##  6 2000  ANDERSON  
##  7 2000  ANDREWS   
##  8 2002  AQEL      
##  9 2004  ARROWSMITH
## 10 2000  BARR      
## # ... with 262 more rows

8 Join Data

There are also a number of join functions in dplyr to combine data from two data frames. Here are the most useful:

  • inner_join() Returns values from both tables only where there is a match.
  • left_join() Returns all the values from the first-mentioned table, plus those from the second table that match.
  • semi_join() Filters the first-mentioned table to include only values that have matches in the second table.
  • anti_join() Filters the first-mentioned table to include only values that have no matches in the second table.

To illustrate, these joins will find doctors paid by Pfizer to run expert-led forums who had also received a warning letter from the fda:

expert_warned_inner <- inner_join(pfizer, fda, 
                                  by=c("first_name" = "name_first", 
                                       "last_name" = "name_last")) %>%
                                  filter(category=="Expert-Led Forums")

expert_warned_semi <- semi_join(pfizer, fda, 
                                by=c("first_name" = "name_first", 
                                     "last_name" = "name_last")) %>%
                                filter(category=="Expert-Led Forums")

The code in by=c() defines how the join should be made. If instructions on how to join the tables are not supplied, dplyr will look for columns with matching names, and perform the join based on those. The difference between the two joins above is that the first contains all of the columns from both data frames, while the second gives only columns from the pfizer data frame.

In practice, you may wish to inner_join and then use dplyr’s select function to select the columns that you want to retain, for example:

expert_warned <- inner_join(pfizer, fda, 
                            by=c("first_name" = "name_first", 
                                 "last_name" = "name_last")) %>%
                            filter(category=="Expert-Led Forums") %>%
                            select(last_name, 
                                   city, 
                                   state, 
                                   total, 
                                   issued)

expert_warned <- inner_join(pfizer, fda, 
                            by=c("first_name" = "name_first",
                                 "last_name" = "name_last")) %>%
                            filter(category=="Expert-Led Forums") %>%
                            select(2:5,10,12)

Click here for more useful reference for managing joins with dplyr.

9 Group and Summarize

This part introduces how to easily compute statistical summaries in R using the dplyr package. You will learn, how to:

  • Compute summary statistics for un-grouped data, as well as, for data that are grouped by one or multiple variables. R functions: summarise() and group_by().
  • Summarise multiple variable columns. R functions:
    • summarise_all() Apply summary functions to every columns in the data frame.
    • summarise_at() Apply summary functions to specific columns selected with a character vector
    • summarise_if() Apply summary functions to columns selected with a predicate function that returns TRUE

9.1 Example 7

Calculate the total payments of the data pfizer, by state descending!

state_sum <- pfizer %>%
  group_by(state) %>%
  summarize(sum = sum(total)) %>%
  arrange(desc(sum))
## `summarise()` ungrouping output (override with `.groups` argument)
state_sum

9.2 Example 8

Calculate some additional summary statistics of the data pfizer, by state descending!

state_summary <- pfizer %>%
  group_by(state) %>%
  summarize(sum = sum(total), 
        average = mean(total), 
         median = median(total),
            min = min(total),
            max = max(total),
            count = n()) %>%
  arrange(desc(sum))
state_summary
## # A tibble: 52 x 7
##    state     sum average median   min     max count
##    <chr>   <dbl>   <dbl>  <dbl> <dbl>   <dbl> <int>
##  1 CA    4737807   4042.   886.    25  323122  1172
##  2 TX    2802196   4491.   802     25  479314   624
##  3 FL    2564047   3474.   669     25 1185466   738
##  4 PA    2484505   4661.   669     25  672263   533
##  5 NC    2328435   6079.   777      0  857698   383
##  6 NY    2065042   2479.   750     25  100000   833
##  7 MA    1764771   6814.  1250     29  617548   259
##  8 IL    1256825   2805.   661     25  160984   448
##  9 MI    1146285   3352.   750     26  152466   342
## 10 OH    1019450   2207.   685     25  107292   462
## # ... with 42 more rows

9.3 Example 9

Group and summarize the data pfizer for multiple categories ascending!

state_summary <- pfizer %>%
  group_by(state, category) %>%
  summarize(sum = sum(total), 
        average = mean(total), 
         median = median(total),
            min = min(total),
            max = max(total),
            count = n()) %>%
  arrange(state, category)
state_summary
## # A tibble: 319 x 8
## # Groups:   state [52]
##    state category                          sum average median   min    max count
##    <chr> <chr>                           <dbl>   <dbl>  <dbl> <dbl>  <dbl> <int>
##  1 AK    Expert-Led Forums                1750  1.75e3 1.75e3  1750   1750     1
##  2 AL    Business Related Travel         29123  9.39e2 3.31e2    39   8041    31
##  3 AL    Educational Items                 431  6.16e1 6.70e1    47     78     7
##  4 AL    Expert-Led Forums              129850  4.06e3 2.00e3   550  26250    32
##  5 AL    Investigator-Initiated Resear~  55937  5.59e4 5.59e4 55937  55937     1
##  6 AL    Meals                           10217  2.69e2 1.86e2    26   1385    38
##  7 AL    Pfizer Sponsored Research ini~ 279485  1.40e5 1.40e5 46494 232991     2
##  8 AL    Pfizer Sponsored Research ini~  66096  6.61e4 6.61e4 66096  66096     1
##  9 AL    Professional Advising          110560  5.53e3 2.00e3   702  35000    20
## 10 AR    Business Related Travel          3783  2.70e2 1.41e2    27   1017    14
## # ... with 309 more rows

9.4 Example 10

Filter the data fda for letters sent from the start of 2006 onwards and summarize!

year_summary <- fda %>%
  filter(issued >= "2005-01-01") %>%
  arrange(issued) %>%
  mutate(year = format(issued, "%Y")) %>%
  group_by(year) %>%
  summarize(letters=n())
year_summary
## # A tibble: 6 x 2
##   year  letters
##   <chr>   <int>
## 1 2005       31
## 2 2006       25
## 3 2007       11
## 4 2008       23
## 5 2009       25
## 6 2010        5


Beware: reusing variables may lead to unexpected results but don’t worry about it. R will give a warning something like summarise() ungrouping output (override with .groups argument).

10 Split Data

As I have mentioned in chapter R Programing about writing function to split training and testing data. Here, we will learn more about how to use some packages in order to split data. Because this part is very important as a data scientist as especially when you apply Machine Learning tasks to analyze data. This part usually we need to split the dataset between a train set and a test set. The train set allows the algorithm to learn from the data. In order to test the performance of our model, we can use the test set to return the performance measure. So, let`s see some packages that you can use to split data:

10.1 dplyr

You can use dplyr for this, makes it super simple. It does require an id variable in your data set, which is a good idea anyway, not only for creating sets but also for traceability during your project. Add it if doesn’t contain already.

library(dplyr)                                    # load `dplyr` package
data(mtcars)                                      # use data from R environment
set.seed(123)                                     # to ensure we generate the same data
mtcars$id <- 1:nrow(mtcars)                       # add the `id` if doesn't contain already
train<-mtcars %>% dplyr::sample_frac(.75)         # assign the train set
test<-dplyr::anti_join(mtcars, train, by = 'id')  # assign the test set
dim(train)                                        # check dimension of train set
## [1] 24 12
dim(test)                                         # check dimension of test set
## [1]  8 12

Note: This data set just, for example, it`s not suitable for machine learning tasks because of the data too small.

10.2 caTools

There are numerous approaches to achieve data partitioning. For a more complete approach take a look at the createDataPartition function in the caTools package.

library(caTools)                                  # load `caTools` package
data(mtcars)                                      # use data from R environment
set.seed(123)                                     # to ensure we generate the same data
smp_size <- floor(0.75 * nrow(mtcars))            # 75% of the sample size
train_ind <- sample(seq_len(nrow(mtcars)), 
                    size = smp_size)
train <- mtcars[train_ind, ]                      # assign the train set
test <- mtcars[-train_ind, ]                      # assign the test set
dim(train)                                        # check dimension of train set
## [1] 24 11
dim(test)                                         # check dimension of test set
## [1]  8 11

10.3 caret

Another powerful packages tha you can you to split dataset is caret.

library(caret)                                    # load `caret` package
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
data(mtcars)                                      # use data from R environment
set.seed(123)                                     # to ensure we generate the same data
intrain<-createDataPartition(mtcars$mpg,
                             p=0.75,list=FALSE)
train<-mtcars[intrain,]
test<-mtcars[-intrain,]
dim(train)                                        # check dimension of train set
## [1] 25 11
dim(test)                                         # check dimension of test set
## [1]  7 11

More