One of the most fundamental skills for a Data Scientist is Data Preparation (Data Manipulation). To be really effective, you need to be masterful at performing essential data manipulations. This is because a very large proportion of your work will just involve getting and cleaning data.
Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis.
In this section, you will learn how to easily perform Data Preparation using R. We’ll cover some fundamental data manipulation verbs that you will use most frequently in your jobs.
read_csv() Import data (You can use others)str() Data Structuresapply() To check and replace missing valuesselect() Choose which columns to include.filter() Select a defined subset of the data.arrange() Sort the data, by size for continuous variables, by date, or alphabetically.rename() Rename column(s).mutate() Create new column(s) in the data, or change existing column(s).bind_rows() Merge two data frames into one, combining data from columns with the same name.group_by() Group the data by a categorical variable.summarize() Summarize, or aggregate (for each group if following group_by). Often used in conjunction with functions including:
mean() Calculate the mean, or average.median() Calculate the median.max() Find the maximum value.min() Find the minimum valuesum() Add all the values together.n() Count the number of records.I recommend you to install the tidyverse packages. Because the core of tidyverse includes the packages that you’re likely to use in everyday data analyses.
install.packages("tidyverse")
We’ll mainly work with two incredibly useful packages developed by Hadley Wickham, chief scientist at RStudio:
The data we will use for this section are pfizer.csv and fda.csv, please download and place it on your desktop. As an optional, you can load data into the current R session by selecting Import Dataset>From Text File... in the Environment tab. But, in this case, we will use the read_csv() function from the readr package. Copy the following code into your script and Run:
suppressPackageStartupMessages(library(tidyverse))# silently loaded the tidyverse
#setwd("C:/Users/Bakti/Desktop/") # remember to set your working directory
pfizer <- read_csv("pfizer.csv") # load data of `pfizer` ##
## -- Column specification --------------------------------------------------------
## cols(
## org_indiv = col_character(),
## first_plus = col_character(),
## first_name = col_character(),
## last_name = col_character(),
## city = col_character(),
## state = col_character(),
## category = col_character(),
## cash = col_double(),
## other = col_double(),
## total = col_double()
## )
##
## -- Column specification --------------------------------------------------------
## cols(
## name_last = col_character(),
## name_first = col_character(),
## name_middle = col_character(),
## issued = col_date(format = ""),
## office = col_character()
## )
Notice that you’ll need a strong understanding of the basic data types and data structures and how to operate on those. The str() function will tell you more about the columns in your data, including their data type. Copy this code into your script and Run:
## tibble [10,087 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ org_indiv : chr [1:10087] "3-D MEDICAL SERVICES LLC" "AA DOCTORS, INC." "ABBO, LILIAN MARGARITA" "ABBO, LILIAN MARGARITA" ...
## $ first_plus: chr [1:10087] "STEVEN BRUCE" "AAKASH MOHAN" "LILIAN MARGARITA" "LILIAN MARGARITA" ...
## $ first_name: chr [1:10087] "STEVEN" "AAKASH" "LILIAN" "LILIAN" ...
## $ last_name : chr [1:10087] "DEITELZWEIG" "AHUJA" "ABBO" "ABBO" ...
## $ city : chr [1:10087] "NEW ORLEANS" "PASO ROBLES" "MIAMI" "MIAMI" ...
## $ state : chr [1:10087] "LA" "CA" "FL" "FL" ...
## $ category : chr [1:10087] "Professional Advising" "Expert-Led Forums" "Business Related Travel" "Meals" ...
## $ cash : num [1:10087] 2625 1000 0 0 1800 ...
## $ other : num [1:10087] 0 0 448 119 0 0 47 0 0 396 ...
## $ total : num [1:10087] 2625 1000 448 119 1800 ...
## - attr(*, "spec")=
## .. cols(
## .. org_indiv = col_character(),
## .. first_plus = col_character(),
## .. first_name = col_character(),
## .. last_name = col_character(),
## .. city = col_character(),
## .. state = col_character(),
## .. category = col_character(),
## .. cash = col_double(),
## .. other = col_double(),
## .. total = col_double()
## .. )
## tibble [272 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ name_last : chr [1:272] "ADELGLASS" "ADKINSON" "ALLEN" "AMSTERDAM" ...
## $ name_first : chr [1:272] "JEFFREY" "N." "MARK" "DANIEL" ...
## $ name_middle: chr [1:272] "M." "FRANKLIN" "S." NA ...
## $ issued : Date[1:272], format: "1999-05-25" "2000-04-19" ...
## $ office : chr [1:272] "Center for Drug Evaluation and Research" "Center for Biologics Evaluation and Research" "Center for Devices and Radiological Health" "Center for Biologics Evaluation and Research" ...
## - attr(*, "spec")=
## .. cols(
## .. name_last = col_character(),
## .. name_first = col_character(),
## .. name_middle = col_character(),
## .. issued = col_date(format = ""),
## .. office = col_character()
## .. )
It is very important to understand because these are the objects you will manipulate on a day-to-day basis in R. If you need to change the data type for any column, use the following functions:
as.character() converts to a text string.as.numeric() converts to a number.as.factor() converts to a categorical variable.as.integer() converts to an integeras.Date() converts to a dateas.POSIXct() converts to a full date and time.For instance, add the following code to your script to convert the converted total in the pfizer data to a numeric variable (which would allow it to hold decimal values if we had any).
pfizer$total <- as.numeric(pfizer$total) # convert total to numeric variable
str(pfizer$total) # let`s check the data structure again## num [1:10087] 2625 1000 448 119 1800 ...
Unlike typical programming, when working with real-life data, you may have missing values: measurements that were simply not recorded/stored/etc. R has rather sophisticated mechanisms to deal with missing values. It distinguishes between the following types:
NA: Not Available entries (NA values have a class also, so there are integer NA, character NA, etc.)NaN: Not a number (A NaN value is also NA but the converse is not true)Find missing values in a column of a data frame pfizer
is.na(pfizer) # classic way to check NA`s
sum(is.na(pfizer)) # counting NA`s
apply(is.na(pfizer),2, which) # which indexes of NA`s (df only)
which(complete.cases(pfizer)) # identify observed complete valuesA more general mechanism is removing these manually:
clean.vector<- na.omit(pfizer$first_name) # clean/remove a vector NA`s
clean.df <- na.omit(pfizer) # clean/remove a dataframe NA`s
apply(is.na(clean.df),2, which) # make sure if there are missing values ## integer(0)
We could also replace missing values with the mean (the median). A good practice is to create two separate variables for the mean. Once created, we can replace the missing values with the newly formed variables. Let’s upload the data and verify the missing data.
PATH <- "https://raw.githubusercontent.com/Bakti-Siregar/dataset/master/Bookdown-Data-Science-for-Beginners/Missing_Values.csv"
titanic <- read.csv(PATH, sep = ",")
list_na <- colnames(titanic)[ apply(titanic, 2, anyNA) ]
list_na## [1] "Age" "Fare"
In this case, we did not remove all missing values, but we use the apply() method to compute the mean of the column with NA. First, we need to compute the mean with the argument na.rm = TRUE. This argument is compulsory because the columns have missing data, and this tells R to ignore them.
average_missing <- apply(titanic[,colnames(titanic) %in% list_na],
2,
mean,
na.rm = TRUE)
average_missing## Age Fare
## 30.27259 35.62719
Code Explanation: We pass 4 arguments in the apply method.
df titanic[,colnames(titanic) %in% list_na]. This code will return the columns name from the list_na object (i.e. “age” and “fare”)2 Compute the function on the columnsmean Compute the meanna.rm = TRUE Ignore the missing valuesNext, we can replace the NA Values. The verb ‘mutate’ from the dplyr library is useful in creating a new variable. We don’t necessarily want to change the original column so we can create a new variable without the NA. mutate’ is easy to use, we just choose a variable name and define how to create this variable. Here is the complete code
titanic_replace <- titanic %>%
mutate(age = ifelse(is.na(Age), average_missing[1], Age),
fare = ifelse(is.na(Fare), average_missing[2], Fare))
sum(is.na(titanic_replace$Age))## [1] 86
## [1] 1
## [1] 0
## [1] 0
The original column age has 86 missing values while the newly created variable has replaced them with the mean of the variable age. You can try by your self to replace the missing observations with the median as well.
In this part, you will learn how to select or subset data frame columns by names and position using the R function select() in dplyr package. You will learn how to use the following functions:
pull() Extract column values as a vector. The column of interest can be specified either by name or by index.select() Extract one or multiple columns as a data table. It can be also used to remove columns from the data frame.select_if() Select columns based on a particular condition. One can use this function to, for example, select columns if they are numeric.starts_with(), ends_with(), contains(), matches(): Select columns/variables based on their names.library(tidyverse) # load `tidyverse`, which include in `dplyr`
pfizer %>% pull(state) %>% head() # extract column values of `state` as a vector
pfizer %>% select(1:3) # select columns 1 to 3
pfizer %>% select(1,3) # select column 1 and 3 but not 2
pfizer %>% select(state:total) # select all columns from `state` to `total`
pfizer %>% select(state,total) # select columns by variable names
pfizer %>% select_if(is.numeric) # select only numeric columns
pfizer %>% select_if(is.character) # select only character columns
pfizer %>% select(starts_with("first")) # Select column whose starts with `first`
pfizer %>% select(ends_with("name")) # Select column whose ends with `name`
pfizer %>% select(contains("rst")) # Select columns whose names contains `rst`
pfizer %>% select(matches("_")) # Select columns whose name maches a regular
pfizer %>% select(-(state:total)) # remove all columns from `state` to `total`
pfizer %>% select(-state, -total) # remove columns `state` and `total`Now we will filter() and arrange() the data in specific ways. For each of the following examples, copy the code that follows into your script, and view the results. Notice how we create a new object to hold the processed data.
Find doctors in California paid $10,000 or more by Pfizer to run “Professional Advising”!
ca_expert_10000 <- pfizer %>% # load all `pfizer`
filter(state == "CA" & # load all `pfizer` filter by `state`
total >= 10000 & # also filter by `total` greather/equal 10000
category == "Professional Advising") # then moreover filter by `category`
ca_expert_10000 # print the result## # A tibble: 35 x 10
## org_indiv first_plus first_name last_name city state category cash other
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 CUMMINGS~ JEFFREY L~ JEFFREY CUMMINGS LOS ~ CA Profess~ 13500 0
## 2 FONAROW,~ GREGG CUR~ GREGG FONAROW LOS ~ CA Profess~ 15750 0
## 3 GANZ, PE~ PETER PETER GANZ SAN ~ CA Profess~ 22000 0
## 4 GT AND A~ GEORGE GEORGE TRIADAFI~ SAN ~ CA Profess~ 33705 0
## 5 HANAHAN,~ DOUGLAS DOUGLAS HANAHAN SAN ~ CA Profess~ 37500 0
## 6 HERBERT ~ HERBERT HERBERT CRUZ FRES~ CA Profess~ 13250 0
## 7 HU, RONA~ RONA JANE RONA HU PALO~ CA Profess~ 22500 0
## 8 KAPLOWIT~ NEIL NEIL KAPLOWITZ LOS ~ CA Profess~ 12000 0
## 9 KIZER, K~ KENNETH W~ KENNETH KIZER ALIS~ CA Profess~ 10000 0
## 10 KOO, EDW~ EDWARD HA~ EDWARD MANG LA J~ CA Profess~ 12500 0
## # ... with 25 more rows, and 1 more variable: total <dbl>
Now add a sort to the end of the code to list the doctors in descending order by the payments received!
ca_expert_10000 <- pfizer %>% # load all `pfizer`
filter(state == "CA" & # filter by `state`in California
total >= 10000 & # also filter by `total` greather/equal 10000
category == "Professional Advising")%>% # then moreover filter by `category`
arrange(desc(total)) # descending order by the payments received
ca_expert_10000 # print the result## # A tibble: 35 x 10
## org_indiv first_plus first_name last_name city state category cash other
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 MALENKA,~ ROBERT CH~ ROBERT MALENKA STAN~ CA Profess~ 75566 0
## 2 REGENTS ~ DAVID RAY~ DAVID GANDARA IRVI~ CA Profess~ 38500 0
## 3 PTACEK, ~ LOUIS JOHN LOUIS PTACEK SAN ~ CA Profess~ 37588 0
## 4 HANAHAN,~ DOUGLAS DOUGLAS HANAHAN SAN ~ CA Profess~ 37500 0
## 5 REGENTS ~ JOHN POWER JOHN KANE IRVI~ CA Profess~ 37500 0
## 6 RISCH, S~ SAMUEL CR~ SAMUEL RISCH SAN ~ CA Profess~ 36000 0
## 7 GT AND A~ GEORGE GEORGE TRIADAFI~ SAN ~ CA Profess~ 33705 0
## 8 SANDROCK~ CHRISTIAN~ CHRISTIAN SANDROCK SACR~ CA Profess~ 33000 0
## 9 MATTHEW ~ MATTHEW J~ MATTHEW BUDOFF MANH~ CA Profess~ 31500 0
## 10 OLEFSKY,~ JERROLD M~ JERROLD OLEFSKY SOLA~ CA Profess~ 27746 0
## # ... with 25 more rows, and 1 more variable: total <dbl>
Find doctors in California or New York who were paid $10,000 or more by Pfizer to run “Professional Advising”!
Notice that, in this case we use of the | Boolean operator, and the brackets around that part of the query. This ensures that this part of the query is run first. See what happens if you exclude them.
ca_ny_expert_10000 <- pfizer %>% # load all `pfizer`
filter((state == "CA" | state == "NY") & # filter by `state` California or New York
total >= 10000 & # also filter by `total` greather/equal 10000
category == "Professional Advising")%>% # then moreover filter by `category`
arrange(desc(total)) # descending order by the payments received
ca_ny_expert_10000 # print the result## # A tibble: 46 x 10
## org_indiv first_plus first_name last_name city state category cash other
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 SAWYERS,~ CHARLES L~ CHARLES SAWYERS NEW ~ NY Profess~ 100000 0
## 2 MALENKA,~ ROBERT CH~ ROBERT MALENKA STAN~ CA Profess~ 75566 0
## 3 REGENTS ~ DAVID RAY~ DAVID GANDARA IRVI~ CA Profess~ 38500 0
## 4 PTACEK, ~ LOUIS JOHN LOUIS PTACEK SAN ~ CA Profess~ 37588 0
## 5 HANAHAN,~ DOUGLAS DOUGLAS HANAHAN SAN ~ CA Profess~ 37500 0
## 6 JAVITT, ~ DANIEL CO~ DANIEL JAVITT ORAN~ NY Profess~ 37500 0
## 7 REGENTS ~ JOHN POWER JOHN KANE IRVI~ CA Profess~ 37500 0
## 8 RISCH, S~ SAMUEL CR~ SAMUEL RISCH SAN ~ CA Profess~ 36000 0
## 9 GT AND A~ GEORGE GEORGE TRIADAFI~ SAN ~ CA Profess~ 33705 0
## 10 SANDROCK~ CHRISTIAN~ CHRISTIAN SANDROCK SACR~ CA Profess~ 33000 0
## # ... with 36 more rows, and 1 more variable: total <dbl>
Find doctors in states other than California who were paid $10,000 or more by Pfizer to run “Professional Advising”!
not_ca_expert_10000 <- pfizer %>% # load all `pfizer`
filter(state != "CA" & # filter by `state`other than California
total >= 10000 & # also filter by `total` greather/equal 10000
category == "Professional Advising")%>% # then moreover filter by `category`
arrange(desc(total)) # descending order by the payments received
not_ca_expert_10000 # print the result## # A tibble: 135 x 10
## org_indiv first_plus first_name last_name city state category cash other
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 BAILES, ~ JOSEPH SW~ JOSEPH BAILES AUST~ TX Profess~ 105000 0
## 2 SAWYERS,~ CHARLES L~ CHARLES SAWYERS NEW ~ NY Profess~ 100000 0
## 3 MARDER, ~ HAROLD KAY HAROLD MARDER PHIL~ PA Profess~ 87610 0
## 4 WILLIAMS~ GRANT GRANT WILLIAMS WAYNE PA Profess~ 85621 0
## 5 THE SCHE~ PHILIP SA~ PHILIP SCHEIN BRYN~ PA Profess~ 75609 0
## 6 GORDON, ~ JEFFREY I~ JEFFREY GORDON SAIN~ MO Profess~ 75108 0
## 7 KOLLEF, ~ MARIN HRI~ MARIN KOLLEF SAIN~ MO Profess~ 72750 0
## 8 JOSEPH, ~ WARREN S WARREN JOSEPH HUNT~ PA Profess~ 71875 0
## 9 SHORR, A~ ANDREW FR~ ANDREW SHORR WASH~ DC Profess~ 62000 0
## 10 MAYO FOU~ PETER JAM~ PETER DYCK ROCH~ MN Profess~ 61175 0
## # ... with 125 more rows, and 1 more variable: total <dbl>
Find the 20 doctors across the four largest states (CA, TX, FL, NY) who were paid the most for “Expert-Led Forums”!
ca_ny_tx_fl_prof_top20 <- pfizer %>%
filter((state == "CA" |
state == "NY" |
state == "TX" |
state == "FL") &
category == "Expert-Led Forums") %>%
arrange(desc(total)) %>%
head(20)
ca_ny_tx_fl_prof_top20## # A tibble: 20 x 10
## org_indiv first_plus first_name last_name city state category cash other
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 SACKS, G~ GERALD MI~ GERALD SACKS SANT~ CA Expert-~ 146500 0
## 2 NIDES, M~ MITCHELL MITCHELL NIDES LOS ~ CA Expert-~ 70500 0
## 3 ROBERT B~ ROBERT BU~ ROBERT NETT SAN ~ TX Expert-~ 60750 0
## 4 SOLERA C~ STEVEN AB~ STEVEN KAPLAN CHAP~ NY Expert-~ 56500 0
## 5 GRIFFIN,~ JAMES DALE JAMES GRIFFIN DALL~ TX Expert-~ 54250 0
## 6 STUBBLEF~ MICHAEL D MICHAEL STUBBLEF~ NEW ~ NY Expert-~ 50500 0
## 7 POTKIN, ~ STEVEN GA~ STEVEN POTKIN ORAN~ CA Expert-~ 48350 0
## 8 GINSBERG~ DAVID ALAN DAVID GINSBERG LOS ~ CA Expert-~ 45750 0
## 9 LOUIE, S~ SAMUEL SAMUEL LOUIE SACR~ CA Expert-~ 41250 0
## 10 INSTITUT~ GURKIPAL GURKIPAL SINGH WOOD~ CA Expert-~ 40000 0
## 11 NEURONLI~ JEFFREY B~ JEFFREY GELBLUM MIAMI FL Expert-~ 39200 0
## 12 HENNEKEN~ CHARLES H CHARLES HENNEKENS BOCA~ FL Expert-~ 33250 0
## 13 SOBHY EL~ SAAD A SAAD SOBHY SYRA~ NY Expert-~ 31500 0
## 14 COHEN, L~ LEE STEVEN LEE COHEN HAST~ NY Expert-~ 30500 0
## 15 BAROYA, ~ IVAN STEP~ IVAN BAROYA BONI~ CA Expert-~ 26400 0
## 16 JONES, S~ STEPHEN E~ STEPHEN JONES DALL~ TX Expert-~ 26250 0
## 17 CRANE, D~ DANIEL LI~ DANIEL CRANE NEW ~ NY Expert-~ 25500 0
## 18 AXONZ INC STANLEY J STANLEY KROLCZYK WESL~ FL Expert-~ 24750 0
## 19 MATTHEW ~ MATTHEW J~ MATTHEW BUDOFF MANH~ CA Expert-~ 24000 0
## 20 QUANG H ~ QUANG H QUANG NGUYEN LA J~ CA Expert-~ 22500 0
## # ... with 1 more variable: total <dbl>
Filter the data pfizer for all payments for running “Expert-Led Forums” or for “Professional Advising”, and arrange alphabetically by doctor (last name, then first name)
expert_professional_advice <- pfizer %>%
filter(category == "Expert-Led Forums" |
category == "Professional Advising") %>%
arrange(first_name,last_name)
expert_professional_advice## # A tibble: 4,382 x 10
## org_indiv first_plus first_name last_name city state category cash other
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 THE REGE~ A MARK A FENDRICK ANN ~ MI Profess~ 12000 0
## 2 AA DOCTO~ AAKASH MO~ AAKASH AHUJA PASO~ CA Expert-~ 1000 0
## 3 BLOM, AA~ AARON S AARON BLOM PHIL~ PA Expert-~ 1000 0
## 4 GROTAS, ~ AARON BRE~ AARON GROTAS BROO~ NY Expert-~ 3000 0
## 5 MILSTONE~ AARON PAUL AARON MILSTONE BREN~ TN Expert-~ 8500 0
## 6 NGUYEN, ~ AARON LINH AARON NGUYEN POMO~ CA Expert-~ 1000 0
## 7 SPITZ EN~ AARON AARON SPITZ LAGU~ CA Expert-~ 750 0
## 8 VINIK, A~ AARON I AARON VINIK NORF~ VA Profess~ 1000 0
## 9 WAXMAN, ~ AARON B AARON WAXMAN BOST~ MA Profess~ 2250 0
## 10 SHAH, AA~ AASHIT KA~ AASHIT SHAH DETR~ MI Expert-~ 3000 0
## # ... with 4,372 more rows, and 1 more variable: total <dbl>
In this part, you will learn how to rename the columns of a data frame in R. Next, you will learn how to compute and add new variables to a data frame in R. You will learn the following R functions from the dplyr R package:
rename() This code used to rename the columns of a data frame in R.mutate() Compute and add new variables into a data table. It preserves existing variables.transmute() Compute new columns but drop existing variables.library(tidyverse)
# Renaming columns of the data `pfizer` with R base functions:
names(pfizer)[names(pfizer) == "org_indiv"] <- "rename1"
names(pfizer)[1] <- "rename2"
names(pfizer)[names(pfizer)==names(pfizer)] <- c("rename3",
"rename4",
"first_name",
"last_name",
"city",
"state",
"category",
"cash",
"other",
"total")
# Renaming columns of the data `pfizer` with `dplyr::rename()`:
pfizer %>%
rename(
org_indiv = rename3 ,
first_plus = rename4
)## # A tibble: 10,087 x 10
## org_indiv first_plus first_name last_name city state category cash other
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 3-D MEDI~ STEVEN BR~ STEVEN DEITELZW~ NEW ~ LA Profess~ 2625 0
## 2 AA DOCTO~ AAKASH MO~ AAKASH AHUJA PASO~ CA Expert-~ 1000 0
## 3 ABBO, LI~ LILIAN MA~ LILIAN ABBO MIAMI FL Busines~ 0 448
## 4 ABBO, LI~ LILIAN MA~ LILIAN ABBO MIAMI FL Meals 0 119
## 5 ABBO, LI~ LILIAN MA~ LILIAN ABBO MIAMI FL Profess~ 1800 0
## 6 ABDULLAH~ ABDULLAH ABDULLAH RAFFEE FLINT MI Expert-~ 750 0
## 7 ABEBE, S~ SHEILA Y SHEILA ABEBE INDI~ IN Educati~ 0 47
## 8 ABEBE, S~ SHEILA Y SHEILA ABEBE INDI~ IN Expert-~ 825 0
## 9 ABILENE ~ GALEN CHR~ GALEN ALBRITTON ABIL~ TX Profess~ 3000 0
## 10 ABOLNIK,~ IGOR Z IGOR ABOLNIK PROVO UT Busines~ 0 396
## # ... with 10,077 more rows, and 1 more variable: total <dbl>
# Add new columns (year*) by preserving existing data `fda`:
letters_year <- fda %>%
mutate(year = format(issued, "%Y")) %>%
group_by(year)
letters_year## # A tibble: 272 x 6
## # Groups: year [15]
## name_last name_first name_middle issued office year
## <chr> <chr> <chr> <date> <chr> <chr>
## 1 ADELGLASS JEFFREY M. 1999-05-25 Center for Drug Evaluatio~ 1999
## 2 ADKINSON N. FRANKLIN 2000-04-19 Center for Biologics Eval~ 2000
## 3 ALLEN MARK S. 2002-01-28 Center for Devices and Ra~ 2002
## 4 AMSTERDAM DANIEL <NA> 2004-11-17 Center for Biologics Eval~ 2004
## 5 AMSTUTZ HARLAN C. 2004-07-19 Center for Devices and Ra~ 2004
## 6 ANDERSON C. JOSEPH 2000-02-25 Center for Devices and Ra~ 2000
## 7 ANDREWS DAVID W. 2000-07-19 Center for Biologics Eval~ 2000
## 8 AQEL RAED <NA> 2002-10-30 Center for Devices and Ra~ 2002
## 9 ARROWSMITH PETER N. 2004-01-21 Center for Devices and Ra~ 2004
## 10 BARR JOHN D. 2000-01-14 Center for Devices and Ra~ 2000
## # ... with 262 more rows
# Add new columns (year*) and (last_name*) by dropping existing data `fda`:
fda %>%
transmute(
year = format(issued, "%Y"),
last_name = name_last
)## # A tibble: 272 x 2
## year last_name
## <chr> <chr>
## 1 1999 ADELGLASS
## 2 2000 ADKINSON
## 3 2002 ALLEN
## 4 2004 AMSTERDAM
## 5 2004 AMSTUTZ
## 6 2000 ANDERSON
## 7 2000 ANDREWS
## 8 2002 AQEL
## 9 2004 ARROWSMITH
## 10 2000 BARR
## # ... with 262 more rows
There are also a number of join functions in dplyr to combine data from two data frames. Here are the most useful:
inner_join() Returns values from both tables only where there is a match.left_join() Returns all the values from the first-mentioned table, plus those from the second table that match.semi_join() Filters the first-mentioned table to include only values that have matches in the second table.anti_join() Filters the first-mentioned table to include only values that have no matches in the second table.To illustrate, these joins will find doctors paid by Pfizer to run expert-led forums who had also received a warning letter from the fda:
expert_warned_inner <- inner_join(pfizer, fda,
by=c("first_name" = "name_first",
"last_name" = "name_last")) %>%
filter(category=="Expert-Led Forums")
expert_warned_semi <- semi_join(pfizer, fda,
by=c("first_name" = "name_first",
"last_name" = "name_last")) %>%
filter(category=="Expert-Led Forums")The code in by=c() defines how the join should be made. If instructions on how to join the tables are not supplied, dplyr will look for columns with matching names, and perform the join based on those. The difference between the two joins above is that the first contains all of the columns from both data frames, while the second gives only columns from the pfizer data frame.
In practice, you may wish to inner_join and then use dplyr’s select function to select the columns that you want to retain, for example:
expert_warned <- inner_join(pfizer, fda,
by=c("first_name" = "name_first",
"last_name" = "name_last")) %>%
filter(category=="Expert-Led Forums") %>%
select(last_name,
city,
state,
total,
issued)
expert_warned <- inner_join(pfizer, fda,
by=c("first_name" = "name_first",
"last_name" = "name_last")) %>%
filter(category=="Expert-Led Forums") %>%
select(2:5,10,12)Click here for more useful reference for managing joins with dplyr.
This part introduces how to easily compute statistical summaries in R using the dplyr package. You will learn, how to:
summarise() and group_by().summarise_all() Apply summary functions to every columns in the data frame.summarise_at() Apply summary functions to specific columns selected with a character vectorsummarise_if() Apply summary functions to columns selected with a predicate function that returns TRUECalculate the total payments of the data pfizer, by state descending!
## `summarise()` ungrouping output (override with `.groups` argument)
Calculate some additional summary statistics of the data pfizer, by state descending!
state_summary <- pfizer %>%
group_by(state) %>%
summarize(sum = sum(total),
average = mean(total),
median = median(total),
min = min(total),
max = max(total),
count = n()) %>%
arrange(desc(sum))
state_summary## # A tibble: 52 x 7
## state sum average median min max count
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 CA 4737807 4042. 886. 25 323122 1172
## 2 TX 2802196 4491. 802 25 479314 624
## 3 FL 2564047 3474. 669 25 1185466 738
## 4 PA 2484505 4661. 669 25 672263 533
## 5 NC 2328435 6079. 777 0 857698 383
## 6 NY 2065042 2479. 750 25 100000 833
## 7 MA 1764771 6814. 1250 29 617548 259
## 8 IL 1256825 2805. 661 25 160984 448
## 9 MI 1146285 3352. 750 26 152466 342
## 10 OH 1019450 2207. 685 25 107292 462
## # ... with 42 more rows
Group and summarize the data pfizer for multiple categories ascending!
state_summary <- pfizer %>%
group_by(state, category) %>%
summarize(sum = sum(total),
average = mean(total),
median = median(total),
min = min(total),
max = max(total),
count = n()) %>%
arrange(state, category)
state_summary## # A tibble: 319 x 8
## # Groups: state [52]
## state category sum average median min max count
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 AK Expert-Led Forums 1750 1.75e3 1.75e3 1750 1750 1
## 2 AL Business Related Travel 29123 9.39e2 3.31e2 39 8041 31
## 3 AL Educational Items 431 6.16e1 6.70e1 47 78 7
## 4 AL Expert-Led Forums 129850 4.06e3 2.00e3 550 26250 32
## 5 AL Investigator-Initiated Resear~ 55937 5.59e4 5.59e4 55937 55937 1
## 6 AL Meals 10217 2.69e2 1.86e2 26 1385 38
## 7 AL Pfizer Sponsored Research ini~ 279485 1.40e5 1.40e5 46494 232991 2
## 8 AL Pfizer Sponsored Research ini~ 66096 6.61e4 6.61e4 66096 66096 1
## 9 AL Professional Advising 110560 5.53e3 2.00e3 702 35000 20
## 10 AR Business Related Travel 3783 2.70e2 1.41e2 27 1017 14
## # ... with 309 more rows
Filter the data fda for letters sent from the start of 2006 onwards and summarize!
year_summary <- fda %>%
filter(issued >= "2005-01-01") %>%
arrange(issued) %>%
mutate(year = format(issued, "%Y")) %>%
group_by(year) %>%
summarize(letters=n())
year_summary## # A tibble: 6 x 2
## year letters
## <chr> <int>
## 1 2005 31
## 2 2006 25
## 3 2007 11
## 4 2008 23
## 5 2009 25
## 6 2010 5
Beware: reusing variables may lead to unexpected results but don’t worry about it. R will give a warning something like summarise() ungrouping output (override with .groups argument).
As I have mentioned in chapter R Programing about writing function to split training and testing data. Here, we will learn more about how to use some packages in order to split data. Because this part is very important as a data scientist as especially when you apply Machine Learning tasks to analyze data. This part usually we need to split the dataset between a train set and a test set. The train set allows the algorithm to learn from the data. In order to test the performance of our model, we can use the test set to return the performance measure. So, let`s see some packages that you can use to split data:
You can use dplyr for this, makes it super simple. It does require an id variable in your data set, which is a good idea anyway, not only for creating sets but also for traceability during your project. Add it if doesn’t contain already.
library(dplyr) # load `dplyr` package
data(mtcars) # use data from R environment
set.seed(123) # to ensure we generate the same data
mtcars$id <- 1:nrow(mtcars) # add the `id` if doesn't contain already
train<-mtcars %>% dplyr::sample_frac(.75) # assign the train set
test<-dplyr::anti_join(mtcars, train, by = 'id') # assign the test set
dim(train) # check dimension of train set## [1] 24 12
## [1] 8 12
Note: This data set just, for example, it`s not suitable for machine learning tasks because of the data too small.
There are numerous approaches to achieve data partitioning. For a more complete approach take a look at the createDataPartition function in the caTools package.
library(caTools) # load `caTools` package
data(mtcars) # use data from R environment
set.seed(123) # to ensure we generate the same data
smp_size <- floor(0.75 * nrow(mtcars)) # 75% of the sample size
train_ind <- sample(seq_len(nrow(mtcars)),
size = smp_size)
train <- mtcars[train_ind, ] # assign the train set
test <- mtcars[-train_ind, ] # assign the test set
dim(train) # check dimension of train set## [1] 24 11
## [1] 8 11
Another powerful packages tha you can you to split dataset is caret.
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
data(mtcars) # use data from R environment
set.seed(123) # to ensure we generate the same data
intrain<-createDataPartition(mtcars$mpg,
p=0.75,list=FALSE)
train<-mtcars[intrain,]
test<-mtcars[-intrain,]
dim(train) # check dimension of train set## [1] 25 11
## [1] 7 11