What is dplyr?

dplyr is a powerful library for manipulating data frames in r programming providing a consistent set of verbs that help you solve the most common data manipulation problems.

Functionalities

• mutate() adds new variables that are functions of existing variables 
• select() picks variables based on their names. 
• filter() picks cases based on their values. 
• summarise() reduces multiple values down to a single summary. 
• arrange() changes the ordering of the rows.

Introduction

When you’re working with data, you need to figure out what you want to do, describe those tasks in the form of a computer program, and run the program. The dplyr package makes these steps quick and easy:

• By limiting your options, it helps you think through your data manipulation challenges.
• It provides simple "verbs," functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
• It uses efficient backends so you spend less time waiting on the computer.

Getting Started

Dplyr

The first step is to install the dplyr package to load the package into memory

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data

To explore the functionalities of dplyr, we will use the data come from data.gov or download the all project here. This dataset provides information regarding the total approved actual expenses incurred by Montgomery County government employees traveling non-locally (over 75 miles from the County’s Executive Office Building at 101 Monroe St. Rockville, MD) for official business, beginning on or after August 12, 2015. The dataset includes the name of traveling employee; the employee’s home department; travel start and end dates; destination; purpose of travel; and actual total expenses funded by the County. Update Frequency: Monthly

Load data

library(readr)
employee <- read_csv("Employee_Travel_Data__Non-Local_.csv")
## 
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   Department = col_character(),
##   `Employee  ` = col_character(),
##   `Travel Start Date` = col_date(format = ""),
##   `Travel End Date` = col_date(format = ""),
##   `Destination(s)` = col_character(),
##   `Purpose Of Travel` = col_character(),
##   `Actual Total Expenses` = col_double()
## )

Overvier of data.

head(employee, n=10)
## # A tibble: 10 x 7
##    Department `Employee  ` `Travel Start D… `Travel End Dat… `Destination(s)`
##    <chr>      <chr>        <date>           <date>           <chr>           
##  1 Police     TONY GALLAD… 2017-05-22       2017-05-26       CHARLOTTE, NC ; 
##  2 Police     GREG WOODMAN 2017-05-22       2017-05-26       CHARLOTTE, NC ; 
##  3 Police     THERESA DUR… 2017-05-09       2017-05-13       SAN ANTONIO, TX…
##  4 Board of … Jerry Quars… 2017-06-07       2017-06-08       Ocean City, MD ;
##  5 Fire and … JEFFREY STA… 2017-03-19       2017-03-22       RALEIGH, NC. ;  
##  6 Police     BRIAN KUYKE… 2017-04-23       2017-04-26       GOLDSBORO, NC ; 
##  7 Police     THOMAS MULL… 2017-06-04       2017-06-08       GRAPEVINE, TX ; 
##  8 Police     MATTHEW LYN… 2017-06-04       2017-06-08       CHARLOTTE, NC ; 
##  9 County Ex… Claudia Can… 2015-08-12       2015-08-15       Ocean City, MD ;
## 10 County Ex… Joy Nurmi    2015-08-12       2015-08-15       Ocean City, MD ;
## # … with 2 more variables: `Purpose Of Travel` <chr>, `Actual Total
## #   Expenses` <dbl>

Preparing data for analysis

Based on the data we’ve chosen, we’re just assuming to do a few things before analysing. If you can observe there is same columns (Travel Start Date, Travel End Date, etc) which have composite name, we can rename them. And we will want to split Travel Start Date to allow us more analysis.

Renaming columns with rename()

##Rename column in r
employee <- employee %>% 
  rename(
   travel_start_date = `Travel Start Date`,
   travel_start_end = `Travel End Date`,
   purpose_Of_ravel = `Purpose Of Travel`,
   actual_total_xpenses = `Actual Total Expenses`
  )
head(employee, n=10)
## # A tibble: 10 x 7
##    Department `Employee  ` travel_start_da… travel_start_end `Destination(s)`
##    <chr>      <chr>        <date>           <date>           <chr>           
##  1 Police     TONY GALLAD… 2017-05-22       2017-05-26       CHARLOTTE, NC ; 
##  2 Police     GREG WOODMAN 2017-05-22       2017-05-26       CHARLOTTE, NC ; 
##  3 Police     THERESA DUR… 2017-05-09       2017-05-13       SAN ANTONIO, TX…
##  4 Board of … Jerry Quars… 2017-06-07       2017-06-08       Ocean City, MD ;
##  5 Fire and … JEFFREY STA… 2017-03-19       2017-03-22       RALEIGH, NC. ;  
##  6 Police     BRIAN KUYKE… 2017-04-23       2017-04-26       GOLDSBORO, NC ; 
##  7 Police     THOMAS MULL… 2017-06-04       2017-06-08       GRAPEVINE, TX ; 
##  8 Police     MATTHEW LYN… 2017-06-04       2017-06-08       CHARLOTTE, NC ; 
##  9 County Ex… Claudia Can… 2015-08-12       2015-08-15       Ocean City, MD ;
## 10 County Ex… Joy Nurmi    2015-08-12       2015-08-15       Ocean City, MD ;
## # … with 2 more variables: purpose_Of_ravel <chr>, actual_total_xpenses <dbl>

Split date and use mutate()

But before we split the date, we suppose to add a new library that works with dplyr. He calls Lubridate. How important is it to do that same time that you may need to do your analysis based on the year, month, or day in this case you are assuming to split the date.

## Split Travel Start Date 
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
employee <- employee %>% mutate(year_s_d = year(travel_start_date), 
                month_s_d = month(travel_start_date), 
                day_s_d = day(travel_start_date))
head(employee, n=10)
## # A tibble: 10 x 10
##    Department `Employee  ` travel_start_da… travel_start_end `Destination(s)`
##    <chr>      <chr>        <date>           <date>           <chr>           
##  1 Police     TONY GALLAD… 2017-05-22       2017-05-26       CHARLOTTE, NC ; 
##  2 Police     GREG WOODMAN 2017-05-22       2017-05-26       CHARLOTTE, NC ; 
##  3 Police     THERESA DUR… 2017-05-09       2017-05-13       SAN ANTONIO, TX…
##  4 Board of … Jerry Quars… 2017-06-07       2017-06-08       Ocean City, MD ;
##  5 Fire and … JEFFREY STA… 2017-03-19       2017-03-22       RALEIGH, NC. ;  
##  6 Police     BRIAN KUYKE… 2017-04-23       2017-04-26       GOLDSBORO, NC ; 
##  7 Police     THOMAS MULL… 2017-06-04       2017-06-08       GRAPEVINE, TX ; 
##  8 Police     MATTHEW LYN… 2017-06-04       2017-06-08       CHARLOTTE, NC ; 
##  9 County Ex… Claudia Can… 2015-08-12       2015-08-15       Ocean City, MD ;
## 10 County Ex… Joy Nurmi    2015-08-12       2015-08-15       Ocean City, MD ;
## # … with 5 more variables: purpose_Of_ravel <chr>, actual_total_xpenses <dbl>,
## #   year_s_d <dbl>, month_s_d <dbl>, day_s_d <int>

Select same columns with select()

You often work with large datasets that have many columns, but only a few are really of interest to you. select () allows you to quickly zoom in on a useful subset using operations that usually only work on numeric variable positions. In this case you have only destination, purpose of travel date which we split and actual total expenses. You can look also is possible to rename the column using select() function.

employee <- employee %>%  select(Department, 
                      destination = `Destination(s)`, 
                      purpose_Of_ravel, 
                     year_s_d,
                     month_s_d,
                    day_s_d,
                    actual_total_xpenses
)
head(employee, n=10)
## # A tibble: 10 x 7
##    Department destination purpose_Of_ravel year_s_d month_s_d day_s_d
##    <chr>      <chr>       <chr>               <dbl>     <dbl>   <int>
##  1 Police     CHARLOTTE,… Other                2017         5      22
##  2 Police     CHARLOTTE,… Other                2017         5      22
##  3 Police     SAN ANTONI… Training / Lice…     2017         5       9
##  4 Board of … Ocean City… Conference / Se…     2017         6       7
##  5 Fire and … RALEIGH, N… Conference / Se…     2017         3      19
##  6 Police     GOLDSBORO,… Training / Lice…     2017         4      23
##  7 Police     GRAPEVINE,… Conference / Se…     2017         6       4
##  8 Police     CHARLOTTE,… Conference / Se…     2017         6       4
##  9 County Ex… Ocean City… Conference / Se…     2015         8      12
## 10 County Ex… Ocean City… Conference / Se…     2015         8      12
## # … with 1 more variable: actual_total_xpenses <dbl>

Exploratory data

After preparing our data, we can use same advance functions of dplyr. Exploratory data analysis help us to analyzing and visualizing data to better understand and gain insight into the data.

Which kind of question we can answer according to the data we have:

• Departments which spend a lot of money for traveling
• The currents destination based on mean of money spending
• Make a grouping by year to know the amount max spent
• .etc

We will use :

• summarize() 
• group_by() 
• arrange()

The summerise () function without group_by () makes no sense. It creates summary statistics by group. The dplyr library automatically applies a function to the group you passed in the group_by verb. We use desc() to sort a variable in descending order.

Analysis and plotting

10 Departments which spend a lot of money for traveling

gr_employee_sum <- employee %>% group_by(Department)%>%
  summarize( actual_total_xpenses = sum(actual_total_xpenses, na.rm = TRUE)) %>%
  arrange(desc(actual_total_xpenses))
## `summarise()` ungrouping output (override with `.groups` argument)
head(gr_employee_sum, n=10)
## # A tibble: 10 x 2
##    Department                                 actual_total_xpenses
##    <chr>                                                     <dbl>
##  1 Police                                                 1439501.
##  2 Fire and Rescue Service                                 589790.
##  3 Technology Services                                     291995.
##  4 Transportation                                          203351.
##  5 Finance (include Risk Management)                       164420.
##  6 MC Employees Retirement Plans                           123766.
##  7 General Services                                         94760.
##  8 Permitting Services                                      91874.
##  9 Emergency Management and Homeland Security               75868.
## 10 Health and Human Services                                57846.

Plot

library(ggplot2)
ggplot(gr_employee_sum, aes(x=reorder(Department,actual_total_xpenses), y=(actual_total_xpenses))) + geom_bar(stat="Identity", fill="#4AA02C",color="darkred")+
  xlab("Actual Total Expenses") + ylab("Frequency")+
  ggtitle("Actual Total Expenses by Department")+
  theme(axis.text.x=element_text(angle=90, hjust=1))

10 currents destinations based of mean of money spending

gr_destination_mean <- employee %>% group_by(destination)%>%
  summarise( actual_total_xpenses = mean(actual_total_xpenses, na.rm = TRUE)) %>%
  arrange(desc(actual_total_xpenses)) %>%
  rename(
    mean_expenses = `actual_total_xpenses`
  )
## `summarise()` ungrouping output (override with `.groups` argument)
head(gr_destination_mean)
## # A tibble: 6 x 2
##   destination                       mean_expenses
##   <chr>                                     <dbl>
## 1 BOSTON, AK ;                              9974.
## 2 ST. CROIX, WY ;                           9891.
## 3 ST CROIX & ST THOMAS,  ;                  9422.
## 4 Cambride, MA ;                            7986.
## 5 TEL AVIV AND JERUSALEM, ISRAEL. ;         5323.
## 6 LANSING, MI. ;                            5200.

Make a grouping by year to know the max amount spent

gr_year_max <- employee %>% group_by(year_s_d)%>%
  summarise( actual_total_xpenses = max(actual_total_xpenses, na.rm = TRUE)) %>%
  arrange(desc(year_s_d)) %>%
  rename(
    max_amount_expenses = `actual_total_xpenses`
)
## `summarise()` ungrouping output (override with `.groups` argument)
head(gr_year_max)
## # A tibble: 6 x 2
##   year_s_d max_amount_expenses
##      <dbl>               <dbl>
## 1     2020               1748.
## 2     2019              10387.
## 3     2018               9974.
## 4     2017              10093.
## 5     2016               9728.
## 6     2015               4605.

Plot

ggplot(gr_year_max, aes(x="", y=max_amount_expenses, fill=year_s_d)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0) +
  theme_void()

## Use filter to get a subset for a specific year

filter () allows you to subsets of observations based on their values. The first argument is the name of the data block. The second and subsequent arguments are the expressions that filter the data block.

##get only subset for 2015
employee2015 <- employee %>% filter(year_s_d == 2015)
head(employee2015)
## # A tibble: 6 x 7
##   Department destination purpose_Of_ravel year_s_d month_s_d day_s_d
##   <chr>      <chr>       <chr>               <dbl>     <dbl>   <int>
## 1 County Ex… Ocean City… Conference / Se…     2015         8      12
## 2 County Ex… Ocean City… Conference / Se…     2015         8      12
## 3 Technolog… FL, Orland… Conference / Se…     2015        10       4
## 4 Technolog… Ocean City… Training / Lice…     2015        10      28
## 5 Technolog… Ocean City… Training / Lice…     2015        10      28
## 6 Technolog… Ocean City… Training / Lice…     2015        10      28
## # … with 1 more variable: actual_total_xpenses <dbl>

Now can answer some question like

• Money spending by month in specific year
• Money spending by day for specific month in specific year
• Money spending by purpose of travel in specific year
• .etc

Money spending by month in 2015

gr_employee_sum_2015 <- employee2015 %>% group_by(month_s_d)%>%
summarize( actual_total_xpenses = sum(actual_total_xpenses, na.rm = TRUE)) %>%
arrange(desc(month_s_d))
## `summarise()` ungrouping output (override with `.groups` argument)
head(gr_employee_sum_2015)
## # A tibble: 6 x 2
##   month_s_d actual_total_xpenses
##       <dbl>                <dbl>
## 1        12               23962.
## 2        11               30499.
## 3        10               73929.
## 4         9               37855.
## 5         8               59394.
## 6         7                4642.

Money spending by day for specific month

employee2015_day <- employee2015 %>%
filter(month_s_d == 7) %>%
group_by(day_s_d) %>%
summarise(actual_total_xpenses = sum(actual_total_xpenses, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
head(gr_employee_sum_2015)
## # A tibble: 6 x 2
##   month_s_d actual_total_xpenses
##       <dbl>                <dbl>
## 1        12               23962.
## 2        11               30499.
## 3        10               73929.
## 4         9               37855.
## 5         8               59394.
## 6         7                4642.

Plot

pie(employee2015_day$actual_total_xpenses)

ggplot(data = employee2015_day, aes(x = "", y = actual_total_xpenses, fill = day_s_d)) + 
  geom_bar(stat = "identity") + 
  coord_polar("y")

Money spending by purpose of travel in specific year

gr_purpose_sum_2015 <- employee2015 %>% group_by(purpose_Of_ravel)%>%
  summarise( actual_total_xpenses = sum(actual_total_xpenses, na.rm = TRUE)) %>%
  arrange(desc(purpose_Of_ravel))
## `summarise()` ungrouping output (override with `.groups` argument)
head(gr_purpose_sum_2015, n=10)
## # A tibble: 7 x 2
##   purpose_Of_ravel                     actual_total_xpenses
##   <chr>                                               <dbl>
## 1 Vendor / Product evaluation                         8559.
## 2 Training / Licensing / Certification               24061.
## 3 Other                                               8201.
## 4 Conference / Seminar speaker                       19539.
## 5 Conference / Seminar exhibitor                      9029.
## 6 Conference / Seminar attendee                     213242.
## 7 Business / Economic development                     1476.

Plot

gr_purpose_sum_2015  %>%
ggplot(aes(x = purpose_Of_ravel, y = actual_total_xpenses, fill = purpose_Of_ravel)) +
  geom_bar(stat = "identity") +
  theme_classic() +
  labs(
    x = "Purpose Of Travel",
    y = "Actual Total Epenses",
    title = paste(
      "Purpose Of Travel by actual total expenses 2015"
    )
  )+
  theme(axis.text.x=element_text(angle=45, hjust=1))

# Conclusion

The goal of our tutorial was to explore different aspects of dplyr. Show the power of this library in exploratory data analysis. We haven’t covered all aspects of dplyr but the one we describe shows the gist of this library. Thank you, your contribution will be welcome.