dplyr is a powerful library for manipulating data frames in r programming providing a consistent set of verbs that help you solve the most common data manipulation problems.
• mutate() adds new variables that are functions of existing variables
• select() picks variables based on their names.
• filter() picks cases based on their values.
• summarise() reduces multiple values down to a single summary.
• arrange() changes the ordering of the rows.
When you’re working with data, you need to figure out what you want to do, describe those tasks in the form of a computer program, and run the program. The dplyr package makes these steps quick and easy:
• By limiting your options, it helps you think through your data manipulation challenges.
• It provides simple "verbs," functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
• It uses efficient backends so you spend less time waiting on the computer.
The first step is to install the dplyr package to load the package into memory
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
To explore the functionalities of dplyr, we will use the data come from data.gov or download the all project here. This dataset provides information regarding the total approved actual expenses incurred by Montgomery County government employees traveling non-locally (over 75 miles from the County’s Executive Office Building at 101 Monroe St. Rockville, MD) for official business, beginning on or after August 12, 2015. The dataset includes the name of traveling employee; the employee’s home department; travel start and end dates; destination; purpose of travel; and actual total expenses funded by the County. Update Frequency: Monthly
library(readr)
employee <- read_csv("Employee_Travel_Data__Non-Local_.csv")
##
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
## Department = col_character(),
## `Employee ` = col_character(),
## `Travel Start Date` = col_date(format = ""),
## `Travel End Date` = col_date(format = ""),
## `Destination(s)` = col_character(),
## `Purpose Of Travel` = col_character(),
## `Actual Total Expenses` = col_double()
## )
Overvier of data.
head(employee, n=10)
## # A tibble: 10 x 7
## Department `Employee ` `Travel Start D… `Travel End Dat… `Destination(s)`
## <chr> <chr> <date> <date> <chr>
## 1 Police TONY GALLAD… 2017-05-22 2017-05-26 CHARLOTTE, NC ;
## 2 Police GREG WOODMAN 2017-05-22 2017-05-26 CHARLOTTE, NC ;
## 3 Police THERESA DUR… 2017-05-09 2017-05-13 SAN ANTONIO, TX…
## 4 Board of … Jerry Quars… 2017-06-07 2017-06-08 Ocean City, MD ;
## 5 Fire and … JEFFREY STA… 2017-03-19 2017-03-22 RALEIGH, NC. ;
## 6 Police BRIAN KUYKE… 2017-04-23 2017-04-26 GOLDSBORO, NC ;
## 7 Police THOMAS MULL… 2017-06-04 2017-06-08 GRAPEVINE, TX ;
## 8 Police MATTHEW LYN… 2017-06-04 2017-06-08 CHARLOTTE, NC ;
## 9 County Ex… Claudia Can… 2015-08-12 2015-08-15 Ocean City, MD ;
## 10 County Ex… Joy Nurmi 2015-08-12 2015-08-15 Ocean City, MD ;
## # … with 2 more variables: `Purpose Of Travel` <chr>, `Actual Total
## # Expenses` <dbl>
Based on the data we’ve chosen, we’re just assuming to do a few things before analysing. If you can observe there is same columns (Travel Start Date, Travel End Date, etc) which have composite name, we can rename them. And we will want to split Travel Start Date to allow us more analysis.
##Rename column in r
employee <- employee %>%
rename(
travel_start_date = `Travel Start Date`,
travel_start_end = `Travel End Date`,
purpose_Of_ravel = `Purpose Of Travel`,
actual_total_xpenses = `Actual Total Expenses`
)
head(employee, n=10)
## # A tibble: 10 x 7
## Department `Employee ` travel_start_da… travel_start_end `Destination(s)`
## <chr> <chr> <date> <date> <chr>
## 1 Police TONY GALLAD… 2017-05-22 2017-05-26 CHARLOTTE, NC ;
## 2 Police GREG WOODMAN 2017-05-22 2017-05-26 CHARLOTTE, NC ;
## 3 Police THERESA DUR… 2017-05-09 2017-05-13 SAN ANTONIO, TX…
## 4 Board of … Jerry Quars… 2017-06-07 2017-06-08 Ocean City, MD ;
## 5 Fire and … JEFFREY STA… 2017-03-19 2017-03-22 RALEIGH, NC. ;
## 6 Police BRIAN KUYKE… 2017-04-23 2017-04-26 GOLDSBORO, NC ;
## 7 Police THOMAS MULL… 2017-06-04 2017-06-08 GRAPEVINE, TX ;
## 8 Police MATTHEW LYN… 2017-06-04 2017-06-08 CHARLOTTE, NC ;
## 9 County Ex… Claudia Can… 2015-08-12 2015-08-15 Ocean City, MD ;
## 10 County Ex… Joy Nurmi 2015-08-12 2015-08-15 Ocean City, MD ;
## # … with 2 more variables: purpose_Of_ravel <chr>, actual_total_xpenses <dbl>
But before we split the date, we suppose to add a new library that works with dplyr. He calls Lubridate. How important is it to do that same time that you may need to do your analysis based on the year, month, or day in this case you are assuming to split the date.
## Split Travel Start Date
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
employee <- employee %>% mutate(year_s_d = year(travel_start_date),
month_s_d = month(travel_start_date),
day_s_d = day(travel_start_date))
head(employee, n=10)
## # A tibble: 10 x 10
## Department `Employee ` travel_start_da… travel_start_end `Destination(s)`
## <chr> <chr> <date> <date> <chr>
## 1 Police TONY GALLAD… 2017-05-22 2017-05-26 CHARLOTTE, NC ;
## 2 Police GREG WOODMAN 2017-05-22 2017-05-26 CHARLOTTE, NC ;
## 3 Police THERESA DUR… 2017-05-09 2017-05-13 SAN ANTONIO, TX…
## 4 Board of … Jerry Quars… 2017-06-07 2017-06-08 Ocean City, MD ;
## 5 Fire and … JEFFREY STA… 2017-03-19 2017-03-22 RALEIGH, NC. ;
## 6 Police BRIAN KUYKE… 2017-04-23 2017-04-26 GOLDSBORO, NC ;
## 7 Police THOMAS MULL… 2017-06-04 2017-06-08 GRAPEVINE, TX ;
## 8 Police MATTHEW LYN… 2017-06-04 2017-06-08 CHARLOTTE, NC ;
## 9 County Ex… Claudia Can… 2015-08-12 2015-08-15 Ocean City, MD ;
## 10 County Ex… Joy Nurmi 2015-08-12 2015-08-15 Ocean City, MD ;
## # … with 5 more variables: purpose_Of_ravel <chr>, actual_total_xpenses <dbl>,
## # year_s_d <dbl>, month_s_d <dbl>, day_s_d <int>
You often work with large datasets that have many columns, but only a few are really of interest to you. select () allows you to quickly zoom in on a useful subset using operations that usually only work on numeric variable positions. In this case you have only destination, purpose of travel date which we split and actual total expenses. You can look also is possible to rename the column using select() function.
employee <- employee %>% select(Department,
destination = `Destination(s)`,
purpose_Of_ravel,
year_s_d,
month_s_d,
day_s_d,
actual_total_xpenses
)
head(employee, n=10)
## # A tibble: 10 x 7
## Department destination purpose_Of_ravel year_s_d month_s_d day_s_d
## <chr> <chr> <chr> <dbl> <dbl> <int>
## 1 Police CHARLOTTE,… Other 2017 5 22
## 2 Police CHARLOTTE,… Other 2017 5 22
## 3 Police SAN ANTONI… Training / Lice… 2017 5 9
## 4 Board of … Ocean City… Conference / Se… 2017 6 7
## 5 Fire and … RALEIGH, N… Conference / Se… 2017 3 19
## 6 Police GOLDSBORO,… Training / Lice… 2017 4 23
## 7 Police GRAPEVINE,… Conference / Se… 2017 6 4
## 8 Police CHARLOTTE,… Conference / Se… 2017 6 4
## 9 County Ex… Ocean City… Conference / Se… 2015 8 12
## 10 County Ex… Ocean City… Conference / Se… 2015 8 12
## # … with 1 more variable: actual_total_xpenses <dbl>
After preparing our data, we can use same advance functions of dplyr. Exploratory data analysis help us to analyzing and visualizing data to better understand and gain insight into the data.
Which kind of question we can answer according to the data we have:
• Departments which spend a lot of money for traveling
• The currents destination based on mean of money spending
• Make a grouping by year to know the amount max spent
• .etc
We will use :
• summarize()
• group_by()
• arrange()
The summerise () function without group_by () makes no sense. It creates summary statistics by group. The dplyr library automatically applies a function to the group you passed in the group_by verb. We use desc() to sort a variable in descending order.
gr_employee_sum <- employee %>% group_by(Department)%>%
summarize( actual_total_xpenses = sum(actual_total_xpenses, na.rm = TRUE)) %>%
arrange(desc(actual_total_xpenses))
## `summarise()` ungrouping output (override with `.groups` argument)
head(gr_employee_sum, n=10)
## # A tibble: 10 x 2
## Department actual_total_xpenses
## <chr> <dbl>
## 1 Police 1439501.
## 2 Fire and Rescue Service 589790.
## 3 Technology Services 291995.
## 4 Transportation 203351.
## 5 Finance (include Risk Management) 164420.
## 6 MC Employees Retirement Plans 123766.
## 7 General Services 94760.
## 8 Permitting Services 91874.
## 9 Emergency Management and Homeland Security 75868.
## 10 Health and Human Services 57846.
Plot
library(ggplot2)
ggplot(gr_employee_sum, aes(x=reorder(Department,actual_total_xpenses), y=(actual_total_xpenses))) + geom_bar(stat="Identity", fill="#4AA02C",color="darkred")+
xlab("Actual Total Expenses") + ylab("Frequency")+
ggtitle("Actual Total Expenses by Department")+
theme(axis.text.x=element_text(angle=90, hjust=1))
gr_destination_mean <- employee %>% group_by(destination)%>%
summarise( actual_total_xpenses = mean(actual_total_xpenses, na.rm = TRUE)) %>%
arrange(desc(actual_total_xpenses)) %>%
rename(
mean_expenses = `actual_total_xpenses`
)
## `summarise()` ungrouping output (override with `.groups` argument)
head(gr_destination_mean)
## # A tibble: 6 x 2
## destination mean_expenses
## <chr> <dbl>
## 1 BOSTON, AK ; 9974.
## 2 ST. CROIX, WY ; 9891.
## 3 ST CROIX & ST THOMAS, ; 9422.
## 4 Cambride, MA ; 7986.
## 5 TEL AVIV AND JERUSALEM, ISRAEL. ; 5323.
## 6 LANSING, MI. ; 5200.
gr_year_max <- employee %>% group_by(year_s_d)%>%
summarise( actual_total_xpenses = max(actual_total_xpenses, na.rm = TRUE)) %>%
arrange(desc(year_s_d)) %>%
rename(
max_amount_expenses = `actual_total_xpenses`
)
## `summarise()` ungrouping output (override with `.groups` argument)
head(gr_year_max)
## # A tibble: 6 x 2
## year_s_d max_amount_expenses
## <dbl> <dbl>
## 1 2020 1748.
## 2 2019 10387.
## 3 2018 9974.
## 4 2017 10093.
## 5 2016 9728.
## 6 2015 4605.
Plot
ggplot(gr_year_max, aes(x="", y=max_amount_expenses, fill=year_s_d)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) +
theme_void()
## Use filter to get a subset for a specific year
filter () allows you to subsets of observations based on their values. The first argument is the name of the data block. The second and subsequent arguments are the expressions that filter the data block.
##get only subset for 2015
employee2015 <- employee %>% filter(year_s_d == 2015)
head(employee2015)
## # A tibble: 6 x 7
## Department destination purpose_Of_ravel year_s_d month_s_d day_s_d
## <chr> <chr> <chr> <dbl> <dbl> <int>
## 1 County Ex… Ocean City… Conference / Se… 2015 8 12
## 2 County Ex… Ocean City… Conference / Se… 2015 8 12
## 3 Technolog… FL, Orland… Conference / Se… 2015 10 4
## 4 Technolog… Ocean City… Training / Lice… 2015 10 28
## 5 Technolog… Ocean City… Training / Lice… 2015 10 28
## 6 Technolog… Ocean City… Training / Lice… 2015 10 28
## # … with 1 more variable: actual_total_xpenses <dbl>
Now can answer some question like
• Money spending by month in specific year
• Money spending by day for specific month in specific year
• Money spending by purpose of travel in specific year
• .etc
gr_employee_sum_2015 <- employee2015 %>% group_by(month_s_d)%>%
summarize( actual_total_xpenses = sum(actual_total_xpenses, na.rm = TRUE)) %>%
arrange(desc(month_s_d))
## `summarise()` ungrouping output (override with `.groups` argument)
head(gr_employee_sum_2015)
## # A tibble: 6 x 2
## month_s_d actual_total_xpenses
## <dbl> <dbl>
## 1 12 23962.
## 2 11 30499.
## 3 10 73929.
## 4 9 37855.
## 5 8 59394.
## 6 7 4642.
employee2015_day <- employee2015 %>%
filter(month_s_d == 7) %>%
group_by(day_s_d) %>%
summarise(actual_total_xpenses = sum(actual_total_xpenses, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
head(gr_employee_sum_2015)
## # A tibble: 6 x 2
## month_s_d actual_total_xpenses
## <dbl> <dbl>
## 1 12 23962.
## 2 11 30499.
## 3 10 73929.
## 4 9 37855.
## 5 8 59394.
## 6 7 4642.
Plot
pie(employee2015_day$actual_total_xpenses)
ggplot(data = employee2015_day, aes(x = "", y = actual_total_xpenses, fill = day_s_d)) +
geom_bar(stat = "identity") +
coord_polar("y")
gr_purpose_sum_2015 <- employee2015 %>% group_by(purpose_Of_ravel)%>%
summarise( actual_total_xpenses = sum(actual_total_xpenses, na.rm = TRUE)) %>%
arrange(desc(purpose_Of_ravel))
## `summarise()` ungrouping output (override with `.groups` argument)
head(gr_purpose_sum_2015, n=10)
## # A tibble: 7 x 2
## purpose_Of_ravel actual_total_xpenses
## <chr> <dbl>
## 1 Vendor / Product evaluation 8559.
## 2 Training / Licensing / Certification 24061.
## 3 Other 8201.
## 4 Conference / Seminar speaker 19539.
## 5 Conference / Seminar exhibitor 9029.
## 6 Conference / Seminar attendee 213242.
## 7 Business / Economic development 1476.
Plot
gr_purpose_sum_2015 %>%
ggplot(aes(x = purpose_Of_ravel, y = actual_total_xpenses, fill = purpose_Of_ravel)) +
geom_bar(stat = "identity") +
theme_classic() +
labs(
x = "Purpose Of Travel",
y = "Actual Total Epenses",
title = paste(
"Purpose Of Travel by actual total expenses 2015"
)
)+
theme(axis.text.x=element_text(angle=45, hjust=1))
# Conclusion
The goal of our tutorial was to explore different aspects of dplyr. Show the power of this library in exploratory data analysis. We haven’t covered all aspects of dplyr but the one we describe shows the gist of this library. Thank you, your contribution will be welcome.