Introduction

slides available at: https://dario.shinyapps.io/presentationFall2016slides/

This is an introduction todplyr, a powerful tool for data manipulation. This library offers a more intuitive way of writing code by using the pipe %>% operator (presented in the next section).

Before starting, make sure you can load dplyr:

library(dplyr)

Pipe operator %>%

The pipe operator %>% from the package magrittr enables to “chain” functions together, rather than “nesting” them. You don’t have to use the %>% in dplyr, however it arguably makes your syntax much more readable.

The %>% passes the returned value from one function as first element to the next function. You can think at %>% as a then between two statements. For instance, the statments:

Generate 20 random observations from a normal distribution then sum them up then round the sum to the 1st decimal become:

library(dplyr)

By default %>% passes the returned value to the first position, but you can actually use the . to call the returned value in any positions you want.

For example, the . becomes useful in statements such as:

Create a vector from [1:10] then bind it to a vector from [a:j]

#bind it to the vector a:j

#bind it as 2nd column

#bind it as 2nd AND 3rd columns

Data Manipulation using dplyr

The dplyr package is a powerful tool to manipulate your data. Use the function tbl_df() to convert a dataset into a local data table

# read this csv https://dds.cct.lsu.edu/ddslab/lab_projects/website/pdf"

Note that the object is not (only) a data frame of class data.frame. In fact, dplyr uses a more efficient data structure that is also particularly convenient for printing your dataset

Explore your dataset

The base package offers a number of useful functions to investigate the nature of our dataset (e.g., str). However, str() does not work very well on object of class tbl; to explore objects of this family use the function glimpse() instead.

The base package gives you two more functions to explore your dataset, namely head() and tail(). However, when dealing with large and/or sorted dataset, using head and tail might be misleading. In dplyr you can use the functions sample_n and sample_frac to print random samples of your dataset:

#print a random sample of 5 observations

Be careful: sample from base and sample_n from dplyr are two different functions that perform different tasks. Run ?sample() for more information.

Manipulating data

Dplyr offers a number of verbs that make data manipulation relatively easy.

Select

Use select() to print out specific columns, e.g. year, month and day . Remember that - like any functions in R - select() creates a copy of your object. If you want to save your output, you need to assign it to a variable using the <- or -> operator.

#select year, month, day

#save if to an object (e.g., subsetFlights)

#print a summary of the new object

You can use the - operator to exclude some columns.

#select all columns from nycflights except year,month and day

You can select ranges of columns with the : operator. For example, all the columns between dep_time and arr_delay

#select all columns from dep_time to arr_delay

You can also use select() in combination with some helper functions such as:

  • starts_with(“X”), every name that starts with “X”
  • ends_with(“X”), every name that ends with “X”
  • contains(“X”), every name that contains “X”
  • matches(“X”), every name that matches “X”, where “X” can be a regular expression
  • num_range(“x”, 1:5), the variables named x01, x02, x03, x04 and x05
  • one_of(x), every name that appears in x, which should be a character vector

For example, we can retrieve all columns regarding dealys

Mutate and transmute

Use mutate() to create new columns but preserving the original

#Create a new column named "flightDur" which contains the difference between arrival (arr_time) and departure time (dep_time)

To drop the original columns, use instead transmute()

#Create two new columns flightDur (=arr_time-dep_time) and overallDelay (=dep_delay+arr_delay)

Filter and slice

Use filter() to conditionally retrieve rows. For example, the following code fetches all the flight with dest equal to MIA and ArrTime greater than 1500. You could perform the same task in base using subset()

#filter all flights with dest MIA and arr_time later than 1000

To filter based on line number, use the function slice() (equivalent to positional indexing)

#filter the first 3 lines 

Arrange

arrange() sorts the dataset depending on one or more variable. Use desc() for descending ordering. You can achieve a similar result in base by using the functions sort() or order()

#arrange by carrier (from Z to A) and dep_time (from low to high)

Summarise

summerise() summarises multiple values to a single value.

#Use summarise() to print min and max distance

This function becomes extremly powerful when used in combination with group_by().

#Use summarise() to print min and max distance grouping by dest

Data Retrieval: connection to mysql DB

Dplyr also offers its own streamline sintax for connecting to a database.

my_db <- src_mysql(
  dbname = "shinydemo",
  host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
  user = "guest",
  password = "guest"
)
src_tbls(my_db)
my_db %>% tbl("City") %>% head(5)

Moreover, it can be used to compile sql code using join functions such as:

  a. Mutating joins, i.e. `left_join()`, `right_join()`, `inner_join()` and `full_join()`
  
  b. Filtering join, i.e. `semi_join()` and `anti_join()`

For each in table “City”, print out the corresponding country name from table “Country”

Find which language is spoken across the highest number of countries in Europe (count only if the language is officially recognised as a national language).

Follow these two steps:

  1. Join the appropriate tables and save the dataset into an object named ‘topLanguage’
  2. Use n() to count records and rearrange to display the most “popular” languages