Managing Data Frames with Package

#install.packages("tidyverse")
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.1
## v tidyr   1.1.1     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ---------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Important Verbs in

There are seven (five) key dplyr verbs that you need to learn.

Some of these functions may clash with others, if you receive such an error, use the following, ie. for select function:

dplyr::select()

Example Happiness Levels

Download the dataset from the following link:

https://www.kaggle.com/unsdsn/world-happiness

You can see some basic characteristics of the dataset with the dim() and str() functions.

happiness2019 <- readxl::read_excel("C:/Users/User/Documents/2019.xlsx")

dim(happiness2019)
## [1] 156   9
str(happiness2019)
## tibble [156 x 9] (S3: tbl_df/tbl/data.frame)
##  $ Overall rank                : num [1:156] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Country or region           : chr [1:156] "Finland" "Denmark" "Norway" "Iceland" ...
##  $ Score                       : num [1:156] 7.77 7.6 7.55 7.49 7.49 ...
##  $ GDP per capita              : num [1:156] 1.34 1.38 1.49 1.38 1.4 ...
##  $ Social support              : num [1:156] 1.59 1.57 1.58 1.62 1.52 ...
##  $ Healthy life expectancy     : num [1:156] 0.986 0.996 1.028 1.026 0.999 ...
##  $ Freedom to make life choices: num [1:156] 0.596 0.592 0.603 0.591 0.557 0.572 0.574 0.585 0.584 0.532 ...
##  $ Generosity                  : num [1:156] 0.153 0.252 0.271 0.354 0.322 0.263 0.267 0.33 0.285 0.244 ...
##  $ Perceptions of corruption   : num [1:156] 0.393 0.41 0.341 0.118 0.298 0.343 0.373 0.38 0.308 0.226 ...

The function

In some cases, you might have several variables, too many maybe, and you may only need to focus on a subset of these variables.

The function can be used to select columns of a data frame that you want to focus on or similarly omitting columns you don’t need.

subset <- select(happiness2019, 1)
head(subset,2)
subset <- select(happiness2019, 1:3)
head(subset,2)
subset <- select(happiness2019, c(1,3,5))
head(subset,2)
subset <- select(happiness2019, -1)
head(subset,2)

You can also select columns based on specific criteria with:

starts_with() = Select columns that start with a character string ends_with() = Select columns that end with a character string contains() = Select columns that contain a character string matches() = Select columns that match a regular expression one_of() = Select columns names that are from a group of names

subset <- select(happiness2019, ends_with("2"))
str(subset)
## tibble [156 x 0] (S3: tbl_df/tbl/data.frame)
##  Named list()
subset <- select(happiness2019, starts_with("G"))
str(subset)
## tibble [156 x 2] (S3: tbl_df/tbl/data.frame)
##  $ GDP per capita: num [1:156] 1.34 1.38 1.49 1.38 1.4 ...
##  $ Generosity    : num [1:156] 0.153 0.252 0.271 0.354 0.322 0.263 0.267 0.33 0.285 0.244 ...

The function

Suppose we wanted to extract the rows of the happiness data frame where the levels of happiness (Score) are greater than 7, we could do:

str(happiness2019)
## tibble [156 x 9] (S3: tbl_df/tbl/data.frame)
##  $ Overall rank                : num [1:156] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Country or region           : chr [1:156] "Finland" "Denmark" "Norway" "Iceland" ...
##  $ Score                       : num [1:156] 7.77 7.6 7.55 7.49 7.49 ...
##  $ GDP per capita              : num [1:156] 1.34 1.38 1.49 1.38 1.4 ...
##  $ Social support              : num [1:156] 1.59 1.57 1.58 1.62 1.52 ...
##  $ Healthy life expectancy     : num [1:156] 0.986 0.996 1.028 1.026 0.999 ...
##  $ Freedom to make life choices: num [1:156] 0.596 0.592 0.603 0.591 0.557 0.572 0.574 0.585 0.584 0.532 ...
##  $ Generosity                  : num [1:156] 0.153 0.252 0.271 0.354 0.322 0.263 0.267 0.33 0.285 0.244 ...
##  $ Perceptions of corruption   : num [1:156] 0.393 0.41 0.341 0.118 0.298 0.343 0.373 0.38 0.308 0.226 ...
subset <- filter(happiness2019, Score>7)
head(subset)

We can place an arbitrarily complex logical sequence inside of filter(), say filtering observations where Score is greater than 7 and GDP.per.capita >1.5:

subset <- filter(happiness2019, Score > 7 & `GDP per capita` > 1.5)
head(subset)

The function

Here we can order the rows of the data frame by GDP.per.capita, so that the first row is the least developed observation and the last row is the most developed observation. Currently the data is ordered according to Happiness Score.

developed <- arrange(happiness2019, `GDP per capita`)
head(developed)
tail(developed)

The function

Renaming variable names can be difficult in R. For example, you may want to rename GDP.per.capita as gdppercapita, all in small letters and no dots in between.

happiness2019re <- rename(happiness2019, gdppercapita = `GDP per capita`)
head(happiness2019re)

The function

The mutate() function exists to compute transformations of variables in a data frame. Often, you want to create new variables that are derived from existing variables and mutate() provides a clean interface for doing that.

Note: mutate() is order aware. So you can chain multiple mutates in a single call.

For example, you may want to standardize GDP.per.capita or apply a log transformation, or create a column for years:

happiness2019log <- mutate(happiness2019, loggdppercapita = log10(`GDP per capita`))
head(happiness2019log)
happiness2019 <- mutate(happiness2019, year = 2019)
head(happiness2019)
# Boolean, logical and conditional operators all work well with mutate() too.

happiness2019 <-  mutate(happiness2019, happy = ifelse(Score > 5.5, "happy", "nothappy"))
head(happiness2019,2)

The Pipeline Operator

The pipeline operator strings together multiple dplyr functions in a sequence.

third(second(first(x)))

Instead of creating difficult to read and manage functions, we will use the pipeline:

first(x) %>% second %>% third

Or similarly

x %>% first %>% second %>% third

Let us assume we wanted to select gdppercapita column, then filter those countries with less than 1.5, and log transform the values:

select(happiness2019, `GDP per capita`) %>%
  mutate(loggdp = log10(`GDP per capita`)) %>%
    filter(`GDP per capita`>1.5)

This is same as follows

happiness2019 %>%
 select(`GDP per capita`) %>%
  mutate(loggdp = log10(`GDP per capita`)) %>%
    filter(`GDP per capita`>1.5)

The function

The group_by() function is used to generate summary statistics from the data frame within strata defined by a variable.

happiness2019 %>% 
  group_by(happy) %>% 
  summarise(mean_gdp = mean(`GDP per capita`, na.rm = T))
## `summarise()` ungrouping output (override with `.groups` argument)
# count() and distinct(): Number and isolate unique observations.
happiness2019 %>% count(happy)
happiness2019 %>% distinct(happy)

Sometimes you may collect data from different environments and may need to merge the datasets according a particular variable.

One of the main advantages of dplyr is that you can join data frames easily. There are several functions one can use to merge, or join datasets. But the dplyr is the easiest and fastest.

#install.packages("nycflights13")
library(nycflights13)
flights 
planes

This join will take only the matching values based on the specified columns.

inner_join(flights, planes) %>%
  select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, type, model)
## Joining, by = c("year", "tailnum")

What do we mean by left join?

This join will take all of the values from the table we specify as left (e.g., the first one) and match them to records from the table on the right (e.g. the second one). If there isn’t a match in the second table, then it will return NULL for the row in question.

left_join(flights, planes) %>%
  select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, type, model)
## Joining, by = c("year", "tailnum")

If you do not provide the columns to be joined, dplyr will joing the datasets based on the columns that have the same name. At the end of the operator, it will tell us which columns are used to join.

Joining, by = c(“year”, “tailnum”)

Now, what if the joining columns don’t have same name, or the meaning of the same named columns is different, then you need to specify the joining columns with the by = argument:

left_join(flights, planes, by = c("tailnum", "year")) %>%
  select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, type, model)

Basically takes the second data frame’s observations into account and the new joint will take all of its values from the second dataframe.

The full outer join returns all of the records in a new table, whether it matches on either the left or right tables. If the table rows match, then a join will be executed, otherwise it will return NULL in places where a matching row does not exist.

A semi join creates a new table where it will return all rows from the first table where there is a corresponding matching value in second, but instead of the new table combining both the first and second tables, it only contains data from the first table.

Some Extras

  • A very common filter() use case is identifying (or removing) missing data cases.
happiness2019 %>% 
  filter(is.na(Generosity)) %>% head(2)

  • Arranging the data based on a variable (either numeric or categorical)
happiness2019 %>% 
  arrange(Generosity) %>% head(2)

Arranging on a character-based column (i.e. strings) will sort alphabetically. Try this yourself by arranging according to the “Country.or.region” column.

  • We can also arrange items in descending order using arrange(desc()).
happiness2019 %>% 
  arrange(desc(Generosity)) %>% head(2)

Once you learn the dplyr grammar there are a few additional benefits . dplyr can work with other data frame “backends”" such as SQL databases. There is an SQL interface for relational databases via the DBI package . dplyr can be integrated with the data.table package for large fast tables

Exercises

1. Pokemon Dataset

https://www.kaggle.com/rounakbanik/pokemon/data#

Import the dataset and have a quick look at it

Exercise 1

Select the first three columns of the pokemon dataset using their column names.

Exercise 2

Select all the columns of the pokemon dataset except “abilities”.

Exercise 3

Select all columns of the pokemon dataset that start with the character string “against”.

Exercise 4

Filter the rows of the pokemon dataset for against_bug \(>= 1\) and against_dark \(>= 1\).

Exercise 5

Filter the rows of the pokemon dataset for only available data for the against_bug variable (use a version of is.na).

Exercise 6

Print out the name of the pokemon variable without any duplicates.

Exercise 7

Arrange rows by a particular column, such as the against_bug.

Exercise 8

Select three columns from pokemon, arrange the rows by against_bug in a descending order, then arrange the rows by against_dark.

Exercise 9

Create a new column called proportion, which is the ratio of height to weight of the pokemon.

Exercise 10

Compute the average damage that a pokemon can do against bug, apply the mean() function to the column against_bug, and call the summary value “ave_against_bug”. HINT: Use summarize().

Exercise 11

Obtain the same summary statistics for different types (“type1”) of pokemon. HINT: Use group_by().

Exercise 12

Create an appropriate plot that would display the information obtained in Exercise 10.

Exercise 13

Plot the damage a pokemon can do against bug types.

Exercise 14

Plot the damage a pokemon can do against bug types for different types of pokemon.

Exercise 15 - Extra

Write a function called my_range that computes the difference between the 90% and 10% quantiles of a vector, and another function called my_cov that computes the coefficient of variation of a vector (defined as the standard deviation divided by mean).

Use these functions to work out how “variable” each pokemon attacks are.

Exercises from RP Online Ref

Do the exercises in http://r4ds.had.co.nz/transform.html (data transformation using the dplyr verbs) of R4DS.

There are exercises at the end of each major subsection. Do as many of these exercises as you need to feel comfortable with the material - I suggest doing at least the first two of each set of exercises.