knitr::opts_chunk$set(echo = TRUE)
The advent of several “point and click” or “drag and drop” tools have eased data manipulation for analysts. Data retrieval, wrangling and cleansing is becoming a task which anyone can perform. However in some scenarios such tools fail to manipulate advanced or complex analysis without the inclusion of typing in programming lines into custom transfiguration. Tools like R and Python are still the most preferred tools due to their advanced nature of handling statistical analysis and machine learning complexity.
Every Data Scientist or Analyst spends about 80% of his/her time in data wrangling and only about 20% of the time in the actual analysis. Many packages in R programming language are quite extensively used for complex data retrievals. The dplyr package is pretty neat in performing several data munging tasks. In this blog I am aiming to cover the various functions and capabilities of the dplyr and the tidyr packages.
With the recent Star Wars movie released, I want to use the starwars tibble for some analysis here!
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
data(starwars)
A tibble is quite different to a data frame in terms of printing or subsetting and it is simply a nicer way to create a data frame. It never adjusts the names of the variables of changes the input type. Check this article for more details.
Now let’s view the dataset:
starwars
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
A tibble also displays just the top 10 rows and displays it on my Rstudio console.A dplyr equivalent for a data frame is the following command which also makes it easier to display the data
dplyr::tbl_df(starwars)
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
If we want to just glimpse the data we can use the following as well
dplyr::glimpse(starwars)
## Observations: 87
## Variables: 13
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia O…
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, …
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", …
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", …
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue"…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0,…
## $ gender <chr> "male", NA, NA, "male", "female", "male", "female", NA, "m…
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "…
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Hum…
## $ films <list> [<"Revenge of the Sith", "Return of the Jedi", "The Empir…
## $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "I…
## $ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1…
Now one particular format that is commonly seen in dplyr code is the piping symbol which is %>%. It passes the object on the left hand side to the function on the right and makes the code more readable. We will see more of this later.
Now let’s look at the different row-wise filtering options.
dplyr::filter(starwars,mass<=20)
## # A tibble: 3 x 13
## name height mass hair_color skin_color eye_color birth_year gender homeworld
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yoda 66 17 white green brown 896 male <NA>
## 2 Wick… 88 20 brown brown brown 8 male Endor
## 3 Ratt… 79 15 none grey, blue unknown NA male Aleen Mi…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>
And I get what I wanted to see! Yoda’s mass is 17, so the data is included in the results.
If you just want to view a set of rows say from 14 to 15, then we do the following:
dplyr::slice(starwars, 13:14)
## # A tibble: 2 x 13
## name height mass hair_color skin_color eye_color birth_year gender homeworld
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Chew… 228 112 brown unknown blue 200 male Kashyyyk
## 2 Han … 180 80 brown fair brown 29 male Corellia
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>
And look who we have here! Han Solo and Chewbacca.
Here is a column-wise manipulation:
dplyr::select(starwars, name, species, birth_year)
## # A tibble: 87 x 3
## name species birth_year
## <chr> <chr> <dbl>
## 1 Luke Skywalker Human 19
## 2 C-3PO Droid 112
## 3 R2-D2 Droid 33
## 4 Darth Vader Human 41.9
## 5 Leia Organa Human 19
## 6 Owen Lars Human 52
## 7 Beru Whitesun lars Human 47
## 8 R5-D4 Droid NA
## 9 Biggs Darklighter Human 24
## 10 Obi-Wan Kenobi Human 57
## # … with 77 more rows
We can also do summary and grouping operations. For example if we want to see the average mass by gender, then:
starwars %>%
group_by(gender) %>%
summarise(avg = mean(mass, na.rm=TRUE)) %>%
arrange(avg)
## # A tibble: 5 x 2
## gender avg
## <chr> <dbl>
## 1 <NA> 46.3
## 2 female 54.0
## 3 male 81.0
## 4 none 140
## 5 hermaphrodite 1358
We can also create new variables. Say I want to create a BMI value. For the sake of simplicity let’s ignore gender and set the formula to be weight / height(squared).
starwars %>%
mutate(bmi = mass/(height^2)) %>%
arrange(desc(bmi)) %>%
select(name, species, height, mass, bmi)
## # A tibble: 87 x 5
## name species height mass bmi
## <chr> <chr> <int> <dbl> <dbl>
## 1 Jabba Desilijic Tiure Hutt 175 1358 0.0443
## 2 Dud Bolt Vulptereen 94 45 0.00509
## 3 Yoda Yoda's species 66 17 0.00390
## 4 Owen Lars Human 178 120 0.00379
## 5 IG-88 Droid 200 140 0.0035
## 6 R2-D2 Droid 96 32 0.00347
## 7 Grievous Kaleesh 216 159 0.00341
## 8 R5-D4 Droid 97 32 0.00340
## 9 Jek Tono Porkins Human 180 110 0.00340
## 10 Darth Vader Human 202 136 0.00333
## # … with 77 more rows
We can also reshape some results using tidyr. Here is an example to append species and name.
starwars %>%
unite(name_species, name, species) %>%
select(name_species)
## # A tibble: 87 x 1
## name_species
## <chr>
## 1 Luke Skywalker_Human
## 2 C-3PO_Droid
## 3 R2-D2_Droid
## 4 Darth Vader_Human
## 5 Leia Organa_Human
## 6 Owen Lars_Human
## 7 Beru Whitesun lars_Human
## 8 R5-D4_Droid
## 9 Biggs Darklighter_Human
## 10 Obi-Wan Kenobi_Human
## # … with 77 more rows