Introduction

Excel and R and great tools for Analytics. While the Business teams ususally rely on Excel to solve their analytical needs, Data teams tend to use R or Python for it’s analytical capabilities. Choosing a platform basically comes down to the Problem at hand and the data you’re dealing. Both R and Excel have their set of pros and cons. Certain advantages that R poses over Excel which can’t be ignored are:

  1. Dealing with Large Data Sets
  2. Graphical Capabilities
  3. Statistical analysis support - Packages

In order for any Excel user to get comfortable with R capabilities, he/she needs to understand how to perform some of their primary analytical tasks in R first. In this vignette we will tackle one of the most vital Excel functionality: Pivots. For this we will use a R package called dplyr and simulate the same Pivot functions in R.

Data Set

For this exercise, we would use the starwars dataset which is available within the dplyr package in R. You can install the dplyr package using the following code: install.packages(“dplyr”) After that you can view the starwars dataset within dplyr.

library("dplyr")
names(starwars)         # View column names
##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "gender"     "homeworld"  "species"   
## [11] "films"      "vehicles"   "starships"
starwars                # View starwars dataset
## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke~    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl~ red             33   <NA>  
##  4 Dart~    202   136 none       white      yellow          41.9 male  
##  5 Leia~    150    49 brown      light      brown           19   female
##  6 Owen~    178   120 brown, gr~ light      blue            52   male  
##  7 Beru~    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg~    183    84 black      light      brown           24   male  
## 10 Obi-~    182    77 auburn, w~ fair       blue-gray       57   male  
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

Pivot functions

[Key Functions]

  1. Values - Sum, count, average, Max and so on.
  2. Rows
  3. Filter
  4. Calculated Field

We will tackle all the Functions using the dplyr package.

1. Values : summarize()

The Values function in a pivot table can be used to add numeric fields and perform mathematic operations on them. The summarize() function allows to do the same thing in R.

Let’s summarize the mass and height variables and find their mean and median values.

library("dplyr")
  starwars[is.na(starwars)] <- 0  # Replacing NA values with 0
  starwars %>%                    # Passing the Starwars dataset to next function
  summarize( mean(mass), median(height))
## # A tibble: 1 x 2
##   `mean(mass)` `median(height)`
##          <dbl>            <dbl>
## 1         66.0              178

Some of the other useful operations you can perform with summarize() are:

Spread: sd(), IQR(), mad()

Range: min(), max(), quantile()

Position: first(), last(), nth(),

Count: n(), n_distinct()

Logical: any(), all()

2. Rows : group_by()

Rows is key to the Pivot table. It allows the data to be grouped against a parameter or multiple parameters along which you want to analyze the data set. In dplyr, group_by() is the equivalent funciton that can be used to perform this operaiton.

Since we are familiar with the summarize function. Let’s summarize the same fields but now also group them:

  1. Mean Mass and Median Height Grouped by Species
  library("dplyr")
  starwars[is.na(starwars)] <- 0  # Replacing NA values with 0
  starwars %>%                    # Passing the Starwars dataset to next function
  group_by(species) %>%          # Grouping by species
  summarize( mean(mass), median(height))
## # A tibble: 38 x 3
##    species   `mean(mass)` `median(height)`
##    <chr>            <dbl>            <dbl>
##  1 0                  9.6              178
##  2 Aleena            15                 79
##  3 Besalisk         102                198
##  4 Cerean            82                198
##  5 Chagrian           0                196
##  6 Clawdite          55                168
##  7 Droid             55.8               97
##  8 Dug               40                112
##  9 Ewok              20                 88
## 10 Geonosian         80                183
## # ... with 28 more rows
  1. Mean Mass and Median Height Grouped by Species and Gender
  library("dplyr")
  starwars[is.na(starwars)] <- 0  # Replacing NA values with 0
  starwars %>%                    # Passing the Starwars dataset to next function
  group_by(species, gender) %>%          # Grouping by species and gender
  summarize( mean(mass), median(height))
## # A tibble: 43 x 4
## # Groups:   species [38]
##    species  gender `mean(mass)` `median(height)`
##    <chr>    <chr>         <dbl>            <dbl>
##  1 0        female         16                 96
##  2 0        male            0                183
##  3 Aleena   male           15                 79
##  4 Besalisk male          102                198
##  5 Cerean   male           82                198
##  6 Chagrian male            0                196
##  7 Clawdite female         55                168
##  8 Droid    0              46.3               97
##  9 Droid    none           70                100
## 10 Dug      male           40                112
## # ... with 33 more rows

3. Filter : filter()

As you can see the nomenclature within dplyr is simple and intuitive. The filter() function allows you to filter a variable by a value.

Let’s use it in a couple of examples

  1. Mean Mass and Median Height Filtered by Black or Brown hair_color
  library("dplyr")
  starwars[is.na(starwars)] <- 0  # Replacing NA values with 0
  starwars %>%                    # Passing the Starwars dataset to next function
  filter(hair_color == "black" | hair_color == "brown" )  %>%   # Filtering hair color by black or brown
  summarize( mean(mass), median(height))
## # A tibble: 1 x 2
##   `mean(mass)` `median(height)`
##          <dbl>            <dbl>
## 1         44.6              171
  1. Mean Mass and Median Height filtered by Female Gender and Naboo homeworld
  library("dplyr")
  starwars[is.na(starwars)] <- 0  # Replacing NA values with 0
  starwars %>%                    # Passing the Starwars dataset to next function
  filter( gender == "female" & homeworld == "Naboo" )  %>%   # Filtering by female gender and naboo homeworld
  summarize( mean(mass), median(height))
## # A tibble: 1 x 2
##   `mean(mass)` `median(height)`
##          <dbl>            <dbl>
## 1           15              165

4. calculated Field : summarize()

‘Insert calculated field’ which is quite often used along with a Pivot table can also be used within a summarize function by writing a new variable formula.

Let’s Group the starwars data by homeworld variable and create a new field to calculate mean height in Feet

  library("dplyr")
  starwars[is.na(starwars)] <- 0  # Replacing NA values with 0
  starwars %>%                    # Passing the Starwars dataset to next function
  group_by(homeworld) %>%         # Grouping by homeworld
    summarize( height_feet <- mean(height/30.48)) # Converting height from cm to feet and taking mean
## # A tibble: 49 x 2
##    homeworld      `height_feet <- mean(height/30.48)`
##    <chr>                                        <dbl>
##  1 0                                             1.82
##  2 Alderaan                                      5.79
##  3 Aleen Minor                                   2.59
##  4 Bespin                                        5.74
##  5 Bestine IV                                    5.91
##  6 Cato Neimoidia                                6.27
##  7 Cerea                                         6.50
##  8 Champala                                      6.43
##  9 Chandrila                                     4.92
## 10 Concord Dawn                                  6.00
## # ... with 39 more rows

Other Basic Functions

Now let’s also look at some of the other basic functions used in an Excel table that you may perform on the data set before or after the Pivot.

We will cover 2 such functions 1. Sort 2. Instert new Column

1. Sort : arrange()

The sort by funtion in Excel can be performed by using the arrange() function in R. The arrange funciton uses the dataset followed by the column name to be sorted. Let’s sort the starwars data set in increasing order of height and then increasing order of mass.

  library("dplyr")
  starwars[is.na(starwars)] <- 0  # Replacing NA values with 0
  arrange(starwars, height, mass) # Arrange with ascending height, mass
## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <dbl> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Arve~      0     0 brown      fair       brown              0 male  
##  2 Finn       0     0 black      dark       dark               0 male  
##  3 Rey        0     0 brown      light      hazel              0 female
##  4 Poe ~      0     0 brown      light      brown              0 male  
##  5 BB8        0     0 none       none       black              0 none  
##  6 Capt~      0     0 unknown    unknown    unknown            0 female
##  7 Yoda      66    17 white      green      brown            896 male  
##  8 Ratt~     79    15 none       grey, blue unknown            0 male  
##  9 Wick~     88    20 brown      brown      brown              8 male  
## 10 Dud ~     94    45 none       blue, grey yellow             0 male  
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

In order to arrange the table in descending fashion, we can use the function desc() along with the arrange funciton. Let’s arrange the starwars data set with decreasing order of height.

  library("dplyr")
  starwars[is.na(starwars)] <- 0  # Replacing NA values with 0
  arrange(starwars, desc(height)) # Arrange in descending order of height
## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <dbl> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Yara~    264     0 none       white      yellow           0   male  
##  2 Tarf~    234   136 brown      brown      blue             0   male  
##  3 Lama~    229    88 none       grey       black            0   male  
##  4 Chew~    228   112 brown      unknown    blue           200   male  
##  5 Roos~    224    82 none       grey       orange           0   male  
##  6 Grie~    216   159 none       brown, wh~ green, y~        0   male  
##  7 Taun~    213     0 none       grey       black            0   female
##  8 Rugo~    206     0 none       green      orange           0   male  
##  9 Tion~    206    80 none       grey       black            0   male  
## 10 Dart~    202   136 none       white      yellow          41.9 male  
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

In order to sort a Pivot table , we can use the arrange function after using the summarize() function. Let’s sort the data from example 2(a) in descending order of mass.

  library("dplyr")
  starwars[is.na(starwars)] <- 0  # Replacing NA values with 0
  pivot <- starwars %>%           # Saving the data in table called pivot
  group_by(species) %>%          # Grouping by species
  summarize( mass = mean(mass), height = median(height))
  arrange(pivot, desc(mass))    
## # A tibble: 38 x 3
##    species       mass height
##    <chr>        <dbl>  <dbl>
##  1 Hutt          1358    175
##  2 Kaleesh        159    216
##  3 Wookiee        124    231
##  4 Trandoshan     113    190
##  5 Besalisk       102    198
##  6 Neimodian       90    191
##  7 Nautolan        87    196
##  8 Mon Calamari    83    180
##  9 Cerean          82    198
## 10 Geonosian       80    183
## # ... with 28 more rows

2. Insert new Column : mutate()

You may want to add new columns to the data set before you use the Pivot functionality. These new columns could just be a function of the existing dataset. We can use the mutate() funciton to perform this task.

Let’s use the mutate funciton to calculate the age in 2019 as a new column in the starwars dataset

library("dplyr")
new_starwars <- mutate(starwars, age= 2019 - birth_year)
new_starwars
## # A tibble: 87 x 14
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <dbl> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke~    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 0          gold       yellow         112   0     
##  3 R2-D2     96    32 0          white, bl~ red             33   0     
##  4 Dart~    202   136 none       white      yellow          41.9 male  
##  5 Leia~    150    49 brown      light      brown           19   female
##  6 Owen~    178   120 brown, gr~ light      blue            52   male  
##  7 Beru~    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 0          white, red red              0   0     
##  9 Bigg~    183    84 black      light      brown           24   male  
## 10 Obi-~    182    77 auburn, w~ fair       blue-gray       57   male  
## # ... with 77 more rows, and 6 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>,
## #   age <dbl>
names(new_starwars)
##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "gender"     "homeworld"  "species"   
## [11] "films"      "vehicles"   "starships"  "age"

References

The cran project link for the package dplyr(2nd Link) is maintained by Hadley Wickham who is also one of the authors of this R package.

  1. https://r4ds.had.co.nz/
  2. https://cran.r-project.org/web/packages/dplyr/index.html
  3. https://www.quora.com/What-really-differentiates-R-from-Excel