Excel and R and great tools for Analytics. While the Business teams ususally rely on Excel to solve their analytical needs, Data teams tend to use R or Python for it’s analytical capabilities. Choosing a platform basically comes down to the Problem at hand and the data you’re dealing. Both R and Excel have their set of pros and cons. Certain advantages that R poses over Excel which can’t be ignored are:
In order for any Excel user to get comfortable with R capabilities, he/she needs to understand how to perform some of their primary analytical tasks in R first. In this vignette we will tackle one of the most vital Excel functionality: Pivots. For this we will use a R package called dplyr and simulate the same Pivot functions in R.
For this exercise, we would use the starwars dataset which is available within the dplyr package in R. You can install the dplyr package using the following code: install.packages(“dplyr”) After that you can view the starwars dataset within dplyr.
library("dplyr")
names(starwars) # View column names
## [1] "name" "height" "mass" "hair_color" "skin_color"
## [6] "eye_color" "birth_year" "gender" "homeworld" "species"
## [11] "films" "vehicles" "starships"
starwars # View starwars dataset
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke~ 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl~ red 33 <NA>
## 4 Dart~ 202 136 none white yellow 41.9 male
## 5 Leia~ 150 49 brown light brown 19 female
## 6 Owen~ 178 120 brown, gr~ light blue 52 male
## 7 Beru~ 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg~ 183 84 black light brown 24 male
## 10 Obi-~ 182 77 auburn, w~ fair blue-gray 57 male
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
[Key Functions]
We will tackle all the Functions using the dplyr package.
The Values function in a pivot table can be used to add numeric fields and perform mathematic operations on them. The summarize() function allows to do the same thing in R.
Let’s summarize the mass and height variables and find their mean and median values.
library("dplyr")
starwars[is.na(starwars)] <- 0 # Replacing NA values with 0
starwars %>% # Passing the Starwars dataset to next function
summarize( mean(mass), median(height))
## # A tibble: 1 x 2
## `mean(mass)` `median(height)`
## <dbl> <dbl>
## 1 66.0 178
Some of the other useful operations you can perform with summarize() are:
Spread: sd(), IQR(), mad()
Range: min(), max(), quantile()
Position: first(), last(), nth(),
Count: n(), n_distinct()
Logical: any(), all()
Rows is key to the Pivot table. It allows the data to be grouped against a parameter or multiple parameters along which you want to analyze the data set. In dplyr, group_by() is the equivalent funciton that can be used to perform this operaiton.
Since we are familiar with the summarize function. Let’s summarize the same fields but now also group them:
library("dplyr")
starwars[is.na(starwars)] <- 0 # Replacing NA values with 0
starwars %>% # Passing the Starwars dataset to next function
group_by(species) %>% # Grouping by species
summarize( mean(mass), median(height))
## # A tibble: 38 x 3
## species `mean(mass)` `median(height)`
## <chr> <dbl> <dbl>
## 1 0 9.6 178
## 2 Aleena 15 79
## 3 Besalisk 102 198
## 4 Cerean 82 198
## 5 Chagrian 0 196
## 6 Clawdite 55 168
## 7 Droid 55.8 97
## 8 Dug 40 112
## 9 Ewok 20 88
## 10 Geonosian 80 183
## # ... with 28 more rows
library("dplyr")
starwars[is.na(starwars)] <- 0 # Replacing NA values with 0
starwars %>% # Passing the Starwars dataset to next function
group_by(species, gender) %>% # Grouping by species and gender
summarize( mean(mass), median(height))
## # A tibble: 43 x 4
## # Groups: species [38]
## species gender `mean(mass)` `median(height)`
## <chr> <chr> <dbl> <dbl>
## 1 0 female 16 96
## 2 0 male 0 183
## 3 Aleena male 15 79
## 4 Besalisk male 102 198
## 5 Cerean male 82 198
## 6 Chagrian male 0 196
## 7 Clawdite female 55 168
## 8 Droid 0 46.3 97
## 9 Droid none 70 100
## 10 Dug male 40 112
## # ... with 33 more rows
As you can see the nomenclature within dplyr is simple and intuitive. The filter() function allows you to filter a variable by a value.
Let’s use it in a couple of examples
library("dplyr")
starwars[is.na(starwars)] <- 0 # Replacing NA values with 0
starwars %>% # Passing the Starwars dataset to next function
filter(hair_color == "black" | hair_color == "brown" ) %>% # Filtering hair color by black or brown
summarize( mean(mass), median(height))
## # A tibble: 1 x 2
## `mean(mass)` `median(height)`
## <dbl> <dbl>
## 1 44.6 171
library("dplyr")
starwars[is.na(starwars)] <- 0 # Replacing NA values with 0
starwars %>% # Passing the Starwars dataset to next function
filter( gender == "female" & homeworld == "Naboo" ) %>% # Filtering by female gender and naboo homeworld
summarize( mean(mass), median(height))
## # A tibble: 1 x 2
## `mean(mass)` `median(height)`
## <dbl> <dbl>
## 1 15 165
‘Insert calculated field’ which is quite often used along with a Pivot table can also be used within a summarize function by writing a new variable formula.
Let’s Group the starwars data by homeworld variable and create a new field to calculate mean height in Feet
library("dplyr")
starwars[is.na(starwars)] <- 0 # Replacing NA values with 0
starwars %>% # Passing the Starwars dataset to next function
group_by(homeworld) %>% # Grouping by homeworld
summarize( height_feet <- mean(height/30.48)) # Converting height from cm to feet and taking mean
## # A tibble: 49 x 2
## homeworld `height_feet <- mean(height/30.48)`
## <chr> <dbl>
## 1 0 1.82
## 2 Alderaan 5.79
## 3 Aleen Minor 2.59
## 4 Bespin 5.74
## 5 Bestine IV 5.91
## 6 Cato Neimoidia 6.27
## 7 Cerea 6.50
## 8 Champala 6.43
## 9 Chandrila 4.92
## 10 Concord Dawn 6.00
## # ... with 39 more rows
Now let’s also look at some of the other basic functions used in an Excel table that you may perform on the data set before or after the Pivot.
We will cover 2 such functions 1. Sort 2. Instert new Column
The sort by funtion in Excel can be performed by using the arrange() function in R. The arrange funciton uses the dataset followed by the column name to be sorted. Let’s sort the starwars data set in increasing order of height and then increasing order of mass.
library("dplyr")
starwars[is.na(starwars)] <- 0 # Replacing NA values with 0
arrange(starwars, height, mass) # Arrange with ascending height, mass
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Arve~ 0 0 brown fair brown 0 male
## 2 Finn 0 0 black dark dark 0 male
## 3 Rey 0 0 brown light hazel 0 female
## 4 Poe ~ 0 0 brown light brown 0 male
## 5 BB8 0 0 none none black 0 none
## 6 Capt~ 0 0 unknown unknown unknown 0 female
## 7 Yoda 66 17 white green brown 896 male
## 8 Ratt~ 79 15 none grey, blue unknown 0 male
## 9 Wick~ 88 20 brown brown brown 8 male
## 10 Dud ~ 94 45 none blue, grey yellow 0 male
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
In order to arrange the table in descending fashion, we can use the function desc() along with the arrange funciton. Let’s arrange the starwars data set with decreasing order of height.
library("dplyr")
starwars[is.na(starwars)] <- 0 # Replacing NA values with 0
arrange(starwars, desc(height)) # Arrange in descending order of height
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Yara~ 264 0 none white yellow 0 male
## 2 Tarf~ 234 136 brown brown blue 0 male
## 3 Lama~ 229 88 none grey black 0 male
## 4 Chew~ 228 112 brown unknown blue 200 male
## 5 Roos~ 224 82 none grey orange 0 male
## 6 Grie~ 216 159 none brown, wh~ green, y~ 0 male
## 7 Taun~ 213 0 none grey black 0 female
## 8 Rugo~ 206 0 none green orange 0 male
## 9 Tion~ 206 80 none grey black 0 male
## 10 Dart~ 202 136 none white yellow 41.9 male
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
In order to sort a Pivot table , we can use the arrange function after using the summarize() function. Let’s sort the data from example 2(a) in descending order of mass.
library("dplyr")
starwars[is.na(starwars)] <- 0 # Replacing NA values with 0
pivot <- starwars %>% # Saving the data in table called pivot
group_by(species) %>% # Grouping by species
summarize( mass = mean(mass), height = median(height))
arrange(pivot, desc(mass))
## # A tibble: 38 x 3
## species mass height
## <chr> <dbl> <dbl>
## 1 Hutt 1358 175
## 2 Kaleesh 159 216
## 3 Wookiee 124 231
## 4 Trandoshan 113 190
## 5 Besalisk 102 198
## 6 Neimodian 90 191
## 7 Nautolan 87 196
## 8 Mon Calamari 83 180
## 9 Cerean 82 198
## 10 Geonosian 80 183
## # ... with 28 more rows
You may want to add new columns to the data set before you use the Pivot functionality. These new columns could just be a function of the existing dataset. We can use the mutate() funciton to perform this task.
Let’s use the mutate funciton to calculate the age in 2019 as a new column in the starwars dataset
library("dplyr")
new_starwars <- mutate(starwars, age= 2019 - birth_year)
new_starwars
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke~ 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 0 gold yellow 112 0
## 3 R2-D2 96 32 0 white, bl~ red 33 0
## 4 Dart~ 202 136 none white yellow 41.9 male
## 5 Leia~ 150 49 brown light brown 19 female
## 6 Owen~ 178 120 brown, gr~ light blue 52 male
## 7 Beru~ 165 75 brown light blue 47 female
## 8 R5-D4 97 32 0 white, red red 0 0
## 9 Bigg~ 183 84 black light brown 24 male
## 10 Obi-~ 182 77 auburn, w~ fair blue-gray 57 male
## # ... with 77 more rows, and 6 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>,
## # age <dbl>
names(new_starwars)
## [1] "name" "height" "mass" "hair_color" "skin_color"
## [6] "eye_color" "birth_year" "gender" "homeworld" "species"
## [11] "films" "vehicles" "starships" "age"
The cran project link for the package dplyr(2nd Link) is maintained by Hadley Wickham who is also one of the authors of this R package.