R_dplyr_groupby

#Cleaning the environment

rm(list=ls())

#Importing Dataset

setwd("/Users/nishantaneja/Desktop/Files_for_R")
sales_order = read.csv("sales_order.csv", stringsAsFactors = FALSE)

#Loading dplyr

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Lets see what the group_by() command does in dplyr

#1 Summarising with sum()

1.1 sum() over sales

Grouping by Regions and seeing the Total Sales will help us know which region did better…

Sales_regions = sales_order %>% 
                group_by(Region) %>% 
                summarise(Overall_Sales = sum(Total)) %>% 
                arrange(desc(Overall_Sales))

This tells us Central Region has the highest $ amount of sales followed by East and then West.

1.2 Sum() over Reps

Lets use the same group_by() to see which Rep did the highest sales in absolute numbers…

Sales_rep = sales_order %>% 
            group_by(Rep) %>% 
            summarise(Overall_Sales = sum(Total)) %>% 
            arrange(desc(Overall_Sales))

This shows us that Kivell was the best performer with highest sales.

#2 Summarising with mean()

2.1 mean() over sales

We can also summarise in avg by using mean() instead of sum(), choose what you want to summarise on based on the exploration you want in the dataset…

Sales_rep_avg = sales_order %>% 
            group_by(Rep) %>% 
            summarise(Avg_Sales = mean(Total)) %>% 
            arrange(desc(Avg_Sales))

#3 Summarising with n()

3.1 n() over Region

Lets explore count by using n()…

Sales_region_count = sales_order %>% 
                  group_by(Region) %>%
                  summarise(Region_count = n()) %>% 
                  arrange(desc(Region_count))

This shows Central region had most sales as count function counts the number of sales in each region as we grouped by region in the above code.

3.2 n() over Reps

Sales_rep_count = sales_order %>% 
                  group_by(Rep) %>%
                  summarise(Rep_sales_count = n()) %>% 
                  arrange(desc(Rep_sales_count))

Jones did the most number of sales, and as per our previous exploration, his avg. was lower hence he did not make it to the top of revenue numbers. This is how exploration comes in handy.

#4 Using tally() in place of n()

Anytime you use ‘summarise’ with ‘n()’ function, it can basically be replaced with tally(). Let’s see how that works…

Sales_rep_tally = sales_order %>% 
                  group_by(Rep) %>%
                  tally(sort = TRUE)

Now lets try to explore a bit deeper…

#5 Summarising with n_distinct()

Sales_region = sales_order %>% 
               group_by(Region) %>% 
               summarise(Region_sales_count = n(), Rep_count = n_distinct(Rep))

Here we have recreated the table of sales counts in each region and then to move a step further, we have also added how many reps have made sales in those regions. This gives us more information in one exploration than doing the same code writing multiple number of times.

#6 Grouping without summarising

Grouping can also be useful without summarising…

Base R functions like table() can also be called to view certain outputs and can be leveraged as per need …

sales_order %>% 
  group_by(Region) %>% 
  select(Item) %>% 
  table()

## Adding missing grouping variables: `Region`

##          Item
## Region    Binder Desk Pen Pen Set Pencil
##   Central      8    2   1       4      9
##   East         5    0   3       3      2
##   West         2    1   1       0      2

Thank you for watching. I’ll see you in the next one :)