#Cleaning the environment
rm(list=ls())
#Importing Dataset
setwd("/Users/nishantaneja/Desktop/Files_for_R")
sales_order = read.csv("sales_order.csv", stringsAsFactors = FALSE)
#Loading dplyr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Lets see what the group_by() command does in dplyr
#1 Summarising with sum()
Grouping by Regions and seeing the Total Sales will help us know which region did better…
Sales_regions = sales_order %>%
group_by(Region) %>%
summarise(Overall_Sales = sum(Total)) %>%
arrange(desc(Overall_Sales))
This tells us Central Region has the highest $ amount of sales followed by East and then West.
Lets use the same group_by() to see which Rep did the highest sales in absolute numbers…
Sales_rep = sales_order %>%
group_by(Rep) %>%
summarise(Overall_Sales = sum(Total)) %>%
arrange(desc(Overall_Sales))
This shows us that Kivell was the best performer with highest sales.
#2 Summarising with mean()
We can also summarise in avg by using mean() instead of sum(), choose what you want to summarise on based on the exploration you want in the dataset…
Sales_rep_avg = sales_order %>%
group_by(Rep) %>%
summarise(Avg_Sales = mean(Total)) %>%
arrange(desc(Avg_Sales))
#3 Summarising with n()
Lets explore count by using n()…
Sales_region_count = sales_order %>%
group_by(Region) %>%
summarise(Region_count = n()) %>%
arrange(desc(Region_count))
This shows Central region had most sales as count function counts the number of sales in each region as we grouped by region in the above code.
Sales_rep_count = sales_order %>%
group_by(Rep) %>%
summarise(Rep_sales_count = n()) %>%
arrange(desc(Rep_sales_count))
Jones did the most number of sales, and as per our previous exploration, his avg. was lower hence he did not make it to the top of revenue numbers. This is how exploration comes in handy.
#4 Using tally() in place of n()
Anytime you use ‘summarise’ with ‘n()’ function, it can basically be replaced with tally(). Let’s see how that works…
Sales_rep_tally = sales_order %>%
group_by(Rep) %>%
tally(sort = TRUE)
Now lets try to explore a bit deeper…
#5 Summarising with n_distinct()
Sales_region = sales_order %>%
group_by(Region) %>%
summarise(Region_sales_count = n(), Rep_count = n_distinct(Rep))
Here we have recreated the table of sales counts in each region and then to move a step further, we have also added how many reps have made sales in those regions. This gives us more information in one exploration than doing the same code writing multiple number of times.
#6 Grouping without summarising
Grouping can also be useful without summarising…
Base R functions like table() can also be called to view certain outputs and can be leveraged as per need …
sales_order %>%
group_by(Region) %>%
select(Item) %>%
table()
## Adding missing grouping variables: `Region`
## Item
## Region Binder Desk Pen Pen Set Pencil
## Central 8 2 1 4 9
## East 5 0 3 3 2
## West 2 1 1 0 2
Thank you for watching. I’ll see you in the next one :)