Introduction to Dyplr

dyplr is a package in R that is very useful for data manipulation and analysis. dylpr has many functions that efficiently manipulate data frames in R. These key functions include mutate(), select(), filter(), summarise(), and arrange(). Additionally, dyplr integrates well with other tidyverse packages including ggplot2 and tidyr. This allows for a seamless flow between data manipulation and visualization. dyplr has gained vast popularity in the R community because it’s easy to use and it’s consistent with its syntax and performance.


group_by() Function Introduction

The functions included above all combine naturally with the group_by() function. This function is used for grouping data based on one or more variables. This function is useful for performing different operations on subsets of data or for creating a summary of the different groups within the data.

This function has many positives aspects. First, the syntax is very straightforward. For example, the data and the group(s) that need to be grouped by are passed as an argument(s) into the function. Additionally, the group_by() function functions similarly to a dataframe. Because of this, operations can be easily performed on the groups. This allows the user to easily calculate summary statistics or filter groups based on specific conditions in order to have a more meaningful data analysis.

Example 1: Grouping Data & Performing Operations on Grouped Data

First, a data set is created in order to demonstrate the different uses of the group_by() function. Then, grouping the data by category will be demonstrated below.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
sales_data <- data.frame(
  region = c("North", "South", "North", "South", "North", "South"),
  category = c("Electronics", "Furniture", "Clothing", "Clothing", "Electronics", "Furniture"),
  sales_amount = c(7000, 1500, 2000, 800, 900, 1200)
)
grouped_data <- sales_data %>% group_by(category)


The code above groups the data by category. Unfortunately, the group_by() funciton doesn’t change the appearance of the data when printed directly. However, an operation on the group data can be performed in order to verify that our data has been grouped correctly. For this, the total sales amount by category will be calculated.

total_sales_by_category <- grouped_data %>% 
  summarise(total_sales = sum(sales_amount))

print(total_sales_by_category)
## # A tibble: 3 × 2
##   category    total_sales
##   <chr>             <dbl>
## 1 Clothing           2800
## 2 Electronics        7900
## 3 Furniture          2700

The code above gives the total sales summary for each of the 3 categories. To verify that the data has been grouped correctly, the total sales for each category can calculated by hand. That way, this information can be cross referenced with the code output above.

Clothing: 2000 + 800 = 2800
Electronics: 7000 + 900 = 7900
Furniture: 1500 + 1200 = 2700

This data has been grouped together correctly for each category since the hand calculations match up with the calculations given by the code.


Example 2: Summarizing Grouped Data

The group_by() function also allows the user to summarize the grouped data.

average_sales <- sales_data %>% 
  group_by(region, category) %>% 
  summarise(avg_sales = sum(sales_amount))
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
print(average_sales)
## # A tibble: 4 × 3
## # Groups:   region [2]
##   region category    avg_sales
##   <chr>  <chr>           <dbl>
## 1 North  Clothing         2000
## 2 North  Electronics      7900
## 3 South  Clothing          800
## 4 South  Furniture        2700

The code above first groups the data by region and category and then summarizes the data by the average sales. Similar to above, calculating this by hand will verify that the data has been grouped and summarized correctly. Based on the data, there are two regions: North and South, and there are only 2 different categories within each region.

North, Electronics: 7000 + 900 = 7900
North, Clothing: 2000
South, Furniture: 1500 + 1200 = 2700
South, Clothing: 800

The hand calculations match the code above.


Example 3: Chaining Operations Together

Finally, these operations can be chained together, so that the data sets can be efficiently manipulated.

# Grouping by region, summarizing total sales, and arranging in descending order
summary_data <- sales_data %>% 
  group_by(region) %>% 
  summarise(total_sales = sum(sales_amount)) %>% 
  arrange(desc(total_sales))

print(summary_data)
## # A tibble: 2 × 2
##   region total_sales
##   <chr>        <dbl>
## 1 North         9900
## 2 South         3500

In the code above, the data was first grouped by region (North or South). Then, the data was summarized by the total sales amount for each region. Finally, the data was arranged by the total sales so that whichever region had the most sales is at the top.


Conclusion

Overall, the group_by() function is very useful in data manipulation. This function allows complex analyses to be permformed on subsets of the data by grouping different variables. This function is essential to the dyplr package which is widely used in data analysis and manipulation.


Works Cited

This code through references and cites the following sources: