xiaofei_longLin_extension.qmd

Author

xiaofei

Introduction

This extended vignette builds on the original example by applying Tidyverse tools (`dplyr`, `ggplot2`, `lubridate`) to a new dataset: the Superstore sales dataset.

We will:
- Load and inspect structured retail data
- Clean and transform date fields
- Perform aggregations (sales, profit, discount)
- Create multiple visualizations
- Extend analysis with time series and regional insights

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(readr)

Reading in the Data

We now load the Superstore dataset downloaded from Kaggle website.

file_path <- "C:/Users/xmei/Downloads/Superstore.csv"

superstore <- read_csv(file_path, show_col_types = FALSE)

glimpse(superstore)

Rows: 9,994
Columns: 21
$ `Row ID`        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ `Order ID`      <chr> "CA-2016-152156", "CA-2016-152156", "CA-2016-138688", …
$ `Order Date`    <chr> "11/8/2016", "11/8/2016", "6/12/2016", "10/11/2015", "…
$ `Ship Date`     <chr> "11/11/2016", "11/11/2016", "6/16/2016", "10/18/2015",…
$ `Ship Mode`     <chr> "Second Class", "Second Class", "Second Class", "Stand…
$ `Customer ID`   <chr> "CG-12520", "CG-12520", "DV-13045", "SO-20335", "SO-20…
$ `Customer Name` <chr> "Claire Gute", "Claire Gute", "Darrin Van Huff", "Sean…
$ Segment         <chr> "Consumer", "Consumer", "Corporate", "Consumer", "Cons…
$ Country         <chr> "United States", "United States", "United States", "Un…
$ City            <chr> "Henderson", "Henderson", "Los Angeles", "Fort Lauderd…
$ State           <chr> "Kentucky", "Kentucky", "California", "Florida", "Flor…
$ `Postal Code`   <dbl> 42420, 42420, 90036, 33311, 33311, 90032, 90032, 90032…
$ Region          <chr> "South", "South", "West", "South", "South", "West", "W…
$ `Product ID`    <chr> "FUR-BO-10001798", "FUR-CH-10000454", "OFF-LA-10000240…
$ Category        <chr> "Furniture", "Furniture", "Office Supplies", "Furnitur…
$ `Sub-Category`  <chr> "Bookcases", "Chairs", "Labels", "Tables", "Storage", …
$ `Product Name`  <chr> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabri…
$ Sales           <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.860…
$ Quantity        <dbl> 2, 3, 2, 5, 2, 7, 4, 6, 3, 5, 9, 4, 3, 3, 5, 3, 6, 2, …
$ Discount        <dbl> 0.00, 0.00, 0.00, 0.45, 0.20, 0.00, 0.00, 0.20, 0.20, …
$ Profit          <dbl> 41.9136, 219.5820, 6.8714, -383.0310, 2.5164, 14.1694,…

Data Cleaning and Transformation

Convert date columns into proper date format and create new fields.

superstore_clean <- superstore %>%
  mutate(
    Order_Date = mdy(`Order Date`),
    Ship_Date  = mdy(`Ship Date`),
    Ship_Delay = as.numeric(Ship_Date - Order_Date),
    Year = year(Order_Date),
    Month = month(Order_Date, label = TRUE)
  )

head(superstore_clean)

# A tibble: 6 × 26
  `Row ID` `Order ID`     `Order Date` `Ship Date` `Ship Mode`    `Customer ID`
     <dbl> <chr>          <chr>        <chr>       <chr>          <chr>        
1        1 CA-2016-152156 11/8/2016    11/11/2016  Second Class   CG-12520     
2        2 CA-2016-152156 11/8/2016    11/11/2016  Second Class   CG-12520     
3        3 CA-2016-138688 6/12/2016    6/16/2016   Second Class   DV-13045     
4        4 US-2015-108966 10/11/2015   10/18/2015  Standard Class SO-20335     
5        5 US-2015-108966 10/11/2015   10/18/2015  Standard Class SO-20335     
6        6 CA-2014-115812 6/9/2014     6/14/2014   Standard Class BH-11710     
# ℹ 20 more variables: `Customer Name` <chr>, Segment <chr>, Country <chr>,
#   City <chr>, State <chr>, `Postal Code` <dbl>, Region <chr>,
#   `Product ID` <chr>, Category <chr>, `Sub-Category` <chr>,
#   `Product Name` <chr>, Sales <dbl>, Quantity <dbl>, Discount <dbl>,
#   Profit <dbl>, Order_Date <date>, Ship_Date <date>, Ship_Delay <dbl>,
#   Year <dbl>, Month <ord>

Basic Aggregation: Sales by Category

Group data by product category and calculate total sales and profit.

category_summary <- superstore_clean %>%
  group_by(Category) %>%
  summarize(
    Total_Sales = sum(Sales, na.rm = TRUE),
    Total_Profit = sum(Profit, na.rm = TRUE),
    Avg_Discount = mean(Discount, na.rm = TRUE)
  )

category_summary

# A tibble: 3 × 4
  Category        Total_Sales Total_Profit Avg_Discount
  <chr>                 <dbl>        <dbl>        <dbl>
1 Furniture           742000.       18451.        0.174
2 Office Supplies     719047.      122491.        0.157
3 Technology          836154.      145455.        0.132

Visualization: Sales by Category

ggplot(category_summary, aes(x = reorder(Category, Total_Sales), y = Total_Sales, fill = Category)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = round(Total_Sales, 0)), hjust = -0.2) +
  coord_flip() +
  labs(
    title = "Total Sales by Category",
    x = "Category",
    y = "Sales"
  ) +
  theme_minimal()

Monthly Sales Trend

We aggregate sales over time to identify trends.

monthly_sales <- superstore_clean %>%
  group_by(Year, Month) %>%
  summarize(
    Monthly_Sales = sum(Sales, na.rm = TRUE)
  ) %>%
  ungroup()

`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.

ggplot(monthly_sales, aes(x = Month, y = Monthly_Sales, group = Year, color = factor(Year))) +
  geom_line() +
  geom_point() +
  labs(
    title = "Monthly Sales Trend by Year",
    x = "Month",
    y = "Sales",
    color = "Year"
  ) +
  theme_minimal()

top_products <- superstore_clean %>%
  group_by(`Product Name`) %>%
  summarize(Total_Sales = sum(Sales, na.rm = TRUE)) %>%
  arrange(desc(Total_Sales)) %>%
  slice_head(n = 10)

top_products

# A tibble: 10 × 2
   `Product Name`                                                    Total_Sales
   <chr>                                                                   <dbl>
 1 "Canon imageCLASS 2200 Advanced Copier"                                61600.
 2 "Fellowes PB500 Electric Punch Plastic Comb Binding Machine with…      27453.
 3 "Cisco TelePresence System EX90 Videoconferencing Unit"                22638.
 4 "HON 5400 Series Task Chairs for Big and Tall"                         21871.
 5 "GBC DocuBind TL300 Electric Binding System"                           19823.
 6 "GBC Ibimaster 500 Manual ProClick Binding System"                     19024.
 7 "Hewlett Packard LaserJet 3310 Copier"                                 18840.
 8 "HP Designjet T520 Inkjet Large Format Printer - 24\" Color"           18375.
 9 "GBC DocuBind P400 Electric Binding System"                            17965.
10 "High Speed Automatic Electric Letter Opener"                          17030.

ggplot(top_products, aes(x = reorder(`Product Name`, Total_Sales), y = Total_Sales)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 10 Products by Sales",
    x = "Product",
    y = "Sales"
  ) +
  theme_minimal()

Conclusion

This extended example demonstrates how Tidyverse enables:

Efficient data cleaning (mutate, lubridate)
aggregation (group_by, summarize)
visualization (ggplot2)

These additions make the analysis more aligned with real-world retail analytics workflows.