1. Introduction

This project analyzes the Online Retail dataset from the UCI Machine Learning Repository.
The goal is to clean, explore, and prepare the data for meaningful marketing insights such as customer behavior, product sales trends, and seasonal patterns.


2. Load Libraries and Raw Data

library(dplyr)
library(lubridate)
library(ggplot2)
library(readr)

# Load the dataset - update the path as needed
retail_raw <- read_csv("~/Desktop/DA Portfolio/online-retail-marketing-analytics/data/raw/data_raw.csv")

# Peek at first few rows
head(retail_raw)
## # A tibble: 6 × 8
##   InvoiceNo StockCode Description      Quantity InvoiceDate UnitPrice CustomerID
##   <chr>     <chr>     <chr>               <dbl> <chr>           <dbl>      <dbl>
## 1 536365    85123A    WHITE HANGING H…        6 12/1/2010 …      2.55      17850
## 2 536365    71053     WHITE METAL LAN…        6 12/1/2010 …      3.39      17850
## 3 536365    84406B    CREAM CUPID HEA…        8 12/1/2010 …      2.75      17850
## 4 536365    84029G    KNITTED UNION F…        6 12/1/2010 …      3.39      17850
## 5 536365    84029E    RED WOOLLY HOTT…        6 12/1/2010 …      3.39      17850
## 6 536365    22752     SET 7 BABUSHKA …        2 12/1/2010 …      7.65      17850
## # ℹ 1 more variable: Country <chr>

2.1 Preliminary Cleaning for Consistency

The raw dataset includes canceled transactions, negative values, and anomalous entries that skew revenue metrics. Before summary or exploratory analysis, we’ll apply key filters to ensure cleaner downstream insights.

retail_clean <- retail_raw %>%
  mutate(
    InvoiceDate = parse_date_time(InvoiceDate, orders = c("dmy HM", "mdy HM")),
    Revenue = Quantity * UnitPrice
  ) %>%
  filter(
    !is.na(InvoiceDate),
    Quantity > 0, Quantity < 1000,
    UnitPrice > 0, Revenue < 10000,
    !grepl("^C", InvoiceNo)
  )

3. Initial Data Overview

summary(retail_raw)
##   InvoiceNo          StockCode         Description           Quantity        
##  Length:541909      Length:541909      Length:541909      Min.   :-80995.00  
##  Class :character   Class :character   Class :character   1st Qu.:     1.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :     3.00  
##                                                           Mean   :     9.55  
##                                                           3rd Qu.:    10.00  
##                                                           Max.   : 80995.00  
##                                                                              
##  InvoiceDate          UnitPrice           CustomerID       Country         
##  Length:541909      Min.   :-11062.06   Min.   :12346    Length:541909     
##  Class :character   1st Qu.:     1.25   1st Qu.:13953    Class :character  
##  Mode  :character   Median :     2.08   Median :15152    Mode  :character  
##                     Mean   :     4.61   Mean   :15288                      
##                     3rd Qu.:     4.13   3rd Qu.:16791                      
##                     Max.   : 38970.00   Max.   :18287                      
##                                         NA's   :135080

The initial summary reveals several data quality issues, including missing values in InvoiceDate and CustomerID, negative UnitPrice values, and extreme outliers in Quantity. Addressing these issues is critical to ensure reliable analysis.


4. Data Cleaning and Preparation

retail_clean <- retail_raw %>%
  # Convert InvoiceDate to POSIXct date-time format
  mutate(InvoiceDate = dmy_hm(InvoiceDate)) %>%
  # Filter out rows with missing InvoiceDate (key for time analysis)
  filter(!is.na(InvoiceDate)) %>%
  # Ensure Quantity and UnitPrice are positive and reasonable
  filter(Quantity > 0, UnitPrice > 0, Quantity < 1000) %>%
  # Remove cancelled transactions (InvoiceNo starting with 'C')
  filter(!grepl("^C", InvoiceNo)) %>%
  # Optional: filter to only transactions with CustomerID for customer-level insights
  filter(!is.na(CustomerID))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `InvoiceDate = dmy_hm(InvoiceDate)`.
## Caused by warning:
## !  308950 failed to parse.
# Summary after cleaning
summary(retail_clean)
##   InvoiceNo          StockCode         Description           Quantity     
##  Length:168566      Length:168566      Length:168566      Min.   :  1.00  
##  Class :character   Class :character   Class :character   1st Qu.:  2.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :  6.00  
##                                                           Mean   : 12.05  
##                                                           3rd Qu.: 12.00  
##                                                           Max.   :992.00  
##   InvoiceDate                    UnitPrice          CustomerID   
##  Min.   :2010-01-12 08:26:00   Min.   :   0.040   Min.   :12347  
##  1st Qu.:2011-03-05 12:12:00   1st Qu.:   1.250   1st Qu.:13881  
##  Median :2011-06-09 09:51:00   Median :   1.950   Median :15192  
##  Mean   :2011-05-16 06:55:47   Mean   :   3.148   Mean   :15295  
##  3rd Qu.:2011-09-06 13:08:00   3rd Qu.:   3.750   3rd Qu.:16873  
##  Max.   :2011-12-10 17:19:00   Max.   :8142.750   Max.   :18287  
##    Country         
##  Length:168566     
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

5. Feature Engineering: Add Revenue Column

retail_clean <- retail_clean %>%
  mutate(Revenue = Quantity * UnitPrice)

summary(retail_clean$Revenue)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.10     4.68    11.70    21.45    19.80 38970.00

6. Exploratory Data Analysis (EDA)

6.1 Total Revenue Over Time

library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
# Aggregate daily revenue
daily_revenue <- retail_clean %>%
  group_by(Date = as.Date(InvoiceDate)) %>%
  summarise(DailyRevenue = sum(Revenue))

# Plot
ggplot(daily_revenue, aes(x = Date, y = DailyRevenue)) +
  geom_line(color = "steelblue") +
  scale_y_continuous(labels = dollar_format(prefix="$")) +
  labs(
    title = "Daily Revenue Over Time",
    x = "Date",
    y = "Revenue (USD)"
  ) +
  theme_minimal()

6.2 Top 10 Best-Selling Products by Revenue

top_products <- retail_clean %>%
  group_by(Description) %>%
  summarise(TotalRevenue = sum(Revenue)) %>%
  arrange(desc(TotalRevenue)) %>%
  slice_head(n = 10)

# Plot
ggplot(top_products, aes(x = reorder(Description, TotalRevenue), y = TotalRevenue)) +
  geom_col(fill = "darkorange") +
  coord_flip() +
  scale_y_continuous(labels = dollar_format(prefix="$")) +
  labs(
    title = "Top 10 Best-Selling Products by Revenue",
    x = "Product",
    y = "Total Revenue (USD)"
  ) +
  theme_minimal()

6.3 Data Quality Investigation: Revenue Volatility and Product Anomalies

During exploratory analysis, the daily revenue plot showed increasing volatility starting around January 2011.
I performed additional checks to assess the cause, including examining daily transaction counts and inspecting transactions with negative revenue.

Additionally, the Top 10 Products plot revealed “Manual” as a high-revenue product, likely an artifact of system or data entry errors.
I removed this and other known placeholder descriptions from the analysis to ensure cleaner product insights.

6.3.1 Daily Number of Transactions Over Time

daily_transactions <- retail_clean %>%
  group_by(Date = as.Date(InvoiceDate)) %>%
  summarise(NumTransactions = n_distinct(InvoiceNo))

ggplot(daily_transactions, aes(x = Date, y = NumTransactions)) +
  geom_line(color = "purple") +
  labs(
    title = "Daily Number of Transactions",
    x = "Date",
    y = "Number of Transactions"
  ) +
  theme_minimal()

6.3.2 Check for Negative Revenue Values

# Check if any negative revenue values remain
retail_clean %>% filter(Revenue < 0)
## # A tibble: 0 × 9
## # ℹ 9 variables: InvoiceNo <chr>, StockCode <chr>, Description <chr>,
## #   Quantity <dbl>, InvoiceDate <dttm>, UnitPrice <dbl>, CustomerID <dbl>,
## #   Country <chr>, Revenue <dbl>

6.3.3 Investigate “Manual” and Other Anomalous Product Names

# View rows where Description == "Manual"
retail_clean %>% filter(Description == "Manual")
## # A tibble: 130 × 9
##    InvoiceNo StockCode Description Quantity InvoiceDate         UnitPrice
##    <chr>     <chr>     <chr>          <dbl> <dttm>                  <dbl>
##  1 536569    M         Manual             1 2010-01-12 15:35:00      1.25
##  2 536569    M         Manual             1 2010-01-12 15:35:00     19.0 
##  3 536981    M         Manual             2 2010-03-12 14:26:00      0.85
##  4 537077    M         Manual            12 2010-05-12 11:59:00      0.42
##  5 537137    M         Manual            36 2010-05-12 12:43:00      0.85
##  6 537140    M         Manual             1 2010-05-12 12:53:00      0.42
##  7 537140    M         Manual             1 2010-05-12 12:53:00      0.85
##  8 537199    M         Manual             1 2010-05-12 14:06:00      0.65
##  9 537208    M         Manual             4 2010-05-12 15:12:00      0.85
## 10 537225    M         Manual             1 2010-05-12 16:41:00      1   
## # ℹ 120 more rows
## # ℹ 3 more variables: CustomerID <dbl>, Country <chr>, Revenue <dbl>

6.3.4 Filter Out Known Anomalous Product Names

# Define list of known bad product descriptions
bad_descriptions <- c("Manual", "POSTAGE", "Discount", "DOTCOM POSTAGE", "CARRIAGE")

# Filter them out of the dataset
retail_clean <- retail_clean %>%
  filter(!Description %in% bad_descriptions)

# Re-run the Top 10 Products plot after cleaning
top_products_clean <- retail_clean %>%
  group_by(Description) %>%
  summarise(TotalRevenue = sum(Revenue)) %>%
  arrange(desc(TotalRevenue)) %>%
  slice_head(n = 10)

# Plot cleaned Top 10
ggplot(top_products_clean, aes(x = reorder(Description, TotalRevenue), y = TotalRevenue)) +
  geom_col(fill = "darkorange") +
  coord_flip() +
  scale_y_continuous(labels = scales::dollar_format(prefix="$")) +
  labs(
    title = "Top 10 Best-Selling Products by Revenue (Cleaned)",
    x = "Product",
    y = "Total Revenue (USD)"
  ) +
  theme_minimal()

6.4 Re-Check Key Visuals With Cleaned Data

Now that we’ve removed noisy or anomalous records (such as negative revenue and placeholder products like “Manual”), we revisit the core business visuals to validate our findings using the clean dataset.

6.4.1 Daily Revenue (Cleaned)

revenue_by_day_clean <- retail_clean %>%
  group_by(Date = as.Date(InvoiceDate)) %>%
  summarise(DailyRevenue = sum(Revenue, na.rm = TRUE))

ggplot(revenue_by_day_clean, aes(x = Date, y = DailyRevenue)) +
  geom_line(color = "steelblue") +
  labs(
    title = "Daily Revenue Over Time (Cleaned)",
    x = "Date",
    y = "Revenue (USD)"
  ) +
  scale_y_continuous(labels = scales::dollar_format(prefix="$")) +
  theme_minimal()

6.5 Exporting the Cleaned Dataset

After thorough cleaning and validation, the dataset is exported as a CSV file.
This clean data will serve as the foundation for further analysis or visualization in other tools like Tableau or Power BI.
Exporting ensures reproducibility and collaboration with stakeholders or team members.

# Create a folder for clean data if it doesn't exist
if(!dir.exists("data/clean")) {
  dir.create("data/clean", recursive = TRUE)
}

# Export the cleaned dataset
write_csv(retail_clean, "~/Desktop/DA Portfolio/online-retail-marketing-analytics/data/processed/retail_clean.csv")


7. Conclusion and Next Steps

This marketing analytics project explored the UCI Online Retail dataset using R for data cleaning, transformation, and visualization. Key takeaways from the analysis include:

  • Revenue Trends: Sales remained relatively stable through 2010 but became erratic in early 2011, suggesting either changes in customer behavior, inventory issues, or potential anomalies in data collection.
  • Top Products: Several products consistently drove high revenue, notably decorative home goods like the WHITE HANGING HEART T-LIGHT HOLDER. However, artifacts like “Manual” and “POSTAGE” initially skewed results, highlighting the importance of rigorous data quality checks.
  • Data Cleaning: Parsing datetime formats, filtering out cancellations, correcting pricing issues, and handling placeholder product names were essential to ensuring meaningful insights.
  • Visual Insights: Line plots and bar charts helped detect issues and refine the analysis, while also effectively communicating trends and outliers.

Recommendations

  • Strengthen Data Governance: Introduce validation rules and product naming conventions at the point of sale to reduce inconsistency.
  • Segment Customers: Leverage CustomerID data to perform RFM (Recency, Frequency, Monetary) analysis and identify loyal customers for targeted marketing.
  • Optimize Inventory: Use sales trends to focus stock on best-selling products and minimize overhead from low performers or non-sale SKUs.

Next Steps

  • Interactive Dashboard: Build a Tableau or Power BI dashboard to make insights accessible to stakeholders.
  • Customer Behavior Modeling: Investigate repeat purchasing patterns and customer lifetime value if customer history is available.
  • Time Series Forecasting: Apply forecasting techniques (like ARIMA or Prophet) to predict future sales and plan for peak demand periods.

Final Note

This project demonstrates my ability to work with messy real-world data and transform it into actionable marketing insights. The analysis showcases skills in data wrangling, visualization, and communication — all critical components of a data analyst role.