This project analyzes the Online Retail dataset from the UCI Machine
Learning Repository.
The goal is to clean, explore, and prepare the data for meaningful
marketing insights such as customer behavior, product sales trends, and
seasonal patterns.
library(dplyr)
library(lubridate)
library(ggplot2)
library(readr)
# Load the dataset - update the path as needed
retail_raw <- read_csv("~/Desktop/DA Portfolio/online-retail-marketing-analytics/data/raw/data_raw.csv")
# Peek at first few rows
head(retail_raw)
## # A tibble: 6 × 8
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID
## <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 536365 85123A WHITE HANGING H… 6 12/1/2010 … 2.55 17850
## 2 536365 71053 WHITE METAL LAN… 6 12/1/2010 … 3.39 17850
## 3 536365 84406B CREAM CUPID HEA… 8 12/1/2010 … 2.75 17850
## 4 536365 84029G KNITTED UNION F… 6 12/1/2010 … 3.39 17850
## 5 536365 84029E RED WOOLLY HOTT… 6 12/1/2010 … 3.39 17850
## 6 536365 22752 SET 7 BABUSHKA … 2 12/1/2010 … 7.65 17850
## # ℹ 1 more variable: Country <chr>
The raw dataset includes canceled transactions, negative values, and anomalous entries that skew revenue metrics. Before summary or exploratory analysis, we’ll apply key filters to ensure cleaner downstream insights.
retail_clean <- retail_raw %>%
mutate(
InvoiceDate = parse_date_time(InvoiceDate, orders = c("dmy HM", "mdy HM")),
Revenue = Quantity * UnitPrice
) %>%
filter(
!is.na(InvoiceDate),
Quantity > 0, Quantity < 1000,
UnitPrice > 0, Revenue < 10000,
!grepl("^C", InvoiceNo)
)
summary(retail_raw)
## InvoiceNo StockCode Description Quantity
## Length:541909 Length:541909 Length:541909 Min. :-80995.00
## Class :character Class :character Class :character 1st Qu.: 1.00
## Mode :character Mode :character Mode :character Median : 3.00
## Mean : 9.55
## 3rd Qu.: 10.00
## Max. : 80995.00
##
## InvoiceDate UnitPrice CustomerID Country
## Length:541909 Min. :-11062.06 Min. :12346 Length:541909
## Class :character 1st Qu.: 1.25 1st Qu.:13953 Class :character
## Mode :character Median : 2.08 Median :15152 Mode :character
## Mean : 4.61 Mean :15288
## 3rd Qu.: 4.13 3rd Qu.:16791
## Max. : 38970.00 Max. :18287
## NA's :135080
The initial summary reveals several data quality issues, including missing values in InvoiceDate and CustomerID, negative UnitPrice values, and extreme outliers in Quantity. Addressing these issues is critical to ensure reliable analysis.
retail_clean <- retail_raw %>%
# Convert InvoiceDate to POSIXct date-time format
mutate(InvoiceDate = dmy_hm(InvoiceDate)) %>%
# Filter out rows with missing InvoiceDate (key for time analysis)
filter(!is.na(InvoiceDate)) %>%
# Ensure Quantity and UnitPrice are positive and reasonable
filter(Quantity > 0, UnitPrice > 0, Quantity < 1000) %>%
# Remove cancelled transactions (InvoiceNo starting with 'C')
filter(!grepl("^C", InvoiceNo)) %>%
# Optional: filter to only transactions with CustomerID for customer-level insights
filter(!is.na(CustomerID))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `InvoiceDate = dmy_hm(InvoiceDate)`.
## Caused by warning:
## ! 308950 failed to parse.
# Summary after cleaning
summary(retail_clean)
## InvoiceNo StockCode Description Quantity
## Length:168566 Length:168566 Length:168566 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.: 2.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 12.05
## 3rd Qu.: 12.00
## Max. :992.00
## InvoiceDate UnitPrice CustomerID
## Min. :2010-01-12 08:26:00 Min. : 0.040 Min. :12347
## 1st Qu.:2011-03-05 12:12:00 1st Qu.: 1.250 1st Qu.:13881
## Median :2011-06-09 09:51:00 Median : 1.950 Median :15192
## Mean :2011-05-16 06:55:47 Mean : 3.148 Mean :15295
## 3rd Qu.:2011-09-06 13:08:00 3rd Qu.: 3.750 3rd Qu.:16873
## Max. :2011-12-10 17:19:00 Max. :8142.750 Max. :18287
## Country
## Length:168566
## Class :character
## Mode :character
##
##
##
retail_clean <- retail_clean %>%
mutate(Revenue = Quantity * UnitPrice)
summary(retail_clean$Revenue)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.10 4.68 11.70 21.45 19.80 38970.00
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
# Aggregate daily revenue
daily_revenue <- retail_clean %>%
group_by(Date = as.Date(InvoiceDate)) %>%
summarise(DailyRevenue = sum(Revenue))
# Plot
ggplot(daily_revenue, aes(x = Date, y = DailyRevenue)) +
geom_line(color = "steelblue") +
scale_y_continuous(labels = dollar_format(prefix="$")) +
labs(
title = "Daily Revenue Over Time",
x = "Date",
y = "Revenue (USD)"
) +
theme_minimal()
top_products <- retail_clean %>%
group_by(Description) %>%
summarise(TotalRevenue = sum(Revenue)) %>%
arrange(desc(TotalRevenue)) %>%
slice_head(n = 10)
# Plot
ggplot(top_products, aes(x = reorder(Description, TotalRevenue), y = TotalRevenue)) +
geom_col(fill = "darkorange") +
coord_flip() +
scale_y_continuous(labels = dollar_format(prefix="$")) +
labs(
title = "Top 10 Best-Selling Products by Revenue",
x = "Product",
y = "Total Revenue (USD)"
) +
theme_minimal()
During exploratory analysis, the daily revenue plot showed increasing
volatility starting around January 2011.
I performed additional checks to assess the cause, including examining
daily transaction counts and inspecting transactions with negative
revenue.
Additionally, the Top 10 Products plot revealed “Manual” as a
high-revenue product, likely an artifact of system or data entry
errors.
I removed this and other known placeholder descriptions from the
analysis to ensure cleaner product insights.
daily_transactions <- retail_clean %>%
group_by(Date = as.Date(InvoiceDate)) %>%
summarise(NumTransactions = n_distinct(InvoiceNo))
ggplot(daily_transactions, aes(x = Date, y = NumTransactions)) +
geom_line(color = "purple") +
labs(
title = "Daily Number of Transactions",
x = "Date",
y = "Number of Transactions"
) +
theme_minimal()
# Check if any negative revenue values remain
retail_clean %>% filter(Revenue < 0)
## # A tibble: 0 × 9
## # ℹ 9 variables: InvoiceNo <chr>, StockCode <chr>, Description <chr>,
## # Quantity <dbl>, InvoiceDate <dttm>, UnitPrice <dbl>, CustomerID <dbl>,
## # Country <chr>, Revenue <dbl>
# View rows where Description == "Manual"
retail_clean %>% filter(Description == "Manual")
## # A tibble: 130 × 9
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## <chr> <chr> <chr> <dbl> <dttm> <dbl>
## 1 536569 M Manual 1 2010-01-12 15:35:00 1.25
## 2 536569 M Manual 1 2010-01-12 15:35:00 19.0
## 3 536981 M Manual 2 2010-03-12 14:26:00 0.85
## 4 537077 M Manual 12 2010-05-12 11:59:00 0.42
## 5 537137 M Manual 36 2010-05-12 12:43:00 0.85
## 6 537140 M Manual 1 2010-05-12 12:53:00 0.42
## 7 537140 M Manual 1 2010-05-12 12:53:00 0.85
## 8 537199 M Manual 1 2010-05-12 14:06:00 0.65
## 9 537208 M Manual 4 2010-05-12 15:12:00 0.85
## 10 537225 M Manual 1 2010-05-12 16:41:00 1
## # ℹ 120 more rows
## # ℹ 3 more variables: CustomerID <dbl>, Country <chr>, Revenue <dbl>
# Define list of known bad product descriptions
bad_descriptions <- c("Manual", "POSTAGE", "Discount", "DOTCOM POSTAGE", "CARRIAGE")
# Filter them out of the dataset
retail_clean <- retail_clean %>%
filter(!Description %in% bad_descriptions)
# Re-run the Top 10 Products plot after cleaning
top_products_clean <- retail_clean %>%
group_by(Description) %>%
summarise(TotalRevenue = sum(Revenue)) %>%
arrange(desc(TotalRevenue)) %>%
slice_head(n = 10)
# Plot cleaned Top 10
ggplot(top_products_clean, aes(x = reorder(Description, TotalRevenue), y = TotalRevenue)) +
geom_col(fill = "darkorange") +
coord_flip() +
scale_y_continuous(labels = scales::dollar_format(prefix="$")) +
labs(
title = "Top 10 Best-Selling Products by Revenue (Cleaned)",
x = "Product",
y = "Total Revenue (USD)"
) +
theme_minimal()
Now that we’ve removed noisy or anomalous records (such as negative revenue and placeholder products like “Manual”), we revisit the core business visuals to validate our findings using the clean dataset.
revenue_by_day_clean <- retail_clean %>%
group_by(Date = as.Date(InvoiceDate)) %>%
summarise(DailyRevenue = sum(Revenue, na.rm = TRUE))
ggplot(revenue_by_day_clean, aes(x = Date, y = DailyRevenue)) +
geom_line(color = "steelblue") +
labs(
title = "Daily Revenue Over Time (Cleaned)",
x = "Date",
y = "Revenue (USD)"
) +
scale_y_continuous(labels = scales::dollar_format(prefix="$")) +
theme_minimal()
After thorough cleaning and validation, the dataset is exported as a
CSV file.
This clean data will serve as the foundation for further analysis or
visualization in other tools like Tableau or Power BI.
Exporting ensures reproducibility and collaboration with stakeholders or
team members.
# Create a folder for clean data if it doesn't exist
if(!dir.exists("data/clean")) {
dir.create("data/clean", recursive = TRUE)
}
# Export the cleaned dataset
write_csv(retail_clean, "~/Desktop/DA Portfolio/online-retail-marketing-analytics/data/processed/retail_clean.csv")
This marketing analytics project explored the UCI Online Retail dataset using R for data cleaning, transformation, and visualization. Key takeaways from the analysis include:
This project demonstrates my ability to work with messy real-world data and transform it into actionable marketing insights. The analysis showcases skills in data wrangling, visualization, and communication — all critical components of a data analyst role.