Analyzing Amazon’s 2019 Technology Product Sales Data

Section 0: Project Outline

Objective

Analyze a data set of Amazon sales of technology products in several urban ZIP codes to find interesting stories and insights, presented through interconnected visualizations and a coherent narrative.

Target Audience

Amazon’s technology, sales, logistics and operations professionals, including key decision-makers and operational staff like General Managers, Logistic Staff, Upper Management, and Port & Warehouse employees.

Prompt

Amazon’s Logistics Division strategizes for product imports, aiming to optimize import and transportation costs in alignment with revenue to maximize profits. The majority of imports are transported via the shipping industry due to its cost-effectiveness compared to air transportation. The division is focused on understanding the breakdown of import costs to achieve its objectives.

Data

Amazon’s 2019 Online Product Sales Data for 19 technology products across 10 Zip Codes.

Scope/Analysis

  • Identify high-revenue top performing products that sell well in the market across categories
  • Understand customer behavior to analyse product trends and patterns that are most frequently purchased
  • Understand product populaity across zip codes, may be divide into east coast and west coast for a comparison product analysis
  • Understand shipping costs associated for logistics department to optimise the space in the shipping container
  • Target advertising and promotions: recommend specific Zip Codes and products for increased marketing efforts based on insights.

This project will provide valuable insights to Amazon’s Technology Product sales, logistics, pricing and product divisions with historic data analysis, helping them optimize marketing efforts, manage import and shipping costs, and make informed decisions about future product sales and potential profit margins insights for the most profitable and most frequently purchased products.

Section I: Preparing the Data

Firstly, we need to load the necessary libraries, set the working directory and load the data to get started.

#importing the libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(vtable)
library(scales)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(stringr)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(ggiraph)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(rio)
## 
## Attaching package: 'rio'
## The following object is masked from 'package:plotly':
## 
##     export
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(treemapify)

#Session → Set Working Directory 
setwd("~/Desktop/BUAN 5210/DTC")

#To remove scientific notation form labels
options(scipen=1000000)

#importing the dataset
sales <- import('sales_data.Rdata')
dat_zip <- read.csv("zip_info.csv")

Section II: Exploratory Data Analysis

head(sales)
vtable(sales, lush = TRUE, factor.limit = 9, char.values = TRUE)
sales
Name Class Values Missing Summary
Product character ‘20in Monitor’ ‘27in 4K Gaming Monitor’ ‘27in FHD Monitor’ ‘34in Ultrawide Monitor’ ‘AA Batteries (4-pack)’ ‘AAA Batteries (4-pack)’ ‘Apple Airpods Headphones’ ‘Bose SoundSport Headphones’ ‘Flatscreen TV’ and 10 more 0 nuniq: 19
Quantity character ‘1’ ‘2’ ‘3’ ‘4’ ‘5’ ‘6’ ‘7’ ‘8’ ‘9’ 0 nuniq: 9
PriceEach character ‘109.99’ ‘11.95’ ‘11.99’ ‘14.95’ ‘149.99’ ‘150’ ‘150.0’ ‘1700’ ‘1700.0’ and 14 more 0 nuniq: 23
DateTime POSIXct Time: 2019-01-01 03:07:00 to 2020-01-01 05:13:00 0 median: 2019-07-17 20:40:30, nuniq: 142395
Date Date Time: 2019-01-01 to 2020-01-01 0 median: 2019-07-17, nuniq: 366
ZIP character ‘02215’ ‘04101’ ‘10001’ ‘30301’ ‘73301’ ‘75001’ ‘90001’ ‘94016’ ‘97035’ and 1 more 0 nuniq: 10
State character ‘CA’ ‘GA’ ‘MA’ ‘ME’ ‘NY’ ‘OR’ ‘TX’ ‘WA’ 0 nuniq: 8
City character ‘Atlanta’ ‘Austin’ ‘Boston’ ‘Dallas’ ‘Los Angeles’ ‘New York City’ ‘Portland’ ‘San Francisco’ ‘Seattle’ 0 nuniq: 9

The sales data contains 8 variables:

head(dat_zip)
vtable(dat_zip, lush = TRUE)
dat_zip
Name Class Values Missing Summary
ZIP integer Num: 2215 to 98101 0 mean: 57407.3, sd: 40907.601, nuniq: 10
TotalPopulation integer Num: 12792 to 58975 0 mean: 26051.6, sd: 12885.755, nuniq: 10
MedianHHIncome integer Num: 46309 to 119370 0 mean: 81151, sd: 25319.33, nuniq: 10
PCIncome integer Num: 14814 to 100364 0 mean: 57085.4, sd: 28034.787, nuniq: 10
MedianAge numeric Num: 21.6 to 44.3 0 mean: 34.02, sd: 6.243, nuniq: 10
Race_White integer Num: 9231 to 22921 0 mean: 16140.6, sd: 5012.111, nuniq: 10
Race_Black integer Num: 459 to 5483 0 mean: 2179.9, sd: 1366.307, nuniq: 10
Race_American_Indian integer Num: 148 to 802 0 mean: 426.3, sd: 229.788, nuniq: 10
Race_Asian integer Num: 173 to 10134 0 mean: 3238.9, sd: 3245.489, nuniq: 10
Race_Pacific_Islander integer Num: 0 to 237 0 mean: 71.3, sd: 81.024, nuniq: 9
Race_Other integer Num: 181 to 30491 0 mean: 5003.8, sd: 9745.687, nuniq: 10
Ethnicity_Hispanic integer Num: 609 to 53085 0 mean: 9223, sd: 16630.268, nuniq: 10
Citizens integer Num: 10432 to 24069 0 mean: 17171.8, sd: 4595.578, nuniq: 10

The zipcode data contains 13 variables:

Then, we check for missing values in the data and if present, we analyse and remove them

# Check for missing values
sum(is.na(sales))
## [1] 0
sum(is.na(dat_zip))
## [1] 0

We can see that, there are 0 missing values across all variables in both sales and zip-code data frames.

The Quantity and price per product (PriceEach) are categorized as ‘character’ instead of ‘numeric’. Additionally, we don’t have the total sales column in the data. Lets correct and create it for further analysis.

q = as.numeric(sales$Quantity) #quantity
p = as.numeric(sales$PriceEach) #price

sales <- sales %>%
  mutate(Sale = p * q)

Section III: Data Vizulisations

Addressing the intersection of high sales volume and premium pricing.

Amazon’s Logistics Department aims to strike a balance between products that are both highly sought-after and carry a premium price tag for several reasons:

  • Rapid Depletion: Such products tend to fly off the shelves quickly.
  • Price Surge Potential: During shortages, they can command higher prices, maximizing profits.

Finding the right mix between items that can fetch premium prices and those constituting regular sales is pivotal. Let’s delve into the pricing and revenue breakdown of various product by categories to gain deeper insights.

dtp<- sales %>%
  select(Product, PriceEach, Quantity)%>%
  group_by(Product, PriceEach) %>%
  summarize(Total_revenue= sum(as.numeric(Quantity))*as.numeric(PriceEach))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Product', 'PriceEach'. You can override
## using the `.groups` argument.
dtp <- dtp %>% 
  distinct()

dtp %>%
  arrange(desc(Total_revenue)) %>%
  kbl() %>%
  kable_styling()
Product PriceEach Total_revenue
Macbook Pro Laptop 1700 8000200.00
iPhone 700 4767700.00
ThinkPad Laptop 999.99 4129958.70
Google Phone 600 3298200.00
27in 4K Gaming Monitor 389.99 2435097.56
34in Ultrawide Monitor 379.99 2355558.01
Apple Airpods Headphones 150 2334300.00
Flatscreen TV 300 1439700.00
Bose SoundSport Headphones 99.99 1345565.43
27in FHD Monitor 149.99 1132424.50
Vareebadd Phone 400 823600.00
20in Monitor 109.99 454148.71
LG Washing Machine 600.0 399600.00
LG Dryer 600.0 387600.00
Lightning Charging Cable 14.95 347094.15
USB-C Charging Cable 11.95 286501.25
Wired Headphones 11.99 246478.43
AA Batteries (4-pack) 3.84 106118.40
AAA Batteries (4-pack) 2.99 92740.83
Macbook Pro Laptop 1700.0 37400.00
iPhone 700.0 26600.00
Google Phone 600.0 21000.00
Apple Airpods Headphones 150.0 14850.00
Flatscreen TV 300.0 6000.00
Vareebadd Phone 400.0 3600.00

Visualization 1: The interactive bar graph with total sales of all the 19 products in 2019 across 10 urban zip codes

sales_1 <- sales %>%
  group_by(Product) %>%
  summarise_at(vars(Sale), list(total_sale = sum))

sales_1 <- sales_1 %>%
  mutate_at(vars(total_sale), funs(./ 10^3))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
sales_1 <- sales_1 %>%
  mutate(tooltip_text = paste0(toupper(Product), "\n", "$", round(total_sale, 2)))

plot1 <- ggplot(sales_1, 
                           aes(x = reorder(Product, total_sale), 
                               y = total_sale,
                               tooltip = tooltip_text, data_id = Product 
                           )) +
  geom_col_interactive(color = "black", fill="#816654", size = 0.5) +  
  theme_minimal() +
  theme(axis.text=element_text(size = 7), panel.grid.major.y = element_blank()) + 
  labs(title = "Total Sales For A Product In 2019",
       subtitle = "Sale Data is Limited to 10 Urban Zip Codes"
  ) +
  ylab("Sales (In Thousand Dollar$)") +
  xlab("Product") +
  coord_flip()

girafe(ggobj = plot1, width_svg = 5.85, height_svg = 3.7)

As observed, laptops, phones, and monitors were the top three products sold across the zip codes analyzed in 2019.
The top 3 products with ‘Premium Price’ and high sales number, therefore, are

  1. Macbook Pro Laptop (Highest Price and Purchase frequency)
  2. iphone (Highest Number of Purchases pertaining to price)
  3. ThinkPad Laptop (Highest Price and Relatively high purchase frequency)

Visualization 2: Top Selling Most Profitable Premium Products

dtp %>%
  arrange(PriceEach)%>%
  filter(str_detect(Product,'Macbook|iPhone|ThinkPad')) %>%
  ggplot(aes(Product, Total_revenue, color= PriceEach))+ geom_bar(stat='identity',
                                                                  show.legend = FALSE)+ coord_flip()+
  xlab('Total Revenue')+
  ylab('Product')+
  theme(
    panel.grid.major.y = element_blank(),
      axis.text.x = element_text() 
  )+
  scale_y_continuous(breaks= scales::pretty_breaks(n=6),
                     labels= label_number(suffix= " K", scale= 1e-6))+
  scale_color_brewer()+
   theme(plot.background=element_rect(fill="#051e3e"),
        panel.background = element_rect(fill="#051e3e"),
        plot.margin = unit(c(1,2,1,2),"cm"),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        axis.line.x.bottom = element_line(color="white"),
        axis.text.x = element_text(colour="white",face = "bold"),
        axis.text.y = element_text(colour="white",face = "bold"),
        axis.title.x = element_text(colour="white",face="bold",size=12, margin=margin(t=25)),
        axis.title.y = element_text(colour="white",face="bold",size=12, margin=margin(r=25)),
        plot.title.position = "plot",
        plot.caption.position = "plot",
        plot.title=element_text(size=14, face="bold",colour="white",margin=margin(b=15)),
        plot.subtitle = element_text(size=12, colour="white",margin=margin(b=25)),
        plot.caption=element_text(size=10,colour="white",hjust=0,margin=margin(t=20)))+
   labs(title="Top 3 Most Profitable Premium Products",
       subtitle=str_wrap("Best Selling and Most Profitable Premium Products",75),
       caption = str_wrap("Total Revenue = Price of an article * Quantity * Number of transactions",width = 105))

These are the products that fall in a high price point and yet still sell relatively well are called most profitable premium products

In the next section, our analysis focuses on the frequency distribution of products within their respective categories. We achieve this by aggregating similar products from different brands, ensuring a more comprehensive understanding of the market composition.

Visualization 3: Addressing the Product Frequency Distribution

Let us look at the Frequency Distribution of different product by their product categories, such that similar products from different brands are clubbed together. Let’s look at some of the most frequently bought items.

#Looking at the frequency distribution table
dts <- sales %>%
  group_by(Product) %>%
  summarize(ProductFrequency= n()) %>%
  arrange(desc(ProductFrequency))

dts %>%
  kbl() %>%
  kable_styling()
Product ProductFrequency
USB-C Charging Cable 21903
Lightning Charging Cable 21658
AAA Batteries (4-pack) 20641
AA Batteries (4-pack) 20577
Wired Headphones 18882
Apple Airpods Headphones 15549
Bose SoundSport Headphones 13325
27in FHD Monitor 7507
iPhone 6842
27in 4K Gaming Monitor 6230
34in Ultrawide Monitor 6181
Google Phone 5525
Flatscreen TV 4800
Macbook Pro Laptop 4724
ThinkPad Laptop 4128
20in Monitor 4101
Vareebadd Phone 2065
LG Washing Machine 666
LG Dryer 646

From the above analysis of the sales data set, the top 3 products that are most frequently bought are:

  1. USB-C Charging Cable
  2. Lightning Charging Cable
  3. AAA Batteries (4-pack)

Lets look at the distribution of these 3 products in the different Zip codes. This might give us a better idea of the State Wise Demand of these products.

Visualization 5: Addressing the Shipping Charges West Coast vs East Coast

Shipping costs vary significantly between the West Coast and East Coast, with a 40-foot container shipment to the West Coast priced at around $20,000, compared to $14,000 for the East Coast.

While a 40 foot container to the former costs about $20k, similar shipment costs $14k for Eastcoast.

A manager would also be interested in knowing what products are popular in east coast vs west coast. Specifically, these products

  • USB C charging Cable

  • Lightning Cable

  • Macbook Pro Laptop

  • iPhone

Here, the top 2 are the most bought products, while the bottom 2 are the premium products that sell well.

dt_coast <- sales %>%
  mutate(region= case_when(
    as.numeric(ZIP)== 94016 ~ "West Coast",
    as.numeric(ZIP)== 90001 ~ "West Coast", 
    as.numeric(ZIP)== 10001 ~ "East Coast",
    as.numeric(ZIP)== 02215 ~ "East Coast",
    as.numeric(ZIP)== 75001 ~ "East Coast",
    as.numeric(ZIP)== 98101 ~ "West Coast", 
    as.numeric(ZIP)== 97035 ~ "West Coast",
    as.numeric(ZIP)== 73301 ~ "East Coast",
    as.numeric(ZIP)== 04101 ~ "East Coast",
    as.numeric(ZIP)== 30301 ~ "East Coast",
    
  ))

dt_coast_grp <- dt_coast %>%
  filter(str_detect('Macbook Pro Laptop|USB-C Charging Cable|Lightning Charging Cable|iPhone', Product)) %>%
  group_by(region,Product) %>%
  summarise(ArticlesBought= sum(as.numeric(Quantity)))
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
dt_coast_grp %>%
  kbl() %>%
  kable_styling()
region Product ArticlesBought
East Coast Lightning Charging Cable 10851
East Coast Macbook Pro Laptop 2211
East Coast USB-C Charging Cable 11188
East Coast iPhone 3153
West Coast Lightning Charging Cable 12366
West Coast Macbook Pro Laptop 2517
West Coast USB-C Charging Cable 12787
West Coast iPhone 3696
ggplot(dt_coast_grp,
       aes(str_wrap(Product,6), ArticlesBought, fill= region))+
  geom_bar(position="dodge", stat="identity")+
  theme(plot.background=element_rect(fill="#025464"),
        panel.background = element_rect(fill="#025464"),
        plot.margin = unit(c(1,2,1,2),"cm"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.ticks = element_blank(),
        panel.grid= element_blank(),
        axis.line.x.bottom = element_line(color="white"),
        axis.text.x = element_text(colour="white",face = "bold"),
        axis.text.y = element_text(colour="white",face = "bold", size= 8),
        axis.title.x = element_text(colour="white",face="bold",size=8, margin=margin(t=25)),
        axis.title.y = element_text(colour="white",face="bold",size=8, margin=margin(r=25)),
        plot.title.position = "plot",
        plot.caption.position = "plot",
        plot.title=element_text(size=14, face="bold",colour="white",margin=margin(b=15)),
        plot.subtitle = element_text(size=12, colour="white",margin=margin(b=25)),
        plot.caption=element_text(size=10,colour="white",hjust=0,margin=margin(t=20)))+
  scale_y_continuous(breaks= scales::pretty_breaks(n=6),
                     )+
   labs(title="East Coast vs West Coast Product Sale",
       subtitle=str_wrap("Comparing Sales of Popular products in 2 US Regions",75),
       caption = str_wrap("States Like Texas which are geographically closer to East coast are included in East Coast since shipment from East Coast to them is more practical from logistic perspective (Proximity to Major Ports on that end)",width = 79))+
  ylab('No. of Products bought')+
  xlab('Products')

The disparity in product demand between the West Coast and East Coast is evident, with the former showing higher purchases across various product categories, including premium product items like iPhones and Macbook Pros.

Given the significant ports on the West Coast such as Long Beach, Portland, San Francisco, Los Angeles, Tacoma, and Seattle, it becomes imperative to devise a solution for optimizing container space.

Considering the dimensions of the items and the container, along with their profitability, a strategy is proposed to maximize profits while efficiently utilizing space. Based on a profit vs. weight analysis, the optimal allocation is suggested as follows:

  • 40% for iPhones, which are both high in profitability and demand.

  • 30% for Macbook Pros, offering a balance between profitability and space efficiency.

  • 15% each for USB-C chargers and Lightning cables, ensuring inclusion of essential accessories while maintaining a balanced distribution.

This approach aims to maximize profits by prioritizing high-profit items while also considering space constraints and the need for a diverse product range.

Visualization 6: Tree map with Container Space Optimization

Products <- c("Macbook Pro", "USB-C Charger", "Lightning Cable", "iPhone")

value <- c(40,30,15,15)

dft <- data.frame(Products, value) 
ggplot(dft, aes(area = value, fill = Products)) +
  geom_treemap()+
  theme(plot.background=element_rect(fill="#3b5998"),
        panel.background = element_rect(fill="#3b5998"),
        plot.margin = unit(c(1,2,1,2),"cm"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.ticks = element_blank(),
        panel.grid= element_blank(),
        axis.line.x.bottom = element_line(color="white"),
        axis.text.x = element_text(colour="white",face = "bold"),
        axis.text.y = element_text(colour="white",face = "bold", size= 8),
        axis.title.x = element_text(colour="white",face="bold",size=8, margin=margin(t=25)),
        axis.title.y = element_text(colour="white",face="bold",size=8, margin=margin(r=25)),
        plot.title.position = "plot",
        plot.caption.position = "plot",
        plot.title=element_text(size=14, face="bold",colour="white",margin=margin(b=15)),
        plot.subtitle = element_text(size=12, colour="white",margin=margin(b=25)),
        plot.caption=element_text(size=10,colour="white",hjust=0,margin=margin(t=20)))+
   labs(title="West Coast Container Optimization for Profit",
       subtitle=str_wrap("Shows size allotment for different products for shipment",75),
       caption = str_wrap("Calculation is based on standard 40-foot shipment container size as well as Apple Packages for Laptops and USB cables",width = 79))+
  scale_color_manual(values= c('#4b3832','#854442','#fff4e6','#3c2f2f'))

Section IV: Reflection/Conclusion:

  1. Sales Analysis by Zip Code (Visualization 1):

    • Provides insight into sales performance across different urban zip codes.

    • Identifies areas with high sales volume and premium pricing, allowing for targeted marketing and inventory management strategies.

  2. Top Selling Premium Products (Visualization 2):

    • Helps identify premium products with high profitability despite their higher price points.

    • Enables Amazon to focus on marketing efforts for these high-margin products and optimize inventory levels.

  3. Product Frequency Distribution (Visualization 3):

    • Analyzes the frequency distribution of products by category, highlighting frequently bought items.

    • Allows Amazon to prioritize stocking and promotion of popular products, ensuring adequate inventory levels to meet demand.

  4. Regional Product Preferences (Visualization 4):

    • Determines the most popular products in each state for frequently bought items.

    • Provides insights into regional preferences, enabling tailored marketing campaigns and inventory allocation strategies.

  5. Shipping Cost Analysis (Visualization 5):

    • Compares shipping costs between the West Coast and East Coast for container shipments.

    • Identifies cost disparities and potential cost-saving opportunities, informing decisions on logistics and distribution strategies.

  6. Container Space Optimization (Visualization 6):

    • Presents solutions for optimizing container space to reduce shipping costs and maximize efficiency.

    • Helps Amazon streamline logistics operations and improve profitability by efficiently utilizing container space, particularly considering major ports on the West Coast.

Thank you for reading!