Analyze a data set of Amazon sales of technology products in several urban ZIP codes to find interesting stories and insights, presented through interconnected visualizations and a coherent narrative.
Amazon’s technology, sales, logistics and operations professionals, including key decision-makers and operational staff like General Managers, Logistic Staff, Upper Management, and Port & Warehouse employees.
Amazon’s Logistics Division strategizes for product imports, aiming to optimize import and transportation costs in alignment with revenue to maximize profits. The majority of imports are transported via the shipping industry due to its cost-effectiveness compared to air transportation. The division is focused on understanding the breakdown of import costs to achieve its objectives.
Amazon’s 2019 Online Product Sales Data for 19 technology products across 10 Zip Codes.
This project will provide valuable insights to Amazon’s Technology Product sales, logistics, pricing and product divisions with historic data analysis, helping them optimize marketing efforts, manage import and shipping costs, and make informed decisions about future product sales and potential profit margins insights for the most profitable and most frequently purchased products.
Firstly, we need to load the necessary libraries, set the working directory and load the data to get started.
#importing the libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(vtable)
library(scales)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(stringr)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(ggiraph)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(rio)
##
## Attaching package: 'rio'
## The following object is masked from 'package:plotly':
##
## export
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(treemapify)
#Session → Set Working Directory
setwd("~/Desktop/BUAN 5210/DTC")
#To remove scientific notation form labels
options(scipen=1000000)
#importing the dataset
sales <- import('sales_data.Rdata')
dat_zip <- read.csv("zip_info.csv")
head(sales)
vtable(sales, lush = TRUE, factor.limit = 9, char.values = TRUE)
| Name | Class | Values | Missing | Summary |
|---|---|---|---|---|
| Product | character | ‘20in Monitor’ ‘27in 4K Gaming Monitor’ ‘27in FHD Monitor’ ‘34in Ultrawide Monitor’ ‘AA Batteries (4-pack)’ ‘AAA Batteries (4-pack)’ ‘Apple Airpods Headphones’ ‘Bose SoundSport Headphones’ ‘Flatscreen TV’ and 10 more | 0 | nuniq: 19 |
| Quantity | character | ‘1’ ‘2’ ‘3’ ‘4’ ‘5’ ‘6’ ‘7’ ‘8’ ‘9’ | 0 | nuniq: 9 |
| PriceEach | character | ‘109.99’ ‘11.95’ ‘11.99’ ‘14.95’ ‘149.99’ ‘150’ ‘150.0’ ‘1700’ ‘1700.0’ and 14 more | 0 | nuniq: 23 |
| DateTime | POSIXct | Time: 2019-01-01 03:07:00 to 2020-01-01 05:13:00 | 0 | median: 2019-07-17 20:40:30, nuniq: 142395 |
| Date | Date | Time: 2019-01-01 to 2020-01-01 | 0 | median: 2019-07-17, nuniq: 366 |
| ZIP | character | ‘02215’ ‘04101’ ‘10001’ ‘30301’ ‘73301’ ‘75001’ ‘90001’ ‘94016’ ‘97035’ and 1 more | 0 | nuniq: 10 |
| State | character | ‘CA’ ‘GA’ ‘MA’ ‘ME’ ‘NY’ ‘OR’ ‘TX’ ‘WA’ | 0 | nuniq: 8 |
| City | character | ‘Atlanta’ ‘Austin’ ‘Boston’ ‘Dallas’ ‘Los Angeles’ ‘New York City’ ‘Portland’ ‘San Francisco’ ‘Seattle’ | 0 | nuniq: 9 |
The sales data contains 8 variables:
Product, the product that has been orderedQuantity, how many of the product was ordered (note
this is a string, you’ll want to fix that!)PriceEach, the price of each item (note this is a
string, you’ll want to fix that!)DateTime and Date, when the order was
placed. DateTime includes both day and time-of-day when the
order was placed, while Date is just the dateZIP, the ZIP code where the order was sent toState and City, the city and state where
the order was sent tohead(dat_zip)
vtable(dat_zip, lush = TRUE)
| Name | Class | Values | Missing | Summary |
|---|---|---|---|---|
| ZIP | integer | Num: 2215 to 98101 | 0 | mean: 57407.3, sd: 40907.601, nuniq: 10 |
| TotalPopulation | integer | Num: 12792 to 58975 | 0 | mean: 26051.6, sd: 12885.755, nuniq: 10 |
| MedianHHIncome | integer | Num: 46309 to 119370 | 0 | mean: 81151, sd: 25319.33, nuniq: 10 |
| PCIncome | integer | Num: 14814 to 100364 | 0 | mean: 57085.4, sd: 28034.787, nuniq: 10 |
| MedianAge | numeric | Num: 21.6 to 44.3 | 0 | mean: 34.02, sd: 6.243, nuniq: 10 |
| Race_White | integer | Num: 9231 to 22921 | 0 | mean: 16140.6, sd: 5012.111, nuniq: 10 |
| Race_Black | integer | Num: 459 to 5483 | 0 | mean: 2179.9, sd: 1366.307, nuniq: 10 |
| Race_American_Indian | integer | Num: 148 to 802 | 0 | mean: 426.3, sd: 229.788, nuniq: 10 |
| Race_Asian | integer | Num: 173 to 10134 | 0 | mean: 3238.9, sd: 3245.489, nuniq: 10 |
| Race_Pacific_Islander | integer | Num: 0 to 237 | 0 | mean: 71.3, sd: 81.024, nuniq: 9 |
| Race_Other | integer | Num: 181 to 30491 | 0 | mean: 5003.8, sd: 9745.687, nuniq: 10 |
| Ethnicity_Hispanic | integer | Num: 609 to 53085 | 0 | mean: 9223, sd: 16630.268, nuniq: 10 |
| Citizens | integer | Num: 10432 to 24069 | 0 | mean: 17171.8, sd: 4595.578, nuniq: 10 |
The zipcode data contains 13 variables:
ZIP, which is a ZIP code we can use to join this data
set with the sales dataTotalPopulation, which is the population in that ZIP
codeMedianHHIncome, which is the median annual household
income in that ZIP. Household income calculates the total income from
everyone in a given household, and then finds the median household
(Income statistics use 2020 ACS instead of 2018)PCIncome, which is the annual per-capita
(i.e;.per-person) income in that ZIP. Per-capita income sums up all the
income earned by everyone in the ZIP, and then divides it by the number
of people in that ZIP (which may include a lot of non-earners, or
children) (Income statistics use 2020 ACS instead of 2018)MedianAge, the median age of people in the ZIP
codeRace_* variables, the number of people of each
broad-category race in that ZIP code. Note that races are not mutually
exclusive. Someone who is, for example, both White and Asian will be
counted once as White and once as AsianEthnicity_Hispanic, which is the number of people who
are Hispanic in the ZIP code. Ethnicity can overlap with any race, so
someone who is, for example, both Hispanic and Black will be counted
once as Hispanic and once as BlackCitizens, which is the number of US citizens living in
the ZIP codeThen, we check for missing values in the data and if present, we analyse and remove them
# Check for missing values
sum(is.na(sales))
## [1] 0
sum(is.na(dat_zip))
## [1] 0
We can see that, there are 0 missing values across all variables in both sales and zip-code data frames.
The Quantity and price per product
(PriceEach) are categorized as ‘character’ instead
of ‘numeric’. Additionally, we don’t have the total sales
column in the data. Lets correct and create it for further analysis.
q = as.numeric(sales$Quantity) #quantity
p = as.numeric(sales$PriceEach) #price
sales <- sales %>%
mutate(Sale = p * q)
sales_1 <- sales %>%
group_by(Product) %>%
summarise_at(vars(Sale), list(total_sale = sum))
sales_1 <- sales_1 %>%
mutate_at(vars(total_sale), funs(./ 10^3))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
sales_1 <- sales_1 %>%
mutate(tooltip_text = paste0(toupper(Product), "\n", "$", round(total_sale, 2)))
plot1 <- ggplot(sales_1,
aes(x = reorder(Product, total_sale),
y = total_sale,
tooltip = tooltip_text, data_id = Product
)) +
geom_col_interactive(color = "black", fill="#816654", size = 0.5) +
theme_minimal() +
theme(axis.text=element_text(size = 7), panel.grid.major.y = element_blank()) +
labs(title = "Total Sales For A Product In 2019",
subtitle = "Sale Data is Limited to 10 Urban Zip Codes"
) +
ylab("Sales (In Thousand Dollar$)") +
xlab("Product") +
coord_flip()
girafe(ggobj = plot1, width_svg = 5.85, height_svg = 3.7)
As observed, laptops, phones, and monitors were the top three
products sold across the zip codes analyzed in 2019.
The top 3 products with ‘Premium Price’ and high sales number,
therefore, are
Let us look at the Frequency Distribution of different product by their product categories, such that similar products from different brands are clubbed together. Let’s look at some of the most frequently bought items.
#Looking at the frequency distribution table
dts <- sales %>%
group_by(Product) %>%
summarize(ProductFrequency= n()) %>%
arrange(desc(ProductFrequency))
dts %>%
kbl() %>%
kable_styling()
| Product | ProductFrequency |
|---|---|
| USB-C Charging Cable | 21903 |
| Lightning Charging Cable | 21658 |
| AAA Batteries (4-pack) | 20641 |
| AA Batteries (4-pack) | 20577 |
| Wired Headphones | 18882 |
| Apple Airpods Headphones | 15549 |
| Bose SoundSport Headphones | 13325 |
| 27in FHD Monitor | 7507 |
| iPhone | 6842 |
| 27in 4K Gaming Monitor | 6230 |
| 34in Ultrawide Monitor | 6181 |
| Google Phone | 5525 |
| Flatscreen TV | 4800 |
| Macbook Pro Laptop | 4724 |
| ThinkPad Laptop | 4128 |
| 20in Monitor | 4101 |
| Vareebadd Phone | 2065 |
| LG Washing Machine | 666 |
| LG Dryer | 646 |
From the above analysis of the sales data set, the top 3 products that are most frequently bought are:
Lets look at the distribution of these 3 products in the different Zip codes. This might give us a better idea of the State Wise Demand of these products.
4a) USB-C Charging Cable
# Purchase frequency distribution of USB-C Charging Cable State Wise
usbc_dat <- sales %>%
filter(str_detect(Product,'USB-C'))
grp_state_usbc <- usbc_dat %>%
group_by(State)%>%
summarize(TotalPurchases= n())
# For USB-C Charging Cable
plt_1 <- grp_state_usbc %>%
arrange(desc(TotalPurchases))%>%
ggplot( aes(x=State, y=TotalPurchases)) +
geom_segment( aes(xend=State, yend=0)) +
geom_point( size=4, color="yellow") +
theme_bw() +
xlab("") +
theme_classic()+
scale_y_continuous(breaks= scales::pretty_breaks(n=10))+
ylab('Number of items sold')+
xlab('US States')+
theme(plot.background=element_rect(fill="#50623a"),
panel.background = element_rect(fill="#50623a"),
plot.margin = unit(c(1,2,1,2),"cm"),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_text(colour="white",face = "bold"),
axis.text.y = element_text(colour="white",face = "bold"),
axis.title.x = element_text(colour="white",face="bold",size=12, margin=margin(t=25)),
axis.title.y = element_text(colour="white",face="bold",size=12, margin=margin(r=25)),
plot.title.position = "plot",
plot.caption.position = "plot",
plot.title=element_text(size=14, face="bold",colour="white",margin=margin(b=15)),
plot.subtitle = element_text(size=12, colour="white",margin=margin(b=25)),
plot.caption=element_text(size=10,colour="white",hjust=0,margin=margin(t=20)))+
labs(title="US State that buys USB-C Charger the most",
subtitle=str_wrap("Distribution of Number of Amazon Purchases of USB-C Charging Cable among different US states in 2019",75),
caption = "Data Recorded from January 01, 2019 to January, 01 2020")
plt_1
4b) Lightning Charging Cable
# Purchase frequency distribution of Lightning Charging Cable State Wise
lightc_dat <- sales %>%
filter(str_detect(Product,'Lightning Charging Cable'))
grp_state_lightc <- lightc_dat %>%
group_by(State)%>%
summarize(TotalPurchases= n())
# For Lightning Charging Cable
plt_2 <- grp_state_lightc %>%
arrange(desc(TotalPurchases))%>%
ggplot( aes(x=State, y=TotalPurchases)) +
geom_segment( aes(xend=State, yend=0)) +
geom_point( size=4, color="yellow") +
theme_bw() +
xlab("") +
theme_classic()+
scale_y_continuous(breaks= scales::pretty_breaks(n=10))+
ylab('Number of items sold')+
xlab('US States')+
theme(plot.background=element_rect(fill="#3c0753"),
panel.background = element_rect(fill="#3c0753"),
plot.margin = unit(c(1,2,1,2),"cm"),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_text(colour="white",face = "bold"),
axis.text.y = element_text(colour="white",face = "bold"),
axis.title.x = element_text(colour="white",face="bold",size=12, margin=margin(t=25)),
axis.title.y = element_text(colour="white",face="bold",size=12, margin=margin(r=25)),
plot.title.position = "plot",
plot.caption.position = "plot",
plot.title=element_text(size=14, face="bold",colour="white",margin=margin(b=15)),
plot.subtitle = element_text(size=12, colour="white",margin=margin(b=25)),
plot.caption=element_text(size=10,colour="white",hjust=0,margin=margin(t=20)))+
labs(title="US State that buys Lightning Cable the most",
subtitle=str_wrap("Distribution of Number of Amazon Purchases of Lightning Cable among different US states in 2019",75),
caption = "Data Recorded from January 01, 2019 to January, 01 2020")
plt_2
4c) AAA Batteries (4-pack)
# Purchase frequency distribution of AAA Batteries (4-pack) State Wise
aaa_dat <- sales %>%
filter(str_detect(Product,'AAA Batteries (4-pack)'))
grp_state_aaa <- aaa_dat %>%
group_by(State)%>%
summarize(TotalPurchases= n())
# For AAA Batteries (4-pack)
plt_3 <- grp_state_lightc %>%
arrange(desc(TotalPurchases))%>%
ggplot( aes(x=State, y=TotalPurchases)) +
geom_segment( aes(xend=State, yend=0),color='white') +
geom_point( size=4, color="yellow") +
theme_bw() +
xlab("") +
theme_classic()+
scale_y_continuous(breaks= scales::pretty_breaks(n=10))+
ylab('Number of items sold')+
xlab('US States')+
theme(plot.background=element_rect(fill="#3e3232"),
panel.background = element_rect(fill="#3e3232"),
plot.margin = unit(c(1,2,1,2),"cm"),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.line.x.bottom = element_line(color="white"),
axis.text.x = element_text(colour="white",face = "bold"),
axis.text.y = element_text(colour="white",face = "bold"),
axis.title.x = element_text(colour="white",face="bold",size=12, margin=margin(t=25)),
axis.title.y = element_text(colour="white",face="bold",size=12, margin=margin(r=25)),
plot.title.position = "plot",
plot.caption.position = "plot",
plot.title=element_text(size=14, face="bold",colour="white",margin=margin(b=15)),
plot.subtitle = element_text(size=12, colour="white",margin=margin(b=25)),
plot.caption=element_text(size=10,colour="white",hjust=0,margin=margin(t=20)))+
labs(title="US State that buys AAA Batteries (4-pack) the most",
subtitle=str_wrap("Distribution of Number of Amazon Purchases of AAA Batteries (4-pack) among different US states in 2019",75),
caption = "Data Recorded from January 01, 2019 to January, 01 2020")
plt_3
We can see that CA California is the most popular state
from the above analysis
Shipping costs vary significantly between the West Coast and East Coast, with a 40-foot container shipment to the West Coast priced at around $20,000, compared to $14,000 for the East Coast.
While a 40 foot container to the former costs about $20k, similar shipment costs $14k for Eastcoast.
A manager would also be interested in knowing what products are popular in east coast vs west coast. Specifically, these products
USB C charging Cable
Lightning Cable
Macbook Pro Laptop
iPhone
Here, the top 2 are the most bought products, while the bottom 2 are the premium products that sell well.
dt_coast <- sales %>%
mutate(region= case_when(
as.numeric(ZIP)== 94016 ~ "West Coast",
as.numeric(ZIP)== 90001 ~ "West Coast",
as.numeric(ZIP)== 10001 ~ "East Coast",
as.numeric(ZIP)== 02215 ~ "East Coast",
as.numeric(ZIP)== 75001 ~ "East Coast",
as.numeric(ZIP)== 98101 ~ "West Coast",
as.numeric(ZIP)== 97035 ~ "West Coast",
as.numeric(ZIP)== 73301 ~ "East Coast",
as.numeric(ZIP)== 04101 ~ "East Coast",
as.numeric(ZIP)== 30301 ~ "East Coast",
))
dt_coast_grp <- dt_coast %>%
filter(str_detect('Macbook Pro Laptop|USB-C Charging Cable|Lightning Charging Cable|iPhone', Product)) %>%
group_by(region,Product) %>%
summarise(ArticlesBought= sum(as.numeric(Quantity)))
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
dt_coast_grp %>%
kbl() %>%
kable_styling()
| region | Product | ArticlesBought |
|---|---|---|
| East Coast | Lightning Charging Cable | 10851 |
| East Coast | Macbook Pro Laptop | 2211 |
| East Coast | USB-C Charging Cable | 11188 |
| East Coast | iPhone | 3153 |
| West Coast | Lightning Charging Cable | 12366 |
| West Coast | Macbook Pro Laptop | 2517 |
| West Coast | USB-C Charging Cable | 12787 |
| West Coast | iPhone | 3696 |
ggplot(dt_coast_grp,
aes(str_wrap(Product,6), ArticlesBought, fill= region))+
geom_bar(position="dodge", stat="identity")+
theme(plot.background=element_rect(fill="#025464"),
panel.background = element_rect(fill="#025464"),
plot.margin = unit(c(1,2,1,2),"cm"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
panel.grid= element_blank(),
axis.line.x.bottom = element_line(color="white"),
axis.text.x = element_text(colour="white",face = "bold"),
axis.text.y = element_text(colour="white",face = "bold", size= 8),
axis.title.x = element_text(colour="white",face="bold",size=8, margin=margin(t=25)),
axis.title.y = element_text(colour="white",face="bold",size=8, margin=margin(r=25)),
plot.title.position = "plot",
plot.caption.position = "plot",
plot.title=element_text(size=14, face="bold",colour="white",margin=margin(b=15)),
plot.subtitle = element_text(size=12, colour="white",margin=margin(b=25)),
plot.caption=element_text(size=10,colour="white",hjust=0,margin=margin(t=20)))+
scale_y_continuous(breaks= scales::pretty_breaks(n=6),
)+
labs(title="East Coast vs West Coast Product Sale",
subtitle=str_wrap("Comparing Sales of Popular products in 2 US Regions",75),
caption = str_wrap("States Like Texas which are geographically closer to East coast are included in East Coast since shipment from East Coast to them is more practical from logistic perspective (Proximity to Major Ports on that end)",width = 79))+
ylab('No. of Products bought')+
xlab('Products')
The disparity in product demand between the West Coast and East Coast is evident, with the former showing higher purchases across various product categories, including premium product items like iPhones and Macbook Pros.
Given the significant ports on the West Coast such as Long Beach, Portland, San Francisco, Los Angeles, Tacoma, and Seattle, it becomes imperative to devise a solution for optimizing container space.
Considering the dimensions of the items and the container, along with their profitability, a strategy is proposed to maximize profits while efficiently utilizing space. Based on a profit vs. weight analysis, the optimal allocation is suggested as follows:
40% for iPhones, which are both high in profitability and demand.
30% for Macbook Pros, offering a balance between profitability and space efficiency.
15% each for USB-C chargers and Lightning cables, ensuring inclusion of essential accessories while maintaining a balanced distribution.
This approach aims to maximize profits by prioritizing high-profit items while also considering space constraints and the need for a diverse product range.
Products <- c("Macbook Pro", "USB-C Charger", "Lightning Cable", "iPhone")
value <- c(40,30,15,15)
dft <- data.frame(Products, value)
ggplot(dft, aes(area = value, fill = Products)) +
geom_treemap()+
theme(plot.background=element_rect(fill="#3b5998"),
panel.background = element_rect(fill="#3b5998"),
plot.margin = unit(c(1,2,1,2),"cm"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
panel.grid= element_blank(),
axis.line.x.bottom = element_line(color="white"),
axis.text.x = element_text(colour="white",face = "bold"),
axis.text.y = element_text(colour="white",face = "bold", size= 8),
axis.title.x = element_text(colour="white",face="bold",size=8, margin=margin(t=25)),
axis.title.y = element_text(colour="white",face="bold",size=8, margin=margin(r=25)),
plot.title.position = "plot",
plot.caption.position = "plot",
plot.title=element_text(size=14, face="bold",colour="white",margin=margin(b=15)),
plot.subtitle = element_text(size=12, colour="white",margin=margin(b=25)),
plot.caption=element_text(size=10,colour="white",hjust=0,margin=margin(t=20)))+
labs(title="West Coast Container Optimization for Profit",
subtitle=str_wrap("Shows size allotment for different products for shipment",75),
caption = str_wrap("Calculation is based on standard 40-foot shipment container size as well as Apple Packages for Laptops and USB cables",width = 79))+
scale_color_manual(values= c('#4b3832','#854442','#fff4e6','#3c2f2f'))
Sales Analysis by Zip Code (Visualization 1):
Provides insight into sales performance across different urban zip codes.
Identifies areas with high sales volume and premium pricing, allowing for targeted marketing and inventory management strategies.
Top Selling Premium Products (Visualization 2):
Helps identify premium products with high profitability despite their higher price points.
Enables Amazon to focus on marketing efforts for these high-margin products and optimize inventory levels.
Product Frequency Distribution (Visualization 3):
Analyzes the frequency distribution of products by category, highlighting frequently bought items.
Allows Amazon to prioritize stocking and promotion of popular products, ensuring adequate inventory levels to meet demand.
Regional Product Preferences (Visualization 4):
Determines the most popular products in each state for frequently bought items.
Provides insights into regional preferences, enabling tailored marketing campaigns and inventory allocation strategies.
Shipping Cost Analysis (Visualization 5):
Compares shipping costs between the West Coast and East Coast for container shipments.
Identifies cost disparities and potential cost-saving opportunities, informing decisions on logistics and distribution strategies.
Container Space Optimization (Visualization 6):
Presents solutions for optimizing container space to reduce shipping costs and maximize efficiency.
Helps Amazon streamline logistics operations and improve profitability by efficiently utilizing container space, particularly considering major ports on the West Coast.
Thank you for reading!