Data Communications: Data Transalation Challenge

Analyzing Amazon’s 2019 Technology Product Sales Data

Section 0: Project Outline

Objective

Analyze a data set of Amazon sales of technology products in several urban ZIP codes to find interesting stories and insights, presented through interconnected visualizations and a coherent narrative.

Target Audience

Amazon’s technology, sales, logistics and operations professionals, including key decision-makers and operational staff like General Managers, Logistic Staff, Upper Management, and Port & Warehouse employees.

Prompt

Amazon’s Logistics Division strategizes for product imports, aiming to optimize import and transportation costs in alignment with revenue to maximize profits. The majority of imports are transported via the shipping industry due to its cost-effectiveness compared to air transportation. The division is focused on understanding the breakdown of import costs to achieve its objectives.

Data

Amazon’s 2019 Online Product Sales Data for 19 technology products across 10 Zip Codes.

Scope/Analysis

Identify high-revenue top performing products that sell well in the market across categories
Understand customer behavior to analyse product trends and patterns that are most frequently purchased
Understand product populaity across zip codes, may be divide into east coast and west coast for a comparison product analysis
Understand shipping costs associated for logistics department to optimise the space in the shipping container
Target advertising and promotions: recommend specific Zip Codes and products for increased marketing efforts based on insights.

This project will provide valuable insights to Amazon’s Technology Product sales, logistics, pricing and product divisions with historic data analysis, helping them optimize marketing efforts, manage import and shipping costs, and make informed decisions about future product sales and potential profit margins insights for the most profitable and most frequently purchased products.

Section I: Preparing the Data

Firstly, we need to load the necessary libraries, set the working directory and load the data to get started.

#importing the libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(vtable)
library(scales)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)
library(stringr)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(ggiraph)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(rio)

## 
## Attaching package: 'rio'

## The following object is masked from 'package:plotly':
## 
##     export

library(hrbrthemes)

## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.

##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and

##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

library(treemapify)

#Session → Set Working Directory 
setwd("~/Desktop/BUAN 5210/DTC")

#To remove scientific notation form labels
options(scipen=1000000)

#importing the dataset
sales <- import('sales_data.Rdata')
dat_zip <- read.csv("zip_info.csv")

Section II: Exploratory Data Analysis

head(sales)

vtable(sales, lush = TRUE, factor.limit = 9, char.values = TRUE)

sales
Name	Class	Values	Summary
Product	character	‘20in Monitor’ ‘27in 4K Gaming Monitor’ ‘27in FHD Monitor’ ‘34in Ultrawide Monitor’ ‘AA Batteries (4-pack)’ ‘AAA Batteries (4-pack)’ ‘Apple Airpods Headphones’ ‘Bose SoundSport Headphones’ ‘Flatscreen TV’ and 10 more	nuniq: 19
Quantity	character	‘1’ ‘2’ ‘3’ ‘4’ ‘5’ ‘6’ ‘7’ ‘8’ ‘9’	nuniq: 9
PriceEach	character	‘109.99’ ‘11.95’ ‘11.99’ ‘14.95’ ‘149.99’ ‘150’ ‘150.0’ ‘1700’ ‘1700.0’ and 14 more	nuniq: 23
DateTime	POSIXct	Time: 2019-01-01 03:07:00 to 2020-01-01 05:13:00	median: 2019-07-17 20:40:30, nuniq: 142395
Date	Date	Time: 2019-01-01 to 2020-01-01	median: 2019-07-17, nuniq: 366
ZIP	character	‘02215’ ‘04101’ ‘10001’ ‘30301’ ‘73301’ ‘75001’ ‘90001’ ‘94016’ ‘97035’ and 1 more	nuniq: 10
State	character	‘CA’ ‘GA’ ‘MA’ ‘ME’ ‘NY’ ‘OR’ ‘TX’ ‘WA’	nuniq: 8
City	character	‘Atlanta’ ‘Austin’ ‘Boston’ ‘Dallas’ ‘Los Angeles’ ‘New York City’ ‘Portland’ ‘San Francisco’ ‘Seattle’	nuniq: 9

The sales data contains 8 variables:

Product, the product that has been ordered
Quantity, how many of the product was ordered (note this is a string, you’ll want to fix that!)
PriceEach, the price of each item (note this is a string, you’ll want to fix that!)
DateTime and Date, when the order was placed. DateTime includes both day and time-of-day when the order was placed, while Date is just the date
ZIP, the ZIP code where the order was sent to
State and City, the city and state where the order was sent to

head(dat_zip)

vtable(dat_zip, lush = TRUE)

dat_zip
Name	Class	Values	Summary
ZIP	integer	Num: 2215 to 98101	mean: 57407.3, sd: 40907.601, nuniq: 10
TotalPopulation	integer	Num: 12792 to 58975	mean: 26051.6, sd: 12885.755, nuniq: 10
MedianHHIncome	integer	Num: 46309 to 119370	mean: 81151, sd: 25319.33, nuniq: 10
PCIncome	integer	Num: 14814 to 100364	mean: 57085.4, sd: 28034.787, nuniq: 10
MedianAge	numeric	Num: 21.6 to 44.3	mean: 34.02, sd: 6.243, nuniq: 10
Race_White	integer	Num: 9231 to 22921	mean: 16140.6, sd: 5012.111, nuniq: 10
Race_Black	integer	Num: 459 to 5483	mean: 2179.9, sd: 1366.307, nuniq: 10
Race_American_Indian	integer	Num: 148 to 802	mean: 426.3, sd: 229.788, nuniq: 10
Race_Asian	integer	Num: 173 to 10134	mean: 3238.9, sd: 3245.489, nuniq: 10
Race_Pacific_Islander	integer	Num: 0 to 237	mean: 71.3, sd: 81.024, nuniq: 9
Race_Other	integer	Num: 181 to 30491	mean: 5003.8, sd: 9745.687, nuniq: 10
Ethnicity_Hispanic	integer	Num: 609 to 53085	mean: 9223, sd: 16630.268, nuniq: 10
Citizens	integer	Num: 10432 to 24069	mean: 17171.8, sd: 4595.578, nuniq: 10

The zipcode data contains 13 variables:

ZIP, which is a ZIP code we can use to join this data set with the sales data
TotalPopulation, which is the population in that ZIP code
MedianHHIncome, which is the median annual household income in that ZIP. Household income calculates the total income from everyone in a given household, and then finds the median household (Income statistics use 2020 ACS instead of 2018)
PCIncome, which is the annual per-capita (i.e;.per-person) income in that ZIP. Per-capita income sums up all the income earned by everyone in the ZIP, and then divides it by the number of people in that ZIP (which may include a lot of non-earners, or children) (Income statistics use 2020 ACS instead of 2018)
MedianAge, the median age of people in the ZIP code
Race_* variables, the number of people of each broad-category race in that ZIP code. Note that races are not mutually exclusive. Someone who is, for example, both White and Asian will be counted once as White and once as Asian
Ethnicity_Hispanic, which is the number of people who are Hispanic in the ZIP code. Ethnicity can overlap with any race, so someone who is, for example, both Hispanic and Black will be counted once as Hispanic and once as Black
Citizens, which is the number of US citizens living in the ZIP code

Then, we check for missing values in the data and if present, we analyse and remove them

# Check for missing values
sum(is.na(sales))

## [1] 0

sum(is.na(dat_zip))

## [1] 0

We can see that, there are 0 missing values across all variables in both sales and zip-code data frames.

The Quantity and price per product (PriceEach) are categorized as ‘character’ instead of ‘numeric’. Additionally, we don’t have the total sales column in the data. Lets correct and create it for further analysis.

q = as.numeric(sales$Quantity) #quantity
p = as.numeric(sales$PriceEach) #price

sales <- sales %>%
  mutate(Sale = p * q)

Section III: Data Vizulisations

Addressing the intersection of high sales volume and premium pricing.

Amazon’s Logistics Department aims to strike a balance between products that are both highly sought-after and carry a premium price tag for several reasons:

Rapid Depletion: Such products tend to fly off the shelves quickly.
Price Surge Potential: During shortages, they can command higher prices, maximizing profits.

Finding the right mix between items that can fetch premium prices and those constituting regular sales is pivotal. Let’s delve into the pricing and revenue breakdown of various product by categories to gain deeper insights.

dtp<- sales %>%
  select(Product, PriceEach, Quantity)%>%
  group_by(Product, PriceEach) %>%
  summarize(Total_revenue= sum(as.numeric(Quantity))*as.numeric(PriceEach))

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `summarise()` has grouped output by 'Product', 'PriceEach'. You can override
## using the `.groups` argument.

dtp <- dtp %>% 
  distinct()

dtp %>%
  arrange(desc(Total_revenue)) %>%
  kbl() %>%
  kable_styling()

Product	PriceEach	Total_revenue
Macbook Pro Laptop	1700	8000200.00
iPhone	700	4767700.00
ThinkPad Laptop	999.99	4129958.70
Google Phone	600	3298200.00
27in 4K Gaming Monitor	389.99	2435097.56
34in Ultrawide Monitor	379.99	2355558.01
Apple Airpods Headphones	150	2334300.00
Flatscreen TV	300	1439700.00
Bose SoundSport Headphones	99.99	1345565.43
27in FHD Monitor	149.99	1132424.50
Vareebadd Phone	400	823600.00
20in Monitor	109.99	454148.71
LG Washing Machine	600.0	399600.00
LG Dryer	600.0	387600.00
Lightning Charging Cable	14.95	347094.15
USB-C Charging Cable	11.95	286501.25
Wired Headphones	11.99	246478.43
AA Batteries (4-pack)	3.84	106118.40
AAA Batteries (4-pack)	2.99	92740.83
Macbook Pro Laptop	1700.0	37400.00
iPhone	700.0	26600.00
Google Phone	600.0	21000.00
Apple Airpods Headphones	150.0	14850.00
Flatscreen TV	300.0	6000.00
Vareebadd Phone	400.0	3600.00

Visualization 1: The interactive bar graph with total sales of all the 19 products in 2019 across 10 urban zip codes

sales_1 <- sales %>%
  group_by(Product) %>%
  summarise_at(vars(Sale), list(total_sale = sum))

sales_1 <- sales_1 %>%
  mutate_at(vars(total_sale), funs(./ 10^3))

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

sales_1 <- sales_1 %>%
  mutate(tooltip_text = paste0(toupper(Product), "\n", "$", round(total_sale, 2)))

plot1 <- ggplot(sales_1, 
                           aes(x = reorder(Product, total_sale), 
                               y = total_sale,
                               tooltip = tooltip_text, data_id = Product 
                           )) +
  geom_col_interactive(color = "black", fill="#816654", size = 0.5) +  
  theme_minimal() +
  theme(axis.text=element_text(size = 7), panel.grid.major.y = element_blank()) + 
  labs(title = "Total Sales For A Product In 2019",
       subtitle = "Sale Data is Limited to 10 Urban Zip Codes"
  ) +
  ylab("Sales (In Thousand Dollar$)") +
  xlab("Product") +
  coord_flip()

girafe(ggobj = plot1, width_svg = 5.85, height_svg = 3.7)

As observed, laptops, phones, and monitors were the top three products sold across the zip codes analyzed in 2019.
The top 3 products with ‘Premium Price’ and high sales number, therefore, are

Macbook Pro Laptop (Highest Price and Purchase frequency)
iphone (Highest Number of Purchases pertaining to price)
ThinkPad Laptop (Highest Price and Relatively high purchase frequency)

Visualization 2: Top Selling Most Profitable Premium Products

dtp %>%
  arrange(PriceEach)%>%
  filter(str_detect(Product,'Macbook|iPhone|ThinkPad')) %>%
  ggplot(aes(Product, Total_revenue, color= PriceEach))+ geom_bar(stat='identity',
                                                                  show.legend = FALSE)+ coord_flip()+
  xlab('Total Revenue')+
  ylab('Product')+
  theme(
    panel.grid.major.y = element_blank(),
      axis.text.x = element_text() 
  )+
  scale_y_continuous(breaks= scales::pretty_breaks(n=6),
                     labels= label_number(suffix= " K", scale= 1e-6))+
  scale_color_brewer()+
   theme(plot.background=element_rect(fill="#051e3e"),
        panel.background = element_rect(fill="#051e3e"),
        plot.margin = unit(c(1,2,1,2),"cm"),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        axis.line.x.bottom = element_line(color="white"),
        axis.text.x = element_text(colour="white",face = "bold"),
        axis.text.y = element_text(colour="white",face = "bold"),
        axis.title.x = element_text(colour="white",face="bold",size=12, margin=margin(t=25)),
        axis.title.y = element_text(colour="white",face="bold",size=12, margin=margin(r=25)),
        plot.title.position = "plot",
        plot.caption.position = "plot",
        plot.title=element_text(size=14, face="bold",colour="white",margin=margin(b=15)),
        plot.subtitle = element_text(size=12, colour="white",margin=margin(b=25)),
        plot.caption=element_text(size=10,colour="white",hjust=0,margin=margin(t=20)))+
   labs(title="Top 3 Most Profitable Premium Products",
       subtitle=str_wrap("Best Selling and Most Profitable Premium Products",75),
       caption = str_wrap("Total Revenue = Price of an article * Quantity * Number of transactions",width = 105))

These are the products that fall in a high price point and yet still sell relatively well are called most profitable premium products

In the next section, our analysis focuses on the frequency distribution of products within their respective categories. We achieve this by aggregating similar products from different brands, ensuring a more comprehensive understanding of the market composition.

Visualization 3: Addressing the Product Frequency Distribution

Let us look at the Frequency Distribution of different product by their product categories, such that similar products from different brands are clubbed together. Let’s look at some of the most frequently bought items.

#Looking at the frequency distribution table
dts <- sales %>%
  group_by(Product) %>%
  summarize(ProductFrequency= n()) %>%
  arrange(desc(ProductFrequency))

dts %>%
  kbl() %>%
  kable_styling()

Product	ProductFrequency
USB-C Charging Cable	21903
Lightning Charging Cable	21658
AAA Batteries (4-pack)	20641
AA Batteries (4-pack)	20577
Wired Headphones	18882
Apple Airpods Headphones	15549
Bose SoundSport Headphones	13325
27in FHD Monitor	7507
iPhone	6842
27in 4K Gaming Monitor	6230
34in Ultrawide Monitor	6181
Google Phone	5525
Flatscreen TV	4800
Macbook Pro Laptop	4724
ThinkPad Laptop	4128
20in Monitor	4101
Vareebadd Phone	2065
LG Washing Machine	666
LG Dryer	646

From the above analysis of the sales data set, the top 3 products that are most frequently bought are:

USB-C Charging Cable
Lightning Charging Cable
AAA Batteries (4-pack)

Lets look at the distribution of these 3 products in the different Zip codes. This might give us a better idea of the State Wise Demand of these products.

Visualization 4: The most popular product state wise for the above mentioned top 3 products that are most frequently bought irrespective of the brands

4a) USB-C Charging Cable

# Purchase frequency distribution of USB-C Charging Cable State Wise
usbc_dat <- sales %>% 
  filter(str_detect(Product,'USB-C'))

grp_state_usbc <- usbc_dat %>%
  group_by(State)%>%
  summarize(TotalPurchases= n())

# For USB-C Charging Cable
plt_1 <- grp_state_usbc %>%
  arrange(desc(TotalPurchases))%>%
  ggplot( aes(x=State, y=TotalPurchases)) +
    geom_segment( aes(xend=State, yend=0)) +
    geom_point( size=4, color="yellow") +
    theme_bw() +
    xlab("") +
    theme_classic()+
    scale_y_continuous(breaks= scales::pretty_breaks(n=10))+
  ylab('Number of items sold')+
  xlab('US States')+
  theme(plot.background=element_rect(fill="#50623a"),
        panel.background = element_rect(fill="#50623a"),
        plot.margin = unit(c(1,2,1,2),"cm"),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        axis.text.x = element_text(colour="white",face = "bold"),
        axis.text.y = element_text(colour="white",face = "bold"),
        axis.title.x = element_text(colour="white",face="bold",size=12, margin=margin(t=25)),
        axis.title.y = element_text(colour="white",face="bold",size=12, margin=margin(r=25)),
        plot.title.position = "plot",
        plot.caption.position = "plot",
        plot.title=element_text(size=14, face="bold",colour="white",margin=margin(b=15)),
        plot.subtitle = element_text(size=12, colour="white",margin=margin(b=25)),
        plot.caption=element_text(size=10,colour="white",hjust=0,margin=margin(t=20)))+
   labs(title="US State that buys USB-C Charger the most",
       subtitle=str_wrap("Distribution of Number of Amazon Purchases of USB-C Charging Cable among different US states in 2019",75),
       caption = "Data Recorded from January 01, 2019 to January, 01 2020")

plt_1

4b) Lightning Charging Cable

# Purchase frequency distribution of Lightning Charging Cable State Wise
lightc_dat <- sales %>% 
  filter(str_detect(Product,'Lightning Charging Cable'))

grp_state_lightc <- lightc_dat %>%
  group_by(State)%>%
  summarize(TotalPurchases= n())

# For Lightning Charging Cable
plt_2 <- grp_state_lightc %>%
  arrange(desc(TotalPurchases))%>%
  ggplot( aes(x=State, y=TotalPurchases)) +
    geom_segment( aes(xend=State, yend=0)) +
    geom_point( size=4, color="yellow") +
    theme_bw() +
    xlab("") +
    theme_classic()+
    scale_y_continuous(breaks= scales::pretty_breaks(n=10))+
  ylab('Number of items sold')+
  xlab('US States')+
  theme(plot.background=element_rect(fill="#3c0753"),
        panel.background = element_rect(fill="#3c0753"),
        plot.margin = unit(c(1,2,1,2),"cm"),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        axis.text.x = element_text(colour="white",face = "bold"),
        axis.text.y = element_text(colour="white",face = "bold"),
        axis.title.x = element_text(colour="white",face="bold",size=12, margin=margin(t=25)),
        axis.title.y = element_text(colour="white",face="bold",size=12, margin=margin(r=25)),
        plot.title.position = "plot",
        plot.caption.position = "plot",
        plot.title=element_text(size=14, face="bold",colour="white",margin=margin(b=15)),
        plot.subtitle = element_text(size=12, colour="white",margin=margin(b=25)),
        plot.caption=element_text(size=10,colour="white",hjust=0,margin=margin(t=20)))+
   labs(title="US State that buys Lightning Cable the most",
       subtitle=str_wrap("Distribution of Number of Amazon Purchases of Lightning Cable among different US states in 2019",75),
       caption = "Data Recorded from January 01, 2019 to January, 01 2020")
plt_2

4c) AAA Batteries (4-pack)

# Purchase frequency distribution of AAA Batteries (4-pack) State Wise
aaa_dat <- sales %>% 
  filter(str_detect(Product,'AAA Batteries (4-pack)'))

grp_state_aaa <- aaa_dat %>%
  group_by(State)%>%
  summarize(TotalPurchases= n())

# For AAA Batteries (4-pack)
plt_3 <- grp_state_lightc %>%
  arrange(desc(TotalPurchases))%>%
  ggplot( aes(x=State, y=TotalPurchases)) +
    geom_segment( aes(xend=State, yend=0),color='white') +
    geom_point( size=4, color="yellow") +
    theme_bw() +
    xlab("") +
    theme_classic()+
    scale_y_continuous(breaks= scales::pretty_breaks(n=10))+
  ylab('Number of items sold')+
  xlab('US States')+
  theme(plot.background=element_rect(fill="#3e3232"),
        panel.background = element_rect(fill="#3e3232"),
        plot.margin = unit(c(1,2,1,2),"cm"),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        axis.line.x.bottom = element_line(color="white"),
        axis.text.x = element_text(colour="white",face = "bold"),
        axis.text.y = element_text(colour="white",face = "bold"),
        axis.title.x = element_text(colour="white",face="bold",size=12, margin=margin(t=25)),
        axis.title.y = element_text(colour="white",face="bold",size=12, margin=margin(r=25)),
        plot.title.position = "plot",
        plot.caption.position = "plot",
        plot.title=element_text(size=14, face="bold",colour="white",margin=margin(b=15)),
        plot.subtitle = element_text(size=12, colour="white",margin=margin(b=25)),
        plot.caption=element_text(size=10,colour="white",hjust=0,margin=margin(t=20)))+
   labs(title="US State that buys AAA Batteries (4-pack) the most",
       subtitle=str_wrap("Distribution of Number of Amazon Purchases of AAA Batteries (4-pack) among different US states in 2019",75),
       caption = "Data Recorded from January 01, 2019 to January, 01 2020")
plt_3

We can see that CA California is the most popular state from the above analysis

Visualization 5: Addressing the Shipping Charges West Coast vs East Coast

Shipping costs vary significantly between the West Coast and East Coast, with a 40-foot container shipment to the West Coast priced at around $20,000, compared to $14,000 for the East Coast.

While a 40 foot container to the former costs about $20k, similar shipment costs $14k for Eastcoast.

A manager would also be interested in knowing what products are popular in east coast vs west coast. Specifically, these products

USB C charging Cable
Lightning Cable
Macbook Pro Laptop
iPhone

Here, the top 2 are the most bought products, while the bottom 2 are the premium products that sell well.

dt_coast <- sales %>%
  mutate(region= case_when(
    as.numeric(ZIP)== 94016 ~ "West Coast",
    as.numeric(ZIP)== 90001 ~ "West Coast", 
    as.numeric(ZIP)== 10001 ~ "East Coast",
    as.numeric(ZIP)== 02215 ~ "East Coast",
    as.numeric(ZIP)== 75001 ~ "East Coast",
    as.numeric(ZIP)== 98101 ~ "West Coast", 
    as.numeric(ZIP)== 97035 ~ "West Coast",
    as.numeric(ZIP)== 73301 ~ "East Coast",
    as.numeric(ZIP)== 04101 ~ "East Coast",
    as.numeric(ZIP)== 30301 ~ "East Coast",
    
  ))

dt_coast_grp <- dt_coast %>%
  filter(str_detect('Macbook Pro Laptop|USB-C Charging Cable|Lightning Charging Cable|iPhone', Product)) %>%
  group_by(region,Product) %>%
  summarise(ArticlesBought= sum(as.numeric(Quantity)))

## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.

dt_coast_grp %>%
  kbl() %>%
  kable_styling()

region	Product	ArticlesBought
East Coast	Lightning Charging Cable	10851
East Coast	Macbook Pro Laptop	2211
East Coast	USB-C Charging Cable	11188
East Coast	iPhone	3153
West Coast	Lightning Charging Cable	12366
West Coast	Macbook Pro Laptop	2517
West Coast	USB-C Charging Cable	12787
West Coast	iPhone	3696

ggplot(dt_coast_grp,
       aes(str_wrap(Product,6), ArticlesBought, fill= region))+
  geom_bar(position="dodge", stat="identity")+
  theme(plot.background=element_rect(fill="#025464"),
        panel.background = element_rect(fill="#025464"),
        plot.margin = unit(c(1,2,1,2),"cm"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.ticks = element_blank(),
        panel.grid= element_blank(),
        axis.line.x.bottom = element_line(color="white"),
        axis.text.x = element_text(colour="white",face = "bold"),
        axis.text.y = element_text(colour="white",face = "bold", size= 8),
        axis.title.x = element_text(colour="white",face="bold",size=8, margin=margin(t=25)),
        axis.title.y = element_text(colour="white",face="bold",size=8, margin=margin(r=25)),
        plot.title.position = "plot",
        plot.caption.position = "plot",
        plot.title=element_text(size=14, face="bold",colour="white",margin=margin(b=15)),
        plot.subtitle = element_text(size=12, colour="white",margin=margin(b=25)),
        plot.caption=element_text(size=10,colour="white",hjust=0,margin=margin(t=20)))+
  scale_y_continuous(breaks= scales::pretty_breaks(n=6),
                     )+
   labs(title="East Coast vs West Coast Product Sale",
       subtitle=str_wrap("Comparing Sales of Popular products in 2 US Regions",75),
       caption = str_wrap("States Like Texas which are geographically closer to East coast are included in East Coast since shipment from East Coast to them is more practical from logistic perspective (Proximity to Major Ports on that end)",width = 79))+
  ylab('No. of Products bought')+
  xlab('Products')

The disparity in product demand between the West Coast and East Coast is evident, with the former showing higher purchases across various product categories, including premium product items like iPhones and Macbook Pros.

Given the significant ports on the West Coast such as Long Beach, Portland, San Francisco, Los Angeles, Tacoma, and Seattle, it becomes imperative to devise a solution for optimizing container space.

Considering the dimensions of the items and the container, along with their profitability, a strategy is proposed to maximize profits while efficiently utilizing space. Based on a profit vs. weight analysis, the optimal allocation is suggested as follows:

40% for iPhones, which are both high in profitability and demand.
30% for Macbook Pros, offering a balance between profitability and space efficiency.
15% each for USB-C chargers and Lightning cables, ensuring inclusion of essential accessories while maintaining a balanced distribution.

This approach aims to maximize profits by prioritizing high-profit items while also considering space constraints and the need for a diverse product range.

Visualization 6: Tree map with Container Space Optimization

Products <- c("Macbook Pro", "USB-C Charger", "Lightning Cable", "iPhone")

value <- c(40,30,15,15)

dft <- data.frame(Products, value) 
ggplot(dft, aes(area = value, fill = Products)) +
  geom_treemap()+
  theme(plot.background=element_rect(fill="#3b5998"),
        panel.background = element_rect(fill="#3b5998"),
        plot.margin = unit(c(1,2,1,2),"cm"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.ticks = element_blank(),
        panel.grid= element_blank(),
        axis.line.x.bottom = element_line(color="white"),
        axis.text.x = element_text(colour="white",face = "bold"),
        axis.text.y = element_text(colour="white",face = "bold", size= 8),
        axis.title.x = element_text(colour="white",face="bold",size=8, margin=margin(t=25)),
        axis.title.y = element_text(colour="white",face="bold",size=8, margin=margin(r=25)),
        plot.title.position = "plot",
        plot.caption.position = "plot",
        plot.title=element_text(size=14, face="bold",colour="white",margin=margin(b=15)),
        plot.subtitle = element_text(size=12, colour="white",margin=margin(b=25)),
        plot.caption=element_text(size=10,colour="white",hjust=0,margin=margin(t=20)))+
   labs(title="West Coast Container Optimization for Profit",
       subtitle=str_wrap("Shows size allotment for different products for shipment",75),
       caption = str_wrap("Calculation is based on standard 40-foot shipment container size as well as Apple Packages for Laptops and USB cables",width = 79))+
  scale_color_manual(values= c('#4b3832','#854442','#fff4e6','#3c2f2f'))

Section IV: Reflection/Conclusion:

Sales Analysis by Zip Code (Visualization 1):
- Provides insight into sales performance across different urban zip codes.
- Identifies areas with high sales volume and premium pricing, allowing for targeted marketing and inventory management strategies.
Top Selling Premium Products (Visualization 2):
- Helps identify premium products with high profitability despite their higher price points.
- Enables Amazon to focus on marketing efforts for these high-margin products and optimize inventory levels.
Product Frequency Distribution (Visualization 3):
- Analyzes the frequency distribution of products by category, highlighting frequently bought items.
- Allows Amazon to prioritize stocking and promotion of popular products, ensuring adequate inventory levels to meet demand.
Regional Product Preferences (Visualization 4):
- Determines the most popular products in each state for frequently bought items.
- Provides insights into regional preferences, enabling tailored marketing campaigns and inventory allocation strategies.
Shipping Cost Analysis (Visualization 5):
- Compares shipping costs between the West Coast and East Coast for container shipments.
- Identifies cost disparities and potential cost-saving opportunities, informing decisions on logistics and distribution strategies.
Container Space Optimization (Visualization 6):
- Presents solutions for optimizing container space to reduce shipping costs and maximize efficiency.
- Helps Amazon streamline logistics operations and improve profitability by efficiently utilizing container space, particularly considering major ports on the West Coast.

Thank you for reading!