For this assignment, you must select at least two variables of interest from your dataset, provide a basic description of the variables, clean and recode as needed, and present descriptive statistics and any visualizations. Your R Markdown document should include:
This is the first project where I am beginning to work with the data I am going to uses for my final project.
I am using a few different data sets for my final project. They are all related to my sister’s job, My sister, Lucy, is an Influencer working with Amazon. She primarily uses social media and tracking links to share deals with her followers. She gets a commission for items sold via her links (as well as some other monetary incentives).
Lucy gets a ton of data from both Amazon and Instagram, but she doesn’t do anything with it, for this homework, I have one main question to answer:
So, I am looking at the relationsip between revenue and ad-spend. (For my final project I have more questions that I’d like to answer for her).
The data for this analysis comes from her “Amazon Influcer” account. It is multiple tabs, so I may end up using a few different pieces (tabs) of the data:
For the final project I will also be looking at her social media data (Instagram and maybe Tik Tok). One thing I am already thinking about is that there won’t be any unique IDs that I can use to clearly tie the Amazon dataset to the Instagram dataset. However, we may be able to use dates to associate social media actions with Amazon sales.
As I noted in the summary of tables, I want to try to combine the Orders and Earnings tables together. At first I thought I could use the ASIN unique identifier, but then I realized that is actually a unique number for the product, not the sale. So, I will work with them separately.
amazon_fee_orders <- read_excel("Documents/GitHub/Luluprime_Amazon.xlsx", sheet = "Fee-Orders", skip = 1)
paged_table(amazon_fee_orders)
amazon_fee_earnings <- read_excel("Documents/GitHub/Luluprime_Amazon.xlsx", sheet = "Fee-Earnings", skip = 1)
paged_table(amazon_fee_earnings)
The good thing about this dataset is that it doesn’t really need to be tidied up (for now). But that means we can dive right in.
Since my question for this homework is specficially around revenue vs. ad cost, I will likely only use the Fee-Earnings table.
I know there is a lot of summary analysis that could be done here, but I am saving a lot of it for my final project (so I will keep it light). Since we are looking specifically at revenue and ad-spend, let’s do some summary tables first:
# table of all the different categories and how many times they occur in the earnings table.
table(select(amazon_fee_earnings, Category))
##
## Amazon Cloud Cam Amazon Fresh Products
## 8 2315
## Amazon Gift Cards AmazonBasics Microwave
## 120 2
## Arts, Crafts & Sewing Audible Audiobooks
## 30 3
## Automotive Baby & Nursery
## 591 1211
## Beauty & Grooming Blu-Ray & DVD
## 14356 56
## Books & Textbooks Business & Industrial Supplies
## 1708 1057
## Camera, Photo & Video CDs & Vinyl
## 589 24
## Cell Phones & Accessories Clothing & Accessories
## 2705 143071
## Computers, Tablets & Components Digital Music
## 603 2
## Digital Products Accessory Digital Video Subscription
## 1 5
## Echo & Alexa Accessories Echo Devices
## 21 46
## Electronic Components & Home Audio Element Smart TV
## 911 7
## Fire Tablets Fire Tablets Accessories
## 45 19
## Fire TV Accessories Fire TV Devices
## 4 21
## Fresh Prepared Furniture
## 43 1972
## Grocery & Gourmet Food Handmade
## 3866 102
## Health & Household Health & Personal Care Appliances
## 5561 1327
## Home Home Entertainment
## 14073 29
## Home Improvement Jewelry
## 2652 9433
## Kindle Books Kindle E-readers
## 27 13
## Kindle E-readers Accessories Kitchen & Dining
## 5 5452
## Luggage Luxury Beauty
## 1740 697
## Magazines Major Appliances
## 8 44
## Musical Instruments Office & School Supplies
## 62 2466
## Other Other Gift Card Brands
## 180 32
## Outdoor Recreation Patio, Lawn & Garden
## 832 1480
## Pet Food & Supplies Power & Hand Tools
## 3248 659
## Prime Pantry Ring Alarms and Smart Lighting
## 4 5
## Ring Video Doorbells and Cameras Shoes, Handbags, Wallets, Sunglasses
## 9 9468
## Softlines Private Label Software
## 5253 1
## Software Download Sports & Fitness
## 5 3691
## Sports Collectibles Tires & Wheels
## 4 5
## Toys & Games Video Game Downloads
## 3578 28
## Video Games Video On Demand: Rent or Buy
## 71 10
## Watches Wine, Spirits & Beer
## 22 13
# proportional table of all the different categories and how many times they occur in the earnings table.
prop.table(table(select(amazon_fee_earnings, Category)))
##
## Amazon Cloud Cam Amazon Fresh Products
## 3.229700e-05 9.345945e-03
## Amazon Gift Cards AmazonBasics Microwave
## 4.844550e-04 8.074251e-06
## Arts, Crafts & Sewing Audible Audiobooks
## 1.211138e-04 1.211138e-05
## Automotive Baby & Nursery
## 2.385941e-03 4.888959e-03
## Beauty & Grooming Blu-Ray & DVD
## 5.795697e-02 2.260790e-04
## Books & Textbooks Business & Industrial Supplies
## 6.895410e-03 4.267242e-03
## Camera, Photo & Video CDs & Vinyl
## 2.377867e-03 9.689101e-05
## Cell Phones & Accessories Clothing & Accessories
## 1.092042e-02 5.775956e-01
## Computers, Tablets & Components Digital Music
## 2.434387e-03 8.074251e-06
## Digital Products Accessory Digital Video Subscription
## 4.037125e-06 2.018563e-05
## Echo & Alexa Accessories Echo Devices
## 8.477963e-05 1.857078e-04
## Electronic Components & Home Audio Element Smart TV
## 3.677821e-03 2.825988e-05
## Fire Tablets Fire Tablets Accessories
## 1.816706e-04 7.670538e-05
## Fire TV Accessories Fire TV Devices
## 1.614850e-05 8.477963e-05
## Fresh Prepared Furniture
## 1.735964e-04 7.961211e-03
## Grocery & Gourmet Food Handmade
## 1.560753e-02 4.117868e-04
## Health & Household Health & Personal Care Appliances
## 2.245045e-02 5.357265e-03
## Home Home Entertainment
## 5.681447e-02 1.170766e-04
## Home Improvement Jewelry
## 1.070646e-02 3.808220e-02
## Kindle Books Kindle E-readers
## 1.090024e-04 5.248263e-05
## Kindle E-readers Accessories Kitchen & Dining
## 2.018563e-05 2.201041e-02
## Luggage Luxury Beauty
## 7.024598e-03 2.813876e-03
## Magazines Major Appliances
## 3.229700e-05 1.776335e-04
## Musical Instruments Office & School Supplies
## 2.503018e-04 9.955551e-03
## Other Other Gift Card Brands
## 7.266826e-04 1.291880e-04
## Outdoor Recreation Patio, Lawn & Garden
## 3.358888e-03 5.974946e-03
## Pet Food & Supplies Power & Hand Tools
## 1.311258e-02 2.660466e-03
## Prime Pantry Ring Alarms and Smart Lighting
## 1.614850e-05 2.018563e-05
## Ring Video Doorbells and Cameras Shoes, Handbags, Wallets, Sunglasses
## 3.633413e-05 3.822350e-02
## Softlines Private Label Software
## 2.120702e-02 4.037125e-06
## Software Download Sports & Fitness
## 2.018563e-05 1.490103e-02
## Sports Collectibles Tires & Wheels
## 1.614850e-05 2.018563e-05
## Toys & Games Video Game Downloads
## 1.444483e-02 1.130395e-04
## Video Games Video On Demand: Rent or Buy
## 2.866359e-04 4.037125e-05
## Watches Wine, Spirits & Beer
## 8.881676e-05 5.248263e-05
Based on these I can see that there is a top category (Clothing and Accessories) that far outperforms other categories in terms of quantity of sales.
# For a clearer graph, I am taking the only the top 20 categories by total items shipped
total_items_shipped <- amazon_fee_earnings %>%
select(Category, `Items Shipped`) %>%
group_by(Category) %>%
summarize(`Total Items Shipped` = sum(`Items Shipped`)) %>%
arrange(desc(`Total Items Shipped`)) %>%
slice(1:20)
# Graphing - arrange and mutate to order the y-axis by most to least
total_items_shipped %>%
arrange(`Total Items Shipped`) %>%
mutate(Category=factor(Category, levels=Category)) %>%
ggplot(aes(y = Category, x = `Total Items Shipped`, label = `Total Items Shipped`)) +
geom_col() +
labs(y = "Category", x = "Total Items Shipped") +
theme(axis.text.y = element_text(angle = 15, hjust = 1))
# Taking out the bar labels as they are a little bit messy
# geom_text(position = position_stack(vjust = 0.7),size = 2.5) +
Looking at just the top 20 categories (by Total Items Shipped) we can see that there aren’t really any other categories that come close to Clothing and Accessories. I then look at the same data by revenue.
# For a clearer graph, I am taking the only the top 20 categories by total revenue
total_revenue <- amazon_fee_earnings %>%
select(Category, `Revenue($)`) %>%
group_by(Category) %>%
summarize(`Total Revenue` = sum(`Revenue($)`)) %>%
arrange(desc(`Total Revenue`)) %>%
slice(1:20)
# Graphing - arrange and mutate to order the y-axis by most to least (by revenue)
total_revenue %>%
arrange(`Total Revenue`) %>%
mutate(Category=factor(Category, levels=Category)) %>%
ggplot(aes(y = Category, x = `Total Revenue`)) +
geom_col() +
scale_x_continuous(labels = scales::dollar) +
labs(y = "Category", x = "Total Revenue") +
theme(axis.text.y = element_text(angle = 15,hjust = 1))
Interestingly, the top 20 categories for Revenue do not perfectly match the top 20 categories for Total Sales. This likely has to do with the average cost of items in each category (keep in mind to look at for final paper).
# For a clearer graph, I am taking the only the top 20 categories by ad spend
total_ads <- amazon_fee_earnings %>%
select(Category, `Ad Fees($)`) %>%
group_by(Category) %>%
summarize(`Total Ad Fees` = sum(`Ad Fees($)`)) %>%
arrange(desc(`Total Ad Fees`)) %>%
slice(1:20)
# Graphing - arrange and mutate to order the y-axis by most to least (by revenue)
total_ads %>%
arrange(`Total Ad Fees`) %>%
mutate(Category=factor(Category, levels=Category)) %>%
ggplot(aes(y = Category, x = `Total Ad Fees`)) +
geom_col() +
scale_x_continuous(labels = scales::dollar) +
labs(y = "Category", x = "Total Ad Fees") +
theme(axis.text.y = element_text(angle = 15,hjust = 1))
Ad spend, however, does seem to closely match revenue by Category. This indicates that Ad Spend might directly lead to more revenue (and thus, spending money on ads is worth it).
I want to take a closer look at the relationship between Revenue and Ad Spend by Category. My hypothesis is that there is a relationship between dollars spent in Ads and Revenue.
# I am using a scatterplot to look at the relationship btwn revenue and ad spend, so no need to limit to the top 20 categories.
total_ads_revenue <- amazon_fee_earnings %>%
select(Category, `Ad Fees($)`, `Revenue($)`) %>%
group_by(Category) %>%
summarize(`Total Ad Fees` = sum(`Ad Fees($)`), `Total Revenue` = sum(`Revenue($)`))
# Scatterplot
total_ads_revenue %>%
ggplot(aes(y = `Total Revenue`, x = `Total Ad Fees`)) +
geom_point() +
scale_x_continuous(labels = scales::dollar) +
scale_y_continuous(labels = scales::dollar)
It looks like there is a clear correlation between $ spent in Ad Fees and Revenue. It might be that this is the way Amazon collects ad fees (based on revenue), and not an indication of Ad Fees leading to Revenue.
It is also difficult to see what is happening at lower left corner of the graph because it is so skewed by Clothing & Accessories, so I am going to try to exclude that category to take a closer look.
# Excluding the Clothing & Accessories category to get a better look at the other categories.
total_ads_revenue %>%
filter(Category != "Clothing & Accessories") %>%
ggplot(aes(y = `Total Revenue`, x = `Total Ad Fees`)) +
geom_point() +
scale_x_continuous(labels = scales::dollar) +
scale_y_continuous(labels = scales::dollar)
Okay, there is definitely a relationship. I am still worried that this is based on the payment structure of Amazon Influencers. I will test this same scatter plot on the total number of items shipped to see if the relationship holds up.
# I am using a scatterplot to look at the relationship btwn revenue and ad spend, so no need to limit to the top 20 categories.
total_ads_items_shipped <- amazon_fee_earnings %>%
select(Category, `Ad Fees($)`, `Items Shipped`) %>%
group_by(Category) %>%
summarize(`Total Ad Fees` = sum(`Ad Fees($)`), `Total Items Shipped` = sum(`Items Shipped`))
# Scatterplot
total_ads_items_shipped %>%
ggplot(aes(y = `Total Items Shipped`, x = `Total Ad Fees`)) +
geom_point() +
scale_x_continuous(labels = scales::dollar) +
scale_y_continuous(labels = scales::dollar)
# Again, I am going to exclude Clothing & Accessories
total_ads_items_shipped %>%
filter(Category != "Clothing & Accessories") %>%
ggplot(aes(y = `Total Items Shipped`, x = `Total Ad Fees`)) +
geom_point() +
scale_x_continuous(labels = scales::dollar) +
scale_y_continuous(labels = scales::dollar)
From my preliminary analysis here, I am concluding that there is a relationship between Total Ad Spend and Revenue/Items Shipped. However, it looks like there are categories where ads might not be as relavent to sales. Next steps (for my final paper) that I might want to consider in the relationship between ads and revenue: 1. Looking at Direct vs. Indirect sales (what do people follow links for vs. what do they buy when they’re already shopping). 2. As I noted above, I want to do some more summary analyses of categories (such as average price per item). It might also be cool to do a deep dive into a specific category (such as Clothing & Accessories). 3. Buyer analyses: when are they buying, how?
```
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
Indirect/direct - If someone follows Lucy’s link to Amazon and buys that item it will count as direct sales, but if they also buy additional items, she also gets commissions for those sales but they are counted as indirect sales.↩︎