Introduction

The data is a record of online purchases made on Amazon India. Using data analysis tools of RStudio We would like to explore the possibility of identifying the profitability, buying habits, and promotional offers on products purchased by the site’s visitors. We will first proceed to clean the dataset and then to discover the dataset features in order to decide on possibilities of data exploration and analysis.

Data Set

This dataset concerns record of online purchases on Amazon India website. It can be downloaded from Kaggle website at:

https://www.kaggle.com/datasets/thedevastator/2019-e-commerce-product-listings-from-amazon-ind

Elips<-read.csv("home_sdf_marketing_sample_for_amazon_in-ecommerce__20191001_20191031__30k_data.csv")
# read in the dataset
rmarkdown::paged_table(Elips) #tabulate the dataset 

We observe that the data set has 30000 rows with 16 columns. These column names are as follows:

names(Elips) #lists column names
##  [1] "index"                 "Uniq.Id"               "Crawl.Timestamp"      
##  [4] "Category"              "Product.Title"         "Product.Description"  
##  [7] "Brand"                 "Pack.Size.Or.Quantity" "Mrp"                  
## [10] "Price"                 "Site.Name"             "Offers"               
## [13] "Combo.Offers"          "Stock.Availibility"    "Product.Asin"         
## [16] "Image.Urls"

From our inspection we can conclude : * elips data contain 30000 of rows and 16 of coloumns * Each of column name : “index”, “Uniq.Id”, “Crawl.Timestamp”, “Category”, “Product.Title”, “Product.Description”, “Brand”,“Pack.Size.Or.Quantity”, “Mrp”, “Price”, “Site.Name”, “Offers”, “Combo.Offers”, “Stock.Availibility”, “Product.Asin”, “Image.Urls”

Library

library(dplyr)
library(magrittr)
library(stringr)
library(ggplot2)
library(tidyr)
library(reshape2)
library(glue)
library(plotly)

Data Cleansing

Check Data type for each column

str(Elips)
## 'data.frame':    30000 obs. of  16 variables:
##  $ index                : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Uniq.Id              : chr  "eb49cc038190f6f03c272f79fbbce894" "1657cc30c438affede6a5060d6847363" "41654633cce38c8650690f6dbac01fd3" "08b1bd85c3efc2d7aa556fd79b073382" ...
##  $ Crawl.Timestamp      : chr  "2019-10-30 11:38:11 +0000" "2019-10-31 15:46:54 +0000" "2019-10-30 09:53:23 +0000" "2019-10-29 16:16:52 +0000" ...
##  $ Category             : chr  "Skin Care" "Skin Care" "Skin Care" "Skin Care" ...
##  $ Product.Title        : chr  " Lee posh Lactic Acid 60% Anti ageing Pigmentation Removing Glow Peel " " Branded SLB Works New 1.5mm Titanium 1200 needles Microneedles Meso Derma Roller Acne Scar Skin S B6M6 " " Generic 1 Pc brand snail eye cream remove dark circle eye lifting instant ageless snail cream for eye care ant"| __truncated__ " Generic Anti Snoring Snore Stopper Sleep Apnea Solution Lips Plasters Soft Space Cotton " ...
##  $ Product.Description  : chr  "PROFESSIONAL GRADE Face Peel: this peel stimulates collagen production, reducing the appearance of wrinkles, fi"| __truncated__ "Item name: 1.5mm titanium 1200 needles microneedles meso derma roller acne scar skin s b6m6 1.5mm titanium 1200"| __truncated__ "Use: eye, item type: cream, net wt: 20g, gzzz: ygzwbz, model number: lkwnys, gender: female, certification: gzz"| __truncated__ "Prevent the tongue from dropping backward or block the airway. 2. Help to develop a habit of breathing with nos"| __truncated__ ...
##  $ Brand                : chr  "Lee Posh" "SLB Works" "Generic" "Generic" ...
##  $ Pack.Size.Or.Quantity: chr  "" "" "" "" ...
##  $ Mrp                  : chr  "2000.00" "2040.00" "1824.00" "2185.00" ...
##  $ Price                : chr  "799.00" "2040.00" "1042.00" "1399.00" ...
##  $ Site.Name            : chr  "Amazon In" "Amazon In" "Amazon In" "Amazon In" ...
##  $ Offers               : chr  "60.05%" "0%" "42.87%" "35.97%" ...
##  $ Combo.Offers         : chr  "" "" "" "" ...
##  $ Stock.Availibility   : chr  "YES" "YES" "YES" "YES" ...
##  $ Product.Asin         : chr  "B072BGHNJ1" "B07QDTZYSJ" "B07DCSN8MP" "B07GLW9VQN" ...
##  $ Image.Urls           : chr  "https://images-na.ssl-images-amazon.com/images/I/41l0RjF1XIL._SS40_.jpg|https://images-na.ssl-images-amazon.com"| __truncated__ "https://images-na.ssl-images-amazon.com/images/I/31sQlZQw%2BDL._SS40_.jpg|https://images-na.ssl-images-amazon.c"| __truncated__ "https://images-na.ssl-images-amazon.com/images/I/51bWWBIkeZL._AC_US40_.jpg|https://images-na.ssl-images-amazon."| __truncated__ "https://images-na.ssl-images-amazon.com/images/I/31U9PV2OI5L._SS40_.jpg|https://images-na.ssl-images-amazon.com"| __truncated__ ...

We would like to maintain the “Crawl.Timestamp” column as this may provide useful business information: the times of high volume of online purchases. But we need to reformat the time representation into one that is compatible with the tools of RStudio.

Firstly we need to remove the last 5 characters in Crawl.Timestamp column.

Elips$Crawl.Timestamp= substring(Elips$Crawl.Timestamp,1, nchar(Elips$Crawl.Timestamp)-5)

Now to check for changes made.

str(Elips$Crawl.Timestamp)
##  chr [1:30000] "2019-10-30 11:38:11 " "2019-10-31 15:46:54 " ...

We can confirm that the time format is in “YYY-MM-DD hours:minutes:seconds” format.

Now we convert Crawl.Timestamp string to POSIXct date format

Elips$Crawl.Timestamp <- as.POSIXct(paste(Elips$Crawl.Timestamp),                        
                    format = "%Y-%m-%d %H:%M:%S", tz="UTC")
str(Elips$Crawl.Timestamp)
##  POSIXct[1:30000], format: "2019-10-30 11:38:11" "2019-10-31 15:46:54" "2019-10-30 09:53:23" ...

The above output confirms that we have successfully converted “Crawl.Timestamp” into POSIXct date type of “YYY-MM-DD hours:minutes:seconds” format.

We check for any missing value in the entirety of the dataset.

colSums(is.na(Elips))
##                 index               Uniq.Id       Crawl.Timestamp 
##                     0                     0                     0 
##              Category         Product.Title   Product.Description 
##                     0                     0                     0 
##                 Brand Pack.Size.Or.Quantity                   Mrp 
##                     0                     0                     0 
##                 Price             Site.Name                Offers 
##                     0                     0                     0 
##          Combo.Offers    Stock.Availibility          Product.Asin 
##                     0                     0                     0 
##            Image.Urls 
##                     0

A second check for missing data shows no missing values or NAs.

anyNA(Elips)
## [1] FALSE

The “Combo$Offers” column contains a short narration of promotional offers. We proceed to list this information.

unique(Elips$Combo.Offers) 
##  [1] ""                                                                                                                                                                                                                  
##  [2] " Myoc Neem Face Wash: Buy 2 Neem face Wash 100 ml Each and Get 1 Additional Neem face Wash 100 ml Absolutely Free "                                                                                                
##  [3] " Arochem Kala Bhoot, 2ml Buy 1 Get 1 Free "                                                                                                                                                                        
##  [4] " GUYSOME - (BUY 3 GET 1 FREE) Green Apple Fragrance Intimate Hygiene Wash for Men (4x100ml) "                                                                                                                      
##  [5] " Khadi Rain Forest Soap 75 g (Buy 3 Get 3) "                                                                                                                                                                       
##  [6] " Khadi Rose Soap - 75g (Buy 3 Get 3) "                                                                                                                                                                             
##  [7] " Apis Himalaya Honey, 1kg Each (Buy 1 Get 1 Free) "                                                                                                                                                                
##  [8] " Lever Ayush Natural Fairness Saffron Soap, 100 g each (Buy 4 Get 1) "                                                                                                                                             
##  [9] " Generic Buy 3 Get 1 Gift Aloe Vera Gel+Blackhead Removes Repair After-Sun Acne Anti-Aging Anti-inflammatory Relieving itching Skin Care "                                                                         
## [10] " Pure4Sure Buy 2 Handwash and Get Disposable Toilet Seat Cover Free "                                                                                                                                              
## [11] " Khadi Lemongrass Soap, 75 g (Buy 3 Get 3) "                                                                                                                                                                       
## [12] " Thanaka powder 15g Buy 1 get 1 Free For permanent hair removal Special Offfer "                                                                                                                                   
## [13] " V.R. Enterprises Buy 5 Qty Get 1 Free Premium Quality Loka Amrith Organic Dishwashgel (Wk01Aq01_Neem)5 Combo "                                                                                                    
## [14] " Al-Nuim Chocolate Musk 3 ML Attar Perfume Oil Alcohol Free Natural Buy 1 Get 1 Free "                                                                                                                             
## [15] " Jimmys Gourmet Kitchen Almond Butter Unsweetened with Coffee (Buy 2 get 1 Free) (Stone Grind)(Slow Grind)(Gluten Free)(Vegan)(No Artificial Sugars Added)(No Oil Added)(410gx3)(1230g) "                          
## [16] " RTF Special offer Aloe vera Magnetic cool eye mask buy one get one free!!! "                                                                                                                                      
## [17] " BuyChoice Vegetable Enzymes Moisturizing Essence Cream Oil Control Facial "                                                                                                                                       
## [18] " Apricot Scrub MYOC (from Salvia) Paraben Free : Buy 3 Apricot Scrub 100 ml Each and Get 2 Additional Apricot Scrub 100 ml Absolutely Free "                                                                       
## [19] " Generic Buy 3 get 1 gift SOONPURE Potent Anti Wrinkle Face Mask 3PCS + SOONPURE Peach Blossom Leaf Face Mask Anti Aging Anti Winkles "                                                                            
## [20] " 🌿 Vedansh Organic Beard and Hair Growth Serum With Argan Oil, Almond Oil, Castor Oil, Jojoba Oil, Vitamin E, 30ml | (BUY 2 GET 1 FREE) "                                                                          
## [21] " Urvi Creations Indian Traditional Natural Herbal Henna Mehandi cones set of 2 Buy Get 2 Mehandi Cone Free Mehandi For Wedding Festivals and Beauty "                                                              
## [22] " Jimmys Gourmet Kitchen | Granola Cereals | 40%Fruit & Nut | Buy 1kg get 1 kg Free | Vegan Gluten Free | Maple Syrup "                                                                                             
## [23] " Golden Beauty Alpine Dew Talc, 400g (Buy one get one) "                                                                                                                                                           
## [24] " Buy Superbee Jamun Honey 1 kg Get Free Treatment with Natural Honey and Herbs(प्राकृतिक जड़ी-बूटियाँ एवं शहद द्वारा रोगोपचार) Book "                                                                                     
## [25] " Truefarm Organic Himalayan Pink Salt, Buy 1 Get 1 Free 2kg (Pack of 2) "                                                                                                                                          
## [26] " TRUNISH MILK TURMERIC SOAP (Buy 4 Get 1 Free) "                                                                                                                                                                   
## [27] " Greenberry Organics SPF 40+ Sunscreen Spray Lotion | Goodness of Kiwi Extracts | Sun Shield | Skin Protection | Skin Revival | All Skin Types,Buy 2 Get 1 Free "                                                  
## [28] " QYF Buy1 Get 1Gift Hyaluronic Acid Serum Acne Treatment Moisturizing Anti Wrinkles "                                                                                                                              
## [29] " Soursop Leaves Tea (Buy 2 get 1 Free) "                                                                                                                                                                           
## [30] " Wonderland Roasted & Salted Almonds 200g Box (Buy 2 Get 1 Free) "                                                                                                                                                 
## [31] " Premium Beauty Talc, 300g (Buy 1 Get 1 Free) "                                                                                                                                                                    
## [32] " Mugi Fresh Blue Bell Fabric Conditioner Buy 2 get 1 Free (400ml) "                                                                                                                                                
## [33] " Simgin Herbal Face Pack -2- Buy Orange and Rose Powder and get Multani Mitti Powder absolutely FREE!! (offer valid till stock lasts!) - Pure, Best Quality, 100% Herbal- Total: 300g (Separate packings) -SH-EOP "
## [34] " Fragrance Tree Tea Tree 100% Pure Essential Oil, 10ML (Special offer - Buy 1 Get 1 Free) "                                                                                                                        
## [35] " Wipro Safewash Liquid Detergent for Woolen - 1 kg (Buy 1 Get 1 Free) with Free Santoor Hand Wash - 360ml "                                                                                                        
## [36] " Ziofit Walnut Kernels Extra Light Quarters, 250g (Buy 1 Get 1 Free) "                                                                                                                                             
## [37] " Buy 3 Get 1 Gift Skin Care Volcanic Soil Facial Mask Remove Blackhead Deep cleaning Repair Whitening Moisture Oil-Control Cream "

There are only 37 offers made out of the 30000 products on offer. We check for the ‘category’ for each non null ‘Combo.Offers’,

#Subsetting rows where Elips$Offers is neither NA nor blank 
Subs1<-Elips[!(is.na(Elips$Combo.Offers) | Elips$Combo.Offers==""), ]
head(Subs1)
##      index                          Uniq.Id     Crawl.Timestamp
## 645    644 0b706fb5981ee9e6d57edb6e2d01f2b1 2019-10-31 13:20:57
## 1137  1136 931b55de61d0e0e272175eec5cdb531b 2019-10-31 07:16:07
## 1419  1418 f2a904708c9d35ee66496b86feb4dc58 2019-10-29 21:39:34
## 2764  2763 b3977939a83fe2e6bbed9e8b3e3aa7d2 2019-10-30 15:56:05
## 2815  2814 71773e86d7bd6569bdc084b5e2daabac 2019-10-30 16:14:58
## 2861  2860 942b67ce0d83e3dfe8074661df01b5fb 2019-10-30 11:33:36
##                     Category
## 645                Skin Care
## 1137               Fragrance
## 1419               Skin Care
## 2764           Bath & Shower
## 2815           Bath & Shower
## 2861 Grocery & Gourmet Foods
##                                                                                                           Product.Title
## 645   Myoc Neem Face Wash: Buy 2 Neem face Wash 100 ml Each and Get 1 Additional Neem face Wash 100 ml Absolutely Free 
## 1137                                                                          Arochem Kala Bhoot, 2ml Buy 1 Get 1 Free 
## 1419                        GUYSOME - (BUY 3 GET 1 FREE) Green Apple Fragrance Intimate Hygiene Wash for Men (4x100ml) 
## 2764                                                                         Khadi Rain Forest Soap 75 g (Buy 3 Get 3) 
## 2815                                                                               Khadi Rose Soap - 75g (Buy 3 Get 3) 
## 2861                                                                  Apis Himalaya Honey, 1kg Each (Buy 1 Get 1 Free) 
##                                                                                                                                                                                                                Product.Description
## 645                                                                                                                                                                                                                               
## 1137                                                                                                                                                                                      Arochem Kala Bhoot, 2ml Buy 1 Get 1 Free
## 1419                                                                                                                                                                                                                              
## 2764 Cool and Crisp, this bar soap delivers a clean as invigorating as a dip in a natural spring, you'll enjoy that clean, refreshed feeling all day. It instantly makes your skin feeling cleansed, refreshed and subtly scented.
## 2815                                          Enlighten your mood with the instant sense gratification that the fragrance of Rose endows. Embellish the sentiment in the care of Glycerin. Enjoy your own floral paradise everyday
## 2861                                                                                                                                                 Winner of numerous industry and government of India awards for honey exports.
##             Brand Pack.Size.Or.Quantity     Mrp   Price Site.Name Offers
## 645   SALVIA MYOC                 349 g  196.00  186.00 Amazon In   5.1%
## 1137 Modern Roots                        200.00  150.00 Amazon In  25.0%
## 1419      GUYSOME                       1840.00 1173.00 Amazon In 36.25%
## 2764        Khadi                        120.00  120.00 Amazon In     0%
## 2815        Khadi                        120.00  120.00 Amazon In     0%
## 2861         Apis                  2 Kg  390.00  370.50 Amazon In   5.0%
##                                                                                                            Combo.Offers
## 645   Myoc Neem Face Wash: Buy 2 Neem face Wash 100 ml Each and Get 1 Additional Neem face Wash 100 ml Absolutely Free 
## 1137                                                                          Arochem Kala Bhoot, 2ml Buy 1 Get 1 Free 
## 1419                        GUYSOME - (BUY 3 GET 1 FREE) Green Apple Fragrance Intimate Hygiene Wash for Men (4x100ml) 
## 2764                                                                         Khadi Rain Forest Soap 75 g (Buy 3 Get 3) 
## 2815                                                                               Khadi Rose Soap - 75g (Buy 3 Get 3) 
## 2861                                                                  Apis Himalaya Honey, 1kg Each (Buy 1 Get 1 Free) 
##      Stock.Availibility Product.Asin
## 645                 YES   B07D1P8926
## 1137                YES   B079WK552W
## 1419                YES   B07G2G68CH
## 2764                YES   B07JFC2QKH
## 2815                YES   B07JC4Z1KZ
## 2861                YES   B07LBG26D1
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Image.Urls
## 645                   https://images-na.ssl-images-amazon.com/images/I/51Bjj4FP0aL._SS40_.jpg|https://images-na.ssl-images-amazon.com/images/I/41qWpO4ydLL._SS40_.jpg|https://images-na.ssl-images-amazon.com/images/I/519Ic7GJq8L._SS40_.jpg|https://images-na.ssl-images-amazon.com/images/I/51DDLiURRML._SS40_.jpg|https://images-na.ssl-images-amazon.com/images/I/41i4lu70ahL._SS40_.jpg|https://images-na.ssl-images-amazon.com/images/I/51RE%2B%2BMcIWL._SS40_.jpg|https://images-na.ssl-images-amazon.com/images/I/51nGhy%2BwoWL._SS40_.jpg|https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif
## 1137                                                                                                                                                                                                                                                                                                                                                                                                                                                                     https://images-na.ssl-images-amazon.com/images/I/41IpnsLv3LL._AC_US40_.jpg|https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif
## 1419 https://images-na.ssl-images-amazon.com/images/I/51Q5K1yw-4L._AC_US40_.jpg|https://images-na.ssl-images-amazon.com/images/I/41mGFvV%2Bx8L._AC_US40_.jpg|https://images-na.ssl-images-amazon.com/images/I/511s2fQ2yVL._AC_US40_.jpg|https://images-na.ssl-images-amazon.com/images/I/51m7a7J6LdL._AC_US40_.jpg|https://images-na.ssl-images-amazon.com/images/I/51s226yoDlL._AC_US40_.jpg|https://images-na.ssl-images-amazon.com/images/I/41Oewmh4rWL._AC_US40_.jpg|https://images-na.ssl-images-amazon.com/images/I/514EOzFub4L._AC_US40_.jpg|https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif
## 2764                                                                                                                                                                                                                                                                                                                                                                                                                                                                        https://images-na.ssl-images-amazon.com/images/I/51kvnTkzeCL._SS40_.jpg|https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif
## 2815                                                                                                                                                                                                                                                                                                                                                                                                                                                                     https://images-na.ssl-images-amazon.com/images/I/41Br7Gmej2L._AC_US40_.jpg|https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif
## 2861                                                                                                                                                                                                                                                                                                                                                                                        https://images-na.ssl-images-amazon.com/images/I/51AyAtD3euL._AC_US40_.jpg|https://images-na.ssl-images-amazon.com/images/I/516rCbL%2BLTL._AC_US40_.jpg|https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif

“Subs1” shows that the categories whereby “Combo.Offers” are not null are also the typical categories of the majority of purchases. This shows that purchases which may have been influenced by promotional offers are insignificant. This is in view of the fact that there are 29,963 (30000 less 37) other online purchases made which have no promotional offers attached. We can therefore delete the Combo.Offers column as it provides no significant information. We can also remove other columns : [1]“Index” ,[2] ” Uniq.Id “, [5]”Product.Title”,[6]“Product.Description”, [8] “Pack.Size.Or.Quantity”, [11]“Site.Name”, [15]“Product.Asin”, [16]“Image.urls”; as these columns do not lend themselves to statistical analysis or are these columns of any business informational value.

Elips <- Elips[,-c(1,2,5,6,8, 11, 13, 15,16)]

We list the remaining columns from the original dataset we have chosen to include in our analysis.

colnames(Elips)
## [1] "Crawl.Timestamp"    "Category"           "Brand"             
## [4] "Mrp"                "Price"              "Offers"            
## [7] "Stock.Availibility"

“Price” and “Mrp” (Manufacturer retail price) are in character format. We need to transform this into numeric data type. Firstly we check number of missing values in the “Mrp” column,

 unique (unlist (lapply (Elips$Mrp, function (x) which (is.na (x)))))
## integer(0)

We proceed to convert “Mrp” and “Price” columns into numeric data types.

Elips$Mrp <- as.numeric(Elips$Mrp) 
## Warning: NAs introduced by coercion
Elips$Price <- as.numeric(Elips$Price)
## Warning: NAs introduced by coercion

As RStudio has warned that NAs or missing values are introduced into the columns “Price” and “Mrp” after the conversion of type. We now have to delete the rows containing these NAs.

str(Elips) #Check that data columns "Mrp" and "Price" have become numeric data types.
## 'data.frame':    30000 obs. of  7 variables:
##  $ Crawl.Timestamp   : POSIXct, format: "2019-10-30 11:38:11" "2019-10-31 15:46:54" ...
##  $ Category          : chr  "Skin Care" "Skin Care" "Skin Care" "Skin Care" ...
##  $ Brand             : chr  "Lee Posh" "SLB Works" "Generic" "Generic" ...
##  $ Mrp               : num  2000 2040 1824 2185 594 ...
##  $ Price             : num  799 2040 1042 1399 570 ...
##  $ Offers            : chr  "60.05%" "0%" "42.87%" "35.97%" ...
##  $ Stock.Availibility: chr  "YES" "YES" "YES" "YES" ...

Missing values introduced after conversion to numeric type confirmed by the following:

anyNA(Elips)
## [1] TRUE

We proceed to delete rows containing “Mrp” or “Price” data columns that are NAs.

Elips<-na.omit(Elips)

We can confirm that there are no longer any columns with NAs.

anyNA(Elips)
## [1] FALSE

We also need to convert ‘Character’ types to ‘Factor’ for ‘Category’, ‘Brand’, ‘Stock.Availibility’, ‘Offers’, ‘Combo.Offers’ columns.

Elips <- Elips %>% mutate_at(c('Category', 'Brand', 'Stock.Availibility', 'Offers'), as.factor)

Check of data types once more.

str(Elips)
## 'data.frame':    29240 obs. of  7 variables:
##  $ Crawl.Timestamp   : POSIXct, format: "2019-10-30 11:38:11" "2019-10-31 15:46:54" ...
##  $ Category          : Factor w/ 6 levels "Bath & Shower",..: 6 6 6 6 4 6 1 1 4 4 ...
##  $ Brand             : Factor w/ 8362 levels "","'SPARSH 4.0",..: 4283 6952 2806 2806 3186 6767 4234 4984 2640 2051 ...
##  $ Mrp               : num  2000 2040 1824 2185 594 ...
##  $ Price             : num  799 2040 1042 1399 570 ...
##  $ Offers            : Factor w/ 4338 levels "0%","0.0%","0.02%",..: 3589 1 2759 2209 2503 1 1007 1 391 1 ...
##  $ Stock.Availibility: Factor w/ 2 levels "NO","YES": 2 2 2 2 2 2 2 2 2 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:760] 103 119 251 450 459 466 583 619 623 742 ...
##   ..- attr(*, "names")= chr [1:760] "103" "119" "251" "450" ...

Data Wrangling

We would like to create new information from the cleaned dataset. Since we have “Price” and “Mrp” we presuppose that the difference between the two column data would equal the price markup by Amazon India.

Create a “Price_Markup” column.

Elips$Price_Markup<-Elips$Price-Elips$Mrp

Check for creation of “Price_Markup” column.

head(Elips)
##       Crawl.Timestamp                Category     Brand  Mrp Price Offers
## 1 2019-10-30 11:38:11               Skin Care  Lee Posh 2000   799 60.05%
## 2 2019-10-31 15:46:54               Skin Care SLB Works 2040  2040     0%
## 3 2019-10-30 09:53:23               Skin Care   Generic 1824  1042 42.87%
## 4 2019-10-29 16:16:52               Skin Care   Generic 2185  1399 35.97%
## 5 2019-10-31 09:32:06 Grocery & Gourmet Foods   Harveys  594   570  4.04%
## 6 2019-10-30 19:14:20               Skin Care    ShiKai 5344  5344     0%
##   Stock.Availibility Price_Markup
## 1                YES        -1201
## 2                YES            0
## 3                YES         -782
## 4                YES         -786
## 5                YES          -24
## 6                YES            0

We note that the “Price_Markup” column are largely in the negative. This is very unlikely as businesses would not want to make a loss. With this reason we suspect that the “Mrp” information is inaccurate.

Pre-Analysis: Data Summary

Compute an overview of information.

summary(Elips)
##  Crawl.Timestamp                                     Category    
##  Min.   :2019-10-28 14:30:19.00   Bath & Shower          : 2075  
##  1st Qu.:2019-10-29 05:17:44.00   Detergents & Dishwash  :  190  
##  Median :2019-10-30 07:28:13.50   Fragrance              : 2088  
##  Mean   :2019-10-30 06:12:23.16   Grocery & Gourmet Foods: 8021  
##  3rd Qu.:2019-10-31 00:56:03.00   Hair Care              : 2407  
##  Max.   :2019-10-31 19:13:19.00   Skin Care              :14459  
##                                                                  
##           Brand            Mrp            Price             Offers     
##  CHOCOCRAFT  : 1465   Min.   :   12   Min.   :   12.0   0%     :12006  
##  Generic     : 1261   1st Qu.:  520   1st Qu.:  385.0   20.0%  :  884  
##  Chocholik   :  698   Median : 1182   Median :  919.5   42.86% :  440  
##  World Beauty:  527   Mean   : 2522   Mean   : 2027.6   66.67% :  280  
##  BOGATCHI    :  420   3rd Qu.: 3554   3rd Qu.: 3028.2   23.09% :  194  
##  Stockout    :  406   Max.   :80000   Max.   :21990.0   20.01% :  189  
##  (Other)     :24463                                     (Other):15247  
##  Stock.Availibility  Price_Markup   
##  NO :    7          Min.   :-75701  
##  YES:29233          1st Qu.:  -397  
##                     Median :   -51  
##                     Mean   :  -494  
##                     3rd Qu.:     0  
##                     Max.   :     0  
## 

Summary:

1 - Data was only collected between 2019-10-28 and 2019-10-31 (3 days).

2 - “Skincare” products is the most frequently bought item, followed by “Grocery and Gourmet Foods”, “Hair Care”, “Bath and Shower” and “Fragrance”

3 - 12300 out of 30000 sales was achieved through zero “Offers” and more than 50% of sales was achieved through some promotional offers.

4 - 280 and 440 of sales was achieved through a promotional 66% and 42% discount of “Price”.

5 - Mean “Price” of products was in the 2000’s range (2023).

6 - Most expensive item bought was priced at 21990.

7 - As “Profit” shows a mean of a loss it is very likely that “Mrp” is quoted for bulk purchases or under-reported.

8 - There is no monopoly of brands for all categories.

9 - 1.6% of products listed on site was not available in stock.

Analysis

Now that we have obtained the cleaned dataset we have identified several business questions which may be sufficiently satisfied by an analysis of this cleaned dataset; namely, profitability,

Pricing

What is the “Price” from each “Brand” and “Category”, and which Category is the highest?

A quick way to display overall data is using a heatmap.

heatmap(xtabs(Price~Category+Brand, Elips), cexRow = 0.8, cexCol = 0.8, scale = "column", Colv = NA, Rowv = NA)

“Skincare” products provided the highest prices as the darkest lines inhabit the “Skincare” segment in the heatmap above.

What is the Mean of Price by Category?

mean_price<-aggregate(x=Elips$Price, by=list(Elips$Category), FUN=mean)

mean_price
##                   Group.1        x
## 1           Bath & Shower 2338.476
## 2   Detergents & Dishwash 1465.887
## 3               Fragrance 2484.621
## 4 Grocery & Gourmet Foods 1056.341
## 5               Hair Care 2232.822
## 6               Skin Care 2428.919
par(mar = c(10, 8, 3, 4), # change the margins
    lwd = 2, # increase the line thickness
    cex.axis = 1, # increase default axis label size
    srt=45
    )

#Plot mean price by category
mean_price.barplot <-barplot(height = mean_price[,2], names.arg = mean_price[,1],
                              main="Mean Price by Category",
                      xaxt="n", yaxt="n",
                      xlab="",
                      ylab="Mean Price",
                      ylim = c(0,3000),
)

end_point = 0.5 + nrow(mean_price) 

#rotate 60 degrees (srt = 60)
text(seq(1, end_point, by = 1), par("usr")[3]-0.45, 
     srt = 55, adj = 1, xpd = TRUE,
     labels = mean_price[,1], cex = 0.95,
     font = 2)


## Draw the y-axis.
axis(side = 2,
     ## Rotate the labels.
     las = 1,
     ## Adjust the label position.
     mgp = c(3, 1, 0.25),
     font = 2,       

     )

“Fragrance” products provided the highest of “Price” with a mean of 2484.

We have suspected that the “Mrp” figures are inaccurate as the amount of markup of prices are negative. To illustrate this fact further we proceed to display the mean of “Price_markup” according to “category”.

mean_price_markup<-aggregate(Price_Markup~Category,Elips,mean)

mean_price_markup
##                  Category Price_Markup
## 1           Bath & Shower    -455.9716
## 2   Detergents & Dishwash    -157.7737
## 3               Fragrance    -430.5970
## 4 Grocery & Gourmet Foods    -114.8013
## 5               Hair Care    -518.0404
## 6               Skin Care    -719.3115
par(mar = c(10, 8, 3, 4), # change the margins
    lwd = 2, # increase the line thickness
    cex.axis = 1, # increase default axis label size
    srt=45
    )

#Plot mean price by category
mean_price.barplot <-barplot(height = mean_price_markup[,2], names.arg = mean_price_markup[,1],
                              main="Mean of Price Markup by Category",
                      xaxt="n", yaxt="n",
                      xlab="",
                      ylab="Mean Price Markup",
                      ylim = c(-1000,100),
)

end_point = 0.5 + nrow(mean_price_markup) 

#rotate 60 degrees (srt = 60)
text(seq(1, end_point, by = 1), par("usr")[3]-0.025, 
     srt = 65, adj = 1, xpd = TRUE,
     labels = mean_price_markup[,1], cex = 0.95,
     font = 2)



## Draw the y-axis.
axis(side = 2,
     ## Rotate the labels.
     las = 1,
     ## Adjust the label position.
     mgp = c(3, 0.75, 0),
     font=2,        # Axis line color

     )

The lowest mean “Price_Markup” is for “SkinCare” and this is about -719, followed by “HairCare” at -518, “Bath & Shower” at -455, and “Fragrance” at -490.

What is the highest ’Price_Markup” by “Category”?

We can answer this quickly with the xtabs() function.

xtabs(Price_Markup~ Category, Elips)
## Category
##           Bath & Shower   Detergents & Dishwash               Fragrance 
##               -946141.1                -29977.0               -899086.5 
## Grocery & Gourmet Foods               Hair Care               Skin Care 
##               -920821.3              -1246923.2             -10400525.3

“Detergents & Dishwash” yielded the least amount of discount and “Skincare” yielded the biggest discount to “Mrp”.

We now attempt to present the distribution of “Price_Markup”. We group “Price_Markup” into ranges : “-10000-(-1001)”, “-1000-(-801)”,“-799-(-501)”, “-499-(-301)”, “-300-(-101)”, “-100-0”,“1-100”.

Elips$Price_Markup_group <- cut(Elips$Price_Markup, breaks = c(-10000, -1000, -800, -500, -300, -100, 0, 100), labels = c( "-10000-(-1001)", "-1000-(-801)","-799-(-501)", "-499-(-301)","-300-(-101)", "-100-0","1-100"), right = TRUE)

and proceed to employ a bar chart to display the distribution.

Elips_Price_Markup_group<-Elips %>% 
                  group_by(Price_Markup_group) %>%
                  summarise(n = n())%>%
                  arrange(desc(n))

Elips_Price_Markup_group <-Elips_Price_Markup_group %>%
  mutate(
    label = glue(
      "{Price_Markup_group}:{n}"
    ) 
  )

plot1 <- ggplot(data = Elips_Price_Markup_group, aes(x = Price_Markup_group,
                              y = reorder(n, Price_Markup_group),
                              text = label)) + 
  geom_col(aes(fill = n)) +
  scale_fill_gradient(low="red", high="black") +
  labs(title = "Distribution of Price Markup",
       x = "Price_Markup Grouping",
       y = "Number of Sales") +
  theme_minimal() +
  theme(legend.position = "none")+
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

ggplotly(plot1, tooltip = "text")

Assuming that the “Mrp” data is accurate and that the online business model really is to sell products at much discounted prices then the plot shows that 15736 of the products are sold at discount between 100 to 0 , 4957 of the products are sold at a discount between 300 to 100, 3680 of the products are sold at a discount of 1000 or more, etcetera. Also note that the “Price_Markup” of 143 of the products are NA (Not Available). It is unlikely that Amazon India would want to operate at a loss so it is unlikely that this dataset analysis is able to represent the business model or profitability of Amazon India.

Offers and Promotions

We also would like to investigate the “offers” associated with the purchases made. But first we have to convert the data type of “Offers” column from ‘Character’ to ‘Numeric’ by removing the percentage sign.

Elips$Offers<-gsub("%$","",Elips$Offers)
Elips$Offers<-as.numeric(Elips$Offers)

We group “Offers” into % ranges :“0”, “1-9”, “10-19”, “20-29”, “30-39”,…, “90-99”.

Elips$Offers_group <- cut(Elips$Offers, breaks = c(-1,0,10,20,30,40, 50, 60, 70, 80, 90, 100), labels = c( "0", "1-9","10-19", "20-29","30-39","40-49","50-59", "60-69", "70-79", "80-89", "90-99"), right = TRUE)

We proceed to group the dataset into “Offers_group”, and count the frequencies of each grouping prior to plotting a distribution of ” % Offers”.

Elips_count_Offers_group<-Elips %>% 
                  group_by(Offers_group) %>%
                  summarise(n = n())%>%
                  arrange(desc(n))

Elips_count_Offers_group <-Elips_count_Offers_group %>%
  mutate(
    label = glue(
      "{Offers_group}:{n}"
    ) 
  )

plot1 <- ggplot(data = Elips_count_Offers_group, aes(x = Offers_group,
                              y = reorder(n, Offers_group),
                              text = label)) + 
  geom_col(aes(fill = n)) +
  scale_fill_gradient(low="red", high="black") +
  labs(title = "Distribution of Offers (%)",
       x = "Offers % Grouping",
       y = "Number of Sales") +
  theme_minimal() +
  theme(legend.position = "none")+
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

ggplotly(plot1, tooltip = "text")

The above distribution plot shows that a majority of sales (12007) had zero % Offers attached, 2478 of the sales was sold with offers of 1-9%, and 4087 of the sales was sold with offers of 10-19%, etc.

Peak Times of Purchases

We mutate the dataset to show count of purchases grouped by “Category” and grouped by hourly slots between time 2019-10-28 14:30:19.00 to time 2019-10-31 19:13:19.00.

SalesTimeBinned<-Elips %>%
  mutate(hr = lubridate::hour(Crawl.Timestamp)) %>%
  count(hr, Category) %>%
  complete(Category, hr = seq(0, max(hr)), fill = list(n = 0)) %>%
  pivot_wider(names_from = Category, values_from = n)

SalesTimeBinned
## # A tibble: 24 × 7
##       hr `Bath & Shower` `Detergents & Dishwash` Fragr…¹ Groce…² Hair …³ Skin …⁴
##    <int>           <int>                   <int>   <int>   <int>   <int>   <int>
##  1     0              66                       0     190     436      70     993
##  2     1              24                       4     155     268     116     467
##  3     2              88                       7       5     154     119     766
##  4     3              73                      12      32     430      99     741
##  5     4              74                      39      18     525      64     758
##  6     5              59                       1      38     440      18     635
##  7     6              23                       0      35     472      27     338
##  8     7              68                       0      51     254      47     284
##  9     8              41                       0      18     309      62     149
## 10     9              31                       4      66     123     108     270
## # … with 14 more rows, and abbreviated variable names ¹​Fragrance,
## #   ²​`Grocery & Gourmet Foods`, ³​`Hair Care`, ⁴​`Skin Care`

Prepare the data for plotting.

df <- melt(SalesTimeBinned ,  id.vars = 'hr', variable.name = 'series')

ggplot(df, aes(hr, value)) +
  geom_line(aes(colour = series))+
  ggtitle("Count of sales by Hour of Day") +
  xlab("Time (hour)") + ylab("Count of Sales")

Occurrence of online sales for “Skin Care” and “Grocery and Gourmet Foods” products show similar peak times, that is at around 4:00, 16:00 and 00:00 hours. Timing of sales other Category of products show no discernible peak times.

Insights and Recommendations:

1- “Skincare” products tops online sales on Amazon India followed by sales of “Grocery & Gourmet Foods”, “Hair Care”, “Fragrance”, and then “Bath & Shower”.

2- The business model of the Amazon India is to sell products online at discounted prices to Manufacturers Retail Price. We have been unable to compute any measure of profitability reason being “Price” is less than “Mrp” (which only gives negative profits) and three days of data is not sufficient to provide a representative picture of the business.

3 - Accuracy of data for “Mrp” must be comparable to data for “Price” in order to calculate profitability - Accurate “Mrp” must be given for each quantity of item sold.

4- Collection and recording of data on “Mrp” and “Price” can be improved by enforcing a standard format.

5- More than half of all sales made have “‘%’ offers” attached. For example sales made under Offers at 40%-49% are quite popular with such sales numbering 2780.

6 - More data collection on demographics of buyers of “Skin Care” and “Grocery and Gourmet Foods” could provide more useful business information of the online market in India.

7- Peak online sales occur at 00, 400, and 1600 hours in any given 24 hours and this characterizes the buying pattern for “Skin Care” and “Grocery & Gourmet Foods” products.