R Markdown

Last Update: “19 September, 2020”

This is an R Markdown document. The data used for this analysis can be downloaded at the following website: https://www.kaggle.com/jmmvutu/summer-products-and-sales-in-ecommerce-wish. The dataset contains product listings as well as products ratings and sales performance.

The main goal of the kaggle competition is to find patterns regarding the success of a product. Following this target, I have developed the document with the following structure:

  • Reading Data
  • EDA - Explanatory Data Analysis
  • Models
  • Predictions

Necessary packages

library(dplyr)
library(ggplot2)
library(gridExtra)
library(glmnet)
library(h2o)
library(stringr)
library(lmtest)
library(fBasics)
library(car)
library(reshape2)
library(Metrics)
library(rcompanion)
library(fitdistrplus)
library(modelr)
library(statmod)

Reading Data

data<-read.csv("summer-products-with-rating-and-performance_2020-08.csv",
               stringsAsFactors=FALSE)

Explenatory Data Analysis

This step is particularly important for the transformation and selection of the variables.

First of all, I’ll take a fast look at what kind of data are present in the dataset.

str(data)
## 'data.frame':    1573 obs. of  43 variables:
##  $ title                       : chr  "2020 Summer Vintage Flamingo Print  Pajamas Set Casual Loose T Shirt Top And Elastic Shorts Women Sleepwear Nig"| __truncated__ "SSHOUSE Summer Casual Sleeveless Soirée Party Soirée sans manches Vêtements de plage sexy Mini robe femme wshC1612242400387A21" "2020 Nouvelle Arrivée Femmes Printemps et Été Plage Porter Longue Mince Cardigan Ouvert Avant Kimono Vert Feuil"| __truncated__ "Hot Summer Cool T-shirt pour les femmes Mode Tops Abeille Lettres imprimées Manches courtes O Neck Coton T-shir"| __truncated__ ...
##  $ title_orig                  : chr  "2020 Summer Vintage Flamingo Print  Pajamas Set Casual Loose T Shirt Top And Elastic Shorts Women Sleepwear Nig"| __truncated__ "Women's Casual Summer Sleeveless Sexy Mini Dress" "2020 New Arrival Women Spring and Summer Beach Wear Long Thin Cardigan Open Front Kimono Green Leaf Printed Chi"| __truncated__ "Hot Summer Cool T Shirt for Women Fashion Tops Bee Printed Letters Short Sleeve O Neck Cotton T-shirts Tops Tee Clothing" ...
##  $ price                       : num  16 8 8 8 2.72 3.92 7 12 11 5.78 ...
##  $ retail_price                : int  14 22 43 8 3 9 6 11 84 22 ...
##  $ currency_buyer              : chr  "EUR" "EUR" "EUR" "EUR" ...
##  $ units_sold                  : int  100 20000 100 5000 100 10 50000 1000 100 5000 ...
##  $ uses_ad_boosts              : int  0 1 0 1 1 0 0 0 1 0 ...
##  $ rating                      : num  3.76 3.45 3.57 4.03 3.1 5 3.84 3.76 3.47 3.6 ...
##  $ rating_count                : int  54 6135 14 579 20 1 6742 286 15 687 ...
##  $ rating_five_count           : int  26 2269 5 295 6 1 3172 120 6 287 ...
##  $ rating_four_count           : int  8 1027 4 119 4 0 1352 56 2 128 ...
##  $ rating_three_count          : int  10 1118 2 87 2 0 971 61 3 92 ...
##  $ rating_two_count            : int  1 644 0 42 2 0 490 18 1 68 ...
##  $ rating_one_count            : int  9 1077 3 36 6 0 757 31 3 112 ...
##  $ badges_count                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ badge_local_product         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ badge_product_quality       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ badge_fast_shipping         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tags                        : chr  "Summer,Fashion,womenunderwearsuit,printedpajamasset,womencasualshort,Women's Fashion,flamingo,loungewearset,Cas"| __truncated__ "Mini,womens dresses,Summer,Patchwork,fashion dress,Dress,Mini dress,Women's Fashion,Women S Clothing,backless,p"| __truncated__ "Summer,cardigan,women beachwear,chiffon,Sexy women,Coat,summercardigan,openfront,short sleeves,Swimsuit,Women's"| __truncated__ "Summer,Shorts,Cotton,Cotton T Shirt,Sleeve,printedletterstop,Clothing,Tops,Necks,short sleeves,Women's Fashion,"| __truncated__ ...
##  $ product_color               : chr  "white" "green" "leopardprint" "black" ...
##  $ product_variation_size_id   : chr  "M" "XS" "XS" "M" ...
##  $ product_variation_inventory : int  50 50 1 50 1 1 50 50 50 50 ...
##  $ shipping_option_name        : chr  "Livraison standard" "Livraison standard" "Livraison standard" "Livraison standard" ...
##  $ shipping_option_price       : int  4 2 3 2 1 1 2 3 2 2 ...
##  $ shipping_is_express         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ countries_shipped_to        : int  34 41 36 41 35 40 31 139 36 33 ...
##  $ inventory_total             : int  50 50 50 50 50 50 50 50 50 50 ...
##  $ has_urgency_banner          : int  1 1 1 NA 1 NA NA NA 1 NA ...
##  $ urgency_text                : chr  "Quantité limitée !" "Quantité limitée !" "Quantité limitée !" "" ...
##  $ origin_country              : chr  "CN" "CN" "CN" "CN" ...
##  $ merchant_title              : chr  "zgrdejia" "SaraHouse" "hxt520" "allenfan" ...
##  $ merchant_name               : chr  "zgrdejia" "sarahouse" "hxt520" "allenfan" ...
##  $ merchant_info_subtitle      : chr  "(568 notes)" "83 % avis positifs (17,752 notes)" "86 % avis positifs (295 notes)" "(23,832 notes)" ...
##  $ merchant_rating_count       : int  568 17752 295 23832 14482 65 10194 342 330 5534 ...
##  $ merchant_rating             : num  4.13 3.9 3.99 4.02 4 ...
##  $ merchant_id                 : chr  "595097d6a26f6e070cb878d1" "56458aa03a698c35c9050988" "5d464a1ffdf7bc44ee933c65" "58cfdefdacb37b556efdff7c" ...
##  $ merchant_has_profile_picture: int  0 0 0 0 0 0 1 0 0 0 ...
##  $ merchant_profile_picture    : chr  "" "" "" "" ...
##  $ product_url                 : chr  "https://www.wish.com/c/5e9ae51d43d6a96e303acdb0" "https://www.wish.com/c/58940d436a0d3d5da4e95a38" "https://www.wish.com/c/5ea10e2c617580260d55310a" "https://www.wish.com/c/5cedf17ad1d44c52c59e4aca" ...
##  $ product_picture             : chr  "https://contestimg.wish.com/api/webimage/5e9ae51d43d6a96e303acdb0-medium.jpg" "https://contestimg.wish.com/api/webimage/58940d436a0d3d5da4e95a38-medium.jpg" "https://contestimg.wish.com/api/webimage/5ea10e2c617580260d55310a-medium.jpg" "https://contestimg.wish.com/api/webimage/5cedf17ad1d44c52c59e4aca-medium.jpg" ...
##  $ product_id                  : chr  "5e9ae51d43d6a96e303acdb0" "58940d436a0d3d5da4e95a38" "5ea10e2c617580260d55310a" "5cedf17ad1d44c52c59e4aca" ...
##  $ theme                       : chr  "summer" "summer" "summer" "summer" ...
##  $ crawl_month                 : chr  "2020-08" "2020-08" "2020-08" "2020-08" ...

It can be seen that the dataset contains a lot of character variables. The main focus of this section will be analyzing the different variables, and if necessary and useful, convert them.

Units Sold

These represent our dependent variable. It is fundamental to understand the distribution of the variable and understand if it may be necessary (or useful) to transform it.

summary(data$units_sold)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     100    1000    4339    5000  100000
hist(data$units_sold)

cat(paste0("It can be seen by the histogram that the distribution of the data is right-skewed. 
As a matter of fact the skewness is: ",skewness(data$units_sold)[1]))
## It can be seen by the histogram that the distribution of the data is right-skewed. 
## As a matter of fact the skewness is: 5.61411708067862

Additionally, if we plot the data producing a histogram and adding a normal curve with the same mean and standard deviation of units_sold, we can see the right-skewed pattern.

plotNormalHistogram(data$units_sold,main ="units_sold")

Due to the specific distribution of the dependent variable the selection of a linear model should be avoided. There are several reason why:

  • The outcome variable is not normally distributed
  • The outcome variable being limited in the values it can take on (count data means the predicted values cannot be negative)

The estimation and prediction strategy should be chosen considering the distribution of the dependent variable. That is, if the outcome variable is limited in the values it can take on, it is necessary to choose a model where the predicted values will fall within the possible range for your outcome. While sometimes linear regression is a good approximation for limited dependent variables, often it is not. However, in the following analysis, I will consider the MLR and GLM. The idea is to show the reason why with the use of a GLM the results should be preferred. Additionally, I will use a Log-transformation for the dependent variable. Indeed, transformation of data are typically use in the academia. However, this procedure may look as forcing the dependent variable to follow the normal distribution and be able to use the MLR.

Merchant_ID & Product ID

The column contains the IDs the different merchant. There may be the possibility that there are duplicates.

cat(paste0("In the dataset the duplicated number of merchant_id is"),
sum(duplicated(data$merchant_id)*1), 
"out of", length(data$merchant_id), "total merchant_id")
## In the dataset the duplicated number of merchant_id is 615 out of 1573 total merchant_id

In this case, It could be possible that there maybe duplicated merchants (and maybe rows). In order to exclude this possibility, I want to check if there are some matches in the product_id.

cat(paste0("In the dataset, the duplicated number of product_id is ",sum(duplicated(data$product_id)*1)))
## In the dataset, the duplicated number of product_id is 232

The results shows that there are a fewer number of duplicates. It will be necessary to remove those rows from the dataset.

duplicate_id<-(duplicated(data$product_id)*1)
duplicate_id<-which(duplicate_id==1)
data<-data[-duplicate_id,]
duplicate_id<-sum(duplicated(data$product_id)*1) #Check
cat(paste0("Number of duplicated product_id:"),duplicate_id)
## Number of duplicated product_id: 0

Duplicated rows successfully removed.

Another interesting aspect may be the possibility that some merchants are selling more than one product in the e-commerce site. It could be interesting to analyze this effect on the units_sold variable. Again, from my personal experience, the possibility that a merchant is selling more than one product on the e-market will increase my confidence about that specific merchant.

more_than_one_article<-as.factor(duplicated(data$merchant_id)*1)
data$more_than_one_article<-more_than_one_article 

Merchant rating

This column contains information about the rating given to the different merchant in the dataset.

The only transformation that I will do to this variable is the use of the function round()

data$merchant_rating<-round(data$merchant_rating,digits=1) #I rounded the rating to make easier the comparison
summary(data$merchant_rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.300   3.900   4.000   4.041   4.200   5.000
hist(data$merchant_rating)

Finally, I check the correlation between the variable merchant_rating and units_sold

cor(data$merchant_rating,data$units_sold)
## [1] 0.1097694

The cor() function shows the presence of a low correlation between the variables analyzed.

Rating count

The column contains the count of the rating obtained by the different merchant. First, I take a look at the data.

head(data$merchant_rating_count,20)
##  [1]    568  17752    295  23832  14482     65  10194    342    330   5534
## [11]   9470   3515    557   5083  55499  39381   6302 139223  22089   3592

Here the values from summary():

summary(data$merchant_rating_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2366    8740   27874   25222 2174765

Starting from my personal experience (such as Zalando, Amazon and ect.), I’ve always considered the number of reviews (in this specific case merchant_rating_count) as important for the decision to acquire or not a specific product from a specific merchant. Taking into account this consideration, I want to test the correlation of the variables merchant_rating_count, units_sold and merchant_rating

## Correlation between the *merchant_rating_count* and *units_sold* variables is: 0.272978836705809
cat(paste0("Correlation between the *merchant_rating_count* and *merchant_rating* variables is: ",cor(data$merchant_rating_count,data$merchant_rating)))
## Correlation between the *merchant_rating_count* and *merchant_rating* variables is: 0.131599096159153

These results should be considered in the models section.

Finally,I shoul be sure to remove the NA values from all the other variables.

data$rating_five_count[is.na(data$rating_five_count)] <- 0
data$rating_four_count[is.na(data$rating_four_count)] <- 0
data$rating_three_count[is.na(data$rating_three_count)] <- 0
data$rating_two_count[is.na(data$rating_two_count)] <- 0
data$rating_one_count[is.na(data$rating_one_count)] <- 0

Tags

The data present the column related to the tags. There are two main problems about this specific column:

  • The tags sometimes present some CAPS LOCK;
  • There are more tags for product.

In order to manage these problems, I will:

  • remove the symbol # ;
  • remove the the CAPS LOCK.

After, I will create two different variables:

  • num_tags: it will define the number of tags per each product. The idea is to understand if there is a correlation between numb_tags and units_sold
  • tag_most_used: it will be a dummy variable which will assess the presence of the most used tags in the tags column. The reason is to undestand if a specific tag may increase the number of units_sold
data$tags = gsub(pattern='#',replacement='',
                                          x=data$tags)
data$tags<-tolower(data$tags)
num_tags<-lengths(strsplit(data$tags," "))
data$num_tags<-num_tags
cat(paste0("The correlation between the num_tags and units_sold variables is:"),
cor(data$num_tags,data$units_sold))
## The correlation between the num_tags and units_sold variables is: 0.03398721
tags<-unlist(strsplit(data$tags,","))
tags<-as.matrix(table(tags))
tags<-ifelse(tags<=300,NA,tags)
tags<-na.omit(tags)
tags<-unique(tags)
detect<-row.names(tags)[which.max(tags)]
cat(paste0("The most used tag is: ",detect ))
## The most used tag is: women's fashion
tag_most_used<-as.factor(str_detect(data$tags, detect)*1)
data$tag_most_used<-tag_most_used

Inventory

There are two variable related to the inventory:

-product_variation_inventory which is “Inventory the seller has. Max allowed quantity is 50”; -inventory_total which is “Total inventory for all the product’s variations (size/color variations for instance)”

The descriptions of the two variables are not very clear. The analysis will help understand the importance of them for the model

hist(data$product_variation_inventory)

hist(data$inventory_total)

cat(paste0("Basically ",(max(table(data$inventory_total))/length(data$inventory_total))," percent of the observations in *inventory_total* are equal to 50."))
## Basically 0.992542878448919 percent of the observations in *inventory_total* are equal to 50.

Additionally, considering that

cat(paste0("The correlation between the variables *inventory_total* and *units_sold* is ", cor(data$units_sold,data$inventory_total)))
## The correlation between the variables *inventory_total* and *units_sold* is 0.00937539843820239

And that

cat(paste0((max(table(data$product_variation_inventory))/length(data$product_variation_inventory))," percent of the observations in *product_variation_inventory* are equal to 50 "))
## 0.591349739000746 percent of the observations in *product_variation_inventory* are equal to 50
cat(paste0("and the correlation between the variables *product_variation_inventory* and *units_sold* is ", cor(data$units_sold,data$product_variation_inventory)))
## and the correlation between the variables *product_variation_inventory* and *units_sold* is 0.127573263998713

Considering the results obtained, it is more meaningful the variable product_variation_inventory than the inventory_total.

Urgency Text

The column urgency_text represents the presence of banner which informs the customers about the limited quantity of the products.

table(data$urgency_text)
## 
##                                                Quantité limitée ! 
##                              975                              365 
## Réduction sur les achats en gros 
##                                1

As can be seen in the table, there are 365 observations with “Quantité limitée !” value in the dataset, 975 without values and 1 with “Réduction sur les achats en gros” (which litterally means reduction on the wholesale purchase) In this case, I will transform the variable into a dummy one, which it will tell if the presence (or absence) of urgency text will increase the units_sold

data$urgency_text[which(data$urgency_text == "Réduction sur les achats en gros", arr.ind=TRUE)]<-0 # I considered this as a
data$urgency_text[which(data$urgency_text == "Quantité limitée !", arr.ind=TRUE)]<-1
data$urgency_text[which(data$urgency_text == "", arr.ind=TRUE)]<-0
data$urgency_text<-as.factor(data$urgency_text)

Urgency Banner

Considering the banners variables, the has_urgency_banner describe the presence of urgency banner for the specific product. It is not clear from the description of the dataset, which kind of urgency is it. Anyway, I will analyze and consider the variable.

str(data$has_urgency_banner)
##  int [1:1341] 1 1 1 NA 1 NA NA NA 1 NA ...

It can be seen that the variable presents only the value 1. It is necessary a small adjustment and the transformation into a factor.

data$has_urgency_banner[is.na(data$has_urgency_banner)] <- 0
data$has_urgency_banner<-as.factor(data$has_urgency_banner)
table(data$has_urgency_banner)
## 
##   0   1 
## 975 366

Shipping

The next analysis consider the following variables:

  • shipping_is_express, which states if the shipping option is express (1) or not;
  • shipping_option_name, specifies if the name of the shipping option;
  • shipping_option_price, specifies the price of the shipping.

In the following tables, I show the values in the first two columns.

cat("Table data$shipping_option_name")
## Table data$shipping_option_name
table(data$shipping_option_name)
## 
##   Ekspresowa wysyłka         Envío normal         Envio Padrão 
##                    1                    4                    8 
##   Expediere Standard    Livraison Express   Livraison standard 
##                    4                    3                 1285 
##  Spedizione standard    Standard Shipping  Standardowa wysyłka 
##                    2                   18                    3 
##      Standardversand     Standart Gönderi Стандартная доставка 
##                    3                    2                    2 
##        الشحن القياسي     การส่งสินค้ามาตรฐาน      ការដឹកជញ្ជូនតាមស្តង់ដារ 
##                    3                    2                    1
cat("Table data$shipping_is_express")
## Table data$shipping_is_express
table(data$shipping_is_express)
## 
##    0    1 
## 1337    4

The first table shows that the all the products have only the option of standard shipping. The second table, instead, tells that there are actually 4 products with express shipping, denying the results seen in the first table. For this reason, I decided to not consider these information.

hist(data$shipping_option_price)

The observations contained in the column shipping_option_price could be interesting in understanding which variables drive the units_sold.

cat(paste0("The correlation between the *units_sold* and the *shipping_option_price* variables is: ",cor(data$units_sold,data$shipping_option_price)))
## The correlation between the *units_sold* and the *shipping_option_price* variables is: -0.0521409042295423

The variable shipping_option_price shows a low negative correlation with units_sold.

Badges

There are 4 columns related to the badges:

  • badge_local_product, which contains information about the product (if it is a local product);
  • badge_fast_shipping, consisting in an award for the seller whenever is considered to ship rapdly;
  • badge_product_quality, which is awareded by the seller when a customers give good evaluation;
  • badges_count, is the total count of the badges given to the product or to the seller.

It appears clearly that the sum of the different badges is in the column badges_count. Considering this specific situation, initially I will use the badges_count as variable.

Color

In this section, I consider the product_color column.

str(data$product_color)
##  chr [1:1341] "white" "green" "leopardprint" "black" "yellow" "navyblue" ...

Since it is a chr column, I have to consider the possibility that there may be CAPS LOCK.

data$product_color<-tolower(data$product_color)
data$product_color<-ifelse(data$product_color=="","No_color",data$product_color)
table(data$product_color)
## 
##          applegreen             apricot                army          army green 
##                   1                   2                   1                   2 
##           armygreen               beige               black        black & blue 
##                  24                  14                 267                   2 
##       black & green      black & stripe       black & white      black & yellow 
##                   3                   1                   3                   2 
##          blackwhite                blue         blue & pink               brown 
##                   1                  85                   1                   6 
##      brown & yellow            burgundy               camel          camouflage 
##                   1                   2                   1                   3 
##              claret              coffee           coolblack            coralred 
##                   1                   5                   2                   1 
##            darkblue           darkgreen           denimblue           dustypink 
##                   5                   1                   1                   2 
##              floral    fluorescentgreen                gold                gray 
##                   5                   2                   1                   9 
##        gray & white               green                grey  greysnakeskinprint 
##                   1                  77                  65                   1 
##               ivory              jasper               khaki            lakeblue 
##                   1                   1                  11                   1 
##             leopard        leopardprint         light green           lightblue 
##                   3                   1                   1                   8 
##           lightgray          lightgreen           lightgrey          lightkhaki 
##                   1                   3                   1                   1 
##           lightpink         lightpurple            lightred         lightyellow 
##                   4                   1                   1                   2 
##           mintgreen          multicolor                navy           navy blue 
##                   2                  18                   3                   2 
##            navyblue    navyblue & white            No_color                nude 
##                  25                   1                  40                   1 
##            offblack            offwhite              orange orange & camouflage 
##                   1                   1                  24                   1 
##          orange-red                pink        pink & black         pink & blue 
##                   3                  86                   2                   1 
##         pink & grey        pink & white        prussianblue              purple 
##                   2                   1                   1                  49 
##             rainbow                 red          red & blue                rose 
##                   1                  79                   1                   3 
##            rose red            rosegold             rosered              silver 
##                   1                   1                   7                   2 
##             skyblue                star                 tan              violet 
##                   7                   1                   1                   1 
##       watermelonred               white       white & black       white & green 
##                   1                 209                   2                   7 
##         white & red         whitefloral         whitestripe                wine 
##                   1                   2                   1                   2 
##            wine red             winered    winered & yellow              yellow 
##                   1                  23                   1                  80
tablecol<-as.matrix(table(data$product_color))

An important issue in the dataset, related to the color, it is the presence of particular color, such as light blue or light red. For the ease of the analysis, I’m gonna merge these into the respective color (e.g. light red will become red). Additionally, whenever the color will present two major color at the same time, I will consider it as “multicolor”.

data$product_color = gsub(pattern='(light-*)|(light)',replacement='',
                                          x=data$product_color)

data$product_color = ifelse(str_detect(data$product_color, '&'),"multicolor",
                                            data$product_color)
data$product_color = replace(data$product_color,grep(pattern='.*blue*.|blue*.|.*blue',data$product_color),"blue")
data$product_color = replace(data$product_color,grep(pattern='.*green*.|green*.|.*green',data$product_color),"green")
data$product_color = replace(data$product_color,grep(pattern='.*red*.|red*.|.*red',data$product_color),"red")
data$product_color = replace(data$product_color,grep(pattern='.*white*.|white*.|.*white',data$product_color),"white")
data$product_color = replace(data$product_color,grep(pattern='.*black*.|black*.|.*black',data$product_color),"black")


data$product_color = gsub(pattern='(white*)|(*white)',replacement='white',
                                          x=data$product_color)
table(data$product_color)
## 
##      apricot         army        beige        black         blue        brown 
##            2            1           14          270          135            6 
##     burgundy        camel   camouflage       coffee    dustypink       floral 
##            2            1            3            5            2            5 
##         gold         gray        ivory       jasper        khaki      leopard 
##            1           10            1            1           12            3 
## leopardprint   multicolor         navy     No_color         nude       orange 
##            1           52            3           40            1           24 
##         pink       purple      rainbow          red         rose     rosegold 
##           90           50            1          298            3            1 
##       silver         star          tan       violet        white         wine 
##            2            1            1            1          214            2 
##       yellow 
##           82

I will select the 10 most used color in the dataset. In the following table, the most used color are disposed in decreasing order considering the count in the dataset.

table=as.matrix(table(data$product_color))
head(sort(table[,1],decreasing=TRUE),10)
##        red      black      white       blue       pink     yellow multicolor 
##        298        270        214        135         90         82         52 
##     purple   No_color     orange 
##         50         40         24
rownames=c(rownames(as.matrix(head(sort(table[,1],decreasing=TRUE),10))))
data$product_color = ifelse(data$product_color %in% rownames,                                       data$product_color,'others')
rownames=c(rownames(as.matrix(table(data$product_color))))

Considering the entity of the dataset, I want to understand if there is a relation between a specific color and the units_sold. Following this idea, I will factorize the product_color into the different 11 level (the 10 most used + the “others” values).

data$product_color = ifelse(data$product_color %in% rownames,                                          data$product_color,'others')
rownames=c(rownames(as.matrix(table(data$product_color))))
data$product_color<-factor(data$product_color,levels=rownames,labels=c(1:length(rownames)))

Size

In this section, I consider the product_variation_size_id column.

head(data$product_variation_size_id,20)
##  [1] "M"       "XS"      "XS"      "M"       "S"       "Size-XS" "XS"     
##  [8] "M."      "M"       "S"       "XXS"     "XS"      "XS"      "S"      
## [15] "S"       "S"       "L"       "S"       "S"       "S"

The code show the fist 20 rows. The main problem is the Caps Lock, which should be removed. Additionally, there some rows which are not clear. It will be necessary to fix the data.

data$product_variation_size_id<-tolower(data$product_variation_size_id)
data$product_variation_size_id<-gsub(pattern='.',replacement='',x=data$product_variation_size_id,fixed=TRUE)
table(data$product_variation_size_id)
## 
##                                                  04-3xl 
##                          13                           1 
##                           1                        1 pc 
##                           2                           1 
##                   1 pc - xl                       10 ml 
##                           1                           3 
##                      100 cm  100 x 100cm(393 x 393inch) 
##                           1                           1 
##                      100pcs                       10pcs 
##                           1                           1 
##                          17                    1m by 3m 
##                           1                           1 
##                         1pc                           2 
##                           1                           1 
##                       20pcs               20pcs-10pairs 
##                           1                           1 
##                          25                        25-s 
##                           1                           1 
##       26(waist 72cm 28inch)                          29 
##                           1                           2 
##                        2pcs                         2xl 
##                           2                           4 
##            3 layered anklet                       30 cm 
##                           1                           1 
##                        32/l                          33 
##                           1                           3 
##                          34                          35 
##                           3                           2 
##                          36                         3xl 
##                           1                           2 
##                           4                   4-5 years 
##                           1                           1 
##                       40 cm                         4xl 
##                           1                           4 
##                           5                      5pairs 
##                           1                           1 
##                         5xl                          60 
##                           1                           1 
##                         6xl                 80 x 200 cm 
##                           1                           1 
##         au plug low quality                           b 
##                           1                           1 
##             baby float boat base & top & matte top coat 
##                           1                           1 
##                   base coat               choose a size 
##                           1                           1 
##                daughter 24m                       eu 35 
##                           1                           2 
##                   eu39(us8)           first  generation 
##                           1                           1 
##      floating chair for kid                         h01 
##                           1                           1 
##                           l                           m 
##                          46                         180 
##                    one size                   pack of 1 
##                           2                           1 
##                     pants-s                       round 
##                           1                           1 
##                           s            s (waist58-62cm) 
##                         574                           1 
##             s diameter 30cm                      s pink 
##                           1                           1 
##                s(bust 88cm)             s(pink & black) 
##                           1                           1 
##                  s/m(child)                   size -xxs 
##                           1                           2 
##                      size m                      size s 
##                           2                           6 
##                     size xs                    size xxs 
##                           3                           2 
##                     size--s                    size-4xl 
##                           1                           1 
##                    size-5xl                      size-l 
##                           1                           1 
##                      size-s                     size-xs 
##                           3                           3 
##                    size-xxs                      size/s 
##                           2                           1 
##                     size4xl                       sizel 
##                           1                           2 
##                      suit-s               us 65 (eu 37) 
##                           2                           1 
##                        us-s                   us55-eu35 
##                           1                           1 
##                       white               women size 36 
##                           1                           1 
##               women size 37                       x   l 
##                           1                           1 
##                          xl                          xs 
##                          15                         297 
##                         xxl                         xxs 
##                          12                          74 
##                        xxxl                        xxxs 
##                           1                           5 
##                       xxxxl                      xxxxxl 
##                           2                           2
data$product_variation_size_id = replace(data$product_variation_size_id,grep(pattern='.*size*.|size*.|.*size',data$product_variation_size_id),"size")

data$product_variation_size_id = gsub(pattern='.+[-]',replacement='',
                                          x=data$product_variation_size_id)
data$product_variation_size_id = ifelse(grepl(pattern='xl',data$product_variation_size_id),
                                            'xl',data$product_variation_size_id)
data$product_variation_size_id = ifelse(grepl(pattern='xs',data$product_variation_size_id),
                                            'xs',data$product_variation_size_id)

data$product_variation_size_id = str_replace(data$product_variation_size_id,' ','')
data$product_variation_size_id = ifelse(data$product_variation_size_id %in% c('s','xs','m','l','xl'),
                                            data$product_variation_size_id,'NotClear')


table(data$product_variation_size_id)
## 
##        l        m NotClear        s       xl       xs 
##       46      180      114      579       46      376

As for the color column, I will factorize the size column.

rownames=c(rownames(as.matrix(table(data$product_variation_size_id))))
data$product_variation_size_id<-factor(data$product_variation_size_id,levels=rownames,labels=c(1:length(rownames)))

Boost

The column uses_ad_boosts indicates if the seller paid to boost the product (in terms of better placement or highlithing it, ecc.) The idea is that this variable should be very relevant for the units_sold.

table(data$uses_ad_boosts)
## 
##   0   1 
## 757 584

Due to the way the data are recorded into the dataset, I will factorize the variable.

data$uses_ad_boosts<-as.factor(data$uses_ad_boosts)

Discount

Another important characteristic in the dataset is the presence of the two price column:

  • price, which contains the price for the product;
  • retail_price, states the refence price for similar product on the market.

I will create the variable discount with the idea to investigate if the precence of a discount may increase the units_sold.

discount=(data$price-data$retail_price)
data$discount=discount
cat(paste0("The correlation between *discount* and *units_sold* is :", cor(discount,data$units_sold)))
## The correlation between *discount* and *units_sold* is :-0.0150171748884622

Merchant profile picture

In this specific section, I will analyze the merchant_has_profile_picture variable. Starting again from my personal experience, I will expect to find that the presence of the merchant profile picture should increase the units_sold.

table(data$merchant_has_profile_picture)
## 
##    0    1 
## 1138  203

The table shows that the variable should be considered as a factor.

cat(paste0("Additionally, the importance of the variable is given by the high correlation, which is equal to: ",cor(data$merchant_has_profile_picture,data$units_sold))) 
## Additionally, the importance of the variable is given by the high correlation, which is equal to: 0.14475246443755
data$merchant_has_profile_picture<-as.factor(data$merchant_has_profile_picture)

Merchant info subtitle

This column contains the feedback received by the merchant since they are active. The idea is to find out that there should be a high correlation between the merchant_info_subtitle and the units_sold.

head(data$merchant_info_subtitle,5)
## [1] "(568 notes)"                       "83 % avis positifs (17,752 notes)"
## [3] "86 % avis positifs (295 notes)"    "(23,832 notes)"                   
## [5] "85 % avis positifs (14,482 notes)"

Again, starting from my personal experience, I will split the data contained in the merchant_info_subtitle column into two variables:

  • feedback_percent, which will represent the feedback gained by the merchant;
  • notes_count, which is the counter of the different feedback given to the merchant by the customers.

I expect to find high correlation between at least one of these two variables and the units_sold ones.

information=data$merchant_info_subtitle
information=colsplit(information,'[()]',names=c("percent","notes_count"))
percent =as.numeric(substr(information$percent, 1, 2))
percent[is.na(percent)] <- 0
notes_count=information$notes_count
notes_count=colsplit(notes_count,'[ ]',names=c("notes_count","nothank"))
notes_count=notes_count$notes_count
notes_count=as.numeric(gsub(",","",notes_count))
notes_count[is.na(notes_count)] <- 0
data$feedback_percent=percent

cat(paste0("The correlation between *feedback_percent* and *units_sold* is ",cor(data$feedback_percent,data$units_sold),"
"))
## The correlation between *feedback_percent* and *units_sold* is 0.0323968360949711
cat(paste0("The correlation between *feedback_notes_count* and *units_sold* is ",cor(notes_count,data$units_sold)))
## The correlation between *feedback_notes_count* and *units_sold* is 0.272410362641396

However, It is important to state that the second variable just created, feedback_notes_count, is basically the same as merchant_rating_count. I’m gonna verify if there are discrepancies in the two variables.

difference=as.data.frame(notes_count-data$merchant_rating_count)
difference[difference[,1] != 0,]
## [1] -126370   -6555  -37753  -33675

As can be seen, there are some values which are in the merchant_rating_count column which are not in the description (merchant_info_subtitle). Seen the merchant_info_subtitle is the description of the merchant and the other variable may have been created by the data supplier, I will overwrite the merchant_rating_count with the notes_count just created.

data$merchant_rating_count=notes_count

Countries shipped to

The column contains information about in which countries the merchant can send the products. This could be an interesting variable which can be used to understand the behave of the units_sold.

hist(data$countries_shipped_to)

cat(paste0("The correlation between the *countries_shipped_to* and *units_sold* is ",cor(data$countries_shipped_to,data$units_sold)))
## The correlation between the *countries_shipped_to* and *units_sold* is -0.0171633674449296

Models

Cross-validation

After finishing the analysis on the main relevant variables contianed in the dataset, I will prepare the dataset for the modeling section. Before splitting the dataset into:

  • a training set, called train;
  • a test set, called test;

This technique is called Cross-validation.It’s a sub-sampling technique in which the existing data are split into training and test sets. The model is trained using the training part and validated on the test part. Additionally,I will remove the “chr” variables.

control<-as.matrix(sapply(data,class))
indexremove=list()
count=0
for (i in 1:length(data)){
  if (control[i]=="character"){
    count=count+1
    indexremove[count]=i
  }
}


indexremove=c(unlist(indexremove))
datanew<-data[-indexremove]
str(datanew)
## 'data.frame':    1341 obs. of  32 variables:
##  $ price                       : num  16 8 8 8 2.72 3.92 7 12 11 5.78 ...
##  $ retail_price                : int  14 22 43 8 3 9 6 11 84 22 ...
##  $ units_sold                  : int  100 20000 100 5000 100 10 50000 1000 100 5000 ...
##  $ uses_ad_boosts              : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 2 1 ...
##  $ rating                      : num  3.76 3.45 3.57 4.03 3.1 5 3.84 3.76 3.47 3.6 ...
##  $ rating_count                : int  54 6135 14 579 20 1 6742 286 15 687 ...
##  $ rating_five_count           : num  26 2269 5 295 6 ...
##  $ rating_four_count           : num  8 1027 4 119 4 ...
##  $ rating_three_count          : num  10 1118 2 87 2 ...
##  $ rating_two_count            : num  1 644 0 42 2 0 490 18 1 68 ...
##  $ rating_one_count            : num  9 1077 3 36 6 ...
##  $ badges_count                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ badge_local_product         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ badge_product_quality       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ badge_fast_shipping         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ product_color               : Factor w/ 11 levels "1","2","3","4",..: 10 9 6 1 11 2 10 2 1 6 ...
##  $ product_variation_size_id   : Factor w/ 6 levels "1","2","3","4",..: 2 6 6 2 4 3 6 2 2 4 ...
##  $ product_variation_inventory : int  50 50 1 50 1 1 50 50 50 50 ...
##  $ shipping_option_price       : int  4 2 3 2 1 1 2 3 2 2 ...
##  $ shipping_is_express         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ countries_shipped_to        : int  34 41 36 41 35 40 31 139 36 33 ...
##  $ inventory_total             : int  50 50 50 50 50 50 50 50 50 50 ...
##  $ has_urgency_banner          : Factor w/ 2 levels "0","1": 2 2 2 1 2 1 1 1 2 1 ...
##  $ urgency_text                : Factor w/ 2 levels "0","1": 2 2 2 1 2 1 1 1 2 1 ...
##  $ merchant_rating_count       : num  568 17752 295 23832 14482 ...
##  $ merchant_rating             : num  4.1 3.9 4 4 4 3.5 4.1 3.7 3.8 4 ...
##  $ merchant_has_profile_picture: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ more_than_one_article       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_tags                    : int  2 8 5 7 5 5 16 6 6 7 ...
##  $ tag_most_used               : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ discount                    : num  2 -14 -35 0 -0.28 ...
##  $ feedback_percent            : num  0 83 86 0 85 75 86 0 82 85 ...

From the original 43 variables, I have reduced the dataset in 32 variables.

set.seed(2)
smp_size <- floor(0.9 * nrow(datanew))
train_ind <- sample(seq_len(nrow(datanew)), size = smp_size)
train <- datanew[train_ind, ]
test <- datanew[-train_ind, ]

Modelling

I will start the modeling section with a simple multiple linear regression without intercept.

Initially, I will consider all the variables in the train dataset. Step by step, I will remove the less significant variables.

fit_test<-lm(units_sold~ .-1,data=train)
summary(fit_test)
## 
## Call:
## lm(formula = units_sold ~ . - 1, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -23522  -1235   -422    682  49110 
## 
## Coefficients: (3 not defined because of singularities)
##                                   Estimate   Std. Error t value  Pr(>|t|)    
## price                            10.317976    60.090334   0.172  0.863697    
## retail_price                     -6.702475     3.891313  -1.722  0.085259 .  
## uses_ad_boosts0                4392.685830  3559.227456   1.234  0.217389    
## uses_ad_boosts1                4949.593203  3553.016330   1.393  0.163865    
## rating                          -97.744302   245.915464  -0.397  0.691093    
## rating_count                      3.787146     3.484963   1.087  0.277390    
## rating_five_count                 2.379432     3.704238   0.642  0.520770    
## rating_four_count               -10.823838     5.140670  -2.106  0.035459 *  
## rating_three_count                8.390024     6.024761   1.393  0.164010    
## rating_two_count                 -1.902789    13.888307  -0.137  0.891049    
## rating_one_count                        NA           NA      NA        NA    
## badges_count                  -3832.155688  1062.235693  -3.608  0.000322 ***
## badge_local_product            4784.377399  1523.376081   3.141  0.001728 ** 
## badge_product_quality          2869.586711  1155.162373   2.484  0.013126 *  
## badge_fast_shipping                     NA           NA      NA        NA    
## product_color2                  534.212116   439.558484   1.215  0.224484    
## product_color3                  179.261701   628.637370   0.285  0.775573    
## product_color4                 1269.066574   711.863784   1.783  0.074890 .  
## product_color5                 3164.434194   869.838433   3.638  0.000287 ***
## product_color6                  826.040838   498.032259   1.659  0.097464 .  
## product_color7                  -21.105957   496.706241  -0.042  0.966114    
## product_color8                 1603.004201   609.392074   2.630  0.008639 ** 
## product_color9                  351.265201   345.375373   1.017  0.309339    
## product_color10                 509.999486   372.439669   1.369  0.171155    
## product_color11                -166.828663   524.817735  -0.318  0.750634    
## product_variation_size_id2     -534.854090   670.559471  -0.798  0.425252    
## product_variation_size_id3    -1642.943597   724.844886  -2.267  0.023597 *  
## product_variation_size_id4    -1130.906600   621.503874  -1.820  0.069072 .  
## product_variation_size_id5    -2369.481002   858.977678  -2.758  0.005898 ** 
## product_variation_size_id6    -1523.728148   653.454817  -2.332  0.019881 *  
## product_variation_inventory       9.970450     6.186839   1.612  0.107329    
## shipping_option_price          -335.508012   235.439138  -1.425  0.154416    
## shipping_is_express            3031.169222  2701.560047   1.122  0.262091    
## countries_shipped_to             -9.667917     5.734196  -1.686  0.092062 .  
## inventory_total                 -69.603151    47.264118  -1.473  0.141118    
## has_urgency_banner1             282.030930  3895.371908   0.072  0.942295    
## urgency_text1                  -670.621994  3899.247482  -0.172  0.863477    
## merchant_rating_count             0.005746     0.001384   4.152 0.0000354 ***
## merchant_rating                 298.248376   582.242931   0.512  0.608580    
## merchant_has_profile_picture1   390.914361   315.992603   1.237  0.216300    
## more_than_one_article1          118.290712   248.750668   0.476  0.634492    
## num_tags                         43.998904    32.530126   1.353  0.176459    
## tag_most_used1                  343.287542   324.974553   1.056  0.291027    
## discount                                NA           NA      NA        NA    
## feedback_percent                 -1.342298     3.254909  -0.412  0.680128    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3770 on 1164 degrees of freedom
## Multiple R-squared:  0.8563, Adjusted R-squared:  0.8511 
## F-statistic: 165.1 on 42 and 1164 DF,  p-value: < 0.00000000000000022

Even if there is a high Adjusted R-squared, a considerable number of the variables are not statistical significant and with high standard error. The presence of high standard errors reflects the high degree of uncertainty for estimating a reliable predictors, and so the presence of these variables should be tested. In addition, 3 of the variables have not been estimated due to high correlation. The three variables are:

  • discount;
  • badge_fast_shipping;
  • rating_one_count.

As a matter of fact,

cat(paste0("The correlation between *retail_price* and *discount* is ",cor(data$retail_price,data$discount)))
## The correlation between *retail_price* and *discount* is -0.991789865212611
cat(paste0("The correlation between *badge_fast_shipping* and *badges_count* is ",cor(data$badge_fast_shipping,data$badges_count)))
## The correlation between *badge_fast_shipping* and *badges_count* is 0.499644870224989
cat(paste0("The correlation between *rating_one_count* and *rating_count* is ",cor(data$rating_one_count,data$rating_count)))
## The correlation between *rating_one_count* and *rating_count* is 0.909674752556129
fit_test2<-lm(units_sold~ retail_price+rating_four_count+badges_count+badge_local_product+badge_product_quality+product_color+product_variation_size_id+countries_shipped_to+merchant_rating_count-1,data=train)
summary(fit_test2)
## 
## Call:
## lm(formula = units_sold ~ retail_price + rating_four_count + 
##     badges_count + badge_local_product + badge_product_quality + 
##     product_color + product_variation_size_id + countries_shipped_to + 
##     merchant_rating_count - 1, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -19194  -1237   -501    617  48032 
## 
## Coefficients:
##                                Estimate   Std. Error t value
## retail_price                 -10.193111     3.795324  -2.686
## rating_four_count             19.593661     0.313062  62.587
## badges_count               -3157.257416  1006.540567  -3.137
## badge_local_product         4228.045758  1520.691114   2.780
## badge_product_quality       2594.887916  1087.808600   2.385
## product_color1              2018.155366   721.244691   2.798
## product_color2              2554.891106   771.062345   3.313
## product_color3              2232.733888   883.554517   2.527
## product_color4              3718.703379   955.305687   3.893
## product_color5              5538.365942  1095.161818   5.057
## product_color6              3086.578914   793.052618   3.892
## product_color7              2279.249052   790.040982   2.885
## product_color8              3977.276405   863.381642   4.607
## product_color9              2641.189424   698.826386   3.779
## product_color10             2567.533223   725.428884   3.539
## product_color11             2049.120046   786.193903   2.606
## product_variation_size_id2  -412.634000   689.239717  -0.599
## product_variation_size_id3 -1467.773236   732.710208  -2.003
## product_variation_size_id4 -1066.947455   636.941901  -1.675
## product_variation_size_id5 -2180.012508   871.222607  -2.502
## product_variation_size_id6 -1486.403470   649.484242  -2.289
## countries_shipped_to         -10.037090     5.745275  -1.747
## merchant_rating_count          0.007465     0.001380   5.409
##                                        Pr(>|t|)    
## retail_price                           0.007339 ** 
## rating_four_count          < 0.0000000000000002 ***
## badges_count                           0.001751 ** 
## badge_local_product                    0.005516 ** 
## badge_product_quality                  0.017216 *  
## product_color1                         0.005223 ** 
## product_color2                         0.000949 ***
## product_color3                         0.011634 *  
## product_color4                         0.000105 ***
## product_color5                     0.0000004929 ***
## product_color6                         0.000105 ***
## product_color7                         0.003985 ** 
## product_color8                     0.0000045356 ***
## product_color9                         0.000165 ***
## product_color10                        0.000417 ***
## product_color11                        0.009265 ** 
## product_variation_size_id2             0.549501    
## product_variation_size_id3             0.045382 *  
## product_variation_size_id4             0.094177 .  
## product_variation_size_id5             0.012475 *  
## product_variation_size_id6             0.022279 *  
## countries_shipped_to                   0.080894 .  
## merchant_rating_count              0.0000000768 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3904 on 1183 degrees of freedom
## Multiple R-squared:  0.8434, Adjusted R-squared:  0.8404 
## F-statistic:   277 on 23 and 1183 DF,  p-value: < 0.00000000000000022
plot(fit_test2$residuals)

pred2<-round(predict(fit_test2,test),digits=0)
percentbias_outofsample2<-percent_bias(test$units_sold,pred2)
percentbias_insample2<-percent_bias(train$units_sold,fit_test2$fitted.values)

cat(paste0("The average amount that actual values are greater than predicted as a percentage is ",percentbias_insample2 ))
## The average amount that actual values are greater than predicted as a percentage is -10.615610428764
mape_2out<-Metrics::mape(test$units_sold,pred2)
mape_2in<-Metrics::mape(train$units_sold,fit_test2$fitted.values)
cat("
")
cat(paste0("The Mean Absolute Percent Error is ",mape_2in ))
## The Mean Absolute Percent Error is 12.2223769337058
mae_2out<-Metrics::mae(test$units_sold,pred2)
mae_2in<-Metrics::mae(train$units_sold,fit_test2$fitted.values)

cat("
")
cat(paste0("The Mean Absolute Error is ",mae_2in ))
## The Mean Absolute Error is 2066.79499499615

The results obtained by the fit_test2 model show that the decision to select just variables with a significant level (10% ) should be considered. As a matter of fact, It is typical in analysis to initially run the model with all the independent variables, and step by step, removing the not significant ones.

It can be seen that the MAPE and the Adjusted R^2 are reducing, while the % bias and the MAE are slightly increase.

It appears clear that it is necessary to reduce further the variables select to define the units sold. Once again I will select only the significant ones.

fit_test3<-lm(units_sold~ retail_price+rating_four_count+badges_count+badge_local_product+product_color+merchant_rating_count-1,data=train)
summary(fit_test3)
## 
## Call:
## lm(formula = units_sold ~ retail_price + rating_four_count + 
##     badges_count + badge_local_product + product_color + merchant_rating_count - 
##     1, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20471  -1143   -524    472  47999 
## 
## Coefficients:
##                           Estimate   Std. Error t value             Pr(>|t|)
## retail_price             -9.100322     3.772105  -2.413             0.015993
## rating_four_count        19.716145     0.311295  63.336 < 0.0000000000000002
## badges_count          -1006.067363   413.371335  -2.434             0.015087
## badge_local_product    1386.420058   967.475378   1.433             0.152112
## product_color1          441.005194   285.531663   1.545             0.122732
## product_color2          861.896204   380.898453   2.263             0.023828
## product_color3          512.625830   605.159104   0.847             0.397114
## product_color4         2041.218648   669.676093   3.048             0.002354
## product_color5         3794.660537   867.396248   4.375          0.000013221
## product_color6         1528.169675   453.721860   3.368             0.000781
## product_color7          763.303063   439.725315   1.736             0.082847
## product_color8         2486.075327   578.182673   4.300          0.000018493
## product_color9         1087.381480   257.440143   4.224          0.000025851
## product_color10        1039.123374   306.418652   3.391             0.000719
## product_color11         570.112216   473.682808   1.204             0.228994
## merchant_rating_count     0.007030     0.001363   5.159          0.000000291
##                          
## retail_price          *  
## rating_four_count     ***
## badges_count          *  
## badge_local_product      
## product_color1           
## product_color2        *  
## product_color3           
## product_color4        ** 
## product_color5        ***
## product_color6        ***
## product_color7        .  
## product_color8        ***
## product_color9        ***
## product_color10       ***
## product_color11          
## merchant_rating_count ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3932 on 1190 degrees of freedom
## Multiple R-squared:  0.8402, Adjusted R-squared:  0.838 
## F-statistic: 390.9 on 16 and 1190 DF,  p-value: < 0.00000000000000022
plot(fit_test3$residuals)

pred3<-round(predict(fit_test3,test),digits=0)
percentbias_outofsample3<-percent_bias(test$units_sold,pred3)
percentbias_insample3<-percent_bias(train$units_sold,fit_test3$fitted.values)
cat(paste0("The average amount that actual values are greater than predicted as a percentage is ",percentbias_insample3 ))
## The average amount that actual values are greater than predicted as a percentage is -12.3615728566824
mape_3out<-Metrics::mape(test$units_sold,pred3)
mape_3in<-Metrics::mape(train$units_sold,fit_test3$fitted.values)
cat("
")
cat(paste0("The Mean Absolute Percent Error is ",mape_3in ))
## The Mean Absolute Percent Error is 12.8540868359851
mae_3out<-Metrics::mae(test$units_sold,pred3)
mae_3in<-Metrics::mae(train$units_sold,fit_test3$fitted.values)

cat("
")
cat(paste0("The Mean Absolute Error is ",mae_3in ))
## The Mean Absolute Error is 2053.04940801906

The results obtained by the fit_test3 model show again the same patter as fit_test2. The main difference is about the MAE which is reducing, suggesting that the model is increasing its ability in better fitting the units_sold.

fit_test4<-lm(units_sold~ retail_price+rating_four_count+badges_count+badge_local_product++merchant_rating_count-1,data=train)
summary(fit_test4)
## 
## Call:
## lm(formula = units_sold ~ retail_price + rating_four_count + 
##     badges_count + badge_local_product + +merchant_rating_count - 
##     1, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -22477   -390    -33    748  47968 
## 
## Coefficients:
##                          Estimate  Std. Error t value             Pr(>|t|)    
## retail_price             4.497042    3.201337   1.405                0.160    
## rating_four_count       20.295474    0.305021  66.538 < 0.0000000000000002 ***
## badges_count          -642.409399  416.432212  -1.543                0.123    
## badge_local_product   1382.975889  989.567419   1.398                0.163    
## merchant_rating_count    0.008171    0.001383   5.908        0.00000000451 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4027 on 1201 degrees of freedom
## Multiple R-squared:  0.8309, Adjusted R-squared:  0.8302 
## F-statistic:  1180 on 5 and 1201 DF,  p-value: < 0.00000000000000022
plot(fit_test4$residuals)

test<-as.data.frame(test)
pred4<-round(predict(fit_test4,test),digits=0)
percentbias_outofsample4<-Metrics::percent_bias(test$units_sold,pred4)
percentbias_insample4<-Metrics::percent_bias(train$units_sold,fit_test4$fitted.values)

cat(paste0("The average amount that actual values are greater than predicted as a percentage is ",percentbias_insample4 ))
## The average amount that actual values are greater than predicted as a percentage is -2.04538656740612
mape_4out<-Metrics::mape(test$units_sold,pred4)
mape_4in<-Metrics::mape(train$units_sold,fit_test4$fitted.values)
cat("
")
cat(paste0("The Mean Absolute Percent Error is ",mape_4in ))
## The Mean Absolute Percent Error is 2.545888353867
mae_4out<-Metrics::mae(test$units_sold,pred4)
mae_4in<-Metrics::mae(train$units_sold,fit_test4$fitted.values)

cat("
")
cat(paste0("The Mean Absolute Error is ",mae_4in ))
## The Mean Absolute Error is 1874.45645825947

The results obtained by the fit_test4 model show that the reduction in independent variables is giving the results expected. As a matter of fact, It can be seen that the % bias, MAPE and MAE have been remarkably decreased.These results suggest that the model has a good ability in fitting the data. Additionally, also the joint F-test shows that the reduction of regressors should be preferred.

For the next model, I will add the interaction between the merchant_rating_count and the rating_four_count variables.

fit_test5<-lm(units_sold~ countries_shipped_to+rating_four_count+merchant_rating_count+(merchant_rating_count*rating_four_count)-1,data=train)
summary(fit_test5)
## 
## Call:
## lm(formula = units_sold ~ countries_shipped_to + rating_four_count + 
##     merchant_rating_count + (merchant_rating_count * rating_four_count) - 
##     1, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27462   -882   -454    581  39895 
## 
## Coefficients:
##                                             Estimate   Std. Error t value
## countries_shipped_to                    20.254565467  2.827640670   7.163
## rating_four_count                       17.871979217  0.355684591  50.247
## merchant_rating_count                   -0.006736144  0.001882804  -3.578
## rating_four_count:merchant_rating_count  0.000023018  0.000002217  10.383
##                                                     Pr(>|t|)    
## countries_shipped_to                        0.00000000000137 ***
## rating_four_count                       < 0.0000000000000002 ***
## merchant_rating_count                                0.00036 ***
## rating_four_count:merchant_rating_count < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3837 on 1202 degrees of freedom
## Multiple R-squared:  0.8463, Adjusted R-squared:  0.8458 
## F-statistic:  1654 on 4 and 1202 DF,  p-value: < 0.00000000000000022
plot(fit_test5$residuals)

pred5<-round(predict(fit_test5,test),digits=0)
percentbias_outofsample5<-Metrics::percent_bias(test$units_sold,pred5)
percentbias_insample5<-Metrics::percent_bias(train$units_sold,fit_test5$fitted.values)

cat(paste0("The average amount that actual values are greater than predicted as a percentage is ",percentbias_insample5 ))
## The average amount that actual values are greater than predicted as a percentage is -7.86250303931833
mape_5out<-Metrics::mape(test$units_sold,pred5)
mape_5in<-Metrics::mape(train$units_sold,fit_test5$fitted.values)
cat("
")
cat(paste0("The Mean Absolute Percent Error is ",mape_5in ))
## The Mean Absolute Percent Error is 8.21028651827872
mae_5out<-Metrics::mae(test$units_sold,pred5)
mae_5in<-Metrics::mae(train$units_sold,fit_test5$fitted.values)

cat("
")
cat(paste0("The Mean Absolute Error is ",mae_5in ))
## The Mean Absolute Error is 2008.00090814781

The results obtained by the fit_test5 model show an increase in the error measurements. However we can see that the Adjusted R^2 is reducing, which is due to the reduction of standard error. However, when considering the joint F-test, we reject the null, suggesting that this last reduction could not be optimal.

In the following table it can be seen:

  • Adjusted R squared;
  • AIC;
  • BIC

for all the different linear models proposed in this first part of the analysis.

##      Rsquared_adj      AIC      BIC      MAE      MAPE percent_bias
## fit1    0.8510985 23328.32 23547.41 1934.751 10.453758    -5.568873
## fit2    0.8403548 23393.87 23516.15 2066.795 12.222377   -10.615610
## fit3    0.8380096 23404.57 23491.18 2053.049 12.854087   -12.361573
## fit4    0.8301669 23450.68 23481.25 1874.456  2.545888    -2.045387
## fit5    0.8457753 23333.42 23358.90 2008.001  8.210287    -7.862503

As can be seen by the table and the results until now discussed, it is difficult to understand if the Multiple linear regression could be a used to estimate properly the units_sold. As a matter of fact, It important to remember that the main characteristic of these models is the base assumption that there is a linear relationship between the independent and dependent variables, which may not be the case. It would be interesting to use a different kind of models.

Log-level multiple linear model

In the models above, I have focused on linear relationships between the dependent and independent variables. However,linear relationships are not nearly general enough for all economic applications. In the following models, I will consider how units_sold changes considering it in percentage terms (applying a log-transformation)

mfrow=c(2,1)
plotNormalHistogram(data$units_sold)

plotNormalHistogram(log1p(data$units_sold))

Logarithmic transformation is a convenient means of transforming a highly skewed variable into a more normalized dataset. The reason why I have used the lo1p() function, which calculate the log(y+1) is due to thefact that y is a count for which zero is a possible value, and the transformation allows to avoid taking the logarithm of zero.

cat(paste0("From a skewness of ",skewness(data$units_sold)[1], " to a skewness of ", skewness(log1p(data$units_sold))[1]))
## From a skewness of 5.33105738109372 to a skewness of -0.115870164019399

In the next section, I will produce the same model proposed before but with log(y+1) as a dependent variable.

fit_log<-lm(log1p(units_sold)~ .-1,data=train)
fit_log2<-lm(log1p(units_sold)~ retail_price+rating_four_count+badges_count+badge_local_product+badge_product_quality+product_color+product_variation_size_id+countries_shipped_to+merchant_rating_count-1,data=train)
fit_log3<-lm(log1p(units_sold)~ retail_price+rating_four_count+badges_count+badge_local_product+product_color+merchant_rating_count-1,data=train)
fit_log4<-lm(log1p(units_sold)~ retail_price+rating_four_count+badges_count+badge_local_product++merchant_rating_count-1,data=train)
fit_log5<-lm(log1p(units_sold)~ countries_shipped_to+rating_four_count+merchant_rating_count+(merchant_rating_count*rating_four_count)-1,data=train)



data_fitlog=data.frame(Rsquared=summary(fit_log)$adj.r.squared,
aic=AIC(fit_log),
bic=BIC(fit_log),
MAE=Metrics::mae(train$units_sold,fit_log$fitted.values),
MAPE=Metrics::mape(train$units_sold,fit_log$fitted.values),
percent_bias=Metrics::percent_bias(train$units_sold,fit_log$fitted.values)
)


data_fitlog2=data.frame(Rsquared=summary(fit_log2)$adj.r.squared,
aic=AIC(fit_log2),
bic=BIC(fit_log2),
MAE=Metrics::mae(train$units_sold,fit_log2$fitted.values),
MAPE=Metrics::mape(train$units_sold,fit_log2$fitted.values),
percent_bias=Metrics::percent_bias(train$units_sold,fit_log2$fitted.values)
)

data_fitlog3=data.frame(Rsquared=summary(fit_log3)$adj.r.squared,
aic=AIC(fit_log3),
bic=BIC(fit_log3),
MAE=Metrics::mae(train$units_sold,fit_log3$fitted.values),
MAPE=Metrics::mape(train$units_sold,fit_log3$fitted.values),
percent_bias=Metrics::percent_bias(train$units_sold,fit_log3$fitted.values)
)

data_fitlog4=data.frame(Rsquared=summary(fit_log4)$adj.r.squared,
aic=AIC(fit_log4),
bic=BIC(fit_log4),
MAE=Metrics::mae(train$units_sold,fit_log4$fitted.values),
MAPE=Metrics::mape(train$units_sold,fit_log4$fitted.values),
percent_bias=Metrics::percent_bias(train$units_sold,fit_log4$fitted.values)
)

data_fitlog5=data.frame(Rsquared=summary(fit_log5)$adj.r.squared,
aic=AIC(fit_log5),
bic=BIC(fit_log5),
MAE=Metrics::mae(train$units_sold,fit_log5$fitted.values),
MAPE=Metrics::mape(train$units_sold,fit_log5$fitted.values),
percent_bias=Metrics::percent_bias(train$units_sold,fit_log5$fitted.values)
)

fit_log_table=rbind(data_fitlog,data_fitlog2,data_fitlog3,data_fitlog4,data_fitlog5)
fit_log_table=as.matrix(fit_log_table)
row.names(fit_log_table)=c("fit_log","fit_log2","fit_log3","fit_log4","fit_log5")

par(mfrow=c(2,2))
plot(fit_log2$residuals)
plot(fit_log3$residuals)
plot(fit_log4$residuals)
plot(fit_log5$residuals)

print(fit_log_table)
##           Rsquared      aic      bic      MAE      MAPE percent_bias
## fit_log  0.9528698 4497.840 4716.928 4484.332 0.9660581    0.9405151
## fit_log2 0.9443693 4679.346 4801.628 4484.340 0.9660388    0.9342031
## fit_log3 0.9402920 4757.764 4844.380 4484.344 0.9663898    0.9318555
## fit_log4 0.5412558 7205.927 7236.497 4487.512 0.9804285    0.9774999
## fit_log5 0.8064755 6164.046 6189.521 4485.463 0.9683758    0.9513616

The table above shows that among the different models developed in this first phase, it should be better to focus on the fit_log one. In addition, there is a huge difference in the results obtained from these log-models compared to the othere ones.

par(mfrow=c(1,2))
scatter.smooth( rstandard(fit_test) ~ fitted(fit_test), main="No transform",
                ylab="Standardized residuals", xlab="Fitted values", las=1)

scatter.smooth( rstandard(fit_log) ~ fitted(fit_log), main="log1p(Units_sold)",
                ylab="Standardized residuals", xlab="Fitted values", las=1)

As it can be seen from the graph, the log1p transformation allows to create a relationship which is more linear than the non transformation one. However, considering the Cook’s distance plot and the Normal Q-Q plot, it can be seen why and what makes the relationship not properly linear.

par(mfrow=c(1,2))
plot( cooks.distance( fit_log ), type="h", las=1, ylab="Cook's distance D")
qqnorm( rstandard( fit_log), las=1 ); qqline( rstandard( fit_log ), las=1 )

Indeed, there are different rows in the train dataset which have a strong influence on the regression as suggested by the code above.

train[which(cooks.distance(fit_log) >0.02),]
##      price retail_price units_sold uses_ad_boosts rating rating_count
## 85    7.00            6      50000              1   4.41        17444
## 662   8.00            7        100              0   2.79           14
## 500   5.66            5      10000              1   4.29         2068
## 1467  8.00           84      50000              0   3.60        18463
## 348  11.00           10      20000              0   4.32         3648
## 759  49.00           42        100              0   4.67            6
## 1062  6.00            6      10000              0   4.29         2808
## 93    5.77           48     100000              0   4.10        20744
## 609  11.00           37      20000              0   3.42         7596
##      rating_five_count rating_four_count rating_three_count rating_two_count
## 85               11548              3191               1632              507
## 662                  5                 1                  0                2
## 500               1245               418                247               68
## 1467              7530              3351               3057             1736
## 348               2418               548                327              126
## 759                  4                 2                  0                0
## 1062              1789               459                300              110
## 93               11184              4152               2919             1174
## 609               2679              1403               1317              793
##      rating_one_count badges_count badge_local_product badge_product_quality
## 85                566            1                   0                     1
## 662                 6            2                   1                     0
## 500                90            1                   0                     1
## 1467             2789            1                   0                     0
## 348               229            1                   0                     1
## 759                 0            1                   0                     0
## 1062              150            3                   1                     1
## 93               1315            0                   0                     0
## 609              1404            0                   0                     0
##      badge_fast_shipping product_color product_variation_size_id
## 85                     0             9                         2
## 662                    1             6                         4
## 500                    0             9                         5
## 1467                   1             4                         4
## 348                    0             1                         5
## 759                    1             9                         5
## 1062                   1             4                         4
## 93                     0             1                         6
## 609                    0             1                         4
##      product_variation_inventory shipping_option_price shipping_is_express
## 85                             2                     2                   0
## 662                           50                     7                   1
## 500                           50                     1                   0
## 1467                          50                     3                   0
## 348                            1                     2                   0
## 759                            2                    12                   1
## 1062                          50                     7                   1
## 93                            50                     2                   0
## 609                           50                     3                   0
##      countries_shipped_to inventory_total has_urgency_banner urgency_text
## 85                     35              50                  0            0
## 662                    41              50                  0            0
## 500                   138              50                  0            0
## 1467                   36              50                  1            1
## 348                    39               1                  0            0
## 759                    41               2                  0            0
## 1062                   39              50                  1            1
## 93                     46              50                  0            0
## 609                    39              50                  1            1
##      merchant_rating_count merchant_rating merchant_has_profile_picture
## 85                  126370             4.1                            1
## 662                   7012             4.3                            0
## 500                2174765             4.4                            0
## 1467                 51369             4.0                            1
## 348                   4448             4.3                            0
## 759                  29977             4.2                            0
## 1062                     0             4.3                            0
## 93                  330405             4.1                            1
## 609                  58068             4.3                            1
##      more_than_one_article num_tags tag_most_used discount feedback_percent
## 85                       1        4             1     1.00               89
## 662                      0        7             1     1.00                0
## 500                      0        4             1     0.66                0
## 1467                     1       10             1   -76.00                0
## 348                      0        7             0     1.00               91
## 759                      0        5             1     7.00               90
## 1062                     0        6             1     0.00               92
## 93                       0        7             1   -42.23               86
## 609                      0       12             1   -26.00               91

Considering the table above, the causes of the outliers cannot be identified. In this case, the analyst is faced with a dilemma. Simply discarding the observation is often unwise, since that observation may be a real, genuine observation for which an alternative model would be appropriate.
I will evaluate the influence of the outliers fiting the model to the data without the outliers and checking if the results obtained are similar or not.

outliers<-as.matrix(train[which(cooks.distance(fit_log) >0.02),])
outliers<-row.names(outliers)

datawo_out<-train[!(row.names(train) %in% outliers), ]

fit_log_noout<-lm(log1p(units_sold)~ .-1,data=datawo_out)


par(mfrow=c(2,2))
plot( cooks.distance( fit_log ), type="h", las=1, ylab="Cook's distance D",main="No transformation")
qqnorm( rstandard( fit_log), las=1 ); qqline( rstandard( fit_log ), las=1 )

plot( cooks.distance( fit_log_noout ), type="h", las=1, ylab="Cook's distance D",main="Log model")
qqnorm( rstandard( fit_log_noout), las=1 ); qqline( rstandard( fit_log_noout ), las=1 )

From the graph above, It can be seen that the major (and only improvement) is the reduction of the Cook’s distance D. In the table above, it is shown how the idea of removing the outliers should not be followed since the results obtained are not better than the ones already discussed.

data_fitlog_no_out=data.frame(Rsquared=summary(fit_log_noout)$adj.r.squared,
aic=AIC(fit_log_noout),
bic=BIC(fit_log_noout),
MAE=Metrics::mae(datawo_out$units_sold,fit_log_noout$fitted.values),
MAPE=Metrics::mape(datawo_out$units_sold,fit_log_noout$fitted.values),
percent_bias=Metrics::percent_bias(datawo_out$units_sold,fit_log_noout$fitted.values)
)

check_outliers=rbind(data_fitlog,data_fitlog_no_out)
check_outliers=as.matrix(check_outliers)
row.names(check_outliers)=c("fit_log","fit_log-no outliers")
check_outliers
##                      Rsquared      aic      bic      MAE      MAPE percent_bias
## fit_log             0.9528698 4497.840 4716.928 4484.332 0.9660581    0.9405151
## fit_log-no outliers 0.9561045 4371.976 4585.654 4300.738 0.9661038    0.9410317

###Generalize Linear Models

The models discussed so far are based on the assumption of a linear relationship between the dependent and independent variables. These models assume constant variance, which demonstrably is not true for all data. Generalized linear models (glms) assume the responses come from a distribution that belongs to a more general family of distributions, and also permit more general systematic components. In the followig graph, Culln and Frey graph, I understand which theoretical distribution better fit the dataset.

descdist(data$units_sold,boot=500,discrete = FALSE)

## summary statistics
## ------
## min:  1   max:  100000 
## median:  1000 
## mean:  4820.663 
## estimated sd:  9947.732 
## estimated skewness:  5.343004 
## estimated kurtosis:  43.54482

It can be seen that the Gamma distribution should be used in the GLM models. As a matter of fact, the article Sales History-based Demand Prediction using Generalized Linear Models (ÖZENBOY et TEKİR,article link) specified that the use of the sale amount is found as gamma distributed. Indeed, the gamma distribution is used to model physical quantities that take only positive values. However, in my analysis, I decided to use a log link function due to the particular restrictions: μi > 0 ηi < 0, which gives restrictions on β.

fit_glm<-glm(units_sold~ . ,family=Gamma(link="log"), data=train)
phi.meandev1 <- deviance(fit_glm) / df.residual(fit_glm)
phi.pearson1 <- summary(fit_glm)$dispersion

data_fitglm=data.frame(Mean.deviance=phi.meandev1, Pearson=phi.pearson1,AIC=AIC(fit_glm),BIC=BIC(fit_glm))



fit_glm2<-glm(units_sold~retail_price+rating_four_count+badges_count+badge_local_product+badge_product_quality+product_color+product_variation_size_id+countries_shipped_to+merchant_rating_count-1,family=Gamma(link="log"), data=train)
 phi.meandev2 <- deviance(fit_glm2) / df.residual(fit_glm2)
 phi.pearson2 <- summary(fit_glm2)$dispersion
data_fitglm2=data.frame(Mean.deviance=phi.meandev2, Pearson=phi.pearson2,AIC=AIC(fit_glm2),BIC=BIC(fit_glm2))

fit_glm3<-glm(units_sold~retail_price+rating_four_count+badges_count+badge_local_product+product_color+merchant_rating_count-1,family=Gamma(link="log"),data=train)
 phi.meandev3 <- deviance(fit_glm3) / df.residual(fit_glm3)
 phi.pearson3 <- summary(fit_glm3)$dispersion

 data_fitglm3=data.frame(Mean.deviance=phi.meandev3, Pearson=phi.pearson3,AIC=AIC(fit_glm3),BIC=BIC(fit_glm3))

fit_glm4<-glm(units_sold~retail_price+rating_four_count+badges_count+badge_local_product+merchant_rating_count,family=Gamma(link="log"),data=train,mustart=units_sold)
phi.meandev4 <- deviance(fit_glm4) / df.residual(fit_glm4)
 phi.pearson4 <- summary(fit_glm4)$dispersion
  data_fitglm4=data.frame(Mean.deviance=phi.meandev4, Pearson=phi.pearson4,AIC=AIC(fit_glm4),BIC=BIC(fit_glm4))

fit_glm5<-glm(units_sold~countries_shipped_to+rating_four_count+merchant_rating_count+(merchant_rating_count*rating_four_count),family=Gamma(link="log"),data=train)
 phi.meandev5 <- deviance(fit_glm5) / df.residual(fit_glm5)
 phi.pearson5 <- summary(fit_glm5)$dispersion

  data_fitglm5=data.frame(Mean.deviance=phi.meandev5, Pearson=phi.pearson5,AIC=AIC(fit_glm5),BIC=BIC(fit_glm5))
  
  
datafit_glm=rbind(data_fitglm,data_fitglm2,data_fitglm3,data_fitglm4,data_fitglm5)
datafit_glm=as.matrix(datafit_glm)
row.names(datafit_glm)=c("fit_glm","fit_glm2","fit_glm3","fit_glm4","fit_glm5")
datafit_glm
##          Mean.deviance  Pearson      AIC      BIC
## fit_glm       1.759623 1.518472 20679.03 20898.12
## fit_glm2      1.985726 1.563185 20848.02 20970.30
## fit_glm3      2.043881 1.484676 20887.20 20973.82
## fit_glm4      2.048938 1.519099 20883.85 20919.51
## fit_glm5      1.981689 1.581366 20831.97 20862.54

The table show the Mean deviance estimate, the Pearson estimate and the AIC and BIC of the glms. It can be seen that the saturated model should be considered. Indeed, considering the AIC values, where the fit_glm ones is the lowest one, suggesting the higher fitting. In the next section, I will use the package h2o, which is open-source machine learning and deep learning for Smarter Applications. The package “H2O” implements almost all common machine learning algorithms.

ML

In this section, I will use the AutoMl function (in the h2o package) which automates the process of building a large number of models, with the goal of finding the “best” model.

library(h2o)
h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 days 10 hours 
##     H2O cluster timezone:       Europe/Copenhagen 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.30.0.1 
##     H2O cluster version age:    5 months and 16 days !!! 
##     H2O cluster name:           H2O_started_from_R_francesconapoli_kbj412 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.57 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.2 (2020-06-22)
trainh20 <- as.h2o(train)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
testh20 <- as.h2o(test)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
predictors <- colnames(train)[1:13]
predictors<-predictors[-3]
response <- "units_sold"

The best model is the XGBoost (aka Machine learning algorithms under the Gradient Boosting framework). The GBM algorithm sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.

utoml <- h2o.automl(x=predictors,y=response,training_frame=trainh20,max_models=15)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |=====================                                                 |  31%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |===================================                                   |  49%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |======================================================================| 100%
best_model <- as.vector(utoml@leaderboard$model_id[1])
summary_model <- h2o.getModel(best_model)
cat(paste0("The R squared of the XGBOOST model is ",h2o.r2(summary_model) ))
## The R squared of the XGBOOST model is 0.984452039784771

An important feature of the GMB models is the possibility to have the Variable Importance Table, which is a measure indicating the relative importance of each variable in training the model.

h2o.varimp_plot(summary_model)

The most important variables are the rating stars counts. One of the most known weakness of the GMB models is the overfitting. However the use of the h2o, in particular the function AutoMl, helps in tuning automatically the machine learning model.

In the next section, I will effectuate an out-of-sample analysis.

Predictions