Last Update: “19 September, 2020”
This is an R Markdown document. The data used for this analysis can be downloaded at the following website: https://www.kaggle.com/jmmvutu/summer-products-and-sales-in-ecommerce-wish. The dataset contains product listings as well as products ratings and sales performance.
The main goal of the kaggle competition is to find patterns regarding the success of a product. Following this target, I have developed the document with the following structure:
library(dplyr)
library(ggplot2)
library(gridExtra)
library(glmnet)
library(h2o)
library(stringr)
library(lmtest)
library(fBasics)
library(car)
library(reshape2)
library(Metrics)
library(rcompanion)
library(fitdistrplus)
library(modelr)
library(statmod)
data<-read.csv("summer-products-with-rating-and-performance_2020-08.csv",
stringsAsFactors=FALSE)
This step is particularly important for the transformation and selection of the variables.
First of all, I’ll take a fast look at what kind of data are present in the dataset.
str(data)
## 'data.frame': 1573 obs. of 43 variables:
## $ title : chr "2020 Summer Vintage Flamingo Print Pajamas Set Casual Loose T Shirt Top And Elastic Shorts Women Sleepwear Nig"| __truncated__ "SSHOUSE Summer Casual Sleeveless Soirée Party Soirée sans manches Vêtements de plage sexy Mini robe femme wshC1612242400387A21" "2020 Nouvelle Arrivée Femmes Printemps et Été Plage Porter Longue Mince Cardigan Ouvert Avant Kimono Vert Feuil"| __truncated__ "Hot Summer Cool T-shirt pour les femmes Mode Tops Abeille Lettres imprimées Manches courtes O Neck Coton T-shir"| __truncated__ ...
## $ title_orig : chr "2020 Summer Vintage Flamingo Print Pajamas Set Casual Loose T Shirt Top And Elastic Shorts Women Sleepwear Nig"| __truncated__ "Women's Casual Summer Sleeveless Sexy Mini Dress" "2020 New Arrival Women Spring and Summer Beach Wear Long Thin Cardigan Open Front Kimono Green Leaf Printed Chi"| __truncated__ "Hot Summer Cool T Shirt for Women Fashion Tops Bee Printed Letters Short Sleeve O Neck Cotton T-shirts Tops Tee Clothing" ...
## $ price : num 16 8 8 8 2.72 3.92 7 12 11 5.78 ...
## $ retail_price : int 14 22 43 8 3 9 6 11 84 22 ...
## $ currency_buyer : chr "EUR" "EUR" "EUR" "EUR" ...
## $ units_sold : int 100 20000 100 5000 100 10 50000 1000 100 5000 ...
## $ uses_ad_boosts : int 0 1 0 1 1 0 0 0 1 0 ...
## $ rating : num 3.76 3.45 3.57 4.03 3.1 5 3.84 3.76 3.47 3.6 ...
## $ rating_count : int 54 6135 14 579 20 1 6742 286 15 687 ...
## $ rating_five_count : int 26 2269 5 295 6 1 3172 120 6 287 ...
## $ rating_four_count : int 8 1027 4 119 4 0 1352 56 2 128 ...
## $ rating_three_count : int 10 1118 2 87 2 0 971 61 3 92 ...
## $ rating_two_count : int 1 644 0 42 2 0 490 18 1 68 ...
## $ rating_one_count : int 9 1077 3 36 6 0 757 31 3 112 ...
## $ badges_count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ badge_local_product : int 0 0 0 0 0 0 0 0 0 0 ...
## $ badge_product_quality : int 0 0 0 0 0 0 0 0 0 0 ...
## $ badge_fast_shipping : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tags : chr "Summer,Fashion,womenunderwearsuit,printedpajamasset,womencasualshort,Women's Fashion,flamingo,loungewearset,Cas"| __truncated__ "Mini,womens dresses,Summer,Patchwork,fashion dress,Dress,Mini dress,Women's Fashion,Women S Clothing,backless,p"| __truncated__ "Summer,cardigan,women beachwear,chiffon,Sexy women,Coat,summercardigan,openfront,short sleeves,Swimsuit,Women's"| __truncated__ "Summer,Shorts,Cotton,Cotton T Shirt,Sleeve,printedletterstop,Clothing,Tops,Necks,short sleeves,Women's Fashion,"| __truncated__ ...
## $ product_color : chr "white" "green" "leopardprint" "black" ...
## $ product_variation_size_id : chr "M" "XS" "XS" "M" ...
## $ product_variation_inventory : int 50 50 1 50 1 1 50 50 50 50 ...
## $ shipping_option_name : chr "Livraison standard" "Livraison standard" "Livraison standard" "Livraison standard" ...
## $ shipping_option_price : int 4 2 3 2 1 1 2 3 2 2 ...
## $ shipping_is_express : int 0 0 0 0 0 0 0 0 0 0 ...
## $ countries_shipped_to : int 34 41 36 41 35 40 31 139 36 33 ...
## $ inventory_total : int 50 50 50 50 50 50 50 50 50 50 ...
## $ has_urgency_banner : int 1 1 1 NA 1 NA NA NA 1 NA ...
## $ urgency_text : chr "Quantité limitée !" "Quantité limitée !" "Quantité limitée !" "" ...
## $ origin_country : chr "CN" "CN" "CN" "CN" ...
## $ merchant_title : chr "zgrdejia" "SaraHouse" "hxt520" "allenfan" ...
## $ merchant_name : chr "zgrdejia" "sarahouse" "hxt520" "allenfan" ...
## $ merchant_info_subtitle : chr "(568 notes)" "83 % avis positifs (17,752 notes)" "86 % avis positifs (295 notes)" "(23,832 notes)" ...
## $ merchant_rating_count : int 568 17752 295 23832 14482 65 10194 342 330 5534 ...
## $ merchant_rating : num 4.13 3.9 3.99 4.02 4 ...
## $ merchant_id : chr "595097d6a26f6e070cb878d1" "56458aa03a698c35c9050988" "5d464a1ffdf7bc44ee933c65" "58cfdefdacb37b556efdff7c" ...
## $ merchant_has_profile_picture: int 0 0 0 0 0 0 1 0 0 0 ...
## $ merchant_profile_picture : chr "" "" "" "" ...
## $ product_url : chr "https://www.wish.com/c/5e9ae51d43d6a96e303acdb0" "https://www.wish.com/c/58940d436a0d3d5da4e95a38" "https://www.wish.com/c/5ea10e2c617580260d55310a" "https://www.wish.com/c/5cedf17ad1d44c52c59e4aca" ...
## $ product_picture : chr "https://contestimg.wish.com/api/webimage/5e9ae51d43d6a96e303acdb0-medium.jpg" "https://contestimg.wish.com/api/webimage/58940d436a0d3d5da4e95a38-medium.jpg" "https://contestimg.wish.com/api/webimage/5ea10e2c617580260d55310a-medium.jpg" "https://contestimg.wish.com/api/webimage/5cedf17ad1d44c52c59e4aca-medium.jpg" ...
## $ product_id : chr "5e9ae51d43d6a96e303acdb0" "58940d436a0d3d5da4e95a38" "5ea10e2c617580260d55310a" "5cedf17ad1d44c52c59e4aca" ...
## $ theme : chr "summer" "summer" "summer" "summer" ...
## $ crawl_month : chr "2020-08" "2020-08" "2020-08" "2020-08" ...
It can be seen that the dataset contains a lot of character variables. The main focus of this section will be analyzing the different variables, and if necessary and useful, convert them.
These represent our dependent variable. It is fundamental to understand the distribution of the variable and understand if it may be necessary (or useful) to transform it.
summary(data$units_sold)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 100 1000 4339 5000 100000
hist(data$units_sold)
cat(paste0("It can be seen by the histogram that the distribution of the data is right-skewed.
As a matter of fact the skewness is: ",skewness(data$units_sold)[1]))
## It can be seen by the histogram that the distribution of the data is right-skewed.
## As a matter of fact the skewness is: 5.61411708067862
Additionally, if we plot the data producing a histogram and adding a normal curve with the same mean and standard deviation of units_sold, we can see the right-skewed pattern.
plotNormalHistogram(data$units_sold,main ="units_sold")
Due to the specific distribution of the dependent variable the selection of a linear model should be avoided. There are several reason why:
The estimation and prediction strategy should be chosen considering the distribution of the dependent variable. That is, if the outcome variable is limited in the values it can take on, it is necessary to choose a model where the predicted values will fall within the possible range for your outcome. While sometimes linear regression is a good approximation for limited dependent variables, often it is not. However, in the following analysis, I will consider the MLR and GLM. The idea is to show the reason why with the use of a GLM the results should be preferred. Additionally, I will use a Log-transformation for the dependent variable. Indeed, transformation of data are typically use in the academia. However, this procedure may look as forcing the dependent variable to follow the normal distribution and be able to use the MLR.
The column contains the IDs the different merchant. There may be the possibility that there are duplicates.
cat(paste0("In the dataset the duplicated number of merchant_id is"),
sum(duplicated(data$merchant_id)*1),
"out of", length(data$merchant_id), "total merchant_id")
## In the dataset the duplicated number of merchant_id is 615 out of 1573 total merchant_id
In this case, It could be possible that there maybe duplicated merchants (and maybe rows). In order to exclude this possibility, I want to check if there are some matches in the product_id.
cat(paste0("In the dataset, the duplicated number of product_id is ",sum(duplicated(data$product_id)*1)))
## In the dataset, the duplicated number of product_id is 232
The results shows that there are a fewer number of duplicates. It will be necessary to remove those rows from the dataset.
duplicate_id<-(duplicated(data$product_id)*1)
duplicate_id<-which(duplicate_id==1)
data<-data[-duplicate_id,]
duplicate_id<-sum(duplicated(data$product_id)*1) #Check
cat(paste0("Number of duplicated product_id:"),duplicate_id)
## Number of duplicated product_id: 0
Duplicated rows successfully removed.
Another interesting aspect may be the possibility that some merchants are selling more than one product in the e-commerce site. It could be interesting to analyze this effect on the units_sold variable. Again, from my personal experience, the possibility that a merchant is selling more than one product on the e-market will increase my confidence about that specific merchant.
more_than_one_article<-as.factor(duplicated(data$merchant_id)*1)
data$more_than_one_article<-more_than_one_article
This column contains information about the rating given to the different merchant in the dataset.
The only transformation that I will do to this variable is the use of the function round()
data$merchant_rating<-round(data$merchant_rating,digits=1) #I rounded the rating to make easier the comparison
summary(data$merchant_rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.300 3.900 4.000 4.041 4.200 5.000
hist(data$merchant_rating)
Finally, I check the correlation between the variable merchant_rating and units_sold
cor(data$merchant_rating,data$units_sold)
## [1] 0.1097694
The cor() function shows the presence of a low correlation between the variables analyzed.
The column contains the count of the rating obtained by the different merchant. First, I take a look at the data.
head(data$merchant_rating_count,20)
## [1] 568 17752 295 23832 14482 65 10194 342 330 5534
## [11] 9470 3515 557 5083 55499 39381 6302 139223 22089 3592
Here the values from summary():
summary(data$merchant_rating_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2366 8740 27874 25222 2174765
Starting from my personal experience (such as Zalando, Amazon and ect.), I’ve always considered the number of reviews (in this specific case merchant_rating_count) as important for the decision to acquire or not a specific product from a specific merchant. Taking into account this consideration, I want to test the correlation of the variables merchant_rating_count, units_sold and merchant_rating
## Correlation between the *merchant_rating_count* and *units_sold* variables is: 0.272978836705809
cat(paste0("Correlation between the *merchant_rating_count* and *merchant_rating* variables is: ",cor(data$merchant_rating_count,data$merchant_rating)))
## Correlation between the *merchant_rating_count* and *merchant_rating* variables is: 0.131599096159153
These results should be considered in the models section.
Finally,I shoul be sure to remove the NA values from all the other variables.
data$rating_five_count[is.na(data$rating_five_count)] <- 0
data$rating_four_count[is.na(data$rating_four_count)] <- 0
data$rating_three_count[is.na(data$rating_three_count)] <- 0
data$rating_two_count[is.na(data$rating_two_count)] <- 0
data$rating_one_count[is.na(data$rating_one_count)] <- 0
There are two variable related to the inventory:
-product_variation_inventory which is “Inventory the seller has. Max allowed quantity is 50”; -inventory_total which is “Total inventory for all the product’s variations (size/color variations for instance)”
The descriptions of the two variables are not very clear. The analysis will help understand the importance of them for the model
hist(data$product_variation_inventory)
hist(data$inventory_total)
cat(paste0("Basically ",(max(table(data$inventory_total))/length(data$inventory_total))," percent of the observations in *inventory_total* are equal to 50."))
## Basically 0.992542878448919 percent of the observations in *inventory_total* are equal to 50.
Additionally, considering that
cat(paste0("The correlation between the variables *inventory_total* and *units_sold* is ", cor(data$units_sold,data$inventory_total)))
## The correlation between the variables *inventory_total* and *units_sold* is 0.00937539843820239
And that
cat(paste0((max(table(data$product_variation_inventory))/length(data$product_variation_inventory))," percent of the observations in *product_variation_inventory* are equal to 50 "))
## 0.591349739000746 percent of the observations in *product_variation_inventory* are equal to 50
cat(paste0("and the correlation between the variables *product_variation_inventory* and *units_sold* is ", cor(data$units_sold,data$product_variation_inventory)))
## and the correlation between the variables *product_variation_inventory* and *units_sold* is 0.127573263998713
Considering the results obtained, it is more meaningful the variable product_variation_inventory than the inventory_total.
The column urgency_text represents the presence of banner which informs the customers about the limited quantity of the products.
table(data$urgency_text)
##
## Quantité limitée !
## 975 365
## Réduction sur les achats en gros
## 1
As can be seen in the table, there are 365 observations with “Quantité limitée !” value in the dataset, 975 without values and 1 with “Réduction sur les achats en gros” (which litterally means reduction on the wholesale purchase) In this case, I will transform the variable into a dummy one, which it will tell if the presence (or absence) of urgency text will increase the units_sold
data$urgency_text[which(data$urgency_text == "Réduction sur les achats en gros", arr.ind=TRUE)]<-0 # I considered this as a
data$urgency_text[which(data$urgency_text == "Quantité limitée !", arr.ind=TRUE)]<-1
data$urgency_text[which(data$urgency_text == "", arr.ind=TRUE)]<-0
data$urgency_text<-as.factor(data$urgency_text)
The next analysis consider the following variables:
In the following tables, I show the values in the first two columns.
cat("Table data$shipping_option_name")
## Table data$shipping_option_name
table(data$shipping_option_name)
##
## Ekspresowa wysyłka Envío normal Envio Padrão
## 1 4 8
## Expediere Standard Livraison Express Livraison standard
## 4 3 1285
## Spedizione standard Standard Shipping Standardowa wysyłka
## 2 18 3
## Standardversand Standart Gönderi Стандартная доставка
## 3 2 2
## الشحن القياسي การส่งสินค้ามาตรฐาน ការដឹកជញ្ជូនតាមស្តង់ដារ
## 3 2 1
cat("Table data$shipping_is_express")
## Table data$shipping_is_express
table(data$shipping_is_express)
##
## 0 1
## 1337 4
The first table shows that the all the products have only the option of standard shipping. The second table, instead, tells that there are actually 4 products with express shipping, denying the results seen in the first table. For this reason, I decided to not consider these information.
hist(data$shipping_option_price)
The observations contained in the column shipping_option_price could be interesting in understanding which variables drive the units_sold.
cat(paste0("The correlation between the *units_sold* and the *shipping_option_price* variables is: ",cor(data$units_sold,data$shipping_option_price)))
## The correlation between the *units_sold* and the *shipping_option_price* variables is: -0.0521409042295423
The variable shipping_option_price shows a low negative correlation with units_sold.
There are 4 columns related to the badges:
It appears clearly that the sum of the different badges is in the column badges_count. Considering this specific situation, initially I will use the badges_count as variable.
In this section, I consider the product_color column.
str(data$product_color)
## chr [1:1341] "white" "green" "leopardprint" "black" "yellow" "navyblue" ...
Since it is a chr column, I have to consider the possibility that there may be CAPS LOCK.
data$product_color<-tolower(data$product_color)
data$product_color<-ifelse(data$product_color=="","No_color",data$product_color)
table(data$product_color)
##
## applegreen apricot army army green
## 1 2 1 2
## armygreen beige black black & blue
## 24 14 267 2
## black & green black & stripe black & white black & yellow
## 3 1 3 2
## blackwhite blue blue & pink brown
## 1 85 1 6
## brown & yellow burgundy camel camouflage
## 1 2 1 3
## claret coffee coolblack coralred
## 1 5 2 1
## darkblue darkgreen denimblue dustypink
## 5 1 1 2
## floral fluorescentgreen gold gray
## 5 2 1 9
## gray & white green grey greysnakeskinprint
## 1 77 65 1
## ivory jasper khaki lakeblue
## 1 1 11 1
## leopard leopardprint light green lightblue
## 3 1 1 8
## lightgray lightgreen lightgrey lightkhaki
## 1 3 1 1
## lightpink lightpurple lightred lightyellow
## 4 1 1 2
## mintgreen multicolor navy navy blue
## 2 18 3 2
## navyblue navyblue & white No_color nude
## 25 1 40 1
## offblack offwhite orange orange & camouflage
## 1 1 24 1
## orange-red pink pink & black pink & blue
## 3 86 2 1
## pink & grey pink & white prussianblue purple
## 2 1 1 49
## rainbow red red & blue rose
## 1 79 1 3
## rose red rosegold rosered silver
## 1 1 7 2
## skyblue star tan violet
## 7 1 1 1
## watermelonred white white & black white & green
## 1 209 2 7
## white & red whitefloral whitestripe wine
## 1 2 1 2
## wine red winered winered & yellow yellow
## 1 23 1 80
tablecol<-as.matrix(table(data$product_color))
An important issue in the dataset, related to the color, it is the presence of particular color, such as light blue or light red. For the ease of the analysis, I’m gonna merge these into the respective color (e.g. light red will become red). Additionally, whenever the color will present two major color at the same time, I will consider it as “multicolor”.
data$product_color = gsub(pattern='(light-*)|(light)',replacement='',
x=data$product_color)
data$product_color = ifelse(str_detect(data$product_color, '&'),"multicolor",
data$product_color)
data$product_color = replace(data$product_color,grep(pattern='.*blue*.|blue*.|.*blue',data$product_color),"blue")
data$product_color = replace(data$product_color,grep(pattern='.*green*.|green*.|.*green',data$product_color),"green")
data$product_color = replace(data$product_color,grep(pattern='.*red*.|red*.|.*red',data$product_color),"red")
data$product_color = replace(data$product_color,grep(pattern='.*white*.|white*.|.*white',data$product_color),"white")
data$product_color = replace(data$product_color,grep(pattern='.*black*.|black*.|.*black',data$product_color),"black")
data$product_color = gsub(pattern='(white*)|(*white)',replacement='white',
x=data$product_color)
table(data$product_color)
##
## apricot army beige black blue brown
## 2 1 14 270 135 6
## burgundy camel camouflage coffee dustypink floral
## 2 1 3 5 2 5
## gold gray ivory jasper khaki leopard
## 1 10 1 1 12 3
## leopardprint multicolor navy No_color nude orange
## 1 52 3 40 1 24
## pink purple rainbow red rose rosegold
## 90 50 1 298 3 1
## silver star tan violet white wine
## 2 1 1 1 214 2
## yellow
## 82
I will select the 10 most used color in the dataset. In the following table, the most used color are disposed in decreasing order considering the count in the dataset.
table=as.matrix(table(data$product_color))
head(sort(table[,1],decreasing=TRUE),10)
## red black white blue pink yellow multicolor
## 298 270 214 135 90 82 52
## purple No_color orange
## 50 40 24
rownames=c(rownames(as.matrix(head(sort(table[,1],decreasing=TRUE),10))))
data$product_color = ifelse(data$product_color %in% rownames, data$product_color,'others')
rownames=c(rownames(as.matrix(table(data$product_color))))
Considering the entity of the dataset, I want to understand if there is a relation between a specific color and the units_sold. Following this idea, I will factorize the product_color into the different 11 level (the 10 most used + the “others” values).
data$product_color = ifelse(data$product_color %in% rownames, data$product_color,'others')
rownames=c(rownames(as.matrix(table(data$product_color))))
data$product_color<-factor(data$product_color,levels=rownames,labels=c(1:length(rownames)))
In this section, I consider the product_variation_size_id column.
head(data$product_variation_size_id,20)
## [1] "M" "XS" "XS" "M" "S" "Size-XS" "XS"
## [8] "M." "M" "S" "XXS" "XS" "XS" "S"
## [15] "S" "S" "L" "S" "S" "S"
The code show the fist 20 rows. The main problem is the Caps Lock, which should be removed. Additionally, there some rows which are not clear. It will be necessary to fix the data.
data$product_variation_size_id<-tolower(data$product_variation_size_id)
data$product_variation_size_id<-gsub(pattern='.',replacement='',x=data$product_variation_size_id,fixed=TRUE)
table(data$product_variation_size_id)
##
## 04-3xl
## 13 1
## 1 1 pc
## 2 1
## 1 pc - xl 10 ml
## 1 3
## 100 cm 100 x 100cm(393 x 393inch)
## 1 1
## 100pcs 10pcs
## 1 1
## 17 1m by 3m
## 1 1
## 1pc 2
## 1 1
## 20pcs 20pcs-10pairs
## 1 1
## 25 25-s
## 1 1
## 26(waist 72cm 28inch) 29
## 1 2
## 2pcs 2xl
## 2 4
## 3 layered anklet 30 cm
## 1 1
## 32/l 33
## 1 3
## 34 35
## 3 2
## 36 3xl
## 1 2
## 4 4-5 years
## 1 1
## 40 cm 4xl
## 1 4
## 5 5pairs
## 1 1
## 5xl 60
## 1 1
## 6xl 80 x 200 cm
## 1 1
## au plug low quality b
## 1 1
## baby float boat base & top & matte top coat
## 1 1
## base coat choose a size
## 1 1
## daughter 24m eu 35
## 1 2
## eu39(us8) first generation
## 1 1
## floating chair for kid h01
## 1 1
## l m
## 46 180
## one size pack of 1
## 2 1
## pants-s round
## 1 1
## s s (waist58-62cm)
## 574 1
## s diameter 30cm s pink
## 1 1
## s(bust 88cm) s(pink & black)
## 1 1
## s/m(child) size -xxs
## 1 2
## size m size s
## 2 6
## size xs size xxs
## 3 2
## size--s size-4xl
## 1 1
## size-5xl size-l
## 1 1
## size-s size-xs
## 3 3
## size-xxs size/s
## 2 1
## size4xl sizel
## 1 2
## suit-s us 65 (eu 37)
## 2 1
## us-s us55-eu35
## 1 1
## white women size 36
## 1 1
## women size 37 x l
## 1 1
## xl xs
## 15 297
## xxl xxs
## 12 74
## xxxl xxxs
## 1 5
## xxxxl xxxxxl
## 2 2
data$product_variation_size_id = replace(data$product_variation_size_id,grep(pattern='.*size*.|size*.|.*size',data$product_variation_size_id),"size")
data$product_variation_size_id = gsub(pattern='.+[-]',replacement='',
x=data$product_variation_size_id)
data$product_variation_size_id = ifelse(grepl(pattern='xl',data$product_variation_size_id),
'xl',data$product_variation_size_id)
data$product_variation_size_id = ifelse(grepl(pattern='xs',data$product_variation_size_id),
'xs',data$product_variation_size_id)
data$product_variation_size_id = str_replace(data$product_variation_size_id,' ','')
data$product_variation_size_id = ifelse(data$product_variation_size_id %in% c('s','xs','m','l','xl'),
data$product_variation_size_id,'NotClear')
table(data$product_variation_size_id)
##
## l m NotClear s xl xs
## 46 180 114 579 46 376
As for the color column, I will factorize the size column.
rownames=c(rownames(as.matrix(table(data$product_variation_size_id))))
data$product_variation_size_id<-factor(data$product_variation_size_id,levels=rownames,labels=c(1:length(rownames)))
The column uses_ad_boosts indicates if the seller paid to boost the product (in terms of better placement or highlithing it, ecc.) The idea is that this variable should be very relevant for the units_sold.
table(data$uses_ad_boosts)
##
## 0 1
## 757 584
Due to the way the data are recorded into the dataset, I will factorize the variable.
data$uses_ad_boosts<-as.factor(data$uses_ad_boosts)
Another important characteristic in the dataset is the presence of the two price column:
I will create the variable discount with the idea to investigate if the precence of a discount may increase the units_sold.
discount=(data$price-data$retail_price)
data$discount=discount
cat(paste0("The correlation between *discount* and *units_sold* is :", cor(discount,data$units_sold)))
## The correlation between *discount* and *units_sold* is :-0.0150171748884622
In this specific section, I will analyze the merchant_has_profile_picture variable. Starting again from my personal experience, I will expect to find that the presence of the merchant profile picture should increase the units_sold.
table(data$merchant_has_profile_picture)
##
## 0 1
## 1138 203
The table shows that the variable should be considered as a factor.
cat(paste0("Additionally, the importance of the variable is given by the high correlation, which is equal to: ",cor(data$merchant_has_profile_picture,data$units_sold)))
## Additionally, the importance of the variable is given by the high correlation, which is equal to: 0.14475246443755
data$merchant_has_profile_picture<-as.factor(data$merchant_has_profile_picture)
This column contains the feedback received by the merchant since they are active. The idea is to find out that there should be a high correlation between the merchant_info_subtitle and the units_sold.
head(data$merchant_info_subtitle,5)
## [1] "(568 notes)" "83 % avis positifs (17,752 notes)"
## [3] "86 % avis positifs (295 notes)" "(23,832 notes)"
## [5] "85 % avis positifs (14,482 notes)"
Again, starting from my personal experience, I will split the data contained in the merchant_info_subtitle column into two variables:
I expect to find high correlation between at least one of these two variables and the units_sold ones.
information=data$merchant_info_subtitle
information=colsplit(information,'[()]',names=c("percent","notes_count"))
percent =as.numeric(substr(information$percent, 1, 2))
percent[is.na(percent)] <- 0
notes_count=information$notes_count
notes_count=colsplit(notes_count,'[ ]',names=c("notes_count","nothank"))
notes_count=notes_count$notes_count
notes_count=as.numeric(gsub(",","",notes_count))
notes_count[is.na(notes_count)] <- 0
data$feedback_percent=percent
cat(paste0("The correlation between *feedback_percent* and *units_sold* is ",cor(data$feedback_percent,data$units_sold),"
"))
## The correlation between *feedback_percent* and *units_sold* is 0.0323968360949711
cat(paste0("The correlation between *feedback_notes_count* and *units_sold* is ",cor(notes_count,data$units_sold)))
## The correlation between *feedback_notes_count* and *units_sold* is 0.272410362641396
However, It is important to state that the second variable just created, feedback_notes_count, is basically the same as merchant_rating_count. I’m gonna verify if there are discrepancies in the two variables.
difference=as.data.frame(notes_count-data$merchant_rating_count)
difference[difference[,1] != 0,]
## [1] -126370 -6555 -37753 -33675
As can be seen, there are some values which are in the merchant_rating_count column which are not in the description (merchant_info_subtitle). Seen the merchant_info_subtitle is the description of the merchant and the other variable may have been created by the data supplier, I will overwrite the merchant_rating_count with the notes_count just created.
data$merchant_rating_count=notes_count
The column contains information about in which countries the merchant can send the products. This could be an interesting variable which can be used to understand the behave of the units_sold.
hist(data$countries_shipped_to)
cat(paste0("The correlation between the *countries_shipped_to* and *units_sold* is ",cor(data$countries_shipped_to,data$units_sold)))
## The correlation between the *countries_shipped_to* and *units_sold* is -0.0171633674449296
After finishing the analysis on the main relevant variables contianed in the dataset, I will prepare the dataset for the modeling section. Before splitting the dataset into:
This technique is called Cross-validation.It’s a sub-sampling technique in which the existing data are split into training and test sets. The model is trained using the training part and validated on the test part. Additionally,I will remove the “chr” variables.
control<-as.matrix(sapply(data,class))
indexremove=list()
count=0
for (i in 1:length(data)){
if (control[i]=="character"){
count=count+1
indexremove[count]=i
}
}
indexremove=c(unlist(indexremove))
datanew<-data[-indexremove]
str(datanew)
## 'data.frame': 1341 obs. of 32 variables:
## $ price : num 16 8 8 8 2.72 3.92 7 12 11 5.78 ...
## $ retail_price : int 14 22 43 8 3 9 6 11 84 22 ...
## $ units_sold : int 100 20000 100 5000 100 10 50000 1000 100 5000 ...
## $ uses_ad_boosts : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 2 1 ...
## $ rating : num 3.76 3.45 3.57 4.03 3.1 5 3.84 3.76 3.47 3.6 ...
## $ rating_count : int 54 6135 14 579 20 1 6742 286 15 687 ...
## $ rating_five_count : num 26 2269 5 295 6 ...
## $ rating_four_count : num 8 1027 4 119 4 ...
## $ rating_three_count : num 10 1118 2 87 2 ...
## $ rating_two_count : num 1 644 0 42 2 0 490 18 1 68 ...
## $ rating_one_count : num 9 1077 3 36 6 ...
## $ badges_count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ badge_local_product : int 0 0 0 0 0 0 0 0 0 0 ...
## $ badge_product_quality : int 0 0 0 0 0 0 0 0 0 0 ...
## $ badge_fast_shipping : int 0 0 0 0 0 0 0 0 0 0 ...
## $ product_color : Factor w/ 11 levels "1","2","3","4",..: 10 9 6 1 11 2 10 2 1 6 ...
## $ product_variation_size_id : Factor w/ 6 levels "1","2","3","4",..: 2 6 6 2 4 3 6 2 2 4 ...
## $ product_variation_inventory : int 50 50 1 50 1 1 50 50 50 50 ...
## $ shipping_option_price : int 4 2 3 2 1 1 2 3 2 2 ...
## $ shipping_is_express : int 0 0 0 0 0 0 0 0 0 0 ...
## $ countries_shipped_to : int 34 41 36 41 35 40 31 139 36 33 ...
## $ inventory_total : int 50 50 50 50 50 50 50 50 50 50 ...
## $ has_urgency_banner : Factor w/ 2 levels "0","1": 2 2 2 1 2 1 1 1 2 1 ...
## $ urgency_text : Factor w/ 2 levels "0","1": 2 2 2 1 2 1 1 1 2 1 ...
## $ merchant_rating_count : num 568 17752 295 23832 14482 ...
## $ merchant_rating : num 4.1 3.9 4 4 4 3.5 4.1 3.7 3.8 4 ...
## $ merchant_has_profile_picture: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ more_than_one_article : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_tags : int 2 8 5 7 5 5 16 6 6 7 ...
## $ tag_most_used : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ discount : num 2 -14 -35 0 -0.28 ...
## $ feedback_percent : num 0 83 86 0 85 75 86 0 82 85 ...
From the original 43 variables, I have reduced the dataset in 32 variables.
set.seed(2)
smp_size <- floor(0.9 * nrow(datanew))
train_ind <- sample(seq_len(nrow(datanew)), size = smp_size)
train <- datanew[train_ind, ]
test <- datanew[-train_ind, ]
I will start the modeling section with a simple multiple linear regression without intercept.
Initially, I will consider all the variables in the train dataset. Step by step, I will remove the less significant variables.
fit_test<-lm(units_sold~ .-1,data=train)
summary(fit_test)
##
## Call:
## lm(formula = units_sold ~ . - 1, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23522 -1235 -422 682 49110
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## price 10.317976 60.090334 0.172 0.863697
## retail_price -6.702475 3.891313 -1.722 0.085259 .
## uses_ad_boosts0 4392.685830 3559.227456 1.234 0.217389
## uses_ad_boosts1 4949.593203 3553.016330 1.393 0.163865
## rating -97.744302 245.915464 -0.397 0.691093
## rating_count 3.787146 3.484963 1.087 0.277390
## rating_five_count 2.379432 3.704238 0.642 0.520770
## rating_four_count -10.823838 5.140670 -2.106 0.035459 *
## rating_three_count 8.390024 6.024761 1.393 0.164010
## rating_two_count -1.902789 13.888307 -0.137 0.891049
## rating_one_count NA NA NA NA
## badges_count -3832.155688 1062.235693 -3.608 0.000322 ***
## badge_local_product 4784.377399 1523.376081 3.141 0.001728 **
## badge_product_quality 2869.586711 1155.162373 2.484 0.013126 *
## badge_fast_shipping NA NA NA NA
## product_color2 534.212116 439.558484 1.215 0.224484
## product_color3 179.261701 628.637370 0.285 0.775573
## product_color4 1269.066574 711.863784 1.783 0.074890 .
## product_color5 3164.434194 869.838433 3.638 0.000287 ***
## product_color6 826.040838 498.032259 1.659 0.097464 .
## product_color7 -21.105957 496.706241 -0.042 0.966114
## product_color8 1603.004201 609.392074 2.630 0.008639 **
## product_color9 351.265201 345.375373 1.017 0.309339
## product_color10 509.999486 372.439669 1.369 0.171155
## product_color11 -166.828663 524.817735 -0.318 0.750634
## product_variation_size_id2 -534.854090 670.559471 -0.798 0.425252
## product_variation_size_id3 -1642.943597 724.844886 -2.267 0.023597 *
## product_variation_size_id4 -1130.906600 621.503874 -1.820 0.069072 .
## product_variation_size_id5 -2369.481002 858.977678 -2.758 0.005898 **
## product_variation_size_id6 -1523.728148 653.454817 -2.332 0.019881 *
## product_variation_inventory 9.970450 6.186839 1.612 0.107329
## shipping_option_price -335.508012 235.439138 -1.425 0.154416
## shipping_is_express 3031.169222 2701.560047 1.122 0.262091
## countries_shipped_to -9.667917 5.734196 -1.686 0.092062 .
## inventory_total -69.603151 47.264118 -1.473 0.141118
## has_urgency_banner1 282.030930 3895.371908 0.072 0.942295
## urgency_text1 -670.621994 3899.247482 -0.172 0.863477
## merchant_rating_count 0.005746 0.001384 4.152 0.0000354 ***
## merchant_rating 298.248376 582.242931 0.512 0.608580
## merchant_has_profile_picture1 390.914361 315.992603 1.237 0.216300
## more_than_one_article1 118.290712 248.750668 0.476 0.634492
## num_tags 43.998904 32.530126 1.353 0.176459
## tag_most_used1 343.287542 324.974553 1.056 0.291027
## discount NA NA NA NA
## feedback_percent -1.342298 3.254909 -0.412 0.680128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3770 on 1164 degrees of freedom
## Multiple R-squared: 0.8563, Adjusted R-squared: 0.8511
## F-statistic: 165.1 on 42 and 1164 DF, p-value: < 0.00000000000000022
Even if there is a high Adjusted R-squared, a considerable number of the variables are not statistical significant and with high standard error. The presence of high standard errors reflects the high degree of uncertainty for estimating a reliable predictors, and so the presence of these variables should be tested. In addition, 3 of the variables have not been estimated due to high correlation. The three variables are:
As a matter of fact,
cat(paste0("The correlation between *retail_price* and *discount* is ",cor(data$retail_price,data$discount)))
## The correlation between *retail_price* and *discount* is -0.991789865212611
cat(paste0("The correlation between *badge_fast_shipping* and *badges_count* is ",cor(data$badge_fast_shipping,data$badges_count)))
## The correlation between *badge_fast_shipping* and *badges_count* is 0.499644870224989
cat(paste0("The correlation between *rating_one_count* and *rating_count* is ",cor(data$rating_one_count,data$rating_count)))
## The correlation between *rating_one_count* and *rating_count* is 0.909674752556129
fit_test2<-lm(units_sold~ retail_price+rating_four_count+badges_count+badge_local_product+badge_product_quality+product_color+product_variation_size_id+countries_shipped_to+merchant_rating_count-1,data=train)
summary(fit_test2)
##
## Call:
## lm(formula = units_sold ~ retail_price + rating_four_count +
## badges_count + badge_local_product + badge_product_quality +
## product_color + product_variation_size_id + countries_shipped_to +
## merchant_rating_count - 1, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19194 -1237 -501 617 48032
##
## Coefficients:
## Estimate Std. Error t value
## retail_price -10.193111 3.795324 -2.686
## rating_four_count 19.593661 0.313062 62.587
## badges_count -3157.257416 1006.540567 -3.137
## badge_local_product 4228.045758 1520.691114 2.780
## badge_product_quality 2594.887916 1087.808600 2.385
## product_color1 2018.155366 721.244691 2.798
## product_color2 2554.891106 771.062345 3.313
## product_color3 2232.733888 883.554517 2.527
## product_color4 3718.703379 955.305687 3.893
## product_color5 5538.365942 1095.161818 5.057
## product_color6 3086.578914 793.052618 3.892
## product_color7 2279.249052 790.040982 2.885
## product_color8 3977.276405 863.381642 4.607
## product_color9 2641.189424 698.826386 3.779
## product_color10 2567.533223 725.428884 3.539
## product_color11 2049.120046 786.193903 2.606
## product_variation_size_id2 -412.634000 689.239717 -0.599
## product_variation_size_id3 -1467.773236 732.710208 -2.003
## product_variation_size_id4 -1066.947455 636.941901 -1.675
## product_variation_size_id5 -2180.012508 871.222607 -2.502
## product_variation_size_id6 -1486.403470 649.484242 -2.289
## countries_shipped_to -10.037090 5.745275 -1.747
## merchant_rating_count 0.007465 0.001380 5.409
## Pr(>|t|)
## retail_price 0.007339 **
## rating_four_count < 0.0000000000000002 ***
## badges_count 0.001751 **
## badge_local_product 0.005516 **
## badge_product_quality 0.017216 *
## product_color1 0.005223 **
## product_color2 0.000949 ***
## product_color3 0.011634 *
## product_color4 0.000105 ***
## product_color5 0.0000004929 ***
## product_color6 0.000105 ***
## product_color7 0.003985 **
## product_color8 0.0000045356 ***
## product_color9 0.000165 ***
## product_color10 0.000417 ***
## product_color11 0.009265 **
## product_variation_size_id2 0.549501
## product_variation_size_id3 0.045382 *
## product_variation_size_id4 0.094177 .
## product_variation_size_id5 0.012475 *
## product_variation_size_id6 0.022279 *
## countries_shipped_to 0.080894 .
## merchant_rating_count 0.0000000768 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3904 on 1183 degrees of freedom
## Multiple R-squared: 0.8434, Adjusted R-squared: 0.8404
## F-statistic: 277 on 23 and 1183 DF, p-value: < 0.00000000000000022
plot(fit_test2$residuals)
pred2<-round(predict(fit_test2,test),digits=0)
percentbias_outofsample2<-percent_bias(test$units_sold,pred2)
percentbias_insample2<-percent_bias(train$units_sold,fit_test2$fitted.values)
cat(paste0("The average amount that actual values are greater than predicted as a percentage is ",percentbias_insample2 ))
## The average amount that actual values are greater than predicted as a percentage is -10.615610428764
mape_2out<-Metrics::mape(test$units_sold,pred2)
mape_2in<-Metrics::mape(train$units_sold,fit_test2$fitted.values)
cat("
")
cat(paste0("The Mean Absolute Percent Error is ",mape_2in ))
## The Mean Absolute Percent Error is 12.2223769337058
mae_2out<-Metrics::mae(test$units_sold,pred2)
mae_2in<-Metrics::mae(train$units_sold,fit_test2$fitted.values)
cat("
")
cat(paste0("The Mean Absolute Error is ",mae_2in ))
## The Mean Absolute Error is 2066.79499499615
The results obtained by the fit_test2 model show that the decision to select just variables with a significant level (10% ) should be considered. As a matter of fact, It is typical in analysis to initially run the model with all the independent variables, and step by step, removing the not significant ones.
It can be seen that the MAPE and the Adjusted R^2 are reducing, while the % bias and the MAE are slightly increase.
It appears clear that it is necessary to reduce further the variables select to define the units sold. Once again I will select only the significant ones.
fit_test3<-lm(units_sold~ retail_price+rating_four_count+badges_count+badge_local_product+product_color+merchant_rating_count-1,data=train)
summary(fit_test3)
##
## Call:
## lm(formula = units_sold ~ retail_price + rating_four_count +
## badges_count + badge_local_product + product_color + merchant_rating_count -
## 1, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20471 -1143 -524 472 47999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## retail_price -9.100322 3.772105 -2.413 0.015993
## rating_four_count 19.716145 0.311295 63.336 < 0.0000000000000002
## badges_count -1006.067363 413.371335 -2.434 0.015087
## badge_local_product 1386.420058 967.475378 1.433 0.152112
## product_color1 441.005194 285.531663 1.545 0.122732
## product_color2 861.896204 380.898453 2.263 0.023828
## product_color3 512.625830 605.159104 0.847 0.397114
## product_color4 2041.218648 669.676093 3.048 0.002354
## product_color5 3794.660537 867.396248 4.375 0.000013221
## product_color6 1528.169675 453.721860 3.368 0.000781
## product_color7 763.303063 439.725315 1.736 0.082847
## product_color8 2486.075327 578.182673 4.300 0.000018493
## product_color9 1087.381480 257.440143 4.224 0.000025851
## product_color10 1039.123374 306.418652 3.391 0.000719
## product_color11 570.112216 473.682808 1.204 0.228994
## merchant_rating_count 0.007030 0.001363 5.159 0.000000291
##
## retail_price *
## rating_four_count ***
## badges_count *
## badge_local_product
## product_color1
## product_color2 *
## product_color3
## product_color4 **
## product_color5 ***
## product_color6 ***
## product_color7 .
## product_color8 ***
## product_color9 ***
## product_color10 ***
## product_color11
## merchant_rating_count ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3932 on 1190 degrees of freedom
## Multiple R-squared: 0.8402, Adjusted R-squared: 0.838
## F-statistic: 390.9 on 16 and 1190 DF, p-value: < 0.00000000000000022
plot(fit_test3$residuals)
pred3<-round(predict(fit_test3,test),digits=0)
percentbias_outofsample3<-percent_bias(test$units_sold,pred3)
percentbias_insample3<-percent_bias(train$units_sold,fit_test3$fitted.values)
cat(paste0("The average amount that actual values are greater than predicted as a percentage is ",percentbias_insample3 ))
## The average amount that actual values are greater than predicted as a percentage is -12.3615728566824
mape_3out<-Metrics::mape(test$units_sold,pred3)
mape_3in<-Metrics::mape(train$units_sold,fit_test3$fitted.values)
cat("
")
cat(paste0("The Mean Absolute Percent Error is ",mape_3in ))
## The Mean Absolute Percent Error is 12.8540868359851
mae_3out<-Metrics::mae(test$units_sold,pred3)
mae_3in<-Metrics::mae(train$units_sold,fit_test3$fitted.values)
cat("
")
cat(paste0("The Mean Absolute Error is ",mae_3in ))
## The Mean Absolute Error is 2053.04940801906
The results obtained by the fit_test3 model show again the same patter as fit_test2. The main difference is about the MAE which is reducing, suggesting that the model is increasing its ability in better fitting the units_sold.
fit_test4<-lm(units_sold~ retail_price+rating_four_count+badges_count+badge_local_product++merchant_rating_count-1,data=train)
summary(fit_test4)
##
## Call:
## lm(formula = units_sold ~ retail_price + rating_four_count +
## badges_count + badge_local_product + +merchant_rating_count -
## 1, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22477 -390 -33 748 47968
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## retail_price 4.497042 3.201337 1.405 0.160
## rating_four_count 20.295474 0.305021 66.538 < 0.0000000000000002 ***
## badges_count -642.409399 416.432212 -1.543 0.123
## badge_local_product 1382.975889 989.567419 1.398 0.163
## merchant_rating_count 0.008171 0.001383 5.908 0.00000000451 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4027 on 1201 degrees of freedom
## Multiple R-squared: 0.8309, Adjusted R-squared: 0.8302
## F-statistic: 1180 on 5 and 1201 DF, p-value: < 0.00000000000000022
plot(fit_test4$residuals)
test<-as.data.frame(test)
pred4<-round(predict(fit_test4,test),digits=0)
percentbias_outofsample4<-Metrics::percent_bias(test$units_sold,pred4)
percentbias_insample4<-Metrics::percent_bias(train$units_sold,fit_test4$fitted.values)
cat(paste0("The average amount that actual values are greater than predicted as a percentage is ",percentbias_insample4 ))
## The average amount that actual values are greater than predicted as a percentage is -2.04538656740612
mape_4out<-Metrics::mape(test$units_sold,pred4)
mape_4in<-Metrics::mape(train$units_sold,fit_test4$fitted.values)
cat("
")
cat(paste0("The Mean Absolute Percent Error is ",mape_4in ))
## The Mean Absolute Percent Error is 2.545888353867
mae_4out<-Metrics::mae(test$units_sold,pred4)
mae_4in<-Metrics::mae(train$units_sold,fit_test4$fitted.values)
cat("
")
cat(paste0("The Mean Absolute Error is ",mae_4in ))
## The Mean Absolute Error is 1874.45645825947
The results obtained by the fit_test4 model show that the reduction in independent variables is giving the results expected. As a matter of fact, It can be seen that the % bias, MAPE and MAE have been remarkably decreased.These results suggest that the model has a good ability in fitting the data. Additionally, also the joint F-test shows that the reduction of regressors should be preferred.
For the next model, I will add the interaction between the merchant_rating_count and the rating_four_count variables.
fit_test5<-lm(units_sold~ countries_shipped_to+rating_four_count+merchant_rating_count+(merchant_rating_count*rating_four_count)-1,data=train)
summary(fit_test5)
##
## Call:
## lm(formula = units_sold ~ countries_shipped_to + rating_four_count +
## merchant_rating_count + (merchant_rating_count * rating_four_count) -
## 1, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27462 -882 -454 581 39895
##
## Coefficients:
## Estimate Std. Error t value
## countries_shipped_to 20.254565467 2.827640670 7.163
## rating_four_count 17.871979217 0.355684591 50.247
## merchant_rating_count -0.006736144 0.001882804 -3.578
## rating_four_count:merchant_rating_count 0.000023018 0.000002217 10.383
## Pr(>|t|)
## countries_shipped_to 0.00000000000137 ***
## rating_four_count < 0.0000000000000002 ***
## merchant_rating_count 0.00036 ***
## rating_four_count:merchant_rating_count < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3837 on 1202 degrees of freedom
## Multiple R-squared: 0.8463, Adjusted R-squared: 0.8458
## F-statistic: 1654 on 4 and 1202 DF, p-value: < 0.00000000000000022
plot(fit_test5$residuals)
pred5<-round(predict(fit_test5,test),digits=0)
percentbias_outofsample5<-Metrics::percent_bias(test$units_sold,pred5)
percentbias_insample5<-Metrics::percent_bias(train$units_sold,fit_test5$fitted.values)
cat(paste0("The average amount that actual values are greater than predicted as a percentage is ",percentbias_insample5 ))
## The average amount that actual values are greater than predicted as a percentage is -7.86250303931833
mape_5out<-Metrics::mape(test$units_sold,pred5)
mape_5in<-Metrics::mape(train$units_sold,fit_test5$fitted.values)
cat("
")
cat(paste0("The Mean Absolute Percent Error is ",mape_5in ))
## The Mean Absolute Percent Error is 8.21028651827872
mae_5out<-Metrics::mae(test$units_sold,pred5)
mae_5in<-Metrics::mae(train$units_sold,fit_test5$fitted.values)
cat("
")
cat(paste0("The Mean Absolute Error is ",mae_5in ))
## The Mean Absolute Error is 2008.00090814781
The results obtained by the fit_test5 model show an increase in the error measurements. However we can see that the Adjusted R^2 is reducing, which is due to the reduction of standard error. However, when considering the joint F-test, we reject the null, suggesting that this last reduction could not be optimal.
In the following table it can be seen:
for all the different linear models proposed in this first part of the analysis.
## Rsquared_adj AIC BIC MAE MAPE percent_bias
## fit1 0.8510985 23328.32 23547.41 1934.751 10.453758 -5.568873
## fit2 0.8403548 23393.87 23516.15 2066.795 12.222377 -10.615610
## fit3 0.8380096 23404.57 23491.18 2053.049 12.854087 -12.361573
## fit4 0.8301669 23450.68 23481.25 1874.456 2.545888 -2.045387
## fit5 0.8457753 23333.42 23358.90 2008.001 8.210287 -7.862503
As can be seen by the table and the results until now discussed, it is difficult to understand if the Multiple linear regression could be a used to estimate properly the units_sold. As a matter of fact, It important to remember that the main characteristic of these models is the base assumption that there is a linear relationship between the independent and dependent variables, which may not be the case. It would be interesting to use a different kind of models.
In the models above, I have focused on linear relationships between the dependent and independent variables. However,linear relationships are not nearly general enough for all economic applications. In the following models, I will consider how units_sold changes considering it in percentage terms (applying a log-transformation)
mfrow=c(2,1)
plotNormalHistogram(data$units_sold)
plotNormalHistogram(log1p(data$units_sold))
Logarithmic transformation is a convenient means of transforming a highly skewed variable into a more normalized dataset. The reason why I have used the lo1p() function, which calculate the log(y+1) is due to thefact that y is a count for which zero is a possible value, and the transformation allows to avoid taking the logarithm of zero.
cat(paste0("From a skewness of ",skewness(data$units_sold)[1], " to a skewness of ", skewness(log1p(data$units_sold))[1]))
## From a skewness of 5.33105738109372 to a skewness of -0.115870164019399
In the next section, I will produce the same model proposed before but with log(y+1) as a dependent variable.
fit_log<-lm(log1p(units_sold)~ .-1,data=train)
fit_log2<-lm(log1p(units_sold)~ retail_price+rating_four_count+badges_count+badge_local_product+badge_product_quality+product_color+product_variation_size_id+countries_shipped_to+merchant_rating_count-1,data=train)
fit_log3<-lm(log1p(units_sold)~ retail_price+rating_four_count+badges_count+badge_local_product+product_color+merchant_rating_count-1,data=train)
fit_log4<-lm(log1p(units_sold)~ retail_price+rating_four_count+badges_count+badge_local_product++merchant_rating_count-1,data=train)
fit_log5<-lm(log1p(units_sold)~ countries_shipped_to+rating_four_count+merchant_rating_count+(merchant_rating_count*rating_four_count)-1,data=train)
data_fitlog=data.frame(Rsquared=summary(fit_log)$adj.r.squared,
aic=AIC(fit_log),
bic=BIC(fit_log),
MAE=Metrics::mae(train$units_sold,fit_log$fitted.values),
MAPE=Metrics::mape(train$units_sold,fit_log$fitted.values),
percent_bias=Metrics::percent_bias(train$units_sold,fit_log$fitted.values)
)
data_fitlog2=data.frame(Rsquared=summary(fit_log2)$adj.r.squared,
aic=AIC(fit_log2),
bic=BIC(fit_log2),
MAE=Metrics::mae(train$units_sold,fit_log2$fitted.values),
MAPE=Metrics::mape(train$units_sold,fit_log2$fitted.values),
percent_bias=Metrics::percent_bias(train$units_sold,fit_log2$fitted.values)
)
data_fitlog3=data.frame(Rsquared=summary(fit_log3)$adj.r.squared,
aic=AIC(fit_log3),
bic=BIC(fit_log3),
MAE=Metrics::mae(train$units_sold,fit_log3$fitted.values),
MAPE=Metrics::mape(train$units_sold,fit_log3$fitted.values),
percent_bias=Metrics::percent_bias(train$units_sold,fit_log3$fitted.values)
)
data_fitlog4=data.frame(Rsquared=summary(fit_log4)$adj.r.squared,
aic=AIC(fit_log4),
bic=BIC(fit_log4),
MAE=Metrics::mae(train$units_sold,fit_log4$fitted.values),
MAPE=Metrics::mape(train$units_sold,fit_log4$fitted.values),
percent_bias=Metrics::percent_bias(train$units_sold,fit_log4$fitted.values)
)
data_fitlog5=data.frame(Rsquared=summary(fit_log5)$adj.r.squared,
aic=AIC(fit_log5),
bic=BIC(fit_log5),
MAE=Metrics::mae(train$units_sold,fit_log5$fitted.values),
MAPE=Metrics::mape(train$units_sold,fit_log5$fitted.values),
percent_bias=Metrics::percent_bias(train$units_sold,fit_log5$fitted.values)
)
fit_log_table=rbind(data_fitlog,data_fitlog2,data_fitlog3,data_fitlog4,data_fitlog5)
fit_log_table=as.matrix(fit_log_table)
row.names(fit_log_table)=c("fit_log","fit_log2","fit_log3","fit_log4","fit_log5")
par(mfrow=c(2,2))
plot(fit_log2$residuals)
plot(fit_log3$residuals)
plot(fit_log4$residuals)
plot(fit_log5$residuals)
print(fit_log_table)
## Rsquared aic bic MAE MAPE percent_bias
## fit_log 0.9528698 4497.840 4716.928 4484.332 0.9660581 0.9405151
## fit_log2 0.9443693 4679.346 4801.628 4484.340 0.9660388 0.9342031
## fit_log3 0.9402920 4757.764 4844.380 4484.344 0.9663898 0.9318555
## fit_log4 0.5412558 7205.927 7236.497 4487.512 0.9804285 0.9774999
## fit_log5 0.8064755 6164.046 6189.521 4485.463 0.9683758 0.9513616
The table above shows that among the different models developed in this first phase, it should be better to focus on the fit_log one. In addition, there is a huge difference in the results obtained from these log-models compared to the othere ones.
par(mfrow=c(1,2))
scatter.smooth( rstandard(fit_test) ~ fitted(fit_test), main="No transform",
ylab="Standardized residuals", xlab="Fitted values", las=1)
scatter.smooth( rstandard(fit_log) ~ fitted(fit_log), main="log1p(Units_sold)",
ylab="Standardized residuals", xlab="Fitted values", las=1)
As it can be seen from the graph, the log1p transformation allows to create a relationship which is more linear than the non transformation one. However, considering the Cook’s distance plot and the Normal Q-Q plot, it can be seen why and what makes the relationship not properly linear.
par(mfrow=c(1,2))
plot( cooks.distance( fit_log ), type="h", las=1, ylab="Cook's distance D")
qqnorm( rstandard( fit_log), las=1 ); qqline( rstandard( fit_log ), las=1 )
Indeed, there are different rows in the train dataset which have a strong influence on the regression as suggested by the code above.
train[which(cooks.distance(fit_log) >0.02),]
## price retail_price units_sold uses_ad_boosts rating rating_count
## 85 7.00 6 50000 1 4.41 17444
## 662 8.00 7 100 0 2.79 14
## 500 5.66 5 10000 1 4.29 2068
## 1467 8.00 84 50000 0 3.60 18463
## 348 11.00 10 20000 0 4.32 3648
## 759 49.00 42 100 0 4.67 6
## 1062 6.00 6 10000 0 4.29 2808
## 93 5.77 48 100000 0 4.10 20744
## 609 11.00 37 20000 0 3.42 7596
## rating_five_count rating_four_count rating_three_count rating_two_count
## 85 11548 3191 1632 507
## 662 5 1 0 2
## 500 1245 418 247 68
## 1467 7530 3351 3057 1736
## 348 2418 548 327 126
## 759 4 2 0 0
## 1062 1789 459 300 110
## 93 11184 4152 2919 1174
## 609 2679 1403 1317 793
## rating_one_count badges_count badge_local_product badge_product_quality
## 85 566 1 0 1
## 662 6 2 1 0
## 500 90 1 0 1
## 1467 2789 1 0 0
## 348 229 1 0 1
## 759 0 1 0 0
## 1062 150 3 1 1
## 93 1315 0 0 0
## 609 1404 0 0 0
## badge_fast_shipping product_color product_variation_size_id
## 85 0 9 2
## 662 1 6 4
## 500 0 9 5
## 1467 1 4 4
## 348 0 1 5
## 759 1 9 5
## 1062 1 4 4
## 93 0 1 6
## 609 0 1 4
## product_variation_inventory shipping_option_price shipping_is_express
## 85 2 2 0
## 662 50 7 1
## 500 50 1 0
## 1467 50 3 0
## 348 1 2 0
## 759 2 12 1
## 1062 50 7 1
## 93 50 2 0
## 609 50 3 0
## countries_shipped_to inventory_total has_urgency_banner urgency_text
## 85 35 50 0 0
## 662 41 50 0 0
## 500 138 50 0 0
## 1467 36 50 1 1
## 348 39 1 0 0
## 759 41 2 0 0
## 1062 39 50 1 1
## 93 46 50 0 0
## 609 39 50 1 1
## merchant_rating_count merchant_rating merchant_has_profile_picture
## 85 126370 4.1 1
## 662 7012 4.3 0
## 500 2174765 4.4 0
## 1467 51369 4.0 1
## 348 4448 4.3 0
## 759 29977 4.2 0
## 1062 0 4.3 0
## 93 330405 4.1 1
## 609 58068 4.3 1
## more_than_one_article num_tags tag_most_used discount feedback_percent
## 85 1 4 1 1.00 89
## 662 0 7 1 1.00 0
## 500 0 4 1 0.66 0
## 1467 1 10 1 -76.00 0
## 348 0 7 0 1.00 91
## 759 0 5 1 7.00 90
## 1062 0 6 1 0.00 92
## 93 0 7 1 -42.23 86
## 609 0 12 1 -26.00 91
Considering the table above, the causes of the outliers cannot be identified. In this case, the analyst is faced with a dilemma. Simply discarding the observation is often unwise, since that observation may be a real, genuine observation for which an alternative model would be appropriate.
I will evaluate the influence of the outliers fiting the model to the data without the outliers and checking if the results obtained are similar or not.
outliers<-as.matrix(train[which(cooks.distance(fit_log) >0.02),])
outliers<-row.names(outliers)
datawo_out<-train[!(row.names(train) %in% outliers), ]
fit_log_noout<-lm(log1p(units_sold)~ .-1,data=datawo_out)
par(mfrow=c(2,2))
plot( cooks.distance( fit_log ), type="h", las=1, ylab="Cook's distance D",main="No transformation")
qqnorm( rstandard( fit_log), las=1 ); qqline( rstandard( fit_log ), las=1 )
plot( cooks.distance( fit_log_noout ), type="h", las=1, ylab="Cook's distance D",main="Log model")
qqnorm( rstandard( fit_log_noout), las=1 ); qqline( rstandard( fit_log_noout ), las=1 )
From the graph above, It can be seen that the major (and only improvement) is the reduction of the Cook’s distance D. In the table above, it is shown how the idea of removing the outliers should not be followed since the results obtained are not better than the ones already discussed.
data_fitlog_no_out=data.frame(Rsquared=summary(fit_log_noout)$adj.r.squared,
aic=AIC(fit_log_noout),
bic=BIC(fit_log_noout),
MAE=Metrics::mae(datawo_out$units_sold,fit_log_noout$fitted.values),
MAPE=Metrics::mape(datawo_out$units_sold,fit_log_noout$fitted.values),
percent_bias=Metrics::percent_bias(datawo_out$units_sold,fit_log_noout$fitted.values)
)
check_outliers=rbind(data_fitlog,data_fitlog_no_out)
check_outliers=as.matrix(check_outliers)
row.names(check_outliers)=c("fit_log","fit_log-no outliers")
check_outliers
## Rsquared aic bic MAE MAPE percent_bias
## fit_log 0.9528698 4497.840 4716.928 4484.332 0.9660581 0.9405151
## fit_log-no outliers 0.9561045 4371.976 4585.654 4300.738 0.9661038 0.9410317
###Generalize Linear Models
The models discussed so far are based on the assumption of a linear relationship between the dependent and independent variables. These models assume constant variance, which demonstrably is not true for all data. Generalized linear models (glms) assume the responses come from a distribution that belongs to a more general family of distributions, and also permit more general systematic components. In the followig graph, Culln and Frey graph, I understand which theoretical distribution better fit the dataset.
descdist(data$units_sold,boot=500,discrete = FALSE)
## summary statistics
## ------
## min: 1 max: 100000
## median: 1000
## mean: 4820.663
## estimated sd: 9947.732
## estimated skewness: 5.343004
## estimated kurtosis: 43.54482
It can be seen that the Gamma distribution should be used in the GLM models. As a matter of fact, the article Sales History-based Demand Prediction using Generalized Linear Models (ÖZENBOY et TEKİR,article link) specified that the use of the sale amount is found as gamma distributed. Indeed, the gamma distribution is used to model physical quantities that take only positive values. However, in my analysis, I decided to use a log link function due to the particular restrictions: μi > 0 ηi < 0, which gives restrictions on β.
fit_glm<-glm(units_sold~ . ,family=Gamma(link="log"), data=train)
phi.meandev1 <- deviance(fit_glm) / df.residual(fit_glm)
phi.pearson1 <- summary(fit_glm)$dispersion
data_fitglm=data.frame(Mean.deviance=phi.meandev1, Pearson=phi.pearson1,AIC=AIC(fit_glm),BIC=BIC(fit_glm))
fit_glm2<-glm(units_sold~retail_price+rating_four_count+badges_count+badge_local_product+badge_product_quality+product_color+product_variation_size_id+countries_shipped_to+merchant_rating_count-1,family=Gamma(link="log"), data=train)
phi.meandev2 <- deviance(fit_glm2) / df.residual(fit_glm2)
phi.pearson2 <- summary(fit_glm2)$dispersion
data_fitglm2=data.frame(Mean.deviance=phi.meandev2, Pearson=phi.pearson2,AIC=AIC(fit_glm2),BIC=BIC(fit_glm2))
fit_glm3<-glm(units_sold~retail_price+rating_four_count+badges_count+badge_local_product+product_color+merchant_rating_count-1,family=Gamma(link="log"),data=train)
phi.meandev3 <- deviance(fit_glm3) / df.residual(fit_glm3)
phi.pearson3 <- summary(fit_glm3)$dispersion
data_fitglm3=data.frame(Mean.deviance=phi.meandev3, Pearson=phi.pearson3,AIC=AIC(fit_glm3),BIC=BIC(fit_glm3))
fit_glm4<-glm(units_sold~retail_price+rating_four_count+badges_count+badge_local_product+merchant_rating_count,family=Gamma(link="log"),data=train,mustart=units_sold)
phi.meandev4 <- deviance(fit_glm4) / df.residual(fit_glm4)
phi.pearson4 <- summary(fit_glm4)$dispersion
data_fitglm4=data.frame(Mean.deviance=phi.meandev4, Pearson=phi.pearson4,AIC=AIC(fit_glm4),BIC=BIC(fit_glm4))
fit_glm5<-glm(units_sold~countries_shipped_to+rating_four_count+merchant_rating_count+(merchant_rating_count*rating_four_count),family=Gamma(link="log"),data=train)
phi.meandev5 <- deviance(fit_glm5) / df.residual(fit_glm5)
phi.pearson5 <- summary(fit_glm5)$dispersion
data_fitglm5=data.frame(Mean.deviance=phi.meandev5, Pearson=phi.pearson5,AIC=AIC(fit_glm5),BIC=BIC(fit_glm5))
datafit_glm=rbind(data_fitglm,data_fitglm2,data_fitglm3,data_fitglm4,data_fitglm5)
datafit_glm=as.matrix(datafit_glm)
row.names(datafit_glm)=c("fit_glm","fit_glm2","fit_glm3","fit_glm4","fit_glm5")
datafit_glm
## Mean.deviance Pearson AIC BIC
## fit_glm 1.759623 1.518472 20679.03 20898.12
## fit_glm2 1.985726 1.563185 20848.02 20970.30
## fit_glm3 2.043881 1.484676 20887.20 20973.82
## fit_glm4 2.048938 1.519099 20883.85 20919.51
## fit_glm5 1.981689 1.581366 20831.97 20862.54
The table show the Mean deviance estimate, the Pearson estimate and the AIC and BIC of the glms. It can be seen that the saturated model should be considered. Indeed, considering the AIC values, where the fit_glm ones is the lowest one, suggesting the higher fitting. In the next section, I will use the package h2o, which is open-source machine learning and deep learning for Smarter Applications. The package “H2O” implements almost all common machine learning algorithms.
In this section, I will use the AutoMl function (in the h2o package) which automates the process of building a large number of models, with the goal of finding the “best” model.
library(h2o)
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 1 days 10 hours
## H2O cluster timezone: Europe/Copenhagen
## H2O data parsing timezone: UTC
## H2O cluster version: 3.30.0.1
## H2O cluster version age: 5 months and 16 days !!!
## H2O cluster name: H2O_started_from_R_francesconapoli_kbj412
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.57 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 4.0.2 (2020-06-22)
trainh20 <- as.h2o(train)
##
|
| | 0%
|
|======================================================================| 100%
testh20 <- as.h2o(test)
##
|
| | 0%
|
|======================================================================| 100%
predictors <- colnames(train)[1:13]
predictors<-predictors[-3]
response <- "units_sold"
The best model is the XGBoost (aka Machine learning algorithms under the Gradient Boosting framework). The GBM algorithm sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.
utoml <- h2o.automl(x=predictors,y=response,training_frame=trainh20,max_models=15)
##
|
| | 0%
|
|= | 2%
|
|=== | 5%
|
|==== | 6%
|
|======== | 11%
|
|========= | 13%
|
|============ | 17%
|
|============= | 19%
|
|=============== | 22%
|
|================ | 22%
|
|================== | 25%
|
|===================== | 31%
|
|======================= | 32%
|
|=========================== | 38%
|
|================================ | 45%
|
|=================================== | 49%
|
|==================================== | 52%
|
|====================================== | 54%
|
|======================================= | 56%
|
|=========================================== | 61%
|
|==================================================== | 75%
|
|====================================================== | 78%
|
|======================================================================| 100%
best_model <- as.vector(utoml@leaderboard$model_id[1])
summary_model <- h2o.getModel(best_model)
cat(paste0("The R squared of the XGBOOST model is ",h2o.r2(summary_model) ))
## The R squared of the XGBOOST model is 0.984452039784771
An important feature of the GMB models is the possibility to have the Variable Importance Table, which is a measure indicating the relative importance of each variable in training the model.
h2o.varimp_plot(summary_model)
The most important variables are the rating stars counts. One of the most known weakness of the GMB models is the overfitting. However the use of the h2o, in particular the function AutoMl, helps in tuning automatically the machine learning model.
In the next section, I will effectuate an out-of-sample analysis.