RPubs note: This version may not be the latest, see if Github copy is more recent. The code is there too.
To demonstrate how previous sales data can be used to predict future sales, we’ll apply neural network library in R language to a dataset from UCI machine learning depository - Dresses_Attribute_Sales Data Set
The idea is to potentially apply the same method to improve other areas of sales. Say, we can come up with a set of attributes that describe qualities of a product page in an eCommerce store. Then, if these qualities do affect sales, we should be able to predict possible results of the product page changes.
I don’t really expect this to work, but this is an interesting exercise anyway.
I have to digress to explain why I’m doing this. I’m studying for Data Science specialization on Coursera, because I realized that for my day job it would be really useful to have better understanding of these technologies. After couple of courses, however entertaining they were, I realized, that it’s not quite what I expected. So I went and looked - where else this knowledge can be used?
I found almost right away that machine learning is the closest application there is. And this area is booming - see this video, for example. It looks like right now we are at a point where Moore law (lost by Intel, but maintained by Nvidia) allows to bridge performance gap from some decades-old machine learning theory to very practical applications. Every business will become IT business at its core, they say. And every IT business will become machine learning business at its core. Android’s amazing speech recognition is machine learning, for example.
So it is all very exciting. Question is - can a layperson like me use existing machine learning libraries without having good understanding of how math works inside? Is it available for some higher level usage similar to how WWW in early nineties allowed to use complex network and PC technologies by working at much higher HTML level and create some amazing results? This is an experiment to do that. Sure, R code can look scary. But you can consider most of the things I do in this file to be just some command-line replacement of Excel. Neural network magic is a black box, I use its very basic calls and at very primitive level, without good understanding.
I mean, I have some understanding, but it’s quite superficial. Like this. At a simplest level, artificial neurons are sets of weights for each input. If input is a picture (list of pixel values), then this set of weights would be same resolution as original input, and can even be rendered - it will show then a pattern to which this neuron is sensitive to. It’s hard for me to imagine what would happen with multi-layered network. They say that for animal image recognition networks the first layer collects graphic primitives like edges, next layer - higher level graphic objects, then at some point eyes and noses, then cats and dogs, then breeds and so on. But I don’t understand yet if it’s something that happens automatically. I only can imagine how the first level is rendered as a pattern a neuron recognizes, which happens in this amazing example where 74 lines of python code, while not containing any calls to neural net libraries, do character recognition better than humans do.
Now, getting back to this dress sales dataset. While working with it, I got a realization - there’s a simple intuitive explanation for neural network logic when it decides if a dress will sell or not. Each input (see section on neuralnet() usage) is a style attribute of a dress. Does it have long sleeves? Is it black? Is it sexy? And 170 or so similar questions are answered in the inputs for each dress. Then each neuron finds a weight for each input. This means that each neuron is actually a customer preference pattern. One neuron will have higher preference for red sexy dresses with long sleeves, smaller preference for brief silk dresses, and tolerance for cashmere. Another - different set of weights meaning imitation of a customer with a different taste - different set of preferences. So neurons are like people with different tastes.
When neural net is taught about dresses and if they will sell or not, it will iterate group of most typical and most sales-influential customer tastes.
I can imagine that if there will be two layers, then deeper layer will aggregate these tastes somehow, and so at the second levels neurons will represent some taste groups - for example, different ages, different cultural groups, different regions. This is a theory, I have no idea how close it is to the reality, but perhaps this is what I will be able to do, as I continue with this research. It seems intriguing to deconstruct a taught neural network to see what patterns it has extracted and what are their meanings.
It is highly probable, of course, that a) I misunderstand and imagining things b) this all had been asked and answered long time ago c) there are simple answers I ignore just because of my ignorance. Well, the path is everything, the destination is nothing. Let’s do it.
These are the attributes the dataset in question contains:
This document is also supposed to demonstrate how sales data can be processed in R, so various data processing steps are explained where possible.
# Let's load libraries first
# Use call like install.packages("dplyr"), if you miss any of the libraries.
library(digest) # get checksum for data loaded from file
library(dplyr) # for column selection and handy function piping syntax
library(tidyr) # for making numeric columns from style variables
library(nnet) # this neural net library allows for one layer of hidden neurons
library(neuralnet) # this one allows several layers, but work only with numbers
library(data.table) # for renaming columns list to list
# Update Java to 64bit version if xlsx library returns Java error
# http://javadl.sun.com/webapps/download/AutoDL?BundleId=109708
library(xlsx) # for reading Excel files
First, we change working directory and download the dataset.
#setwd(file.path(normalizePath("~"),"sales-neuralnet")) # Knitr FAQ says
# it's a bad practice, but use this line when running code in console
src<-"http://archive.ics.uci.edu/ml/machine-learning-databases/00289/Dresses_Attribute_Sales.rar"
file <- basename(src)
if(!file.exists(file)) download.file(src,file)
Now you have to unrar Dresses_Attribute_Sales.rar to the current folder! - it’s a manual step, not reproducible by the script
As it seems there’s no easy way to unrar in R, so do this step outside of R. Actually the R-downloaded file is broken somehow, so download it manually too.
We see that there’s a file where for different properties of a dress there’s a recommendation result - does it sell or does it not. As authors explain, “This dataset contain Attributes of dresses and their recommendations according to their sales.” There’s also another spreadsheet where sales data is provided. For now we’ll use Recommendation parameter directly, sales numbers can be used later.
# create platform-independent file path
xls<-file.path("Dresses_Attribute_Sales","Attribute DataSet.xlsx")
# read data from excel file to a data frame
if(!exists("loaded") || digest(loaded) != "c049d992c0c433cb3b0d1cbc7e348f6c")
loaded<-read.xlsx(xls,1) # second argument is tab number
# preserve to skip file loading on console re-runs
data<-loaded
Let’s review our dataset - first its first five lines, then data structure of the dataset in R, then table dimensions.
# Top five rows
head(data)
## Dress_ID Style Price Rating Size Season NeckLine SleeveLength
## 1 1006032852 Sexy Low 4.6 M Summer o-neck sleevless
## 2 1212192089 Casual Low 0.0 L Summer o-neck Petal
## 3 1190380701 vintage High 0.0 L Automn o-neck full
## 4 966005983 Brief Average 4.6 L Spring o-neck full
## 5 876339541 cute Low 4.5 M Summer o-neck butterfly
## 6 1068332458 bohemian Low 0.0 M Summer v-neck sleevless
## waiseline Material FabricType Decoration Pattern.Type
## 1 empire null chiffon ruffles animal
## 2 natural microfiber null ruffles animal
## 3 natural polyster null null print
## 4 natural silk chiffon embroidary print
## 5 natural chiffonfabric chiffon bow dot
## 6 empire null null null print
## Recommendation
## 1 1
## 2 0
## 3 0
## 4 1
## 5 0
## 6 0
# Memory object structure
str(data)
## 'data.frame': 500 obs. of 14 variables:
## $ Dress_ID : num 1.01e+09 1.21e+09 1.19e+09 9.66e+08 8.76e+08 ...
## $ Style : Factor w/ 13 levels "bohemian","Brief",..: 11 3 12 2 4 1 3 7 6 1 ...
## $ Price : Factor w/ 7 levels "Average","high",..: 5 5 3 1 5 5 1 1 1 5 ...
## $ Rating : num 4.6 0 0 4.6 4.5 0 0 0 0 0 ...
## $ Size : Factor w/ 7 levels "free","L","M",..: 3 2 2 2 3 3 7 1 1 1 ...
## $ Season : Factor w/ 8 levels "Automn","Autumn",..: 6 6 1 4 6 6 6 1 4 6 ...
## $ NeckLine : Factor w/ 17 levels "backless","boat-neck",..: 7 7 7 7 7 17 7 7 17 17 ...
## $ SleeveLength : Factor w/ 18 levels "butterfly","cap-sleeves",..: 12 8 4 4 1 12 4 9 9 12 ...
## $ waiseline : Factor w/ 5 levels "dropped","empire",..: 2 3 3 3 3 2 4 3 2 3 ...
## $ Material : Factor w/ 24 levels "acrylic","cashmere",..: 14 9 17 20 3 14 4 17 4 15 ...
## $ FabricType : Factor w/ 23 levels "batik","broadcloth",..: 3 12 12 3 3 12 12 2 2 3 ...
## $ Decoration : Factor w/ 25 levels "applique","beading",..: 21 21 14 8 3 14 14 12 2 14 ...
## $ Pattern.Type : Factor w/ 15 levels "animal","character",..: 1 1 12 12 3 12 13 9 13 9 ...
## $ Recommendation: num 1 0 0 1 0 0 0 0 1 1 ...
# Frame dimensions
dim(data)
## [1] 500 14
So we have 500 lines in the table. In the data language it means we have 500 observations - with a given set of dress’ attributes, does it sell or not.
We have two columns that are different from other styles - Dress_ID and Rating. Supposedly the first is a unique id of a dress for which we observe the recommendation based on sales level. Let’s check this assumption. Are Dress_IDs unique for each record?
It shouldn’t be said while presenting results, but as this document is about procedure, it should be noted that this particular section was added later during the analysis when attempt of tidying data showed that it’s not actually unique, and so we need to understand what it means.
length(unique(data$Dress_ID))
## [1] 475
length(data$Dress_ID)
## [1] 500
It looks some dresses have several rows. Let’s look at them.
repeats<-table(data$Dress_ID)
dupes<-data[data$Dress_ID %in% names(repeats[repeats>1]),] %>%
arrange(Dress_ID)
head(dupes)
## Dress_ID Style Price Rating Size Season NeckLine SleeveLength
## 1 549159213 Casual Average 4.5 free Summer o-neck short
## 2 549159213 Casual Average 4.4 free Summer o-neck capsleeves
## 3 560474456 Sexy low 4.3 free Winter v-neck sleevless
## 4 560474456 Sexy low 4.4 free Automn v-neck sleevless
## 5 699738864 Casual low 4.7 S summer o-neck full
## 6 699738864 Casual Low 4.7 S Summer o-neck full
## waiseline Material FabricType Decoration Pattern.Type Recommendation
## 1 natural cotton null null solid 1
## 2 natural cotton null null solid 0
## 3 natural silk worsted lace solid 1
## 4 natural cotton worsted lace solid 0
## 5 natural cotton null null patchwork 0
## 6 natural cotton null null patchwork 0
Apparently, each dress (with the same Dress_ID) can have variations - same dress with different sleeves, from different materials. It doesn’t make sense to make Dress_ID a style parameter (see below), so we’ll just ignore it, and will treat same dress with variations as a different dresses.
# Let's add explicit unique row numbers to represent observations
data$rnumber<-1:nrow(data)
Presumably the rating is average grade given to a dress by store’s customers. We should be wary about this parameter, as can be correlated to the sales too directly. What is worse, for new inventory it won’t be available. So let’s keep an eye on it.
Nnet() is a standard R library, it’s even included in the basic distribution package and doesn’t need to be installed separately. It also great because it allows to analyze data with so called “factor” variables - categorical ones. For example, what material a dress is made from - cotton, silk, etc. Its drawback is that it only allows for a single hidden layer of neurons.
# Make sure data types are good for nnet() - factors and numbers
df<-as.data.frame(
lapply(
dplyr::select(data,-Dress_ID,-rnumber),
as.factor))
df$Rating<-as.numeric(df$Rating)
# select() above needed direct reference to dplyr library because R libraries
# like to overwrite this function very much
It’s time to train the neural network. This code creates an object which basically includes a set of coefficients to be applied to input parameters to predict (calculate) the outcome. This set of coefficients is the substance of the trained neural network.
It takes some time to train a neural network, but once it’s trained, predictions can be calculated very fast and “cheap”. In this case the dataset is small, and network is simple, so it trains also fast.
neurons<-5 # we'll play with this number below
df1<-df # make clean copy which can be mutilated if needed - will be used later
seed<-3 # initialize randomizer for reproducibility
# We'll set seed every time randomizer is in play to reset it after previous use
set.seed(seed)
nn<-nnet(Recommendation ~ ., df1, size=neurons)
## # weights: 771
## initial value 338.247117
## iter 10 value 310.502710
## iter 20 value 195.739258
## iter 30 value 138.190418
## iter 40 value 121.918684
## iter 50 value 118.273514
## iter 60 value 114.846922
## iter 70 value 113.185644
## iter 80 value 110.318486
## iter 90 value 109.172075
## iter 100 value 107.684641
## final value 107.684641
## stopped after 100 iterations
guess<-predict(nn, df1, type = "class")
table(df1$Recommendation,guess)
## guess
## 0 1
## 0 251 38
## 1 8 199
Let’s make nnet() a placeholder function with this result to skip these runs when developing next parts. This code itself should be skipped when producing the final document.
# To skip empty runs while developing, overwrite it
nn.placeholder<-nn
if(exists("nnet")) rm(nnet)
nnet<-function(a,b,size=1) {
print("Console run: skipped following nnet calls")
nn.placeholder
}
Well, most zeroes are classified as zeroes and vice versa. Let’s calculate a percentage of the correct guesses.
# We'll use this later, let's make a reusable function out of it
# Although I'm sure there's a simpler way to do this, some basic ready-made
# R function, which I'm just don't know. Well, I'll learn that later then
qualify<-function(real,guess){
check<-table(real,guess)
good.ones<-check[1,1]+check[2,2]
bad.ones<-check[1,2]+check[2,1]
paste0(as.character(round(100*good.ones/(good.ones+bad.ones))),'%')
}
quality<-qualify(df1$Recommendation,guess)
print(quality)
## [1] "91%"
OK, so 91% doesn’t seem too bad for just 5 neurons. But what if we separate training and test sets?
First, let’s select lines we’ll use for training, and rest will go for testing. Let’s use 80% of data for training purposes.
nr<-dim(df)[1] # number of observations
share<-0.8 # this is our 80% parameter
set.seed(seed)
trainset<-sample.int(nr,round(share*nr))
The trainset variable now has numbers of observations we’ll use for training. Let’s split the dataset in two.
neurons<-5
df1<-df
trainers<-df1[trainset,]
testers<-df1[-trainset,]
set.seed(seed)
nn<-nnet(Recommendation ~ ., trainers, size=neurons)
## # weights: 771
## initial value 273.048410
## iter 10 value 241.482580
## iter 20 value 144.842478
## iter 30 value 95.991723
## iter 40 value 82.573174
## iter 50 value 77.544592
## iter 60 value 73.808236
## iter 70 value 71.405765
## iter 80 value 69.804317
## iter 90 value 68.856340
## iter 100 value 68.259184
## final value 68.259184
## stopped after 100 iterations
Now let’s see what happens to our prediction quality.
guess<-predict(nn, testers, type = "class")
quality<-qualify(testers$Recommendation,guess)
# Record for later comparison
nnets<-c(Neurons=neurons,Rating="Yes",Quality=quality)
nnets
## Neurons Rating Quality
## "5" "Yes" "60%"
nnets.all<-nnets
Well, just 60%. We could have just flipped a coin.
Moreover, it included Rating, which is kind of cheating. Theoretically speaking, there’s a confounding variable - people both rate and buy what they like. Let’s drop the rating.
neurons<-5
df1<-dplyr::select(df,-Rating)
trainers<-df1[trainset,]
testers<-df1[-trainset,]
set.seed(seed)
nn<-nnet(Recommendation ~ ., trainers, size=neurons)
## # weights: 766
## initial value 298.439688
## iter 10 value 188.133627
## iter 20 value 117.675639
## iter 30 value 89.266456
## iter 40 value 80.714295
## iter 50 value 79.602084
## iter 60 value 79.080988
## iter 70 value 78.650563
## iter 80 value 76.371040
## iter 90 value 72.361793
## iter 100 value 72.022875
## final value 72.022875
## stopped after 100 iterations
guess<-predict(nn, testers, type = "class")
quality<-qualify(testers$Recommendation,guess)
# Record for later comparison
nnets<-c(Neurons=neurons,Rating="No",Quality=quality)
nnets
## Neurons Rating Quality
## "5" "No" "61%"
nnets.all<-rbind(nnets.all,nnets)
Surprisingly, it didn’t drop at all. It could - if you play with the randomizer seed, you’ll see. But not by much. Probably the Rating wasn’t affecting sales as strong as we expected.
Let’s increase number of neurons. First, with Rating.
neurons<-30
df1<-df
trainers<-df1[trainset,]
testers<-df1[-trainset,]
set.seed(seed)
# Add a new paramter to nnet() - MaxNWts, otherwise it exceeds default
# value for maximum number of weights and errors w/ "too many (4621) weights"
nn<-nnet(Recommendation ~ ., trainers, size=neurons, MaxNWts=10000)
## # weights: 4621
## initial value 293.796148
## iter 10 value 233.308647
## iter 20 value 174.044699
## iter 30 value 108.898880
## iter 40 value 57.163090
## iter 50 value 13.316643
## iter 60 value 4.838987
## iter 70 value 1.660317
## iter 80 value 0.127794
## iter 90 value 0.007926
## iter 100 value 0.000615
## final value 0.000615
## stopped after 100 iterations
guess<-predict(nn, testers, type = "class")
quality<-qualify(testers$Recommendation,guess)
# Record for later comparison
nnets<-c(Neurons=neurons,Rating="Yes",Quality=quality)
nnets
## Neurons Rating Quality
## "30" "Yes" "57%"
nnets.all<-rbind(nnets.all,nnets)
Now - without the rating.
neurons<-30
df1<-dplyr::select(df,-Rating)
trainers<-df1[trainset,]
testers<-df1[-trainset,]
set.seed(seed)
nn<-nnet(Recommendation ~ ., trainers, size=neurons,MaxNWts=10000)
## # weights: 4591
## initial value 482.279127
## iter 10 value 157.513021
## iter 20 value 12.781400
## iter 30 value 1.647551
## iter 40 value 1.391606
## iter 50 value 1.386979
## iter 60 value 1.386534
## iter 70 value 1.386402
## iter 80 value 1.386370
## iter 90 value 1.386324
## final value 1.386322
## converged
guess<-predict(nn, testers, type = "class")
quality<-qualify(testers$Recommendation,guess)
# Record for later comparison
nnets<-c(Neurons=neurons,Rating="Yes",Quality=quality)
nnets
## Neurons Rating Quality
## "30" "Yes" "47%"
nnets.all<-rbind(nnets.all,nnets)
So rating doesn’t matter here.
Is there a rule of thumb for the number of neurons? Here’s a detailed answer to this question by Nate Kohl at StackOverflow. In short, while every problem should be handled specifically, and trial-and-error always useful, but the rule of thumb is 2/3 of number of inputs plus number of outputs. Question is - if we are dealing with the factor variables, do we count variables or their levels? That’s a question for another day, but for now we’ve seen that crude pumping neurons number doesn’t help that much.
How many levels do we have in total, anyway?
# Count number of unique values in each column, then sum it up
# Let's also use R's trick - while we assign a value to a variable, parenthesis
# do also print the value
(levels.number<-sum(
sapply(
dplyr::select(df,-Rating,-Recommendation),
function(x)
length(unique(x)))))
## [1] 170
# Then 2/3 will be, adding input Rating and output Recommendation
(neurons<-round((levels.number+2)*2/3))
## [1] 115
Just to be safe, let’s do 115 neurons.
df1<-df
trainers<-df1[trainset,]
testers<-df1[-trainset,]
set.seed(seed)
nn<-nnet(Recommendation ~ ., trainers, size=neurons,MaxNWts=100000)
## # weights: 17711
## initial value 814.369720
## iter 10 value 233.274460
## iter 20 value 192.190614
## iter 30 value 137.987112
## iter 40 value 71.908269
## iter 50 value 15.057290
## iter 60 value 1.789717
## iter 70 value 0.077771
## iter 80 value 0.003328
## iter 90 value 0.000466
## iter 100 value 0.000186
## final value 0.000186
## stopped after 100 iterations
guess<-predict(nn, testers, type = "class")
quality<-qualify(testers$Recommendation,guess)
# Record for later comparison
nnets<-c(Neurons=neurons,Rating="Yes",Quality=quality)
nnets
## Neurons Rating Quality
## "115" "Yes" "51%"
nnets.all<-rbind(nnets.all,nnets)
Let’s review our attempts.
print(data.frame(nnets.all), row.names = FALSE)
## Neurons Rating Quality
## 5 Yes 60%
## 5 No 61%
## 30 Yes 57%
## 30 Yes 47%
## 115 Yes 51%
So, it doesn’t work. Some different approach is needed, obviously. Perhaps there should be more layers, not just more neurons? Let’s try to use neuralnet() library now, hoping that it will allow for more complex networks.
R’s neuralnet() requires numeric input, so let’s convert it.
# Convert every column to character to prevent loss of the data when gathering
# square brackets keep data.frame type
data[]<-lapply(data,as.character)
# Repack factors to numeric columns
# I know there are faster ways to do it, but this one is more transparent to me
bins<-data %>% # piped functions follow
# make it narrow, don't touch numeric variables and IDs
gather(catnames,catvalues,-Dress_ID,-Rating,-Recommendation,-rnumber) %>%
# make single column out of them
unite(newfactor,catnames,catvalues,sep=".") %>%
# add a new column - it's "1" for every record
mutate( is = 1) %>%
# create a column from each factor, and where there's no record, add "0"
spread(newfactor, is, fill = 0)
# Now let's make it back numeric, except for ID
bins[]<-lapply(bins,as.numeric)
bins$Dress_ID<-as.factor(bins$Dress_ID)
Now we have 174 columns instead of 15, and they are all numeric, except for ID.
Try full dataset first, just to get a feeling.
# replace minuses in column names to avoid breaking formula in neuralnet call
cnames<-colnames(bins)
bad.cnames<-cnames[grepl("-",cnames,fixed=TRUE)]
## Presumably there's no columns with the same names except for minus sign
#fixed.cnames<-sapply(bad.cnames,function(x){gsub("-","",x,fixed=TRUE)})
#cnames[cnames %in% fixed.cnames]
## I was wrong! It duplicates this way. Let's add something instead of minus.
##And keep this exercise in here just to remember what fun it was
fixed.cnames<-sapply(bad.cnames,function(x){gsub("-",".",x,fixed=TRUE)})
cnames[cnames %in% fixed.cnames]
## character(0)
# setnames() call from data.table works by reference, so no copying
setnames(bins,old=bad.cnames,new=fixed.cnames)
# next line was used to generate list of columns below
# cat(paste0(names(bins)[5:174],sep="+"))
# also make function to avoid copy-pasting it again
bins.nn<-function(df,rep=1,hidden=c(1),threshold=0.1) {
set.seed(seed)
nn.obj<-neuralnet(Recommendation ~ Decoration.applique+ Decoration.beading+ Decoration.bow+ Decoration.button+ Decoration.cascading+ Decoration.crystal+ Decoration.draped+ Decoration.embroidary+ Decoration.feathers+ Decoration.flowers+ Decoration.hollowout+ Decoration.lace+ Decoration.NA+ Decoration.none+ Decoration.null+ Decoration.pearls+ Decoration.plain+ Decoration.pleat+ Decoration.pockets+ Decoration.rivet+ Decoration.ruched+ Decoration.ruffles+ Decoration.sashes+ Decoration.sequined+ Decoration.tassel+ Decoration.Tiered+ FabricType.batik+ FabricType.broadcloth+ FabricType.chiffon+ FabricType.Corduroy+ FabricType.dobby+ FabricType.flannael+ FabricType.flannel+ FabricType.jersey+ FabricType.knitted+ FabricType.knitting+ FabricType.lace+ FabricType.NA+ FabricType.null+ FabricType.organza+ FabricType.other+ FabricType.poplin+ FabricType.satin+ FabricType.sattin+ FabricType.shiffon+ FabricType.terry+ FabricType.tulle+ FabricType.wollen+ FabricType.woolen+ FabricType.worsted+ Material.acrylic+ Material.cashmere+ Material.chiffonfabric+ Material.cotton+ Material.knitting+ Material.lace+ Material.linen+ Material.lycra+ Material.microfiber+ Material.milksilk+ Material.mix+ Material.modal+ Material.model+ Material.NA+ Material.null+ Material.nylon+ Material.other+ Material.polyster+ Material.rayon+ Material.shiffon+ Material.silk+ Material.sill+ Material.spandex+ Material.viscos+ Material.wool+ NeckLine.backless+ NeckLine.boat.neck+ NeckLine.bowneck+ NeckLine.halter+ NeckLine.mandarin.collor+ NeckLine.NA+ NeckLine.NULL+ NeckLine.o.neck+ NeckLine.open+ NeckLine.peterpan.collor+ NeckLine.ruffled+ NeckLine.Scoop+ NeckLine.slash.neck+ NeckLine.sqare.collor+ NeckLine.sweetheart+ NeckLine.Sweetheart+ NeckLine.turndowncollor+ NeckLine.v.neck+ Pattern.Type.animal+ Pattern.Type.character+ Pattern.Type.dot+ Pattern.Type.floral+ Pattern.Type.geometric+ Pattern.Type.leapord+ Pattern.Type.leopard+ Pattern.Type.NA+ Pattern.Type.none+ Pattern.Type.null+ Pattern.Type.patchwork+ Pattern.Type.plaid+ Pattern.Type.print+ Pattern.Type.solid+ Pattern.Type.splice+ Pattern.Type.striped+ Price.Average+ Price.high+ Price.High+ Price.low+ Price.Low+ Price.Medium+ Price.NA+ Price.very.high+ Season.Automn+ Season.Autumn+ Season.NA+ Season.spring+ Season.Spring+ Season.summer+ Season.Summer+ Season.winter+ Season.Winter+ Size.free+ Size.L+ Size.M+ Size.s+ Size.S+ Size.small+ Size.XL+ SleeveLength.butterfly+ SleeveLength.cap.sleeves+ SleeveLength.capsleeves+ SleeveLength.full+ SleeveLength.half+ SleeveLength.halfsleeve+ SleeveLength.NULL+ SleeveLength.Petal+ SleeveLength.short+ SleeveLength.sleeevless+ SleeveLength.sleeveless+ SleeveLength.sleevless+ SleeveLength.sleveless+ SleeveLength.threequarter+ SleeveLength.threequater+ SleeveLength.thressqatar+ SleeveLength.turndowncollor+ SleeveLength.urndowncollor+ Style.bohemian+ Style.Brief+ Style.Casual+ Style.cute+ Style.fashion+ Style.Flare+ Style.Novelty+ Style.OL+ Style.party+ Style.sexy+ Style.Sexy+ Style.vintage+ Style.work+ waiseline.dropped+ waiseline.empire+ waiseline.NA+ waiseline.natural+ waiseline.null+ waiseline.princess,
data=df,
hidden=hidden,
lifesign="full",
lifesign.step=2000,
threshold=threshold,
rep=rep)
return(nn.obj)}
n1<-bins.nn(bins,rep=1,hidden=c(5),threshold=0.02)
## hidden: 5 thresh: 0.02 rep: 1/1 steps: 2000 min thresh: 0.07388653044
## 4000 min thresh: 0.03293984918
## 6000 min thresh: 0.03293984918
## 8000 min thresh: 0.02927448485
## 10000 min thresh: 0.02625585969
## 12000 min thresh: 0.02120308571
## 12110 error: 0.55624 time: 20.3 secs
res1<-neuralnet::compute(n1,bins[,5:174])
## Some fun how I couldn't get it working
## First I got this
### Error in eval(expr, envir, enclos) : object 'NeckLine.boatneck' not found
## Looked like minus sign in column names were treated as part of the formula
## So I made code above to delete minuses from column names
## Then I was getting following:
### Error in neurons[[i]] %*% weights[[i]] : non-conformable arguments
## Checking dimensions
#length(n1$model.list$variables)
#length(bins[1,5:174])
## Do not match, why?
#bins[1,5:174][! names(bins[1,5:174]) %in% n1$model.list$variables]
## SleeveLength.capsleeves.1
## It duplicates!
## I went up to see if fixed.cnames were present in cnames, and they were
## So I replaced minusis with dot
qualify(round(res1$net.result),bins$Recommendation)
## [1] "100%"
So it works fine for the full set. Let’s do testers/trainers again.
# do we have same number of lines?
dim(bins)
## [1] 500 174
dim(df)
## [1] 500 13
# We do
trainers<-bins[trainset,]
testers<-bins[-trainset,]
## I had another interesting hurdle here. To skip wait times on teaching the
## network, I had cache=TRUE instruction on the previous block. Guess what -
## it skipped that code block and didn't replace minuses in column names.
## And so I got errors. Thus I should be careful with using cache when
## code does some essential transformations. So I've split that chunk to two
## and cached only the neuralnet training part.
n5<-bins.nn(trainers,rep=1,hidden=c(5),threshold=0.02)
## hidden: 5 thresh: 0.02 rep: 1/1 steps: 2000 min thresh: 0.07135827152
## 3926 error: 0.52462 time: 5.63 secs
res5<-neuralnet::compute(n5,testers[,5:174])
qualify(round(res5$net.result),testers$Recommendation)
## [1] "38%"
38% with 5 artificial neurons. Worse than random. Let’s try just one.
n2<-bins.nn(trainers,rep=1,hidden=c(1),threshold=0.02)
## hidden: 1 thresh: 0.02 rep: 1/1 steps: 775 error: 18.72795 time: 0.45 secs
res2<-neuralnet::compute(n2,testers[,5:174])
qualify(round(res2$net.result),testers$Recommendation)
## [1] "54%"
Only better - 54%. OK, let’s do three layers with 115 neurons in total and do 10 fitting runs.
n2<-bins.nn(trainers,rep=10,hidden=c(80,25,10),threshold=0.02)
## hidden: 80, 25, 10 thresh: 0.02 rep: 1/10 steps: 147 error: 0.50196 time: 3.23 secs
## hidden: 80, 25, 10 thresh: 0.02 rep: 2/10 steps: 122 error: 0.50089 time: 2.69 secs
## hidden: 80, 25, 10 thresh: 0.02 rep: 3/10 steps: 119 error: 0.50131 time: 2.59 secs
## hidden: 80, 25, 10 thresh: 0.02 rep: 4/10 steps: 114 error: 0.50516 time: 2.52 secs
## hidden: 80, 25, 10 thresh: 0.02 rep: 5/10 steps: 112 error: 0.50073 time: 2.44 secs
## hidden: 80, 25, 10 thresh: 0.02 rep: 6/10 steps: 129 error: 0.50561 time: 2.84 secs
## hidden: 80, 25, 10 thresh: 0.02 rep: 7/10 steps: 119 error: 0.50179 time: 2.63 secs
## hidden: 80, 25, 10 thresh: 0.02 rep: 8/10 steps: 109 error: 0.50043 time: 2.37 secs
## hidden: 80, 25, 10 thresh: 0.02 rep: 9/10 steps: 140 error: 0.50117 time: 3.1 secs
## hidden: 80, 25, 10 thresh: 0.02 rep: 10/10 steps: 121 error: 0.50216 time: 2.68 secs
res2<-neuralnet::compute(n2,testers[,5:174])
qualify(round(res2$net.result),testers$Recommendation)
## [1] "51%"
51% :(
# Some code for future lookup
model.matrix(~ . + 0,
data=dfactors,
contrasts.arg = lapply(data, contrasts, contrasts=FALSE))
## To do - technical
# - Make nnet call a function - takes seed, neurons, rating usage, returns list
# - Automate install.packages (and discuss reproducibility at forums)
# - Also discuss republishing to the same RPubs document from a different copy
# - Check if nnet() has successfully trained before checking it results
# - Find out why make.row.names isn't recognized by rbind properly
# - Envelope nnet() into a function and auto-calculate MaxNWts there too