Product Classification





Winston Saunders
April 22, 2015
Exploratory Summary

R version 3.2.0 (2015-04-16)





alt text

Grabbing the data

The train data set has 61878 rows and 95 columns. Here is a sample of a few rows and columns. The target column has 9 classifiers:

id feat_1 feat_2 feat_3 feat_4 feat_5 feat_92 feat_93 target
1 1 0 0 0 0 0 0 Class_1
2 0 0 0 0 0 0 0 Class_1
3 0 0 0 0 0 0 0 Class_1
4 1 0 0 1 6 0 0 Class_1


The number of elements in each class is shown below.

Class_1 Class_2 Class_3 Class_4 Class_5 Class_6 Class_7 Class_8 Class_9
1929 16122 8004 2691 2739 14135 2839 8464 4955

Create data sample and munge

To get some data for inspection first create a random sample of 4000 rows. This speeds up calculations.

sample_rows <- sample(1:dim(train_data)[1], 4000)

The sampled train_data has 4000 rows and 95 columns.

Munging (cont.)

and then for plotting etc. convert it to a “long” format with each observation in one unique row.

## Get packages
require(plyr); require(ggplot2); require(tidyr)
## munge data into long format 
long_train<-gather(train_data, feature, data, feat_1:feat_93)

Here is a sample… (the table has 372000 rows)

id target feature data
9869 Class_2 feat_1 0
29589 Class_5 feat_1 0
47323 Class_7 feat_1 0
47625 Class_7 feat_1 1
16617 Class_2 feat_1 0
41644 Class_6 feat_1 0

First summary

Let's look at the means and std deviations of the features by class….

## use ddply to get means and standard deviations
train_morph<-ddply(long_train, c("target", "feature"), summarize, mean_data = mean(data), sdev_data = sqrt(var(data)))


## calculate inverse coeff of variation 
## (which I will label as z_stat for later use)
## add small value to prevent overflow errors
train_morph$z_stat<-train_morph$mean_data/(train_morph$sdev_data+0.00001)

Means and sd for all classes

plot of chunk unnamed-chunk-11

plot of chunk unnamed-chunk-12

some variation with class, but low mean/sd = 1/CV means noisy correlations!

1/CV features for each class

plot of chunk unnamed-chunk-13

This is beginning to look promising. We can see at least some variation between classes & features (though others are weak)

Another look

Here the mean and 1/CV within a product class is plotted with color coded for the target class.

plot of chunk unnamed-chunk-14

Another look

If we require that 1/CV be above 0.5 (arbitrarily) plot of chunk unnamed-chunk-15

This starts to look selective

Decision Tree Class1 vs Class2

Decision trees also appear to offer a good way to distinguish.

plot of chunk unnamed-chunk-16

Tree Plot (pruned to cp=0.02)

This shows just a few features (e.g. 17, 78, and 84) can do a pretty good job discriminating between Class_1 and Class_2.

plot of chunk unnamed-chunk-17

Decision Tree Class1 vs Class6

Here's a look at a different pairing (Class_1 and CLass_6)

plot of chunk unnamed-chunk-18

Tree Plot (pruned to cp=0.02)

This shows many more features (e.g. 8, 78, 6, &c.) are needed to discriminate between Class_1 and Class_6. Though feat_78 is common between the two analyses.

plot of chunk unnamed-chunk-19

Next Steps

Examine more pairwise trees.
Understand if the variable values contribute information.
Run generic random forest as benchmark.

Cheers!