Product Classification

Winston Saunders
April 22, 2015
Exploratory Summary

R version 3.2.0 (2015-04-16)

alt text

Grabbing the data

The train data set has 61878 rows and 95 columns. Here is a sample of a few rows and columns. The target column has 9 classifiers:

id	feat_1	feat_4	feat_5	target
1	1	0	0	Class_1
2	0	0	0	Class_1
3	0	0	0	Class_1
4	1	1	6	Class_1

The number of elements in each class is shown below.

Class_1	Class_2	Class_3	Class_4	Class_5	Class_6	Class_7	Class_8	Class_9
1929	16122	8004	2691	2739	14135	2839	8464	4955

Create data sample and munge

To get some data for inspection first create a random sample of 4000 rows. This speeds up calculations.

sample_rows <- sample(1:dim(train_data)[1], 4000)

The sampled train_data has 4000 rows and 95 columns.

Munging (cont.)

and then for plotting etc. convert it to a “long” format with each observation in one unique row.

## Get packages
require(plyr); require(ggplot2); require(tidyr)
## munge data into long format 
long_train<-gather(train_data, feature, data, feat_1:feat_93)

Here is a sample… (the table has 372000 rows)

id	target	feature	data
9869	Class_2	feat_1	0
29589	Class_5	feat_1	0
47323	Class_7	feat_1	0
47625	Class_7	feat_1	1
16617	Class_2	feat_1	0
41644	Class_6	feat_1	0

First summary

Let's look at the means and std deviations of the features by class….

## use ddply to get means and standard deviations
train_morph<-ddply(long_train, c("target", "feature"), summarize, mean_data = mean(data), sdev_data = sqrt(var(data)))


## calculate inverse coeff of variation 
## (which I will label as z_stat for later use)
## add small value to prevent overflow errors
train_morph$z_stat<-train_morph$mean_data/(train_morph$sdev_data+0.00001)

Means and sd for all classes

plot of chunk unnamed-chunk-11

plot of chunk unnamed-chunk-12

some variation with class, but low mean/sd = 1/CV means noisy correlations!

1/CV features for each class

plot of chunk unnamed-chunk-13

This is beginning to look promising. We can see at least some variation between classes & features (though others are weak)

Another look

Here the mean and 1/CV within a product class is plotted with color coded for the target class.

plot of chunk unnamed-chunk-14

Another look

If we require that 1/CV be above 0.5 (arbitrarily) plot of chunk unnamed-chunk-15

This starts to look selective

Decision Tree Class1 vs Class2

Decision trees also appear to offer a good way to distinguish.

plot of chunk unnamed-chunk-16

Tree Plot (pruned to cp=0.02)

This shows just a few features (e.g. 17, 78, and 84) can do a pretty good job discriminating between Class_1 and Class_2.

plot of chunk unnamed-chunk-17

Decision Tree Class1 vs Class6

Here's a look at a different pairing (Class_1 and CLass_6)

plot of chunk unnamed-chunk-18

Tree Plot (pruned to cp=0.02)

This shows many more features (e.g. 8, 78, 6, &c.) are needed to discriminate between Class_1 and Class_6. Though feat_78 is common between the two analyses.

plot of chunk unnamed-chunk-19

Next Steps

Examine more pairwise trees.
Understand if the variable values contribute information.
Run generic random forest as benchmark.

Cheers!