Winston Saunders
April 22, 2015
Exploratory Summary
R version 3.2.0 (2015-04-16)
The train data set has 61878 rows and 95 columns. Here is a sample of a few rows and columns. The target column has 9 classifiers:
id | feat_1 | feat_2 | feat_3 | feat_4 | feat_5 | feat_92 | feat_93 | target |
---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Class_1 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Class_1 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Class_1 |
4 | 1 | 0 | 0 | 1 | 6 | 0 | 0 | Class_1 |
The number of elements in each class is shown below.
Class_1 | Class_2 | Class_3 | Class_4 | Class_5 | Class_6 | Class_7 | Class_8 | Class_9 |
---|---|---|---|---|---|---|---|---|
1929 | 16122 | 8004 | 2691 | 2739 | 14135 | 2839 | 8464 | 4955 |
To get some data for inspection first create a random sample of 4000 rows. This speeds up calculations.
sample_rows <- sample(1:dim(train_data)[1], 4000)
The sampled train_data has 4000 rows and 95 columns.
and then for plotting etc. convert it to a “long” format with each observation in one unique row.
## Get packages
require(plyr); require(ggplot2); require(tidyr)
## munge data into long format
long_train<-gather(train_data, feature, data, feat_1:feat_93)
Here is a sample… (the table has 372000 rows)
id | target | feature | data |
---|---|---|---|
9869 | Class_2 | feat_1 | 0 |
29589 | Class_5 | feat_1 | 0 |
47323 | Class_7 | feat_1 | 0 |
47625 | Class_7 | feat_1 | 1 |
16617 | Class_2 | feat_1 | 0 |
41644 | Class_6 | feat_1 | 0 |
Let's look at the means and std deviations of the features by class….
## use ddply to get means and standard deviations
train_morph<-ddply(long_train, c("target", "feature"), summarize, mean_data = mean(data), sdev_data = sqrt(var(data)))
## calculate inverse coeff of variation
## (which I will label as z_stat for later use)
## add small value to prevent overflow errors
train_morph$z_stat<-train_morph$mean_data/(train_morph$sdev_data+0.00001)
some variation with class, but low mean/sd = 1/CV means noisy correlations!
This is beginning to look promising. We can see at least some variation between classes & features (though others are weak)
Here the mean and 1/CV within a product class is plotted with color coded for the target class.
If we require that 1/CV be above 0.5 (arbitrarily)
This starts to look selective
Decision trees also appear to offer a good way to distinguish.
This shows just a few features (e.g. 17, 78, and 84) can do a pretty good job discriminating between Class_1 and Class_2.
Here's a look at a different pairing (Class_1 and CLass_6)
This shows many more features (e.g. 8, 78, 6, &c.) are needed to discriminate between Class_1 and Class_6. Though feat_78 is common between the two analyses.
Examine more pairwise trees.
Understand if the variable values contribute information.
Run generic random forest as benchmark.
Cheers!