k-means clustering and features selection

Ronen Cohen
Thu Jan 22 17:57:42 2015

2. Problem statment

Help select important features to solve a given supervised learning problem.

This app demo the idea of unsupervised learning is an important preprocessor for supervised learning. It is often useful to try to organize your features, choose features based on the X's themselves and then use those processed features as input into supervised learning.

This app uses the k-means unsupervised learning model to generate clusters of related observations, clusters as a convenient way to visually examine relations between variables.

Variables are related if their values systematically correspond to each other for a given set of observations, most important based on the clustering results new feature can be generated which could increase the significance of an independent variable, and the overall interpretability of the model.

3. How to use

Select Y-variable which represents the dependent variable e.g. mpg, then change the X-variable repeatedly and the number of clusters parameter, to evaluate the different clusters and detect the variables that systematically correspond to the Y-variable, and hence could be significant in explaining the response. based on the clustering results new feature can be generated which could increase the interpretability of the model.

4. Example

For demo purposes i use the cars dataset, i choose to cluster the observations on two variables: mpg which is the dependent variable and wt the independent variable/feature.

data("mtcars")

df <- mtcars[,c("wt", "mpg")]
km <- kmeans(mtcars, 3)

plot of chunk unnamed-chunk-2

5. Conclusion

Based on the plot it looks like mpg and wt are negatively related, meaning mpg decrease as the car gets heavier, also to the most part the clusters are well separated, which is a reinforcement to the importance of the independent variable (wt) in explaining the response (mpg).

Three clusters of observations were identified with the characteristics summarized in the table below.

cluster wt mpg
light cars 0.5 - 2.3 25 - 30+
medium cars 2.3 - 3.4 16 - 25
heavy cars 3.4 - 5+ 5 - 16

In trying to predict and interpret the gasoline consumption of cars. the car's weight is an important variable to include in our supervised learning model, to increase the model interpretability we can introduce the car's weight cluster.