String Clustering

Project for Developing Data Products Coursera

Enelen

Text Clustering


Many a time, your data might contain textual information that also needs to be analysed. For instance you might have a dataset where the same thing could be written in different ways by different people (color and colour for instance), and you would like them all to be treated in the same manner.


As an example, remember the dataset from the Reproducible Research's final project, which had a column of storm types, which were officially supposed to be 48, but due to data entry errors, spelling mistakes, and other reasons, had more than 900 unique items listed.


One solution in such a case is to group similar strings together, just like you group similar points together based on how close they are to each other (as done in Exploratory Data Analysis class).

Text Clustering functioning

In the case of strings, we obviously have to use other form of distance to compute how close/similar two strings are to each other. Many algorithms exist, such as Levenshtein, Jaro-Winker, Hamming, etc, each with its own method of computing the distance between two strings. Not each works very well with every case.

And once you have computed the distances between them using some algorithm, there is the question of "How many clusters?"" This again depends on the user, and different answers give different results.

This app allows you to test all the different methods and number of clusters on your dataset to determine the best method and number of groups to cluster into.

The App

The app allows a user to upload his own file of strings, and in case the user does not have his/her own dataset, the app uses uses the dataset used in Reproducible research by default, provided by NOAA to showcase the string clustering mechanism.

The user can select the algorithm to compute distance, and the number of clusters to form, and then view the result one cluster at a time or all the clusters together.

Contents from a sample cluster on the default dataset

pander(clustered$Type[clustered$Cluster==3])

HAIL, HAIL 1.75), MARINE MISHAP, HAIL 75, HAIL 80, HAIL 0.75, HAIL 1.00, HAIL 1.75, HAIL 225, HAIL 0.88, HAIL 88, HAIL 175, HAIL 100, HAIL 150, HAIL 075, HAIL 125, HAIL 200, HAIL DAMAGE, HAIL 088, HAIL ALOFT, HAIL 275, HAIL 450, RAIN (HEAVY) and MARINE HAIL

All of these are now considered one group! It is not perfect, but gets the job done.

App interface

App Interface