19/08/2020

Text Classifier Algorithm:

The algortihm used by me is a very simple approach and have many things to work on for better accuracy, but it works for simple texts. It is a frequency based algorithm where the algorithm uses the cleaned data (RDS format - milestone report) to match the input string provided by the user, and provides all the matches wether its bi, tri or quad gram and after that the matched phrase is chosen with highest frequency.

Problems with the classifier:

  • It is highly dependent on the data (cleaned RDS data), in my case due to computational limitations I was only able to use 3 percent of the total data, due to which my model gives predictions with common words like ~ Mother’s, Happy, Barack etc
  • If a phrase is not available in the tri or quad grams returned list than the match is made using the bigram list using the last word of the sentence only.

Solution for the problems listed before:

The data problem can be easily solved using high end computer, I was only able to work with 3% data due to vector memory problem, but if more of the data can be used around (20% - 30% of complete data provided in the Capstone) this problem can easily be solved and highe accuracies can be achieved using the same frequency algorithm approach.

How to use the application :

Regardless the computational limitation, my model works well with common phrases and words like : Mother’s, Barack, Happy, United, United States of and much more

The application is straightforward and easy to use, just enter the text in the text box and the prediction will load instantly due to simplicity of the algorithm.

A red error shows up when the text box is empty, while a question mark shows up if there is spelling error or a word is unrecognised.

Future Approach:

As stated in the third slide the model can improved simply by feeding in more portion of data, but another way to work with this problem is to use a neural network.

Although a neural network would be more computationally exhaustive but if a Bi-Directional Recuurent Neural network or a Bag of words model approach is used then we can get even more accurate and smart results of the inputed query.