US English Text Analyzer

jtmoogle @github.com All Rights Reserved
Jan 20, 2016

This is prepared for Johns Hopkins' Data Science Capstone online class Final Report

Data Product: US English Text Analyzer

This Shiny application is the web-based data product and implemented the prediction model to predict what next words you want to type.

  • The APP learn terms/words from English blogs/news/twitters provided by JH Capstone Swiftkey.

  • Real-time to interact with user(s) to take user's typing as input method and to display predicted next word(s)

    • Efficient Modeling: Markov Chain
    • Cleaned/Tidy datasets as a dictionary: compressed size 250KB
    • Lightweight APP, QUICK response time: less than ~0.5 sec

How to use

Users can access the APP (https://jtmoogle.shinyapps.io/textAnalyzer) through any browsers (i.e. IE, Firefox, Chrome) via internet access

  • Input:
    1. User type any words in the phrase/sentence. No need to press an ENTER key
    2. Select the numbers of next words/prediction to display
  • Output: The App predicts and shows to users
    1. Possible current word you are typing
    2. Possible next one or few word you might type next
  • Summary Reports: Users can view summary of N-Gram content in various ways: Data Table, Plots and Word Cloud illustrate beautiful color and fonts representing word frequency

Methods Taken

Data -> Analyze -> Modeling -> Data Product

  • Data Collection: English data file size over 580MB
  • Cleaned data -> Tidy data
    • Removed URLS, Replaced digit number, punctuation, control keys (non-English letters)
    • Converted to lower case (better for building N-Gram /dictionary)
    • Removed common/stop/unused words, and Stemmed words
  • Explored possible variables for observations relating to words
    • Used N-Gram API to generate Unigram-N5 Gram word count, and probability.

Fitting Modeling

  • Normalized word count per corpus in range of [0,1].
    Algorithm: wordcnt - min(wordcnt)) / (max(wordcnt)-min(wordcnt)
  • Ranked the Word Preferred score 1 to 9 based of Normalized value
  • Fitting Modeling, crossed validation, and chose the highest accuracy, 95% CI, the lowest RMSE, Error Rate
  • Classified using Principal Component Analysis(PCA), linear model(lm), random Forest model (rf), kmean, knn to cross validation, and found the best prediction model.
    • Removed variables “Near Zero” and “Highly Coefficient”
    • Applied the max Text Mining sparse ratios based on corpus size and Hardware capacity/memory

Prediction and Future

The Prediction is determined by Preferred Rank based on N-Gram result of word counts/normalized probabilities

  • N1 to N5 Gram files of the best prediction model were compressed as one N-Gram Dictionary file whose size was less than 258KB. The dictionary file was deployed along with Shiny application.

The Shiny APP prediction algorithm utilized the N-Gram dictionary

  • Used Unigram to predict the possible current word
  • Used Bigram to N5 gram to predict the possible next words.

Furthermore, About Future, might extend to text topic model, spelling checking, sentiment features. Could apply in areas of customer survey, trouble ticketing, voice response system and etc.