Coursera Data Science Specialization Capstone

Natural Language Processing

Lynette Garcia

Problem.

  • People spends a lot of time typing on their mobile devices. Although mobile devices's keyboard sizes and functions has been considerably improved; typing on mobile devices can be a serious pain. So,

  • The next time, that a mobile device user writes, the three next words, needs to be predicted

Data

  • The dataset uses three files named:
    • en_US.blogs.txt (205 Mb of blogs texts)
    • en_US.news.txt (200 Mb of news texts)
    • en_US.twitter.txt (163 Mb of twits)

Preprocessing.

Preprocessing

A process of data cleaning was made before mining the files.

  • Extra whitespace were removed.
  • Numbers and punctuation signs were removed.
  • English stopwords were removed.
  • All text was converted to lowercase

Sample

A clean corpus containing all text files was generated, but it was to big for the computer hardware resources, so different samples were made. Tm package was used.

Finally a 5% of the corpus was used to predict.

Algorithm.

  • There were made several document term matrixes for each word, bigrams, trigrams and quadgrams, using quanteda package.
  • For each, 4-grams, 3-grams and 2-grams, a data frame was constructed.
  • Kneser-Ney smoothing probability (KNP) was calculated for 2-grams, 3-grams and 4-grams, and each data frame was ordered descendently, by KNP term.
  • This probability was used to predict the next three words

Shiny App

  • A Shiny App was made. The user introduces a sentence and can choose how many words he wants to predict.
  • A worldcloud was made to show the predicted's words frequency

plot of chunk unnamed-chunk-3