Predict the Next Word

Derek Luo
2018.9.1

This is the CAPSTONE PROJECT about “Predict the Next Word” of COURSERA.

The shiny app is built as the link below.

https://derekluo.shinyapps.io/Capstone_Project_Predict_Next_Word/.

Process of Building the APP - Tokenization for Database

Tokenization for words of two,three and four from the data given by blogs, news and twitter.
Exclude STOP WORDS and numeric numbers to enhance the efficiency of prediction.
Build a database for all the words appear more than three times.
The final database contain 77,478 of data.

nrow(bidata)

[1] 48991

nrow(tridata)

[1] 23567

nrow(fourdata)

[1] 4920

nrow(bidata) + nrow(tridata) + nrow(fourdata)

[1] 77478

Process of Building the APP - Wrap the function for prediction

Since the combination of four words normally appear less than two or three words, but it's more likely to appear if the first three words typed together.
For example, when we type “consumer financial protection” we'll normally expect the next word as “bureau”, but as only for “protection”, the prediction will be “agency”, and the combination of “consumer financial protection bureau” word “agency”
We try to put more weights into the function for four words then three words, and also more than two words to fix that “bureau” will appear sooner as prediction word 1.

bidata %>%
    filter(db == "protection")

# A tibble: 3 x 3
# Groups:   db [1]
  db         prediction     n
  <chr>      <chr>      <int>
1 protection agency       332
2 protection act           92
3 protection district      70

fourdata %>%
    filter(db == "consumer financial protection")

# A tibble: 1 x 3
  db                            prediction     n
  <chr>                         <chr>      <int>
1 consumer financial protection bureau        55

Process of Building the APP - Build an app for interaction

The two pictures demonstrate the app and the little function we explained at the previous page.

plot of chunk unnamed-chunk-4

At the end of the joruney

Maybe there are a lot of ways like Neural Networks for doing these kind of prediction, but this is just a start of learning the basics of NLP.

If you think there are something I can do better in this app, just leave your comment in the webpage of COURSERA, thank you!

https://derekluo.shinyapps.io/Capstone_Project_Predict_Next_Word/.