Derek Luo
2018.9.1
This is the CAPSTONE PROJECT about “Predict the Next Word” of COURSERA.
The shiny app is built as the link below.
https://derekluo.shinyapps.io/Capstone_Project_Predict_Next_Word/.
nrow(bidata)
[1] 48991
nrow(tridata)
[1] 23567
nrow(fourdata)
[1] 4920
nrow(bidata) + nrow(tridata) + nrow(fourdata)
[1] 77478
Since the combination of four words normally appear less than two or three words, but it's more likely to appear if the first three words typed together.
For example, when we type “consumer financial protection” we'll normally expect the next word as “bureau”, but as only for “protection”, the prediction will be “agency”, and the combination of “consumer financial protection bureau” word “agency”
We try to put more weights into the function for four words then three words, and also more than two words to fix that “bureau” will appear sooner as prediction word 1.
bidata %>%
filter(db == "protection")
# A tibble: 3 x 3
# Groups: db [1]
db prediction n
<chr> <chr> <int>
1 protection agency 332
2 protection act 92
3 protection district 70
fourdata %>%
filter(db == "consumer financial protection")
# A tibble: 1 x 3
db prediction n
<chr> <chr> <int>
1 consumer financial protection bureau 55
Maybe there are a lot of ways like Neural Networks for doing these kind of prediction, but this is just a start of learning the basics of NLP.
If you think there are something I can do better in this app, just leave your comment in the webpage of COURSERA, thank you!
https://derekluo.shinyapps.io/Capstone_Project_Predict_Next_Word/.