Chien-Hua Wang
05/14/2019
In this project, our team designed a tool for next word prediction. In addition, we decided to several support softwares as our development enviornment.
Below, we could easily understand each raw dataset's situation.
FileName Lines Chars Words WPL_Min WPL_Mean WPL_Max
1 en_US.blogs 899288 206824382 37570839 0 41.75107 6726
2 en_US.news 77259 15639408 2651432 1 34.61779 1123
3 en_US.twitter 2360148 162096241 30451170 1 12.75065 47
This is the step to evaluate our size of sample data.
| File | FileSize | nEntries | TotalCharacters | MaxCharacters |
|---|---|---|---|---|
| blogsData | 255.4 Mb | 899288 | 206824505 | 40833 |
| newsData | 19.8 Mb | 77259 | 15639408 | 5760 |
| twitterData | 319 Mb | 2360148 | 162096241 | 140 |
| blogsSample | 5.1 Mb | 17985 | 4136040 | 4243 |
| newsSample | 0.4 Mb | 1545 | 303785 | 1244 |
| twitterSample | 6.5 Mb | 47202 | 3245129 | 140 |
| allDataSample | 1.2 Mb | 6672 | 785607 | 3985 |
In this tool, we did nature language preprocessing before modeling. In addition, we used N-Grams algorithm as our serching engine to evaluate those words which had high frequency.
word frequency
1 just 566
2 like 444
3 one 434
4 will 434
5 can 414
6 get 329
In our shiny app, we demonstrated high fequency words in our App.