Milestone_Project

Chien-Hua Wang
05/14/2019

Project Goal

In this project, our team designed a tool for next word prediction. In addition, we decided to several support softwares as our development enviornment.

  • R
  • RStudios
  • ShinyWebApp

Dataset Exploration

Below, we could easily understand each raw dataset's situation.

       FileName   Lines     Chars    Words WPL_Min WPL_Mean WPL_Max
1   en_US.blogs  899288 206824382 37570839       0 41.75107    6726
2    en_US.news   77259  15639408  2651432       1 34.61779    1123
3 en_US.twitter 2360148 162096241 30451170       1 12.75065      47

Memory Exploration for original data and sample data

This is the step to evaluate our size of sample data.

File FileSize nEntries TotalCharacters MaxCharacters
blogsData 255.4 Mb 899288 206824505 40833
newsData 19.8 Mb 77259 15639408 5760
twitterData 319 Mb 2360148 162096241 140
blogsSample 5.1 Mb 17985 4136040 4243
newsSample 0.4 Mb 1545 303785 1244
twitterSample 6.5 Mb 47202 3245129 140
allDataSample 1.2 Mb 6672 785607 3985

Algorithm and Design

In this tool, we did nature language preprocessing before modeling. In addition, we used N-Grams algorithm as our serching engine to evaluate those words which had high frequency.

  word frequency
1 just       566
2 like       444
3  one       434
4 will       434
5  can       414
6  get       329

WordCloud Visualization

In our shiny app, we demonstrated high fequency words in our App.

plot of chunk unnamed-chunk-5