Milestone_Project

Chien-Hua Wang
05/14/2019

Project Goal

In this project, our team designed a tool for next word prediction. In addition, we decided to several support softwares as our development enviornment.

R
RStudios
ShinyWebApp

Dataset Exploration

Below, we could easily understand each raw dataset's situation.

       FileName   Lines     Chars    Words WPL_Min WPL_Mean WPL_Max
1   en_US.blogs  899288 206824382 37570839       0 41.75107    6726
2    en_US.news   77259  15639408  2651432       1 34.61779    1123
3 en_US.twitter 2360148 162096241 30451170       1 12.75065      47

Memory Exploration for original data and sample data

This is the step to evaluate our size of sample data.

File	FileSize	nEntries	TotalCharacters	MaxCharacters
blogsData	255.4 Mb	899288	206824505	40833
newsData	19.8 Mb	77259	15639408	5760
twitterData	319 Mb	2360148	162096241	140
blogsSample	5.1 Mb	17985	4136040	4243
newsSample	0.4 Mb	1545	303785	1244
twitterSample	6.5 Mb	47202	3245129	140
allDataSample	1.2 Mb	6672	785607	3985

Algorithm and Design

In this tool, we did nature language preprocessing before modeling. In addition, we used N-Grams algorithm as our serching engine to evaluate those words which had high frequency.

  word frequency
1 just       566
2 like       444
3  one       434
4 will       434
5  can       414
6  get       329

WordCloud Visualization

In our shiny app, we demonstrated high fequency words in our App.

plot of chunk unnamed-chunk-5