22 9 2020

The Main Description

The mission of the project is to clean a raw database and present the data on on what people read or write in the online news, social media and blogs within one sample year in order to pass a JHU course project.

Internet is an important source of data nowadays - and effective statistical and IT ways to proceed its big data are of a high interest.

The Main Primary Data Description

This project uses the .txt format raw (unclenaed) data in the original languages as they are.

The data dorcessing methods documentation presentation may be found here: https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html.

The data itself may be found here.

The project is made in R which is a statistical programming language free software. The code samples are represented in this text.

Achievements

  • Single Function For R Environment Complex Prediction
  • Function For Shiny
  • Midterm Report
  • Final Project Presentation

Part I: R Environment Prediction

  • Special Function Is Designed For This Issue: getWords()
  • It Will Both Suggest The Next Words
  • And Will Plot Wordcloud Representing Words + Probabilities

Part I: R Environment Prediction Output Example

Word Prediction

Word Prediction

Part I: R Environment Prediction Output Example

Word Prediction

Word Prediction

Part I: R Environment Prediction Output Example

Word Prediction

Word Prediction

Part II: The Shiny Application (Local Machine Use)

Word Prediction

Word Prediction

Part II: The Shiny Application (Local Machine Use)

Word Prediction

Word Prediction

Part II: The Shiny Application (Empty Input Processing)

Word Prediction

Word Prediction

Timing

  • The Internal R Function Needs Around 2-3 Seconds To Predict The Next Word(s)
  • The Shiny Application Needs Less Than 1 Second To Produce 1 Each Next Word

Issues

  • Shiny Server Time Limit Per User
  • Processing Unusual Symbols From Non-English Languages
  • Stopwords: If Use Or Not For Training The Model

The Final Slide

Thank You Very Much For Your Attention! I Hope You Will Like My Project.