Data science capstone project presentation

Ike
2019-07-01

Introduction

These slides describes:

  • The data used for this application.

  • How to use the accompanying shiny application.

  • Model algorithm.

  • Model and application limitations.

Data cleaning, sampling and processing

The data used to build this application consists of sampled fractions of three text files:

1 en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt. The application is based on an n-gram word model (n=5).

2 Fractions of each file was read in and cleaned separatly.

3 Cleaning consists of reducing all text to lowercase letters, removing punctuations, digits, single letter words, english stopwords and whitspaces.

4 A five gram word model dataframe was built for each file and the three dataframes were collapsed into a single dataframe through row binding.

5 A sample of the five gram word model dataframe is shown in the table below.

word1 word2 word3 word4 word5
st louis plant close die
louis plant close die old
plant close die old age
close die old age workers
die old age workers making
old age workers making cars

Prediction algorithm

  • The prediction algorithm relies on Markov property namely, that the next word of a phrase is largely dependent on its last word.

  • Model uses the last word of each phrase to conduct a search of likely next word in the models dictionary of words.

  • For words not in the dictionary, one of the most frequent words in the dictionary is suggested as a possible next word

How to use the shiny application

  • The user is urged to enter a word or phrase

  • The inputed word or the last word of entered phrase is extracted and used as the word whose next word is to be predicted.

  • Based on the last word of a phrase, A possible next word is suggested.

  • The suggested next word along with the entered phrase is outputed as new phrase

Model limitations and challenges

1 One may think of language as a dictionary of words. Words not spoken in a language will likely not be found in its dictionary.

2 This applications language dictionary is based on text files of news, blogs and twitter posts and the uniqueness of phrases from these text files limits applications predictive accuaracy.

3 Model could be improved by adding language rules and context to word prediction.