Data science capstone project presentation

Ike
2019-07-01

Introduction

These slides describes:

The data used for this application.
How to use the accompanying shiny application.
Model algorithm.
Model and application limitations.

Data cleaning, sampling and processing

The data used to build this application consists of sampled fractions of three text files:

1 en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt. The application is based on an n-gram word model (n=5).

2 Fractions of each file was read in and cleaned separatly.

3 Cleaning consists of reducing all text to lowercase letters, removing punctuations, digits, single letter words, english stopwords and whitspaces.

4 A five gram word model dataframe was built for each file and the three dataframes were collapsed into a single dataframe through row binding.

5 A sample of the five gram word model dataframe is shown in the table below.

word1	word2	word3	word4	word5
st	louis	plant	close	die
louis	plant	close	die	old
plant	close	die	old	age
close	die	old	age	workers
die	old	age	workers	making
old	age	workers	making	cars

Prediction algorithm

The prediction algorithm relies on Markov property namely, that the next word of a phrase is largely dependent on its last word.
Model uses the last word of each phrase to conduct a search of likely next word in the models dictionary of words.
For words not in the dictionary, one of the most frequent words in the dictionary is suggested as a possible next word

How to use the shiny application

The user is urged to enter a word or phrase
The inputed word or the last word of entered phrase is extracted and used as the word whose next word is to be predicted.
Based on the last word of a phrase, A possible next word is suggested.
The suggested next word along with the entered phrase is outputed as new phrase

Model limitations and challenges

1 One may think of language as a dictionary of words. Words not spoken in a language will likely not be found in its dictionary.

2 This applications language dictionary is based on text files of news, blogs and twitter posts and the uniqueness of phrases from these text files limits applications predictive accuaracy.

3 Model could be improved by adding language rules and context to word prediction.