Coursera Data Science Capstone Final Project

Ozge Tugrul Sonmez
02/20/2021

Introduction and Data Preparation

The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others.

In order to prepare data for the prediction model:

  • Blogs, Twitter and News data are combined to obtain main data source.
  • Character “I” is substituted by character “i” since lower() function turns the upper case letter “I” to lower case letter (“i”).
  • All of the letters are converted to lowercase. All punctuation, numbers,symbols,urls are removed.
  • N grams (unigrams, bigrams,trigrams and quadgrams) are obtained in order of term frequency and data frames are saved.

Next Word Prediction Model

  • Input data is cleaned. Character “I” is substituded by “i”. Letters are converted to lowercase,white spaces are trimmed. Numbers, punctuation are removed.
  • Quadgram, trigram and bigram data frames are obtained and terms with frequency higher than 50 are extracted.
  • If the number of words entered is at least 3, then respectively quadgram, trigram, bigram and unigram frequency is used to predict next word. The most frequent 3 words are shown on the screen.
  • If the number of words entered is 2, then respectively trigram, bigram and unigram frequency is used to predict next word.The most frequent 3 words are shown on the screen.
  • If only one word is entered, then respectively bigram and unigram frequency is used to predict next word.The most frequent 3 words are shown on the screen.
  • If nothing is entered then unigram frequency is used to predict next word.The most frequent 3 words are shown on the screen.

Shiny Application

alt text

Data Science Capstone Project Links