2024-07-28

R Markdown

This is an R Markdown presentation. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

FINAL PROJECT PRESENATION ON THE COUSERA CAPSTONE PROJECT

  • Coursera Data Science Capstone Project - Shiny App for Next Word Prediction Using N-Grams Morobe Geofrey July 28th, 2024 Shiny app is one of the app for word prediction. It predicts the next words by using a basic-ngram model. These words can be predicted based on 1,2,3,4 or 5 words, using unigram, bigram and etc.

Summary

By using the English text data from 3 different sources: Blogs, Twitter & News were used in this project. The data were uploaded, summarized, preprocessed and tokenized using R-text mining packages such as quanteda,dt, dplyr, tm, ggplot2, ngram, tidytext and other packages that have not been mentioned.

The word prediction model was developed putting in the back of our mind ensuring optimizing accuracy and efficiency. In case there is no suggestion for the next word using 5-Grams the use of 4-Grams is the option or 3-grams,and 2-Grams. The model also calculates a second guess and a third guess respectively.

The developed Shiny app was launched in web Shinyapps.io.on 198Mg account. It’s used to predict words based on guess when a user enters words and displays the plots and data of n- grams as selected by the user.

##How Does the shiny App Work in the Next word prediction?

The user has to be aple to sign in to RPubs account and enter text to a given box and the predicted next word, 2nd guess and 3rd guess predictions are displayed. For example, if I entered the text Not the below prediction was generated. Enter Text: Not

Predicted Next Word: a

Second Guess: to

Third Guess: the ## N-Gram Plots in shinyapp The shinyapp also has the tab “N-GRAM PLOTS”, the user can select the n-gram to view from a drop down menu and the number of terms to display within the slider bar and thus displaying the graphs .

View Data in Shinyapp

In the tab “VIEW DATA”, user can select which n-gram data to view from a drop down menu. The R package dt will generate data table from a R data frame, thus the user will be able to view the data table as below.

ngram                                    freq

1 at the end of the 74 2 in the middle of the 41 3 thanks for the shout out 32 4 thank you so much for 32 5 happy mothers day to all 30 6 for the rest of the 28 7 the end of the day 28 8 i cant wait to see 24 9 cant wait to see you 24 10 keep up the good work 22

Docummentation in Shinyapp

Finally the “DOCUMENTATION” tab contains a description about the app, especially on what process was needed to develop the app.

Highlights The model performs with good accuracy. Large number of n-grams extracted and the use of a backoff model to predict the next word were helpful in building an accurate model. The speed of the model is excellent. It is able to predict the next word while the user is typing the text. R’s package ‘ngram’ is used to tokenize the data. The words coming from foreign languages are handled by removing non-ascii characters. Profanity words, URLs, hash tags and Twitter handlers were removed from the corpora.

##Useful Links http://rmarkdown.rstudio.com for the first phase of the project. Shiny Interactive App: https://1982mg.shinyapps.io/ShinyApp/ Slide Deck Coursera Data Science Specialization Data Downloaded from: Capstone Dataset