Denis Rosa
2025-03-14
Data Science Capstone Project
Johns Hopkins University
This presentation features the Next Word Predict app including an introduction to the application user interface and details about the text prediction algorithm.
This presentation is part of the Coursera Capstone Project Milestone Report. It features a word prediction app written using shine. You can see the app at https://deniswsrosa.shinyapps.io/denis_capstone_data_science/
Next Word Predict is a Shiny application designed to anticipate the next word(s) a user might type by leveraging a text prediction algorithm.
The app suggests possible next words in a sentence through an n-gram model, which analyzes sequences of n consecutive words from a given text.
To develop the predictive model, a vast dataset comprising blogs, news articles, and Twitter posts was utilized. N-grams were extracted from this dataset and employed to enhance the accuracy of predictions.
Different techniques in natural language processing and text mining were explored to optimize both the speed and precision of the predictions.
The predictive text model was developed using a sample of 800,000 lines sourced from a larger dataset of blogs, news articles, and Twitter content.
To prepare the data, it was tokenized and cleaned with the tm package, alongside various regular expressions applied via the gsub function. During this preprocessing stage, the text was converted to lowercase, and elements such as non-ASCII characters, URLs, email addresses, Twitter handles, hashtags, ordinal numbers, offensive words, punctuation, and extra whitespace were removed. The cleaned text was then segmented into tokens (n-grams).
When a user inputs text, the algorithm searches for matches by iterating from the longest n-gram (4-gram) down to the shortest (2-gram). The predicted next word is selected based on the longest and most frequently occurring matching n-gram. A simple back-off strategy is used to refine the prediction.