Data Science Specialization from Johns Hopkins University
Author
Daniel Morales
Published
June 24, 2025
Introduction
This is the Milestone Report for the Capstone Project from Coursera and Johns Hopkins University Data Science Specialization. The goal for the Capstone Project is to create a Shiny App with a textbox that, using given data and like the keyboards from smartphones, produces three options for what the next typed word might be.
The goal for this Milestone Report is to show that we are able to download, explore and start to model with the data. This data is available to download here and we will be using the files in English, listed below:
en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt
We are assuming that the data is already downloaded, unzipped and available in the active R directory.
Setup
We start loading the R packages needed and the data.
As expected, the maximum number of characters per line on the database from Twitter is limited to 140, given the time when it was extracted. Now let us see some statistics on word counts and words per line.
Now visualizing the distribution of words per line in each database.
Show/hide code
p1 <-ggplot(data.frame(blogs_wpl = words_per_line[[1]]), aes(x = blogs_wpl)) +geom_histogram(binwidth =40, color ="black", fill ="lightblue") +labs(title ="US Blogs", x ="Words per Line", y ="Frequency") +theme_bw() +theme(panel.background =element_rect(fill ="transparent", color =NA),plot.background =element_rect(fill ="transparent", color =NA),legend.background =element_rect(fill ="transparent", color =NA),legend.box.background =element_rect(fill ="transparent", color =NA) )p2 <-ggplot(data.frame(blogs_wpl = words_per_line[[2]]), aes(x = blogs_wpl)) +geom_histogram(binwidth =20, color ="black", fill ="lightblue") +labs(title ="US News", x ="Words per Line", y ="Frequency") +theme_bw() +theme(panel.background =element_rect(fill ="transparent", color =NA),plot.background =element_rect(fill ="transparent", color =NA),legend.background =element_rect(fill ="transparent", color =NA),legend.box.background =element_rect(fill ="transparent", color =NA) )p3 <-ggplot(data.frame(blogs_wpl = words_per_line[[3]]), aes(x = blogs_wpl)) +geom_histogram(binwidth =2, color ="black", fill ="lightblue") +labs(title ="US Twitter", x ="Words per Line", y ="Frequency") +theme_bw() +theme(panel.background =element_rect(fill ="transparent", color =NA),plot.background =element_rect(fill ="transparent", color =NA),legend.background =element_rect(fill ="transparent", color =NA),legend.box.background =element_rect(fill ="transparent", color =NA) )p1 / p2 / p3
Preparing the Data
To prepare the data for modeling, we begin by drawing a random sample of 5,000 lines from each of the three text sources—blogs, news articles, and Twitter posts. This sampling step helps reduce computational cost while retaining diversity across different writing styles.
Next, we create a text corpus using the tm package’s VCorpus function, which structures the sampled data for further text processing. We then apply a series of preprocessing steps to clean the text and make it suitable for natural language processing:
Lowercasing all text to ensure consistency (e.g., “The” and “the” are treated the same).
Removing punctuation and numbers, which usually do not contribute meaningful information for word prediction.
Removing stopwords, such as “the”, “is”, and “and”, which are extremely common but add little value to the predictive model.
Stripping excess whitespace introduced by earlier transformations.
Removing profanity, using a predefined list of offensive terms obtained from Carnegie Mellon University’s resource.
These cleaning steps help reduce noise and standardize the text, preparing it for tokenization and n-gram modeling in the next stages.
I’m not sure if I’ll get this entire treatment since it’s just my back that needs work, but I hope it’s what I get. I kind of can’t imagine a ‘smaller’ version of this.
’m sure ’ll get entire treatment since ’s just back needs work hope ’s get kind can’t imagine ‘smaller’ version
Here’s what I finally did for the frosting…
heres finally frosting
PS ~ no news on whether or not Anna is going to the tourney… looks like I’ll have to wait till Friday to see if I’m going to see Swimmer. Also, my friend correctly guessed his name on the first try. Lucky guess…
ps news whether anna going tourney… looks like ’ll wait till friday see ’m going see swimmer also friend correctly guessed name first try lucky guess…
The upmarket grocery retailer has matched prices on 1,000 branded lines since September 2010 and is now expanding the offer to 7,000 products.
upmarket grocery retailer matched prices branded lines since september now expanding offer products
Bible Doctrines I
doctrines
Exploratory Data Analysis
To better understand the structure and most common word patterns in the dataset, we tokenize the cleaned text into unigrams (single words), bigrams (two-word combinations), and trigrams (three-word combinations). This process helps reveal the most frequent terms and phrases, which will be valuable for building the predictive model later.
To ensure clarity and avoid visual clutter, we remove any missing values (NA) that may have resulted from the tokenization process. We also filter out rare combinations and display only the top 20 most frequent entries in each category.
This analysis gives us insight into the common language patterns used across the different text sources, and will serve as a foundation for training our n-gram model for word prediction.
Show/hide code
# Convert the cleaned text corpus into a tidy data frametext_df <-data.frame(text =sapply(corpus_treated, as.character), stringsAsFactors =FALSE)# Unigramunigrams <- text_df |>unnest_tokens(output = word, input = text) |>filter(!is.na(word)) |>count(word, sort =TRUE) |>top_n(20, n)ggplot(unigrams, aes(x =reorder(word, n), y = n)) +geom_col(color ="black", fill ="lightblue") +coord_flip() +labs(title ="Top 20 Unigrams", x =NULL, y ="Frequency") +theme_bw()
Show/hide code
bigrams <- text_df |>unnest_tokens(bigram, text, token ="ngrams", n =2) |>filter(!is.na(bigram)) |>count(bigram, sort =TRUE) |>top_n(20, n)ggplot(bigrams, aes(x =reorder(bigram, n), y = n)) +geom_col(color ="black", fill ="lightblue") +coord_flip() +labs(title ="Top 20 Bigrams", x =NULL, y ="Frequency") +theme_bw()
Show/hide code
trigrams <- text_df |>unnest_tokens(trigram, text, token ="ngrams", n =3) |>filter(!is.na(trigram)) |>count(trigram, sort =TRUE) |>top_n(20, n)ggplot(trigrams, aes(x =reorder(trigram, n), y = n)) +geom_col(color ="black", fill ="lightblue") +coord_flip() +labs(title ="Top 20 Trigrams", x =NULL, y ="Frequency") +theme_bw()
Naive Predictor Prototype
As a preliminary step toward building the final word prediction model, a simple, rule-based predictor was implemented using n-gram frequency tables. This naive predictor uses the cleaned and tokenized corpus to estimate the most likely next word based on the user’s most recent one or two words.
The prediction logic follows a basic back-off strategy:
If the user input ends with two or more words, the model looks up the most frequent trigrams and returns the top three most common continuations.
If only one word is provided, it falls back to bigrams.
If no matching bigrams or trigrams are found, the model defaults to suggesting the most common unigrams overall.
This early prototype serves as a proof of concept for the final app and demonstrates that meaningful word predictions can be generated from the n-gram structure of the data, even using a simple frequency-based approach. While it lacks the sophistication of a smoothed probabilistic model, it provides a functional baseline to test the predictive pipeline.
To illustrate the behavior of the initial naive prediction model, we tested the function with a set of common input phrases. The table below shows each input along with the top three word predictions generated by the model.
Note that despite being able to handle an empty input, one key limitation of the naive prediction model is its inability to handle unseen inputs (phrases or word combinations that do not appear in the training data). Since the model relies entirely on matching sequences in the n-gram frequency tables, it cannot generate meaningful predictions for novel or rare contexts. This highlights the need for a more robust approach in the final model, incorporating smoothing techniques or fallback strategies that can generalize beyond the observed data.
Next steps
Refine the prediction algorithm using probabilistic methods, such as Stupid Backoff or Kneser-Ney smoothing, to improve accuracy and handling of unseen inputs.
Optimize performance by using efficient data structures (e.g., data.table) and pruning low-frequency n-grams.
Build and deploy the final Shiny app, which will allow users to enter text and receive three suggestions for the next word in real-time.
Evaluate the model’s effectiveness and consider user feedback for additional improvements.
These steps will bring the project closer to its final goal: a functional and responsive word prediction app similar to those used in mobile text input systems.