Instructions

The goal of this project is to publish on R Pubs a report that explains the exploratory analysis and its goals for the application and algorithm. This document should be concise and explain only the key features of the data you identified and briefly summarize the plans for creating the Shiny prediction algorithm and application in a way that is understandable to a non-Data Scientist manager. Present tables and graphs to illustrate important summaries of the data set. The motivation for this project, therefore, is:

Demonstrate that the data download was imported and loaded successfully.
Create a basic summary statistics report on the datasets.
Report all interesting discoveries to date.
Get feedback on plans to create a Shiny prediction algorithm and app.

Libraries to be loaded for Data Analysis

if (!require("pacman")) install.packages("pacman")
pacman::p_load(
        pacman,
        knitr,
        tidyverse,
        NLP,
        openNLP,
        qdapDictionaries,
        qdapRegex,
        qdapTools,
        slam,
        tools,
        RWeka,
        ngram,
        stringr,
        RColorBrewer,
        SnowballC,
        wordcloud,
        wordcloud2
)

Final Product

The final product of the project will be an algorithm that predicts the next word in text provided with inputs from the test dataset, similar to the text prediction functions found in today’s modern smartphones.

Data Set

HC Corpora’s Swiftkey dataset is comprised of output from several News, Blogs and Twitter sites. The dataset contains 3 files in four languages (Russian, Finnish, German and English). This project will focus on English (en) language datasets:

en_US.blogs.txt
en_US.twitter.txt
en_US.news.txt

NOTE: and these will be referred to as "Blogs", "Twitter" and "News" in the remainder of this report

Tasks performed on data

Explore the training dataset;
Profanity filtering - removal of inappropriate terms;;
Tokenization – identifying appropriate tokens such as words, punctuation and numbers

Importing the Data:

The training datasets for this project were downloaded via this link

library(dplyr)
library(kableExtra)
 
blogsC        <- file("C:/Users/rahmamx1/OneDrive - Abbott/Desktop/RExercises/Milestone Report/final/en_US/en_US.blogs.txt"  , "r")
newsC         <- file("C:/Users/rahmamx1/OneDrive - Abbott/Desktop/RExercises/Milestone Report/final/en_US/en_US.news.txt"   , "r")
twitterC      <- file("C:/Users/rahmamx1/OneDrive - Abbott/Desktop/RExercises/Milestone Report/final/en_US/en_US.twitter.txt", "r")

blogs         <- readLines(blogsC,   n = -1, encoding = "UTF-8")
news          <- readLines(newsC,    n = -1, encoding = "UTF-8")
twitter       <- readLines(twitterC, n = -1, encoding = "UTF-8")

close(blogsC)
close(newsC)
close(twitterC)

nCharBlogs    <- sum(nchar(blogs))
nCharNews     <- sum(nchar(news))
nCharTwitter  <- sum(nchar(twitter))

lenBlogs      <- length(blogs)
lenNews       <- length(news)
lentwitter    <- length(twitter)

nWordsBlogs   <- sum(sapply(strsplit(blogs, " "), length))
nWordsNews    <- sum(sapply(strsplit(news, " "), length))
nWordsTwitter <- sum(sapply(strsplit(twitter, " "), length))

outputTable   <- data.frame(
    c("blogs", "news", "twitter"),
    c(nCharBlogs, nCharNews, nCharTwitter),
    c(lenBlogs, lenNews, lentwitter),
    c(nWordsBlogs, nWordsNews, nWordsTwitter)
)
colnames(outputTable) <- c("FileType", "Characters", "Lines", "Words")

rm(blogs, news, twitter)

SwiftKeys (EN-US)

FileType	Characters	Lines	Words
blogs	206824505	899288	37334131
news	15639408	77259	2643969
twitter	162096031	2360148	30373543

Steps taken:

Code to download the file and unzip is hidden and its with the assumption that files are available in the current working directory;
Open the file connection;
Close connection;
Compute n°. of characters;
Compute n°. of lines;
Compute n°. of words;
Build a Dataset Summary output Table;
Release object memory;
Summary of Data Analysis.

Exploratory Analysis of Data

Since the training data set size is huge, a sample of each file is extracted and explored for further study

Sample Data:

FileType	Characters	Lines	Words
blogsSample	2277384	10000	410620
newsSample	2035687	10000	343929
twitterSample	681544	10000	127674

Steps taken:

Set the line count information for extraction;
Open the file connection;
Close the file connection;
Compute n°. of characters;
Compute n°. of lines;
Compute n°. of words;
Build a Dataset Summary output Table

Data Cleansing & Corpora building

The text sample extracted from each of the files is transformed step-by-step for Predictive Model bulding

Steps taken:

NOTE Optional steps - collate Sample Text Data and create a reduced raw data file before processing

  beforeData <- c(blogsS, newsS, twitterS)

Remove retweets from Twitter Data Sample;
Remove @people names from Twitter;
Collate text data from different file samples;
Replace abbreviation so that the sentences are not split at incorrect places;

NOTE: Optional steps - convert paragraphs to sentences

  endNotations <- c("?", ".", ",","!", "|", ":", "\n", "\r\n")<
  sampleData <- sent_detect(
                            sampleData, 
                            endmarks   = endNotations, 
                            rm.bracket = FALSE
                            )

Collate Sample Text Data and create a reduced raw data file;
Create Text Corpus for processing;
Release object memory;
Convert the text to lower case;
Remove URL from Corpora, symbols;
Replace emails and such but space, websites and file systems;
Edit out most non-alphabetical character, text must be lower case first;
Remove the punctuations after trimming leading & trailing white spaces;
Remove numbers from the text;
Build profane word list;
Remove profane words from the corpora;
Remove all the white space that was created by the removals;
Sample Text Data after processing.

Please find below the first few lines of text data before and after processing in the Corpora:

## Before Text Processing
sampleTxtBP

##  [1] "In the years thereafter, most of the Oil fields and platforms were"     
##  [2] "named after pagan “gods”."                                              
##  [3] "We love you Mr. Brown."                                                 
##  [4] "Chad has been awesome with the kids and holding down the fort while I"  
##  [5] "work later than usual! The kids have been busy together playing"        
##  [6] "Skylander on the XBox together, after Kyan cashed in his $$$ from his"  
##  [7] "piggy bank. He wanted that game so bad and used his gift card from his" 
##  [8] "birthday he has been saving and the money to get it (he never taps into"
##  [9] "that thing either, that is how we know he wanted it so bad). We made"   
## [10] "him count all of his money to make sure that he had enough! It was very"
## [11] "cute to watch his reaction when he realized he did! He also does a very"
## [12] "good job of letting Lola feel like she is playing too, by letting her"  
## [13] "switch out the characters! She loves it almost as much as him."         
## [14] "so anyways, i am going to share some home decor inspiration that i have"
## [15] "been storing in my folder on the puter. i have all these amazing images"
## [16] "stored away ready to come to life when we get our home."                
## [17] "With graduation season right around the corner, Nancy has whipped up a" 
## [18] "fun set to help you out with not only your graduation cards and gifts," 
## [19] "but any occasion that brings on a change in one's life. I stamped the"  
## [20] "images in Memento Tuxedo Black and cut them out with circle"            
## [21] "Nestabilities. I embossed the kraft and red cardstock with TE's new"    
## [22] "Stars Impressions Plate, which is double sided and gives you 2"         
## [23] "fantastic patterns. You can see how to use the Impressions Plates in"   
## [24] "this tutorial Taylor created. Just one pass through your die cut"       
## [25] "machine using the Embossing Pad Kit is all you need to do - super easy!"

## After Text Processing 
sampleTxtAP

##  [1] "in the years thereafter most of the oil fields and platforms were named"
##  [2] "after pagan gods"                                                       
##  [3] "we love you mister brown"                                               
##  [4] "chad has been awesome with the kids and holding down the fort while i"  
##  [5] "work later than usual the kids have been busy together playing"         
##  [6] "skylander on the xbox together after kyan cashed in his from his piggy" 
##  [7] "bank he wanted that game so bad and used his gift card from his"        
##  [8] "birthday he has been saving and the money to get it he never taps into" 
##  [9] "that thing either that is how we know he wanted it so bad we made him"  
## [10] "count all of his money to make sure that he had enough it was very cute"
## [11] "to watch his reaction when he realized he did he also does a very good" 
## [12] "job of letting lola feel like she is playing too by letting her switch" 
## [13] "out the characters she loves it almost as much as him"                  
## [14] "so anyways i am going to share some home decor inspiration that i have" 
## [15] "been storing in my folder on the puter i have all these amazing images" 
## [16] "stored away ready to come to life when we get our home"                 
## [17] "with graduation season right around the corner nancy has whipped up a"  
## [18] "fun set to help you out with not only your graduation cards and gifts"  
## [19] "but any occasion that brings on a change in ones life i stamped the"    
## [20] "images in memento tuxedo and cut them out with circle nestabilities i"  
## [21] "embossed the kraft and red cardstock with tes new stars impressions"    
## [22] "plate which is double sided and gives you fantastic patterns you can"   
## [23] "see how to use the impressions plates in this tutorial taylor created"  
## [24] "just one pass through your cut machine using the embossing pad kit is"  
## [25] "all you need to do super easy"

Tokenization of text - Creation of N-grams

Perform Tokenization, and thus obtain one (uni-), two (bi-), three (tri-), and four (tetra-) word combinations that appear frequently in the Text Corpus.

library(dplyr)

strCorpus <- concatenate(lapply(sampleCorpora,"[",1))
ng1 <- ngram(strCorpus, n=1)
ng2 <- ngram(strCorpus, n=2)
ng3 <- ngram(strCorpus, n=3)
ng4 <- ngram(strCorpus, n=4)

Inspect first few entries in N-grams generated

NG1

ngrams	freq	prop
the	41835	0.0466667
“,	25117	0.0280179
to	23813	0.0265633
and	22437	0.0250284
a	21024	0.0234522
of	18805	0.0209769
in	14262	0.0159092
i	11996	0.0133815
that	9492	0.0105883
is	9220	0.0102849

NG2

ngrams	freq	prop
of the	4119	0.0045947
in the	3679	0.0041039
to the	1993	0.0022232
on the	1654	0.0018450
for the	1633	0.0018216
“,”	1532	0.0017089
to be	1448	0.0016152
“,”the	1228	0.0013698
and the	1206	0.0013453
at the	1179	0.0013152

NG3

ngrams	freq	prop
one of the	300	0.0003346
a lot of	278	0.0003101
the u s	158	0.0001762
to be a	151	0.0001684
as well as	145	0.0001617
the end of	139	0.0001551
going to be	133	0.0001484
out of the	127	0.0001417
i don t	122	0.0001361
some of the	121	0.0001350

NG4

ngrams	freq	prop
the end of the	75	8.37e-05
at the end of	59	6.58e-05
for the first time	57	6.36e-05
the rest of the	56	6.25e-05
in the middle of	51	5.69e-05
“,”thanks for the	48	5.35e-05
one of the most	47	5.24e-05
is one of the	43	4.80e-05
when it comes to	40	4.46e-05
at the same time	37	4.13e-05

Make new sentence:

To make new sentence using the n-grams generated, try babble(ng=ng2, genlen=15, seed= 123445) which would return a random formed 15 word length sentence.

babble(ng=ng2, genlen=15, seed= 12112344)

## [1] "you francophiles during june at \", \"fax \", \"indeed the uninsured crisis itself is causing "

Visual Inspection of tokenized words

Using the corpus of documents, we now construct a Document Term Matrix (DTM). This object is a simple triplet matrix structure (efficient for storing large sparse matrices), that has each document as a row and each n-gram (or term) as a column.

Build Term document matrix with single tokenizer and words smaller than 3 characters are omitted:

sampleTDM <- tm::TermDocumentMatrix(sampleCorpora, control = list(wordLengths = c(3,Inf)))

#Put word count from TDM to data frame 
sampleFreqWords <- data.frame(word = sampleTDM$dimnames$Terms, frequency = sampleTDM$v)

# Reorder the word list in descending order 
sampleFreqWords <- plyr::arrange(sampleFreqWords, -frequency)

# Build Most frequent terms 
n <- 25L # variable to set top n words
# isolate top n words by decreasing frequency
sampleFreqWords.top <- sampleFreqWords[1:n, ]
# reorder levels so charts plot in order of frequency
sampleFreqWords.top$word <- reorder(sampleFreqWords.top$word, sampleFreqWords.top$frequency)

Term analysis

Frequent Terms

Wordcloud

Next Steps Forward - Prediction Algorithm

Moving forward, the project goal is to develop a natural language prediction algorithm and app. For example, if a user were to type, “I want to go to the .”, the app would suggest the three most likely words that would replace “.”.

N-gram Dictionary

While the word analysis performed in this document is helpful for initial exploration, the data analyst will need to construct a dictionary of bigrams, trigrams, and tetra-grams, collectively called n-grams. Bigrams are two word phrases, trigrams are three word phrases, and four-grams are four word phrases.

Here is an example of trigrams from the randomly sampled corpus. Recall that stop words had been removed so the phrases may look choppy. In the final dictionary, stop phrases and words of any length will be maintained.

Trigrams

Frequent Terms Trigrams

word	frequency
one of the	330
a lot of	284
the u s	166
to be a	152
as well as	147
the end of	139
going to be	137
i don t	134
out of the	129
some of the	124

Wordcloud Trigrams

Predicting from N-grams

Each n-gram will be split, separating the last word from the previous words in the n-gram.

bigrams will become unigram/unigram pairs
trigrams will become bigram/unigram pairs
four-grams will become trigram/unigram pairs

For each pair, the three most frequent occurrences will be stored in the dictionary. Here are the three most frequent trigrams for a bigram of “cant wait” from the randomly sampled corpus. These eleven trigrams would be split into bigram/unigram pairs and stored in the sample dictionary. Dictionaries will be built for the whole data set

freq.trigram %>% filter(str_detect(freq.trigram$word,"^cant wait"))

##               word frequency
## 1     cant wait to        47
## 2    cant wait for        16
## 3   cant wait till         7
## 4  cant wait until         3
## 5     cant wait my         2
## 6    cant wait any         1
## 7    cant wait but         1
## 8   cant wait good         1
## 9    cant wait ily         1
## 10    cant wait rt         1

Application Logic

After the dictionaries have been established, an app will be developed allowing the user to enter text. The app will suggest the three most likely words to come next in the text for the text type, based on these rules.

If the supplied text is greater than 2 words, take the last three words of the text and search the trigram/unigram pairs.
If the supplied text is 2 words, take the two words and search the bigram/unigram pairs.
If the supplied text is 1 word, search for that word in the unigram/unigram pairs.
Suggest the three most frequent unigrams from the n-gram/unigram pair for either 1, 2, or 3 above.

Peer-graded Assignment: Milestone Report

Mohammad Rahman

July 18, 2025

Instructions

Data Set

Importing the Data:

SwiftKeys (EN-US)

Exploratory Analysis of Data

Sample Data:

Data Cleansing & Corpora building

Tokenization of text - Creation of N-grams

Inspect first few entries in N-grams generated

NG1

NG2

NG3

NG4

Make new sentence:

Visual Inspection of tokenized words

Term analysis

Frequent Terms

Wordcloud

N-gram Dictionary

Trigrams

Frequent Terms Trigrams

Wordcloud Trigrams

Predicting from N-grams

Application Logic