Prediction of next word

Ranto Ramananjato

2025-10-25

Spend less time in entry!

To start with, we load the required packages and make connection to the dat. R codes are shown for your information but not run

# Set working directory and load data (done ahead to save time)
fileurl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileurl, destfile ="Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
setwd("Week 2/final/en_US/")
blogs <- read_lines("en_US.blogs.txt", skip_empty_rows = TRUE)
news <- read_lines("en_US.news.txt", skip_empty_rows = TRUE)
twitter <- read_lines("en_US.twitter.txt", skip_empty_rows = TRUE)
all_text <- c(blogs, news, twitter)
rm(blogs, news, twitter)

Tidying data

set.seed("12345")
trainingData <- linesInFile[rbinom(fileNLine, 1, 0.01)==1]
corpusFeeds <- VCorpus(VectorSource(trainingData))
corpusFeeds <- tm_map(corpusFeeds, removePunctuation) # remove punctuation
corpusFeeds <- tm_map(corpusFeeds, content_transformer(tolower))  # put in lower char
corpusFeeds <- tm_map(corpusFeeds, content_transformer(remove_chars)) # remove internet chars
corpusFeeds <- tm_map(corpusFeeds, removeWords, stopwords("english")) # remove English stop words
corpusFeeds <- tm_map(corpusFeeds, content_transformer(remove_symbols)) #remove symbols
corpusFeeds <- tm_map(corpusFeeds, stripWhitespace) # remove extra spaces
filerul2 <- "http://www.bannedwordlist.com/lists/swearWords.txt"
download.file(filerul2, destfile = "badwords.txt")
badwords <- readLines("badwords.txt")
profanity <- VectorSource(badwords)
corpusFeeds <- tm_map(corpusFeeds, removeWords, profanity)

Descriptive analysis

I chose two ways to present interesting findings. The first one is with word to cloud to see which words are the most frequently used alt text

The second is with NGrams histogram to understand which combination of words are frequently used. alt text

Prediction

Using the bigrams and trigrams shown in table, a model is built to predict the next word based on user’s input. alt text