Ranto Ramananjato
2025-10-25
To start with, we load the required packages and make connection to the dat. R codes are shown for your information but not run
# Set working directory and load data (done ahead to save time)
fileurl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileurl, destfile ="Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
setwd("Week 2/final/en_US/")
blogs <- read_lines("en_US.blogs.txt", skip_empty_rows = TRUE)
news <- read_lines("en_US.news.txt", skip_empty_rows = TRUE)
twitter <- read_lines("en_US.twitter.txt", skip_empty_rows = TRUE)
all_text <- c(blogs, news, twitter)
rm(blogs, news, twitter)set.seed("12345")
trainingData <- linesInFile[rbinom(fileNLine, 1, 0.01)==1]
corpusFeeds <- VCorpus(VectorSource(trainingData))
corpusFeeds <- tm_map(corpusFeeds, removePunctuation) # remove punctuation
corpusFeeds <- tm_map(corpusFeeds, content_transformer(tolower)) # put in lower char
corpusFeeds <- tm_map(corpusFeeds, content_transformer(remove_chars)) # remove internet chars
corpusFeeds <- tm_map(corpusFeeds, removeWords, stopwords("english")) # remove English stop words
corpusFeeds <- tm_map(corpusFeeds, content_transformer(remove_symbols)) #remove symbols
corpusFeeds <- tm_map(corpusFeeds, stripWhitespace) # remove extra spacesI chose two ways to present interesting findings. The first one is
with word to cloud to see which words are the most frequently used
The second is with NGrams histogram to understand which combination
of words are frequently used.
Using the bigrams and trigrams shown in table, a model is built to
predict the next word based on user’s input.