This milestone report is part of the Coursera Data Science Capstone Project, where the final goal is to build a text prediction model like SwiftKey and deploy it in a Shiny App. Here we perform data loading, cleaning, exploratory analysis, and outline the next steps for model development. —
# Load Dataset
blogs <- readLines("C:/Users/Parul Mittal/OneDrive/Documents/coursera-text-prediction-project/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", warn=FALSE)
news <- readLines("C:/Users/Parul Mittal/OneDrive/Documents/coursera-text-prediction-project/Coursera-SwiftKey/final/en_US/en_US.news.txt", warn=FALSE)
twitter <- readLines("C:/Users/Parul Mittal/OneDrive/Documents/coursera-text-prediction-project/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", warn=FALSE)
# Summary Statistics
data_summary <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(sum(str_count(blogs, "\\\\w+")),
sum(str_count(news, "\\\\w+")),
sum(str_count(twitter, "\\\\w+")))
)
data_summary
## Source Lines Words
## 1 Blogs 899288 1
## 2 News 1010206 0
## 3 Twitter 2360148 7
# Sampling & Cleaning
set.seed(123)
sample_data <- c(
sample(blogs, 2000),
sample(news, 2000),
sample(twitter, 2000)
)
corpus <- VCorpus(VectorSource(sample_data))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
# Word Frequency Plot
tdm <- TermDocumentMatrix(corpus)
freq <- rowSums(as.matrix(tdm))
freq <- sort(freq, decreasing = TRUE)
word_freq <- data.frame(word = names(freq), freq = freq)
ggplot(head(word_freq, 20), aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat="identity", fill="steelblue") +
coord_flip() +
labs(title="Top 20 Most Frequent Words", x="Word", y="Frequency")
# Wordcloud Visualization
wordcloud(names(freq), freq, max.words = 100)
Twitter has the highest number of lines but shorter text length.
Most frequent words are common stopwords, so removal is needed before modeling.
Next steps:
Build 1-gram, 2-gram & 3-gram text models using tidytext + tokenizers
Create probability-based next-word prediction matrix
Implement Shiny app with:
Input text box
Output: top predicted next word
Improve accuracy using smoothing & backoff algorithms
The dataset was loaded and explored successfully. Word frequency patterns are clear and form the basis for building an N-gram language model. The next step is constructing the prediction algorithm and deploying a Shiny application.