title: "Exploratory Analysis for Word Prediction" author: "Your Name"
date: "r Sys.Date()" output: html_document
----------------------
This project aims to demonstrate familiarity with large text datasets and show progress toward building a word prediction algorithm. The data was obtained from Coursera SwiftKey Dataset, which includes US English corpora from blogs, news, and Twitter.
Our objectives in this stage are:
The datasets contain:
en_US.blogs.txten_US.news.txten_US.twitter.txtWe read these files and calculate:
```r blogs <- readLines("enUS.blogs.txt", encoding = "UTF-8") news <- readLines("enUS.news.txt", encoding = "UTF-8") twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8")
blogsWords <- sum(sapply(gregexpr("\S+", blogs), length)) newsWords <- sum(sapply(gregexpr("\S+", news), length)) twitterWords <- sum(sapply(gregexpr("\S+", twitter), length))
summarytable <- data.frame( filename = c("blogs", "news", "twitter"), numlines = c(length(blogs), length(news), length(twitter)), numwords = c(blogsWords, newsWords, twitterWords), filesizeMB = c( file.info("enUS.blogs.txt")$size / 1024 / 1024, file.info("enUS.news.txt")$size / 1024 / 1024, file.info("enUS.twitter.txt")$size / 1024 / 1024 ) )
knitr::kable(summary_table) ```
To handle memory limitations, we clean and sample the data:
r cleanBlogs <- iconv(blogs, to = "UTF-8", sub = "byte") sampleBlogs <- sample(cleanBlogs, round(0.001 * length(cleanBlogs))) # Repeat for news and twitter...
We process the sampled data using tm:
r library(tm) doc.corpusBlogs <- Corpus(VectorSource(sampleBlogs)) doc.corpusBlogs <- tm_map(doc.corpusBlogs, content_transformer(tolower)) doc.corpusBlogs <- tm_map(doc.corpusBlogs, removePunctuation) doc.corpusBlogs <- tm_map(doc.corpusBlogs, removeNumbers) doc.corpusBlogs <- tm_map(doc.corpusBlogs, stripWhitespace)
We compute term-document matrices and visualize the most common words using bar plots and word clouds.
```r library(wordcloud) TDMBlogs <- TermDocumentMatrix(doc.corpusBlogs) Blogsmatrix <- as.matrix(TDMBlogs) FreqBlogs <- sort(rowSums(Blogsmatrix), decreasing = TRUE) FreqDistBlogs <- data.frame(word = names(FreqBlogs), freq = FreqBlogs)
barplot(FreqDistBlogs[1:10,]$freq, names.arg = FreqDistBlogs[1:10,]$word, col = "lightblue", main = "Top Words in Blogs")
wordcloud(words = names(FreqBlogs), freq = FreqBlogs, max.words = 100, colors = brewer.pal(8, "Dark2")) ```
Repeat similar analysis for news and twitter samples.
We plan to build a predictive model using N-grams (unigram, bigram, trigram). Based on user input, we will:
We’ll balance accuracy with performance to ensure responsiveness on the web app.
This report confirms that the data is successfully loaded, explored, and cleaned. Preliminary visualizations show most frequent terms. The next phase will focus on model building and app development.