title: "Exploratory Analysis for Word Prediction" author: "Your Name" date: "r Sys.Date()" output: html_document ----------------------

Introduction

This project aims to demonstrate familiarity with large text datasets and show progress toward building a word prediction algorithm. The data was obtained from Coursera SwiftKey Dataset, which includes US English corpora from blogs, news, and Twitter.

Our objectives in this stage are:


Loading the Data

The datasets contain:

We read these files and calculate:

```r blogs <- readLines("enUS.blogs.txt", encoding = "UTF-8") news <- readLines("enUS.news.txt", encoding = "UTF-8") twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8")

blogsWords <- sum(sapply(gregexpr("\S+", blogs), length)) newsWords <- sum(sapply(gregexpr("\S+", news), length)) twitterWords <- sum(sapply(gregexpr("\S+", twitter), length))

summarytable <- data.frame( filename = c("blogs", "news", "twitter"), numlines = c(length(blogs), length(news), length(twitter)), numwords = c(blogsWords, newsWords, twitterWords), filesizeMB = c( file.info("enUS.blogs.txt")$size / 1024 / 1024, file.info("enUS.news.txt")$size / 1024 / 1024, file.info("enUS.twitter.txt")$size / 1024 / 1024 ) )

knitr::kable(summary_table) ```


Data Cleaning & Sampling

To handle memory limitations, we clean and sample the data:

r cleanBlogs <- iconv(blogs, to = "UTF-8", sub = "byte") sampleBlogs <- sample(cleanBlogs, round(0.001 * length(cleanBlogs))) # Repeat for news and twitter...


Text Preprocessing

We process the sampled data using tm:

r library(tm) doc.corpusBlogs <- Corpus(VectorSource(sampleBlogs)) doc.corpusBlogs <- tm_map(doc.corpusBlogs, content_transformer(tolower)) doc.corpusBlogs <- tm_map(doc.corpusBlogs, removePunctuation) doc.corpusBlogs <- tm_map(doc.corpusBlogs, removeNumbers) doc.corpusBlogs <- tm_map(doc.corpusBlogs, stripWhitespace)


Word Frequency Analysis

We compute term-document matrices and visualize the most common words using bar plots and word clouds.

```r library(wordcloud) TDMBlogs <- TermDocumentMatrix(doc.corpusBlogs) Blogsmatrix <- as.matrix(TDMBlogs) FreqBlogs <- sort(rowSums(Blogsmatrix), decreasing = TRUE) FreqDistBlogs <- data.frame(word = names(FreqBlogs), freq = FreqBlogs)

barplot(FreqDistBlogs[1:10,]$freq, names.arg = FreqDistBlogs[1:10,]$word, col = "lightblue", main = "Top Words in Blogs")

wordcloud(words = names(FreqBlogs), freq = FreqBlogs, max.words = 100, colors = brewer.pal(8, "Dark2")) ```

Repeat similar analysis for news and twitter samples.


Future Steps

We plan to build a predictive model using N-grams (unigram, bigram, trigram). Based on user input, we will:

We’ll balance accuracy with performance to ensure responsiveness on the web app.


Conclusion

This report confirms that the data is successfully loaded, explored, and cleaned. Preliminary visualizations show most frequent terms. The next phase will focus on model building and app development.