Milestone Report

Introduction

The goal of this project is to build a predictive text model similar to smartphone keyboards. The model will analyze large text datasets and predict the next word based on previous words.

The dataset contains text from three sources: blogs, news articles, and Twitter posts.

Data Summary

library(stringi)

blogs <- readLines("en_US.blogs.txt", encoding="UTF-8", skipNul=TRUE)
news <- readLines("en_US.news.txt", encoding="UTF-8", skipNul=TRUE)
twitter <- readLines("en_US.twitter.txt", encoding="UTF-8", skipNul=TRUE)

blogs_words <- sum(stri_count_words(blogs))
news_words <- sum(stri_count_words(news))
twitter_words <- sum(stri_count_words(twitter))

summary_table <- data.frame(
  File=c("Blogs","News","Twitter"),
  Lines=c(length(blogs), length(news), length(twitter)),
  Words=c(blogs_words, news_words, twitter_words)
)

summary_table

##      File   Lines    Words
## 1   Blogs  899288 37546806
## 2    News 1010206 34761151
## 3 Twitter 2360148 30096690

Sampling the Data

set.seed(123)

sample_data <- c(
  sample(blogs,1000),
  sample(news,1000),
  sample(twitter,1000)
)

Histogram of Words Per Line

word_counts <- stri_count_words(sample_data)

hist(word_counts,
     breaks=30,
     main="Histogram of Words Per Line",
     xlab="Words Per Line")

## Word Frequency Analysis

library(tm)

## Loading required package: NLP

library(RWeka)

## java.home option:

## JAVA_HOME environment variable: C:\Program Files\Java\jre1.8.0_421

## Warning in fun(libname, pkgname): Java home setting is INVALID, it will be ignored.
## Please do NOT set it unless you want to override system settings.

corpus <- VCorpus(VectorSource(sample_data))

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)

unigram <- NGramTokenizer(corpus, Weka_control(min=1,max=1))

freq <- sort(table(unigram), decreasing=TRUE)

barplot(head(freq,10),
        main="Top 10 Most Frequent Words")

## Future Plans

The prediction algorithm will use n-gram models to predict the next word based on previous words. Unigrams, bigrams, and trigrams will be generated from the dataset.

A web application will be developed using Shiny where users can type text and receive predicted next word suggestions.

Interesting Findings

The exploratory analysis of the dataset revealed several interesting patterns.

First, the number of words per line varies across the different sources. Twitter posts generally contain fewer words because of the platform’s character limits, while blog posts tend to have longer and more descriptive sentences.

Second, the word frequency analysis shows that common English words appear most frequently in the dataset. These words often include articles and prepositions such as “the”, “and”, and “to”.

Finally, the large size of the dataset suggests that sampling and efficient data processing techniques will be important when building the final prediction model.