Executive Summary

This report presents an exploratory analysis of the training data for the text prediction application. The objective is to understand the characteristics of the source data and establish a foundation for developing a prediction algorithm and a Shiny application.

Data Loading

blogs <- readLines("c:/Users/Dell/Documents/GitHub/datasciencecoursera/DataScienceCapstone/final/en_US/en_US.blogs.txt",
                   encoding="UTF-8",
                   skipNul=TRUE)

news <- readLines("c:/Users/Dell/Documents/GitHub/datasciencecoursera/DataScienceCapstone/final/en_US/en_US.news.txt",
                  encoding="UTF-8",
                  skipNul=TRUE)

twitter <- readLines("c:/Users/Dell/Documents/GitHub/datasciencecoursera/DataScienceCapstone/final/en_US/en_US.twitter.txt",
                     encoding="UTF-8",
                     skipNul=TRUE)

The three datasets were successfully loaded.

Basic Summary Statistics

summary_table <- data.frame(

Source=c("Blogs","News","Twitter"),

Lines=c(
length(blogs),
length(news),
length(twitter)
),

Words=c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
),

Characters=c(
sum(nchar(blogs)),
sum(nchar(news)),
sum(nchar(twitter))
)
)

summary_table
##    Source   Lines    Words Characters
## 1   Blogs  899288 37546806  206824505
## 2    News 1010206 34761151  203214543
## 3 Twitter 2360148 30096690  162096241

Character Distribution

line_lengths <- data.frame(

Source=c(
rep("Blogs",length(blogs)),
rep("News",length(news)),
rep("Twitter",length(twitter))
),

Characters=c(
nchar(blogs),
nchar(news),
nchar(twitter)
)
)

ggplot(line_lengths,
aes(x=Characters))+

geom_histogram(
bins=50
)+

facet_wrap(
~Source,
scales="free_y"
)+

xlim(0,500)+

labs(
title="Character Distribution",
x="Characters",
y="Frequency"
)
## Warning: Removed 131269 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Word Distribution

word_counts <- data.frame(

Source=c(
rep("Blogs",length(blogs)),
rep("News",length(news)),
rep("Twitter",length(twitter))
),

Words=c(
stri_count_words(blogs),
stri_count_words(news),
stri_count_words(twitter)
)
)

ggplot(
word_counts,
aes(x=Words)
)+

geom_histogram(
bins=50
)+

facet_wrap(
~Source,
scales="free_y"
)+

xlim(0,100)+

labs(
title="Word Count Distribution",
x="Words",
y="Frequency"
)
## Warning: Removed 95703 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Findings

The datasets differ substantially in size and writing style. Twitter contains the greatest number of entries and generally shorter text segments. Blog and news data contain longer entries and more structured language. The histograms demonstrate that most observations contain relatively small amounts of text, while a small proportion contain much longer entries.

Prediction Algorithm Plan

The prediction system will use an n-gram language model. Text preprocessing will include lowercasing, punctuation handling, tokenization, and removal of unnecessary whitespace. The algorithm will estimate the most probable next word using observed word sequences.

Shiny Application Plan

The Shiny application will provide a simple user interface where users enter text and receive next-word predictions in real time. The interface will prioritize responsiveness and ease of use.

Conclusion

This exploratory analysis confirms successful loading and understanding of the training datasets. The next phase is preprocessing, model construction, and deployment through a Shiny application.