This report presents an exploratory analysis of the training data for the text prediction application. The objective is to understand the characteristics of the source data and establish a foundation for developing a prediction algorithm and a Shiny application.
blogs <- readLines("c:/Users/Dell/Documents/GitHub/datasciencecoursera/DataScienceCapstone/final/en_US/en_US.blogs.txt",
encoding="UTF-8",
skipNul=TRUE)
news <- readLines("c:/Users/Dell/Documents/GitHub/datasciencecoursera/DataScienceCapstone/final/en_US/en_US.news.txt",
encoding="UTF-8",
skipNul=TRUE)
twitter <- readLines("c:/Users/Dell/Documents/GitHub/datasciencecoursera/DataScienceCapstone/final/en_US/en_US.twitter.txt",
encoding="UTF-8",
skipNul=TRUE)
The three datasets were successfully loaded.
summary_table <- data.frame(
Source=c("Blogs","News","Twitter"),
Lines=c(
length(blogs),
length(news),
length(twitter)
),
Words=c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
),
Characters=c(
sum(nchar(blogs)),
sum(nchar(news)),
sum(nchar(twitter))
)
)
summary_table
## Source Lines Words Characters
## 1 Blogs 899288 37546806 206824505
## 2 News 1010206 34761151 203214543
## 3 Twitter 2360148 30096690 162096241
line_lengths <- data.frame(
Source=c(
rep("Blogs",length(blogs)),
rep("News",length(news)),
rep("Twitter",length(twitter))
),
Characters=c(
nchar(blogs),
nchar(news),
nchar(twitter)
)
)
ggplot(line_lengths,
aes(x=Characters))+
geom_histogram(
bins=50
)+
facet_wrap(
~Source,
scales="free_y"
)+
xlim(0,500)+
labs(
title="Character Distribution",
x="Characters",
y="Frequency"
)
## Warning: Removed 131269 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_bar()`).
word_counts <- data.frame(
Source=c(
rep("Blogs",length(blogs)),
rep("News",length(news)),
rep("Twitter",length(twitter))
),
Words=c(
stri_count_words(blogs),
stri_count_words(news),
stri_count_words(twitter)
)
)
ggplot(
word_counts,
aes(x=Words)
)+
geom_histogram(
bins=50
)+
facet_wrap(
~Source,
scales="free_y"
)+
xlim(0,100)+
labs(
title="Word Count Distribution",
x="Words",
y="Frequency"
)
## Warning: Removed 95703 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_bar()`).
The datasets differ substantially in size and writing style. Twitter contains the greatest number of entries and generally shorter text segments. Blog and news data contain longer entries and more structured language. The histograms demonstrate that most observations contain relatively small amounts of text, while a small proportion contain much longer entries.
The prediction system will use an n-gram language model. Text preprocessing will include lowercasing, punctuation handling, tokenization, and removal of unnecessary whitespace. The algorithm will estimate the most probable next word using observed word sequences.
The Shiny application will provide a simple user interface where users enter text and receive next-word predictions in real time. The interface will prioritize responsiveness and ease of use.
This exploratory analysis confirms successful loading and understanding of the training datasets. The next phase is preprocessing, model construction, and deployment through a Shiny application.