This project explores text data from blogs, news, and Twitter.
The goal is to build a next-word prediction model using the Coursera Data Science Capstone dataset.
library(stringi)
library(ggplot2)
library(knitr)
set.seed(123)
read_sample <- function(file, n = 10000){
con <- file(file, "r")
lines <- readLines(con,
n = n,
encoding = "UTF-8",
skipNul = TRUE)
close(con)
return(lines)
}
blogs <- read_sample("en_US.blogs.txt",10000)
news <- read_sample("en_US.news.txt",10000)
twitter <- read_sample("en_US.twitter.txt",10000)
summary_table <- data.frame(
Dataset = c("Blogs","News","Twitter"),
Lines = c(
length(blogs),
length(news),
length(twitter)
),
Words = c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
),
AvgWords = c(
mean(stri_count_words(blogs)),
mean(stri_count_words(news)),
mean(stri_count_words(twitter))
)
)
kable(summary_table,digits=2)
| Dataset | Lines | Words | AvgWords |
|---|---|---|---|
| Blogs | 10000 | 413215 | 41.32 |
| News | 10000 | 349062 | 34.91 |
| 10000 | 126736 | 12.67 |
blogs_wc <- stri_count_words(blogs)
ggplot(data.frame(words=blogs_wc),
aes(x=words))+
geom_histogram(
bins=40,
fill="steelblue",
color="white"
)+
labs(
title="Blogs Word Distribution",
x="Words",
y="Frequency"
)+
theme_minimal()
news_wc <- stri_count_words(news)
ggplot(data.frame(words=news_wc),
aes(x=words))+
geom_histogram(
bins=40,
fill="darkgreen",
color="white"
)+
labs(
title="News Word Distribution",
x="Words",
y="Frequency"
)+
theme_minimal()
twitter_wc <- stri_count_words(twitter)
ggplot(data.frame(words=twitter_wc),
aes(x=words))+
geom_histogram(
bins=30,
fill="tomato",
color="white"
)+
labs(
title="Twitter Word Distribution",
x="Words",
y="Frequency"
)+
theme_minimal()
The next stage of the project will include:
Cleaning the text
Tokenization
Build:
Apply backoff strategy for prediction.
The Shiny application will:
This exploratory analysis provides a strong understanding of the three text sources and their characteristics. The findings will guide the preprocessing, language modeling, and development of a next-word prediction application using n-gram models.