The goal of this project is to demonstrate familiarity with the
SwiftKey text data and to perform a basic exploratory analysis.
This analysis will guide the development of a next-word prediction
algorithm and a future Shiny application.
The data consists of text from blogs, news articles, and Twitter posts. ## Data Loading
knitr::opts_knit$set(
root.dir = "C:/Users/Siva Karthik/Downloads/Coursera-SwiftKey"
)
library(stringi)
library(readr)
blogs <- read_lines("en_US.blogs.txt")
news <- read_lines("en_US.news.txt")
twitter <- read_lines("en_US.twitter.txt")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
summary_table <- data.frame(
Dataset = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
),
FileSize_MB = c(
file.info("en_US.blogs.txt")$size / 1024^2,
file.info("en_US.news.txt")$size / 1024^2,
file.info("en_US.twitter.txt")$size / 1024^2
)
)
summary_table
## Dataset Lines Words FileSize_MB
## 1 Blogs 899288 37546806 200.4242
## 2 News 1010242 34762658 196.2775
## 3 Twitter 2360148 30096649 159.3641
twitter_lengths <- nchar(twitter)
hist(
twitter_lengths,
breaks = 50,
col = "lightblue",
main = "Distribution of Twitter Line Lengths",
xlab = "Number of Characters"
)
## Observations - Twitter has the largest number of lines. - Blogs
contain the longest individual lines. - The data requires cleaning and
preprocessing. ## Future Work The next step is to clean the text, build
n-gram models, and develop a next-word prediction algorithm. This model
will be deployed using a Shiny web application.