Introduction

The goal of this project is to demonstrate familiarity with the SwiftKey text data and to perform a basic exploratory analysis.
This analysis will guide the development of a next-word prediction algorithm and a future Shiny application.

The data consists of text from blogs, news articles, and Twitter posts. ## Data Loading

knitr::opts_knit$set(
 root.dir = "C:/Users/Siva Karthik/Downloads/Coursera-SwiftKey"
)
library(stringi)
library(readr)
blogs <- read_lines("en_US.blogs.txt")
news <- read_lines("en_US.news.txt")
twitter <- read_lines("en_US.twitter.txt")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

Basic Summary Statistics

summary_table <- data.frame(
 Dataset = c("Blogs", "News", "Twitter"),
 Lines = c(length(blogs), length(news), length(twitter)),
 Words = c(
 sum(stri_count_words(blogs)),
 sum(stri_count_words(news)),
 sum(stri_count_words(twitter))
 ),
 FileSize_MB = c(
 file.info("en_US.blogs.txt")$size / 1024^2,
 file.info("en_US.news.txt")$size / 1024^2,
 file.info("en_US.twitter.txt")$size / 1024^2
 )
)
summary_table
##   Dataset   Lines    Words FileSize_MB
## 1   Blogs  899288 37546806    200.4242
## 2    News 1010242 34762658    196.2775
## 3 Twitter 2360148 30096649    159.3641

Distribution of Twitter Line Lengths

twitter_lengths <- nchar(twitter)
hist(
 twitter_lengths,
 breaks = 50,
 col = "lightblue",
 main = "Distribution of Twitter Line Lengths",
 xlab = "Number of Characters"
)

## Observations - Twitter has the largest number of lines. - Blogs contain the longest individual lines. - The data requires cleaning and preprocessing. ## Future Work The next step is to clean the text, build n-gram models, and develop a next-word prediction algorithm. This model will be deployed using a Shiny web application.