SwiftKey Exploratory Data Analysis

This page shows exploratory data analysis for the data set consisting of three files for en_US: twitter, blogs and news. It shows the major features of the data and illustrate important summaries of the data set in the form of tables and plots. It also briefly summarize the plan for creating the prediction algorithm

Downloading the data and exploring directory structure and files under root directory “final”

##  [1] ".\\final/de_DE"                   ".\\final/de_DE/de_DE.blogs.txt"  
##  [3] ".\\final/de_DE/de_DE.news.txt"    ".\\final/de_DE/de_DE.twitter.txt"
##  [5] ".\\final/en_US"                   ".\\final/en_US/en_US.blogs.txt"  
##  [7] ".\\final/en_US/en_US.news.txt"    ".\\final/en_US/en_US.twitter.txt"
##  [9] ".\\final/fi_FI"                   ".\\final/fi_FI/fi_FI.blogs.txt"  
## [11] ".\\final/fi_FI/fi_FI.news.txt"    ".\\final/fi_FI/fi_FI.twitter.txt"
## [13] ".\\final/ru_RU"                   ".\\final/ru_RU/ru_RU.blogs.txt"  
## [15] ".\\final/ru_RU/ru_RU.news.txt"    ".\\final/ru_RU/ru_RU.twitter.txt"

Loading the data and exploring number of lines in the files

##                       en_US.twitter en_US.blogs en_US.news
## total_number_of_lines       2360148      899288      77259

Exploring total number of words in each of the files

##                       en_US.twitter en_US.blogs en_US.news
## total_number_of_words      30513860    38487556    2760230

Exploring summaries of word counts for rows

##         en_US.twitter en_US.blogs en_US.news
## Mean        12.928791    42.79781   35.72697
## Std Dev      7.185126    47.80498   24.06795
## Min          1.000000     1.00000    1.00000
## Q1           7.000000     9.00000   20.00000
## Median      12.000000    29.00000   33.00000
## Q3          19.000000    61.00000   47.00000
## Max         62.000000  6851.00000 1521.00000

Distribution plot of word counts for en_US.twitter using Histogram and Boxplot

Distribution plot of word counts for en_US.blogs using Histogram and Boxplot

Distribution for only those blogs having word length upto 200 words

Distribution plot of word counts for en_US.news using Histogram and Boxplot

Distribution for only those news having word length upto 150 words

Observations

en_US.twitter: Total 2360148 tweets containing 30513860 words. Tweets are mostly varying from 1 word to 30 words with very few tweets beyond 30 words. Largest tweet is of 62 words. Most of the tweets (peak) are between 2 to 12 words. Mean tweet length is 12.93 words and median tweet length is 12 words.
en_US.blogs: Total 899288 blogs containing 38487556 words. Blogs are mostly varying from 1 word to 200 words with very few blogs beyond 200 words. Largest blog is of 6851 words. Most of the blogs (peak) are between 2 to 18 words. Mean blog length is 42.8 words and median blog length is 29 words.
en_US.news: Total 77259 news containing 2760230 words. News are mostly varying from 1 word to 100 words with very few news beyond 100 words. Largest news is of 1521 words. Most of the news (peak) are between 1 to 50 words. Mean news length is 35.7 words and median news length is 33 words.

Plan for creating the prediction algorithm

Next word is predicted based on combination of previous few words preceding this word. This is called n-gram model.
The training data set (em_US) provided will be used for building the model.
The frequency of word combination and probability is used to refine the model.
The model will have capability to predict unseen word combinations (using backoff models).
The model performance will be tuned for trade-off between size and runtime.
The model will run in a Shiny app.

—————————-END—————————–

SwiftKey Exploratory Data Analysis

Abhinav

28/04/2021