Data Science Specialization Capstone Project - Exploratory Analysis of Text Files

Author: “Uma Balakrishnan”

Date: “January 13, 2016”

Synopsis

This project analyse three corpora of US English text files downlodaed from given link. Goal of this project is to create a Shiny app which takes user entered phrase and makes a prediction based on what the next word could be. This report summarizes expolatory data analysis from three text files (twitter, bolgs and news).

Initialize library

library(stringi)
library(graphics)

Download the file from internet

destfile='Coursera-SwiftKey.zip'
if(!file.exists('Coursera-SwiftKey.zip')){
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",destfile) 
}

Extract and list the files from the destfile

unzip(destfile, list = TRUE, overwrite = FALSE)
##                             Name    Length                Date
## 1                         final/         0 2014-07-22 10:10:00
## 2                   final/de_DE/         0 2014-07-22 10:10:00
## 3  final/de_DE/de_DE.twitter.txt  75578341 2014-07-22 10:11:00
## 4    final/de_DE/de_DE.blogs.txt  85459666 2014-07-22 10:11:00
## 5     final/de_DE/de_DE.news.txt  95591959 2014-07-22 10:11:00
## 6                   final/ru_RU/         0 2014-07-22 10:10:00
## 7    final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8     final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9  final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10                  final/en_US/         0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12    final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13   final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14                  final/fi_FI/         0 2014-07-22 10:10:00
## 15    final/fi_FI/fi_FI.news.txt  94234350 2014-07-22 10:11:00
## 16   final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt  25331142 2014-07-22 10:10:00

Understand the folders

list.files("final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"

Select the folder specified for English language and understand the list inside the folder

list.files("final/en_US")
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Now, Choose each text file for basic analysis

file.twitter <- "final/en_US/en_US.twitter.txt"
file.blogs <- "final/en_US/en_US.blogs.txt"
file.news <- "final/en_US/en_US.news.txt"

Determine size of the files in Megabytes

twitter.size.MB <- file.info(file.twitter)$size / 1024^2
blogs.size.MB <- file.info(file.blogs)$size / 1024^2
news.size.MB <- file.info(file.news)$size / 1024^2

size.files <- c(twitter.size.MB, blogs.size.MB, news.size.MB)
names(size.files) <- c("Size of Twitter in  MB", "Size of Blogs in  MB", "Size of News in  MB")
size.files
## Size of Twitter in  MB   Size of Blogs in  MB    Size of News in  MB 
##               159.3641               200.4242               196.2775

Read each text file

twitter <- readLines(file.twitter, skipNul = FALSE)
blogs <- readLines(file.blogs, skipNul = FALSE)
news <- readLines(file.news, skipNul = FALSE)

Basic statistics of each file (Number of lines, Number of lines not empty, Number of characters and Number of characters without white space)

twitter.stats <- stri_stats_general(twitter)
blogs.stats <- stri_stats_general(blogs)
news.stats <- stri_stats_general(news)

stats.total <- rbind(twitter.stats, blogs.stats, news.stats)
stats.total
##                 Lines LinesNEmpty     Chars CharsNWhite
## twitter.stats 2360148     2360148 162384825   134370864
## blogs.stats    899288      899288 208361438   171926076
## news.stats      77259       77259  15683765    13117038

Count total number of words in each file

twitter.words <- stri_count_boundaries(twitter, type = "word")
blogs.words <- stri_count_boundaries(blogs, type = "word")
news.words <- stri_count_boundaries(news, type = "word")

words.stat <- c(sum(twitter.words), sum(blogs.words), sum(news.words))
names(words.stat) <- c("No of Words in Twitter", "No of Words in Blogs", "No of Words in News")
words.stat
## No of Words in Twitter   No of Words in Blogs    No of Words in News 
##               65477024               81379967                5760550

Following figures show the histograms of words per each file

Further Steps

  1. Begin n-gram analysis by combining all three (pick samples) files (for usage frequency analysis)
  2. Create prediction model from n-grams (2-gram, 3-gram, …) and match user entered phrases in desencing order of frequency. Predict the first three highest frequency n-gram as subsequent term.
  3. Test predictive models against data not picked as samples in step 1.
  4. Fine tune the predicted model.
  5. Develop and deploy Shiny App
  6. Present the project in R Presenter.