This project analyse three corpora of US English text files downlodaed from given link. Goal of this project is to create a Shiny app which takes user entered phrase and makes a prediction based on what the next word could be. This report summarizes expolatory data analysis from three text files (twitter, bolgs and news).
Initialize library
library(stringi)
library(graphics)
Download the file from internet
destfile='Coursera-SwiftKey.zip'
if(!file.exists('Coursera-SwiftKey.zip')){
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",destfile)
}
Extract and list the files from the destfile
unzip(destfile, list = TRUE, overwrite = FALSE)
## Name Length Date
## 1 final/ 0 2014-07-22 10:10:00
## 2 final/de_DE/ 0 2014-07-22 10:10:00
## 3 final/de_DE/de_DE.twitter.txt 75578341 2014-07-22 10:11:00
## 4 final/de_DE/de_DE.blogs.txt 85459666 2014-07-22 10:11:00
## 5 final/de_DE/de_DE.news.txt 95591959 2014-07-22 10:11:00
## 6 final/ru_RU/ 0 2014-07-22 10:10:00
## 7 final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8 final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9 final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10 final/en_US/ 0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12 final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13 final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14 final/fi_FI/ 0 2014-07-22 10:10:00
## 15 final/fi_FI/fi_FI.news.txt 94234350 2014-07-22 10:11:00
## 16 final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt 25331142 2014-07-22 10:10:00
Understand the folders
list.files("final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
Select the folder specified for English language and understand the list inside the folder
list.files("final/en_US")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
Now, Choose each text file for basic analysis
file.twitter <- "final/en_US/en_US.twitter.txt"
file.blogs <- "final/en_US/en_US.blogs.txt"
file.news <- "final/en_US/en_US.news.txt"
Determine size of the files in Megabytes
twitter.size.MB <- file.info(file.twitter)$size / 1024^2
blogs.size.MB <- file.info(file.blogs)$size / 1024^2
news.size.MB <- file.info(file.news)$size / 1024^2
size.files <- c(twitter.size.MB, blogs.size.MB, news.size.MB)
names(size.files) <- c("Size of Twitter in MB", "Size of Blogs in MB", "Size of News in MB")
size.files
## Size of Twitter in MB Size of Blogs in MB Size of News in MB
## 159.3641 200.4242 196.2775
Read each text file
twitter <- readLines(file.twitter, skipNul = FALSE)
blogs <- readLines(file.blogs, skipNul = FALSE)
news <- readLines(file.news, skipNul = FALSE)
Basic statistics of each file (Number of lines, Number of lines not empty, Number of characters and Number of characters without white space)
twitter.stats <- stri_stats_general(twitter)
blogs.stats <- stri_stats_general(blogs)
news.stats <- stri_stats_general(news)
stats.total <- rbind(twitter.stats, blogs.stats, news.stats)
stats.total
## Lines LinesNEmpty Chars CharsNWhite
## twitter.stats 2360148 2360148 162384825 134370864
## blogs.stats 899288 899288 208361438 171926076
## news.stats 77259 77259 15683765 13117038
Count total number of words in each file
twitter.words <- stri_count_boundaries(twitter, type = "word")
blogs.words <- stri_count_boundaries(blogs, type = "word")
news.words <- stri_count_boundaries(news, type = "word")
words.stat <- c(sum(twitter.words), sum(blogs.words), sum(news.words))
names(words.stat) <- c("No of Words in Twitter", "No of Words in Blogs", "No of Words in News")
words.stat
## No of Words in Twitter No of Words in Blogs No of Words in News
## 65477024 81379967 5760550
Following figures show the histograms of words per each file