As part of the Capstone project by JHU through Coursera, this job tries to put in practice the knowledge of data getting and cleaning in the real world environment.
library("dplyr")
library("ngram")
library("tidytext")
library("janeaustenr")
library("qdap")
fileurl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileurl, destfile = "./SwiftKey.zip", method = "curl")
unzip("SwiftKey.zip", exdir = "./SwiftKey")
news <- readLines("./SwiftKey/final/en_US/en_US.news.txt")
blogs <- readLines("./SwiftKey/final/en_US/en_US.blogs.txt")
twitter <- readLines("./SwiftKey/final/en_US/en_US.twitter.txt")
newlen <- length(news)
blolen <- length(blogs)
twilen <- length(twitter)
The number of lines in:
There are 77259 lines in en_US.news.txt
There are 899288 lines in en_US.blogs.txt
There are 2360148 lines in en_US.twitter.txt
In order to get the number of words into each file, we can use bash.
wc -w SwiftKey/final/en_US/en_US.news.txt > nwords
read nwords filename < nwords
echo "Number of words in News: $nwords"
wc -w SwiftKey/final/en_US/en_US.blogs.txt > bwords
read bwords filename < bwords
echo "Number of words in Blogs: $bwords"
wc -w SwiftKey/final/en_US/en_US.twitter.txt > twords
read twords filename < twords
echo "Number of words in Twitter: $twords"
## Number of words in News: 34365936
## Number of words in Blogs: 37334117
## Number of words in Twitter: 30373559
Mining the text, we can show the three most common words into each file
ft <- freq_terms(news, 5)
plot(ft, main = "Most common words in news")
ft <- freq_terms(blogs, 5)
plot(ft, main = "Most common words in blogs")
ft <- freq_terms(twitter, 5)
plot(ft, main = "Most common words in twitter")