The coursera week 2 assignment provided a link to download data file. The expectation is for students to perform natural language processing on the data. The download provided text files from 3 different sources: Blogs, twitter and news. In the following sections, downloading data, processing it and making a few visual representations are explored.
In this section I answer the following questions from the review criteria: -Does the link lead to an HTML page describing the exploratory analysis of the training data set? -Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables? -Has the data scientist made basic plots, such as histograms to illustrate features of the data? -Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?
# Download data if it doesn't exist from the URL given in the first week
dataURL<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dataDIR = "final"
if (!dir.exists(dataDIR)) {
dataZipName <- "Coursera-SwiftKey.zip"
if (!file.exists(dataZipName))
download.file(dataURL, dataZipName, method = "auto")
unzip(dataZipName)
}
# Summary of the downloaded files
list.dirs(path = "./final", full.names = TRUE, recursive = TRUE)
## [1] "./final" "./final/de_DE" "./final/en_US" "./final/fi_FI"
## [5] "./final/ru_RU"
# The 3 text files downloaded in english language is shown here
list.files(path = "./final/en_US", full.names = TRUE, recursive = TRUE)
## [1] "./final/en_US/en_US.blogs.txt" "./final/en_US/en_US.news.txt"
## [3] "./final/en_US/en_US.twitter.txt"
# A sampling of blogs is shown below
con_blogs <- file("final/en_US/en_US.blogs.txt")
en_US.blogs <- readLines(con_blogs)
close(con_blogs)
# Similar observations can be done for the other 2 txt files
# con_twitter <- file("final/en_US/en_US.blogs.txt")
# con_news <- file("final/en_US/en_US.blogs.txt")
Basic summaries of the 3 files is done in terms of word counts, line counts and basic data tables The 3 files explored are in the US English language. They are from 3 different sources. One from blogs, one from twitter and another from news
# Summary for blogs
library(tokenizers)
## Warning: package 'tokenizers' was built under R version 3.4.2
con_blogs <- file("final/en_US/en_US.blogs.txt")
en_US.blogs <- readLines(con_blogs)
# Word count
blogs_text <- paste(readLines(con_blogs), collapse = "\n")
blogs_word_count <- tokenize_words(blogs_text)
length(blogs_word_count[[1]])
## [1] 38154238
#line count
blogs_line_count <- NROW(en_US.blogs)
blogs_line_count
## [1] 899288
#Word count of top 20 longest lines
blogs_line <- tokenize_words(en_US.blogs)
blogs_line_length <- sapply(blogs_line,length)
blogs_length_sort <- sort(blogs_line_length)
blogs_top_length <- tail(blogs_length_sort,20)
plot(blogs_top_length, col= "dark green", bg= "dark green", pch= 19, main= "Word count of top 20 longest lines for blogs text", ylab= "Word count")
close(con_blogs)
# Summary for twitter
con_twitter <- file("final/en_US/en_US.twitter.txt")
en_US.twitter <- readLines(con_twitter)
## Warning in readLines(con_twitter): line 167155 appears to contain an
## embedded nul
## Warning in readLines(con_twitter): line 268547 appears to contain an
## embedded nul
## Warning in readLines(con_twitter): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(con_twitter): line 1759032 appears to contain an
## embedded nul
# Word count
twitter_text <- paste(readLines(con_twitter), collapse = "\n")
## Warning in readLines(con_twitter): line 167155 appears to contain an
## embedded nul
## Warning in readLines(con_twitter): line 268547 appears to contain an
## embedded nul
## Warning in readLines(con_twitter): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(con_twitter): line 1759032 appears to contain an
## embedded nul
twitter_word_count <- tokenize_words(twitter_text)
length(twitter_word_count[[1]])
## [1] 30218125
#line count
twitter_line_count <- NROW(en_US.twitter)
twitter_line_count
## [1] 2360148
#Word count of top 20 longest lines
twitter_line <- tokenize_words(en_US.twitter)
twitter_line_length <- sapply(twitter_line,length)
twitter_length_sort <- sort(twitter_line_length)
twitter_top_length <- tail(twitter_length_sort,20)
plot(twitter_top_length, col= "blue", bg= "blue", pch= 19, main= "Word count of top 20 longest lines for twitter text", ylab= "Word count")
close(con_twitter)
# Summary for news
con_news <- file("final/en_US/en_US.news.txt")
en_US.news <- readLines(con_news)
## Warning in readLines(con_news): incomplete final line found on 'final/
## en_US/en_US.news.txt'
# Word count
news_text <- paste(readLines(con_news), collapse = "\n")
## Warning in readLines(con_news): incomplete final line found on 'final/
## en_US/en_US.news.txt'
news_word_count <- tokenize_words(news_text)
length(news_word_count[[1]])
## [1] 2693898
#line count
news_line_count <- NROW(en_US.news)
news_line_count
## [1] 77259
#Word count of top 20 longest lines
news_line <- tokenize_words(en_US.news)
news_line_length <- sapply(news_line,length)
news_length_sort <- sort(news_line_length)
news_top_length <- tail(news_length_sort,20)
plot(news_top_length, col= "purple", bg= "purple", pch= 19, main= "Word count of top 20 longest lines for news text", ylab= "Word count")
close(con_news)
My attempt at preliminary research into natural language processing is shown in this article. In future I would like to be able to predict words (upto a combination of 3 words), that would follow, when a person is writing text messages electronically.