library(stringi)
library(ggplot2)
library(magrittr)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(SnowballC)
library(ngram)
library(corpus)
library(tmap)
library(wordcloud)
## Loading required package: RColorBrewer
library(RWeka)
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The result is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
The first step in analyzing any new data set is figuring out:
Make sure you have downloaded the data from Coursera before heading for the exercises. This exercise uses the files named: LOCALE.blogs.txt where LOCALE is the each of the four locales:
The data is from a corpus called HC Corpora. See the About the Corpora reading for more details. The files have been language filtered but may still contain some foreign text. In this capstone we will be applying data science in the area of natural language processing. As a first step toward working on this project, you should familiarize yourself with Natural Language Processing, Text Mining, and the associated tools in R. Here are some resources that may be helpful to you.
Dataset This is the training data to get you started that will be the basis for most of the capstone. You must download the data from the Coursera site and not from external websites to start.
# open for reading in binary mode
blogsURL <- file("./data/en_US.blogs.txt", open="rb")
blogs <- readLines(blogsURL, encoding = "UTF-8", skipNul=TRUE)
# open for reading in binary mode
newsURL <- file("./data/en_US.news.txt", open = "rb")
news <- readLines(newsURL, encoding = "UTF-8", skipNul=TRUE)
# open for reading in binary mode
twitterURL <- file("./data/en_US.twitter.txt", open = "rb")
twitter <- readLines(twitterURL, encoding = "UTF-8", skipNul=TRUE)
** Counting spaces **
blankspaceBlog <- sum(stri_count(blogs,regex="\\p{Space}"))
blankspaceNews <- sum(stri_count(news,regex="\\p{Space}"))
blankspaceTwitter <- sum(stri_count(twitter,regex="\\p{Space}"))
** Counting punctuation **
puncBlog <- sum(stri_count(blogs,regex="\\p{Punct}"))
puncNews <- sum(stri_count(news,regex="\\p{Punct}"))
puncTwitter <- sum(stri_count(twitter,regex="\\p{Punct}"))
** Counting NON-ASCII **
nonEnglishBlog <- length(blogs[stri_enc_isascii(unlist(blogs))==FALSE])
nonEnglishNews <- length(news[stri_enc_isascii(unlist(news))==FALSE])
nonEnglishTwitter <- length(twitter[stri_enc_isascii(unlist(twitter))==FALSE])
** Counting Digit **
numberBlog <- length(blogs[stri_detect_regex(blogs,"[:digit:]")==TRUE])
numberNews <- length(news[stri_detect_regex(news,"[:digit:]")==TRUE])
numberTwitter <- length(twitter[stri_detect_regex(twitter,"[:digit:]")==TRUE])
nwordBlog <- stri_stats_latex(blogs)[4]
nwordNews <- stri_stats_latex(news)[4]
nwordTwitter <- stri_stats_latex(twitter)[4]
CONCLUSIONS:
summaryTabledata <- data.frame("File Name" = c("Blogs","News","Twitter"),
"File Size(MB)" = c(sizeBlogs, sizeNews, sizeTwitter),
"Spaces" = c(blankspaceBlog, blankspaceNews, blankspaceTwitter),
"NON-ASCII" = c(nonEnglishBlog, nonEnglishNews, nonEnglishTwitter),
"Words" = c(nwordBlog, nwordNews, nwordTwitter),
"Punctuation" = c(puncBlog, puncNews, puncTwitter),
"Digits" = c(numberBlog, numberNews, numberTwitter),
"Lines" = c(length(blogs), length(news), length(twitter)) )
summaryTabledata
## File.Name File.Size.MB. Spaces NON.ASCII Words Punctuation Digits
## 1 Blogs 200.4242 36434843 263027 37570839 6536746 228328
## 2 News 196.2775 33362288 135964 34494539 6913088 396003
## 3 Twitter 159.3641 28013435 77431 30451170 7877048 375220
## Lines
## 1 899288
## 2 1010242
## 3 2360148
Sampling. To reiterate, to build models you don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. Remember your inference class and how a representative sample can be used to infer facts about a population. You might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, you can store the sample and not have to recreate it every time. You can use the rbinom function to “flip a biased coin” to determine whether you sample a line of text or not.
set.seed(97531)
sizeFile <- 8000
sampTwitter <- sample(twitter, size = sizeFile, replace = TRUE)
sampBlogs <- sample(blogs, size = sizeFile, replace = TRUE)
sampNews <- sample(news, size = sizeFile, replace = TRUE)
samplingData <- c(sampTwitter, sampBlogs, sampNews)
length(samplingData)
## [1] 24000
A. Charts: Unigram Barplot AND Word Cloud for general data
B. Word Cloud for Blogs File
C. Word Cloud for News File
## Warning in brewer.pal(9, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
D. Word Cloud for Twitter File
SUMMARY
REFERENCES