library(stringi)
    library(ggplot2)
    library(magrittr)
    library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
    library(SnowballC)
    library(ngram)
    library(corpus)
    library(tmap)
    library(wordcloud)
## Loading required package: RColorBrewer
    library(RWeka)

1. Understanding the Problem

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The result is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

The first step in analyzing any new data set is figuring out:

  1. what data you have
  2. what are the standard tools and models used for that type of data

Make sure you have downloaded the data from Coursera before heading for the exercises. This exercise uses the files named: LOCALE.blogs.txt where LOCALE is the each of the four locales:

The data is from a corpus called HC Corpora. See the About the Corpora reading for more details. The files have been language filtered but may still contain some foreign text. In this capstone we will be applying data science in the area of natural language processing. As a first step toward working on this project, you should familiarize yourself with Natural Language Processing, Text Mining, and the associated tools in R. Here are some resources that may be helpful to you.

Dataset This is the training data to get you started that will be the basis for most of the capstone. You must download the data from the Coursera site and not from external websites to start.

2. Getting the Data

READING FILES

# open for reading in binary mode
blogsURL <- file("./data/en_US.blogs.txt", open="rb")
blogs <- readLines(blogsURL, encoding = "UTF-8", skipNul=TRUE)

# open for reading in binary mode
newsURL <- file("./data/en_US.news.txt", open = "rb")
news <- readLines(newsURL, encoding = "UTF-8", skipNul=TRUE)

# open for reading in binary mode
twitterURL <- file("./data/en_US.twitter.txt", open = "rb") 
twitter <- readLines(twitterURL, encoding = "UTF-8", skipNul=TRUE)

UNDERSTANDING THE DATA

Finding the punctuation, spaces, non-ASCII, Words and Numbers:

** Counting spaces **

    blankspaceBlog <- sum(stri_count(blogs,regex="\\p{Space}"))
    blankspaceNews <- sum(stri_count(news,regex="\\p{Space}"))
    blankspaceTwitter <- sum(stri_count(twitter,regex="\\p{Space}"))

** Counting punctuation **

    puncBlog <- sum(stri_count(blogs,regex="\\p{Punct}"))
    puncNews <- sum(stri_count(news,regex="\\p{Punct}"))
    puncTwitter <- sum(stri_count(twitter,regex="\\p{Punct}"))

** Counting NON-ASCII **

    nonEnglishBlog <- length(blogs[stri_enc_isascii(unlist(blogs))==FALSE])
    nonEnglishNews <- length(news[stri_enc_isascii(unlist(news))==FALSE])
    nonEnglishTwitter <- length(twitter[stri_enc_isascii(unlist(twitter))==FALSE])

** Counting Digit **

    numberBlog <- length(blogs[stri_detect_regex(blogs,"[:digit:]")==TRUE])
    numberNews <- length(news[stri_detect_regex(news,"[:digit:]")==TRUE])
    numberTwitter <- length(twitter[stri_detect_regex(twitter,"[:digit:]")==TRUE])
    nwordBlog <- stri_stats_latex(blogs)[4]
    nwordNews <- stri_stats_latex(news)[4]
    nwordTwitter <- stri_stats_latex(twitter)[4]

CONCLUSIONS:

    summaryTabledata <- data.frame("File Name" = c("Blogs","News","Twitter"),
                                   "File Size(MB)" = c(sizeBlogs, sizeNews, sizeTwitter),
                                   "Spaces" = c(blankspaceBlog, blankspaceNews, blankspaceTwitter),
                                   "NON-ASCII" = c(nonEnglishBlog, nonEnglishNews, nonEnglishTwitter),
                                   "Words" = c(nwordBlog, nwordNews, nwordTwitter),
                                   "Punctuation" = c(puncBlog, puncNews, puncTwitter),
                                   "Digits" = c(numberBlog, numberNews, numberTwitter),
                                   "Lines" = c(length(blogs), length(news), length(twitter))                                               )

    summaryTabledata
##   File.Name File.Size.MB.   Spaces NON.ASCII    Words Punctuation Digits
## 1     Blogs      200.4242 36434843    263027 37570839     6536746 228328
## 2      News      196.2775 33362288    135964 34494539     6913088 396003
## 3   Twitter      159.3641 28013435     77431 30451170     7877048 375220
##     Lines
## 1  899288
## 2 1010242
## 3 2360148

3. Sampling Data

Sampling. To reiterate, to build models you don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. Remember your inference class and how a representative sample can be used to infer facts about a population. You might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, you can store the sample and not have to recreate it every time. You can use the rbinom function to “flip a biased coin” to determine whether you sample a line of text or not.

    set.seed(97531)
    sizeFile <- 8000
    sampTwitter <- sample(twitter, size = sizeFile, replace = TRUE)
    sampBlogs <- sample(blogs, size = sizeFile, replace = TRUE)
    sampNews <- sample(news, size = sizeFile, replace = TRUE)
    samplingData <- c(sampTwitter, sampBlogs, sampNews)
    length(samplingData)
## [1] 24000

4. Cleaning and Exploring Data

A. Charts: Unigram Barplot AND Word Cloud for general data

B. Word Cloud for Blogs File

C. Word Cloud for News File

## Warning in brewer.pal(9, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

D. Word Cloud for Twitter File

SUMMARY

  1. Below chats are showing that each database has its own set of keywords that carry a certain semantic load.
  2. Analyzing all data together, we got different results, therefore, the best way to analyzing is each file separately.
  3. Tokenization is the first step of the NLP and it is used to break sentences and phrases in to pairs of words or n-grams. Essentially we are breaking down units of words or tokens, hence the term tokenization.
  4. n-gram is a contiguous sequence of n items from a given sequence of text or speech in Natural Language Processing (NLP). Unigrams are single words. Bigrams are two words combinations. Trigrams are three-word combinations.

REFERENCES

  1. Pluralsight course “Getting Started with Natural Language Processing with Python”.