Coursera Milestone Report

Data Import and Processing

The data is provided in the coursera course site (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) and comprises of four sets of files each containing text data from blogs, news and twiiter. There are four directories in different european languages of the text files but for my report i will use the files with the English language (en_US).

The packages used are the tm and quanteda which are recomended by coursera. Quanteda will be used later for future text analysis.

data <- file.path(".", "final", "en_US")
library(tm); library(readtext) ## load the required packages

## Loading required package: NLP

# Read files into R
blogs <- readLines(con = file("en_us.blogs.txt", "r"), skipNul = T)
news <- readLines(con = file("en_us.news.txt", "r"), skipNul = T)

## Warning in readLines(con = file("en_us.news.txt", "r"), skipNul = T):
## incomplete final line found on 'en_us.news.txt'

twitter <- readLines(con = file("en_us.twitter.txt", "r"), skipNul = T)

Basic summary of the data

# Get to know your data
length(blogs); length(news); length(twitter)

## [1] 899288

## [1] 77259

## [1] 2360148

The files are too big and therefore for faster processing i will use a subset of the data.

merge <- paste(blogs[1:4000], news[1:4000], twitter[1:4000])
# create a corpus 
data_source <- VectorSource(merge)
corpus <- Corpus(data_source)
corpus <- tm_map(corpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents

corpus <- tm_map(corpus, stripWhitespace)

## Warning in tm_map.SimpleCorpus(corpus, stripWhitespace): transformation
## drops documents

corpus <- tm_map(corpus, removeNumbers)

## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents

corpus <- tm_map(corpus, removePunctuation)

## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation
## drops documents

corpus <- tm_map(corpus, removeWords, stopwords())

## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords()):
## transformation drops documents

Create a docment term matrix and find the frequency of words that will be useful in further analysis of the data

dtm <- as.matrix(DocumentTermMatrix(corpus))
freq <- colSums(dtm)
freq <- sort(freq, decreasing = T)

Show summary plots in form of word cloud and Histogram on frequency of words

library(wordcloud)

## Loading required package: RColorBrewer

set.seed(111)
wordcloud(names(freq), freq, max.words=100)

## Warning in wordcloud(names(freq), freq, max.words = 100): said could not be
## fit on page. It will not be plotted.

barplot(freq[1:20], cex.names=1.0, names.arg=freq[1:20], col="blue", main="Word frequency", las=2)

Prediction Algorithim

I will include the quanteda package when formulating my prediction algorithim as well as the tm package. I will further employ several text analysis techniques useful in the NLP. These include: 1. Parsing 2. Stemming 3. Text segmentation 4. Named entity recognition and 5. Sentiment analysis.

Coursera Milestone Report

Ezekiel

August 6, 2018

Introduction

Data Import and Processing

Prediction Algorithim