Introduction

The goal of this project is to work with the data and build prediction algorithm. This explains exploratory analysis and goals for the eventual app and algorithm. This document explain only the major features of the data i have identified and briefly summarize my plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.

The motivation for this project is to:

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app

Loading library

library(RWeka)
library(RColorBrewer)
library(NLP);library(slam)
library(knitr); library(doParallel)
library(stringi); library(tm); library(ggplot2); library(wordcloud)

Loading the data

setwd("E:/shubby coursera/final/en_US")
blogs <- suppressWarnings(readLines(con <- file("./en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE))
close(con)
twitter <- suppressWarnings(readLines(con <- file("./en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE))
close(con)
news <-  suppressWarnings(readLines(con <- file("./en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE))
close(con)

Get basic information

FileName <- c("blogs","news","twitter")
FileSizeMB <- c(200,196,159)
chars <- sapply(list(blogs,news,twitter),length)
lines <- c(max(nchar(blogs)),max(nchar(news)),max(nchar(twitter)))

data.frame(FileName,FileSizeMB,chars,lines)

  FileName FileSizeMB   chars lines
1    blogs        200  899288 40833
2     news        196   77259  5760
3  twitter        159 2360148   140

Clean and sample data

blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")

set.seed(519)
sample_data <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))

Build corpus

corpus <- VCorpus(VectorSource(sample_data))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Plot

plotNGram <- function(n) {
  options(mc.cores=1)
  
  # builds n-gram tokenizer and term document matrix
  tk <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
  tdm <- TermDocumentMatrix(corpus, control=list(tokenize=tk))
  
  # find 25 most frequent n-grams in the matrix
  ngram <- as.matrix(rollup(tdm, 2, na.rm=TRUE, FUN=sum))
  ngram <- data.frame(word=rownames(ngram), freq=ngram[,1])
  ngram <- ngram[order(-ngram$freq), ][1:25, ]
  ngram$word <- factor(ngram$word, as.character(ngram$word))
  
  # plots
  ggplot(ngram, aes(x=word, y=freq)) + ggtitle("Frequency of Words") + 
    geom_bar(stat="Identity", fill="#ED9626", color="#855415") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) + xlab("Word(s)") + 
    ylab("Frequency")
}

Plot Frequency of Most Common n-gram

plotNGram(1)

plotNGram(2)

plotNGram(3)

plotNGram(4)

Interesting Observations

Common n-grams can be used to identify tokens
n-grams with similar frequency are likely to be part of a longer n-gram (eg. 3-gram ‘boy big sword’ and ‘little boy big’ are identified as common, and so is 4-gram ‘little boy big sword’)
Differences among the 3 data sources include:
- tone
- style
- formality
- number of misspellings
- usage of condensed words

-Noise is more prominent in informal sources (eg. blogs and twitter) as data is heavily influenced by personal style, which differs significantly from person to person

Milestone Report

soubhagya Laxmi

20 December 2016