Executive summary

This Data Science Capstone project report shows basic exploratory analysis for the course by Johns Hopkins at Coursera. The goal of the project is to work with natural language processing starting from raw data to be cleaned and ending with a prediction algorithm to predict next word after a user has inputed some text, which is built in to a constructed Shiny application.

This particular report shows the exprorative analysis of the data including handling the data, creating summary statistics and plots of the data, and further plans on how I will proceed.

Loading files

to show the code on how to load the data to created variables:

#loading libraries 
library(tm)

## Warning: package 'tm' was built under R version 3.2.3

## Loading required package: NLP

## Warning: package 'NLP' was built under R version 3.2.3

library(qdap)

## Warning: package 'qdap' was built under R version 3.2.3

## Loading required package: qdapDictionaries

## Warning: package 'qdapDictionaries' was built under R version 3.2.3

## Loading required package: qdapRegex

## Warning: package 'qdapRegex' was built under R version 3.2.3

## Loading required package: qdapTools

## Warning: package 'qdapTools' was built under R version 3.2.3

## Loading required package: RColorBrewer

## Warning: package 'RColorBrewer' was built under R version 3.2.2

## 
## Attaching package: 'qdap'
## 
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix
## 
## The following object is masked from 'package:NLP':
## 
##     ngrams
## 
## The following object is masked from 'package:base':
## 
##     Filter

library(RCurl)

## Warning: package 'RCurl' was built under R version 3.2.3

## Loading required package: bitops

## Warning: package 'bitops' was built under R version 3.2.3

library(stringi)

## Warning: package 'stringi' was built under R version 3.2.3

library(stringr)

## Warning: package 'stringr' was built under R version 3.2.3

## 
## Attaching package: 'stringr'
## 
## The following object is masked from 'package:qdap':
## 
##     %>%

library(RWeka)

## Warning: package 'RWeka' was built under R version 3.2.3

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.3

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:qdapRegex':
## 
##     %+%
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

library(slam)

## Warning: package 'slam' was built under R version 3.2.3

library(quanteda)

## Warning: package 'quanteda' was built under R version 3.2.3

## 
## Attaching package: 'quanteda'
## 
## The following objects are masked from 'package:qdap':
## 
##     as.DocumentTermMatrix, as.wfm, ngrams, weight
## 
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords
## 
## The following object is masked from 'package:NLP':
## 
##     ngrams
## 
## The following object is masked from 'package:base':
## 
##     sample

library(lattice)
library(scales)

## Warning: package 'scales' was built under R version 3.2.3

library(DataCombine)

## Warning: package 'DataCombine' was built under R version 3.2.3

## 
## Attaching package: 'DataCombine'
## 
## The following object is masked from 'package:qdapTools':
## 
##     shift

library(NLP) 
library(gridBase)

## Warning: package 'gridBase' was built under R version 3.2.3

library(magrittr)

## Warning: package 'magrittr' was built under R version 3.2.3

## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:qdap':
## 
##     %>%

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.3

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:qdap':
## 
##     %>%
## 
## The following object is masked from 'package:qdapTools':
## 
##     id
## 
## The following objects are masked from 'package:qdapRegex':
## 
##     escape, explain
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#loading data
file_tweets<-"en_US.twitter.txt"
file_blogs<-"en_US.blogs.txt"
file_news<-"en_US.news.txt"
tweets<-readLines(file_tweets, skipNul = TRUE)
blogs<-readLines(file_blogs, skipNul = TRUE)
con<-file(file_news, open="rb")
news<-readLines(con, encoding="UTF-8")
close(con)
rm(con)

Summary statistics

to show the file sizes, lenghts of records and number of words:

# file sizes
filesizeMB<-data.frame (sizeMB = c(round(file.info(file_tweets)$size / 1024^2,2),
                                round(file.info(file_blogs)$size / 1024^2,2),
                                     round(file.info(file_news)$size / 1024^2,2) ))

# number of records
records<-data.frame(length = c(length(tweets),length(blogs),length(news)))


# lengths of records
longest<-data.frame( 'Max length' = c( max(str_length(tweets)),max(str_length(blogs)),max(str_length(news))))
shortest<-data.frame( 'Min length' = c(min(str_length(tweets)),min(str_length(blogs)),min(str_length(news))))
average<-data.frame( 'Avg length' = c(mean(str_length(tweets)),mean(str_length(blogs)),mean(str_length(news))))

# number of words
wordcounts<-data.frame ( words = c(sum(stri_count_words(tweets)),
                                    sum(stri_count_words(blogs)),
                                     sum(stri_count_words(news)) ))

sources<-data.frame(source = c("tweets","blogs","news"),  stringsAsFactors=FALSE)

info_df<-cbind(sources, records, shortest,average, longest, wordcounts, filesizeMB)
info_df

##   source  length Min.length Avg.length Max.length    words sizeMB
## 1 tweets 2360148          2    68.8029        213 30218166 159.36
## 2  blogs  899288          1   231.6960      40835 38154238 200.42
## 3   news 1010242          1   201.1628      11384 34762395 196.28

With over 100 million words in total data, we can take samples of 100k rows/source:

sm_tweets<-sample(stri_trans_tolower(tweets),size=100000,replace=FALSE)
sm_blogs<-sample(stri_trans_tolower(blogs),size=100000,replace=FALSE)
sm_news<-sample(stri_trans_tolower(news),size=100000,replace=FALSE)

#combining samples
allData<-combine(sm_tweets, sm_blogs, sm_news)
wordcounts<-data.frame (words = c(sum(stri_count_words(allData))))
wordcounts

##     words
## 1 8941887

#unigram
dfm.all<-dfm(allData, ngrams = 1, verbose = FALSE, concatenator = " ", stem=FALSE)
barchart(topfeatures(dfm.all, 15))

This shows the frequencies of the most-used 15 words in the combined sample.

#The frequency of the words: 
ngram<-as.data.frame(as.matrix(docfreq(dfm.all)))
ngram.sorted<-sort(rowSums(ngram), decreasing=TRUE)
ngram.FreqTable<-data.frame(Words=names(ngram.sorted), Frequency = ngram.sorted)
summary(ngram.FreqTable$Frequency)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      1.00      1.00      1.00     35.08      3.00 166100.00

#Cumulative frequency of the words:
w4<-ngram.FreqTable
w4<-w4 %>% arrange(Frequency)
w4<-mutate(w4, cumsum = cumsum(Frequency) )
w4<-mutate(w4, rn = 1:nrow(w4))
w4<-mutate(w4, cumper = (w4$cumsum / max(w4$cumsum)))

lp<-ggplot(w4, aes(x=rn, y=cumper)) + geom_point()
lp<-lp + ggtitle("Cumulative Frequency of Words") 
lp<-lp + theme(plot.title = element_text(lineheight=.8, face="bold"))
lp<-lp + scale_y_continuous(labels=percent)
#lp<-lp + geom_hline(yintercept=.50)
lp<-lp + labs(x='Word n', y='Cumulative % of Occurance' )
lp

#Frequency calculations
w4.mean<-mean(w4$Frequency)
word.per<-(count(w4 %>% filter(Frequency > w4.mean)) / count(w4) ) * 100
#the mean frequency:
w4.mean

## [1] 35.08139

#osmallish percent of words are above the mean frequency
word.per

##          n
## 1 6.298115

Summary of unigrams:

The combined sample has just over 200k words occuring more than 7 million times. 50% of occurrences were made up of 269 words. Above the mean frequency of 35, there are 12902 words, which account for 6.3% of all words in the sample.

Summary of 2-grams:

dfm2.all<-dfm(allData, ngrams = 2, verbose = FALSE, concatenator = " ", stem=FALSE)
barchart(topfeatures(dfm2.all, 15))

There are 42981 2-grams making up 50% of the occurances. There are 245760 2.grams greater than the mean frequency of 3.4. They make up 10.1% of the total of 2438405 2-grams.

Summary of 3-grams:

dfm3.all<-dfm(allData, ngrams = 3, verbose = FALSE, concatenator = " ", stem=FALSE)
barchart(topfeatures(dfm3.all, 15))

There are 1583242 3-grams making up 50% of the occurances. There are 639781 3-grams greater than the mean frequency of 1.4. They account for 11.3% of the total of 5662011 3-grams.

Further plans:

Shortly, the plan is to clean the profanity words from the sample, to build 4-grams and more if needed, clean further the data and develop the code, to go on with developing the predictive model and to incorporate it into a Shiny application in the end.

Capstone milestone project

jamieon