This project involves taking three large datasets of text one from blogs, a second from news articles and a third from twitter. Because the data is so large a random sample of .005% of the data was taken in order to do the initial analysis. The analysis was done using mainly the quanteda package in R.

There are four stages reported below.

  1. Reading the data into R.
  2. Sampling, cleaning the data. and combining the datasets.
  3. Tokenizing the data.
  4. Create and implement a profanity filter
  5. Analysing and plotting the frequency of words in the dataset and words that come together in the dataset
  6. Analysing and plotting words and groups of words that are correlated together.
## 
## Bellow Packages Successfully Installed:
## 
##     knitr     dplyr      qdap SnowballC        tm   R.utils  devtools 
##      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE 
##      pryr   ggplot2  quanteda   stringr 
##      TRUE      TRUE      TRUE      TRUE

Step 1 - Load the Data and Inspect its Size and Take Random Saples

Load datasets and take sample of .005% of the data and load then into data frames. The sizes of the datasets are below:

Blogs Original Dataset: 210160014 Bites, 261 MB, 899288 lines. News Original Dataset: 205811889 Bites, 1010242 lines. Twitter Original Dataset: 167105338 Bites, 2360148 lines.

The sample dataset is .005 of all of that which should be enough given that it is a random sample.

set.seed(123)
setwd("~/courses/Capstone/final/en_US")
blogs <- readLines("en_US.blogs.txt", ok = TRUE)
#File size of the original blogs dataset in bites
file.size("en_US.blogs.txt")
## [1] 210160014
#Size of original blogs dataset in terms of MB
object_size(blogs)
## 261 MB
lengthB <- length(blogs)
#Length in terms of lines of original blogs dataset
lengthB
## [1] 899288

I created a function to find the amount of words and lines in each data frame (see appendix for the function)

WordsLines<-function(dataframe, namess, namess2){
Words<-as.data.frame(dataframe)#since the dataframe is in text format ut it into a dataframe
Wc<-wc(Words[,1])#get the word count of each input (all rows) of the first colunm
Words1<-as.data.frame(Wc)#put that word count into a dataframe
Words1$Wc<-as.numeric(Words1$Wc)#make sure it is numeric
names(Words1)[1]<-paste("Words")#change the colunm name to "Words"
Words1<-sum(Words1, na.rm = T)#Sum all the word counts of the entire colunm
Lines<-nrow(Words)#find the number of wors in the entire dataframe
final<-cbind(Lines, Words1)#combine the line count and wort count into one table
colnames(final) <- c(namess, namess2)#change the names of the colunms to fit the particular dataset
final#return the table
}

Number of words and lines in the Blogs dataset

WordsLines(blogs, "Blogs Lines", "Blogs Words")
##      Blogs Lines Blogs Words
## [1,]      899288    36825518

Get sample from Blogs dataset

BlogsFinal<-as.data.frame(sample(blogs, size = lengthB*.005))

News

## [1] NA
## [1] 1010242

Number of words and lines in the News dataset

WordsLines(news, "News Lines", "News Words")
##      News Lines News Words
## [1,]    1010242   33482314

Get sample from News dataset

NewsFinal<-as.data.frame(sample(news, size = lengthN*.005))

Twitter

## [1] NA
## [1] 2360148

Number of words and lines in the Twitter dataset

WordsLines(twitter, "Twitter Lines", "Twitter Words")
##      Twitter Lines Twitter Words
## [1,]       2360148      29379682

Get sample from Twitter dataset

TwitterFinal<-as.data.frame(sample(twitter, size = lengthT*.005))

Step 2 - Combine the Datasets

Combine all datasets into one large dataset using full-join from the dplyr package. Then we check the class of the dataset to make sure it is “character”.

## [1] "character"

Step 3 - Clean the Data

The next step is to clean the data. This was done in a number of ways. First I replaced all end of sentences (periods, exclamation marks etc.) with “ootoo”. I then removed all twitter type characters such as @, # etc. I then put all words to lower case. I removed contiguous spaces and removed all non-alpha text (numbers etc), then I took all “ootoo” and split the strings based on that so that all sentences are separate and removed all other misc characters. See below the clean data outcome of the first ten documents in the data frame.

##  [1] "june sounds great late june gotta trip to seattle early june"                             
##  [2] "someone is butthurt about a breakup wow that came out of left field"                      
##  [3] "need a job so badly"                                                                      
##  [4] "thanks i already had dropboxevernote as i cant work without them ill check out the others"
##  [5] "you can officially start drinking now being cinco de mayo and all"                        
##  [6] "taking my fat butt to the gym what was i thinking taking a month off"                     
##  [7] "i cant wait for some good soccer action tomorrod"                                         
##  [8] "me thinks this is momentum"                                                               
##  [9] "iwannagiveashoutoutto all my haters "                                                     
## [10] "love this show glad for the new season "

Step 4 – Create Profanity Filter

I then created a profanity Filter based on a datagram containing 450 profane words. I had two data frames one with 450 profanities and another with 450 times “XXXX”. I then created a function that replaced the 450 profanities that had matches in the dataset with “XXXX.”

## [1] 449   9
## [1] 448   1

Results of Profanity Filter

Created the function “ProfanityRemover”" that takes a data frame of profanities and replaces them in a document with another pattern in this case “XXXX”. See below the first ten results of the profanity filter.

This is the top ten words in the profanity dataset that will be filtered from the main dataset

head(badwords$V9)
## [1] V8         \\b5h1t\\b \\b5hit\\b \\ba55\\b  \\banal\\b \\banus\\b
## 449 Levels: \\b5h1t\\b \\b5hit\\b \\ba_s_s\\b \\ba55\\b ... V8

Examples of how the Profanity filter works. First see the cleaned data then its comparison in the dirty data

#Unfiltered
DataAll$paragraph[203]
## [1] "brazilian butt lift is serious shit minutes and im fucking drained omg"
#Filtered
DataAllClean[203]
## [1] "brazilian XXXX lift is serious XXXX minutes and im XXXX drained omg"

Step 5 - Tokenization

Create a corpus that has been tokenized into Ngrams. Unigrams, Bigrams, Trigrams and 4-grams.

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 16,851 documents
##    ... indexing features: 33,172 feature types
##    ... removed 124 features, from 174 supplied (glob) feature types
##    ... created a 16851 x 33048 sparse dfm
##    ... complete. 
## Elapsed time: 1.007 seconds.
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 16,851 documents
##    ... indexing features: 172,484 feature types
##    ... removed 93,793 features, from 174 supplied (glob) feature types
##    ... created a 16851 x 78691 sparse dfm
##    ... complete. 
## Elapsed time: 5.685 seconds.
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 16,851 documents
##    ... indexing features: 249,846 feature types
##    ... removed 206,138 features, from 174 supplied (glob) feature types
##    ... created a 16851 x 43708 sparse dfm
##    ... complete. 
## Elapsed time: 5.899 seconds.
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 16,851 documents
##    ... indexing features: 254,291 feature types
##    ... removed 232,230 features, from 174 supplied (glob) feature types
##    ... created a 16851 x 22061 sparse dfm
##    ... complete. 
## Elapsed time: 6.156 seconds.
## Corpus consisting of 16851 documents, showing 5 documents.
## Warning in nsentence.character(object, ...): nsentence() does not correctly
## count sentences in all lower-cased text
##   Text Types Tokens Sentences
##  text1     9     11         1
##  text2    13     13         1
##  text3     5      5         1
##  text4    15     16         1
##  text5    12     12         1
## 
## Source:  /Users/levibrackman/courses/Capstone/* on x86_64 by levibrackman
## Created: Sun May  1 15:18:17 2016
## Notes:
##  said  will  just    im  like   get   one  xxxx   new   can  good  love 
##  1375   997   994   858   805   788   783   697   690   675   649   602 
##   day  time  dont   now  know     u great    go 
##   600   589   587   543   542   507   496   468

Step 6

Put the grams into a data frame format so that they can then be graphed. To do this I created a function that put them all into a matrix, then sorted the ngrams into a data frame based on the ngrams that are found most frequently in the dataset.

Plot the n-grams

These plots show which n-grams are found most frequently in the dataset. There are four plots representing unigrams, bigrams, trigrams and 4-grams. ##Step 6 - Relationships between words in the corpus

In this final step we look to see which words are highly correlated with each other. We use the “collocations” function for this which uses The likelihood ratio statistic G^2, computed as: 2 * ∑_i ∑j ( n{ij} * log ) (amongst others such as Pearson’s χ^2) to tell us the correlation between words. I then plot the top 30 sets of words (sets of two and sets of three) that are most highly correlated with each other in the dataset.

Appendix - Code

#install.packages("quanteda")
#install.packages("pryr")
#install.packages("devtools")
#devtools::install_github("hadley/lineprof")
#install.packages("R.utils")
#install.packages("qdap")
#library(devtools)
#install_github("espanta/lubripack", force = TRUE)
library(lubripack)
lubripack("knitr", "dplyr", "qdap", "SnowballC", "tm", "R.utils", "devtools", "pryr", "ggplot2", "quanteda", "stringr")
set.seed(123)
setwd("~/courses/Capstone/final/en_US")
blogs <- readLines("en_US.blogs.txt", ok = TRUE)
#File size of the original blogs dataset in bites
file.size("en_US.blogs.txt")
#Size of original blogs dataset in terms of MB
object_size(blogs)
lengthB <- length(blogs)
#Length in terms of lines of original blogs dataset
lengthB

WordsLines<-function(dataframe, namess, namess2){
Words<-as.data.frame(dataframe)#since the dataframe is in text format ut it into a dataframe
Wc<-wc(Words[,1])#get the word count of each input (all rows) of the first colunm
Words1<-as.data.frame(Wc)#put that word count into a dataframe
Words1$Wc<-as.numeric(Words1$Wc)#make sure it is numeric
names(Words1)[1]<-paste("Words")#change the colunm name to "Words"
Words1<-sum(Words1, na.rm = T)#Sum all the word counts of the entire colunm
Lines<-nrow(Words)#find the number of wors in the entire dataframe
final<-cbind(Lines, Words1)#combine the line count and wort count into one table
colnames(final) <- c(namess, namess2)#change the names of the colunms to fit the particular dataset
final#return the table
}

WordsLines(blogs, "Blogs Lines", "Blogs Words")

BlogsFinal<-as.data.frame(sample(blogs, size = lengthB*.005))
news <- readLines("~/courses/Capstone/final/en_US/en_US.news.txt", ok = TRUE)
#File size of the original blogs dataset in bites
file.size("en_US.news.txt")
#Length in terms of lines of original news dataset
lengthN <- length(news)
lengthN
WordsLines(news, "News Lines", "News Words")
NewsFinal<-as.data.frame(sample(news, size = lengthN*.005))
twitter <- readLines("~/courses/Capstone/final/en_US/en_US.twitter.txt", ok = TRUE, skipNul = TRUE)
#File size of the original blogs dataset in bites
file.size("en_US.twitter.txt")
#Length in terms of lines of original twitter dataset
lengthT <- length(twitter)
lengthT
WordsLines(twitter, "Twitter Lines", "Twitter Words")
TwitterFinal<-as.data.frame(sample(twitter, size = lengthT*.005))

names(TwitterFinal) <- c("paragraph")
names(NewsFinal) <- c("paragraph")
names(BlogsFinal) <- c("paragraph")
TwitterFinal$paragraph<-as.character(TwitterFinal$paragraph)
BlogsFinal$paragraph<-as.character(BlogsFinal$paragraph)
NewsFinal$paragraph<-as.character(NewsFinal$paragraph)
DataAll<-full_join(TwitterFinal, NewsFinal, BlogsFinal, by="paragraph")
class(DataAll$paragraph)
#Remove all twitter type charecters such as @, # etc. Replace all end of sentances with "ootoo"
DataAll$paragraph <- gsub(pattern=';|\\.|!|\\?', x=DataAll$paragraph, replacement='ootoo')
DataAll$paragraph <- gsub("&amp", "", DataAll$paragraph)
DataAll$paragraph <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", DataAll$paragraph)
DataAll$paragraph <- gsub("@\\w+", "", DataAll$paragraph)
DataAll$paragraph <- gsub("[[:punct:]]", "", DataAll$paragraph)
DataAll$paragraph <- gsub("[[:digit:]]", "", DataAll$paragraph)
DataAll$paragraph <- gsub("http\\w+", "", DataAll$paragraph)
DataAll$paragraph <- gsub("[ \t]{2,}", "", DataAll$paragraph)
DataAll$paragraph <- gsub("^\\s+|\\s+$", "", DataAll$paragraph) 
# remove all non-alpha text (numbers etc)
DataAll$paragraph <- gsub(pattern="[^[:alpha:]]", x=DataAll$paragraph, replacement = ' ')
# force all characters to lower case
DataAll$paragraph <- tolower(DataAll$paragraph)
# remove contiguous spaces
DataAll$paragraph <- gsub(pattern="\\s+", x=DataAll$paragraph, replacement=' ')
#take all "ootoo" and split the strings based on that so that all sentances are seperate.
DataAll$paragraph <- strsplit(x=DataAll$paragraph, split='ootoo',fixed = TRUE)
#Remove other misc charectors.
DataAll$paragraph <- gsub(pattern="\"", x=DataAll$paragraph, replacement='')
DataAll$paragraph <- gsub(pattern="\\,", x=DataAll$paragraph, replacement='')
DataAll$paragraph <- gsub(pattern="^[c]", x=DataAll$paragraph, replacement='')
DataAll$paragraph <- gsub(pattern="^[(]", x=DataAll$paragraph, replacement='')
DataAll$paragraph <- gsub(pattern="[)]$", x=DataAll$paragraph, replacement='')
DataAll$paragraph <- gsub(pattern="\\s+", x=DataAll$paragraph, replacement=' ')
#See the first ten lines of clean data
head(DataAll$paragraph, 10)
#Read in a list of 450 curse words 
badwords <- read.csv("/Users/levibrackman/courses/Capstone/final/en_US/badwords/badwords.csv", header = F)
replaceXXXX <- read.csv("/Users/levibrackman/courses/Capstone/final/en_US/badwords/XXXX.csv", header = F)
dim(badwords)
dim(replaceXXXX)

ProfanityRemover <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern))
x <- gsub(pattern[i], replacement[i], x, ...)
x
}
#run the function removing all 450 profanities
DataAllClean <- ProfanityRemover(badwords$V9, replaceXXXX$V1, DataAll$paragraph)
head(badwords$V9)
#Unfiltered
DataAll$paragraph[203]
#Filtered
DataAllClean[203]
#install.packages("stringr", dependencies = TRUE)
DataAllCorpus <- corpus(DataAllClean)
OneNgrams <- dfm(DataAllCorpus, ignoredFeatures = stopwords("english"), removeNumbers = TRUE, removePunct = TRUE, ngrams = 1)
#Save to rdata file
#save(OneNgrams, file = "OneNgrams.RData")  #Save to rdata file

TwoNgrams <- dfm(DataAllCorpus, ignoredFeatures = stopwords("english"), removeNumbers = TRUE, removePunct = TRUE, verbose = TRUE, concatenator = " ", ngrams = 2)
#Save to rdata file
#save(TwoNgrams, file = "TwoNgrams.RData")

ThreeNgrams <- dfm(DataAllCorpus, ignoredFeatures = stopwords("english"), removeNumbers = TRUE, removePunct = TRUE, verbose = TRUE, concatenator = " ", ngrams = 3)
#Save to rdata file
#save(TwoNgrams, file = "TwoNgrams.RData")

FourNgrams <- dfm(DataAllCorpus, ignoredFeatures = stopwords("english"), removeNumbers = TRUE, removePunct = TRUE, verbose = TRUE, concatenator = " ",  ngrams = 4)
#Save to rdata file
#save(FourNgrams, file = "FourNgrams.RData")
summary(DataAllCorpus, n = 5)

topfeatures(OneNgrams, 20)

toplotDF<- function(corp) {
  dataframes <- as.data.frame(as.matrix(docfreq(corp)))
  DFsorted <- sort(rowSums(dataframes), decreasing=TRUE)
  FreqTable <- data.frame(Words=names(DFsorted), Frequency = DFsorted)
}

OneNgramsDF <- toplotDF(OneNgrams)
TwoNgramsDF <- toplotDF(TwoNgrams)
ThreeNgramsDF <- toplotDF(ThreeNgrams)
FourNgramsDF <- toplotDF(FourNgrams)

plotsall1<-function(plots, title1) {
P <- ggplot(within(plots[1:30, ], Words <- factor(Words, levels=Words)), aes(Words, Frequency))
P <- P + geom_bar(stat="identity", fill="orange") + ggtitle(title1)
P <- P + theme(axis.text.x=element_text(angle=90, hjust=.3))
P
}

plotsall1(OneNgramsDF, "Frequency of Unigram")
plotsall1(TwoNgramsDF, "Frequency of Bigrams")
plotsall1(ThreeNgramsDF, "Frequency of Trigrams")
plotsall1(FourNgramsDF, "Frequency of 4-Ngrams")

twowords<-collocations(DataAllCorpus, size = 2, method = "all")
twowords$Words <- paste(twowords$word1, twowords$word2)
threewords<-collocations(DataAllCorpus, size = 3)
threewords$Words <- paste(threewords$word1, threewords$word2, threewords$word3)

plotsall2<-function(plots, title1) {
P <- ggplot(within(plots[1:30, ], Words <- factor(Words, levels=Words)), aes(Words, G2))
P <- P + geom_bar(stat="identity", fill="blue") + ggtitle(title1)
P <- P + theme(axis.text.x=element_text(angle=90, hjust=1))
P
}

plotsall2(twowords, "30 Most Corrolated Two Words")
plotsall2(threewords, "30 Most Corrolated Three Wrods")
## NA