This project involves taking three large datasets of text one from blogs, a second from news articles and a third from twitter. Because the data is so large a random sample of .005% of the data was taken in order to do the initial analysis. The analysis was done using mainly the quanteda package in R.
There are four stages reported below.
##
## Bellow Packages Successfully Installed:
##
## knitr dplyr qdap SnowballC tm R.utils devtools
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## pryr ggplot2 quanteda stringr
## TRUE TRUE TRUE TRUE
Load datasets and take sample of .005% of the data and load then into data frames. The sizes of the datasets are below:
Blogs Original Dataset: 210160014 Bites, 261 MB, 899288 lines. News Original Dataset: 205811889 Bites, 1010242 lines. Twitter Original Dataset: 167105338 Bites, 2360148 lines.
The sample dataset is .005 of all of that which should be enough given that it is a random sample.
set.seed(123)
setwd("~/courses/Capstone/final/en_US")
blogs <- readLines("en_US.blogs.txt", ok = TRUE)
#File size of the original blogs dataset in bites
file.size("en_US.blogs.txt")
## [1] 210160014
#Size of original blogs dataset in terms of MB
object_size(blogs)
## 261 MB
lengthB <- length(blogs)
#Length in terms of lines of original blogs dataset
lengthB
## [1] 899288
I created a function to find the amount of words and lines in each data frame (see appendix for the function)
WordsLines<-function(dataframe, namess, namess2){
Words<-as.data.frame(dataframe)#since the dataframe is in text format ut it into a dataframe
Wc<-wc(Words[,1])#get the word count of each input (all rows) of the first colunm
Words1<-as.data.frame(Wc)#put that word count into a dataframe
Words1$Wc<-as.numeric(Words1$Wc)#make sure it is numeric
names(Words1)[1]<-paste("Words")#change the colunm name to "Words"
Words1<-sum(Words1, na.rm = T)#Sum all the word counts of the entire colunm
Lines<-nrow(Words)#find the number of wors in the entire dataframe
final<-cbind(Lines, Words1)#combine the line count and wort count into one table
colnames(final) <- c(namess, namess2)#change the names of the colunms to fit the particular dataset
final#return the table
}
Number of words and lines in the Blogs dataset
WordsLines(blogs, "Blogs Lines", "Blogs Words")
## Blogs Lines Blogs Words
## [1,] 899288 36825518
Get sample from Blogs dataset
BlogsFinal<-as.data.frame(sample(blogs, size = lengthB*.005))
## [1] NA
## [1] 1010242
Number of words and lines in the News dataset
WordsLines(news, "News Lines", "News Words")
## News Lines News Words
## [1,] 1010242 33482314
Get sample from News dataset
NewsFinal<-as.data.frame(sample(news, size = lengthN*.005))
## [1] NA
## [1] 2360148
Number of words and lines in the Twitter dataset
WordsLines(twitter, "Twitter Lines", "Twitter Words")
## Twitter Lines Twitter Words
## [1,] 2360148 29379682
Get sample from Twitter dataset
TwitterFinal<-as.data.frame(sample(twitter, size = lengthT*.005))
Combine all datasets into one large dataset using full-join from the dplyr package. Then we check the class of the dataset to make sure it is “character”.
## [1] "character"
The next step is to clean the data. This was done in a number of ways. First I replaced all end of sentences (periods, exclamation marks etc.) with “ootoo”. I then removed all twitter type characters such as @, # etc. I then put all words to lower case. I removed contiguous spaces and removed all non-alpha text (numbers etc), then I took all “ootoo” and split the strings based on that so that all sentences are separate and removed all other misc characters. See below the clean data outcome of the first ten documents in the data frame.
## [1] "june sounds great late june gotta trip to seattle early june"
## [2] "someone is butthurt about a breakup wow that came out of left field"
## [3] "need a job so badly"
## [4] "thanks i already had dropboxevernote as i cant work without them ill check out the others"
## [5] "you can officially start drinking now being cinco de mayo and all"
## [6] "taking my fat butt to the gym what was i thinking taking a month off"
## [7] "i cant wait for some good soccer action tomorrod"
## [8] "me thinks this is momentum"
## [9] "iwannagiveashoutoutto all my haters "
## [10] "love this show glad for the new season "
I then created a profanity Filter based on a datagram containing 450 profane words. I had two data frames one with 450 profanities and another with 450 times “XXXX”. I then created a function that replaced the 450 profanities that had matches in the dataset with “XXXX.”
## [1] 449 9
## [1] 448 1
Created the function “ProfanityRemover”" that takes a data frame of profanities and replaces them in a document with another pattern in this case “XXXX”. See below the first ten results of the profanity filter.
This is the top ten words in the profanity dataset that will be filtered from the main dataset
head(badwords$V9)
## [1] V8 \\b5h1t\\b \\b5hit\\b \\ba55\\b \\banal\\b \\banus\\b
## 449 Levels: \\b5h1t\\b \\b5hit\\b \\ba_s_s\\b \\ba55\\b ... V8
Examples of how the Profanity filter works. First see the cleaned data then its comparison in the dirty data
#Unfiltered
DataAll$paragraph[203]
## [1] "brazilian butt lift is serious shit minutes and im fucking drained omg"
#Filtered
DataAllClean[203]
## [1] "brazilian XXXX lift is serious XXXX minutes and im XXXX drained omg"
Create a corpus that has been tokenized into Ngrams. Unigrams, Bigrams, Trigrams and 4-grams.
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 16,851 documents
## ... indexing features: 33,172 feature types
## ... removed 124 features, from 174 supplied (glob) feature types
## ... created a 16851 x 33048 sparse dfm
## ... complete.
## Elapsed time: 1.007 seconds.
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 16,851 documents
## ... indexing features: 172,484 feature types
## ... removed 93,793 features, from 174 supplied (glob) feature types
## ... created a 16851 x 78691 sparse dfm
## ... complete.
## Elapsed time: 5.685 seconds.
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 16,851 documents
## ... indexing features: 249,846 feature types
## ... removed 206,138 features, from 174 supplied (glob) feature types
## ... created a 16851 x 43708 sparse dfm
## ... complete.
## Elapsed time: 5.899 seconds.
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 16,851 documents
## ... indexing features: 254,291 feature types
## ... removed 232,230 features, from 174 supplied (glob) feature types
## ... created a 16851 x 22061 sparse dfm
## ... complete.
## Elapsed time: 6.156 seconds.
## Corpus consisting of 16851 documents, showing 5 documents.
## Warning in nsentence.character(object, ...): nsentence() does not correctly
## count sentences in all lower-cased text
## Text Types Tokens Sentences
## text1 9 11 1
## text2 13 13 1
## text3 5 5 1
## text4 15 16 1
## text5 12 12 1
##
## Source: /Users/levibrackman/courses/Capstone/* on x86_64 by levibrackman
## Created: Sun May 1 15:18:17 2016
## Notes:
## said will just im like get one xxxx new can good love
## 1375 997 994 858 805 788 783 697 690 675 649 602
## day time dont now know u great go
## 600 589 587 543 542 507 496 468
Put the grams into a data frame format so that they can then be graphed. To do this I created a function that put them all into a matrix, then sorted the ngrams into a data frame based on the ngrams that are found most frequently in the dataset.
These plots show which n-grams are found most frequently in the dataset. There are four plots representing unigrams, bigrams, trigrams and 4-grams. ##Step 6 - Relationships between words in the corpus
In this final step we look to see which words are highly correlated with each other. We use the “collocations” function for this which uses The likelihood ratio statistic G^2, computed as: 2 * ∑_i ∑j ( n{ij} * log ) (amongst others such as Pearson’s χ^2) to tell us the correlation between words. I then plot the top 30 sets of words (sets of two and sets of three) that are most highly correlated with each other in the dataset.
#install.packages("quanteda")
#install.packages("pryr")
#install.packages("devtools")
#devtools::install_github("hadley/lineprof")
#install.packages("R.utils")
#install.packages("qdap")
#library(devtools)
#install_github("espanta/lubripack", force = TRUE)
library(lubripack)
lubripack("knitr", "dplyr", "qdap", "SnowballC", "tm", "R.utils", "devtools", "pryr", "ggplot2", "quanteda", "stringr")
set.seed(123)
setwd("~/courses/Capstone/final/en_US")
blogs <- readLines("en_US.blogs.txt", ok = TRUE)
#File size of the original blogs dataset in bites
file.size("en_US.blogs.txt")
#Size of original blogs dataset in terms of MB
object_size(blogs)
lengthB <- length(blogs)
#Length in terms of lines of original blogs dataset
lengthB
WordsLines<-function(dataframe, namess, namess2){
Words<-as.data.frame(dataframe)#since the dataframe is in text format ut it into a dataframe
Wc<-wc(Words[,1])#get the word count of each input (all rows) of the first colunm
Words1<-as.data.frame(Wc)#put that word count into a dataframe
Words1$Wc<-as.numeric(Words1$Wc)#make sure it is numeric
names(Words1)[1]<-paste("Words")#change the colunm name to "Words"
Words1<-sum(Words1, na.rm = T)#Sum all the word counts of the entire colunm
Lines<-nrow(Words)#find the number of wors in the entire dataframe
final<-cbind(Lines, Words1)#combine the line count and wort count into one table
colnames(final) <- c(namess, namess2)#change the names of the colunms to fit the particular dataset
final#return the table
}
WordsLines(blogs, "Blogs Lines", "Blogs Words")
BlogsFinal<-as.data.frame(sample(blogs, size = lengthB*.005))
news <- readLines("~/courses/Capstone/final/en_US/en_US.news.txt", ok = TRUE)
#File size of the original blogs dataset in bites
file.size("en_US.news.txt")
#Length in terms of lines of original news dataset
lengthN <- length(news)
lengthN
WordsLines(news, "News Lines", "News Words")
NewsFinal<-as.data.frame(sample(news, size = lengthN*.005))
twitter <- readLines("~/courses/Capstone/final/en_US/en_US.twitter.txt", ok = TRUE, skipNul = TRUE)
#File size of the original blogs dataset in bites
file.size("en_US.twitter.txt")
#Length in terms of lines of original twitter dataset
lengthT <- length(twitter)
lengthT
WordsLines(twitter, "Twitter Lines", "Twitter Words")
TwitterFinal<-as.data.frame(sample(twitter, size = lengthT*.005))
names(TwitterFinal) <- c("paragraph")
names(NewsFinal) <- c("paragraph")
names(BlogsFinal) <- c("paragraph")
TwitterFinal$paragraph<-as.character(TwitterFinal$paragraph)
BlogsFinal$paragraph<-as.character(BlogsFinal$paragraph)
NewsFinal$paragraph<-as.character(NewsFinal$paragraph)
DataAll<-full_join(TwitterFinal, NewsFinal, BlogsFinal, by="paragraph")
class(DataAll$paragraph)
#Remove all twitter type charecters such as @, # etc. Replace all end of sentances with "ootoo"
DataAll$paragraph <- gsub(pattern=';|\\.|!|\\?', x=DataAll$paragraph, replacement='ootoo')
DataAll$paragraph <- gsub("&", "", DataAll$paragraph)
DataAll$paragraph <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", DataAll$paragraph)
DataAll$paragraph <- gsub("@\\w+", "", DataAll$paragraph)
DataAll$paragraph <- gsub("[[:punct:]]", "", DataAll$paragraph)
DataAll$paragraph <- gsub("[[:digit:]]", "", DataAll$paragraph)
DataAll$paragraph <- gsub("http\\w+", "", DataAll$paragraph)
DataAll$paragraph <- gsub("[ \t]{2,}", "", DataAll$paragraph)
DataAll$paragraph <- gsub("^\\s+|\\s+$", "", DataAll$paragraph)
# remove all non-alpha text (numbers etc)
DataAll$paragraph <- gsub(pattern="[^[:alpha:]]", x=DataAll$paragraph, replacement = ' ')
# force all characters to lower case
DataAll$paragraph <- tolower(DataAll$paragraph)
# remove contiguous spaces
DataAll$paragraph <- gsub(pattern="\\s+", x=DataAll$paragraph, replacement=' ')
#take all "ootoo" and split the strings based on that so that all sentances are seperate.
DataAll$paragraph <- strsplit(x=DataAll$paragraph, split='ootoo',fixed = TRUE)
#Remove other misc charectors.
DataAll$paragraph <- gsub(pattern="\"", x=DataAll$paragraph, replacement='')
DataAll$paragraph <- gsub(pattern="\\,", x=DataAll$paragraph, replacement='')
DataAll$paragraph <- gsub(pattern="^[c]", x=DataAll$paragraph, replacement='')
DataAll$paragraph <- gsub(pattern="^[(]", x=DataAll$paragraph, replacement='')
DataAll$paragraph <- gsub(pattern="[)]$", x=DataAll$paragraph, replacement='')
DataAll$paragraph <- gsub(pattern="\\s+", x=DataAll$paragraph, replacement=' ')
#See the first ten lines of clean data
head(DataAll$paragraph, 10)
#Read in a list of 450 curse words
badwords <- read.csv("/Users/levibrackman/courses/Capstone/final/en_US/badwords/badwords.csv", header = F)
replaceXXXX <- read.csv("/Users/levibrackman/courses/Capstone/final/en_US/badwords/XXXX.csv", header = F)
dim(badwords)
dim(replaceXXXX)
ProfanityRemover <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern))
x <- gsub(pattern[i], replacement[i], x, ...)
x
}
#run the function removing all 450 profanities
DataAllClean <- ProfanityRemover(badwords$V9, replaceXXXX$V1, DataAll$paragraph)
head(badwords$V9)
#Unfiltered
DataAll$paragraph[203]
#Filtered
DataAllClean[203]
#install.packages("stringr", dependencies = TRUE)
DataAllCorpus <- corpus(DataAllClean)
OneNgrams <- dfm(DataAllCorpus, ignoredFeatures = stopwords("english"), removeNumbers = TRUE, removePunct = TRUE, ngrams = 1)
#Save to rdata file
#save(OneNgrams, file = "OneNgrams.RData") #Save to rdata file
TwoNgrams <- dfm(DataAllCorpus, ignoredFeatures = stopwords("english"), removeNumbers = TRUE, removePunct = TRUE, verbose = TRUE, concatenator = " ", ngrams = 2)
#Save to rdata file
#save(TwoNgrams, file = "TwoNgrams.RData")
ThreeNgrams <- dfm(DataAllCorpus, ignoredFeatures = stopwords("english"), removeNumbers = TRUE, removePunct = TRUE, verbose = TRUE, concatenator = " ", ngrams = 3)
#Save to rdata file
#save(TwoNgrams, file = "TwoNgrams.RData")
FourNgrams <- dfm(DataAllCorpus, ignoredFeatures = stopwords("english"), removeNumbers = TRUE, removePunct = TRUE, verbose = TRUE, concatenator = " ", ngrams = 4)
#Save to rdata file
#save(FourNgrams, file = "FourNgrams.RData")
summary(DataAllCorpus, n = 5)
topfeatures(OneNgrams, 20)
toplotDF<- function(corp) {
dataframes <- as.data.frame(as.matrix(docfreq(corp)))
DFsorted <- sort(rowSums(dataframes), decreasing=TRUE)
FreqTable <- data.frame(Words=names(DFsorted), Frequency = DFsorted)
}
OneNgramsDF <- toplotDF(OneNgrams)
TwoNgramsDF <- toplotDF(TwoNgrams)
ThreeNgramsDF <- toplotDF(ThreeNgrams)
FourNgramsDF <- toplotDF(FourNgrams)
plotsall1<-function(plots, title1) {
P <- ggplot(within(plots[1:30, ], Words <- factor(Words, levels=Words)), aes(Words, Frequency))
P <- P + geom_bar(stat="identity", fill="orange") + ggtitle(title1)
P <- P + theme(axis.text.x=element_text(angle=90, hjust=.3))
P
}
plotsall1(OneNgramsDF, "Frequency of Unigram")
plotsall1(TwoNgramsDF, "Frequency of Bigrams")
plotsall1(ThreeNgramsDF, "Frequency of Trigrams")
plotsall1(FourNgramsDF, "Frequency of 4-Ngrams")
twowords<-collocations(DataAllCorpus, size = 2, method = "all")
twowords$Words <- paste(twowords$word1, twowords$word2)
threewords<-collocations(DataAllCorpus, size = 3)
threewords$Words <- paste(threewords$word1, threewords$word2, threewords$word3)
plotsall2<-function(plots, title1) {
P <- ggplot(within(plots[1:30, ], Words <- factor(Words, levels=Words)), aes(Words, G2))
P <- P + geom_bar(stat="identity", fill="blue") + ggtitle(title1)
P <- P + theme(axis.text.x=element_text(angle=90, hjust=1))
P
}
plotsall2(twowords, "30 Most Corrolated Two Words")
plotsall2(threewords, "30 Most Corrolated Three Wrods")
## NA