This document presents my initial findings on a set of textual data from three sources : Twitter, Blogs and News. As the purpose of the project is to create a text prediction tool my exploratory analysis will look at the characteristics of each data set, the frequencies of various words, pairs of words (2-grams) and trios of words (3-grams). I will also take a look at coverage and appopriate sample size for speed and reliability.
Data Set Characteristics What does the data look like?
file.twitter<-file(paste(getwd(),"/final/en_US/en_US.twitter.txt",sep=""))
data.twitter<-readLines(file.twitter)
## Warning in readLines(file.twitter): line 167155 appears to contain an
## embedded nul
## Warning in readLines(file.twitter): line 268547 appears to contain an
## embedded nul
## Warning in readLines(file.twitter): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(file.twitter): line 1759032 appears to contain an
## embedded nul
head(data.twitter,10)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
## [7] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"
## [8] "I'm coo... Jus at work hella tired r u ever in cali"
## [9] "The new sundrop commercial ...hehe love at first sight"
## [10] "we need to reconnect THIS WEEK"
file.news<-file(paste(getwd(),"/final/en_US/en_US.news.txt",sep=""))
data.news<-readLines(file.news)
## Warning in readLines(file.news): incomplete final line found on 'C:/Users/
## LastMile/Documents/1_oth/capstone/final/en_US/en_US.news.txt'
head(data.news,10)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."
## [7] "14915 Charlevoix, Detroit"
## [8] "\"Itâ<U+0080><U+0099>s just another in a long line of failed attempts to subsidize Atlantic City,\" said Americans for Prosperity New Jersey Director Steve Lonegan, a conservative who lost to Christie in the 2009 GOP primary. \"The Revel Casino hit the jackpot here at government expense.\""
## [9] "But time and again in the report, Sullivan called on CPS to correct problems to improve employee accountability, saying, for example, that measures to keep employees from submitting fraudulent invoices or to block employees from accessing inappropriate websites were not in place."
## [10] "Â<U+0093>I was just trying to hit it hard someplace,Â<U+0094> said Rizzo, who hit the pitch to the opposite field in left-center. Â<U+0093>IÂ<U+0092>m just up there trying to make good contact.Â<U+0094>"
file.blogs<-file(paste(getwd(),"/final/en_US/en_US.blogs.txt",sep=""))
data.blogs<-readLines(file.blogs)
head(data.blogs,10)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"
## [6] "If you have an alternative argument, let's hear it! :)"
## [7] "If I were a bear,"
## [8] "Other friends have similar stories, of how they were treated brusquely by Laurelwood staff, and as often as not, the same names keep coming up. About a half-dozen friends of mine refuse to step foot in there ever again because of it. How many others theyâ<U+0080><U+0099>re telling - and keeping away - one can only guess."
## [9] "Although our beloved Cantab canâ<U+0080><U+0099>t claim the international recognition afforded the Station Inn, otherwise these two joints feel like twins separated by nothing more than distance. They share a complete lack of pretense that canâ<U+0080><U+0099>t be imitated or approximated. Their very ordinariness makes them special."
## [10] "Peter Schiff: Hard to tell. It will look pretty bad for most Americans when prices will go way up and they canâ<U+0080><U+0099>t afford to buy stuff. It could also get very bad as far as loss of individual liberty. A lot of people will blame it on capitalism, on freedom, and they will claim we need more government. It could be used as an impetus for more regulation, which would be a disaster, or it could be an impetus to get rid of all the regulation that was causing the problem. But whether we will do the right or the wrong thing here in America, there will be a lot of pain first. We got some serious problems we have to deal with, but we are not dealing with the problems, we only make the problems worse."
The twitter data has more versions of words (slang, misspellings, lowecase/uppercase combos) than the news and blog sources, making it potentially harder to run prediction on.
How large are the datasets?
data<-list(data.twitter,data.news,data.blogs)
data.size<-sapply(data,object.size)
data.rows<-sapply(data,length)
data.words<-sapply(data,wordcount)
statistics.simple<-data.table(data.size,data.rows,data.words,data.words/data.rows)
statistics.simple$source<-c("twitter","news","blogs")
colnames(statistics.simple)<-c("size","rows","word count","avg words","source")
statistics.simple
## size rows word count avg words source
## 1: 316037344 2360148 30373543 12.86934 twitter
## 2: 20111392 77259 2643969 34.22215 news
## 3: 260564320 899288 37334131 41.51521 blogs
Interestingly, the twitter data source is the largest with the most amount of rows. However the blog entries have the largest number of words per entry at 41 words on average.
Word Frequencies
Looking across all the collected data we want to run an analysis on the most frequently used words. I will start by using a sample of 1% of the rows in each set and run analysis on this combined set of data.
We also want to remove whitespace, punctuation and different combinations of upper and lowercase letters
# take samples
set.seed(124)
sample.twitter<-sample(data.twitter,as.integer(data.rows[1]*0.01))
sample.news<-sample(data.news,as.integer(data.rows[2]*0.01))
sample.blogs<-sample(data.blogs,as.integer(data.rows[3]*0.01))
sample.str<-paste(sample.twitter,sample.news,sample.blogs, collapse=" ") # turn into a string for use with tm
sample.corpus<-Corpus(VectorSource(sample.str))
# clean the data
sample.corpus <- tm_map(sample.corpus, tolower)
sample.corpus <- tm_map(sample.corpus, removePunctuation)
sample.corpus <- tm_map(sample.corpus, PlainTextDocument)
Once the data has been cleaned it is useful to see the frequencies of words, bigrams and trigrams across all data sources. This will allow us to know the most common words and phrases for our eventual text prediction engine.
# figure out frequencies
words.tdm<-TermDocumentMatrix(sample.corpus)
words.freq<-data.frame(word=words.tdm$dimnames$Terms,frequency=words.tdm$v)
words.freq<-arrange(words.freq,-frequency)
words.top25<-words.freq[1:25,]
words.graph<-ggplot(data=words.top25)+geom_bar(aes(x=factor(word),y=frequency),stat='identity')+coord_flip()
words.graph
# tokenize
To figure out how many words we need to cover 50 and 90% of the series we first look at the number of unique words in the sample. This can be returned by the sum of vector j in the term document matrix. There are 49,309 words in the Term Document Matrix. As Words.freq is arranged in decreasing order we can do a cumulative sum and then lookup the coverage required to get to the 50 and 90% values.
sum(words.tdm$j)
## [1] 49309
c1<-as.integer(sum(words.tdm$j)*0.5)
c2<-as.integer(sum(words.tdm$j)*0.9)
c1
## [1] 24654
c2
## [1] 44378
words.freq$cumsum[c1]
## NULL
words.freq$cumsum[c2]
## NULL
We would need 1.65m entries to cover 50% of the data and 1.69m entries to cover 90% of the data in the sample. This gives us an initial exploration of the data. I was not able to make progress on bigrams and trigrams in this analysis as the NGramTokenizer kept crashing. I am looking into a solution for the end project.