Name: SHILPA SWETH
Date: 9/4/2016
The goal of this project is just to display that youāve gotten used to working with the data and that you are on track to create your prediction algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.
The motivation for this project is to:
1.Demonstrate that youāve downloaded the data and have successfully loaded it in.
2.Create a basic report of summary statistics about the data sets.
3.Report any interesting findings that you amassed so far.
4.Get feedback on your plans for creating a prediction algorithm and Shiny app.
library(NLP)
library(tm)
library(knitr)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(RWeka)
After downloading zip file from net, final/en_US is set as working directory.
bloglines<-readLines("en_US.blogs.txt",encoding="UTF-8",skipNul=TRUE)
newslines<-readLines("en_US.news.txt",encoding="UTF-8",skipNul=TRUE)
twitterlines<-readLines("en_US.twitter.txt",encoding="UTF-8",skipNul=TRUE)
Summary of text files
summary(bloglines)
## Length Class Mode
## 899289 character character
summary(newslines)
## Length Class Mode
## 77259 character character
summary(twitterlines)
## Length Class Mode
## 2360149 character character
Samples from the text files
head(bloglines)
## [1] "<U+FEFF>In the years thereafter, most of the Oil fields and platforms were named after pagan <U+0093>gods<U+0094>."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"
## [6] "If you have an alternative argument, let's hear it! :)"
tail(bloglines)
## [1] "The hulking mass of unfinished brick and concrete at 20-13 35th St. is so unsightly, it became a poster child for zoning reform."
## [2] "The 2004 IIFA award ceremony witnessed a contingent of over 450 stars, celebrities, cricketers, industrialists and government leaders over the festive weekend."
## [3] "Plus, I have also been allowing myself not to get <U+0091>stressed<U+0092> over things that have not been done! If the ironing is not done right now, it<U+0092>s not the end of the world! If that phone call is made tomorrow rather than today, then that<U+0092>s OK too! Living in the moment and allowing myself the time to get <U+0091>back to feeling great<U+0092>!"
## [4] "(5) What's the barrier to entry and why is the business sustainable?"
## [5] "In response to an over-whelming number of comments we sat down and created a list of do (s) and don<U+0092>t (s) <U+0096> these recommendations are easy to follow and except for - adding some herbs to your rinse . So let<U+0092>s get begin<U+0085>"
## [6] ""
head(newslines)
## [1] "<U+FEFF>He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."
tail(newslines)
## [1] "Rosso Gelato"
## [2] "Former White House candidate Newt Gingrich says Mitt Romney has \"earned the right to represent the Republican Party\" against President Barack Obama and that he'll help Romney's campaign in any way he can."
## [3] "President Obama, with a belated embrace of his commission's recommendation to cut $4 trillion in deficits over the next 12 years, on Wednesday laid out an aggressively centrist re-election platform that confronts Republicans on taxes and his own liberal base on spending in an effort to win back the independents he lost in last year's election."
## [4] "\"Given that there are no further tenders scheduled and the recent ECB rhetoric has been skewed toward this being the last, we believe the market will return to fundamentals and move away from this liquidity euphoria,\" analysts at Royal Bank of Scotland wrote in a research note."
## [5] "Ripley's strong performance is complemented by Asa Somers as Diana's patient, or perhaps depressed, husband, and by Emma Hunton as their neglected teenage daughter, Natalie. A talented musician and student, Natalie is the most sympathetic character in \"Next to Normal,\" a girl trying to raise herself while her father cares for her mother and her mother is out of commission."
## [6] "At the Spice Merchant, a 2-ounce bag of tea leaves capable of producing 1"
head(twitterlines)
## [1] "<U+FEFF>How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
tail(twitterlines)
## [1] "what's good. I see the success you got poppin in yo area."
## [2] "RT : Consumers are visual. They want data at their finger tips. Mobile is the only way to deliver this, 24/7."
## [3] "u welcome"
## [4] "It is #RHONJ time!!"
## [5] "The key to keeping your woman happy= attention, affection, treat her like a queen and sex her like a pornstar!"
## [6] ""
No.of characters in the text files
sum(nchar(bloglines))
## [1] 206824506
sum(nchar(newslines))
## [1] 15639409
sum(nchar(twitterlines))
## [1] 162096249
Combining the three sources into one database.
bulk <- c(sample(bloglines, length(bloglines) * 0.001),
sample(newslines, length(newslines) * 0.001),
sample(twitterlines, length(twitterlines) * 0.001))
bulk <- sapply(bulk,function(row) iconv(row, "latin1", "ASCII", sub=""))
USlines<- VCorpus(VectorSource(bulk))
rm(bloglines)
rm(newslines)
rm(twitterlines)
rm(bulk)
Removing unwanted symbols
USlines <- tm_map(USlines, removePunctuation)
USlines <- tm_map(USlines, removeNumbers)
USlines <- tm_map(USlines, tolower)
USlines <- tm_map(USlines, removeWords, stopwords("english"))
USlines <- tm_map(USlines, stripWhitespace)
USlines <- tm_map(USlines, PlainTextDocument)
Creating Onegram frequency
OnegramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tdm.onegram = TermDocumentMatrix(USlines,control = list(tokenize = OnegramTokenizer))
freq = sort(rowSums(as.matrix(tdm.onegram)),decreasing = TRUE)
wf = data.frame(word=names(freq), freq=freq)
Onegram Histogram and Wordcloud
p <- ggplot(subset(wf, freq>50), aes(word, freq))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
wordcloud(names(freq), freq, min.freq=50, scale=c(5, .1), colors=brewer.pal(6, "Set1"))
Creating Bigram frequency
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm.bigram = TermDocumentMatrix(USlines,control = list(tokenize = BigramTokenizer))
freq = sort(rowSums(as.matrix(tdm.bigram)),decreasing = TRUE)
wf = data.frame(word=names(freq), freq=freq)
Bigram Histogram and Wordcloud
p <- ggplot(subset(wf, freq>5), aes(word, freq))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
wordcloud(names(freq), freq, min.freq=5, scale=c(4, .5), colors=brewer.pal(6, "Set1"))
Creating Trigram frequency
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm.trigram = TermDocumentMatrix(USlines,control = list(tokenize = TrigramTokenizer))
freq = sort(rowSums(as.matrix(tdm.trigram)),decreasing = TRUE)
wf = data.frame(word=names(freq), freq=freq)
Trigram Histogram and Wordcloud
p <- ggplot(subset(wf, freq>2), aes(word, freq))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
wordcloud(names(freq), freq, min.freq=2, scale=c(1.25, 1), colors=brewer.pal(6, "Set1"),max.words = 50)