Data Science Milestone Report

The data used in this project comes from the website: https://web-beta.archive.org/web/20160930083655/http://www.corpora.heliohost.org/aboutcorpus.html. The site contains data from numerous different webpages, with the aim of getting a varied and comprehensive corpus of current use of the respective language. There are files for separate sources such as newspapers, magazines, (personal and professional) blogs and Twitter updates.

In my exporatory analysis of the Twitter data, I found that there do seem to be patterns in that can be explored. There are many reused words within the data. My goal for the eventual app and algorithm is to predict the next words of a tweet from the proceeding words.

Twitter Data

#Reading in the file
Twitter <- readLines(con <- file("en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

con <- file("en_US.twitter.txt",open="r")
#Read 5 lines
readLines(con,5)

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"

close(con)
longTwitter<-stri_length(Twitter)
#Max Length
maxTwitter <- max(longTwitter)
maxTwitter

## [1] 140

#Number of Lines, Number of lines with at least one non-white char, total characters, characters that are not white
stri_stats_general(Twitter)

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096241   134082806

Blogs Data

#Reading in the file
Blogs <- readLines(con <- file("en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

con <- file("en_US.blogs.txt",open="r")
#Read 5 lines
readLines(con,5)

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"

close(con)
longBlogs<-stri_length(Blogs)
#Max Length
maxBlog <- max(longBlogs)
maxBlog

## [1] 40833

#Number of Lines, Number of lines with at least one non-white char, total characters, characters that are not white
stri_stats_general(Blogs)

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539

News Data

#Reading in the file
News <- readLines(con <- file("en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

con <- file("en_US.news.txt",open="r")
#Read 5 lines
readLines(con,5)

## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"

close(con)
lengthNews <-length(News)
longNews<-stri_length(lengthNews)
#Max Length
maxNews <- max(longNews)
maxNews

## [1] 5

#Number of Lines, Number of lines with at least one non-white char, total characters, characters that are not white
stri_stats_general(News)

##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15639408    13072698

Additional Analysis

set.seed(1234)
sampleTwitter <- readLines(con <- file("en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
sampleTwitter <- sample(sampleTwitter, 50000, replace = FALSE, prob = NULL)
#get rid of bad characters
sampleTwitter <- sapply(sampleTwitter,function(row) iconv(row, "latin1", "ASCII", sub=""))
sampleTwitter<-str_replace_all(sampleTwitter,"[^[:graph:]]", " ") 
#Make Corpus
docTwitter <- Corpus(VectorSource(sampleTwitter))
#change to lower case, make a plain text doc, remove punctuation, strip whitespace, 
#remove numbers, remove common english words
docTwitter <- tm_map(docTwitter, tolower)
docTwitter <- tm_map(docTwitter, PlainTextDocument)
docTwitter <- tm_map(docTwitter,removePunctuation)
docTwitter <- tm_map(docTwitter,stripWhitespace)
docTwitter <- tm_map(docTwitter,removeNumbers)
docTwitter <- tm_map(docTwitter, removeWords, stopwords("english"))

#remove profanity
profanityfilter <- fread("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
names(profanityfilter)<-"ProfanityFilter"
docTwitter <- tm_map(docTwitter, removeWords, profanityfilter$ProfanityFilter)

docTwitter <- Corpus(VectorSource(docTwitter))
matrix_terms <- DocumentTermMatrix(docTwitter)
tdmTwitter <- TermDocumentMatrix(docTwitter)
mTwitter <- as.matrix(tdmTwitter)
x <- sort(rowSums(mTwitter),decreasing=TRUE)
dfTwitter <- data.frame(word = names(x),freq=x)
topTwitter <- head(dfTwitter, 20)
barplot(topTwitter$freq, las = 2, names.arg = topTwitter$word,
        col ="darkgreen", main ="Top Twitter Words",
        ylab = "Frequency")

topTwitter300 <- head(dfTwitter, 300)

This is a wordcloud of the most common Twitter words.

wordcloud(words = topTwitter300$word, freq = topTwitter300$freq, min.freq = 1,
          max.words=300, random.order=FALSE, rot.per=0.15, 
          colors=brewer.pal(8, "Dark2"))

Some of the least common words appear to be words that are misspelled.

leastTwitter <- tail(dfTwitter, 20)
leastTwitter

##                          word freq
## zombieoutbreak zombieoutbreak    1
## zonamaco             zonamaco    1
## zoneflex             zoneflex    1
## zoneone               zoneone    1
## zoom                     zoom    1
## zoomed                 zoomed    1
## zooming               zooming    1
## zoomshift           zoomshift    1
## zorro                   zorro    1
## zoubida               zoubida    1
## zouorleans         zouorleans    1
## zro                       zro    1
## zuccotti             zuccotti    1
## zuckerberg         zuckerberg    1
## zuckerman           zuckerman    1
## zumbaaaa             zumbaaaa    1
## zusis                   zusis    1
## zuzu                     zuzu    1
## zygotes               zygotes    1
## zynga                   zynga    1

Data Science Milestone Report

Kristy Wedel

August 9, 2017