Project 1 - Jose Luis Lopez Guevara

Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?

through the “stri_stats_general” and “stri_stats_latex” functions from the stringi package we can see the basic summaries

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   208361438   171926076
##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##     163325412             9      43302825      37865888             3 
##        Envirs 
##             0
##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15683765    13117038
##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##      12502954             0       3114374       2665742             0 
##        Envirs 
##             0
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162384825   134370864
##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##     125769312          3033      36047904      30578891           963 
##        Envirs 
##             0

with the data above we can see important things like the count of lines and, chars and words in each file, specially we can see that the file with more lines is from twitter with 2360148 and the last is the news file with 77259, also we can see that regardless of the number of lines, the blogs file has more words than the others with 37865888

once did this, we want to do some awesome charts, but for handle this data, we need to convert our character data to a corpus object using the “tm” package, also how this data is too large we are going just to use a sample of 3% of the original size for every file

## Loading required package: NLP

blogsSample <- sample(blogs,length(blogs)*0.03)
blogsCorpus <- VCorpus(VectorSource(blogsSample))
blogsCorpus <- tm_map(blogsCorpus, stripWhitespace)
blogsCorpus <- tm_map(blogsCorpus, tolower)
blogsCorpus <- tm_map(blogsCorpus, removePunctuation)
blogsCorpus <- tm_map(blogsCorpus, removeNumbers)
blogsCorpus <- tm_map(blogsCorpus, PlainTextDocument)
blogsCorpus <- tm_map(blogsCorpus, removeWords, stopwords("english"))

in the lines of above we have done some basic clean operations, through the “tm_map” function from the “tm” package we apply to our corpus object “blogsSample” that is the sample of the original data functions like “removePuntuation”, “removeNumbers” that as their name says does that operation, also we remove innecesaries spaces with the “stripWhitespace” function and change to lower case the words with the tolower function and finally we remove the “stopwords” like “in, the, and” becasuse this words normally appears alot and if we let them it will bias our analysis

we do the same with all the 3 files

now with the data clean, we can see and awesome graph that will show us the more frequent word, to do this we use the wordcloud package

## Loading required package: RColorBrewer
wordcloud(blogsCorpus, max.words=75, random.order=F, rot.per=.33, colors=colorRampPalette(brewer.pal(9,"Greens"))(33), scale=c(3, .3))

wordcloud(NewsCorpus, max.words=75, random.order=F, rot.per=.33, colors=colorRampPalette(brewer.pal(9,"Reds"))(33), scale=c(3, .3))

wordcloud(twitterCorpus, max.words=75, random.order=F, rot.per=.33, colors=colorRampPalette(brewer.pal(9,"Purples"))(33), scale=c(3, .3))

we can see that the most frequent word for the blogs file is “one” , for the news file is “said” and for the twitter file is “just”

Has the data scientist made basic plots, such as histograms to illustrate features of the data?

now we would like to see another type of representation of this like is the bar chart, for that how this is a corpus object, we have to first change the type to a TermDocumentMatrix, and through the “findFreqTerms” function we can specify a min of frequency and a max, here for the first file we specify a 1000 times of frecuancy for the blogs file, for the news file 100(is the smallest) and 1000 again for the twiter file

blogsTDM <- TermDocumentMatrix(blogsCorpus)
BMinAppear1000times <- findFreqTerms(blogsTDM ,lowfreq=1000)
blogsTDM  <- blogsTDM[BMinAppear1000times, ]
blogsTDM  <- as.matrix(blogsTDM )
blogsTDMfreq  <- rowSums(blogsTDM)
blogsTDMfreq <-sort(blogsTDMfreq, decreasing = TRUE)
barplot(blogsTDMfreq[1:12], col = "green", las = 2, main = "Word Frequency Blogs")

NewsTDM <- TermDocumentMatrix(NewsCorpus)
NMinAppear100times <- findFreqTerms(NewsTDM ,lowfreq=100)
NewsTDM <- NewsTDM[NMinAppear100times, ]
NewsTDM  <- as.matrix(NewsTDM)
NewsTDMfreq  <- rowSums(NewsTDM)
NewsTDMfreq <-sort(NewsTDMfreq, decreasing = TRUE)
barplot(NewsTDMfreq[1:12], col = "red", las = 2, main = "Word Frequency News")

twitterTDM <- TermDocumentMatrix(twitterCorpus)
TMinAppear10times <- findFreqTerms(twitterTDM ,lowfreq=1000)
twitterTDM <- twitterTDM[TMinAppear10times, ]
twitterTDM  <- as.matrix(twitterTDM)
twitterTDMfreq  <- rowSums(twitterTDM)
twitterTDMfreq <-sort(twitterTDMfreq, decreasing = TRUE)
barplot(twitterTDMfreq[1:12], col = "purple", las = 2, main = "Word Frequency Twitter")

we can see that the frequency of the “one” word for the blogs file is more than 3500, for the word “said” in the news file is more than 500 and for the word “just” in the twitter file is more than 4000

now we would like to see the same charts but with a combination of two words, we do the same process but we create a fucntion before that will specify the size of “token”(the count of words that we are going to extract in phrases) to be 2 for this we use the “NGramTokenizer” function from the “RWeka” package

Twowords <- function(x)NGramTokenizer(x, Weka_control(min = 2, max = 2))

blogsTDM2words <- TermDocumentMatrix(blogsCorpus,control=list(tokenize=Twowords))
BMinAppear100times <- findFreqTerms(blogsTDM2words,lowfreq=100)
blogsTDM2words  <- blogsTDM2words[BMinAppear100times, ]
blogsTDM2words  <- as.matrix(blogsTDM2words)
blogsTDM2wordsfreq  <- rowSums(blogsTDM2words)
blogsTDM2wordsfreq <-sort(blogsTDM2wordsfreq, decreasing = TRUE)
barplot(blogsTDM2wordsfreq[1:12], col = "green", las = 2, main = " 2 Word Frequency Blogs")

NewsTDM2words <- TermDocumentMatrix(NewsCorpus,control=list(tokenize=Twowords))
NMinAppear10times <- findFreqTerms(NewsTDM2words,lowfreq=10)
NewsTDM2words  <- NewsTDM2words[NMinAppear10times, ]
NewsTDM2words  <- as.matrix(NewsTDM2words)
NewsTDM2wordsfreq  <- rowSums(NewsTDM2words)
NewsTDM2wordsfreq <-sort(NewsTDM2wordsfreq, decreasing = TRUE)
barplot(NewsTDM2wordsfreq[1:12], col = "red", las = 2, main = " 2 Word Frequency News")

twitterTDM2words <- TermDocumentMatrix(twitterCorpus,control=list(tokenize=Twowords))
TMinAppear100times <- findFreqTerms(twitterTDM2words, lowfreq=100)
twitterTDM2words2  <- twitterTDM2words[TMinAppear100times, ]
twitterTDM2words  <- as.matrix(twitterTDM2words2)
twitterTDM2wordsfreq  <- rowSums(twitterTDM2words)
twitterTDM2wordsfreq <-sort(twitterTDM2wordsfreq, decreasing = TRUE)
barplot(twitterTDM2wordsfreq[1:12], col = "purple", las = 2, main = " 2 Word Frequency Twitter")

finally we can see that the frecuancy of “right now” combination for the blogs file is more than 140, the frecuancy of “high school” combination for the news file is more than 25 and he frecuancy of “right now” combination for the twitter file is slightly more than 500

future plans

i am not completly clear about the prediction algorithm, but maybe we could create a prediction algorithm based on the frecuancy of a word to predcit the next or something like that; and a interactive shiny app that the manager could handle easily

Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

I will be glad to discuss this interpretation and follow through on any decisions you make