Project 1 - Jose Luis Lopez Guevara

Does the link lead to an HTML page describing the exploratory analysis of the training data set?

in this R markdown we are going to see an exploratory data analysis where we basically are going to look for the most frequent words in teh data using good graphs, also we are goign to see some basic data summaries like word countsan line counts

once the files are in yout working directory we read them thorugh the R function “readLines”

blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")

## Warning in readLines("en_US.news.txt"): incomplete final line found on
## 'en_US.news.txt'

twitter <- readLines("en_US.twitter.txt")

## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain an
## embedded nul

## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain an
## embedded nul

## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain an
## embedded nul

## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain an
## embedded nul

Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?

through the “stri_stats_general” and “stri_stats_latex” functions from the stringi package we can see the basic summaries

library(stringi)
stri_stats_general(blogs)

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   208361438   171926076

stri_stats_latex(blogs)

##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##     163325412             9      43302825      37865888             3 
##        Envirs 
##             0

stri_stats_general(news)

##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15683765    13117038

stri_stats_latex(news)

##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##      12502954             0       3114374       2665742             0 
##        Envirs 
##             0

stri_stats_general(twitter)

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162384825   134370864

stri_stats_latex(twitter)

##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##     125769312          3033      36047904      30578891           963 
##        Envirs 
##             0

with the data above we can see important things like the count of lines and, chars and words in each file, specially we can see that the file with more lines is from twitter with 2360148 and the last is the news file with 77259, also we can see that regardless of the number of lines, the blogs file has more words than the others with 37865888

once did this, we want to do some awesome charts, but for handle this data, we need to convert our character data to a corpus object using the “tm” package, also how this data is too large we are going just to use a sample of 3% of the original size for every file

library(tm)

## Loading required package: NLP

set.seed(3712)

blogsSample <- sample(blogs,length(blogs)*0.03)
blogsCorpus <- VCorpus(VectorSource(blogsSample))
blogsCorpus <- tm_map(blogsCorpus, stripWhitespace)
blogsCorpus <- tm_map(blogsCorpus, tolower)
blogsCorpus <- tm_map(blogsCorpus, removePunctuation)
blogsCorpus <- tm_map(blogsCorpus, removeNumbers)
blogsCorpus <- tm_map(blogsCorpus, PlainTextDocument)
blogsCorpus <- tm_map(blogsCorpus, removeWords, stopwords("english"))

in the lines of above we have done some basic clean operations, through the “tm_map” function from the “tm” package we apply to our corpus object “blogsSample” that is the sample of the original data functions like “removePuntuation”, “removeNumbers” that as their name says does that operation, also we remove innecesaries spaces with the “stripWhitespace” function and change to lower case the words with the tolower function and finally we remove the “stopwords” like “in, the, and” becasuse this words normally appears alot and if we let them it will bias our analysis

we do the same with all the 3 files

now with the data clean, we can see and awesome graph that will show us the more frequent word, to do this we use the wordcloud package

library(wordcloud)

## Loading required package: RColorBrewer

wordcloud(blogsCorpus, max.words=75, random.order=F, rot.per=.33, colors=colorRampPalette(brewer.pal(9,"Greens"))(33), scale=c(3, .3))

wordcloud(NewsCorpus, max.words=75, random.order=F, rot.per=.33, colors=colorRampPalette(brewer.pal(9,"Reds"))(33), scale=c(3, .3))

wordcloud(twitterCorpus, max.words=75, random.order=F, rot.per=.33, colors=colorRampPalette(brewer.pal(9,"Purples"))(33), scale=c(3, .3))

we can see that the most frequent word for the blogs file is “one” , for the news file is “said” and for the twitter file is “just”

Has the data scientist made basic plots, such as histograms to illustrate features of the data?

now we would like to see another type of representation of this like is the bar chart, for that how this is a corpus object, we have to first change the type to a TermDocumentMatrix, and through the “findFreqTerms” function we can specify a min of frequency and a max, here for the first file we specify a 1000 times of frecuancy for the blogs file, for the news file 100(is the smallest) and 1000 again for the twiter file

blogsTDM <- TermDocumentMatrix(blogsCorpus)
BMinAppear1000times <- findFreqTerms(blogsTDM ,lowfreq=1000)
blogsTDM  <- blogsTDM[BMinAppear1000times, ]
blogsTDM  <- as.matrix(blogsTDM )
blogsTDMfreq  <- rowSums(blogsTDM)
blogsTDMfreq <-sort(blogsTDMfreq, decreasing = TRUE)
barplot(blogsTDMfreq[1:12], col = "green", las = 2, main = "Word Frequency Blogs")

NewsTDM <- TermDocumentMatrix(NewsCorpus)
NMinAppear100times <- findFreqTerms(NewsTDM ,lowfreq=100)
NewsTDM <- NewsTDM[NMinAppear100times, ]
NewsTDM  <- as.matrix(NewsTDM)
NewsTDMfreq  <- rowSums(NewsTDM)
NewsTDMfreq <-sort(NewsTDMfreq, decreasing = TRUE)
barplot(NewsTDMfreq[1:12], col = "red", las = 2, main = "Word Frequency News")

twitterTDM <- TermDocumentMatrix(twitterCorpus)
TMinAppear10times <- findFreqTerms(twitterTDM ,lowfreq=1000)
twitterTDM <- twitterTDM[TMinAppear10times, ]
twitterTDM  <- as.matrix(twitterTDM)
twitterTDMfreq  <- rowSums(twitterTDM)
twitterTDMfreq <-sort(twitterTDMfreq, decreasing = TRUE)
barplot(twitterTDMfreq[1:12], col = "purple", las = 2, main = "Word Frequency Twitter")

we can see that the frequency of the “one” word for the blogs file is more than 3500, for the word “said” in the news file is more than 500 and for the word “just” in the twitter file is more than 4000

now we would like to see the same charts but with a combination of two words, we do the same process but we create a fucntion before that will specify the size of “token”(the count of words that we are going to extract in phrases) to be 2 for this we use the “NGramTokenizer” function from the “RWeka” package

library(RWeka)
Twowords <- function(x)NGramTokenizer(x, Weka_control(min = 2, max = 2))
 

blogsTDM2words <- TermDocumentMatrix(blogsCorpus,control=list(tokenize=Twowords))
BMinAppear100times <- findFreqTerms(blogsTDM2words,lowfreq=100)
blogsTDM2words  <- blogsTDM2words[BMinAppear100times, ]
blogsTDM2words  <- as.matrix(blogsTDM2words)
blogsTDM2wordsfreq  <- rowSums(blogsTDM2words)
blogsTDM2wordsfreq <-sort(blogsTDM2wordsfreq, decreasing = TRUE)
barplot(blogsTDM2wordsfreq[1:12], col = "green", las = 2, main = " 2 Word Frequency Blogs")

NewsTDM2words <- TermDocumentMatrix(NewsCorpus,control=list(tokenize=Twowords))
NMinAppear10times <- findFreqTerms(NewsTDM2words,lowfreq=10)
NewsTDM2words  <- NewsTDM2words[NMinAppear10times, ]
NewsTDM2words  <- as.matrix(NewsTDM2words)
NewsTDM2wordsfreq  <- rowSums(NewsTDM2words)
NewsTDM2wordsfreq <-sort(NewsTDM2wordsfreq, decreasing = TRUE)
barplot(NewsTDM2wordsfreq[1:12], col = "red", las = 2, main = " 2 Word Frequency News")

twitterTDM2words <- TermDocumentMatrix(twitterCorpus,control=list(tokenize=Twowords))
TMinAppear100times <- findFreqTerms(twitterTDM2words, lowfreq=100)
twitterTDM2words2  <- twitterTDM2words[TMinAppear100times, ]
twitterTDM2words  <- as.matrix(twitterTDM2words2)
twitterTDM2wordsfreq  <- rowSums(twitterTDM2words)
twitterTDM2wordsfreq <-sort(twitterTDM2wordsfreq, decreasing = TRUE)
barplot(twitterTDM2wordsfreq[1:12], col = "purple", las = 2, main = " 2 Word Frequency Twitter")

finally we can see that the frecuancy of “right now” combination for the blogs file is more than 140, the frecuancy of “high school” combination for the news file is more than 25 and he frecuancy of “right now” combination for the twitter file is slightly more than 500

future plans

i am not completly clear about the prediction algorithm, but maybe we could create a prediction algorithm based on the frecuancy of a word to predcit the next or something like that; and a interactive shiny app that the manager could handle easily

Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

I will be glad to discuss this interpretation and follow through on any decisions you make