Data Science Capstone Project

Data Exploration

The data set for this project contains three files, a US News file, US Blog file and a Twitter file. Each contain all text files and will need to be processed and cleaned for the summary (ie. punctuation and numbers removed, changed to all lower case letters, and stopwords removed.) The US Blog and Twitter file were shortened due to computer capacity and time requirement issues to run the macros

The tm package was utilized for this processing:

library(tm)
library(wordcloud)

texts <- Corpus(DirSource("C:/Users/Rick/Desktop/texts/news"))
texts1 <- tm_map(texts, content_transformer(tolower))
texts1 <- tm_map(texts1, removePunctuation)
texts1 <- tm_map(texts1, removeNumbers)
texts1 <- tm_map(texts1, removeWords, stopwords("english"))
texts1 <- tm_map(texts1, stripWhitespace)
textmatrix <- DocumentTermMatrix(texts1)

texts2 <- Corpus(DirSource("C:/Users/Rick/Desktop/texts/blogs"))
texts3 <- tm_map(texts2, content_transformer(tolower))
texts3 <- tm_map(texts3, removePunctuation)
texts3 <- tm_map(texts3, removeNumbers)
texts3 <- tm_map(texts3, removeWords, stopwords("english"))
texts3 <- tm_map(texts3, stripWhitespace)
textmatrix2 <- DocumentTermMatrix(texts3)

texts4 <- Corpus(DirSource("C:/Users/Rick/Desktop/texts/twitter"))
texts5 <- tm_map(texts4, content_transformer(tolower))
texts5 <- tm_map(texts5, removePunctuation)
texts5 <- tm_map(texts5, removeNumbers)
texts5 <- tm_map(texts5, removeWords, stopwords("english"))
texts5 <- tm_map(texts5, stripWhitespace)
textmatrix4 <- DocumentTermMatrix(texts5)

After the Corpus is created, the word-frequency list is created, sorted highest to lowest and then converted to a dataframe for graphing and summary. A barplot and wordcloud show the highest used words for the three files US News, US Blog, and Twitter, in order:

freq <- sort(colSums(as.matrix(textmatrix)), decreasing = TRUE)
worddata <- data.frame(word = names(freq), freq = freq)
word <- worddata[1:10,]
barplot(word$freq, col="blue", names.arg = word$word, 
        main="Word Frequency, News", ylab="Quantity")

wordcloud(names(freq), freq, min.freq=2000)

freq2 <- sort(colSums(as.matrix(textmatrix2)), decreasing = TRUE)
worddata2 <- data.frame(word = names(freq2), freq = freq2)
word2 <- worddata2[1:10,]
barplot(word2$freq, col="red", names.arg = word2$word, 
        main="Word Frequency, Blogs", ylab="Quantity")

wordcloud(names(freq2), freq2, min.freq=5000)

freq4 <- sort(colSums(as.matrix(textmatrix4)), decreasing = TRUE)
worddata4 <- data.frame(word = names(freq4), freq = freq4)
word4 <- worddata4[1:10,]
barplot(word4$freq, col="green", names.arg = word4$word, 
        main="Word Frequency, Twitter", ylab="Quantity")

wordcloud(names(freq4), freq4, min.freq=5000)

The number of lines and words of the three files is also calculated:

g <- readLines("C:/Users/Rick/Desktop/texts/news/en_US.news.txt")
news <- length(g)
newsword <- rowSums(as.matrix(textmatrix))

g2 <- readLines("C:/Users/Rick/Desktop/texts/blogs/en_US.blogs.txt")
news2 <- length(g2)
newsword2 <- rowSums(as.matrix(textmatrix2))

g4 <- readLines("C:/Users/Rick/Desktop/texts/twitter/en_US.twitter.txt")
news4 <- length(g4)
newsword4 <- rowSums(as.matrix(textmatrix4))

The US News file has 77259 lines of text and 1.4946910^{6} words after cleaning.

The US Blog file has 236298 lines of text and5.03190510^{6}words after cleaning.

The Twitter file has461250 lines of text and 3.21094910^{6} words after cleaning.

Looking at the three barplots and wordclouds, you can see there are several words which frequently appear in all three files. This should make for pattern recognition and possible allow for n-grams to be able to predict the next word in a sentance. The n-gram and tokenizing of the common words will be used for the next modeling phase of the project.