This Milestone Report is for the Capstone project of the Coursera Data Science Specialization. Its motivation is to 1) demonstrate that the data has been successfully downloaded and loaded, 2) create a basic report of summary statistics, 3) report any interesting findings discovered so far, and 4) get feedback on plans for creating a prediction algorithm and Shiny app.
First we will read in all the twitter, news, and blog records and output the number of lines from each.
First we’ll load the three twitter, news, and blog text files and retrieve the number of records in each.
twitter <- readLines(con <- file("en_US.twitter.txt", "r"), skipNul = TRUE)
close(con)
length(twitter)
## [1] 2360148
news <- readLines(con <- file("en_US.news.txt", "r"), skipNul = TRUE)
close(con)
length(news)
## [1] 1010242
blogs <- readLines(con <- file("en_US.blogs.txt", "r"), skipNul = TRUE)
close(con)
length(blogs)
## [1] 899288
Next we will clean up the data to prepare it for analysis. To do this, we’ll 1) make all characters lower case; and 2) remove all numbers, special characters, and punctuation, leaving only letters. We don’t want upper or lower case characters to change any predictions. Numbers and special characters shouldn’t be predicted as well, and removing punctuation will make it easier at this stage in the analysis. We might incorporate punctuation later, as commas and periods do in fact change the meanings of sentences, but for now we will exclude them.
# Make text lowercase
twitter <- tolower(twitter)
news <- tolower(news)
blogs <- tolower(blogs)
# Remove numbers from text
twitter_char <- gsub("\\d", "", twitter)
news_char <- gsub("\\d", "", news)
blogs_char <- gsub("\\d", "", blogs)
# Remove all other special characters and punctuation
twitter_char <- gsub("[^[:alnum:] ]", "", twitter_char)
news_char <- gsub("[^[:alnum:] ]", "", news_char)
blogs_char <- gsub("[^[:alnum:] ]", "", blogs_char)
# Replace all instances of one or more spaces with a single space
twitter_char <- gsub("\\s+", " ", twitter_char)
news_char <- gsub("\\s+", " ", news_char)
blogs_char <- gsub("\\s+", " ", blogs_char)
We’ve already seen the number of records in the twitter, blog, and news text files and we know they’re pretty large files. From experimentation, we know that we’ll have to work on a sample of these files. First we’ll see how many characters in each file we’re working with, then take a sample, then explore the sample.
twitter_length <- nchar(twitter_char)
news_length <- nchar(news_char)
blogs_length <- nchar(blogs_char)
summary(twitter_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 34.00 60.00 64.53 94.00 140.00
summary(news_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 103.0 176.0 191.1 256.0 9699.0
summary(blogs_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 44.0 149.0 221.1 317.0 38940.0
# Let's take samples
set.seed(43263)
twitter_sample <- sample(twitter_char, 10000)
news_sample <- sample(news_char, 5000)
blogs_sample <- sample(blogs_char, 5000)
library(wordcloud)
wordcloud(twitter_sample, max.words = 200, random.order = FALSE)
wordcloud(news_sample, max.words = 150, random.order = FALSE)
wordcloud(blogs_sample, max.words = 100, random.order = FALSE)
Note how different the most common words are from tweets to blogs to news articles. This definitely implies that if I use different individual sources for developing prediction models, I could get very different results for the same words! I will have to make a decision about whether to use just one source or all three. Also note that the word clouds remove common stop words (the, an, a, I, to, you, for, is, of, etc.), so those aren’t considered here. I haven’t actually removed them at this point, but they will be removed before I start modeling.
Now let’s create a histogram of the number of characters and words in each tweet.
twitter_words <- sapply(gregexpr("\\W+", twitter_sample), length) + 1
library(ggplot2)
qplot(twitter_length, geom="histogram", main = "Number of Characters in Tweets Histogram", xlab = "Number of Characters in Tweets")
qplot(twitter_words, geom="histogram", binwidth=0.5, main = "Number of Words in Tweets Histogram", xlab = "Number of Words in Tweets")
The next steps are to remove common stop words and swear words, which we don’t want to predict. To build a prediction model, I plan to use this same sample of the text files used in the exploratory analysis section here and then create various n-grams that will be used to predict the next word given a specified word. I will also have to decide which text source(s) to use in the model, since tweets, blogs, and news articles can all give very different results.