Introduction

This Milestone Report is for the Capstone project of the Coursera Data Science Specialization. Its motivation is to 1) demonstrate that the data has been successfully downloaded and loaded, 2) create a basic report of summary statistics, 3) report any interesting findings discovered so far, and 4) get feedback on plans for creating a prediction algorithm and Shiny app.

First we will read in all the twitter, news, and blog records and output the number of lines from each.

1. Data Loading

First we’ll load the three twitter, news, and blog text files and retrieve the number of records in each.

twitter <- readLines(con <- file("en_US.twitter.txt", "r"), skipNul = TRUE)
close(con)
length(twitter)

## [1] 2360148

news <- readLines(con <- file("en_US.news.txt", "r"), skipNul = TRUE)
close(con)
length(news)

## [1] 1010242

blogs <- readLines(con <- file("en_US.blogs.txt", "r"), skipNul = TRUE)
close(con)
length(blogs)

## [1] 899288

2. Data Cleanup & Preparation

Next we will clean up the data to prepare it for analysis. To do this, we’ll 1) make all characters lower case; and 2) remove all numbers, special characters, and punctuation, leaving only letters. We don’t want upper or lower case characters to change any predictions. Numbers and special characters shouldn’t be predicted as well, and removing punctuation will make it easier at this stage in the analysis. We might incorporate punctuation later, as commas and periods do in fact change the meanings of sentences, but for now we will exclude them.

# Make text lowercase
twitter <- tolower(twitter)
news <- tolower(news)
blogs <- tolower(blogs)

# Remove numbers from text
twitter_char <- gsub("\\d", "", twitter)
news_char <- gsub("\\d", "", news)
blogs_char <- gsub("\\d", "", blogs)

# Remove all other special characters and punctuation
twitter_char <- gsub("[^[:alnum:] ]", "", twitter_char)
news_char <- gsub("[^[:alnum:] ]", "", news_char)
blogs_char <- gsub("[^[:alnum:] ]", "", blogs_char)

# Replace all instances of one or more spaces with a single space
twitter_char <- gsub("\\s+", " ", twitter_char)
news_char <- gsub("\\s+", " ", news_char)
blogs_char <- gsub("\\s+", " ", blogs_char)

3. Exploratory Data Analysis

We’ve already seen the number of records in the twitter, blog, and news text files and we know they’re pretty large files. From experimentation, we know that we’ll have to work on a sample of these files. First we’ll see how many characters in each file we’re working with, then take a sample, then explore the sample.

twitter_length <- nchar(twitter_char)
news_length <- nchar(news_char)
blogs_length <- nchar(blogs_char)

summary(twitter_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   34.00   60.00   64.53   94.00  140.00

summary(news_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   103.0   176.0   191.1   256.0  9699.0

summary(blogs_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    44.0   149.0   221.1   317.0 38940.0

# Let's take samples
set.seed(43263)
twitter_sample <- sample(twitter_char, 10000)
news_sample <- sample(news_char, 5000)
blogs_sample <- sample(blogs_char, 5000)

library(wordcloud)

wordcloud(twitter_sample, max.words = 200, random.order = FALSE)

wordcloud(news_sample, max.words = 150, random.order = FALSE)

wordcloud(blogs_sample, max.words = 100, random.order = FALSE)

Note how different the most common words are from tweets to blogs to news articles. This definitely implies that if I use different individual sources for developing prediction models, I could get very different results for the same words! I will have to make a decision about whether to use just one source or all three. Also note that the word clouds remove common stop words (the, an, a, I, to, you, for, is, of, etc.), so those aren’t considered here. I haven’t actually removed them at this point, but they will be removed before I start modeling.

Now let’s create a histogram of the number of characters and words in each tweet.

twitter_words <- sapply(gregexpr("\\W+", twitter_sample), length) + 1

library(ggplot2)

qplot(twitter_length, geom="histogram", main = "Number of Characters in Tweets Histogram", xlab = "Number of Characters in Tweets")

qplot(twitter_words, geom="histogram", binwidth=0.5, main = "Number of Words in Tweets Histogram", xlab = "Number of Words in Tweets")

4. Next Steps

The next steps are to remove common stop words and swear words, which we don’t want to predict. To build a prediction model, I plan to use this same sample of the text files used in the exploratory analysis section here and then create various n-grams that will be used to predict the next word given a specified word. I will also have to decide which text source(s) to use in the model, since tweets, blogs, and news articles can all give very different results.

Coursera Capstone Milestone Report

MLovejoy

September 3, 2016

Introduction

1. Data Loading

2. Data Cleanup & Preparation

3. Exploratory Data Analysis

4. Next Steps