Milestone Report 1

This is the first milestone report for the capstone project within the Data Science Specialization Program from John Hopkins University.

Understanding the problem

  • We want to be able to predict a word given 1, 2, or 3 words using a basic n-gram model.

  • An n-gram model is trained on a corpus (large body of text).

  • The model counts how often different sequences of n-words occur in the corpus.

  • It then estimates the probability of the word given the previous words using n-1.

  • The next word is selected by the highest probability.

Data Acquisition

The data provided can be downloaded here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

# Bring in data - Locally Saved
fileBlog <- readLines("en_US.blogs.txt")
fileNews <- readLines("en_US.news.txt")
## Warning in readLines("en_US.news.txt"): incomplete final line found on
## 'en_US.news.txt'
fileTwitter <- readLines("en_US.twitter.txt")
## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain an
## embedded nul
## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain an
## embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain an
## embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain an
## embedded nul

There are now 3 large character vectors.

head(fileBlog,5);head(fileNews,5);head(fileTwitter,5)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"
## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"

As we can see, all three data sets appear to have the same type of data all separated by quotation marks.

Trimming whitespaces

#Trim whitespaces
fileBlog <- trimws(fileBlog);
fileNews <- trimws(fileNews);
fileTwitter <- trimws(fileTwitter)

Data Cleaning

Check for NAs

#Check data for NAs
sum(is.na(fileBlog));sum(is.na(fileNews));sum(is.na(fileTwitter))
## [1] 0
## [1] 0
## [1] 0

There were no NA values.

Check for empty strings

#Check data for empty entries 
any(fileBlog =="");any(fileNews == "");any(fileTwitter == "") 
## [1] FALSE
## [1] FALSE
## [1] FALSE

There were no missing values.

Check for duplicates

#Check for duplicates
any(duplicated(fileBlog));any(duplicated(fileNews));any(duplicated(fileTwitter))
## [1] FALSE
## [1] FALSE
## [1] TRUE

This says there are duplicates in fileTwitter.

Exploratory analysis

Since the data is all strings, we need a way to break down each word separately so we can analyze it. I will start by separating each word into a lower case format with no punctuation. This way we can do word analysis.

# Create tokens  
tmpTokenBlog <- tolower(gsub("[[:punct:]]", "", fileBlog))
tokenBlog <- unlist(strsplit(tmpTokenBlog, "\\s+"))

tmpTokenNews <- tolower(gsub("[[:punct:]]", "", fileNews))
tokenNews <- unlist(strsplit(tmpTokenNews, "\\s+"))

tmpTokenTwitter <- tolower(gsub("[[:punct:]]", "", fileTwitter))
tokenTwitter <- unlist(strsplit(tmpTokenTwitter, "\\s+"))

# Cleanup temps
rm(tmpTokenBlog); rm(tmpTokenNews); rm(tmpTokenTwitter)

Now that we have our token data, we can create n-grams. (An n-gram is a sequence of x number of words.)

This will be useful for understanding the relationships between the words in a sequence.

First create a function that can do that.

# Function to generate n-grams
generate_nGrams <- function(tokens, n) {
  nGrams <- vapply(
    seq_len(length(tokens) - n + 1),
    function(i) paste(tokens[i:(i + n - 1)], collapse = " "),
    character(1)  # This specifies the return type, ensuring efficiency - Used lapply first and it was too slow
  )
  return(nGrams)
}

This function takes a sentence, breaks it into smaller, consecutive groups of words, and gives you a list of these word groups. The number of words in each group is determined by the number n that you provide.

Test the function

We will first use it to split up the sentences into single words and place them in a large character vector.

singleGramsBlog <- generate_nGrams(tokenBlog,1)
singleGramsNews <- generate_nGrams(tokenNews,1)
singleGramsTwitter <- generate_nGrams(tokenTwitter,1)

This is where it gets resource heavy.

Generate bi-grams and tri-grams.

# Generate biGrams and triGrams
# Since the datasets are huge, I will only be taking small samples of 30%

set.seed(923)
biGramsBlog <- generate_nGrams(sample(tokenBlog, size = length(tokenBlog) * 0.3), 2)
triGramsBlog <- generate_nGrams(sample(tokenBlog, size = length(tokenBlog) * 0.3), 3)

biGramsNews <- generate_nGrams(sample(tokenNews, size = length(tokenNews) * 0.3), 2)
triGramsNews <- generate_nGrams(sample(tokenNews, size = length(tokenNews) * 0.3), 3)

biGramsTwitter <- generate_nGrams(sample(tokenTwitter, size = length(tokenTwitter) * 0.3), 2)
triGramsTwitter <- generate_nGrams(sample(tokenTwitter, size = length(tokenTwitter) * 0.3), 3)

Now we can look at the data.

# Summaries
summary(singleGramsBlog);summary(singleGramsNews);summary(singleGramsTwitter)
##    Length     Class      Mode 
##  37165320 character character
##    Length     Class      Mode 
##   2631117 character character
##    Length     Class      Mode 
##  29770468 character character

37 Million Words in Blog! That’s pretty large.

Now we will calculate the frequencies in which these words appear

# Check frequencies of words 

singleCountBlog <- table(singleGramsBlog)
singleCountBlog <- sort(singleCountBlog, decreasing = TRUE)
head(singleCountBlog, 20)
## singleGramsBlog
##     the     and      to       a      of       i      in    that      is      it 
## 1855622 1085736 1065641  896187  875005  769099  593482  459476  431796  400873 
##     for     you    with     was      on      my    this      as    have      be 
##  362783  296813  286161  277996  274021  270160  257945  223334  218530  208288

Visualize it

library(wordcloud); library(tm)
## Loading required package: RColorBrewer
## Loading required package: NLP
top10Single <- as.data.frame(head(singleCountBlog, 10))

wordcloud(words = top10Single$singleGramsBlog, 
          freq = top10Single$Freq, 
          random.order = FALSE, 
          rot.per = 0.35, 
          colors = brewer.pal(8, "Dark2"))

That is the summary so far!

Statistical modeling

Predictive modeling

Creative exploration

Creating a data product

Creating a short slide deck pitching your product