Overview

This is a preliminary report for the Coursera Capstone project in the Data Science Specialization. The Capstone project entails producing a predictive text product - the user inputs a phrase containing several words, and the response is a selection of possible next words. The response is based on an analysis of phrases occuring in a corpus of texts drawn from three sources - twitter, blogs and news. This project will use an English language source, though there may be instances of other languages embedded in this text. After input, the text is cleaned (including the removal of obscene and profane words) and broken down into words (or tokens) which can be re-assembled to produce phrases of different lengths. The frequencies of occurrence of these words and phrases are then calculated, and used for the purposes of prediction.

suppressWarnings(library(tm))
suppressWarnings(library(stringr))
suppressWarnings(library(dplyr))
suppressWarnings(library(tidyr))
suppressWarnings(library(tidytext))
suppressWarnings(library(magrittr))
suppressWarnings(library(ggplot2))

Assemble data

We read in the data from each file, print summary statistics.

news_text <- readLines('en_US.news.txt')
blogs_text <- readLines('en_US.blogs.txt')
twitter_text <- readLines('en_US.twitter.txt')

Get statistics on the data

##                file file_size number_lines characters characters_per_line
## 1    en_US.news.txt  20111392        77259   15683765           203.00243
## 2   en_US.blogs.txt 260564320       899288  208361438           231.69601
## 3 en_US.twitter.txt 316037344      2360148  162384825            68.80281

Sampling the data

We take a small random sample of the data to investigate its characteristics

set.seed(11235)
news_sample <- sample(news_text, 1000)
blogs_sample<-sample(blogs_text,1000)
twitter_sample <-sample(twitter_text, 1000)
text_sample <- c(news_sample, blogs_sample, twitter_sample)
text_df<-data_frame(line = 1:3000, text = text_sample)
text_df %>% unnest_tokens(word, text) %>% anti_join(stop_words) %>% count(word, sort=TRUE)
## # A tibble: 14,048 × 2
##      word     n
##     <chr> <int>
## 1       â   450
## 2    time   170
## 3  people   159
## 4     day   120
## 5    love   100
## 6    life    92
## 7       1    87
## 8       3    86
## 9       2    85
## 10   week    81
## # ... with 14,038 more rows
text_df %>% unnest_tokens(word, text) %>% anti_join(stop_words) %>% count(word, sort=TRUE) %>% filter(n > 50) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL)+coord_flip()

Data cleaning

We see that the text has foreign symbols or words, and numbers. It will also have hashtags, web addresses with “http”, and so on, that we will want to clean.

text_sample[[1]]
## [1] "With their flat terrain and tabula-rasa potential, the two parks, occupying the old site of the Rand Corp. headquarters just west of Santa Monica City Hall, offer little of the romantic, post-industrial drama of the sites where Field Operations has produced its most memorable designs. There is no rusting and useful relic like the abandoned elevated train tracks that form the spine of Corner's most celebrated work, the High Line park on the far west side of Manhattan. There is nothing like the complex history of the Fresh Kills waterfront park on New York's Staten Island, a former landfill that is nearly three times the size of Central Park and became a macabre sorting ground for World Trade Center rubble after the 9/11 attacks."
corpus <- Corpus(VectorSource(text_sample))
toSpace<-content_transformer(function(x, pattern) gsub(pattern," ",x))
toEmpty <-content_transformer(function(x, pattern) gsub(pattern,"",x))
corpus %>% tm_map(toEmpty,"[']") %>% tm_map(toEmpty,"#\\w+") %>% tm_map(toSpace, "[[:punct:]]+") %>%  tm_map(stripWhitespace) 
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3000

Remove profanities

Read in a file of profane and obscene words to remove them from the corpus of text and carry out more cleaning

profanities <- readLines("profanities.txt")
corpus %>% tm_map(removeWords, profanities)
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3000
corpus<-tm_map(corpus, removePunctuation)
corpus<-tm_map(corpus, removeNumbers)
corpus<-tm_map(corpus, tolower)
sample_text<-sapply(corpus, identity)
sample_text[[1]]
## [1] "with their flat terrain and tabularasa potential the two parks occupying the old site of the rand corp headquarters just west of santa monica city hall offer little of the romantic postindustrial drama of the sites where field operations has produced its most memorable designs there is no rusting and useful relic like the abandoned elevated train tracks that form the spine of corners most celebrated work the high line park on the far west side of manhattan there is nothing like the complex history of the fresh kills waterfront park on new yorks staten island a former landfill that is nearly three times the size of central park and became a macabre sorting ground for world trade center rubble after the  attacks"
sample_text <- iconv(sample_text, "latin1", "ASCII", sub=" ")
sample_text <- gsub("[^[:alpha:][:space:][:punct:]]", "", sample_text)
sample_df<-data_frame(line = 1:3000, text = sample_text)
sample_df %>% unnest_tokens(word, text) %>% anti_join(stop_words) %>% count(word, sort=TRUE) %>% filter(n > 50) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL)+coord_flip()
## Joining, by = "word"

# Next, move on to two-word combinations, that is, to bigrams
sample_bigrams <- sample_df %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)
bigrams_separated <- sample_bigrams %>% separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word)
bigram_counts <- bigrams_filtered %>% count(word1, word2, sort = TRUE)
bigram_counts
## Source: local data frame [12,750 x 3]
## Groups: word1 [7,072]
## 
##       word1     word2     n
##       <chr>     <chr> <int>
## 1       san francisco     8
## 2       los   angeles     7
## 3       san     diego     7
## 4        st     louis     7
## 5    health      care     6
## 6      real      life     6
## 7   medical marijuana     5
## 8   nursing      home     5
## 9    social  security     5
## 10 drinking     water     4
## # ... with 12,740 more rows
bigrams_united <- bigrams_filtered %>% unite(bigram, word1, word2, sep = " ")
sample_df %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% count(bigram, sort=TRUE) %>% filter(n > 50) %>% mutate(bigram = reorder(bigram, n)) %>% ggplot(aes(bigram, n)) + geom_col() + xlab(NULL)+coord_flip()

# Now for three-word combinations or trigrams
sample_df %>% unnest_tokens(trigram, text, token = "ngrams", n = 3) %>% separate(trigram, c("word1", "word2", "word3"), sep = " ") %>% filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word, !word3 %in% stop_words$word) %>% count(word1, word2, word3, sort = TRUE)
## Source: local data frame [4,846 x 4]
## Groups: word1, word2 [4,763]
## 
##       word1     word2        word3     n
##       <chr>     <chr>        <chr> <int>
## 1       san francisco          ers     4
## 2        de        la         rosa     3
## 3   figured   fashion         week     3
## 4  business      park developments     2
## 5    friday     night       lights     2
## 6       gov     chris     christie     2
## 7    jersey     local         news     2
## 8    kinder      farm         park     2
## 9    landen   meadows neighborhood     2
## 10      los   angeles       county     2
## # ... with 4,836 more rows
sample_df %>% unnest_tokens(trigram, text, token = "ngrams", n = 3)%>% count(trigram, sort=TRUE) %>% filter(n > 10) %>% mutate(trigram = reorder(trigram, n)) %>% ggplot(aes(trigram, n)) + geom_col() + xlab(NULL)+coord_flip()

Next steps

The procedure for quadgrams is similar. The next steps of the project will be to use larger samples and to produce frequency tables of 1-grams, bigrams, trigrams and quadgrams. We will use Katz’s backoff model in our predictive text application based on these frequency tables, which will allow us to use a Markov chain approach to the probability of the next word.

We will also investigate the Good Turing approach to assigning probabilities to words not found in the corpus that is being used.

References