This is a preliminary report for the Coursera Capstone project in the Data Science Specialization. The Capstone project entails producing a predictive text product - the user inputs a phrase containing several words, and the response is a selection of possible next words. The response is based on an analysis of phrases occuring in a corpus of texts drawn from three sources - twitter, blogs and news. This project will use an English language source, though there may be instances of other languages embedded in this text. After input, the text is cleaned (including the removal of obscene and profane words) and broken down into words (or tokens) which can be re-assembled to produce phrases of different lengths. The frequencies of occurrence of these words and phrases are then calculated, and used for the purposes of prediction.
suppressWarnings(library(tm))
suppressWarnings(library(stringr))
suppressWarnings(library(dplyr))
suppressWarnings(library(tidyr))
suppressWarnings(library(tidytext))
suppressWarnings(library(magrittr))
suppressWarnings(library(ggplot2))
We read in the data from each file, print summary statistics.
news_text <- readLines('en_US.news.txt')
blogs_text <- readLines('en_US.blogs.txt')
twitter_text <- readLines('en_US.twitter.txt')
## file file_size number_lines characters characters_per_line
## 1 en_US.news.txt 20111392 77259 15683765 203.00243
## 2 en_US.blogs.txt 260564320 899288 208361438 231.69601
## 3 en_US.twitter.txt 316037344 2360148 162384825 68.80281
We take a small random sample of the data to investigate its characteristics
set.seed(11235)
news_sample <- sample(news_text, 1000)
blogs_sample<-sample(blogs_text,1000)
twitter_sample <-sample(twitter_text, 1000)
text_sample <- c(news_sample, blogs_sample, twitter_sample)
text_df<-data_frame(line = 1:3000, text = text_sample)
text_df %>% unnest_tokens(word, text) %>% anti_join(stop_words) %>% count(word, sort=TRUE)
## # A tibble: 14,048 × 2
## word n
## <chr> <int>
## 1 â 450
## 2 time 170
## 3 people 159
## 4 day 120
## 5 love 100
## 6 life 92
## 7 1 87
## 8 3 86
## 9 2 85
## 10 week 81
## # ... with 14,038 more rows
text_df %>% unnest_tokens(word, text) %>% anti_join(stop_words) %>% count(word, sort=TRUE) %>% filter(n > 50) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL)+coord_flip()
We see that the text has foreign symbols or words, and numbers. It will also have hashtags, web addresses with “http”, and so on, that we will want to clean.
text_sample[[1]]
## [1] "With their flat terrain and tabula-rasa potential, the two parks, occupying the old site of the Rand Corp. headquarters just west of Santa Monica City Hall, offer little of the romantic, post-industrial drama of the sites where Field Operations has produced its most memorable designs. There is no rusting and useful relic like the abandoned elevated train tracks that form the spine of Corner's most celebrated work, the High Line park on the far west side of Manhattan. There is nothing like the complex history of the Fresh Kills waterfront park on New York's Staten Island, a former landfill that is nearly three times the size of Central Park and became a macabre sorting ground for World Trade Center rubble after the 9/11 attacks."
corpus <- Corpus(VectorSource(text_sample))
toSpace<-content_transformer(function(x, pattern) gsub(pattern," ",x))
toEmpty <-content_transformer(function(x, pattern) gsub(pattern,"",x))
corpus %>% tm_map(toEmpty,"[']") %>% tm_map(toEmpty,"#\\w+") %>% tm_map(toSpace, "[[:punct:]]+") %>% tm_map(stripWhitespace)
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 3000
Read in a file of profane and obscene words to remove them from the corpus of text and carry out more cleaning
profanities <- readLines("profanities.txt")
corpus %>% tm_map(removeWords, profanities)
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 3000
corpus<-tm_map(corpus, removePunctuation)
corpus<-tm_map(corpus, removeNumbers)
corpus<-tm_map(corpus, tolower)
sample_text<-sapply(corpus, identity)
sample_text[[1]]
## [1] "with their flat terrain and tabularasa potential the two parks occupying the old site of the rand corp headquarters just west of santa monica city hall offer little of the romantic postindustrial drama of the sites where field operations has produced its most memorable designs there is no rusting and useful relic like the abandoned elevated train tracks that form the spine of corners most celebrated work the high line park on the far west side of manhattan there is nothing like the complex history of the fresh kills waterfront park on new yorks staten island a former landfill that is nearly three times the size of central park and became a macabre sorting ground for world trade center rubble after the attacks"
sample_text <- iconv(sample_text, "latin1", "ASCII", sub=" ")
sample_text <- gsub("[^[:alpha:][:space:][:punct:]]", "", sample_text)
sample_df<-data_frame(line = 1:3000, text = sample_text)
sample_df %>% unnest_tokens(word, text) %>% anti_join(stop_words) %>% count(word, sort=TRUE) %>% filter(n > 50) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL)+coord_flip()
## Joining, by = "word"
# Next, move on to two-word combinations, that is, to bigrams
sample_bigrams <- sample_df %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)
bigrams_separated <- sample_bigrams %>% separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word)
bigram_counts <- bigrams_filtered %>% count(word1, word2, sort = TRUE)
bigram_counts
## Source: local data frame [12,750 x 3]
## Groups: word1 [7,072]
##
## word1 word2 n
## <chr> <chr> <int>
## 1 san francisco 8
## 2 los angeles 7
## 3 san diego 7
## 4 st louis 7
## 5 health care 6
## 6 real life 6
## 7 medical marijuana 5
## 8 nursing home 5
## 9 social security 5
## 10 drinking water 4
## # ... with 12,740 more rows
bigrams_united <- bigrams_filtered %>% unite(bigram, word1, word2, sep = " ")
sample_df %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% count(bigram, sort=TRUE) %>% filter(n > 50) %>% mutate(bigram = reorder(bigram, n)) %>% ggplot(aes(bigram, n)) + geom_col() + xlab(NULL)+coord_flip()
# Now for three-word combinations or trigrams
sample_df %>% unnest_tokens(trigram, text, token = "ngrams", n = 3) %>% separate(trigram, c("word1", "word2", "word3"), sep = " ") %>% filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word, !word3 %in% stop_words$word) %>% count(word1, word2, word3, sort = TRUE)
## Source: local data frame [4,846 x 4]
## Groups: word1, word2 [4,763]
##
## word1 word2 word3 n
## <chr> <chr> <chr> <int>
## 1 san francisco ers 4
## 2 de la rosa 3
## 3 figured fashion week 3
## 4 business park developments 2
## 5 friday night lights 2
## 6 gov chris christie 2
## 7 jersey local news 2
## 8 kinder farm park 2
## 9 landen meadows neighborhood 2
## 10 los angeles county 2
## # ... with 4,836 more rows
sample_df %>% unnest_tokens(trigram, text, token = "ngrams", n = 3)%>% count(trigram, sort=TRUE) %>% filter(n > 10) %>% mutate(trigram = reorder(trigram, n)) %>% ggplot(aes(trigram, n)) + geom_col() + xlab(NULL)+coord_flip()
The procedure for quadgrams is similar. The next steps of the project will be to use larger samples and to produce frequency tables of 1-grams, bigrams, trigrams and quadgrams. We will use Katz’s backoff model in our predictive text application based on these frequency tables, which will allow us to use a Markov chain approach to the probability of the next word.
We will also investigate the Good Turing approach to assigning probabilities to words not found in the corpus that is being used.