Data Science Capstone: Exploratory Analysis

Introduction

Plans for Creating a Prediction Algorithm The author plans to use an analysis of words and n-grams to generate a prediction algorithm. Prediction will be based on calculations of maximum likelihood estimates for words generated by a user. A n-gram model will be used which looks at previous words, up to two words, to generate words which are most likely to appear next. Discounting will be used when unknown words or combination of words are entered by the user. Most likely, I will use Katz backoff to discount probabilities. The prediction algorithm will run on a Shiny App where users will enter in text to receive predictions.

The purpose of project is to build a model that produces the word(s) based on a user input. The prediction will be based on the calculation of maximum likeklihood estimates of words and n-grams generarated by a user. The project will work the combinations of up to three words (3-grams).

The challenges to overcome are: * Finding the balance between predictive power and performance * Dealing with unknown words/cominations of words * Selecting the best statistical model

The final result will be availalbe as web interface (buld with RShiny).

First look at the data: number of rows

In order to create the sampled data I’ve created the following

## [1] "en_US.blogs.txt|number of lines: 899288"
## [1] "en_US.blogs.txt|number of words: 38370723"
## [1] "en_US.news.txt|number of lines: 1010242"
## [1] "en_US.news.txt|number of words: 35783083"
## [1] "en_US.twitter.txt|number of lines: 2360148"
## [1] "en_US.twitter.txt|number of words: 31149374"

One can easily see that the we talking about big data here. Dealing with that much data isn’t efficient and also it’s not necessary. Next we going to pick a sample of each media. We’ll also manipulate the data in a way that serves our purpose.

Getting and cleaning the data

Since the files are quite large we can move on selecting one sample of each file and cleaning the data by going throug the following steps:

  1. Get the full data set
  2. Remove URLs and twitter @ mentions and punctuation (in that very order)
  3. Pick a subset of the data (I’ll pick 3%).
  4. Tokenize the data (single words, 2-grams and 3-grams)
  5. Remove records which bad language from the list
  6. Save the sample data
  7. Create a frequency table
  8. Save the frequency table

That’s what the function below does:

get_sample <- function (filename,prob) {
  path <- paste0('final/',substring(filename, 1,5),'/',filename)
  # load Google's list of bad words
  en_bad <- read.table("badlanguage.txt")
  # open conection
  con <- file(path,"r")
  set.seed(23)
  full <- readLines(con, encoding="UTF-8")
  # urls
  full <- gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", full)
  # remove twitter @ mentions
  full <- gsub("@\\w+", " ", full)
  # remove punctuation
  full <- removePunctuation(full,preserve_intra_word_contractions = TRUE,preserve_intra_word_dashes = TRUE)
  # Use a binomial distrubution to get a sample where the sample size equals the value of the variable prob
  sample <- as.data.frame(list(full[rbinom(length(full),1,prob)==1]))
  colnames(sample)[1] <- "words"
  #tokinize the text, get words, 2-grams and 3-grams, I also wanna store stopwords as they are relevant for this task
  sample <- as.vector(sample[,1]) %>%
    tokenize_ngrams(n = 3, n_min = 1) %>%
    unlist() %>% list() %>%
    as.data.frame() %>% mutate(language = str_sub(filename,1,5),media = str_extract(filename,'blogs|news|twitter'))
  colnames(sample)[1] <- "words"
  #delete all lines with bad words
  sample <- sample %>% subset(!grepl(paste(unlist(en_bad), collapse="|"),words)) %>%
                       mutate(type = ifelse(str_count(words, ' ') == 0,'words',paste0(str_count(words, ' ')+1,'-grams')))
  #write sample file
  write.table(sample,paste0(gsub('.txt','',filename),'_sample',str_sub(filename,-4,-1)))
  #get frequencies and write file
  freq <- sample %>% group_by(words,language,media,type) %>% summarise(counts = n()) %>% ungroup() %>% group_by(language,media,type) %>% mutate(rank = rank(-counts)) %>% ungroup()
  write.table(freq,paste0(gsub('.txt','',filename),'_freq',str_sub(filename,-4,-1)))
  close(con)
}

Exploratory Analysis

I’ve pulled the data using the function above. Let’s have a look at total counts per type:

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## # A tibble: 12 x 4
## # Groups:   language, media [4]
##    language media   type    total_counts
##    <chr>    <chr>   <chr>          <int>
##  1 en_US    blogs   words        1145275
##  2 en_US    blogs   2-grams      1117998
##  3 en_US    blogs   3-grams      1091045
##  4 en_US    news    words        1048508
##  5 en_US    news    2-grams      1017969
##  6 en_US    news    3-grams       987538
##  7 en_US    twitter words         902986
##  8 en_US    twitter 2-grams       831708
##  9 en_US    twitter 3-grams       760459
## 10 en_US    all     words        3096769
## 11 en_US    all     2-grams      2967675
## 12 en_US    all     3-grams      2839042

Top 20 words by usage

The whole list contians solely so callded stopwords. Words which are not specific at all and don’t give any information about the topic of the conversation. Which makes totall sense.

Top 20 2-grams by usage

When it comes to the top 20 used 2-grams we also see quite common combinations like ‘going to’, ‘I am’ or ‘I don’t’ ranking up there. Which is what I’ve expected.

Top 20 3-grams by usage

Finally, looking in at the top 20 3-grams it’s save to say that the data shaping has been quite a success. Very commenly used phrases rank on top of the list which matches the expectations perfectly.

How many unique words are needed to cover 50% and 90% of the total words?

We’d need 154 unique words to cover 50% and around 8k unique words to cover 90% of all words used. That’s pretty helpful information and will be needed to balance out performance and precisness.

What’s up next

Next I’m going to pick a statistical model based on the clean data we’ve got in place now. Before I go there I might take an even closer look at the data and see if further data cleaning steps are necessary. I did some research on how to deal with foreign words. While it is quite tricky to identify foreign words, there are some commenly used methods to identify whole sentences written in a foreign language. So I’m considring edding the language detection to my script to create the samples.