Milestone report - Exploratory data analysis

Project overview

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey. (from the course description).

We are going to use three files that have several text entries from blogs, news and Twitter in English.

This document presents an exploratory analysis for these three sources. The seed used for this work is 3399.

Twitter analysis

con <- file("final/en_US/en_US.twitter.txt", "r")
aux <- readLines(con,n=-1)

l <- length(aux)
m <- max(unlist(lapply(aux,nchar)),na.rm=TRUE)

The number of lines in the source is 2,360,148 and the line that has more characters has 140 chars.

sampleSize <- 1000
sample <- sample(aux,sampleSize)

We are going to use a sample size of 1000.

First step

We are going to replace contractions that we find in English.

sample <- gsub("can’t", "cannot",sample)
sample <- gsub("can't", "cannot",sample)

sample <- gsub("won’t", "will not", sample)
sample <- gsub("won't", "will not", sample)

sample <- gsub("’er", " are", sample)
sample <- gsub("'er", " are", sample)

sample <- gsub("’ve", " have", sample)
sample <- gsub("'ve", " have", sample)

sample <- gsub("what’s", "what is", sample)
sample <- gsub("what's", "what is", sample)

sample <- gsub("n’t", " not", sample)
sample <- gsub("n't", " not", sample)

sample <- gsub("’d", " would", sample)
sample <- gsub("'d", " would", sample)

sample <- gsub("’ll", " will", sample)
sample <- gsub("'ll", " will", sample)

sample <- gsub("’m", " am", sample)
sample <- gsub("'m", " am", sample)

Second step

We tokenize the text entries. This function identifies tokens such as words, punctuation, and numbers keeping only words.

tokens <- MC_tokenizer(sample)

Third step

We remove profanity words that we don’t want to predict. For this, we have three sources:

List 1 Carnegie Mellon University, School of computer science’s
List 2 Github repository with words banned by Google
List 3 Google list of banned words

We are going to use this last list.

conBadWords <- file("Profanity words/list3.txt", "r")
badwords <- readLines(conBadWords,n=-1)

noBadWords <- tokens[!tokens %in% badwords]

noBadWords <- unlist(lapply(noBadWords,tolower))

beforeWordCount <- length(tokens)
afterWordCount <- length(noBadWords)

Before profanity removal, we have 13,029 tokens. After profanity removal, we have 12,989 tokens.

**NOTE: We are not using the tm package for the last processes because, for some reason, some words were not counted (for example: I).

Fourth step

We are going to explore the frequency of 1-gram, 2-gram and 3-gram structures.

1-gram Word cloud

cleanDF <- as.data.frame(sort(table(noBadWords),decreasing = TRUE))
names(cleanDF) <- c("word","freq")

wordcloud(words = cleanDF$word, freq = cleanDF$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

1-gram Frequency bar plot

up100 <- cleanDF[cleanDF$freq>100,]

barplot(up100$freq, las = 2, names.arg = up100$word,
        col ="darkolivegreen4", main ="Most frequent words (freq>100)",
        ylab = "Word frequencies")

2-gram Word cloud

list2Gram <- unlist(lapply(ngrams(words(noBadWords), 2), paste, collapse = " "), use.names = FALSE)

cleanDF2Gram <- as.data.frame(sort(table(list2Gram),decreasing = TRUE))
names(cleanDF2Gram) <- c("word","freq")

wordcloud(words = cleanDF2Gram$word, freq = cleanDF2Gram$freq, min.freq = 1,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

2-gram Frequency bar plot

up1002Gram <- cleanDF2Gram[cleanDF2Gram$freq>10,]

barplot(up1002Gram$freq, las = 2, names.arg = up1002Gram$word,
        col ="darkolivegreen4", main ="Most frequent words (freq>10)",
        ylab = "Word frequencies")

3-gram Word cloud

list3Gram <- unlist(lapply(ngrams(words(noBadWords), 3), paste, collapse = " "), use.names = FALSE)

cleanDF3Gram <- as.data.frame(sort(table(list3Gram),decreasing = TRUE))
names(cleanDF3Gram) <- c("word","freq")

wordcloud(words = cleanDF3Gram$word, freq = cleanDF3Gram$freq, min.freq = 1,
          max.words=20, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

3-gram Frequency bar plot

up1003Gram <- cleanDF3Gram[cleanDF3Gram$freq>2,]

barplot(up1003Gram$freq, las = 2, names.arg = up1003Gram$word,
        col ="darkolivegreen4", main ="Most frequent words (freq>2)",
        ylab = "Word frequencies")

Fifth step

cumsums <-cumsum(cleanDF$freq)
percent50 <- length(noBadWords)*.5

percent90 <- length(noBadWords)*.9

Having these tokens, we need 95 tokens to cover 50% of all word instances in the sample.

Having these tokens, we need 2007 tokens to cover 90% of all word instances in the sample.

In the following plot, we can see that the relationship between the converage percentage and the words needed is exponential. This means that we have many words that have low frequencies and few words that have high frequencies.

aa <- c(5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95)
bb <- lapply(aa,function(x){
  percent <- length(noBadWords)*x/100
  length(cumsums[cumsums<percent])+1
})
plot(x = aa,y=bb,type = "l",ylab = "Words needed",xlab = "Coverage percentage", col="aquamarine4")

Blog analysis

The following section shows the same analysis for the blog text source. The code is omitted because is the same as for the Twitter analysis. Just the file was changed.

The number of lines in the source is 899,288 and the line that has more characters has 40,833 chars.

We are going to use a sample size of 1000.

First step

We are going to replace contractions that we find in English.

Second step

We tokenize the text entries. This function identifies tokens such as words, punctuation, and numbers keeping only words.

Third step

We remove profanity words that we don’t want to predict. For this, we have three sources:

List 1 Carnegie Mellon University, School of computer science’s
List 2 Github repository with words banned by Google
List 3 Google list of banned words

We are going to use this last list.

Before profanity removal, we have 42,608 tokens. After profanity removal, we have 42,591 tokens.

**NOTE: We are not using the tm package for the last processes because, for some reason, some words were not counted (for example: I).

Fourth step

We are going to explore the frequency of 1-gram, 2-gram and 3-gram structures.

1-gram Word cloud

1-gram Frequency bar plot

2-gram Word cloud

2-gram Frequency bar plot

3-gram Word cloud

3-gram Frequency bar plot

Fifth step

Having these tokens, we need 96 tokens to cover 50% of all word instances in the sample.

Having these tokens, we need 3501 tokens to cover 90% of all word instances in the sample.

News analysis

The following section shows the same analysis for the news text source. The code is omitted because is the same as for the Twitter analysis. Just the file was changed.

The number of lines in the source is 1,010,242 and the line that has more characters has 11,384 chars.

We are going to use a sample size of 1000.

First step

We are going to replace contractions that we find in English.

Second step

We tokenize the text entries. This function identifies tokens such as words, punctuation, and numbers keeping only words.

Third step

We remove profanity words that we don’t want to predict. For this, we have three sources:

List 1 Carnegie Mellon University, School of computer science’s
List 2 Github repository with words banned by Google
List 3 Google list of banned words

We are going to use this last list.

Before profanity removal, we have 34,118 tokens. After profanity removal, we have 34,111 tokens.

**NOTE: We are not using the tm package for the last processes because, for some reason, some words were not counted (for example: I).

Fourth step

We are going to explore the frequency of 1-gram, 2-gram and 3-gram structures.

1-gram Word cloud

1-gram Frequency bar plot

2-gram Word cloud

2-gram Frequency bar plot

3-gram Word cloud

3-gram Frequency bar plot

Fifth step

Having these tokens, we need 159 tokens to cover 50% of all word instances in the sample.

Having these tokens, we need 4294 tokens to cover 90% of all word instances in the sample.

Future work

In order to build a prediction model that help us predict the next word given the last (or lasts) word, we will do the following:

Create a data frame using n-grams. The last word in the gram will be the prediction and the other words will be the predictor.
Build a model. Use one, two or three predictors (words before the next word).
Test the models with different algoriths such as linreas regression or random forest, for example.
Check the efficiency of these models about:
- Memory needed
- Time needed
- Accuracy

Milestone report - Exploratory data analysis

Gil Huesca

10/4/2020

Project overview

Twitter analysis

Blog analysis

News analysis

Future work