This report is a part of the Capstone project for JHU Data Science Specialization (Coursera). The materials for capstone project were provided by Johns Hopkins University’s team in collaboration with SwiftKey. The final goal of the capstone project is to develop a word prediction algorithm and to deploy it as a text typing app on the web, that would suggest the next most probable words as the text is typed in. The purpose of this report is to process the provided text files, perform exploratory analysis of the obtained data, and make a brief summary of the observations.
The data is from a corpus called HC Corpora. The data can be downloaded from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The files include collections of twitter, blog and news posts in four different languages including English. I will use English language for building the application.
twitter <- readLines("en_US.twitter.txt", skipNul = TRUE)
object.size(twitter)/1024^2
length(twitter)
sum(sapply(gregexpr("\\W+", twitter), length))
blogs <- readLines("en_US.blogs.txt", skipNul = TRUE)
object.size(blogs)/1024^2
length(blogs)
sum(sapply(gregexpr("\\W+", blogs), length))
news <- readLines(file("en_US.news.txt", "rb"), skipNul=TRUE)
object.size(news)/1024^2
length(news)
sum(sapply(gregexpr("\\W+", news), length))
| Data source | File size | Number of lines | Number of words |
|---|---|---|---|
| 257.2 Mb | 2360148 | 30513904 | |
| blogs | 231.3 Mb | 899288 | 38487556 |
| news | 230.4 Mb | 1010242 | 35866668 |
There are a number of packages available in R for text processing. However, since the sizes of the files are rather large, I have chosen to use Python for text preprocessing for performance reasons. I also decided to write my own scripts because I wanted to have a total control over the text processing pipeline, so that changes can be added fast and precisely for the fine-tuning of my final model. The python script used in this report can be viewed in the Appendix section.
For the purposes of this report I have processed the “en_US.twitter.txt” file using the abovementioned script. The overall process of tokenization and creating n-gram tables for the entire “en_US.twitter.txt” file takes roughly 5 minutes on a regular desktop computer.
The new files containing n-gram tables were loaded into R and analyzed further.
unigrams <- read.csv("unigrams.csv", stringsAsFactors=FALSE, header=FALSE)
library(ggplot2)
ggplot(data = unigrams, aes(x=V2)) +
geom_histogram(binwidth=5) +
coord_cartesian(xlim = c(0, 100)) +
ggtitle("Relative counts for words with frequencies below 100") +
xlab("Unigrams frequencies")
Out of all 285583 identified unigrams, 171584 words have frequencies of one, and there are 9661 words with frequencies above 100.
subset1 <- subset(unigrams, V2==1)
subset100 <- subset(unigrams, V2 > 100)
dim(unigrams); dim(subset1); dim(subset100)
## [1] 285583 2
## [1] 171584 2
## [1] 9661 2
Visualizing the most frequent words as a world cloud:
library(wordcloud)
wordcloud(subset100$V1, subset100$V2, scale=c(7,0.6), colors=brewer.pal(8, "Dark2"), max.words=50)
bigrams <- read.csv("bigrams.csv", stringsAsFactors=FALSE, header=FALSE)
top20 <- bigrams[order(bigrams$V2, decreasing = TRUE),]
top20 <- top20[1:20,]
ggplot(top20, aes(x=reorder(V1, -V2), y = V2)) +
geom_bar(stat="identity", fill="skyblue") + xlab("") + ylab("Counts") +
ggtitle("The most common bigrams") +
geom_text(aes(label=V2), vjust= -0.3, size=3) +
theme(axis.text.x = element_text(size=16, angle=45 , hjust=1),
axis.text.y = element_text(colour="grey20",size=14),
axis.title.y = element_text(size=14),
legend.text = element_text(size=14))
trigrams <- read.csv("trigrams.csv", stringsAsFactors=FALSE, header=FALSE)
top20t <- trigrams[order(trigrams$V2, decreasing = TRUE),]
top20t <- top20t[1:20,]
ggplot(top20t, aes(x=reorder(V1, -V2), y = V2)) +
geom_bar(stat="identity", fill="lightgreen") + xlab("") + ylab("Counts") +
ggtitle("The most common trigrams") +
geom_text(aes(label=V2), vjust= -0.3, size=3) +
theme(axis.text.x = element_text(size=16, angle=45 , hjust=1),
axis.text.y = element_text(colour="grey20",size=14),
axis.title.y = element_text(size=14),
legend.text = element_text(size=14))
As we can see the text corpora is pretty big. There are many high frequency n-grams and even more low frequency n-grams. The predictive algorithm should react very fast to user’s input and since the App will be deployed on the Shiny server, it should not take much space in memory. Hence, I will most probaly need to reduce the amount of n-grams used for the final language model. The majority of low freguency n-grams probably will not contribute much to the predictive capabilities of the algorithm and hopefully can be discarded. On the order hand I might try to include a higher order n-grams such as 4-grams and 5-grams. The cut-off frequencies will be selected to minimize memory footprint of the model, but at the same time not to reduce significantly the accuracy of the predictions. The algorithm for prediction will be developed on the basis of Katz back-off or Kneser-Ney smoothing model. I also have plans to implement profanity filtering. The model will be fine-tuned (fast running times), so that the app will be able to respond to user’s input in real time. The final model will be deployed on Shiny server.
####################---- Here is python script ----#############################
# dictionary of mapping of contractions to their expansions:
contractions = {"aren't": "are not", "can't": "cannot", "can't've": "cannot have",
"'cause": "because", "could've": "could have", "couldn't": "could not",
"couldn't've": "could not have", "didn't": "did not", "doesn't": "does not",
"don't": "do not", "hadn't": "had not", "hadn't've": "had not have",
"hasn't": "has not", "haven't": "have not", "he'd've": "he would have",
"he'll": "he will", "how'd": "how did", "how'd'y": "how do you",
"how'll": "how will", "how's": "how is", "i'd've": "i would have",
"i'll": "i will", "i'm": "i am", "i've": "i have", "isn't": "is not",
"it's": "it is", "it'd've": "it would have", "it'll": "it will",
"let's": "let us", "ma'am": "madam", "mayn't": "may not",
"might've": "might have", "mightn't": "might not", "mightn't've": "might not have",
"must've": "must have", "mustn't": "must not", "mustn't've": "must not have",
"needn't": "need not", "needn't've": "need not have", "oughtn't": "ought not",
"oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not",
"shan't've": "shall not have", "she'd've": "she would have", "she'll": "she will",
"should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have",
"so've": "so have", "that'd've": "that would have", "there'd've": "there would have",
"they'd've": "they would have", "they'll": "they will", "they're": "they are",
"they've": "they have", "to've": "to have", "wasn't": "was not",
"we'd've": "we would have", "we'll": "we will", "we'll've": "we will have",
"we're": "we are", "we've": "we have", "weren't": "were not", "what're": "what are",
"what's": "what is", "what've": "what have", "when've": "when have", "where'd": "where did",
"where's": "where is", "where've": "where have", "who've": "who have", "why've": "why have",
"will've": "will have", "won't": "will not", "won't've": "will not have",
"would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have",
"y'all": "you all", "y'all'd": "you all would", "y'all'd've": "you all would have",
"y'all're": "you all are", "y'all've": "you all have", "you'd've": "you would have",
"you're": "you are", "you've": "you have", "you'll": "you will"}
import re
from collections import defaultdict
import csv
file_name = "en_US.twitter.txt" # or use a copmlete path if the file is not in current directory.
def process_text(txt_file, new_file):
"""Takes a text file, preprocess the text and saves proccesed data into
new txt file line by line, where each line contains meaningfully
connected text block.
Args:
txt_file (.txt): file name or a full path name for the file.
new_file (.txt): name for a file where the proccesed text will be saved.
Returns:
None
"""
f = open(new_file, "w") # the file where the processed text will be written.
with open(txt_file, 'rb') as in_file:
for line in in_file:
line = line.lower() # convert to lower cases
phrases_list = re.split(r'[!?.,;:&()]+', line) # split line by punctuation
for phrase in phrases_list:
word_list = []
for word in phrase.split(): # splits a string into words
if word in contractions: # expands contractions
new_word = contractions[word]
word_list.append(new_word)
else:
word_list.append(word)
new_phrase = ' '.join(word_list) # converts list into a string
# clean the text and save each phrase line by line:
clean_phrase = re.sub(r"[^a-zA-Z\s]+", '', new_phrase).strip()
if clean_phrase != '':
f.write("%s\n" % clean_phrase)
f.close()
process_text(file_name, "processed.txt")
def make_ngrams(proc_file, n):
"""Finds n-grams and calculates their frequencies.
Args:
proc_file (.txt): file containing text preprocessed by process_text().
n (int): the number of words in n-gram.
Returns:
frequency dictionary of n-grams.
"""
dct = defaultdict(int)
with open("processed.txt", 'r') as in_file1:
for line in in_file1:
line_list = line.split()
for i in range(len(line_list)-n):
ngram = ' '.join(line_list[i:i+n])
dct[ngram] += 1
return dct
freq_dict1 = make_ngrams("processed.txt", 1) # unigrams
freq_dict2 = make_ngrams("processed.txt", 2) # bigrams
freq_dict3 = make_ngrams("processed.txt", 3) # trigrams
# write data into a .csv files:
with open('unigrams.csv','wb') as f1:
w1 = csv.writer(f1)
w1.writerows(freq_dict1.items())
with open('bigrams.csv','wb') as f2:
w2 = csv.writer(f2)
w2.writerows(freq_dict2.items())
with open('trigrams.csv','wb') as f3:
w3 = csv.writer(f3)
w3.writerows(freq_dict3.items())