Xing Su
April 26, 2015
<emoticon>, <phone>, <url>, <email>, <date>, <time>, <num>, as well as <profanity>don't, we'll, there's to don not, we will, there is respectivelytau and tm packages to tokenize data and build ngrams, however, ultimately decided to build from scratch to improve performancedata.table (fast retrieval) to store the 4-gram Markov model
table function<s> and </s> tags were added to each sentence to mark beginning/enddata.tables with each column corresponding to a word and N column as the frequency of the word/group of words<unk> and the model is trained as suchunigram table that start with the letters, and return the result with the highest frequencyfourgram table, last 2 words in the trigram, and last word in bigram table, respectively in that order, for the most likely prediction; if nothing is found, the top word in unigram table is returnednloptr package) but resulted in ~30% increase in computation time and 8% drop in accuracy\[ P_{interpolated} = 0.1155\frac{Count(w_{i−3},w_{i−2},w_{i−1},w_i)}{Count(w_{i−2},w_{i−1},w_i)} + 0.2364\frac{Count(w_{i−2},w_{i−1},w_i)}{Count(w_{i−1},w_i)} + 0.3757\frac{Count(w_{i−1},w_i)}{Count(w_i)} + 0.2724\frac{Count(w_i)}{Total~Words} \]
tab or \( \rightarrow \) to autocomplete the word