Summary

The aim of this project is to apply and evaluate the algorithm to predict the most probable next word(s) in the text. We use the standard corpora of typical texts (blogs, news and tweets) for training and validation purposes.

Obtaining of training set

The training data was downloaded from the Coursera site. The data is from a corpus called HC Corpora. It contains 4 separate folders for the four locales en_US, de_DE, ru_RU and fi_FI. Each folders contains separate files for texts extracted from blogs, news and tweets respcetively. In our futher work we use only files en_US locale.

Basic data summary

File Name Number of Lines Number of Words Number of Characters
en_US.blogs.txt 899288 37334114 210160014
en_US.news.txt 1010242 34365936 205811886
en_US.twitter.txt 2360148 30359852 167105338

Data pre-proccessing

The following steps were done to preprocess and clean the data set using regular expressions:

Then we have used The SRI Language Modeling Toolkit to generate 1-grams counts for each corpus, and calculated frequency histograms by standard means of R.

We have considered two furher processing workflows:

  1. Cut off the vocabulary for each corpus by the most frequent words.
  2. Eliminate standard English stopwords, and then apply stemming.

Vocabulary cut off

We have used The SRI Language Modeling Toolkit to generate 1-grams counts for each corpus, sorted them and calculated their cummulative frequency, as illustrated by log-log plot.

Based on these results, we have constructed the sparate limited vocabularies for each corpus. The minimal count of 1-unigram to be included in correspondent limited vocabulary is set to cover at least 97.5% of total counts. Here are the last (less frequent) words for each vocabulary:

Corpus Least frequent unigram Count Unigram frequency Cumulative unigram frequency
blogs abacus 24 6.5e-07 0.9754281
news abernathy 22 6.6e-07 0.9755589
twitter aall 16 5.4e-07 0.9753012

Stemming

We have used tm, a framework for text mining applications within R, in order to eliminate standard English stopwords, and then apply stemming using Porter’s stemming algorithm.

Most frequent trigrams

Here is the list of most frequent trigrams per corpus, with alternative cleanup applied as above

Blogs, cut off News, cut off Twitter, cut off Blogs, stemmed News, stemmed Twitter, stemmed
one of the one of the thanks for the new york citi m p m happi mother day
a lot of a lot of looking forward to im pretti sure presid barack obama cant wait see
of the <unk> the u s thank you for new york time new york citi let us know
<unk> <unk> <unk> of the <unk> cant wait to dont get wrong p m saturday happi new year
<unk> and <unk> as well as i love you dont know whi two year ago look forward see
as well as <unk> and <unk> for the follow let just say p m sunday im pretti sure
to be a part of the going to be cant wait see gov chris christi cinco de mayo
it was a the end of i want to amazon servic llc washington d c feel like im
some of the according to the a lot of coupl week ago u s attorney dont even know
out of the out of the to be a im look forward st loui counti cant wait get
the end of some of the i need to item c abov p m friday cant wait till
be able to to be a i have a two year ago u s district thank veri much
a couple of in the first im going to incorpor item c world war ii happi valentin day
i want to at p m one of the new york n p m april dream come true
in the <unk> going to be have a great york n y p m monday cant wait hear
the <unk> of <unk> <unk> <unk> to see you feel like im new york time follow follow back
the fact that the united states w <unk> com dont even know u s depart come see us
this is a it was a i have to look forward see will take place thank follow us
i have to in the <unk> i dont know world war ii noon p m just got back
the rest of the first time is going to c abov pp three year ago love love love
there is a be able to you have a let us know around p m dont know whi
i have a <unk> in the to go to mani year ago p m may keep good work
it is a <unk> <unk> and let me know long time ago four year ago st patrick day
part of the said in a for the rt veri long time u s rep ive ever seen
i have been end of the i wish i coupl year ago first time sinc im look forward
going to be of the year i cant wait make feel like cent per share just got home
i dont know for the first i feel like spend lot time p m p doe anyon know
<unk> <unk> and most of the <unk> and <unk> want make sure p m thursday pleas pleas pleas
<unk> in the the rest of i think i dont feel like p m tuesday thank everyon came
one of my to p m you want to right now im two week ago pleas follow back
the <unk> <unk> <unk> of the thanks for following pleas let know superior court judg hope feel better
i had to <unk> said the cant wait for vest interests vest g protein g dont feel like
this is the the <unk> <unk> of the day interests vest interests g carbohydr g think im go
the <unk> and in the past you so much spend much time u s senat make feel like
to the <unk> more than a would love to im sure will georg w bush im onli one
i wanted to percent of the one of my everi singl day high school student thank follow back
<unk> of the at the end happy mothers day new year eve five year ago hope everyon great
in order to the <unk> of wait to see washington d c u s offici follow look forward
a bit of at a m a great day two week ago s district judg new year eve
end of the the university of at the <unk> level mp cost g fat g just want say
if you are this is a you for the amazon co uk told associ press ill let know
you want to at the time you have to onc upon time s district court hope great day
when i was the number of how are you pleas feel free fat g satur come join us
all of the is one of in the world lord jesus christ want make sure just let know
it would be in the second <unk> <unk> <unk> happi new year chief execut offic thank follow look
<unk> and the in the s it was a love love love p m wednesday cant wait go
there is no a couple of thanks so much illinois incorpor item senior vice presid right now im
i am not to the <unk> i dont think amazon com amazon protein g carbohydr just got done
at the end is expected to i will be remov ani time st charl counti happi th birthday
most of the in new york what do you link amazon com dow jone industri look forward hear
and the <unk> in front of so i can subject chang remov counti prosecutor offic look forward meet
i had a as part of do you have provid subject chang execut vice presid ha ha ha
to have a you have to had a great just littl bit u s sen cant wait til
the first time of the season the end of content provid subject u s govern mother day mom
in the world in a statement if you want come amazon servic osama bin laden make feel better
is one of is going to i need a com amazon ca counti sheriff offic hope see soon
you have to director of the the rest of co uk amazon world trade center go back sleep
im going to said he was be able to chang remov ani attorney general offic cake cake cake
back to the were going to out of the certain content appear st loui area show last night
i need to he said he i miss you wordpress com particip point per game yes yes yes
that i have according to a in the <unk> websit come amazon said last week ralph waldo emerson
i decided to and the <unk> so much for uk amazon d s suprem court thank much follow
rest of the it would be of the <unk> site earn advertis jone industri averag im sure will
there was a and <unk> <unk> dont want to servic llc andor u s suprem will follow back
in front of in the u i had a servic llc amazon per serv calori happi birthday hope
was going to because of the what are you provid mean site rock n roll dont know im
to make a <unk> and the do you think programm design provid begin p m good morn everyone
on the <unk> of the most wish i could particip amazon servic chief financi offic look like im
and i have to be the check it out mean site earn p m today let know can
to do with he said the for the <unk> llc andor amazon saturday p m keep great work
to be the there is a to see the llc amazon eu serv calori g realli look forward
and it was there is no look forward to fr amazon amazon hour m p dont understand whi
is going to the fact that want to be fee advertis link u s militari im gonna go
of <unk> and i dont know if you are eu content provid nation weather servic thank follow hope
if you have i dont think is the best eu associ programm martin luther king sound like great
<unk> is a at the <unk> please follow me es certain content wall street journal ive never seen
that i am rest of the thank you so earn advertis fee state attorney general today good day
i dont think it was the you need to design provid mean past two year think im gonna
one of those the new york if you have d amazon fr former u s pleas let know
and i am in a <unk> i dont have content appear websit start p m im big fan
i will be <unk> said he if you dont com particip amazon u s economi dont want go
i think i not going to to the <unk> ca amazon co major leagu basebal will let know
of <unk> <unk> in addition to the first time associ programm design late last year just make sure
as much as president of the <unk> in the appear websit come plain dealer report st patti day
dont want to to have a i hope you andor amazon eu said u s cant wait next
and i was the <unk> and in the morning amazon fr amazon u s open just finish mi
of the most members of the all the time amazon eu content new york stock never get old
it is the this is the i love it amazon eu associ los angel time good morn world
it is not all of the we need to amazon es certain gov john kasich follow us twitter
in the middle the block of go to the amazon d amazon fat mg cholesterol finish mi run
to go to member of the i love the amazon ca amazon new york new happi cinco de
for the <unk> part of a to have a amazon amazon es chief oper offic sound like fun
it will be a chance to let us know advertis link amazon friday p m follow back please
would like to as much as i am so advertis fee advertis new york city thank follow love
at the same a m to need to get realli look forward said dont know pleas follow love
for the first more than million it would be will take place los angel counti help spread word
that i was a little bit to get a sever year ago sever year ago look forward read
with the <unk> for the <unk> i dont want st patrick day p m march im just gonna
at the time to make a have a good ive ever seen calori g fat happi hump day
it was the at the same i know i spent lot time counti circuit court martin luther king

Unique N-grams

We have obtained the following numbers of N-grams in prepocessed corpora

N-grams Blogs News Twitter
1-grams 370709 307432 385284
2-grams 6278444 6210441 5078315
3-grams 18883474 17795115 13400447
4-grams 28354363 25382185 18186423
5-grams 31722973 27675235 18799974

After vocabulary cut offs applied as above, the numbers of N-grams are as following

N.grams Blogs News Twitter
1-grams 35968 39908 35403
2-grams 4910128 5027042 4004865
3-grams 17668476 16809386 12592003
4-grams 27889455 25040570 17917071
5-grams 31618529 27594699 18741859

Preliminary findings

  1. We have found that removal of control characters is vital for further text preprocessing. Several control characters may cause truncation of corpora when text transforms are applied.

  2. The SRI Language Modeling Toolkit has been found to be more effective and versatile for our purposes than Weka/RWeka.

  3. We have found substantial difference between Blogs, News and Twitter corpora, as far as their in N-gram composition is concerned.

Further plans

We are going to evaluate and implement several n-gram based schemes for smoothing and iterpolation.