In the file en_US.blogs.txt there are a collection os phrases extracted, possibly, from blogs in english language, due to the name of the file. Same consideration about the en_US.news.txt, a collection of phrases from news in english language and a collection of short phrases from twitters, also in english language.
The Swiftkey company, a partner of Johns Hopkins Health School, prepared the dataset to be used by the Coursera Capstone Project. It was collected from publicly available sources by a web crawler, to four languages: english, russian, german and finn. In this Capstone the english language will be used.
Yes, some literature texts or poems, texts with regional vocabulary or from people that uses new words like slang.
Retreaving,cleaning, exploring and processing data.
We can expect that Using informal texts we can find slang, foreing words, mispelling and new vocabules that are created as the language evolves.
Since the 1990s, much Natural-Language Processing research has relied heavily on machine learning. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real-world examples. Systems based on machine-learning algorithms have many advantages over hand-produced rules.
(Wikipedia)