Descriptive Stastics and Impression

Blogs

Vector <- c()
Vector<- c(readLines("en_US.blogs.txt",encoding = "UTF-8",skipNul = T),Vector)
sample(x = Vector,size = length(Vector)*0.45) -> Vector
tokens(Vector,remove_numbers=T,remove_punct=T,remove_symbols=T,remove_separators = TRUE,remove_twitter=T,remove_hyphens=T,remove_url=T,ngrams=1,concatenator=" ") -> Vector1
dfm(Vector1,remove = stopwords("english")) -> Vector1
docfreq(Vector1) -> Vector1

Here i have taken the 50 percent random samples from the Blog.txt file and did necessary steps as necessary

barplot(sort(Vector1,decreasing=T)[1:10])

summary(Vector1)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.00     1.00     1.00    39.26     5.00 46495.00

Looking at the barplot we find that the word one is the most common word . Also most of the tokens seems to be adjective , verbs in nature and not nouns in general. Looking at the five number summary we find most of words frequency is single or unique words

max(names(Vector1))
## [1] "zzzzzzzzzzzzzzzzzzzzzzzzzzz"

The word having the max length is related to sleep

News

Vector <- c()
Vector<- c(readLines("en_US.news.txt",encoding = "UTF-8",skipNul = T),Vector)
## Warning in readLines("en_US.news.txt", encoding = "UTF-8", skipNul = T):
## incomplete final line found on 'en_US.news.txt'
sample(x = Vector,size = length(Vector)*0.45) -> Vector
tokens(Vector,remove_numbers=T,remove_punct=T,remove_symbols=T,remove_separators = TRUE,remove_twitter=T,remove_hyphens=T,remove_url=T,ngrams=1,concatenator=" ") -> Vector1
dfm(Vector1,remove = stopwords("english")) -> Vector1
docfreq(Vector1) -> Vector1

Here i have taken the same random sample constituenting to about 50 percent

barplot(sort(Vector1,decreasing=T)[1:10])

summary(Vector1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00   11.86    5.00 7770.00

Looking at the barplot, the word said is the most common word and logically it makes sense considering that the dataset here corresponds to news Looking at The five summary no of varied words are more than the blog dataset

Twitter

Vector <- c()
Vector<- c(readLines("en_US.twitter.txt",encoding = "UTF-8",skipNul = T),Vector)
sample(x = Vector,size = length(Vector)*0.45) -> Vector
tokens(Vector,remove_numbers=T,remove_punct=T,remove_symbols=T,remove_separators = TRUE,remove_twitter=T,remove_hyphens=T,remove_url=T,ngrams=1,concatenator=" ") -> Vector1
dfm(Vector1,remove = stopwords("english")) -> Vector1
docfreq(Vector1) -> Vector1

Here the same 50 percent random sampling is done

barplot(sort(Vector1,decreasing=T)[1:10])

summary(Vector1)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.00     1.00     1.00    33.65     3.00 66482.00

Here the Word Just appears the most times , the same word just appeared in news dataset as the second most common word

However looking at the five no summary it looks like the most of the words only appeared once just like blog dataset

Approaches for prediction from n gram

The random sampling would be done ammounting to about 20 percent of the whole population . The dataset is quite heavy in aggregrate and doing random sampling after combining all the files would hopefully be representative of the whole population . Further a one gram,two gram , and three gram tokens would be produced and a starting from Tri gram(3 words) the next would be predicted according to the last words of the sentence by application of Markov. However if the word is not found a back off procedure would be applied and the word thus would be searched according to the bigram(2 word) and unigram(one word)