Blogs
Vector <- c()
Vector<- c(readLines("en_US.blogs.txt",encoding = "UTF-8",skipNul = T),Vector)
sample(x = Vector,size = length(Vector)*0.45) -> Vector
tokens(Vector,remove_numbers=T,remove_punct=T,remove_symbols=T,remove_separators = TRUE,remove_twitter=T,remove_hyphens=T,remove_url=T,ngrams=1,concatenator=" ") -> Vector1
dfm(Vector1,remove = stopwords("english")) -> Vector1
docfreq(Vector1) -> Vector1
Here i have taken the 50 percent random samples from the Blog.txt file and did necessary steps as necessary
barplot(sort(Vector1,decreasing=T)[1:10])
summary(Vector1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 1.00 39.26 5.00 46495.00
Looking at the barplot we find that the word one is the most common word . Also most of the tokens seems to be adjective , verbs in nature and not nouns in general. Looking at the five number summary we find most of words frequency is single or unique words
max(names(Vector1))
## [1] "zzzzzzzzzzzzzzzzzzzzzzzzzzz"
The word having the max length is related to sleep
News
Vector <- c()
Vector<- c(readLines("en_US.news.txt",encoding = "UTF-8",skipNul = T),Vector)
## Warning in readLines("en_US.news.txt", encoding = "UTF-8", skipNul = T):
## incomplete final line found on 'en_US.news.txt'
sample(x = Vector,size = length(Vector)*0.45) -> Vector
tokens(Vector,remove_numbers=T,remove_punct=T,remove_symbols=T,remove_separators = TRUE,remove_twitter=T,remove_hyphens=T,remove_url=T,ngrams=1,concatenator=" ") -> Vector1
dfm(Vector1,remove = stopwords("english")) -> Vector1
docfreq(Vector1) -> Vector1
Here i have taken the same random sample constituenting to about 50 percent
barplot(sort(Vector1,decreasing=T)[1:10])
summary(Vector1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 2.00 11.86 5.00 7770.00
Looking at the barplot, the word said is the most common word and logically it makes sense considering that the dataset here corresponds to news Looking at The five summary no of varied words are more than the blog dataset
Vector <- c()
Vector<- c(readLines("en_US.twitter.txt",encoding = "UTF-8",skipNul = T),Vector)
sample(x = Vector,size = length(Vector)*0.45) -> Vector
tokens(Vector,remove_numbers=T,remove_punct=T,remove_symbols=T,remove_separators = TRUE,remove_twitter=T,remove_hyphens=T,remove_url=T,ngrams=1,concatenator=" ") -> Vector1
dfm(Vector1,remove = stopwords("english")) -> Vector1
docfreq(Vector1) -> Vector1
Here the same 50 percent random sampling is done
barplot(sort(Vector1,decreasing=T)[1:10])
summary(Vector1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 1.00 33.65 3.00 66482.00
Here the Word Just appears the most times , the same word just appeared in news dataset as the second most common word
However looking at the five no summary it looks like the most of the words only appeared once just like blog dataset
The random sampling would be done ammounting to about 20 percent of the whole population . The dataset is quite heavy in aggregrate and doing random sampling after combining all the files would hopefully be representative of the whole population . Further a one gram,two gram , and three gram tokens would be produced and a starting from Tri gram(3 words) the next would be predicted according to the last words of the sentence by application of Markov. However if the word is not found a back off procedure would be applied and the word thus would be searched according to the bigram(2 word) and unigram(one word)