Capstone

Descriptive Stastics and Impression

Blogs

Vector <- c()
Vector<- c(readLines("en_US.blogs.txt",encoding = "UTF-8",skipNul = T),Vector)
sample(x = Vector,size = length(Vector)*0.45) -> Vector
tokens(Vector,remove_numbers=T,remove_punct=T,remove_symbols=T,remove_separators = TRUE,remove_twitter=T,remove_hyphens=T,remove_url=T,ngrams=1,concatenator=" ") -> Vector1
dfm(Vector1,remove = stopwords("english")) -> Vector1
docfreq(Vector1) -> Vector1

Here i have taken the 50 percent random samples from the Blog.txt file and did necessary steps as necessary

barplot(sort(Vector1,decreasing=T)[1:10])

summary(Vector1)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.00     1.00     1.00    39.26     5.00 46495.00

Looking at the barplot we find that the word one is the most common word . Also most of the tokens seems to be adjective , verbs in nature and not nouns in general. Looking at the five number summary we find most of words frequency is single or unique words

max(names(Vector1))

## [1] "zzzzzzzzzzzzzzzzzzzzzzzzzzz"

The word having the max length is related to sleep

News

Vector <- c()
Vector<- c(readLines("en_US.news.txt",encoding = "UTF-8",skipNul = T),Vector)

## Warning in readLines("en_US.news.txt", encoding = "UTF-8", skipNul = T):
## incomplete final line found on 'en_US.news.txt'

sample(x = Vector,size = length(Vector)*0.45) -> Vector
tokens(Vector,remove_numbers=T,remove_punct=T,remove_symbols=T,remove_separators = TRUE,remove_twitter=T,remove_hyphens=T,remove_url=T,ngrams=1,concatenator=" ") -> Vector1
dfm(Vector1,remove = stopwords("english")) -> Vector1
docfreq(Vector1) -> Vector1

Here i have taken the same random sample constituenting to about 50 percent

barplot(sort(Vector1,decreasing=T)[1:10])

summary(Vector1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00   11.86    5.00 7770.00

Looking at the barplot, the word said is the most common word and logically it makes sense considering that the dataset here corresponds to news Looking at The five summary no of varied words are more than the blog dataset

Twitter

Vector <- c()
Vector<- c(readLines("en_US.twitter.txt",encoding = "UTF-8",skipNul = T),Vector)
sample(x = Vector,size = length(Vector)*0.45) -> Vector
tokens(Vector,remove_numbers=T,remove_punct=T,remove_symbols=T,remove_separators = TRUE,remove_twitter=T,remove_hyphens=T,remove_url=T,ngrams=1,concatenator=" ") -> Vector1
dfm(Vector1,remove = stopwords("english")) -> Vector1
docfreq(Vector1) -> Vector1

Here the same 50 percent random sampling is done

barplot(sort(Vector1,decreasing=T)[1:10])

summary(Vector1)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.00     1.00     1.00    33.65     3.00 66482.00

Here the Word Just appears the most times , the same word just appeared in news dataset as the second most common word

However looking at the five no summary it looks like the most of the words only appeared once just like blog dataset

Capstone

Yash Singhal

September 8, 2018

Descriptive Stastics and Impression

Approaches for prediction from n gram