## Loading required package: NLP
## Loading required package: RColorBrewer
## Warning in readLines("C:/Users/Eashaan/Desktop/Data Science/10 Capstone/
## Capstone/final/en_US/en_US.news.txt"): incomplete final line found on 'C:/Users/
## Eashaan/Desktop/Data Science/10 Capstone/Capstone/final/en_US/en_US.news.txt'
## Warning in readLines("C:/Users/Eashaan/Desktop/Data Science/10 Capstone/
## Capstone/final/en_US/en_US.twitter.txt"): line 167155 appears to contain an
## embedded nul
## Warning in readLines("C:/Users/Eashaan/Desktop/Data Science/10 Capstone/
## Capstone/final/en_US/en_US.twitter.txt"): line 268547 appears to contain an
## embedded nul
## Warning in readLines("C:/Users/Eashaan/Desktop/Data Science/10 Capstone/
## Capstone/final/en_US/en_US.twitter.txt"): line 1274086 appears to contain an
## embedded nul
## Warning in readLines("C:/Users/Eashaan/Desktop/Data Science/10 Capstone/
## Capstone/final/en_US/en_US.twitter.txt"): line 1759032 appears to contain an
## embedded nul
This assignment shows some exploratory data analysis results of the English portion of the Data Science Capstone corpora.
Shown below is a pie chart of the number of lines of text in the corpus.
## LineCount WordCount Unique.Words RepeatedVocabularyPercent
## blogs 899288 37334441 1103677 97.04381
## news 77259 2643972 197857 92.51668
## twitter 2360148 30373792 1290203 95.75225
## AvgWordsinLine RelSentimentScorex1000
## blogs 41.51556 9.788120
## news 34.22219 4.854439
## twitter 12.86944 14.915688
For the sake of program runtime, a total of 10000 lines of text have been selected in proportion to the number of lines computed above.