## Loading required package: NLP
## Loading required package: RColorBrewer
## Warning in readLines("C:/Users/Eashaan/Desktop/Data Science/10 Capstone/
## Capstone/final/en_US/en_US.news.txt"): incomplete final line found on 'C:/Users/
## Eashaan/Desktop/Data Science/10 Capstone/Capstone/final/en_US/en_US.news.txt'
## Warning in readLines("C:/Users/Eashaan/Desktop/Data Science/10 Capstone/
## Capstone/final/en_US/en_US.twitter.txt"): line 167155 appears to contain an
## embedded nul
## Warning in readLines("C:/Users/Eashaan/Desktop/Data Science/10 Capstone/
## Capstone/final/en_US/en_US.twitter.txt"): line 268547 appears to contain an
## embedded nul
## Warning in readLines("C:/Users/Eashaan/Desktop/Data Science/10 Capstone/
## Capstone/final/en_US/en_US.twitter.txt"): line 1274086 appears to contain an
## embedded nul
## Warning in readLines("C:/Users/Eashaan/Desktop/Data Science/10 Capstone/
## Capstone/final/en_US/en_US.twitter.txt"): line 1759032 appears to contain an
## embedded nul

Introduction

This assignment shows some exploratory data analysis results of the English portion of the Data Science Capstone corpora.

Shown below is a pie chart of the number of lines of text in the corpus.

##         LineCount WordCount Unique.Words RepeatedVocabularyPercent
## blogs      899288  37334441      1103677                  97.04381
## news        77259   2643972       197857                  92.51668
## twitter   2360148  30373792      1290203                  95.75225
##         AvgWordsinLine RelSentimentScorex1000
## blogs         41.51556               9.788120
## news          34.22219               4.854439
## twitter       12.86944              14.915688

Note

For the sake of program runtime, a total of 10000 lines of text have been selected in proportion to the number of lines computed above.

Observations

  • As expected, the line length is shortest for Tweets.
  • The news uses the most diverse vocabulary while blogs repeat almost 97% of the words!
  • Tweets show the highest (most positive) sentiment score) while news is relatively the most neutral.