## Warning in readLines("C:/Users/Eashaan/Desktop/Data Science/10 Capstone/ ## Capstone/final/en_US/en_US.news.txt"): incomplete final line found on 'C:/Users/ ## Eashaan/Desktop/Data Science/10 Capstone/Capstone/final/en_US/en_US.news.txt'

Introduction

This assignment shows some exploratory data analysis results of the English portion of the Data Science Capstone corpora.

Shown below is a pie chart of the number of lines of text in the corpus.

##         LineCount WordCount Unique.Words RepeatedVocabularyPercent
## blogs      899288  37334441      1103677                  97.04381
## news        77259   2643972       197857                  92.51668
## twitter   2360148  30373792      1290203                  95.75225
##         AvgWordsinLine RelSentimentScorex1000
## blogs         41.51556               9.788120
## news          34.22219               4.854439
## twitter       12.86944              14.915688

Note

For the sake of program runtime, a total of 10000 lines of text have been selected in proportion to the number of lines computed above.

Observations

As expected, the line length is shortest for Tweets.
The news uses the most diverse vocabulary while blogs repeat almost 97% of the words!
Tweets show the highest (most positive) sentiment score) while news is relatively the most neutral.

Exploring some English Usage

EGo

Introduction

Note

Observations