##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'tm' was built under R version 4.0.5
## Loading required package: NLP
## Warning: package 'tidytext' was built under R version 4.0.5
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
For this project we analyze 3 data sets: Blogs, News, and Twitter.
What we did was very simple. We made a histogram and a statistical summary of each one of those sets.
On each histogram you can find the most popular words of the set. We eliminated words like “the” that are not fundamental for this analysis.
The statiscal summary provides a quick comparisson between the three sets for number of lines, number of characters, number of words, and file sizes.
For the statiscal summary, we use the package stringi to use the stri_stats functions. We use the library “tm” to remove numbers. Some numbers are very common but meaningless for our analysis. We use tidytext to convert our sets to lines with size equals to 1 word. Finally, we use dplyr to use pipes.
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## FileName FileSize Lines LinesNEmpty Chars CharsNWhite Words
## 1 blogs.data 255.4 Mb 899288 899288 206824382 170389539 37570839
## 2 news.data 19.8 Mb 77259 77259 15639408 13072698 2651432
## 3 twitter.data 319 Mb 2360148 2360148 162096241 134082806 30451170