## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'tm' was built under R version 4.0.5
## Loading required package: NLP
## Warning: package 'tidytext' was built under R version 4.0.5
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate

For Non-Analysts. Text Mining with R: Exploratory Data Analysis

For this project we analyze 3 data sets: Blogs, News, and Twitter.

What we did was very simple. We made a histogram and a statistical summary of each one of those sets.

On each histogram you can find the most popular words of the set. We eliminated words like “the” that are not fundamental for this analysis.

The statiscal summary provides a quick comparisson between the three sets for number of lines, number of characters, number of words, and file sizes.

Technical Explanation for Data Scientists.

For the statiscal summary, we use the package stringi to use the stri_stats functions. We use the library “tm” to remove numbers. Some numbers are very common but meaningless for our analysis. We use tidytext to convert our sets to lines with size equals to 1 word. Finally, we use dplyr to use pipes.

Plots

## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"

##       FileName FileSize   Lines LinesNEmpty     Chars CharsNWhite    Words
## 1   blogs.data 255.4 Mb  899288      899288 206824382   170389539 37570839
## 2    news.data  19.8 Mb   77259       77259  15639408    13072698  2651432
## 3 twitter.data   319 Mb 2360148     2360148 162096241   134082806 30451170