The objectives of this report is carry out Exploratory Data Analysis on a given data sets. The data sets contain texts in English language. There are three different data sets, namely:-
blogs_data_sets <- "en_US.blogs.txt"
news_data_sets <- "en_US.news.txt"
twitter_data_sets <- "en_US.twitter.txt"
In order to carry out the Exploratory Data Analysis on the data sets, the characteristic and contents of each data sets will be explored and finally will be cleaned in order to find the unique characteristic of the data sets. Characteristic such as the nos. of frequents words will be identified at the the end of this reports.
The following tables shows each data sets sizes in MB:-
## Data Sets File Sizes (MB) %
## 1: Blogs 200.4242 36
## 2: News 196.2775 35
## 3: Twitter 159.3641 29
## 4: =>Total 556.0658 100
From the above tables shows that Blogs data sets have the biggest file size, ie 200MB and follows with News data sets with 196MB and Twitter data sets with 159MB.
The following tables shows the nos. of lines for each data sets:-
## Data Sets Nos. Of Lines %
## 1: Blogs 899288 21
## 2: News 1010242 24
## 3: Twitter 2360148 55
## 4: =>Total 4269678 100
From the above tables shows that Twitter data sets have the highest nos. of lines with 2,360,148 lines and follow by News data sets with 1,010,242 lines and Blogs data sets with 899,288 lines.
To get a glimpse on the data sets, the following shows the first line of each data sets.
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [1] "He wasn't home alone, apparently."
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
The following table shows the nos. of characters for each data sets:-
## Data Sets Nos. Of Characters %
## 1: Blogs 206824505 36
## 2: News 203223159 36
## 3: Twitter 162096031 28
## 4: =>Total 572143695 100
From the above table clearly shows that Blogs data sets have the highest nos. of characters and follows by News data sets and Twitter data sets.
The following boxplot shows the number of characters distribution per line for each data sets:-
The above boxplot shows that Blogs data sets have the biggest character/lines follow with News data sets and Twitter data sets.
The following table shows the nos. of words from each data sets.
## Data Sets Nos. Of Words %
## 1: Blogs 37334131 37
## 2: News 34372530 34
## 3: Twitter 30373543 30
## 4: =>Total 102080204 100
From the above tables shows that Blogs data sets have the highest nos. of words and follow by News data sets and Twitter data sets.
The following table shows the summary of the characteristic of each data sets.
## Data Sets File Sizes (MB) Nos. Of Lines Nos. Of Characters Nos. Of Words
## 1: Blogs 200.4242 899288 206824505 37334131
## 2: News 196.2775 1010242 203223159 34372530
## 3: Twitter 159.3641 2360148 162096031 30373543
## 4: =>Total 556.0658 4269678 572143695 102080204
Based on the above tables, the following are the finding from each data sets:-
Blogs data sets have the highest file size, nos. of characters and nos. of words compared to the others data sets.
Twitter data sets have the highest nos. of lines compared to the others data sets.
To get the detail characteristic of each data sets in term of the word frequency, a sample data sets will be created as detail below. For this report, 5% samples is sufficient to get the best analysis on the data sets main characteristic.
## Data Sets Sample Size %
## 1: Blogs 44964.4 21
## 2: News 50512.1 24
## 3: Twitter 118007.4 55
## 4: =>Total 213483.9 100
The sample data sets shall be cleaned before further analysis is carry out. The following tasks are performed against the sample data sets:-
Removing the URL’s and anything other than English letters or space.
Transforming to all lower case.
Removing stopwords.
Removing whitespace.
Removing sparse items.
To get a glimpse on the sample data sets, the following shows the first line of each sample data sets.
## [1] "***As many of you have pointed out, you don't need to spend $20 on ONE diaper. You can buy a dozen prefolds and 3 covers for under $50...and that would be enough diapers to get you through an entire day."
## [1] "Lensing said he acquired the stone in late 2010; he declines to discuss the purchase price. He said that Ragan tried to sell the stone to at least one auction house in Dallas, but \"they didn't want anything to do with it.\" He added that the auction house recommended she contact Lensing because of the range of his Kennedy collectibles."
## [1] "Oh I sure hope not hahaha..."
To get a glimpse on the clean sample data sets, the following shows the first line of each clean sample data sets.
## [1] " many pointed dont need spend one diaper can buy dozen prefolds covers enough diapers get entire day"
## [1] "lensing said acquired stone late declines discuss purchase price said ragan tried sell stone least one auction house dallas didnt want anything added auction house recommended contact lensing range kennedy collectibles"
## [1] "oh sure hope hahaha"
After completing the data cleaning process, the most frequent words are identified. The following tables shows the 10 most frequent words from each data sets.
## Words count Words count Words count
## 1: one 6063 said 12524 just 7588
## 2: will 5573 will 5384 like 6035
## 3: just 4938 one 4046 get 5756
## 4: can 4878 new 3506 love 5283
## 5: like 4856 also 2968 good 4944
## 6: time 4344 can 2926 will 4710
## 7: get 3525 two 2840 dont 4498
## 8: know 3010 year 2818 thanks 4415
## 9: people 2976 just 2611 day 4410
## 10: now 2919 time 2598 can 4372
Based on the above tables, the findings are as follows:-
The words will, just and can appears as a frequents words in all data sets.
The most frequents words in Blogs data sets is one.
The most frequents words in News data sets is said.
The most frequents words in Twitter data sets is just.
The following tables shows the 10 least frequent words from each data sets.
table_least <- data.table(blogs_doc_features[order(count)][1:10], news_doc_features[order(count)][1:10], twitter_doc_features[order(count)][1:10])
table_least
## Words count Words count Words count
## 1: onto 234 either 255 saw 605
## 2: third 235 calls 259 mean 607
## 3: sitting 238 biggest 261 hit 609
## 4: awesome 239 effort 265 making 613
## 5: kept 240 paul 265 stay 613
## 6: six 242 chris 266 might 614
## 7: recent 243 media 266 favorite 620
## 8: original 243 star 267 yet 625
## 9: tomorrow 247 single 267 wow 627
## 10: present 247 investigation 268 looks 645
Based on the above tables, the findings are as follows:-
The least frequents words in Blogs data sets is happen.
The least frequents words in News data sets is rest.
The least frequents words in Twitter data sets is damnt.
The following plots shows the most frequents words for each data sets.
The following are the word cloud based on most frequent words of each data sets.
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.0 stringr_1.4.0 purrr_0.3.4 tibble_3.0.1
## [5] tidyverse_1.3.0 tidyr_1.1.0 readr_1.3.1 dtplyr_1.0.1
## [9] wordcloud_2.6 RColorBrewer_1.1-2 ggthemes_4.2.0 ggplot2_3.3.2
## [13] data.table_1.12.8 knitr_1.29 dplyr_1.0.0 ngram_3.0.4
## [17] tm_0.7-7 NLP_0.2-0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.15 slam_0.1-47 haven_2.3.1
## [5] lattice_0.20-41 colorspace_1.4-1 vctrs_0.3.1 generics_0.0.2
## [9] htmltools_0.5.0 blob_1.2.1 rlang_0.4.6 pillar_1.4.4
## [13] glue_1.4.1 withr_2.2.0 DBI_1.1.0 dbplyr_1.4.4
## [17] readxl_1.3.1 modelr_0.1.8 lifecycle_0.2.0 cellranger_1.1.0
## [21] munsell_0.5.0 gtable_0.3.0 rvest_0.3.5 evaluate_0.14
## [25] labeling_0.3 parallel_4.0.0 fansi_0.4.1 broom_0.5.6
## [29] Rcpp_1.0.5 scales_1.1.1 backports_1.1.8 jsonlite_1.7.0
## [33] farver_2.0.3 fs_1.4.2 hms_0.5.3 digest_0.6.25
## [37] stringi_1.4.6 grid_4.0.0 cli_2.0.2 tools_4.0.0
## [41] magrittr_1.5 crayon_1.3.4 pkgconfig_2.0.3 ellipsis_0.3.1
## [45] xml2_1.3.2 reprex_0.3.0 lubridate_1.7.9 rstudioapi_0.11
## [49] assertthat_0.2.1 rmarkdown_2.3 httr_1.4.1 R6_2.4.1
## [53] nlme_3.1-148 compiler_4.0.0