Capstone-Project-Natural-Language-ProcessingV2.utf8

Data Science Capstone Project:

Milestone Report

Objectives

The objectives of this report is carry out Exploratory Data Analysis on a given data sets. The data sets contain texts in English language. There are three different data sets, namely:-

blogs_data_sets <- "en_US.blogs.txt"
news_data_sets <- "en_US.news.txt"
twitter_data_sets <- "en_US.twitter.txt"

In order to carry out the Exploratory Data Analysis on the data sets, the characteristic and contents of each data sets will be explored and finally will be cleaned in order to find the unique characteristic of the data sets. Characteristic such as the nos. of frequents words will be identified at the the end of this reports.

Exploratory Data Analysis

The following tables shows each data sets sizes in MB:-

##    Data Sets File Sizes (MB)   %
## 1:     Blogs        200.4242  36
## 2:      News        196.2775  35
## 3:   Twitter        159.3641  29
## 4:   =>Total        556.0658 100

From the above tables shows that Blogs data sets have the biggest file size, ie 200MB and follows with News data sets with 196MB and Twitter data sets with 159MB.

The following tables shows the nos. of lines for each data sets:-

##    Data Sets Nos. Of Lines   %
## 1:     Blogs        899288  21
## 2:      News       1010242  24
## 3:   Twitter       2360148  55
## 4:   =>Total       4269678 100

From the above tables shows that Twitter data sets have the highest nos. of lines with 2,360,148 lines and follow by News data sets with 1,010,242 lines and Blogs data sets with 899,288 lines.

To get a glimpse on the data sets, the following shows the first line of each data sets.

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."

## [1] "He wasn't home alone, apparently."

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."

The following table shows the nos. of characters for each data sets:-

##    Data Sets Nos. Of Characters   %
## 1:     Blogs          206824505  36
## 2:      News          203223159  36
## 3:   Twitter          162096031  28
## 4:   =>Total          572143695 100

From the above table clearly shows that Blogs data sets have the highest nos. of characters and follows by News data sets and Twitter data sets.

The following boxplot shows the number of characters distribution per line for each data sets:-

The above boxplot shows that Blogs data sets have the biggest character/lines follow with News data sets and Twitter data sets.

The following table shows the nos. of words from each data sets.

##    Data Sets Nos. Of Words   %
## 1:     Blogs      37334131  37
## 2:      News      34372530  34
## 3:   Twitter      30373543  30
## 4:   =>Total     102080204 100

From the above tables shows that Blogs data sets have the highest nos. of words and follow by News data sets and Twitter data sets.

The following table shows the summary of the characteristic of each data sets.

##    Data Sets File Sizes (MB) Nos. Of Lines Nos. Of Characters Nos. Of Words
## 1:     Blogs        200.4242        899288          206824505      37334131
## 2:      News        196.2775       1010242          203223159      34372530
## 3:   Twitter        159.3641       2360148          162096031      30373543
## 4:   =>Total        556.0658       4269678          572143695     102080204

Based on the above tables, the following are the finding from each data sets:-

Blogs data sets have the highest file size, nos. of characters and nos. of words compared to the others data sets.
Twitter data sets have the highest nos. of lines compared to the others data sets.

Data Sampling

To get the detail characteristic of each data sets in term of the word frequency, a sample data sets will be created as detail below. For this report, 5% samples is sufficient to get the best analysis on the data sets main characteristic.

##    Data Sets Sample Size   %
## 1:     Blogs     44964.4  21
## 2:      News     50512.1  24
## 3:   Twitter    118007.4  55
## 4:   =>Total    213483.9 100

The sample data sets shall be cleaned before further analysis is carry out. The following tasks are performed against the sample data sets:-

Removing the URL’s and anything other than English letters or space.
Transforming to all lower case.
Removing stopwords.
Removing whitespace.
Removing sparse items.

To get a glimpse on the sample data sets, the following shows the first line of each sample data sets.

## [1] "***As many of you have pointed out, you don't need to spend $20 on ONE diaper. You can buy a dozen prefolds and 3 covers for under $50...and that would be enough diapers to get you through an entire day."

## [1] "Lensing said he acquired the stone in late 2010; he declines to discuss the purchase price. He said that Ragan tried to sell the stone to at least one auction house in Dallas, but \"they didn't want anything to do with it.\" He added that the auction house recommended she contact Lensing because of the range of his Kennedy collectibles."

## [1] "Oh I sure hope not hahaha..."

To get a glimpse on the clean sample data sets, the following shows the first line of each clean sample data sets.

## [1] " many pointed dont need spend one diaper can buy dozen prefolds covers enough diapers get entire day"

## [1] "lensing said acquired stone late declines discuss purchase price said ragan tried sell stone least one auction house dallas didnt want anything added auction house recommended contact lensing range kennedy collectibles"

## [1] "oh sure hope hahaha"

After completing the data cleaning process, the most frequent words are identified. The following tables shows the 10 most frequent words from each data sets.

##      Words count Words count  Words count
##  1:    one  6063  said 12524   just  7588
##  2:   will  5573  will  5384   like  6035
##  3:   just  4938   one  4046    get  5756
##  4:    can  4878   new  3506   love  5283
##  5:   like  4856  also  2968   good  4944
##  6:   time  4344   can  2926   will  4710
##  7:    get  3525   two  2840   dont  4498
##  8:   know  3010  year  2818 thanks  4415
##  9: people  2976  just  2611    day  4410
## 10:    now  2919  time  2598    can  4372

Based on the above tables, the findings are as follows:-

The words will, just and can appears as a frequents words in all data sets.
The most frequents words in Blogs data sets is one.
The most frequents words in News data sets is said.
The most frequents words in Twitter data sets is just.

The following tables shows the 10 least frequent words from each data sets.

table_least <- data.table(blogs_doc_features[order(count)][1:10], news_doc_features[order(count)][1:10], twitter_doc_features[order(count)][1:10])
table_least

##        Words count         Words count    Words count
##  1:     onto   234        either   255      saw   605
##  2:    third   235         calls   259     mean   607
##  3:  sitting   238       biggest   261      hit   609
##  4:  awesome   239        effort   265   making   613
##  5:     kept   240          paul   265     stay   613
##  6:      six   242         chris   266    might   614
##  7:   recent   243         media   266 favorite   620
##  8: original   243          star   267      yet   625
##  9: tomorrow   247        single   267      wow   627
## 10:  present   247 investigation   268    looks   645

Based on the above tables, the findings are as follows:-

The least frequents words in Blogs data sets is happen.
The least frequents words in News data sets is rest.
The least frequents words in Twitter data sets is damnt.

The following plots shows the most frequents words for each data sets.

The following are the word cloud based on most frequent words of each data sets.

Blogs Word Cloud

News Word Cloud

Twitter Word Cloud

Session info

sessionInfo()

## R version 4.0.0 (2020-04-24)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_0.5.0      stringr_1.4.0      purrr_0.3.4        tibble_3.0.1      
##  [5] tidyverse_1.3.0    tidyr_1.1.0        readr_1.3.1        dtplyr_1.0.1      
##  [9] wordcloud_2.6      RColorBrewer_1.1-2 ggthemes_4.2.0     ggplot2_3.3.2     
## [13] data.table_1.12.8  knitr_1.29         dplyr_1.0.0        ngram_3.0.4       
## [17] tm_0.7-7           NLP_0.2-0         
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.0 xfun_0.15        slam_0.1-47      haven_2.3.1     
##  [5] lattice_0.20-41  colorspace_1.4-1 vctrs_0.3.1      generics_0.0.2  
##  [9] htmltools_0.5.0  blob_1.2.1       rlang_0.4.6      pillar_1.4.4    
## [13] glue_1.4.1       withr_2.2.0      DBI_1.1.0        dbplyr_1.4.4    
## [17] readxl_1.3.1     modelr_0.1.8     lifecycle_0.2.0  cellranger_1.1.0
## [21] munsell_0.5.0    gtable_0.3.0     rvest_0.3.5      evaluate_0.14   
## [25] labeling_0.3     parallel_4.0.0   fansi_0.4.1      broom_0.5.6     
## [29] Rcpp_1.0.5       scales_1.1.1     backports_1.1.8  jsonlite_1.7.0  
## [33] farver_2.0.3     fs_1.4.2         hms_0.5.3        digest_0.6.25   
## [37] stringi_1.4.6    grid_4.0.0       cli_2.0.2        tools_4.0.0     
## [41] magrittr_1.5     crayon_1.3.4     pkgconfig_2.0.3  ellipsis_0.3.1  
## [45] xml2_1.3.2       reprex_0.3.0     lubridate_1.7.9  rstudioapi_0.11 
## [49] assertthat_0.2.1 rmarkdown_2.3    httr_1.4.1       R6_2.4.1        
## [53] nlme_3.1-148     compiler_4.0.0