Data Science Capstone. Milestone Report.

Alessandro Galletto

7/1/2020

1 GETTING AND CLEANING THE DATA

INTRODUCTION

This report is part of the John’s Hopkins Data Science Course Spacialization. The assignment requires to analyse several files with common txets in order to model the usual pattern of texts and develop an algorithm in order to suggest the following word to the user as Swiftjey’s software performs.

Source of the data: Content archived from heliohost.org on September 30, 2016 and retrieved via Wayback Machine on April 24, 2017.

The data is collected by a web archiver that allows us to have the text published. This zip contains several languages: 1) German, 2) English - United States, 3) Finnish and 4) Russian but I’m going to process only the English language. Anyway, all that is deveolpped in this assigment can be applied to other languages.

Reading of the english files:

## [1] TRUE

Summary of the data sets:

source size.MB num.lines
blogs 200.4 899288
news 196.3 1010242
twitter 159.4 2360148

I’m going to use the tm package. Cleaning the data from stopwords, stemming, removing whitespaces

Creating the term-document matrix and inspecting it

## <<DocumentTermMatrix (documents: 1, terms: 4668)>>
## Non-/sparse entries: 4668/0
## Sparsity           : 0%
## Maximal term length: 31
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs can day get just like make one said time will
##    1  52  51  80   84   72   55  71   81   61   83

2 EXPLORATORY DATA ANALYSIS

Zipf’s law is an empirical law formulated using mathematical statistics that refers to the fact that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions. We can see from the plot that the frequency is similar to expected.

## (Intercept)           x 
##   6.3840979  -0.7842895

Most frequent words

Word cloud

Topics model

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
said just said said get will get like
like one like one just get time year
just get just just like can one will
make will time time say said love time
can think get will will one make say
day said know make know love just said

3 MODELING

Identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.

Bi-gram

Tri-gram

Quad-gram

4 NEXT STEPS

This is the initial exploratory data analysis and modeling of the capstone project. The processing time is important because the data -although is not very big- the combinations for bigrams and trigrams make the CPU heavy. Is is important to select the best model and efficient functions for the final model of the capstone. The initial idea for the model is to use the bigram and trigram in order to predict the following word.


5 APPENDIX. ENVIRONMENT USED

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DT_0.10             lda_1.4.2           topicmodels_0.2-9  
##  [4] wordcloud_2.6       RColorBrewer_1.1-2  RWeka_0.4-41       
##  [7] SnowballC_0.6.0     tm_0.7-7            NLP_0.2-0          
## [10] GGally_1.4.0        factoextra_1.0.6    plotly_4.9.1       
## [13] knitr_1.26          randomForest_4.6-14 rattle_5.3.0       
## [16] rpart.plot_3.0.8    rpart_4.1-15        caret_6.0-84       
## [19] lattice_0.20-38     rlist_0.4.6.1       ggthemes_4.2.0     
## [22] R.utils_2.9.2       R.oo_1.23.0         R.methodsS3_1.7.1  
## [25] data.table_1.12.6   reshape2_1.4.3      reshape_0.8.8      
## [28] ggcorrplot_0.1.3    ggplot2_3.2.1       dplyr_0.8.3        
## [31] stringr_1.4.0       readr_1.3.1        
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-142       lubridate_1.7.4    httr_1.4.1         tools_3.6.1       
##  [5] backports_1.1.5    R6_2.4.1           lazyeval_0.2.2     colorspace_1.4-1  
##  [9] nnet_7.3-12        withr_2.1.2        tidyselect_0.2.5   compiler_3.6.1    
## [13] xml2_1.2.2         labeling_0.3       slam_0.1-47        scales_1.1.0      
## [17] digest_0.6.23      rmarkdown_1.18     pkgconfig_2.0.3    htmltools_0.4.0   
## [21] highr_0.8          RWekajars_3.9.3-2  htmlwidgets_1.5.1  rlang_0.4.2       
## [25] prettydoc_0.3.1    farver_2.0.1       generics_0.0.2     jsonlite_1.6      
## [29] ModelMetrics_1.2.2 magrittr_1.5       modeltools_0.2-22  Matrix_1.2-18     
## [33] Rcpp_1.0.3         munsell_0.5.0      lifecycle_0.1.0    stringi_1.4.3     
## [37] yaml_2.2.0         MASS_7.3-51.4      plyr_1.8.4         recipes_0.1.7     
## [41] grid_3.6.1         parallel_3.6.1     ggrepel_0.8.1      crayon_1.3.4      
## [45] splines_3.6.1      hms_0.5.2          zeallot_0.1.0      pillar_1.4.2      
## [49] codetools_0.2-16   stats4_3.6.1       glue_1.3.1         evaluate_0.14     
## [53] vctrs_0.2.0        foreach_1.4.7      gtable_0.3.0       purrr_0.3.3       
## [57] tidyr_1.0.0        assertthat_0.2.1   xfun_0.11          gower_0.2.1       
## [61] prodlim_2019.11.13 class_7.3-15       survival_3.1-8     viridisLite_0.3.0 
## [65] timeDate_3043.102  tibble_2.1.3       rJava_0.9-11       iterators_1.0.12  
## [69] lava_1.6.6         ipred_0.9-9
 

A work by Alessandro Galletto

alessandro.galletto@iese.net