The goal of this report is just to show that I’ve gotten used to working with the data and that I’m are on track to create a prediction algorithm. This report will explain my exploratory analysis and my goals for the eventual app and algorithm. I will explain only the major features of the data that I have identified and briefly summarize my plans for creating the prediction algorithm and Shiny app in a way that should be understandable to a non-data scientist manager. I will make use of tables and plots to illustrate important summaries of the data set.
After a successful load of the data (“Blogs, News, and Twitter”). I decided to sample 3% of the data to enhance performance because of the large memory footprint needed later to clean and analyze the data.
## Name: BadWords File Size: 4.9 Kb Lines in File: 77 Words in file: 85
## Name: Blogs File Size: 261483 Kb Lines in File: 899288 Words in file: 37546239
## Name: News File Size: 263516.6 Kb Lines in File: 1010242 Words in file: 34762395
## Name: Twitter File Size: 326645.5 Kb Lines in File: 2360148 Words in file: 30093413
## text
## Length:1
## Class :character
## Mode :character
## text
## Length:1
## Class :character
## Mode :character
## text
## Length:1
## Class :character
## Mode :character
## Author
## 1 Blog
## 2 News
## 3 Twitter
## Corpus consisting of 3 documents, showing 3 documents:
##
## Text Types Tokens Sentences Author
## text1 67679 1366974 43815 Blog
## text2 69714 1307340 32216 News
## text3 67860 1322457 42525 Twitter
## Corpus consisting of 3 documents and 1 docvar.
## text1 :
## "c("\"They're ready for us. They're panting for us. And we're..."
##
## text2 :
## "c("Net income attributable to Cummins in the first quarter o..."
##
## text3 :
## "c("We the US will be in Afghanistan well past 2024. Just lik..."
## Keyword-in-context with 5 matches.
## [text1, 377] it up to | love | again, that
## [text1, 1158] incredible. I | love | the video sequences
## [text1, 1973] " While I | love | all 3 of
## [text1, 2818] How to Make | Love | Like a P
## [text1, 3097] . But I | love | the imagery of
Now I will create a DocumentTermMatrix and a TermDocumentMatrix in order to explore the individual terms and term frequencies.
## <<DocumentTermMatrix (documents: 3, terms: 101250)>>
## Non-/sparse entries: 152881/150869
## Sparsity : 50%
## Maximal term length: 273
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs can get just like new now one said time will
## 1 2812 2195 2933 2905 1639 1826 3765 1159 2715 3308
## 2 1884 1243 1578 1511 2106 1123 2531 7455 1745 3222
## 3 2640 3309 4552 3612 2157 2467 2395 557 2269 2900
## <<TermDocumentMatrix (terms: 101250, documents: 3)>>
## Non-/sparse entries: 152881/150869
## Sparsity : 50%
## Maximal term length: 273
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms 1 2 3
## can 2812 1884 2640
## get 2195 1243 3309
## just 2933 1578 4552
## like 2905 1511 3612
## new 1639 2106 2157
## now 1826 1123 2467
## one 3765 2531 2395
## said 1159 7455 557
## time 2715 1745 2269
## will 3308 3222 2900
## word frequency
## will will 9430
## said said 9171
## just just 9063
## one one 8691
## like like 8028
## can can 7336
In order to visualize the data, I created a LDA model using my DocumentTermMatrix looking for 4 topics. The first plot shows all the information on words associated to a topic. the probability of a word being associated to a topic. The higher the beta topic the more frequent the term appears in the topic.
## A LDA_VEM topic model with 4 topics.
## # A tibble: 405,000 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 –ashley 3.21e-32
## 2 2 –ashley 1.66e-33
## 3 3 –ashley 1.12e- 6
## 4 4 –ashley 2.24e- 6
## 5 1 –central 1.71e- 6
## 6 2 –central 4.77e-46
## 7 3 –central 9.87e-20
## 8 4 –central 2.87e-23
## 9 1 –diane 6.16e-53
## 10 2 –diane 2.25e- 6
## # … with 404,990 more rows
## # A tibble: 48,202 × 3
## document term count
## <chr> <chr> <dbl>
## 1 3 just 4552
## 2 3 like 3612
## 3 3 get 3309
## 4 3 love 3194
## 5 3 good 2999
## 6 3 will 2900
## 7 3 day 2711
## 8 3 thanks 2673
## 9 3 dont 2644
## 10 3 can 2640
## # … with 48,192 more rows
## # A tibble: 12 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 1 1 0.000442
## 2 2 1 0.997
## 3 3 1 0.000329
## 4 1 2 0.000153
## 5 2 2 0.0000485
## 6 3 2 0.912
## 7 1 3 0.607
## 8 2 3 0.000147
## 9 3 3 0.0000234
## 10 1 4 0.392
## 11 2 4 0.00289
## 12 3 4 0.0877
The goal of this exercise is to create a product to highlight the prediction algorithm that I have built and to provide an interface that can be accessed by others.
My proposed Shiny app will take as input a phrase (multiple words) in a text box input and outputs a prediction of the next word. I will look into several algorithms before I make a final choice for the prediction algorithm that I will use in the final app. For this report I have explored LDA.
A slide deck consisting of no more than 5 slides created with R Studio Presenter pitching my algorithm and app.