Introduction

The goal of this report is just to show that I’ve gotten used to working with the data and that I’m are on track to create a prediction algorithm. This report will explain my exploratory analysis and my goals for the eventual app and algorithm. I will explain only the major features of the data that I have identified and briefly summarize my plans for creating the prediction algorithm and Shiny app in a way that should be understandable to a non-data scientist manager. I will make use of tables and plots to illustrate important summaries of the data set.

Data Loading and Sampling

After a successful load of the data (“Blogs, News, and Twitter”). I decided to sample 3% of the data to enhance performance because of the large memory footprint needed later to clean and analyze the data.

## Name:  BadWords  File Size:  4.9 Kb  Lines in File:  77  Words in file:  85
## Name:  Blogs  File Size:  261483 Kb  Lines in File:  899288  Words in file:  37546239
## Name:  News  File Size:  263516.6 Kb  Lines in File:  1010242  Words in file:  34762395
## Name:  Twitter  File Size:  326645.5 Kb  Lines in File:  2360148  Words in file:  30093413

Data Cleaning and Exploration

##      text          
##  Length:1          
##  Class :character  
##  Mode  :character
##      text          
##  Length:1          
##  Class :character  
##  Mode  :character
##      text          
##  Length:1          
##  Class :character  
##  Mode  :character

##    Author
## 1    Blog
## 2    News
## 3 Twitter
## Corpus consisting of 3 documents, showing 3 documents:
## 
##   Text Types  Tokens Sentences  Author
##  text1 67679 1366974     43815    Blog
##  text2 69714 1307340     32216    News
##  text3 67860 1322457     42525 Twitter
## Corpus consisting of 3 documents and 1 docvar.
## text1 :
## "c("\"They're ready for us. They're panting for us. And we're..."
## 
## text2 :
## "c("Net income attributable to Cummins in the first quarter o..."
## 
## text3 :
## "c("We the US will be in Afghanistan well past 2024. Just lik..."
## Keyword-in-context with 5 matches.                                                         
##   [text1, 377]      it up to | love | again, that        
##  [text1, 1158] incredible. I | love | the video sequences
##  [text1, 1973]     " While I | love | all 3 of           
##  [text1, 2818]   How to Make | Love | Like a P           
##  [text1, 3097]       . But I | love | the imagery of

Exploring Terms

Now I will create a DocumentTermMatrix and a TermDocumentMatrix in order to explore the individual terms and term frequencies.

## <<DocumentTermMatrix (documents: 3, terms: 101250)>>
## Non-/sparse entries: 152881/150869
## Sparsity           : 50%
## Maximal term length: 273
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs  can  get just like  new  now  one said time will
##    1 2812 2195 2933 2905 1639 1826 3765 1159 2715 3308
##    2 1884 1243 1578 1511 2106 1123 2531 7455 1745 3222
##    3 2640 3309 4552 3612 2157 2467 2395  557 2269 2900
## <<TermDocumentMatrix (terms: 101250, documents: 3)>>
## Non-/sparse entries: 152881/150869
## Sparsity           : 50%
## Maximal term length: 273
## Weighting          : term frequency (tf)
## Sample             :
##       Docs
## Terms     1    2    3
##   can  2812 1884 2640
##   get  2195 1243 3309
##   just 2933 1578 4552
##   like 2905 1511 3612
##   new  1639 2106 2157
##   now  1826 1123 2467
##   one  3765 2531 2395
##   said 1159 7455  557
##   time 2715 1745 2269
##   will 3308 3222 2900
##      word frequency
## will will      9430
## said said      9171
## just just      9063
## one   one      8691
## like like      8028
## can   can      7336

Data Visualization

In order to visualize the data, I created a LDA model using my DocumentTermMatrix looking for 4 topics. The first plot shows all the information on words associated to a topic. the probability of a word being associated to a topic. The higher the beta topic the more frequent the term appears in the topic.

## A LDA_VEM topic model with 4 topics.
## # A tibble: 405,000 × 3
##    topic term         beta
##    <int> <chr>       <dbl>
##  1     1 –ashley  3.21e-32
##  2     2 –ashley  1.66e-33
##  3     3 –ashley  1.12e- 6
##  4     4 –ashley  2.24e- 6
##  5     1 –central 1.71e- 6
##  6     2 –central 4.77e-46
##  7     3 –central 9.87e-20
##  8     4 –central 2.87e-23
##  9     1 –diane   6.16e-53
## 10     2 –diane   2.25e- 6
## # … with 404,990 more rows

## # A tibble: 48,202 × 3
##    document term   count
##    <chr>    <chr>  <dbl>
##  1 3        just    4552
##  2 3        like    3612
##  3 3        get     3309
##  4 3        love    3194
##  5 3        good    2999
##  6 3        will    2900
##  7 3        day     2711
##  8 3        thanks  2673
##  9 3        dont    2644
## 10 3        can     2640
## # … with 48,192 more rows
## # A tibble: 12 × 3
##    document topic     gamma
##    <chr>    <int>     <dbl>
##  1 1            1 0.000442 
##  2 2            1 0.997    
##  3 3            1 0.000329 
##  4 1            2 0.000153 
##  5 2            2 0.0000485
##  6 3            2 0.912    
##  7 1            3 0.607    
##  8 2            3 0.000147 
##  9 3            3 0.0000234
## 10 1            4 0.392    
## 11 2            4 0.00289  
## 12 3            4 0.0877

Final Project Proposal for the Prediction Model

The goal of this exercise is to create a product to highlight the prediction algorithm that I have built and to provide an interface that can be accessed by others.

My proposed Shiny app will take as input a phrase (multiple words) in a text box input and outputs a prediction of the next word. I will look into several algorithms before I make a final choice for the prediction algorithm that I will use in the final app. For this report I have explored LDA.

A slide deck consisting of no more than 5 slides created with R Studio Presenter pitching my algorithm and app.