Empirical Topic Modeling from First Principles

Markdown

The 3 topics selected for the following analysis are Politics,Laptop and Food. The corresponding text for all these topics are retrieved from top 50 websites got from google search.

The text content of each topic are stored on the documents individually and then processed to a single document.Lets start the analysis with the word cloud and COG.

The above word cloud of the resultant document clearly indicates the highest frequency of the topics. So, the topics are clearly seperated from the whole of the document content.

The above analysis of the topics interlinks with the recommended topics and gives us the summary of the overall analysis.

Now, lets check for the LIFT and ETA proportion values which helps us the probability of the topics in each document.

Theta Values:

Below are the Theta values for top tokens in the document:

##                topic
## phrase                     1            2            3
##   snowman       5.075749e-10 1.845461e-09 2.920818e-04
##   sliders       5.075751e-10 1.845461e-09 1.518814e-03
##   freeshipping  2.126554e-09 3.513155e-04 2.933013e-09
##   inches_inches 1.776377e-09 2.810529e-04 2.933013e-09
##   adblock       1.603402e-09 2.459215e-04 2.933013e-09

Omega Values:

Below are the Omega values for the tokens in the document:

##         topic
## document            1            2            3
##        1 0.0005590992 0.9989189550 0.0005219458
##        2 0.0002920402 0.9991193911 0.0005885687
##        3 0.0001596683 0.0003159075 0.9995244242
##        4 0.5129219500 0.4853479812 0.0017300689
##        5 0.8089190860 0.1887835890 0.0022973250

Lift Values

On further analysis, the lift values for the corresponding topics are as below:

##                 topic
## phrase                      1            2            3
##   adageunicorns  1.750409e-05 5.183362e+00 1.087037e-04
##   entrepreneur   2.786827e-03 5.173378e+00 1.731776e-04
##   charles_pierce 8.064532e-06 7.466233e-06 8.618585e+00
##   staggers       8.105816e-04 2.850908e-04 8.613571e+00
##   smartcooky     7.791184e-06 1.800563e-05 8.618614e+00
##   statesman      1.578546e-01 1.249039e-02 7.657820e+00
##   foodnavigator  1.441350e+00 2.057864e-02 9.240647e-05
##   confectionery  1.445557e+00 5.484517e-03 9.209458e-05
##   exhaustive     1.442577e+00 1.615626e-02 6.600552e-05
##   ratings_based  4.160594e-06 5.183346e+00 2.159149e-05
##   prices_ratings 4.160594e-06 5.183346e+00 2.159149e-05

The below censored Lift values of the tokens in the document below shows the weightage of each topic classified.

##                    topic
## phrase                     1        2        3
##   snowman           0.000000 0.000000 8.618693
##   sliders           0.000000 0.000000 8.618629
##   freeshipping      0.000000 5.183274 0.000000
##   inches_inches     0.000000 5.183282 0.000000
##   adblock           0.000000 5.183288 0.000000
##   emi_rs            0.000000 5.182407 0.000000
##   mumbai_rajkot     1.446314 0.000000 0.000000
##   gurgaon_delhi     1.446314 0.000000 0.000000
##   pincode_locations 1.446627 0.000000 0.000000
##   product_delivered 1.446627 0.000000 0.000000

ETA Proportion values and Word Cloud For each Latent Topic:

The Eta proportion values for each topic are calculated and below is the summary. The columns indicates the document numbers and each row represents the topic. For example, if the topic resides in the document-1, all the other document values for the corresponding topic will be assigned Zero.

##           1         2         3
## 1 0.1780223 0.6251038 0.1968739
## 2 0.1251950 0.7039112 0.1708938
## 3 0.1032872 0.1628087 0.7339040
## 4 0.2713160 0.5080330 0.2206510
## 5 0.4042983 0.3690693 0.2266324

The above latent topics clearly classifies each topic from the documents are indicated correctly.

Final Observations :

QNo.1. whether the topic model is able to separate each subject from other subjects. To what extent is it able to do so?

The topics are able to seperate the subject from other subjects. However the subjects Food and Politics has both the contents in their document and hence few words were included in both the latent topics.

QNo.2. Are there mixed tokens (with high lift in more than one topic)? Are the highest LIFT tokens and the document topic proportions (ETA scores) clear and able to identify each topic?

The mixed tokens available in the topic models are however very less. The ETA proportion values are clear and are able to identify each topic individually.

QNo.3. What are your learnings from this exercise.

Below are the learnings from this exercise: 1. The emperical topic modelling though the topics are clubbed into a single document has successfully identified each topic. 2. The topic model has the tokens classified and the theta and omega values clearly indicates the token belongs to a particlar topic which suggests the model is highly effective. 3.The latent topics obtained by the result of the model is as expected. If the K value is increased, the token classification will become close to a perfect model.