The 3 topics selected for the following analysis are Politics,Laptop and Food. The corresponding text for all these topics are retrieved from top 50 websites got from google search.
The text content of each topic are stored on the documents individually and then processed to a single document.Lets start the analysis with the word cloud and COG.
The above word cloud of the resultant document clearly indicates the highest frequency of the topics. So, the topics are clearly seperated from the whole of the document content.
The above analysis of the topics interlinks with the recommended topics and gives us the summary of the overall analysis.
Now, lets check for the LIFT and ETA proportion values which helps us the probability of the topics in each document.
Below are the Theta values for top tokens in the document:
## topic
## phrase 1 2 3
## snowman 5.075749e-10 1.845461e-09 2.920818e-04
## sliders 5.075751e-10 1.845461e-09 1.518814e-03
## freeshipping 2.126554e-09 3.513155e-04 2.933013e-09
## inches_inches 1.776377e-09 2.810529e-04 2.933013e-09
## adblock 1.603402e-09 2.459215e-04 2.933013e-09
Below are the Omega values for the tokens in the document:
## topic
## document 1 2 3
## 1 0.0005590992 0.9989189550 0.0005219458
## 2 0.0002920402 0.9991193911 0.0005885687
## 3 0.0001596683 0.0003159075 0.9995244242
## 4 0.5129219500 0.4853479812 0.0017300689
## 5 0.8089190860 0.1887835890 0.0022973250
On further analysis, the lift values for the corresponding topics are as below:
## topic
## phrase 1 2 3
## adageunicorns 1.750409e-05 5.183362e+00 1.087037e-04
## entrepreneur 2.786827e-03 5.173378e+00 1.731776e-04
## charles_pierce 8.064532e-06 7.466233e-06 8.618585e+00
## staggers 8.105816e-04 2.850908e-04 8.613571e+00
## smartcooky 7.791184e-06 1.800563e-05 8.618614e+00
## statesman 1.578546e-01 1.249039e-02 7.657820e+00
## foodnavigator 1.441350e+00 2.057864e-02 9.240647e-05
## confectionery 1.445557e+00 5.484517e-03 9.209458e-05
## exhaustive 1.442577e+00 1.615626e-02 6.600552e-05
## ratings_based 4.160594e-06 5.183346e+00 2.159149e-05
## prices_ratings 4.160594e-06 5.183346e+00 2.159149e-05
The below censored Lift values of the tokens in the document below shows the weightage of each topic classified.
## topic
## phrase 1 2 3
## snowman 0.000000 0.000000 8.618693
## sliders 0.000000 0.000000 8.618629
## freeshipping 0.000000 5.183274 0.000000
## inches_inches 0.000000 5.183282 0.000000
## adblock 0.000000 5.183288 0.000000
## emi_rs 0.000000 5.182407 0.000000
## mumbai_rajkot 1.446314 0.000000 0.000000
## gurgaon_delhi 1.446314 0.000000 0.000000
## pincode_locations 1.446627 0.000000 0.000000
## product_delivered 1.446627 0.000000 0.000000
The Eta proportion values for each topic are calculated and below is the summary. The columns indicates the document numbers and each row represents the topic. For example, if the topic resides in the document-1, all the other document values for the corresponding topic will be assigned Zero.
## 1 2 3
## 1 0.1780223 0.6251038 0.1968739
## 2 0.1251950 0.7039112 0.1708938
## 3 0.1032872 0.1628087 0.7339040
## 4 0.2713160 0.5080330 0.2206510
## 5 0.4042983 0.3690693 0.2266324
The above latent topics clearly classifies each topic from the documents are indicated correctly.
QNo.1. whether the topic model is able to separate each subject from other subjects. To what extent is it able to do so?
The topics are able to seperate the subject from other subjects. However the subjects Food and Politics has both the contents in their document and hence few words were included in both the latent topics.
QNo.2. Are there mixed tokens (with high lift in more than one topic)? Are the highest LIFT tokens and the document topic proportions (ETA scores) clear and able to identify each topic?
The mixed tokens available in the topic models are however very less. The ETA proportion values are clear and are able to identify each topic individually.
QNo.3. What are your learnings from this exercise.
Below are the learnings from this exercise: 1. The emperical topic modelling though the topics are clubbed into a single document has successfully identified each topic. 2. The topic model has the tokens classified and the theta and omega values clearly indicates the token belongs to a particlar topic which suggests the model is highly effective. 3.The latent topics obtained by the result of the model is as expected. If the K value is increased, the token classification will become close to a perfect model.