Authors: Duzhin Fedor, Tan Joo Seng, Tan Siew Eng
We have collected raw whatsapp chats of student teams in four courses, two in business and two in mathematics. For each course, we have a directory with all chats exported into txt files and one xlsx file with anonymised student information. Below are some statistics.
Here, we merge all the chats that we have into one huge table. Below is the number of messages in each course.
##
## A B C D E
## 15119 4583 4373 1226 463
We will extract all messages from the course that has longest messages. Below is the word cloud
We begin by constructing a document-term matrix (DTM). Its rows represent documents (messages in our WhatsApp chat) and its columns represent terms that appear in documents. The entry \(DTM_{d,t}\) counts how many times term \(t\) appeared in document \(d\).
## Dim(DTM) = 1226 1195
## DTM sample:
## Terms
## Docs 2 ai manager see
## 10 1 0 0 1
## 20 0 0 1 0
## 31 0 2 0 0
## 40 0 1 0 0
## 118 0 0 0 0
Now we will construct an LDA model. To do it, we need to specify the number of topics in advance.
## Number of topics = 10
## LDA with 10 topics has been constructed.
In LDA model, every document is represented as a mixture of topics.
##
## ======================================================================
## document 1 2 3 4 5 6 7 8 9 10
## ----------------------------------------------------------------------
## 1 10 0.173 0.011 0.011 0.743 0.011 0.011 0.011 0.011 0.011 0.011
## 2 20 0.009 0.009 0.009 0.009 0.009 0.009 0.164 0.009 0.762 0.009
## 3 30 0.042 0.042 0.042 0.042 0.042 0.042 0.042 0.626 0.042 0.042
## 4 100 0.006 0.006 0.006 0.224 0.128 0.006 0.006 0.006 0.610 0.006
## 5 200 0.007 0.007 0.007 0.429 0.132 0.007 0.007 0.267 0.129 0.007
## ----------------------------------------------------------------------
Every term (word) has certain frequency in each topic.
##
## =======================================================================
## term 1 2 3 4 5 6 7 8 9 10
## -----------------------------------------------------------------------
## 1 ai 0.047 0 0.00000 0 0.024 0.008 0.017 0 0.050 0.012
## 2 deleted 0 0 0 0.016 0 0 0 0 0 0
## 3 manager 0.001 0 0 0 0 0 0.003 0 0.010 0.003
## 4 think 0.003 0.004 0 0.030 0.003 0 0.00000 0 0.007 0.00000
## -----------------------------------------------------------------------
Now, for each topic, we will choose 10 most frequent words in that topic:
Note that some frequent words will appear in several topics. For example, “ai” appears in topics 1 and 9 in the first place and in several other topics in top 10. It just means that “ai” is a very common term in our corpus but its association to a particular topic is not strong enough to distinguish topics.
Now, given a term \(t\), we consider its frequencies with respect to each of the \(n\) topics, i.e., numbers \[ f_1,f_2,\dots,f_{n} \] Among them, we will pick the largest \(f_{1st}\) and the second largest \(f_{2nd}\) and compute the statistic \[ \mathrm{LogRatio}(t)=\log\frac{f_{1st}}{f_{2nd}} \] This statistic does not depend on the overall frequency of the term \(t\), but it shows how strongly the term associates with a particular single topic.
Here are the top 10 terms in each topic sorted according to LogRatio:
Now we compute probabilities of topics in each document. And, to each document, we assign the topic with the highest probability:
##
## 1 2 3 4 5 6 7 8 9 10
## 55 45 51 308 71 26 69 457 91 53
Topics by group:
##
## 1 2 3 4 5 6 7 8 9 10
## 18T1 9 6 4 45 7 4 14 57 21 5
## 18T10 5 8 3 9 7 1 3 36 3 3
## 18T2 12 19 23 113 27 3 17 62 16 14
## 18T3 5 2 3 28 3 1 2 65 7 2
## 18T4 3 2 0 32 2 6 9 47 0 1
## 18T5 2 2 3 35 5 4 8 56 3 0
## 18T6 3 0 1 10 8 1 4 25 20 4
## 18T7 2 4 4 9 6 2 6 15 10 12
## 18T8 10 0 6 5 2 1 2 40 2 6
## 18T9 4 2 4 22 4 3 4 54 9 6
A sample of sentiment-carrying words appearing in our data:
## ability adopt advanced advantage advantages affected agree
## 2 1 1 2 2 -1 1
## allow alone authority benefits best better big
## 1 -2 1 2 3 2 1
## boost capable careful certain challenge chance
## 1 1 2 1 -1 2
Summary statistics:
##
## =========================================================================
## topic N Mean Word Count Med sent Mean sent SD sent Max sent Min sent
## -------------------------------------------------------------------------
## 1 1 55 60.418 1 2.927 5.120 30 -1
## 2 2 45 90.956 2 5.178 9.960 56 -1
## 3 3 51 78.549 2 5.627 9.520 53 -1
## 4 4 308 11.166 0 0.964 1.707 13 -2
## 5 5 71 67.042 3 4.296 5.397 29 -1
## 6 6 26 207.423 4 11.846 18.626 83 -1
## 7 7 69 71.899 2 3.159 4.828 25 -2
## 8 8 457 3.871 0 1.098 1.506 8 -1
## 9 9 91 65.945 2 3.989 5.967 31 -3
## 10 10 53 100.283 3 6.415 8.606 38 0
## -------------------------------------------------------------------------
##
## =========================================================================
## team N Mean Word Count Med sent Mean sent SD sent Max sent Min sent
## -------------------------------------------------------------------------
## 1 18T1 172 13.343 0 0.988 1.867 8 -3
## 2 18T10 78 44.449 0 3.321 6.802 42 0
## 3 18T2 306 17.059 0.500 1.634 2.703 23 -2
## 4 18T3 118 19.932 0 1.576 3.517 31 0
## 5 18T4 102 53.471 0 3.275 9.546 83 -1
## 6 18T5 118 30.559 0 1.441 3.791 25 -2
## 7 18T6 76 57.868 3 4.421 5.565 29 -1
## 8 18T7 70 88.571 2 5.371 9.280 56 -2
## 9 18T8 74 76.568 2 4.824 9.442 53 0
## 10 18T9 112 39.304 2 2.911 4.088 27 0
## -------------------------------------------------------------------------