Authors: Duzhin Fedor, Tan Joo Seng, Tan Siew Eng

Preparation and reading data

We have collected raw whatsapp chats of student teams in four courses, two in business and two in mathematics. For each course, we have a directory with all chats exported into txt files and one xlsx file with anonymised student information. Below are some statistics.

Data processing

Here, we merge all the chats that we have into one huge table. Below is the number of messages in each course.

## 
##     A     B     C     D     E 
## 15119  4583  4373  1226   463

Topic modelling

We will extract all messages from the course that has longest messages. Below is the word cloud

We begin by constructing a document-term matrix (DTM). Its rows represent documents (messages in our WhatsApp chat) and its columns represent terms that appear in documents. The entry \(DTM_{d,t}\) counts how many times term \(t\) appeared in document \(d\).

## Dim(DTM) = 1226 1195
## DTM sample:
##      Terms
## Docs  2 ai manager see
##   10  1  0       0   1
##   20  0  0       1   0
##   31  0  2       0   0
##   40  0  1       0   0
##   118 0  0       0   0

Now we will construct an LDA model. To do it, we need to specify the number of topics in advance.

## Number of topics = 10
## LDA with 10 topics has been constructed.

In LDA model, every document is represented as a mixture of topics.

## 
## ======================================================================
##   document   1     2     3     4     5     6     7     8     9    10  
## ----------------------------------------------------------------------
## 1    10    0.173 0.011 0.011 0.743 0.011 0.011 0.011 0.011 0.011 0.011
## 2    20    0.009 0.009 0.009 0.009 0.009 0.009 0.164 0.009 0.762 0.009
## 3    30    0.042 0.042 0.042 0.042 0.042 0.042 0.042 0.626 0.042 0.042
## 4   100    0.006 0.006 0.006 0.224 0.128 0.006 0.006 0.006 0.610 0.006
## 5   200    0.007 0.007 0.007 0.429 0.132 0.007 0.007 0.267 0.129 0.007
## ----------------------------------------------------------------------

Every term (word) has certain frequency in each topic.

## 
## =======================================================================
##    term     1     2      3      4     5     6      7    8   9     10   
## -----------------------------------------------------------------------
## 1   ai    0.047   0   0.00000   0   0.024 0.008  0.017  0 0.050  0.012 
## 2 deleted   0     0      0    0.016   0     0      0    0   0      0   
## 3 manager 0.001   0      0      0     0     0    0.003  0 0.010  0.003 
## 4  think  0.003 0.004    0    0.030 0.003   0   0.00000 0 0.007 0.00000
## -----------------------------------------------------------------------

Now, for each topic, we will choose 10 most frequent words in that topic:

Note that some frequent words will appear in several topics. For example, “ai” appears in topics 1 and 9 in the first place and in several other topics in top 10. It just means that “ai” is a very common term in our corpus but its association to a particular topic is not strong enough to distinguish topics.

Now, given a term \(t\), we consider its frequencies with respect to each of the \(n\) topics, i.e., numbers \[ f_1,f_2,\dots,f_{n} \] Among them, we will pick the largest \(f_{1st}\) and the second largest \(f_{2nd}\) and compute the statistic \[ \mathrm{LogRatio}(t)=\log\frac{f_{1st}}{f_{2nd}} \] This statistic does not depend on the overall frequency of the term \(t\), but it shows how strongly the term associates with a particular single topic.

Here are the top 10 terms in each topic sorted according to LogRatio:

Now we compute probabilities of topics in each document. And, to each document, we assign the topic with the highest probability:

## 
##   1   2   3   4   5   6   7   8   9  10 
##  55  45  51 308  71  26  69 457  91  53

Topics by group:

##        
##           1   2   3   4   5   6   7   8   9  10
##   18T1    9   6   4  45   7   4  14  57  21   5
##   18T10   5   8   3   9   7   1   3  36   3   3
##   18T2   12  19  23 113  27   3  17  62  16  14
##   18T3    5   2   3  28   3   1   2  65   7   2
##   18T4    3   2   0  32   2   6   9  47   0   1
##   18T5    2   2   3  35   5   4   8  56   3   0
##   18T6    3   0   1  10   8   1   4  25  20   4
##   18T7    2   4   4   9   6   2   6  15  10  12
##   18T8   10   0   6   5   2   1   2  40   2   6
##   18T9    4   2   4  22   4   3   4  54   9   6

Sentiment analysis

A sample of sentiment-carrying words appearing in our data:

##    ability      adopt   advanced  advantage advantages   affected      agree 
##          2          1          1          2          2         -1          1 
##      allow      alone  authority   benefits       best     better        big 
##          1         -2          1          2          3          2          1 
##      boost    capable    careful    certain  challenge     chance 
##          1          1          2          1         -1          2

Summary statistics:

## 
## =========================================================================
##    topic  N  Mean Word Count Med sent Mean sent SD sent Max sent Min sent
## -------------------------------------------------------------------------
## 1    1   55      60.418         1       2.927    5.120     30       -1   
## 2    2   45      90.956         2       5.178    9.960     56       -1   
## 3    3   51      78.549         2       5.627    9.520     53       -1   
## 4    4   308     11.166         0       0.964    1.707     13       -2   
## 5    5   71      67.042         3       4.296    5.397     29       -1   
## 6    6   26      207.423        4      11.846   18.626     83       -1   
## 7    7   69      71.899         2       3.159    4.828     25       -2   
## 8    8   457      3.871         0       1.098    1.506     8        -1   
## 9    9   91      65.945         2       3.989    5.967     31       -3   
## 10  10   53      100.283        3       6.415    8.606     38       0    
## -------------------------------------------------------------------------
## 
## =========================================================================
##    team   N  Mean Word Count Med sent Mean sent SD sent Max sent Min sent
## -------------------------------------------------------------------------
## 1  18T1  172     13.343         0       0.988    1.867     8        -3   
## 2  18T10 78      44.449         0       3.321    6.802     42       0    
## 3  18T2  306     17.059       0.500     1.634    2.703     23       -2   
## 4  18T3  118     19.932         0       1.576    3.517     31       0    
## 5  18T4  102     53.471         0       3.275    9.546     83       -1   
## 6  18T5  118     30.559         0       1.441    3.791     25       -2   
## 7  18T6  76      57.868         3       4.421    5.565     29       -1   
## 8  18T7  70      88.571         2       5.371    9.280     56       -2   
## 9  18T8  74      76.568         2       4.824    9.442     53       0    
## 10 18T9  112     39.304         2       2.911    4.088     27       0    
## -------------------------------------------------------------------------