Synopsis

A basic text analysis (text mining) for an IRC public chat conversations.

The Dataset

Consist of chat talks (user’s messages) on #perl freenode irc room during the period (01/01/2020 - 02/01/2020).

In this dataset we have that one message is defined as to be one line of text sent this does not requires a complete sentence.

Pre processing phase

Remove urls and numbers from messages (can be relevant for many analysis,but ours will not need them). And free the text of common stop words (See note). Later we will make a further analysis using term frequencies that will treat all terms, including stop words.

Top 10 users

First how frequent people chat (how is the distribution of total number of messages for each user).

The distribution appears to be some form of exponential, maybe a poisson distribution, since the number of messages sent relates to messages arriving process what could be modeled, with some assumptions, as a Poisson stochastic process.

The top 10 users and how much messages they sent.

## Selecting by total

Top 10 and all the Rest

Comparing the top 10 with all the rests in total messages sent.

As we can see first 10 users speak as much as all the rest in the room, (approximately 300 people).

Global view

Naive an fast analysis by only tokenization and remove of stop words.

Note: Using tokenization process as “word”, and removing the stop words (a canonical tokenization).

## Selecting by freq (%)
Top 10 word occurence at #perl
Word frequency at #perl
word n freq (%)
perl 1268 2.1550699
foo 407 0.6917298
file 380 0.6458411
code 333 0.5659608
grinnz 333 0.5659608
perlbot 280 0.4758829
line 278 0.4724838
time 264 0.4486896
yeah 253 0.4299942
pink_mist 233 0.3960026

A graphical information of the previous table.

## Selecting by n

User’s detailed view.

Counting the words by users. The top 10 users and their most common words.

## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

User citing other users

How often a user cite (tag) another user (In reply, usually)? Which user is most cited by other users ?

## Selecting by total
User cited total occurences
Each user and how much it was cited
cited total
grinnz 333
perlbot 280
pink_mist 233
rindolf 175
simcop 155
zln 148
gordonfish 127
daemon 98
simple 90
botje 88

1 The user name is tokenized by the tokenization process

User citing packages

Note: This is probably not so useful. A packages here means a 2 or more words separated by double colon “::”. Since, otherwise, the false positive for packages names with only one word is too high, and the process of check on message if the name used is a package name or something else is too much for purpose of this simple analysis.

## Selecting by n
Top 10 packages cited in the chat
packages n
Foo::Bar 53
Time::Piece 30
List::Util 27
Path::Tiny 23
DBIx::Class 15
Util::Versioned 15
Data::Dumper 14
IO::Async 14
Math::BigFloat 13
Mojo::UserAgent 12
Params::Validate 12

How many times a user cited a package.

Package cite by user
How many times a user cited a package , top 10 occurences
packages user n
Foo::Bar <Harzilein> 20
Foo::Bar <Grinnz> 17
Util::Versioned <daemon> 15
Time::Piece <Grinnz> 11
Mojo::UserAgent <Grinnz> 10
Path::Tiny <gordonfish> 10
IO::Async <simcop2387> 9
List::Util <pink_mist> 9
Foo::Bar <daemon> 8
Path::Tiny <Grinnz> 8

Packages and the top 10 users

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

Inside the top 10 users group which packages stand above the average?

Package Citations from top 10 users
How many times the top 10 users cited these packages
packages total
Foo::Bar 29
Path::Tiny 22
List::Util 15
Util::Versioned 15
IO::Async 14
Time::Piece 14
Math::BigFloat 12
Mojo::UserAgent 11
Data::Dumper 10
Mojo::DOM 9

Perl native functions

Perl has around 250 builtin functions and keywords. A great deal of discussion should be around those terms, so we again goes for some counting on occurence of such names to see who and how often it appears.

Note: Many perl keywords and builtin functions are also stop words, such as “if”, “or”, “and”, etc…, so we exclude those as well, of course, other builtin functions also has common usage, such as “time” or “open” or “write”, such false positives is really hard to detect without further context, and that can only be extracted if a n-gram analysis take place.

## Joining, by = "word"

Sentiment Analysis

Sentiment Analysis, as usual wikipedia has a good article explaining sentiment analysis, my intent is to try out the sentiment analysis towards the main theme of the chat: the Perl language. But, before, we will follow a simple analysis, as a global view of the average sentiment in the community.

## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## $trust
## # A tibble: 312 x 2
##    word        n
##    <chr>   <int>
##  1 good       75
##  2 found      43
##  3 system     40
##  4 kind       32
##  5 pretty     32
##  6 true       31
##  7 real       25
##  8 show       23
##  9 calls      22
## 10 content    22
## # … with 302 more rows
## 
## $fear
## # A tibble: 239 x 2
##    word        n
##    <chr>   <int>
##  1 case       54
##  2 problem    51
##  3 bad        40
##  4 change     35
##  5 missing    34
##  6 default    29
##  7 lines      27
##  8 die        26
##  9 shell      19
## 10 blob       18
## # … with 229 more rows
## 
## $negative
## # A tibble: 534 x 2
##    word        n
##    <chr>   <int>
##  1 hash       60
##  2 case       54
##  3 problem    51
##  4 wrong      51
##  5 error      45
##  6 bad        40
##  7 missing    34
##  8 default    29
##  9 die        26
## 10 wait       24
## # … with 524 more rows
## 
## $sadness
## # A tibble: 223 x 2
##    word        n
##    <chr>   <int>
##  1 case       54
##  2 problem    51
##  3 error      45
##  4 bad        40
##  5 missing    34
##  6 default    29
##  7 die        26
##  8 shell      19
##  9 hate       18
## 10 remove     18
## # … with 213 more rows
## 
## $anger
## # A tibble: 210 x 2
##    word          n
##    <chr>     <int>
##  1 bad          40
##  2 argument     21
##  3 shell        19
##  4 arguments    18
##  5 hate         18
##  6 money        18
##  7 remove       18
##  8 words        18
##  9 broken       17
## 10 daemon       17
## # … with 200 more rows
## 
## $surprise
## # A tibble: 123 x 2
##    word         n
##    <chr>    <int>
##  1 good        75
##  2 guess       51
##  3 variable    43
##  4 expect      19
##  5 shell       19
##  6 hope        18
##  7 money       18
##  8 daemon      17
##  9 tree        16
## 10 break       15
## # … with 113 more rows
## 
## $positive
## # A tibble: 559 x 2
##    word         n
##    <chr>    <int>
##  1 good        75
##  2 question    53
##  3 script      51
##  4 array       47
##  5 found       43
##  6 working     43
##  7 sense       36
##  8 reading     34
##  9 kind        32
## 10 pretty      32
## # … with 549 more rows
## 
## $disgust
## # A tibble: 160 x 2
##    word        n
##    <chr>   <int>
##  1 bad        40
##  2 default    29
##  3 weird      24
##  4 blob       18
##  5 hate       18
##  6 bug        17
##  7 daemon     17
##  8 tree       16
##  9 damn       14
## 10 shit       14
## # … with 150 more rows
## 
## $joy
## # A tibble: 176 x 2
##    word        n
##    <chr>   <int>
##  1 good       75
##  2 found      43
##  3 kind       32
##  4 pretty     32
##  5 true       31
##  6 create     30
##  7 content    22
##  8 fun        22
##  9 love       22
## 10 share      20
## # … with 166 more rows
## 
## $anticipation
## # A tibble: 221 x 2
##    word        n
##    <chr>   <int>
##  1 good       75
##  2 time       72
##  3 install    48
##  4 thought    44
##  5 long       37
##  6 pretty     32
##  7 start      31
##  8 result     26
##  9 wait       24
## 10 calls      22
## # … with 211 more rows

User sentiments using “bing” lexicon, shows an interisting result about this.

## Joining, by = "word"
## # A tibble: 10 x 4
##    user         negative positive sentiment
##    <chr>           <dbl>    <dbl>     <dbl>
##  1 <daemon>           69       58       -11
##  2 <gordonfish>       73       68        -5
##  3 <Grinnz>          172      135       -37
##  4 <perlbot>          54       45        -9
##  5 <pink_mist>        95       76       -19
##  6 <rindolf>          35       46        11
##  7 <simcop2387>       94       80       -14
##  8 <thrig>            54       34       -20
##  9 <xenu>             73       56       -17
## 10 <zln>              69       51       -18

Clustering

The top 10 users is a group that is ranked by their participation in #perl, but we could improve this,we can make more interest question for grouping users.

Let start by making messages histogram distribution for total messages send by users.

## # A tibble: 1 x 2
##   `mean (#messages)` `median (#messages)`
##                <dbl>                <int>
## 1               53.0                   11

As we see most of users send few less than 200 messages. To be exact 18 of users.

Breaking the length of messages

We can examine the length of each message, and see the distribution for a particular user, or even, the distribution for all users. This is not only curiosity, because we can infer that longer messages should have more information so we can have a more sensible measure of information flow using the message length. We show the global distribution for messages length and the distribution for our usual group (the top 10).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [[1]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## [[2]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## [[3]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## [[4]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## [[5]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## [[6]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## [[7]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## [[8]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## [[9]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## [[10]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Packages and groups of users

One interest question is how we could cluster the users by package citation. They could, possibly, be most interested in talking about ? In this case intuition says that we will need to count packages cited by users (our previous analysis) setting the distance function to be based on this frequency number, and then, group them.

Analysing frequencies

Term frequency refer to the analysis of the terms used in a corpus, the bellow definitions guide the understand of what it comes to be in our analysis.

  1. Document as a set of all messages from a single user.
  2. Corpus the collection of all documents ( i.e, entire chat log ).
  3. Term A word occurence as the result of the tokenization process.

Term frequency and inverse document frequency

The Inverse document frequency (idf) is a measure given by: \(\log(d/n_t)\) Where:

  1. \(d\) is number of documents
  2. \(n_t\) is the number of documents that contains the term \(t\) .

Basically with those things we can compare the terms accross a collection of documents (corpus) and give some rank for the most important topic without simply removing the stop words, because they will be weighted as unimportant because of the high frequency they appear accross all documents.

Of course this is good when we have a lot of messages for users, so we will use it for the top 10 users.

## Selecting by total
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The graph confirms the Zipf’s law We apply the log/log graph to visualize better the relation of rank and term frequencie.

As the same with distributions previously showed, we can visualize that terms frequency vs rank is similiar around users, and seems to obeys the Zipf’s law, showed by the approximately linearity of the graph.

Ngrams

Continuing the simple analysis we could make a simple bigram to make sense of words with a bit of context and also suitable to more modelling approach, e.g., bayesian models. One could even make a model to tell if our bots can be easily detected or not. And, in fact, as we would see, it can.

Note: Our bigram analysis when using tf_idf exclude users with low number of documents (a low rate of messages sent), that because they will get a high tf_idf by means of their few documents. (We set the limit for at least 100 messages).

user bigram n tf idf tf_idf
<perlbot> line 1 67 0.008717148 3.433987 0.02993457
<perlbot> at irc 60 0.007806401 3.433987 0.02680708
<perlbot> irc line 60 0.007806401 3.433987 0.02680708
<perlbot> file at 48 0.006245121 3.433987 0.02144567
<perlbot> pasted a 48 0.006245121 3.433987 0.02144567
<Strolls> net google 14 0.005642886 2.740840 0.01546625
<Strolls> google drive 18 0.007255139 2.047693 0.01485630
<perlbot> at https 76 0.009888108 1.488077 0.01471427
<perlbot> new file 48 0.006245121 2.335375 0.01458470
<Jonno_FTW> perl5 5.30 10 0.005022602 2.740840 0.01376615

Lets now focus on the top 10 users (including bots) and check the distribution of the bigrams frequencies and so we could see in picture the different bigram frequency distributions for human/bot

As we can see here, perlbot’s histogram shape differs from its DNA made counter part. that because bots will have the same response (and one that usually is not phrased among humans) ,after all we can change our topics and manners of speak a bit more than a perl program can.

Network of bigrams

We can represent a network of the bigram when we considere the bigram pair \((w_1,w_2)\) as vertices points, so the entire set of bigrams is the graph formed.

Let’s visualize it for a subset of the corpus (those bigrams that appears at least 30 times or more).

## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union

It seems that perlbot dominates this, because as we said its bigrams have a high frequency in its own messages (automated response) and a low frequencie compared to our DNA made bots (such as Grinnz).

So we can exclude perlbot from the chat and rerun this particula analysis. Also lowering the rate it must appears (use 10 times at least). Results are shown below:

Conclusions

This very simple analysis was heavily inspired from the R book tidytext that shows an example using books. Computer linguistic is a vast study area, where I have absolutely no knowledge of, so any review will be more than welcome. My motivation was to get a view of the perl community, which I found is extremelly friendly and helpful, have been myself a member (arthurpbs) for a while I can say that #perl (on irc.freenode.org) and many other thematic rooms (on irc.perl.org) such as #mojo are source of good discussions and advices when it comes to common problems facing a programmer, software analyst or engineer.

Special thanks to Grinnz that has always a good comment and helpful advice for a range of topics, as the numbers and graphs shows here.

Critique ( perlcritic . )

Our analysis disregard some facts, for example, bot users such as (perlbot), code snippets that do occur on chat rooms, use of emojis, abbreviations, slangs, and many other previous knowledge we could improve considerably our analysis by making the tokenization process a much better one.

A nice idea, is to create a module (perl module) to help bring the proper tokenization considering all variables described above.

Other consideration is to use the timestamp of messages so we could make a better structure of dialogs between users, as well as we could make an idea about the statistics of the stocastic process beheind the time series events that constitutes a real time chat.