A basic text analysis (text mining) for an IRC public chat conversations.
Consist of chat talks (user’s messages) on #perl freenode irc room during the period (01/01/2020 - 02/01/2020).
In this dataset we have that one message is defined as to be one line of text sent this does not requires a complete sentence.
Remove urls and numbers from messages (can be relevant for many analysis,but ours will not need them). And free the text of common stop words (See note). Later we will make a further analysis using term frequencies that will treat all terms, including stop words.
First how frequent people chat (how is the distribution of total number of messages for each user).
The distribution appears to be some form of exponential, maybe a poisson distribution, since the number of messages sent relates to messages arriving process what could be modeled, with some assumptions, as a Poisson stochastic process.
The top 10 users and how much messages they sent.
## Selecting by total
Comparing the top 10 with all the rests in total messages sent.
As we can see first 10 users speak as much as all the rest in the room, (approximately 300 people).
Naive an fast analysis by only tokenization and remove of stop words.
Note: Using tokenization process as “word”, and removing the stop words (a canonical tokenization).
## Selecting by freq (%)
| Top 10 word occurence at #perl | ||
|---|---|---|
| Word frequency at #perl | ||
| word | n | freq (%) |
| perl | 1268 | 2.1550699 |
| foo | 407 | 0.6917298 |
| file | 380 | 0.6458411 |
| code | 333 | 0.5659608 |
| grinnz | 333 | 0.5659608 |
| perlbot | 280 | 0.4758829 |
| line | 278 | 0.4724838 |
| time | 264 | 0.4486896 |
| yeah | 253 | 0.4299942 |
| pink_mist | 233 | 0.3960026 |
A graphical information of the previous table.
## Selecting by n
Counting the words by users. The top 10 users and their most common words.
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
How often a user cite (tag) another user (In reply, usually)? Which user is most cited by other users ?
## Selecting by total
| User cited total occurences | |
|---|---|
| Each user and how much it was cited | |
| cited | total |
| grinnz | 333 |
| perlbot | 280 |
| pink_mist | 233 |
| rindolf | 175 |
| simcop | 155 |
| zln | 148 |
| gordonfish | 127 |
| daemon | 98 |
| simple | 90 |
| botje | 88 |
|
1
The user name is tokenized by the tokenization process
|
|
Note: This is probably not so useful. A packages here means a 2 or more words separated by double colon “::”. Since, otherwise, the false positive for packages names with only one word is too high, and the process of check on message if the name used is a package name or something else is too much for purpose of this simple analysis.
## Selecting by n
| Top 10 packages cited in the chat | |
|---|---|
| packages | n |
| Foo::Bar | 53 |
| Time::Piece | 30 |
| List::Util | 27 |
| Path::Tiny | 23 |
| DBIx::Class | 15 |
| Util::Versioned | 15 |
| Data::Dumper | 14 |
| IO::Async | 14 |
| Math::BigFloat | 13 |
| Mojo::UserAgent | 12 |
| Params::Validate | 12 |
| Package cite by user | ||
|---|---|---|
| How many times a user cited a package , top 10 occurences | ||
| packages | user | n |
| Foo::Bar | <Harzilein> | 20 |
| Foo::Bar | <Grinnz> | 17 |
| Util::Versioned | <daemon> | 15 |
| Time::Piece | <Grinnz> | 11 |
| Mojo::UserAgent | <Grinnz> | 10 |
| Path::Tiny | <gordonfish> | 10 |
| IO::Async | <simcop2387> | 9 |
| List::Util | <pink_mist> | 9 |
| Foo::Bar | <daemon> | 8 |
| Path::Tiny | <Grinnz> | 8 |
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
| Package Citations from top 10 users | |
|---|---|
| How many times the top 10 users cited these packages | |
| packages | total |
| Foo::Bar | 29 |
| Path::Tiny | 22 |
| List::Util | 15 |
| Util::Versioned | 15 |
| IO::Async | 14 |
| Time::Piece | 14 |
| Math::BigFloat | 12 |
| Mojo::UserAgent | 11 |
| Data::Dumper | 10 |
| Mojo::DOM | 9 |
Perl has around 250 builtin functions and keywords. A great deal of discussion should be around those terms, so we again goes for some counting on occurence of such names to see who and how often it appears.
Note: Many perl keywords and builtin functions are also stop words, such as “if”, “or”, “and”, etc…, so we exclude those as well, of course, other builtin functions also has common usage, such as “time” or “open” or “write”, such false positives is really hard to detect without further context, and that can only be extracted if a n-gram analysis take place.
## Joining, by = "word"
Sentiment Analysis, as usual wikipedia has a good article explaining sentiment analysis, my intent is to try out the sentiment analysis towards the main theme of the chat: the Perl language. But, before, we will follow a simple analysis, as a global view of the average sentiment in the community.
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## $trust
## # A tibble: 312 x 2
## word n
## <chr> <int>
## 1 good 75
## 2 found 43
## 3 system 40
## 4 kind 32
## 5 pretty 32
## 6 true 31
## 7 real 25
## 8 show 23
## 9 calls 22
## 10 content 22
## # … with 302 more rows
##
## $fear
## # A tibble: 239 x 2
## word n
## <chr> <int>
## 1 case 54
## 2 problem 51
## 3 bad 40
## 4 change 35
## 5 missing 34
## 6 default 29
## 7 lines 27
## 8 die 26
## 9 shell 19
## 10 blob 18
## # … with 229 more rows
##
## $negative
## # A tibble: 534 x 2
## word n
## <chr> <int>
## 1 hash 60
## 2 case 54
## 3 problem 51
## 4 wrong 51
## 5 error 45
## 6 bad 40
## 7 missing 34
## 8 default 29
## 9 die 26
## 10 wait 24
## # … with 524 more rows
##
## $sadness
## # A tibble: 223 x 2
## word n
## <chr> <int>
## 1 case 54
## 2 problem 51
## 3 error 45
## 4 bad 40
## 5 missing 34
## 6 default 29
## 7 die 26
## 8 shell 19
## 9 hate 18
## 10 remove 18
## # … with 213 more rows
##
## $anger
## # A tibble: 210 x 2
## word n
## <chr> <int>
## 1 bad 40
## 2 argument 21
## 3 shell 19
## 4 arguments 18
## 5 hate 18
## 6 money 18
## 7 remove 18
## 8 words 18
## 9 broken 17
## 10 daemon 17
## # … with 200 more rows
##
## $surprise
## # A tibble: 123 x 2
## word n
## <chr> <int>
## 1 good 75
## 2 guess 51
## 3 variable 43
## 4 expect 19
## 5 shell 19
## 6 hope 18
## 7 money 18
## 8 daemon 17
## 9 tree 16
## 10 break 15
## # … with 113 more rows
##
## $positive
## # A tibble: 559 x 2
## word n
## <chr> <int>
## 1 good 75
## 2 question 53
## 3 script 51
## 4 array 47
## 5 found 43
## 6 working 43
## 7 sense 36
## 8 reading 34
## 9 kind 32
## 10 pretty 32
## # … with 549 more rows
##
## $disgust
## # A tibble: 160 x 2
## word n
## <chr> <int>
## 1 bad 40
## 2 default 29
## 3 weird 24
## 4 blob 18
## 5 hate 18
## 6 bug 17
## 7 daemon 17
## 8 tree 16
## 9 damn 14
## 10 shit 14
## # … with 150 more rows
##
## $joy
## # A tibble: 176 x 2
## word n
## <chr> <int>
## 1 good 75
## 2 found 43
## 3 kind 32
## 4 pretty 32
## 5 true 31
## 6 create 30
## 7 content 22
## 8 fun 22
## 9 love 22
## 10 share 20
## # … with 166 more rows
##
## $anticipation
## # A tibble: 221 x 2
## word n
## <chr> <int>
## 1 good 75
## 2 time 72
## 3 install 48
## 4 thought 44
## 5 long 37
## 6 pretty 32
## 7 start 31
## 8 result 26
## 9 wait 24
## 10 calls 22
## # … with 211 more rows
User sentiments using “bing” lexicon, shows an interisting result about this.
## Joining, by = "word"
## # A tibble: 10 x 4
## user negative positive sentiment
## <chr> <dbl> <dbl> <dbl>
## 1 <daemon> 69 58 -11
## 2 <gordonfish> 73 68 -5
## 3 <Grinnz> 172 135 -37
## 4 <perlbot> 54 45 -9
## 5 <pink_mist> 95 76 -19
## 6 <rindolf> 35 46 11
## 7 <simcop2387> 94 80 -14
## 8 <thrig> 54 34 -20
## 9 <xenu> 73 56 -17
## 10 <zln> 69 51 -18
The top 10 users is a group that is ranked by their participation in #perl, but we could improve this,we can make more interest question for grouping users.
Let start by making messages histogram distribution for total messages send by users.
## # A tibble: 1 x 2
## `mean (#messages)` `median (#messages)`
## <dbl> <int>
## 1 53.0 11
As we see most of users send few less than 200 messages. To be exact 18 of users.
We can examine the length of each message, and see the distribution for a particular user, or even, the distribution for all users. This is not only curiosity, because we can infer that longer messages should have more information so we can have a more sensible measure of information flow using the message length. We show the global distribution for messages length and the distribution for our usual group (the top 10).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [[1]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
## [[2]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
## [[3]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
## [[4]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
## [[5]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
## [[6]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
## [[7]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
## [[8]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
## [[9]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
## [[10]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
One interest question is how we could cluster the users by package citation. They could, possibly, be most interested in talking about ? In this case intuition says that we will need to count packages cited by users (our previous analysis) setting the distance function to be based on this frequency number, and then, group them.
Term frequency refer to the analysis of the terms used in a corpus, the bellow definitions guide the understand of what it comes to be in our analysis.
The Inverse document frequency (idf) is a measure given by: \(\log(d/n_t)\) Where:
Basically with those things we can compare the terms accross a collection of documents (corpus) and give some rank for the most important topic without simply removing the stop words, because they will be weighted as unimportant because of the high frequency they appear accross all documents.
Of course this is good when we have a lot of messages for users, so we will use it for the top 10 users.
## Selecting by total
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The graph confirms the Zipf’s law We apply the log/log graph to visualize better the relation of rank and term frequencie.
As the same with distributions previously showed, we can visualize that terms frequency vs rank is similiar around users, and seems to obeys the Zipf’s law, showed by the approximately linearity of the graph.
Continuing the simple analysis we could make a simple bigram to make sense of words with a bit of context and also suitable to more modelling approach, e.g., bayesian models. One could even make a model to tell if our bots can be easily detected or not. And, in fact, as we would see, it can.
Note: Our bigram analysis when using tf_idf exclude users with low number of documents (a low rate of messages sent), that because they will get a high tf_idf by means of their few documents. (We set the limit for at least 100 messages).
| user | bigram | n | tf | idf | tf_idf |
|---|---|---|---|---|---|
| <perlbot> | line 1 | 67 | 0.008717148 | 3.433987 | 0.02993457 |
| <perlbot> | at irc | 60 | 0.007806401 | 3.433987 | 0.02680708 |
| <perlbot> | irc line | 60 | 0.007806401 | 3.433987 | 0.02680708 |
| <perlbot> | file at | 48 | 0.006245121 | 3.433987 | 0.02144567 |
| <perlbot> | pasted a | 48 | 0.006245121 | 3.433987 | 0.02144567 |
| <Strolls> | net google | 14 | 0.005642886 | 2.740840 | 0.01546625 |
| <Strolls> | google drive | 18 | 0.007255139 | 2.047693 | 0.01485630 |
| <perlbot> | at https | 76 | 0.009888108 | 1.488077 | 0.01471427 |
| <perlbot> | new file | 48 | 0.006245121 | 2.335375 | 0.01458470 |
| <Jonno_FTW> | perl5 5.30 | 10 | 0.005022602 | 2.740840 | 0.01376615 |
Lets now focus on the top 10 users (including bots) and check the distribution of the bigrams frequencies and so we could see in picture the different bigram frequency distributions for human/bot
As we can see here, perlbot’s histogram shape differs from its DNA made counter part. that because bots will have the same response (and one that usually is not phrased among humans) ,after all we can change our topics and manners of speak a bit more than a perl program can.
We can represent a network of the bigram when we considere the bigram pair \((w_1,w_2)\) as vertices points, so the entire set of bigrams is the graph formed.
Let’s visualize it for a subset of the corpus (those bigrams that appears at least 30 times or more).
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
It seems that perlbot dominates this, because as we said its bigrams have a high frequency in its own messages (automated response) and a low frequencie compared to our DNA made bots (such as Grinnz).
So we can exclude perlbot from the chat and rerun this particula analysis. Also lowering the rate it must appears (use 10 times at least). Results are shown below:
This very simple analysis was heavily inspired from the R book tidytext that shows an example using books. Computer linguistic is a vast study area, where I have absolutely no knowledge of, so any review will be more than welcome. My motivation was to get a view of the perl community, which I found is extremelly friendly and helpful, have been myself a member (arthurpbs) for a while I can say that #perl (on irc.freenode.org) and many other thematic rooms (on irc.perl.org) such as #mojo are source of good discussions and advices when it comes to common problems facing a programmer, software analyst or engineer.
Special thanks to Grinnz that has always a good comment and helpful advice for a range of topics, as the numbers and graphs shows here.
Our analysis disregard some facts, for example, bot users such as (perlbot), code snippets that do occur on chat rooms, use of emojis, abbreviations, slangs, and many other previous knowledge we could improve considerably our analysis by making the tokenization process a much better one.
A nice idea, is to create a module (perl module) to help bring the proper tokenization considering all variables described above.
Other consideration is to use the timestamp of messages so we could make a better structure of dialogs between users, as well as we could make an idea about the statistics of the stocastic process beheind the time series events that constitutes a real time chat.