Introduction : why do we need to make statistics on text ?

The main goal of text analysis is to find regular formal structurs on texts.

Those techniques are powerful because we can link what is said with who said that, like sociodemographic variables of speakers.

For instance, there are differences in the vocabulary people use to define what is hapiness, work, etc.

Why not having children?

Why not having children?

Package for the workshop

library(stringr)
library(stringi)
library(dplyr)
library(questionr)
library(tm)
library(stopwords)
library(quanteda)
library(wordcloud)
library(topicmodels)
library(rainette)
library(FactoMineR)
library(explor)

Several packages…

There are plenty of packages that could be used for text mining :

  • tm
  • quanteda
  • tidytext
  • corpus
  • koRpus
  • etc.

Often, they are similar, and use same functions of other packages (wordcloud for instance)

We will focus on “tm” and “quanteda”, but tidytext is very useful for those who likes tidyverse :D (But tidytext does not implement some functions I find useful)

… and also several softwares

In my research, I often use R to collect and shape my corpus, but also to do some basic statistics. But, I often use specific softwares to do text mining :

The particularity of those softwares are that they use R do make the calculation.

Another good way to do text analysis without R knowledge is to use the package “Rtemis” (Bouchet-Valat, Bastin), that uses Rcommander.

But here, we should look at R precisely !

Starting point : using “regular” function of R

Using CA to represent lexical associations

A good way to do text analysis is to build your own object to test hypothesis about your data. For instance, here, I run a correspondance analysis (AFC) on a table of frequencies between a list of categories that I study and the type of preceding word.

AFC<-read.csv("C:/Users/coren/Dropbox/Sociologie/SatRdays/Analyse textuelle/AFC_categories.csv", sep=";",
              header=T, row.names=1)
resul<-CA(AFC)

AFC_Categories

AFC_Categories

Prepare a corpus

Data : a work in progress on French Hip Hop

Texts from a webscraping of the Genius sites : all the text of french rap songs between 2010 and 2017. This type of page : https://genius.com/Rap-francais-discographie-2017-lyrics

First, the question was to study featuring between artists : how the rap world is structured by measuring collaborations through featuring ?

Network_rap

Network_rap

Now, that I made that structur appears, I want to cross this analysis with textual variables. The question is : does people that collabore perform a similar rap ?

So, we load the data. Here, it is only the songs where there are featuring :

FR <- read.csv2("C:/Users/coren/Dropbox/Sociologie/Sociologie du rap/Bddfeat/GENIUSFEAT2.csv", stringsAsFactors = F)

Here, the text are not “clean” : the column “Texte” is the only one to be interesting, but it has to be cleaned.

t<-FR$Texte
print(t[100])
## [1] "\n          \n            \n            \n\n[Intro : Rim'K] x2\nOn a connu la galère, les métaux précieux\nTout revient à la mama, la prunelle de mes yeux\n[Refrain : Lartiste] x2\nOn a connu les galères, on a connu les dramas\nL'époque des bandanas quand on était gamins\nSi j'veux réussir, c'est pour la mama\nJ'pense à elle quand tout va mal\n[Couplet 1 : Rim'K]\nHé gringo, qué pasa ?\nJ'suis avec la fafa, la fafa c'est quoi ?\nLa fafa, c'est la famille\nNous on vient des quartiers sensibles\nGros cigare de cuba, p'tite danse du Brésil\nAu mariage, au décès, au baptême, les mêmes qui seront là\nLes cousins nostra\nTu te places sur mon lit pour mes liasses\nÀ la Mahrez, j'leur ai fait le coup du foulard\nFennec dans la brume, foyer dans la fume\nOn en a brûlé des thunes\nJ'ai construit ma vie sur un champ de ruines\nJ'rentrais après la lune\nSi nous fâchés, ça fait rom pom pom\nMoi et ma team on est plein, plein, plein\nTout c'qui faut dans ma planque, planque, planque\nDes big Sam, Sam, Sam\n[Refrain : Lartiste] x2\nOn a connu les galères, on a connu les dramas\nL'époque des bandanas quand on était gamins\nSi j'veux réussir, c'est pour la mama\nJ'pense à elle quand tout va mal\n[Couplet 2 : Lartiste]Le V12 dans l'Veneno\nPas d'marche arrière, pas de créneau\nJ'fais mon malin comme Serge Beynaud\nCe soir encore, on fout le fuego\nJ'ai enfumé la pièce, j'arrive canabissé\nT'inquiète pas p'tit frère, le dîner est bien visser\nJ'traîne qu'avec des hommes, pas de calamités\nPas d'mythomanes qui tté-gra l'amitié\nAllez vamos, monte vite dans le gamos\nY'a trop de Piqué, y'a trop de Ramos\nY'a trop d'alcool, y'a trop de cassos\nBang bang, ça tire dans l'club\nEncore un narvalo qui fait le thug\nLe patron va appeler les keufs\nCes bâtards vont faire fuir les meufs\n[Refrain : Lartiste] x2\nOn a connu les galères, on a connu les dramas\nL'époque des bandanas quand on était gamins\nSi j'veux réussir, c'est pour la mama\nJ'pense à elle quand tout va mal\n[Outro]\nEt je pense à elle, c'est elle qui m'a porté\nLa seule qui m'a soutenu quand je prenais des coups bas\nOui je pense à elle, c'est elle qui m'a porté\nLa seule qui m'a soutenu quand je prenais des coups bas\nDes galères, des dramas\nDes galères, des dramas\nDes dramas, des dramas\nDes dramas, des galères\nDes dramas, si y'avait pas la mama, je n'en serais pas là\n\n\n            \n          \n        "

Clean a corpus

Construct the dataframe

The syntax to decode is this form : “Texte [Rappeur 1 :] Couplet 1 [Rappeur 2 :] Couplet 2 etc.

We gonna need regular expression to construct it !

A powerful tool : regular expression (regex)

Manipulation of character strings in R (but in other languages too). Useful for many thing out of text analysis ! Using metacharacters to get a lot of them : for instance, “.” means “any character” To neutralize a metacharacter, we need to use “double backlash” in front of it (“\\.” is a real dot)

Main characters of regex :

  • […] : characters between brackets
  • [^…] : all characters but the ones between brackets
  • [x-y] : character between x and y
  • [:alnum:] : every letters and numbers
  • [0-9] : every numbers

Some quantifications :

  • + : 1 or several occurrences of the preceding form
  • * : 0 or several occurrences of preceding form
  • ? : 0 or 1
  • {n} : n occurrences

Also :

  • ^ : indique le début du texte
  • $ : la fin du texte Etc.

We gonna use base functions in R :

  • gsub (replace expression par an other)
  • grepl (does the expression appears -> TRUE or FALSE)
  • grep (positions where grepl is TRUE)

But also functions of stringi and stringr packages.

Creation of two lists with needed informations

First, we remove what is before the first “[” (we don’t know who to assign this text)

t <- gsub("^[^\\[]*\\[", "",t)
# Littéralement : à partir du début du texte, tout ce qui n'est pas un crochet ouvrant, autant de fois que possible, jusqu'à trouver un crochet ouvrant

t <- gsub("^", "\\[",t) # On rajoute la première balise
# Note : on aurait pu tout faire d'un coup, comme ceci : gsub("^([^\\[]*)(\\[)","\\2",t)

Now, we get in a list what is in bracket (name of artists) and what is between the bracket (the texts)

test2 <-stri_match_all_regex(t, "\\[[^\\]]*\\]") # This syntax, litteraly : un crochet ouvrant, puis tout sauf un crochet fermant, jusqu'à un crochet fermant -> [blabla]
# l'argument "all" permet de récupérer toutes les occurrences de cette syntaxe


test <- str_split(t, "\\[[^\\]]*\\]") # Sépare les textes dès qu'il rencontre cette syntaxe

Here, we used str_split function, that cut strings when it matches an expression

Creation of dataframe

One column for the name of the artist, and one for the text. I think that there are plenty of fastest way than using a for loop… :

test3<-data.frame()
test4 <- data.frame()
for (i in (1:length(test))){ # Pour chaque morceau
  for (j in (1:length(test2[[i]]))) { # Pour le nombre de balises dans le morceau
    test4[1,1] <- test2[[i]][j] 
    test4[1,2] <- test[[i]][j+1] # j+1 car il y a une colonne vide dans test
    test4[1,3] <- i # On garde un identifiant pour le morceau
    test3 <- bind_rows(test3, test4)
  }
}

colnames(test3) <- c("Artiste", "Texte", "id")

We can show some informations on this dataframe :

max(table(test3$id)) # One song with 40 different texts
## [1] 40
min(table(test3$id)) # Song with only one text : not a featuring
## [1] 1
table(table(test3$id)==1) # 98 songs dans ce cas (sur 2640 = 4%)
## 
## FALSE  TRUE 
##  2542    98
mean(table(test3$id)) # Mean of 6,8 texts by song
## [1] 6.820833

18 000 different texts, but for a lot of them, we have : - too short texts - not the information on the artist

We remove those texts :

rap <- subset(test3, nchar(test3$Texte)> 200 & nchar(test3$Artiste) > 5) 

6 000 texts are dropped.

Cleaning the corpus

Now we have structure our corpus, we need to clean it :

Name of the artists

To do so, we use a function :

clean_rappeur <- function (x) {
  x <- tolower(x)
  x <- gsub("\\[", "", x)
  x <- gsub("\\]", "",x)
  x <- gsub(".* :", "",x)
  x <- gsub("couplet[ 0-9]*", "",x)
  x <-  gsub("refrain", "",x)
  x <-  gsub(":", "",x)
  x <-  gsub("\\?", "",x)
  x <-  gsub("intro", "",x)
  x <-  gsub("outro", "",x)
  x <-  gsub(".*transcri.*", "",x)
  x <-  gsub("-", "",x)
  x <-  gsub("–", "",x)
  x <-  gsub(",", "",x)
  x <- gsub ("—", "",x)
  x <- gsub ("x[0-9]", "",x)
  x <- gsub ("\\*[0-9]", "",x)
  x <- gsub ("\\(.*\\)", "",x)
  x <- gsub ("", "",x)
  x <-  gsub("pont", "",x)
  x <-  gsub("^ *", "",x)
  x <-  gsub(" *$", "",x)
}

rap[,1] <- clean_rappeur(rap[,1])

rap <- subset(rap, nchar(rap$Texte)> 200 & nchar(rap$Artiste) > 2)

1 200 texts less.

We also need to remove duplicated lines (repeated chorus for instance) :

rap <- unique(rap)

800 less.

We can look at the distribution by artist. For our purpose, we only keep the artists that have the most texts :

rap$Artiste <- as.factor(rap$Artiste)
table(table(rap$Artiste)) # Répartition très inégalitaire
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1255  517  212   99   61   58   35   31   13   16   22   15   13   21    6 
##   16   17   18   19   20   21   22   24   25   26   27   28   29   30   31 
##   12    9   11    6    6    5    4    4    3    4    4    3    5    4    7 
##   32   33   35   36   37   38   39   40   41   42   44   45   46   47   48 
##    2    1    1    1    1    1    2    1    1    3    1    3    1    2    1 
##   49   51   52   53   57   60   62   64   67   69   75   80   81   85  100 
##    1    2    3    1    1    1    1    2    1    1    1    1    1    2    1 
##  102  114 
##    1    1
sort(table(rap$Artiste), decreasing=T)[1:50]
## 
##        alkpote         alonzo         nekfeu            jul        soprano 
##            114            102            100             85             85 
##      la fouine          booba          t.i.s    maître gims     seth gueko 
##             81             80             75             69             67 
##           niro          sadek     swift guad            a2h        sofiane 
##             64             64             62             60             57 
##   deen burbigo        black m          dadju        sneazzy          furax 
##             53             52             52             52             51 
##        jok'air lucio bukowski          taïro          rohff     youssoupha 
##             51             49             48             47             47 
##     s.pri noir     alpha wann        canardo         lacrim          rim'k 
##             46             45             45             45             44 
##   jeff le nerf         kaaris         scylla       lartiste          niska 
##             42             42             42             41             40 
##     kery james           lefa    hayce lemsi    green money         guizmo 
##             39             39             38             37             36 
##    despo rutti           lino   barack adama          nemir   demi portion 
##             35             33             32             32             31 
##          hamza       mac tyer  rockin' squat          sinik    tito prince 
##             31             31             31             31             31
sum(sort(table(rap[,1]), decreasing = T)[1:10])
## [1] 858
tab<-sort(table(rap[,1]), decreasing = T)[1:20]
names(tab)
##  [1] "alkpote"      "alonzo"       "nekfeu"       "jul"         
##  [5] "soprano"      "la fouine"    "booba"        "t.i.s"       
##  [9] "maître gims"  "seth gueko"   "niro"         "sadek"       
## [13] "swift guad"   "a2h"          "sofiane"      "deen burbigo"
## [17] "black m"      "dadju"        "sneazzy"      "furax"
# On peut vouloir ne garder que les textes des rappeurs qui ont le plus de texte
rap2 <- subset(rap, rap$Artiste %in% names(tab))

rap2$Artiste <- as.factor(as.character(rap2$Artiste)) # Il serait plus efficace de faire droplevel
table(rap2$Artiste)
## 
##          a2h      alkpote       alonzo      black m        booba 
##           60          114          102           52           80 
##        dadju deen burbigo        furax          jul    la fouine 
##           52           53           51           85           81 
##  maître gims       nekfeu         niro        sadek   seth gueko 
##           69          100           64           64           67 
##      sneazzy      sofiane      soprano   swift guad        t.i.s 
##           52           57           85           62           75

1500 texts for 20 artists.

Do we have difference of length of texts by artist ? :

summary(tapply(nchar(rap2$Texte), rap2$Artiste, mean, na.rm=F))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   443.7   522.5   598.7   623.1   714.7   811.7

The distribution is quiet homogeneous.

Cleaning the texts

We do the same kind of things for the texts :

### enlever les \n (saut de ligne)
rap2$Texte <- gsub("\\\n",". ",rap2$Texte) # On rajoute des points à chaque saut de ligne, donne une idée des phrases
rap2$Texte <- gsub("\\n",". ",rap2$Texte)
rap2$Texte[250]
## [1] "J'prends mon sac, maman pleure encore. Elle file à l'hôpital, j'suis d'jà pressé qu'elle en sorte. Je zone dans la ville, j'me jure de péter le score. Je squatte chez une fille qui, maintenant, est dans la drogue. Je ride pendant qu'maman soigne sa tête (putain). J'suis triste mais, quand j'sors, j'suis le roi d'la fête. A2 éclate les dép', \"A2, elle claque, la tape\". Moi, j'm'en fous, j'veux juste que ma mère soit fière. J'ai quitté l'école, et j'ai quitté les blocs. J'fais ma route comme un grand, voilà, j'ai traversé l'époque. Avec les potes, une pensée pour la mama. Putain, qu'elle est forte, elle a quitté le "
# Expressions régulières
# Quand on a une consonne puis un ' puis une consonne : transformer ça en consonne+e espace ?

rap2$Texte <- gsub("([b-df-hj-np-tv-z]|[B-DF-HJ-NP-TV-Z])'([b-df-hj-np-tv-z]|[B-DF-HJ-NP-TV-Z])", "\\1e \\2", rap2$Texte)
rap2$Texte <- gsub("(qu)'([b-df-hj-np-tv-z])", "\\1e \\2", rap2$Texte) # les formules en qu'tu ou qu'je sont aussi changées

# Ou plus généralement, enlever toutes les apostrophes en rajoutant des e (ce qui n'est pas super précis mais correspond dans la majorité des cas)
rap2$Texte <- str_replace_all(rap2$Texte,"(\\w)'(\\w)", "\\1e \\2") # \w : tous les charactères de mot

# Il y a un autre problème : les sauts de ligne n'ont parfois pas été reconnus, on a donc des mots collés
# Quand on a minuscule puis majuscule, mettre un espace ?
rap2$Texte <- str_replace_all(rap2$Texte, "([:lower:])([:upper:])", "\\1\\. \\2")


# With grepl function, we can check if we still have some issue
# For instance, do we remove all the apostrophe ?
table(grepl("'", rap2$Texte)) # We still have a lot of it
## 
## FALSE  TRUE 
##  1079   346
rap2$Texte <- gsub("'", " ", rap2$Texte) # We removed it

# But we also have an other type of apostrophe
table(grepl("’", rap2$Texte)) 
## 
## FALSE  TRUE 
##  1107   318
rap2$Texte <- gsub("’", " ", rap2$Texte) # We removed it
table(grepl("’", rap2$Texte)) # It's ok
## 
## FALSE 
##  1425

There still are a lot to do, but for now, let’s say it is ok !

Analysing the corpus

Now, we can load the “clean” corpus !

# rap2 <- read.csv2("https://raw.githubusercontent.com/satRdays/paris2019/master/corpusrap_clean.csv")
# For me : keeping with the old one

Import and analysing a corpus with tm package

# Creation of the "corpus" object, from the vector with the texts :
docs <- VCorpus(VectorSource(rap2$Texte))
print(docs)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1425

But we want to keep our supplementary variables :

# Specific format :
rap3 <- rap2
for (i in 1:nrow(rap3)){
  rap3[i,4] <- i
}
rap3 <- rap3[,c(4,2,1,3)]
colnames(rap3) <- c("doc_id", "text", "artiste", "Morceau_id")

# We keep metadata by
docs <- Corpus(DataframeSource(rap3)) 
meta(docs[1]) # We have the name of the artist and the id for the song

We can make subset, to keep only texts from one artist :

idx <- meta(docs, "artiste") == 'sofiane'
docs_f <- docs[idx] # Here, we only have texts from one specific artist, here 57 parts of song

Now, we have to built a Document Term Matrix. On that purpose, we modify our corpus :

# with the tm_map function to transform the corpus
docs <- tm_map(docs, content_transformer(tolower)) # lower case
docs <- tm_map(docs, content_transformer(removeNumbers)) 
docs <- tm_map(docs, content_transformer(removePunctuation))
docs <- tm_map(docs, removeWords, stopwords("fr")) # remove stopwords
docs <- tm_map(docs, content_transformer(stripWhitespace)) 
docs <- tm_map(docs, stemDocument) # stemming (it would be better to lemmatize)
docs <- tm_map(docs, removeWords, c("fair", "fait", "fai")) # remove specific words

# The most important object is the Document Term Matrix
dtm <- DocumentTermMatrix(docs)
dim(dtm)
## [1]  1425 14007

To have better results, we should not using “stem” but “lem” : lemmatization of corpus is a powerful technique, because it does not only “cut” words, but keep morpho-syntaxic informations on it. For instance, use “treetagger” to do so.

Basic statistics with this object

# frequencies
freq <- sort(colSums(as.matrix(dtm)),decreasing=TRUE)
freq[1:10] # 10 words most frequents
##  comm    ça  tout quand  plus  tous  trop   vie  veux  faut 
##  1094   964   902   680   673   440   358   353   333   290
# wordcloud
set.seed(700)
wordcloud(names(freq), freq, min.freq=30, max.words = 150, colors=brewer.pal(8, "Dark2"))

# barplot
barplot(freq[1:25], xlab = "term", ylab = "frequency",  col=heat.colors(50))

# We can decide to keep only the most frequent terms
dtm2 <- removeSparseTerms(dtm, 0.97) # The words used in 3% of documents
dim(dtm2) # It keeps only 240 words
## [1] 1425  241
# An other way to calculate frequency
freq = colSums(as.matrix(dtm2))
ord = order(-freq) #order the frequency
freq[(ord)]
##      comm        ça      tout     quand      plus      tous      trop 
##      1094       964       902       680       673       440       358 
##       vie      veux      faut      rien      bien        où       vai 
##       353       333       290       285       280       273       263 
##       dit      ouai   toujour        là       rap     jamai       sai 
##       245       245       239       238       229       218       217 
##      gros      parl      coup      sous      être      vien       voi 
##       213       212       210       209       205       204       202 
##      mond      temp       aim     petit       car       mal      mère 
##       200       196       196       191       186       178       177 
##     prend       met      homm      tête      cœur       mec      sale 
##       177       165       164       164       164       163       163 
##      seul      pute      just      veut      deux      nuit      donc 
##       161       160       159       157       150       148       148 
##      chez     avant       oui     plein       sor       bon     grand 
##       148       148       147       142       142       141       140 
##       peu     depui      pass    putain      autr      voir     rêves 
##       137       136       136       136       135       134       134 
##      jour      peux      quoi    demand      rest       gar      heur 
##       133       132       131       131       130       129       129 
##      vrai      bais       foi    besoin      dire      mort     arriv 
##       128       127       125       123       118       117       116 
##     laiss      merd    devant       dis     amour      alor      soir 
##       116       115       115       113       112       111       111 
##      ceux      parc      entr  pourquoi     encor     jusqu     chaqu 
##       111       109       109       108       108       107       107 
##   personn     rentr       fil      loin    regard     frère       non 
##       107       107       106       104       104       102       102 
##      pens      rime      font       pui       gen       sen       rue 
##       102       102       102       101        98        98        97 
##      yeah    attend      femm      pari      fume      part      vill 
##        97        96        96        96        95        94        92 
##      band      main    connai     aller      yeux      noir      aprè 
##        92        91        91        91        90        89        89 
##      vite    frères       air      sang      sort     avoir   veulent 
##        88        87        86        86        86        86        85 
##      niqu      tire    parler     gross    argent   comment      port 
##        85        84        84        84        84        84        84 
##    quelqu    couill      flow     franc  meilleur      bell       vas 
##        84        84        83        81        81        80        80 
##    appell      tour       mis      poto      dieu      fond      pein 
##        79        79        79        79        78        78        78 
##      fill    prendr     place       bas      très      face      croi 
##        78        78        77        76        76        74        74 
##      pris     chose     équip      donn      pote      estc     compt 
##        74        74        74        72        72        72        72 
##     guerr      roul      fuck   histoir      terr      tant     gueul 
##        71        71        71        70        70        70        69 
##   pendant      peut      déjà      fous     trist       fou     mettr 
##        69        69        68        67        67        67        67 
##       pay       tit tellement     oubli      corp      tape     perdu 
##        66        66        66        66        65        64        64 
##       bat      rapp      game    entend      envi      haut  quartier 
##        64        64        63        63        62        62        62 
##      peur      père      jeun      bonn       dès      vide     salop 
##        62        61        61        61        60        60        60 
##       cul     chatt     écout   mainten      bras      meuf     bouch 
##        59        59        58        58        57        57        56 
##     passé      joue      sait      anné    ennemi      fini      ball 
##        56        56        56        56        56        56        55 
##      pire      shit     drogu     mieux    famill    cherch    passer 
##        55        55        55        55        54        54        53 
##     aucun   rappeur   premier     touch      cash     arrêt     chaud 
##        53        53        53        51        51        51        51 
##     droit      forc   derrièr      fort     march      beau     finir 
##        51        50        50        49        49        49        49 
##      bite  pourtant       ami   inquièt     traîn  beaucoup      larm 
##        49        48        48        47        47        46        45 
##     avanc     croir      sent 
##        44        44        43
# Measure association with a specific word
# "Bag of word"
findAssocs(dtm2, "femm", corlimit = 0.1)
## $femm
##  bell amour  homm   vie  face  fume 
##  0.18  0.12  0.12  0.11  0.11  0.10
findAssocs(dtm2, "homm", corlimit = 0.1)
## $homm
##    plus    entr   perdu    face     sen    être      où    pein premier 
##    0.21    0.20    0.19    0.19    0.18    0.15    0.14    0.14    0.13 
##    femm    mort    fort    vrai    loin   avant    rest derrièr  regard 
##    0.12    0.12    0.12    0.12    0.12    0.11    0.11    0.11    0.11 
##     aim 
##    0.10
findAssocs(dtm2, "amour", corlimit = 0.1)
## $amour
## regard   déjà    sen   loin   femm   père  quand  encor famill 
##   0.21   0.15   0.15   0.14   0.12   0.11   0.11   0.11   0.11
findAssocs(dtm, "papa", corlimit = 0.4)
## $papa
##     tambola      africa       créer déséquilibr        mama       sursi 
##        0.73        0.52        0.52        0.52        0.52        0.52 
##       violé      meurtr     endurci      oublié     lentill        rohf 
##        0.52        0.49        0.42        0.42        0.42        0.42 
##       laiss 
##        0.40

We can go further with this package, but let’s try an other one !

Using quanteda

# Create an objet of class "corpus"
tt <- corpus(rap2, text_field = "Texte")
summary(tt)
# Create a corpus with only texts from one artist
ttt <- corpus_subset(tt, Artiste == "alkpote") # Sous-corpus en fonction des métadonnées

Before building a DTM, we can look at the corpus.

kwic function

Key word in context : helpful for textmining !

# In wich contexts the term "couilles" appears?
kwic(tt, "couilles", window = 6)
nrow(kwic(tt, "couilles", window = 6)) # 74 occurences of the term
## [1] 74
# No need to be always in a "bag of word" approach :
table(kwic(tt, "couilles", window = 1)[4])
## 
##       ,       à      de     des    Deux      en Grosses     les     Les 
##       1       1       1       3       1       5       1      51       2 
##     mes     tes 
##       7       1
prop.table(table(kwic(tt, "couilles", window = 1)[4]))
## 
##          ,          à         de        des       Deux         en 
## 0.01351351 0.01351351 0.01351351 0.04054054 0.01351351 0.06756757 
##    Grosses        les        Les        mes        tes 
## 0.01351351 0.68918919 0.02702703 0.09459459 0.01351351
# The term is very often (70%) preceded by the terme "les"

table(kwic(tt, pattern = phrase("les couilles"), window = 1)[4]) #battre ou casser
## 
##       ,       .       a      ai    avec     bas     bat     Bat    bats 
##       1       2       2       2       1       1      10       3       9 
##    Bats   coupé coupent    dans    donc  gratte  jamais  MDBats     ont 
##       3       1       2       2       1       2       1       1       1 
##     pas    vide   vider    vrai 
##       3       2       2       1
aa <- kwic(tt, patter = phrase("les couilles"), window = 6)
View(aa) 

We can also look if there are specific expressions in the corpus, and use it to redefine tokens in our corpus :

col <- textstat_collocations(tt, min_count = 20, size = 4)
col
# We could use it after to redefine our tokens (words) in a corpus
# rap_tok <- tokens_compound(token, pattern = col)

# Do we have some very long repeteated segments ? Don't run !
# textstat_collocations(tt, min_count = 2, size = 30)

But kwic function is also useful to do some “keyness” analysis.

Keyness analysis

wordcloud wordcloud

Question : which words are more used around the term “rap” ?

If there are independance in the distribution of each word in each corpus, there will be :

Occurrences in corpus 1 / Length of corpus 1 = Occurrences in corpus 2 / Length of corpus 2

It is a chisquared measure.

To do so, we need to build a Document Feature Matrix First thing to do : tokenize the corpus (not mandatory but preferable)

tk <- quanteda::tokens(tt, what = 'word', remove_numbers = T,
             remove_punct = T,
             remove_symbols = T) # first tokenize the corpus

Then, we build the matrix :

stop <- stopwords("fr", source = "stopwords-iso") # Here, we use a wider list of stopwords

dfm <- dfm(tk, remove = stop, tolower = TRUE, remove_punct = TRUE)
dfm <- dfm_wordstem(dfm, language = "french")
dfm <- dfm(dfm, remove = c("a", "fair", "fais", "fait", "plus", "tout", "quand", "ça", "ye")) # remove specific words
dfm <- dfm_trim(dfm, min_termfreq = 20) # Keep only the terms that appear 20 times

We can run basic statistics on the corpus :

nfeat(dfm) # We kept 700 words
## [1] 701
# 40 most frequents words
topfeatures(dfm, n=40)
##    vi  veux  faut   aim ouais  coup   rap   rêv  sais   jam  gros   fin 
##   353   333   293   274   250   235   229   222   221   219   215   208 
##   mer  pass  vien  mond  vois   sal  temp  frer   mal   pet   têt   met 
##   207   207   207   200   200   200   196   190   190   189   185   181 
##  bais prend   put grand   mec  homm  cœur  part   fum  veut   con   dis 
##   177   177   176   172   164   164   164   163   161   157   154   154 
## laiss   gar   oui  nuit 
##   150   150   150   148
# Useful for a wordcloud
textplot_wordcloud(dfm, max.words = 50, scale = c(3, .5), colors=brewer.pal(6, "Dark2"))

Now, we can try to answer the question : how texts of each artist are specific?

# We need to group documents by artist
dfm_a <- dfm(dfm, groups = "Artiste") # On regroupe les textes de chaque rappeur ?
docvars(dfm_a) # 20 textes différents
# Measure of the lexical diversity for each artist (number of different forms / total number of forms)
textstat_lexdiv(dfm_a)
# Measure of keyness : which words are more used by an artist ? (// chisquared)
a<-textstat_keyness(dfm_a, target = "t.i.s") 
textplot_keyness(a)

a<-textstat_keyness(dfm_a, target = "jul")
textplot_keyness(a)

a<-textstat_keyness(dfm_a, target = "nekfeu")
textplot_keyness(a)

a<-textstat_keyness(dfm_a, target = "alkpote")
textplot_keyness(a)

Clustering method

On a more systematic way, we can proceed clustering methods to regroup texts, based on their lexical similarities.

tstat_dist <- textstat_dist(dfm_a)
clust <- hclust(tstat_dist, method = "ward.D2")
plot(clust, xlab = "Distance", ylab = NULL)

Topic modeling

Very fashion technique, like sentiment analysis.

dtm_l <- convert(dfm, to = "topicmodels")
lda <- LDA(dtm_l, k = 6)
terms(lda, 6)
##      Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
## [1,] "gros"  "coup"  "ouais" "rêv"   "sal"   "vi"   
## [2,] "temp"  "veut"  "mer"   "vien"  "frer"  "aim"  
## [3,] "veux"  "niqu"  "met"   "cœur"  "laiss" "mond" 
## [4,] "put"   "sort"  "veux"  "rentr" "sor"   "vid"  
## [5,] "rap"   "sais"  "faut"  "jam"   "pe"    "heur" 
## [6,] "fum"   "mort"  "âme"   "vi"    "faut"  "vit"
# # How manys topic should we keep ? (Don't run)
# library(ldatuning)
# result <- FindTopicsNumber(
#   dtm,
#   topics = seq(from = 2, to = 15, by = 1),
#   metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
#   method = "Gibbs",
#   control = list(seed = 77),
#   mc.cores = 2L,
#   verbose = TRUE
# )
# 
# FindTopicsNumber_plot(result)

Other tools

Alcest Method

Known because of the software “Alceste”, created by Max Reinert, that developed the technique in 1980s.

Principles :

  • Reduce the text in segments of 30-40 words
  • Calculate distances between the texts in the Document Term Matrix
  • Measure of cooccurrences between words in the same text
  • Using clustering to regroup texts based on their proximity
  • Print the overrepresentated words (khi²) in each group to establish “lexical worlds”
  • Link those lexical worlds with supplementary variables (on those who speak for instance)

An example

Caption

Caption

To have a better use of this method, a new package is being developed by Julien Barnier : rainette. It is still in beta-testing, but I’m sure it is gonna be awesome in a few weeks !

Here, I tried his package on another corpus, from french press

#On rap corpus
# First, we need to split texts in segments
corpus <- split_segments(tt, segment_size = 40)
tk <- quanteda::tokens(corpus, what = 'word', remove_numbers = T,
                       remove_punct = T,
                       remove_symbols = T) # first tokenize the corpus

dtm <- dfm(tk, remove = stop, tolower = TRUE, remove_punct = TRUE)
dtm <- dfm_wordstem(dtm, language = "french")
dtm <- dfm(dtm, remove = c("a", "fair", "fais", "fait", "plus", "tout", "quand", "ça", "ye")) # remove specific words
dtm <- dfm_trim(dtm, min_termfreq = 10)
res <- rainette(dtm, k = 5, min_uc_size = 5, min_members = 10)
##   Computing ucs from segments...
rainette_plot(res, dtm, k = 5, type = "bar", n_terms = 20, free_scales = FALSE,
    measure = "chi2", show_negative = "TRUE", text_size = 10)

# On another corpus
beauf <- read.csv2("C:/Users/coren/Dropbox/Sociologie/Thèse/Quantithèse/Bdd_thèse/quisuisje.csv", stringsAsFactors = F)
bobo <- corpus(beauf$Texte[1:1000])
corpus <- split_segments(bobo, segment_size = 40)
tk <- quanteda::tokens(corpus, what = 'word', remove_numbers = T,
                       remove_punct = T,
                       remove_symbols = T) # first tokenize the corpus

dtm <- dfm(tk, remove = stop, tolower = TRUE, remove_punct = TRUE)
dtm <- dfm_wordstem(dtm, language = "french")
dtm <- dfm(dtm, remove = c("a", "fair", "fais", "fait", "plus", "tout", "quand", "ça", "ye")) # remove specific words
dtm <- dfm_trim(dtm, min_termfreq = 20)
res <- rainette(dtm, k = 8, min_uc_size = 5, min_members = 10)
##   Computing ucs from segments...
rainette_plot(res, dtm, k = 8, type = "bar", n_terms = 20, free_scales = FALSE,
    measure = "chi2", show_negative = "TRUE", text_size = 10)