The main goal of text analysis is to find regular formal structurs on texts.
Those techniques are powerful because we can link what is said with who said that, like sociodemographic variables of speakers.
For instance, there are differences in the vocabulary people use to define what is hapiness, work, etc.
Why not having children?
library(stringr)
library(stringi)
library(dplyr)
library(questionr)
library(tm)
library(stopwords)
library(quanteda)
library(wordcloud)
library(topicmodels)
library(rainette)
library(FactoMineR)
library(explor)
There are plenty of packages that could be used for text mining :
Often, they are similar, and use same functions of other packages (wordcloud for instance)
We will focus on “tm” and “quanteda”, but tidytext is very useful for those who likes tidyverse :D (But tidytext does not implement some functions I find useful)
In my research, I often use R to collect and shape my corpus, but also to do some basic statistics. But, I often use specific softwares to do text mining :
The particularity of those softwares are that they use R do make the calculation.
Another good way to do text analysis without R knowledge is to use the package “Rtemis” (Bouchet-Valat, Bastin), that uses Rcommander.
But here, we should look at R precisely !
A good way to do text analysis is to build your own object to test hypothesis about your data. For instance, here, I run a correspondance analysis (AFC) on a table of frequencies between a list of categories that I study and the type of preceding word.
AFC<-read.csv("C:/Users/coren/Dropbox/Sociologie/SatRdays/Analyse textuelle/AFC_categories.csv", sep=";",
header=T, row.names=1)
resul<-CA(AFC)
AFC_Categories
Texts from a webscraping of the Genius sites : all the text of french rap songs between 2010 and 2017. This type of page : https://genius.com/Rap-francais-discographie-2017-lyrics
First, the question was to study featuring between artists : how the rap world is structured by measuring collaborations through featuring ?
Network_rap
Now, that I made that structur appears, I want to cross this analysis with textual variables. The question is : does people that collabore perform a similar rap ?
So, we load the data. Here, it is only the songs where there are featuring :
FR <- read.csv2("C:/Users/coren/Dropbox/Sociologie/Sociologie du rap/Bddfeat/GENIUSFEAT2.csv", stringsAsFactors = F)
Here, the text are not “clean” : the column “Texte” is the only one to be interesting, but it has to be cleaned.
t<-FR$Texte
print(t[100])
## [1] "\n \n \n \n\n[Intro : Rim'K] x2\nOn a connu la galère, les métaux précieux\nTout revient à la mama, la prunelle de mes yeux\n[Refrain : Lartiste] x2\nOn a connu les galères, on a connu les dramas\nL'époque des bandanas quand on était gamins\nSi j'veux réussir, c'est pour la mama\nJ'pense à elle quand tout va mal\n[Couplet 1 : Rim'K]\nHé gringo, qué pasa ?\nJ'suis avec la fafa, la fafa c'est quoi ?\nLa fafa, c'est la famille\nNous on vient des quartiers sensibles\nGros cigare de cuba, p'tite danse du Brésil\nAu mariage, au décès, au baptême, les mêmes qui seront là\nLes cousins nostra\nTu te places sur mon lit pour mes liasses\nÀ la Mahrez, j'leur ai fait le coup du foulard\nFennec dans la brume, foyer dans la fume\nOn en a brûlé des thunes\nJ'ai construit ma vie sur un champ de ruines\nJ'rentrais après la lune\nSi nous fâchés, ça fait rom pom pom\nMoi et ma team on est plein, plein, plein\nTout c'qui faut dans ma planque, planque, planque\nDes big Sam, Sam, Sam\n[Refrain : Lartiste] x2\nOn a connu les galères, on a connu les dramas\nL'époque des bandanas quand on était gamins\nSi j'veux réussir, c'est pour la mama\nJ'pense à elle quand tout va mal\n[Couplet 2 : Lartiste]Le V12 dans l'Veneno\nPas d'marche arrière, pas de créneau\nJ'fais mon malin comme Serge Beynaud\nCe soir encore, on fout le fuego\nJ'ai enfumé la pièce, j'arrive canabissé\nT'inquiète pas p'tit frère, le dîner est bien visser\nJ'traîne qu'avec des hommes, pas de calamités\nPas d'mythomanes qui tté-gra l'amitié\nAllez vamos, monte vite dans le gamos\nY'a trop de Piqué, y'a trop de Ramos\nY'a trop d'alcool, y'a trop de cassos\nBang bang, ça tire dans l'club\nEncore un narvalo qui fait le thug\nLe patron va appeler les keufs\nCes bâtards vont faire fuir les meufs\n[Refrain : Lartiste] x2\nOn a connu les galères, on a connu les dramas\nL'époque des bandanas quand on était gamins\nSi j'veux réussir, c'est pour la mama\nJ'pense à elle quand tout va mal\n[Outro]\nEt je pense à elle, c'est elle qui m'a porté\nLa seule qui m'a soutenu quand je prenais des coups bas\nOui je pense à elle, c'est elle qui m'a porté\nLa seule qui m'a soutenu quand je prenais des coups bas\nDes galères, des dramas\nDes galères, des dramas\nDes dramas, des dramas\nDes dramas, des galères\nDes dramas, si y'avait pas la mama, je n'en serais pas là\n\n\n \n \n "
The syntax to decode is this form : “Texte [Rappeur 1 :] Couplet 1 [Rappeur 2 :] Couplet 2 etc.
We gonna need regular expression to construct it !
Manipulation of character strings in R (but in other languages too). Useful for many thing out of text analysis ! Using metacharacters to get a lot of them : for instance, “.” means “any character” To neutralize a metacharacter, we need to use “double backlash” in front of it (“\\.” is a real dot)
Main characters of regex :
Some quantifications :
Also :
We gonna use base functions in R :
But also functions of stringi and stringr packages.
First, we remove what is before the first “[” (we don’t know who to assign this text)
t <- gsub("^[^\\[]*\\[", "",t)
# Littéralement : à partir du début du texte, tout ce qui n'est pas un crochet ouvrant, autant de fois que possible, jusqu'à trouver un crochet ouvrant
t <- gsub("^", "\\[",t) # On rajoute la première balise
# Note : on aurait pu tout faire d'un coup, comme ceci : gsub("^([^\\[]*)(\\[)","\\2",t)
Now, we get in a list what is in bracket (name of artists) and what is between the bracket (the texts)
test2 <-stri_match_all_regex(t, "\\[[^\\]]*\\]") # This syntax, litteraly : un crochet ouvrant, puis tout sauf un crochet fermant, jusqu'à un crochet fermant -> [blabla]
# l'argument "all" permet de récupérer toutes les occurrences de cette syntaxe
test <- str_split(t, "\\[[^\\]]*\\]") # Sépare les textes dès qu'il rencontre cette syntaxe
Here, we used str_split function, that cut strings when it matches an expression
One column for the name of the artist, and one for the text. I think that there are plenty of fastest way than using a for loop… :
test3<-data.frame()
test4 <- data.frame()
for (i in (1:length(test))){ # Pour chaque morceau
for (j in (1:length(test2[[i]]))) { # Pour le nombre de balises dans le morceau
test4[1,1] <- test2[[i]][j]
test4[1,2] <- test[[i]][j+1] # j+1 car il y a une colonne vide dans test
test4[1,3] <- i # On garde un identifiant pour le morceau
test3 <- bind_rows(test3, test4)
}
}
colnames(test3) <- c("Artiste", "Texte", "id")
We can show some informations on this dataframe :
max(table(test3$id)) # One song with 40 different texts
## [1] 40
min(table(test3$id)) # Song with only one text : not a featuring
## [1] 1
table(table(test3$id)==1) # 98 songs dans ce cas (sur 2640 = 4%)
##
## FALSE TRUE
## 2542 98
mean(table(test3$id)) # Mean of 6,8 texts by song
## [1] 6.820833
18 000 different texts, but for a lot of them, we have : - too short texts - not the information on the artist
We remove those texts :
rap <- subset(test3, nchar(test3$Texte)> 200 & nchar(test3$Artiste) > 5)
6 000 texts are dropped.
Now we have structure our corpus, we need to clean it :
To do so, we use a function :
clean_rappeur <- function (x) {
x <- tolower(x)
x <- gsub("\\[", "", x)
x <- gsub("\\]", "",x)
x <- gsub(".* :", "",x)
x <- gsub("couplet[ 0-9]*", "",x)
x <- gsub("refrain", "",x)
x <- gsub(":", "",x)
x <- gsub("\\?", "",x)
x <- gsub("intro", "",x)
x <- gsub("outro", "",x)
x <- gsub(".*transcri.*", "",x)
x <- gsub("-", "",x)
x <- gsub("–", "",x)
x <- gsub(",", "",x)
x <- gsub ("—", "",x)
x <- gsub ("x[0-9]", "",x)
x <- gsub ("\\*[0-9]", "",x)
x <- gsub ("\\(.*\\)", "",x)
x <- gsub ("", "",x)
x <- gsub("pont", "",x)
x <- gsub("^ *", "",x)
x <- gsub(" *$", "",x)
}
rap[,1] <- clean_rappeur(rap[,1])
rap <- subset(rap, nchar(rap$Texte)> 200 & nchar(rap$Artiste) > 2)
1 200 texts less.
We also need to remove duplicated lines (repeated chorus for instance) :
rap <- unique(rap)
800 less.
We can look at the distribution by artist. For our purpose, we only keep the artists that have the most texts :
rap$Artiste <- as.factor(rap$Artiste)
table(table(rap$Artiste)) # Répartition très inégalitaire
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1255 517 212 99 61 58 35 31 13 16 22 15 13 21 6
## 16 17 18 19 20 21 22 24 25 26 27 28 29 30 31
## 12 9 11 6 6 5 4 4 3 4 4 3 5 4 7
## 32 33 35 36 37 38 39 40 41 42 44 45 46 47 48
## 2 1 1 1 1 1 2 1 1 3 1 3 1 2 1
## 49 51 52 53 57 60 62 64 67 69 75 80 81 85 100
## 1 2 3 1 1 1 1 2 1 1 1 1 1 2 1
## 102 114
## 1 1
sort(table(rap$Artiste), decreasing=T)[1:50]
##
## alkpote alonzo nekfeu jul soprano
## 114 102 100 85 85
## la fouine booba t.i.s maître gims seth gueko
## 81 80 75 69 67
## niro sadek swift guad a2h sofiane
## 64 64 62 60 57
## deen burbigo black m dadju sneazzy furax
## 53 52 52 52 51
## jok'air lucio bukowski taïro rohff youssoupha
## 51 49 48 47 47
## s.pri noir alpha wann canardo lacrim rim'k
## 46 45 45 45 44
## jeff le nerf kaaris scylla lartiste niska
## 42 42 42 41 40
## kery james lefa hayce lemsi green money guizmo
## 39 39 38 37 36
## despo rutti lino barack adama nemir demi portion
## 35 33 32 32 31
## hamza mac tyer rockin' squat sinik tito prince
## 31 31 31 31 31
sum(sort(table(rap[,1]), decreasing = T)[1:10])
## [1] 858
tab<-sort(table(rap[,1]), decreasing = T)[1:20]
names(tab)
## [1] "alkpote" "alonzo" "nekfeu" "jul"
## [5] "soprano" "la fouine" "booba" "t.i.s"
## [9] "maître gims" "seth gueko" "niro" "sadek"
## [13] "swift guad" "a2h" "sofiane" "deen burbigo"
## [17] "black m" "dadju" "sneazzy" "furax"
# On peut vouloir ne garder que les textes des rappeurs qui ont le plus de texte
rap2 <- subset(rap, rap$Artiste %in% names(tab))
rap2$Artiste <- as.factor(as.character(rap2$Artiste)) # Il serait plus efficace de faire droplevel
table(rap2$Artiste)
##
## a2h alkpote alonzo black m booba
## 60 114 102 52 80
## dadju deen burbigo furax jul la fouine
## 52 53 51 85 81
## maître gims nekfeu niro sadek seth gueko
## 69 100 64 64 67
## sneazzy sofiane soprano swift guad t.i.s
## 52 57 85 62 75
1500 texts for 20 artists.
Do we have difference of length of texts by artist ? :
summary(tapply(nchar(rap2$Texte), rap2$Artiste, mean, na.rm=F))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 443.7 522.5 598.7 623.1 714.7 811.7
The distribution is quiet homogeneous.
We do the same kind of things for the texts :
### enlever les \n (saut de ligne)
rap2$Texte <- gsub("\\\n",". ",rap2$Texte) # On rajoute des points à chaque saut de ligne, donne une idée des phrases
rap2$Texte <- gsub("\\n",". ",rap2$Texte)
rap2$Texte[250]
## [1] "J'prends mon sac, maman pleure encore. Elle file à l'hôpital, j'suis d'jà pressé qu'elle en sorte. Je zone dans la ville, j'me jure de péter le score. Je squatte chez une fille qui, maintenant, est dans la drogue. Je ride pendant qu'maman soigne sa tête (putain). J'suis triste mais, quand j'sors, j'suis le roi d'la fête. A2 éclate les dép', \"A2, elle claque, la tape\". Moi, j'm'en fous, j'veux juste que ma mère soit fière. J'ai quitté l'école, et j'ai quitté les blocs. J'fais ma route comme un grand, voilà, j'ai traversé l'époque. Avec les potes, une pensée pour la mama. Putain, qu'elle est forte, elle a quitté le "
# Expressions régulières
# Quand on a une consonne puis un ' puis une consonne : transformer ça en consonne+e espace ?
rap2$Texte <- gsub("([b-df-hj-np-tv-z]|[B-DF-HJ-NP-TV-Z])'([b-df-hj-np-tv-z]|[B-DF-HJ-NP-TV-Z])", "\\1e \\2", rap2$Texte)
rap2$Texte <- gsub("(qu)'([b-df-hj-np-tv-z])", "\\1e \\2", rap2$Texte) # les formules en qu'tu ou qu'je sont aussi changées
# Ou plus généralement, enlever toutes les apostrophes en rajoutant des e (ce qui n'est pas super précis mais correspond dans la majorité des cas)
rap2$Texte <- str_replace_all(rap2$Texte,"(\\w)'(\\w)", "\\1e \\2") # \w : tous les charactères de mot
# Il y a un autre problème : les sauts de ligne n'ont parfois pas été reconnus, on a donc des mots collés
# Quand on a minuscule puis majuscule, mettre un espace ?
rap2$Texte <- str_replace_all(rap2$Texte, "([:lower:])([:upper:])", "\\1\\. \\2")
# With grepl function, we can check if we still have some issue
# For instance, do we remove all the apostrophe ?
table(grepl("'", rap2$Texte)) # We still have a lot of it
##
## FALSE TRUE
## 1079 346
rap2$Texte <- gsub("'", " ", rap2$Texte) # We removed it
# But we also have an other type of apostrophe
table(grepl("’", rap2$Texte))
##
## FALSE TRUE
## 1107 318
rap2$Texte <- gsub("’", " ", rap2$Texte) # We removed it
table(grepl("’", rap2$Texte)) # It's ok
##
## FALSE
## 1425
There still are a lot to do, but for now, let’s say it is ok !
Now, we can load the “clean” corpus !
# rap2 <- read.csv2("https://raw.githubusercontent.com/satRdays/paris2019/master/corpusrap_clean.csv")
# For me : keeping with the old one
# Creation of the "corpus" object, from the vector with the texts :
docs <- VCorpus(VectorSource(rap2$Texte))
print(docs)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1425
But we want to keep our supplementary variables :
# Specific format :
rap3 <- rap2
for (i in 1:nrow(rap3)){
rap3[i,4] <- i
}
rap3 <- rap3[,c(4,2,1,3)]
colnames(rap3) <- c("doc_id", "text", "artiste", "Morceau_id")
# We keep metadata by
docs <- Corpus(DataframeSource(rap3))
meta(docs[1]) # We have the name of the artist and the id for the song
We can make subset, to keep only texts from one artist :
idx <- meta(docs, "artiste") == 'sofiane'
docs_f <- docs[idx] # Here, we only have texts from one specific artist, here 57 parts of song
Now, we have to built a Document Term Matrix. On that purpose, we modify our corpus :
# with the tm_map function to transform the corpus
docs <- tm_map(docs, content_transformer(tolower)) # lower case
docs <- tm_map(docs, content_transformer(removeNumbers))
docs <- tm_map(docs, content_transformer(removePunctuation))
docs <- tm_map(docs, removeWords, stopwords("fr")) # remove stopwords
docs <- tm_map(docs, content_transformer(stripWhitespace))
docs <- tm_map(docs, stemDocument) # stemming (it would be better to lemmatize)
docs <- tm_map(docs, removeWords, c("fair", "fait", "fai")) # remove specific words
# The most important object is the Document Term Matrix
dtm <- DocumentTermMatrix(docs)
dim(dtm)
## [1] 1425 14007
To have better results, we should not using “stem” but “lem” : lemmatization of corpus is a powerful technique, because it does not only “cut” words, but keep morpho-syntaxic informations on it. For instance, use “treetagger” to do so.
# frequencies
freq <- sort(colSums(as.matrix(dtm)),decreasing=TRUE)
freq[1:10] # 10 words most frequents
## comm ça tout quand plus tous trop vie veux faut
## 1094 964 902 680 673 440 358 353 333 290
# wordcloud
set.seed(700)
wordcloud(names(freq), freq, min.freq=30, max.words = 150, colors=brewer.pal(8, "Dark2"))
# barplot
barplot(freq[1:25], xlab = "term", ylab = "frequency", col=heat.colors(50))
# We can decide to keep only the most frequent terms
dtm2 <- removeSparseTerms(dtm, 0.97) # The words used in 3% of documents
dim(dtm2) # It keeps only 240 words
## [1] 1425 241
# An other way to calculate frequency
freq = colSums(as.matrix(dtm2))
ord = order(-freq) #order the frequency
freq[(ord)]
## comm ça tout quand plus tous trop
## 1094 964 902 680 673 440 358
## vie veux faut rien bien où vai
## 353 333 290 285 280 273 263
## dit ouai toujour là rap jamai sai
## 245 245 239 238 229 218 217
## gros parl coup sous être vien voi
## 213 212 210 209 205 204 202
## mond temp aim petit car mal mère
## 200 196 196 191 186 178 177
## prend met homm tête cœur mec sale
## 177 165 164 164 164 163 163
## seul pute just veut deux nuit donc
## 161 160 159 157 150 148 148
## chez avant oui plein sor bon grand
## 148 148 147 142 142 141 140
## peu depui pass putain autr voir rêves
## 137 136 136 136 135 134 134
## jour peux quoi demand rest gar heur
## 133 132 131 131 130 129 129
## vrai bais foi besoin dire mort arriv
## 128 127 125 123 118 117 116
## laiss merd devant dis amour alor soir
## 116 115 115 113 112 111 111
## ceux parc entr pourquoi encor jusqu chaqu
## 111 109 109 108 108 107 107
## personn rentr fil loin regard frère non
## 107 107 106 104 104 102 102
## pens rime font pui gen sen rue
## 102 102 102 101 98 98 97
## yeah attend femm pari fume part vill
## 97 96 96 96 95 94 92
## band main connai aller yeux noir aprè
## 92 91 91 91 90 89 89
## vite frères air sang sort avoir veulent
## 88 87 86 86 86 86 85
## niqu tire parler gross argent comment port
## 85 84 84 84 84 84 84
## quelqu couill flow franc meilleur bell vas
## 84 84 83 81 81 80 80
## appell tour mis poto dieu fond pein
## 79 79 79 79 78 78 78
## fill prendr place bas très face croi
## 78 78 77 76 76 74 74
## pris chose équip donn pote estc compt
## 74 74 74 72 72 72 72
## guerr roul fuck histoir terr tant gueul
## 71 71 71 70 70 70 69
## pendant peut déjà fous trist fou mettr
## 69 69 68 67 67 67 67
## pay tit tellement oubli corp tape perdu
## 66 66 66 66 65 64 64
## bat rapp game entend envi haut quartier
## 64 64 63 63 62 62 62
## peur père jeun bonn dès vide salop
## 62 61 61 61 60 60 60
## cul chatt écout mainten bras meuf bouch
## 59 59 58 58 57 57 56
## passé joue sait anné ennemi fini ball
## 56 56 56 56 56 56 55
## pire shit drogu mieux famill cherch passer
## 55 55 55 55 54 54 53
## aucun rappeur premier touch cash arrêt chaud
## 53 53 53 51 51 51 51
## droit forc derrièr fort march beau finir
## 51 50 50 49 49 49 49
## bite pourtant ami inquièt traîn beaucoup larm
## 49 48 48 47 47 46 45
## avanc croir sent
## 44 44 43
# Measure association with a specific word
# "Bag of word"
findAssocs(dtm2, "femm", corlimit = 0.1)
## $femm
## bell amour homm vie face fume
## 0.18 0.12 0.12 0.11 0.11 0.10
findAssocs(dtm2, "homm", corlimit = 0.1)
## $homm
## plus entr perdu face sen être où pein premier
## 0.21 0.20 0.19 0.19 0.18 0.15 0.14 0.14 0.13
## femm mort fort vrai loin avant rest derrièr regard
## 0.12 0.12 0.12 0.12 0.12 0.11 0.11 0.11 0.11
## aim
## 0.10
findAssocs(dtm2, "amour", corlimit = 0.1)
## $amour
## regard déjà sen loin femm père quand encor famill
## 0.21 0.15 0.15 0.14 0.12 0.11 0.11 0.11 0.11
findAssocs(dtm, "papa", corlimit = 0.4)
## $papa
## tambola africa créer déséquilibr mama sursi
## 0.73 0.52 0.52 0.52 0.52 0.52
## violé meurtr endurci oublié lentill rohf
## 0.52 0.49 0.42 0.42 0.42 0.42
## laiss
## 0.40
We can go further with this package, but let’s try an other one !
# Create an objet of class "corpus"
tt <- corpus(rap2, text_field = "Texte")
summary(tt)
# Create a corpus with only texts from one artist
ttt <- corpus_subset(tt, Artiste == "alkpote") # Sous-corpus en fonction des métadonnées
Before building a DTM, we can look at the corpus.
Key word in context : helpful for textmining !
# In wich contexts the term "couilles" appears?
kwic(tt, "couilles", window = 6)
nrow(kwic(tt, "couilles", window = 6)) # 74 occurences of the term
## [1] 74
# No need to be always in a "bag of word" approach :
table(kwic(tt, "couilles", window = 1)[4])
##
## , à de des Deux en Grosses les Les
## 1 1 1 3 1 5 1 51 2
## mes tes
## 7 1
prop.table(table(kwic(tt, "couilles", window = 1)[4]))
##
## , à de des Deux en
## 0.01351351 0.01351351 0.01351351 0.04054054 0.01351351 0.06756757
## Grosses les Les mes tes
## 0.01351351 0.68918919 0.02702703 0.09459459 0.01351351
# The term is very often (70%) preceded by the terme "les"
table(kwic(tt, pattern = phrase("les couilles"), window = 1)[4]) #battre ou casser
##
## , . a ai avec bas bat Bat bats
## 1 2 2 2 1 1 10 3 9
## Bats coupé coupent dans donc gratte jamais MDBats ont
## 3 1 2 2 1 2 1 1 1
## pas vide vider vrai
## 3 2 2 1
aa <- kwic(tt, patter = phrase("les couilles"), window = 6)
View(aa)
We can also look if there are specific expressions in the corpus, and use it to redefine tokens in our corpus :
col <- textstat_collocations(tt, min_count = 20, size = 4)
col
# We could use it after to redefine our tokens (words) in a corpus
# rap_tok <- tokens_compound(token, pattern = col)
# Do we have some very long repeteated segments ? Don't run !
# textstat_collocations(tt, min_count = 2, size = 30)
But kwic function is also useful to do some “keyness” analysis.
Question : which words are more used around the term “rap” ?
If there are independance in the distribution of each word in each corpus, there will be :
Occurrences in corpus 1 / Length of corpus 1 = Occurrences in corpus 2 / Length of corpus 2
It is a chisquared measure.
To do so, we need to build a Document Feature Matrix First thing to do : tokenize the corpus (not mandatory but preferable)
tk <- quanteda::tokens(tt, what = 'word', remove_numbers = T,
remove_punct = T,
remove_symbols = T) # first tokenize the corpus
Then, we build the matrix :
stop <- stopwords("fr", source = "stopwords-iso") # Here, we use a wider list of stopwords
dfm <- dfm(tk, remove = stop, tolower = TRUE, remove_punct = TRUE)
dfm <- dfm_wordstem(dfm, language = "french")
dfm <- dfm(dfm, remove = c("a", "fair", "fais", "fait", "plus", "tout", "quand", "ça", "ye")) # remove specific words
dfm <- dfm_trim(dfm, min_termfreq = 20) # Keep only the terms that appear 20 times
We can run basic statistics on the corpus :
nfeat(dfm) # We kept 700 words
## [1] 701
# 40 most frequents words
topfeatures(dfm, n=40)
## vi veux faut aim ouais coup rap rêv sais jam gros fin
## 353 333 293 274 250 235 229 222 221 219 215 208
## mer pass vien mond vois sal temp frer mal pet têt met
## 207 207 207 200 200 200 196 190 190 189 185 181
## bais prend put grand mec homm cœur part fum veut con dis
## 177 177 176 172 164 164 164 163 161 157 154 154
## laiss gar oui nuit
## 150 150 150 148
# Useful for a wordcloud
textplot_wordcloud(dfm, max.words = 50, scale = c(3, .5), colors=brewer.pal(6, "Dark2"))
Now, we can try to answer the question : how texts of each artist are specific?
# We need to group documents by artist
dfm_a <- dfm(dfm, groups = "Artiste") # On regroupe les textes de chaque rappeur ?
docvars(dfm_a) # 20 textes différents
# Measure of the lexical diversity for each artist (number of different forms / total number of forms)
textstat_lexdiv(dfm_a)
# Measure of keyness : which words are more used by an artist ? (// chisquared)
a<-textstat_keyness(dfm_a, target = "t.i.s")
textplot_keyness(a)
a<-textstat_keyness(dfm_a, target = "jul")
textplot_keyness(a)
a<-textstat_keyness(dfm_a, target = "nekfeu")
textplot_keyness(a)
a<-textstat_keyness(dfm_a, target = "alkpote")
textplot_keyness(a)
On a more systematic way, we can proceed clustering methods to regroup texts, based on their lexical similarities.
tstat_dist <- textstat_dist(dfm_a)
clust <- hclust(tstat_dist, method = "ward.D2")
plot(clust, xlab = "Distance", ylab = NULL)
Very fashion technique, like sentiment analysis.
dtm_l <- convert(dfm, to = "topicmodels")
lda <- LDA(dtm_l, k = 6)
terms(lda, 6)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
## [1,] "gros" "coup" "ouais" "rêv" "sal" "vi"
## [2,] "temp" "veut" "mer" "vien" "frer" "aim"
## [3,] "veux" "niqu" "met" "cœur" "laiss" "mond"
## [4,] "put" "sort" "veux" "rentr" "sor" "vid"
## [5,] "rap" "sais" "faut" "jam" "pe" "heur"
## [6,] "fum" "mort" "âme" "vi" "faut" "vit"
# # How manys topic should we keep ? (Don't run)
# library(ldatuning)
# result <- FindTopicsNumber(
# dtm,
# topics = seq(from = 2, to = 15, by = 1),
# metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
# method = "Gibbs",
# control = list(seed = 77),
# mc.cores = 2L,
# verbose = TRUE
# )
#
# FindTopicsNumber_plot(result)
Known because of the software “Alceste”, created by Max Reinert, that developed the technique in 1980s.
Principles :
Caption
To have a better use of this method, a new package is being developed by Julien Barnier : rainette. It is still in beta-testing, but I’m sure it is gonna be awesome in a few weeks !
Here, I tried his package on another corpus, from french press
#On rap corpus
# First, we need to split texts in segments
corpus <- split_segments(tt, segment_size = 40)
tk <- quanteda::tokens(corpus, what = 'word', remove_numbers = T,
remove_punct = T,
remove_symbols = T) # first tokenize the corpus
dtm <- dfm(tk, remove = stop, tolower = TRUE, remove_punct = TRUE)
dtm <- dfm_wordstem(dtm, language = "french")
dtm <- dfm(dtm, remove = c("a", "fair", "fais", "fait", "plus", "tout", "quand", "ça", "ye")) # remove specific words
dtm <- dfm_trim(dtm, min_termfreq = 10)
res <- rainette(dtm, k = 5, min_uc_size = 5, min_members = 10)
## Computing ucs from segments...
rainette_plot(res, dtm, k = 5, type = "bar", n_terms = 20, free_scales = FALSE,
measure = "chi2", show_negative = "TRUE", text_size = 10)
# On another corpus
beauf <- read.csv2("C:/Users/coren/Dropbox/Sociologie/Thèse/Quantithèse/Bdd_thèse/quisuisje.csv", stringsAsFactors = F)
bobo <- corpus(beauf$Texte[1:1000])
corpus <- split_segments(bobo, segment_size = 40)
tk <- quanteda::tokens(corpus, what = 'word', remove_numbers = T,
remove_punct = T,
remove_symbols = T) # first tokenize the corpus
dtm <- dfm(tk, remove = stop, tolower = TRUE, remove_punct = TRUE)
dtm <- dfm_wordstem(dtm, language = "french")
dtm <- dfm(dtm, remove = c("a", "fair", "fais", "fait", "plus", "tout", "quand", "ça", "ye")) # remove specific words
dtm <- dfm_trim(dtm, min_termfreq = 20)
res <- rainette(dtm, k = 8, min_uc_size = 5, min_members = 10)
## Computing ucs from segments...
rainette_plot(res, dtm, k = 8, type = "bar", n_terms = 20, free_scales = FALSE,
measure = "chi2", show_negative = "TRUE", text_size = 10)