Le Latent Dirichlet Allocation est un modèle probabiliste génératif de corpus. L’idée consiste à considérer les documents comme des mélanges aléatoires sur des topics sous-jacents, où chaque topic est caractérisé par une une distribution sur les mots. Le processus de génération des documents se décline ainsi : Pour un corpus C = (\(d_1\), …, \(d_M\)) de M documents : Pour chaque document \(d_i\) = (\(w_1\) …, \(w_{N_i}\)) de Ni mots, 1- On choisit \(N_i\) \(\rightsquigarrow\) Poisson( \(\epsilon\) ) 2- On choisit \(\theta\) \(\rightsquigarrow\) Dir(\(\alpha\)) 3-Pour chacun des N mots \(w_n\) de \(d_i\) : a) On choisit un \(Z_{n}\) \(\rightsquigarrow\) Multinomiale(\(\theta\)) b)On choisit un mot wn de p(\(w_n\)|\(z_n\), \(\beta\) ), une probabilité multinômiale conditionnée sur le topic \(Z_n\)
Le schéma génératif précédent fait ainsi ressortir cinq paramètres, définis comme suit : \(\alpha\) : Vecteur de dimension K correspondant au paramètre de la loi de Dirichlet générant les mots d’un document. L’interprétation intuitive de ce paramètre est donnée dans l’exemple des celles de la section précédente.
\(\beta\) : Matrice K × V , où K représente le nombre de classes, et V la taille du vocabulaire associé au corpus. L’expression du terme général de la matrice est :\[\beta_{ij}=p(w_{j}|z_{i})\] Les coefficients \(\beta_{ij}\) correspondent ainsi à la probabilité d’un mot \(w_j\) d’appartenir au topic \(z_i\). .
\(\theta\) : la proportion exacte des topics dans chaque document.
N.B : \(\alpha\) et \(\beta\) sont appelés hyperparamètres du modèle, à l’inverse de \(\theta\) , z, et w qui sont des paramètres latents. N correspond à \(N_i\) pour un document d donné
Les variables \(\theta\) et z présentent un intérêt majeur en ce qu’elles hiérarchisent le modèle et évaluent chaque composante du corpus (corpus entier, document, mots). On distingue en effet trois niveaux distincts sur la Figure:\[ P(\theta | \alpha ) = \frac{ \Gamma(\alpha)}{ \sum_{i=1}^K \Gamma(\alpha_i)} \theta_{1}^{\alpha_1-1}...\theta_{K}^{\alpha_K-1} \]
avec \(\alpha\) ∈ \(R ^K\) , $$ > 0 et \(Gamma(x)\) la fonction Gamma. Cette distribution permet donc d’obtenir une distribution multinomiale de paramètre \(\theta\), correspondant pour LDA au mélange de topics d’un document w. Le paramètre \(\alpha\) = \(\alpha_1\) + · · · \(\alpha_k\) contrôle l’homogénéité de\(\theta_i\): lorsque \(\alpha\) est grand, les θi sont proches et homogènes, lorsque \(\alpha\) est petit, la plupart des θi sont proches de 0 sauf quelques uns. Dans les cas extrêmes, tous les θi sont égaux (\(\alpha\) >>>0 ou tous les \(theta_i\) sont nuls sauf un (\(\alpha\) =0 ) . L’un des avantages de la loi de Dirichlet est qu’elle est conjuguée à la loi multinomiale, c’est à dire que si \(z_1\), . . . , \(z_N\) sont des variables multinomiales de paramètre \(theta\), alors la variable \(\theta\)|\(z_1\), . . . , \(z_N\) donnée par:
\[p(\theta|z_1, . . . , z_N ) \propto p(z_1,..,z_n|\theta)p(\theta|\alpha) \]
suit également une loi de Dirichlet. Ceci permettra de simplifier les calculs au moment de l’inférence.
Etant donnés les paramètres \(\alpha\) et \(\beta\), la probabilité jointe du mélange de topics \(\theta\), des N topics z et de N mots w est donnée par:
\[p(\theta , z, w|α, β) = p(θ|α) \prod_{i=1}^K p(z_i|θ)p(w_i|β_{z_i}) \]
La loi marginale d’un document w est alors:
\[p(w|α, β) = \int p(θ|α) \prod_i \sum_{z_i} p(z_i|θ)p(w_i|β_{z_i} ) d\theta\]
et il suffit de prendre le produit de cette quantité pour chaque document w du corpus pour obtenir la probabilité de ce corpus.
library(knitr)
library(topicmodels)
library(tm)
## Loading required package: NLP
filenames <- list.files(getwd(),pattern="*.txt" )
#filenames
files <- lapply(filenames,readLines)
## Warning in FUN(X[[i]], ...): ligne finale incomplète trouvée dans 'DM.txt'
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, "" , x))})
docs <- Corpus(VectorSource(files))
docs <-tm_map(docs,content_transformer(tolower))
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, ", ")
docs <- tm_map(docs, toSpace, "‘")
docs <- tm_map(docs, toSpace, "•")
docs <- tm_map(docs, toSpace, "“")
#remove punctuation
docs <- tm_map(docs, removePunctuation)
#Strip digits
docs <- tm_map(docs, removeNumbers)
#remove stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
#remove whitespace
docs <- tm_map(docs, stripWhitespace)
#Good practice to check every now and then
#Stem document
docs <- tm_map(docs,stemDocument)
#fix up 1) differences between us and aussie english 2) general errors
docs <- tm_map(docs, content_transformer(gsub),
pattern = "organiz", replacement = "organ")
docs <- tm_map(docs, content_transformer(gsub),
pattern = "organis", replacement = "organ")
docs <- tm_map(docs, content_transformer(gsub),
pattern = "andgovern", replacement = "govern")
docs <- tm_map(docs, content_transformer(gsub),
pattern = "inenterpris", replacement = "enterpris")
docs <- tm_map(docs, content_transformer(gsub),
pattern = "team-", replacement = "team")
#define and eliminate all custom stopwords
myStopwords <- c("can", "say","one","way","use",
"also","howev","tell","will",
"much","need","take","tend","even",
"like","particular","rather","said",
"get","well","make","ask","come","end",
"first","two","help","often","may",
"might","see","someth","thing","point",
"post","look","right","now","think","‘ve ",
"‘re ","anoth","put","set","new","good",
"want","sure","kind","larg","yes,","day","etc",
"quit","sinc","attempt","lack","seen","awar",
"littl","ever","moreov","though","found","abl",
"enough","far","earli","away","achiev","draw",
"last","never","brief","bit","entir","brief",
"great","lot")
docs <- tm_map(docs, removeWords, myStopwords)
#Create document-term matrix
dtm <- DocumentTermMatrix(docs)
#convert rownames to filenames
rownames(dtm) <- filenames
#collapse matrix by summing over columns
freq <- colSums(as.matrix(dtm))
#length should be total number of terms
#length(freq)
#create sort order (descending)
ord <- order(freq,decreasing=TRUE)
library(ldatuning)
library(Rmpfr)
## Loading required package: gmp
##
## Attaching package: 'gmp'
## The following objects are masked from 'package:base':
##
## %*%, apply, crossprod, matrix, tcrossprod
## C code of R package 'Rmpfr': GMP using 64 bits per limb
##
## Attaching package: 'Rmpfr'
## The following objects are masked from 'package:stats':
##
## dbinom, dnorm, dpois, pnorm
## The following objects are masked from 'package:base':
##
## cbind, pmax, pmin, rbind
result <- FindTopicsNumber(
dtm,
topics = seq(from = 2, to = 15, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = 4L,
verbose = TRUE
)
## fit models... done.
## calculate metrics:
## Griffiths2004... done.
## CaoJuan2009... done.
## Arun2010... done.
## Deveaud2014... done.
FindTopicsNumber_plot(result)
#load topic models library
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
#Number of topics
k <- 5
#Run LDA using Gibbs sampling
ldaOut <-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
ldaOut.topics <- as.matrix(topics(ldaOut))
#top 6 terms in each topic
ldaOut.terms <- as.matrix(terms(ldaOut,6))
kable(ldaOut.terms)
| Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 |
|---|---|---|---|---|
| child | thou | joy | yet | old |
| bring | rom | exeunt | sweet | dream |
| must | thi | friend | hour | shame |
| that | love | letter | fall | thought |
| everi | nurs | sun | samp | three |
| henc | thee | faith | long | law |
topicProbabilities <- as.data.frame(ldaOut@gamma)
docu<-paste0("doc", 1:30)
tab<-cbind(docu ,topicProbabilities )
kable(tab)
| docu | V1 | V2 | V3 | V4 | V5 |
|---|---|---|---|---|---|
| doc1 | 0.5692308 | 0.0769231 | 0.1076923 | 0.1076923 | 0.1384615 |
| doc2 | 0.0810011 | 0.5446985 | 0.1215417 | 0.1398529 | 0.1129058 |
| doc3 | 0.5692308 | 0.0769231 | 0.1076923 | 0.1076923 | 0.1384615 |
| doc4 | 0.0810011 | 0.5446985 | 0.1215417 | 0.1398529 | 0.1129058 |
| doc5 | 0.5692308 | 0.0769231 | 0.1076923 | 0.1076923 | 0.1384615 |
| doc6 | 0.0810011 | 0.5446985 | 0.1215417 | 0.1398529 | 0.1129058 |
| doc7 | 0.5692308 | 0.0769231 | 0.1076923 | 0.1076923 | 0.1384615 |
| doc8 | 0.0810011 | 0.5446985 | 0.1215417 | 0.1398529 | 0.1129058 |
| doc9 | 0.5692308 | 0.0769231 | 0.1076923 | 0.1076923 | 0.1384615 |
| doc10 | 0.0810011 | 0.5446985 | 0.1215417 | 0.1398529 | 0.1129058 |
| doc11 | 0.5692308 | 0.0769231 | 0.1076923 | 0.1076923 | 0.1384615 |
| doc12 | 0.0810011 | 0.5446985 | 0.1215417 | 0.1398529 | 0.1129058 |
| doc13 | 0.5692308 | 0.0769231 | 0.1076923 | 0.1076923 | 0.1384615 |
| doc14 | 0.0810011 | 0.5446985 | 0.1215417 | 0.1398529 | 0.1129058 |
| doc15 | 0.5692308 | 0.0769231 | 0.1076923 | 0.1076923 | 0.1384615 |
| doc16 | 0.0810011 | 0.5446985 | 0.1215417 | 0.1398529 | 0.1129058 |
| doc17 | 0.5692308 | 0.0769231 | 0.1076923 | 0.1076923 | 0.1384615 |
| doc18 | 0.0810011 | 0.5446985 | 0.1215417 | 0.1398529 | 0.1129058 |
| doc19 | 0.5692308 | 0.0769231 | 0.1076923 | 0.1076923 | 0.1384615 |
| doc20 | 0.0810011 | 0.5446985 | 0.1215417 | 0.1398529 | 0.1129058 |
| doc21 | 0.5692308 | 0.0769231 | 0.1076923 | 0.1076923 | 0.1384615 |
| doc22 | 0.0810011 | 0.5446985 | 0.1215417 | 0.1398529 | 0.1129058 |
| doc23 | 0.5692308 | 0.0769231 | 0.1076923 | 0.1076923 | 0.1384615 |
| doc24 | 0.0810011 | 0.5446985 | 0.1215417 | 0.1398529 | 0.1129058 |
| doc25 | 0.5692308 | 0.0769231 | 0.1076923 | 0.1076923 | 0.1384615 |
| doc26 | 0.0810011 | 0.5446985 | 0.1215417 | 0.1398529 | 0.1129058 |
| doc27 | 0.5692308 | 0.0769231 | 0.1076923 | 0.1076923 | 0.1384615 |
| doc28 | 0.0810011 | 0.5446985 | 0.1215417 | 0.1398529 | 0.1129058 |
| doc29 | 0.5692308 | 0.0769231 | 0.1076923 | 0.1076923 | 0.1384615 |
| doc30 | 0.0810011 | 0.5446985 | 0.1215417 | 0.1398529 | 0.1129058 |