1 Introduction
2 Mise en oeuvre sous R
3 Extraction des données à partir du réseaux Facebook.
4 Etude de cas : Analyse des tweet de D. Trump.

1 Introduction

L’objectif de ce chapitre est de donner quelques notions de text mining ou fouilles de textes.
On peut distinguer deux étapes principales dans les traitements mis en place par la fouille de textes.
- La première étape, l’analyse, consiste à reconnaître les mots, les phrases, leurs rôles grammaticaux, leurs relations et leur sens.
- La seconde étape, l’interprétation de l’analyse, permet de sélectionner un texte parmi d’autres. Des exemples d’applications sont la classification de courriers en spam, c’est-à-dire les courriers non sollicités, ou non spam : l’application de requêtes dans un moteur de recherche de documents ou le résumé de texte sélectionne les phrases représentatives d’un texte, voire les reformule.
Le critère de sélection peut être d’au moins deux types : la nouveauté et la similarité. Celui de la nouveauté d’une connaissance consiste à découvrir des relations, notamment des implications qui n’étaient pas explicites car indirectes, ou découlant de deux éléments éloignés dans le texte. Celui de la similarité ou contradiction par rapport à un autre texte, ou encore la réponse à une question spécifique, consiste à découvrir des textes qui correspondent le plus à un ensemble de descripteurs dans la requête initiale. Les descripteurs sont par exemple les noms et verbes les plus fréquents d’un texte.
Dans ce chapitre on va s’intéresser à l’analyse de quelques discours politiques aux Etats Unis et on passera à la fin des analyses de tweet et d’extractions d’informations à partir de pages Facebook.

2 Mise en oeuvre sous `R`

2.1 Analyse d’un discours politique.

2.1.1 Importation du texte dans `R` et création du Corpus

D’abord on installer les packages nécessaires tm (pour faire du text mining) et word cloud.

> library(tm)
Loading required package: NLP
> library(wordcloud)
Loading required package: RColorBrewer

Charger les données qui se trouvent dans un fichier .txt. Ce texte à été chargé du site de la maison blanche : http://www.whitehouse.gov/the-press-office/2011/01/25/remarks-president-state-union-address.

> txt=readLines("/Users/dhafermalouche/Documents/Teaching/CoursDataMining_1516/WordCloud/ObamaSpeech2011.txt")
> txt[1:4]
[1] "Mr. Speaker, Mr. Vice President, members of Congress, distinguished guests, and fellow Americans:"                                                                                                                                                                                                                                   
[2] ""                                                                                                                                                                                                                                                                                                                                    
[3] "      Tonight I want to begin by congratulating the men and women of the 112th Congress, as well as your new Speaker, John Boehner.  (Applause.)  And as we mark this occasion, we’re also mindful of the empty chair in this chamber, and we pray for the health of our colleague -- and our friend -– Gabby Giffords.  (Applause.)"
[4] ""

En effet, ces données contiennent un discours de Obama en 2011 sur l’état de l’Union. Ces discours (State of the Union address) est un évènement annuel où le président des États-Unis présente son programme pour l’année en cours. Ce discours est prononcé à Washington au Capitole, où les deux chambres (la Chambre des représentants et le Sénat) sont réunies.

Pour l’histoire, George Washington donna le premier discours sur l’état de l’Union le 8 janvier 1790 dans la ville de New York, qui à l’époque était la capitale.

Nettoyage des données. On supprime les ponctuations

> txt <- removePunctuation(txt)
> txt[1:5]
[1] "Mr Speaker Mr Vice President members of Congress distinguished guests and fellow Americans"                                                                                                                                                                                                                         
[2] ""                                                                                                                                                                                                                                                                                                                   
[3] "      Tonight I want to begin by congratulating the men and women of the 112th Congress as well as your new Speaker John Boehner  Applause  And as we mark this occasion were also mindful of the empty chair in this chamber and we pray for the health of our colleague  and our friend  Gabby Giffords  Applause"
[4] ""                                                                                                                                                                                                                                                                                                                   
[5] "      Its no secret that those of us here tonight have had our differences over the last two years  The debates have been contentious we have fought fiercely for our beliefs  And thats a good thing  Thats what a robust democracy demands  Thats what helps set us apart as a nation"

On supprime des nombres s’il y en a.

> txt <- removeNumbers(txt)
> txt[1:10]
 [1] "Mr Speaker Mr Vice President members of Congress distinguished guests and fellow Americans"                                                                                                                                                                                                                                                                                  
 [2] ""                                                                                                                                                                                                                                                                                                                                                                            
 [3] "      Tonight I want to begin by congratulating the men and women of the th Congress as well as your new Speaker John Boehner  Applause  And as we mark this occasion were also mindful of the empty chair in this chamber and we pray for the health of our colleague  and our friend  Gabby Giffords  Applause"                                                            
 [4] ""                                                                                                                                                                                                                                                                                                                                                                            
 [5] "      Its no secret that those of us here tonight have had our differences over the last two years  The debates have been contentious we have fought fiercely for our beliefs  And thats a good thing  Thats what a robust democracy demands  Thats what helps set us apart as a nation"                                                                                     
 [6] ""                                                                                                                                                                                                                                                                                                                                                                            
 [7] "      But theres a reason the tragedy in Tucson gave us pause Amid all the noise and passion and rancor of our public debate Tucson reminded us that no matter who we are or where we come from each of us is a part of something greater  something more consequential than party or political preference"                                                                  
 [8] ""                                                                                                                                                                                                                                                                                                                                                                            
 [9] "      We are part of the American family  We believe that in a country where every race and faith and point of view can be found we are still bound together as one people that we share common hopes and a common creed that the dreams of a little girl in Tucson are not so different than those of our own children and that they all deserve the chance to be fulfilled"
[10] ""

On supprime tous les espaces vides.

> txt <- txt[-which(txt=="")]
> txt[1:10]
 [1] "Mr Speaker Mr Vice President members of Congress distinguished guests and fellow Americans"                                                                                                                                                                                                                                                                                                            
 [2] "      Tonight I want to begin by congratulating the men and women of the th Congress as well as your new Speaker John Boehner  Applause  And as we mark this occasion were also mindful of the empty chair in this chamber and we pray for the health of our colleague  and our friend  Gabby Giffords  Applause"                                                                                      
 [3] "      Its no secret that those of us here tonight have had our differences over the last two years  The debates have been contentious we have fought fiercely for our beliefs  And thats a good thing  Thats what a robust democracy demands  Thats what helps set us apart as a nation"                                                                                                               
 [4] "      But theres a reason the tragedy in Tucson gave us pause Amid all the noise and passion and rancor of our public debate Tucson reminded us that no matter who we are or where we come from each of us is a part of something greater  something more consequential than party or political preference"                                                                                            
 [5] "      We are part of the American family  We believe that in a country where every race and faith and point of view can be found we are still bound together as one people that we share common hopes and a common creed that the dreams of a little girl in Tucson are not so different than those of our own children and that they all deserve the chance to be fulfilled"                          
 [6] "      That too is what sets us apart as a nation  Applause"                                                                                                                                                                                                                                                                                                                                            
 [7] "      Now by itself this simple recognition wont usher in a new era of cooperation  What comes of this moment is up to us  What comes of this moment will be determined not by whether we can sit together tonight but whether we can work together tomorrow  Applause"                                                                                                                                
 [8] "      I believe we can  And I believe we must  Thats what the people who sent us here expect of us  With their votes theyve determined that governing will now be a shared responsibility between parties  New laws will only pass with support from Democrats and Republicans  We will move forward together or not at all  for the challenges we face are bigger than party and bigger than politics"
 [9] "      At stake right now is not who wins the next election  after all we just had an election  At stake is whether new jobs and industries take root in this country or somewhere else  Its whether the hard work and industry of our people is rewarded  Its whether we sustain the leadership that has made America not just a place on a map but the light to the world"                            
[10] "      We are poised for progress  Two years after the worst recession most of us have ever known the stock market has come roaring back  Corporate profits are up  The economy is growing again"

Tout rendre en minuscule

> for(i in 1:length(txt))
+   txt[i]=tolower(txt[i])
> txt[1:10]
 [1] "mr speaker mr vice president members of congress distinguished guests and fellow americans"                                                                                                                                                                                                                                                                                                            
 [2] "      tonight i want to begin by congratulating the men and women of the th congress as well as your new speaker john boehner  applause  and as we mark this occasion were also mindful of the empty chair in this chamber and we pray for the health of our colleague  and our friend  gabby giffords  applause"                                                                                      
 [3] "      its no secret that those of us here tonight have had our differences over the last two years  the debates have been contentious we have fought fiercely for our beliefs  and thats a good thing  thats what a robust democracy demands  thats what helps set us apart as a nation"                                                                                                               
 [4] "      but theres a reason the tragedy in tucson gave us pause amid all the noise and passion and rancor of our public debate tucson reminded us that no matter who we are or where we come from each of us is a part of something greater  something more consequential than party or political preference"                                                                                            
 [5] "      we are part of the american family  we believe that in a country where every race and faith and point of view can be found we are still bound together as one people that we share common hopes and a common creed that the dreams of a little girl in tucson are not so different than those of our own children and that they all deserve the chance to be fulfilled"                          
 [6] "      that too is what sets us apart as a nation  applause"                                                                                                                                                                                                                                                                                                                                            
 [7] "      now by itself this simple recognition wont usher in a new era of cooperation  what comes of this moment is up to us  what comes of this moment will be determined not by whether we can sit together tonight but whether we can work together tomorrow  applause"                                                                                                                                
 [8] "      i believe we can  and i believe we must  thats what the people who sent us here expect of us  with their votes theyve determined that governing will now be a shared responsibility between parties  new laws will only pass with support from democrats and republicans  we will move forward together or not at all  for the challenges we face are bigger than party and bigger than politics"
 [9] "      at stake right now is not who wins the next election  after all we just had an election  at stake is whether new jobs and industries take root in this country or somewhere else  its whether the hard work and industry of our people is rewarded  its whether we sustain the leadership that has made america not just a place on a map but the light to the world"                            
[10] "      we are poised for progress  two years after the worst recession most of us have ever known the stock market has come roaring back  corporate profits are up  the economy is growing again"

On supprime tous le mots du the, at… (stopwords)

> txt <- removeWords(txt,stopwords("en"))
> txt[1:10]
 [1] "mr speaker mr vice president members  congress distinguished guests  fellow americans"                                                                                                                                                                                                                
 [2] "      tonight  want  begin  congratulating  men  women   th congress  well   new speaker john boehner  applause     mark  occasion  also mindful   empty chair   chamber   pray   health   colleague    friend  gabby giffords  applause"                                                             
 [3] "        secret    us  tonight    differences   last two years   debates   contentious   fought fiercely   beliefs   thats  good thing  thats   robust democracy demands  thats  helps set us apart   nation"                                                                                          
 [4] "       theres  reason  tragedy  tucson gave us pause amid   noise  passion  rancor   public debate tucson reminded us   matter       come    us   part  something greater  something  consequential  party  political preference"                                                                     
 [5] "        part   american family   believe    country  every race  faith  point  view can  found   still bound together  one people   share common hopes   common creed   dreams   little girl  tucson    different      children     deserve  chance   fulfilled"                                      
 [6] "          sets us apart   nation  applause"                                                                                                                                                                                                                                                           
 [7] "      now    simple recognition wont usher   new era  cooperation   comes   moment    us   comes   moment will  determined   whether  can sit together tonight  whether  can work together tomorrow  applause"                                                                                        
 [8] "       believe  can    believe  must  thats   people  sent us  expect  us    votes theyve determined  governing will now   shared responsibility  parties  new laws will  pass  support  democrats  republicans   will move forward together        challenges  face  bigger  party  bigger  politics"
 [9] "       stake right now    wins  next election     just   election   stake  whether new jobs  industries take root   country  somewhere else   whether  hard work  industry   people  rewarded   whether  sustain  leadership   made america  just  place   map   light   world"                       
[10] "        poised  progress  two years   worst recession   us  ever known  stock market  come roaring back  corporate profits     economy  growing "

On peut choisir de spécifier certains mots à supprimer, par exemple mr, us.

> txt <- removeWords(txt,c("mr","us","applause"))
> txt[1:10]
 [1] " speaker  vice president members  congress distinguished guests  fellow americans"                                                                                                                                                                                                                
 [2] "      tonight  want  begin  congratulating  men  women   th congress  well   new speaker john boehner       mark  occasion  also mindful   empty chair   chamber   pray   health   colleague    friend  gabby giffords  "                                                                         
 [3] "        secret      tonight    differences   last two years   debates   contentious   fought fiercely   beliefs   thats  good thing  thats   robust democracy demands  thats  helps set  apart   nation"                                                                                          
 [4] "       theres  reason  tragedy  tucson gave  pause amid   noise  passion  rancor   public debate tucson reminded    matter       come       part  something greater  something  consequential  party  political preference"                                                                       
 [5] "        part   american family   believe    country  every race  faith  point  view can  found   still bound together  one people   share common hopes   common creed   dreams   little girl  tucson    different      children     deserve  chance   fulfilled"                                  
 [6] "          sets  apart   nation  "                                                                                                                                                                                                                                                                 
 [7] "      now    simple recognition wont usher   new era  cooperation   comes   moment       comes   moment will  determined   whether  can sit together tonight  whether  can work together tomorrow  "                                                                                              
 [8] "       believe  can    believe  must  thats   people  sent   expect      votes theyve determined  governing will now   shared responsibility  parties  new laws will  pass  support  democrats  republicans   will move forward together        challenges  face  bigger  party  bigger  politics"
 [9] "       stake right now    wins  next election     just   election   stake  whether new jobs  industries take root   country  somewhere else   whether  hard work  industry   people  rewarded   whether  sustain  leadership   made america  just  place   map   light   world"                   
[10] "        poised  progress  two years   worst recession     ever known  stock market  come roaring back  corporate profits     economy  growing "

Enfin transformer l’objet txt dans un format Corpus puisqu’il puisse être analysé

> corpus <- Corpus(VectorSource(txt))
> corpus
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 113

2.1.2 Tracer le word cloud

Transformer le Corpus en une matrice (on a écocide de supprimer les mots de longueur de moins de 3 caractères)

> tdm <- TermDocumentMatrix(corpus,control = list(minWordLength=3))
> tdm
<<TermDocumentMatrix (terms: 1558, documents: 113)>>
Non-/sparse entries: 3314/172740
Sparsity           : 98%
Maximal term length: 16
Weighting          : term frequency (tf)
> dim(tdm)
[1] 1558  113

On peut conclure que dans le texte il y a + 1558 mots. + 113 paragraphes

Chaque ligne de la matrice tdm correspond à un mot et chaque colonne correspond à un paragraphe.

> sum((tdm==0))
[1] 172740
> sum((tdm!=0))
[1] 3314

Les mot le plus fréquents dans le texte

> m <- as.matrix(tdm)
> freqWords=rowSums(m)
> freqWords=sort(freqWords,d=T)
> t(freqWords[1:10])
     will can new people jobs now years thats make just
[1,]   58  37  36     31   25  25    25    24   23   21

On décide d’éliminer le mot applause du Corpus

> i=grep('thats',rownames(m))
> m=m[-i,]

Cherchons le mot economy dans le texte et sa fréquence d’apparition

> i=grep('economy',rownames(m))
> sum(m[i,])
[1] 7

Et le mot security?

> i=grep('security',rownames(m))
> sum(m[i,])
[1] 3

Traçons maintenant le word cloud.

> freqWords=rowSums(m)
> v=sort(freqWords,d=T)
> dt=data.frame(word=names(v),freq=v)
> head(dt)
         word freq
will     will   58
can       can   37
new       new   36
people people   31
jobs     jobs   25
now       now   25
> par(bg="gray")
> wordcloud(dt$word,dt$freq,min.freq = 5,stack=T,random.order = F)

Trouver les termes les plus fréquents

> freq.terms <- findFreqTerms(tdm, lowfreq = 20)
> freq.terms
 [1] "can"    "jobs"   "just"   "make"   "new"    "now"    "people"
 [8] "thats"  "will"   "work"   "years" 
> term.freq <- rowSums(as.matrix(tdm))
> term.freq <- subset(term.freq, term.freq >= 20)
> df <- data.frame(term = names(term.freq), freq = term.freq)
> library(ggplot2)

Attaching package: 'ggplot2'
The following object is masked from 'package:NLP':

    annotate
> ggplot(df, aes(x=term, y=freq)) + geom_bar(stat="identity") +
+ xlab("Terms") + ylab("Count") + coord_flip() +
+ theme(axis.text=element_text(size=7))

Recherche d’associations

> findAssocs(tdm, "people", 0.2)
$people
 aspirations       desire     dictator     powerful       stands 
        0.50         0.50         0.50         0.50         0.50 
    supports      tunisia         writ         free          saw 
        0.50         0.50         0.50         0.44         0.34 
  democratic      finally       proved      purpose      america 
        0.32         0.32         0.32         0.32         0.31 
  assistance       danced         dawn       events independence 
        0.31         0.31         0.31         0.31         0.31 
       lined         lost       recent        scene        shown 
        0.31         0.31         0.31         0.31         0.31 
       sudan       summed          war         able        clear 
        0.31         0.31         0.29         0.27         0.27 
      degree         must       around       dreams         laws 
        0.27         0.27         0.23         0.23         0.23 
         man       nearly      protect     security     industry 
        0.23         0.23         0.23         0.23         0.22 
        also         will 
        0.21         0.21 
> findAssocs(tdm, "job", 0.2)
$job
     chances       decent     downtown      finding      limited 
        0.71         0.71         0.71         0.71         0.71 
       maybe         much       nearby    neighbors   occasional 
        0.71         0.71         0.71         0.71         0.71 
    paycheck       pretty        pride     probably    promotion 
        0.71         0.71         0.71         0.71         0.71 
    watching         youd      factory        meant       seeing 
        0.71         0.71         0.49         0.49         0.49 
     showing         hard     benefits  competition        forge 
        0.49         0.45         0.39         0.39         0.39 
      worked         good          act         born        brave 
        0.39         0.36         0.34         0.34         0.34 
    bringing      choices       combat   compromise     deficits 
        0.34         0.34         0.34         0.34         0.34 
   expressed       finish       formed        heads         held 
        0.34         0.34         0.34         0.34         0.34 
      houses   individual     interest         iraq        iraqi 
        0.34         0.34         0.34         0.34         0.34 
        kept      lasting      patrols   principled        sides 
        0.34         0.34         0.34         0.34         0.34 
      always       degree          end         even         kids 
        0.33         0.33         0.33         0.33         0.33 
    remember         time        didnt    civilians         code 
        0.33         0.29         0.24         0.23         0.23 
  commitment        ended expectations    graduates      highest 
        0.23         0.23         0.23         0.23         0.23 
        join         look      members  partnership     prepared 
        0.23         0.23         0.23         0.23         0.23 
  proportion        raise         rein     simplify        taxes 
        0.23         0.23         0.23         0.23         0.23 
       tough     violence      company 
        0.23         0.23         0.21 
> findAssocs(tdm, "jobs", 0.2)
$jobs
        month     agreement      doubling        export     finalized 
         0.60          0.51          0.51          0.51          0.51 
       signed unprecedented          pass       support       exports 
         0.51          0.51          0.47          0.47          0.44 
       create    agreements        dreams    enterprise     factories 
         0.39          0.36          0.36          0.34          0.34 
        labor         least        pursue      recently          soon 
         0.34          0.34          0.34          0.34          0.34 
        train biotechnology       careers      carolina       earning 
         0.34          0.33          0.33          0.33          0.33 
 fastchanging     furniture          hope      industry         kathy 
         0.33          0.33          0.33          0.33          0.33 
      measure      measured        mother         offer           old 
         0.33          0.33          0.33          0.33          0.33 
opportunities       proctor     prospects  revitalizing          shes 
         0.33          0.33          0.33          0.33          0.33 
  surrounding         tells      thriving        todays          town 
         0.33          0.33          0.33          0.33          0.33 
      turning    yardsticks         trade      business        better 
         0.33          0.33          0.31          0.30          0.29 
   businesses         since        abroad      colleges          goal 
         0.29          0.28          0.25          0.25          0.25 
        india      products       quality          sell         never 
         0.25          0.25          0.25          0.25          0.24 
   innovation       america           ago          home         alone 
         0.23          0.22          0.21          0.21          0.20 
breakthroughs   electricity          else       forsyth          gone 
         0.20          0.20          0.20          0.20          0.20 
      inspire     inventors          keep         korea         named 
         0.20          0.20          0.20          0.20          0.20 
        north         owner     paychecks      progress       sputnik 
         0.20          0.20          0.20          0.20          0.20 
         tech          told         wants         woman 
         0.20          0.20          0.20          0.20

On a alors besoin de faire du stemming pour mettre tous les mots dans la même racine.

> require(RWeka)
Loading required package: RWeka
> require(SnowballC)
Loading required package: SnowballC
> corpus1 <- Corpus(VectorSource(txt))
> tdm1 <- TermDocumentMatrix(corpus1, control=list(stemming=TRUE))
> tdm1 ## A comparer avec l'ancien tdm
<<TermDocumentMatrix (terms: 1266, documents: 113)>>
Non-/sparse entries: 3260/139798
Sparsity           : 98%
Maximal term length: 14
Weighting          : term frequency (tf)
> tdm
<<TermDocumentMatrix (terms: 1558, documents: 113)>>
Non-/sparse entries: 3314/172740
Sparsity           : 98%
Maximal term length: 16
Weighting          : term frequency (tf)
> freq.terms1 <- findFreqTerms(tdm1, lowfreq = 20)
> freq.terms1  ## A comparer avec l'ancien freq.terms
 [1] "america"  "american" "busi"     "can"      "come"     "govern"  
 [7] "job"      "just"     "make"     "nation"   "need"     "new"     
[13] "now"      "peopl"    "that"     "will"     "work"     "year"    
> freq.terms
 [1] "can"    "jobs"   "just"   "make"   "new"    "now"    "people"
 [8] "thats"  "will"   "work"   "years" 
> term.freq1 <- rowSums(as.matrix(tdm1))
> term.freq1 <- subset(term.freq1, term.freq1 >= 20)
> df1 <- data.frame(term = names(term.freq1), freq = term.freq1)
> library(ggplot2)
> ggplot(df1, aes(x=term, y=freq)) + geom_bar(stat="identity") +
+ xlab("Terms") + ylab("Count") + coord_flip() +
+ theme(axis.text=element_text(size=7))

Le réseau des mots.

> library(graph)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':

    IQR, mad, xtabs
The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, cbind, colnames,
    do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff,
    sort, table, tapply, union, unique, unsplit
> library(Rgraphviz)
Loading required package: grid
> plot(tdm1, term = freq.terms1, corThreshold = 0.1, weighting = T)

2.2 Recherche de thèmes.

On utilise ici la commande LDA du package topicmodels
Dans la commande LDA estime un modèle bayésien hiérarchique à trois niveaux.
Pour plus de lectures : Blei D.M., Ng A.Y., Jordan M.I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.

> dtm1 <- as.DocumentTermMatrix(tdm1)
> dim(dtm1)
[1]  113 1266
> dtm1[1,]
<<DocumentTermMatrix (documents: 1, terms: 1266)>>
Non-/sparse entries: 9/1257
Sparsity           : 99%
Maximal term length: 14
Weighting          : term frequency (tf)
> sum(dtm1[1,])
[1] 9
> dtm1[10,]
<<DocumentTermMatrix (documents: 1, terms: 1266)>>
Non-/sparse entries: 17/1249
Sparsity           : 99%
Maximal term length: 14
Weighting          : term frequency (tf)
> sum(dtm1[10,])
[1] 17
> raw.sum=apply(dtm1,1,sum)
> dtm1=dtm1[raw.sum!=0,]  ## On enlève les lignes contenant que des zéros
> library(topicmodels)
> lda1 <- LDA(dtm1, k = 4) ## Chercher les 4 sujets qui peuvent être extraits du discours 
> term1 <- terms(lda1, 7) ## Chercher les termes les plus fréquents dans chaque sujet.
> term1
     Topic 1    Topic 2 Topic 3   Topic 4  
[1,] "will"     "can"   "peopl"   "will"   
[2,] "year"     "will"  "new"     "that"   
[3,] "can"      "make"  "america" "new"    
[4,] "american" "now"   "want"    "job"    
[5,] "last"     "know"  "job"     "educ"   
[6,] "take"     "get"   "will"    "work"   
[7,] "school"   "need"  "year"    "compani"

2.3 Analyse de sentiments.

L’analyse de sentiment (parfois appelée opinion mining) est la partie du text mining qui essaye de définir les opinions, sentiments et attitudes présente dans un texte ou un ensemble de texte.
Elle est particulièrement utilisé en marketing pour analyser par exemple les commentaires des internautes ou les comparatifs et tests des blogueurs ou encore les réseaux sociaux : une grande part de la littérature sur le sujet concerne par exemple les tweets. Mais elle peut également être utilisée pour sonder l’opinion publique sur un sujet, pour chercher a caractériser les relations sociales dans les forums ou encore pour vérifier si Wikipedia est bien un média neutre.
Installation des packages

> require(devtools)
> install_github("sentiment140", "okugami79")

Fréquences des termes positifs, négatifs et neutres.

> library(sentiment)
Loading required package: RCurl
Loading required package: bitops
Loading required package: rjson
Loading required package: plyr

Attaching package: 'plyr'
The following object is masked from 'package:graph':

    join
> sentiments <- sentiment(txt)
> table(sentiments$polarity) 

negative  neutral positive 
      12       80       21

2.4 Comparaison entre deux discours

On considère maintenant un deuxième discours de Obama en 2012 sur sur l’état de l’Union. Ce texte a été téléchargé à partir du lien suivant : http://www.foxnews.com/politics/2012/01/24/transcript-obamas-2012-state-union/.

On refait donc le même travail avec l’autre discours après avoir intégré les deux discours dans le même data.frame

> source('/Users/dhafermalouche/Documents/Teaching/CoursDataMining_1516/WordCloud/ObamaSpeechs.R')
> 
> ds <- DataframeSource(tmpText)
> head(ds)
$encoding
[1] ""

$length
[1] 2
> inspect(VCorpus(ds))
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 46076

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 42559
> corp = Corpus(ds)
> corp = tm_map(corp,removePunctuation)
> corp = tm_map(corp, content_transformer(tolower))
> corp = tm_map(corp,removeNumbers)
> corp = tm_map(corp, function(x){removeWords(x,stopwords())})
> corp = tm_map(corp,function(x){removeWords(x,"applause")})
> term.matrix <- TermDocumentMatrix(corp)
> term.matrix <- as.matrix(term.matrix)
> colnames(term.matrix) <- c("SOTU 2011","SOTU 2012")
> 
> comparison.cloud(term.matrix,max.words=300,random.order=FALSE,colors=c("#1F497D","#C0504D"),main="Différences entre  2011 et 2012")

3 Extraction des données à partir du réseaux Facebook.

3.1 Création de l’application sous Facebook.

Pour extraire des données sous R à partir de Facebook, il faut d’abord créer une application sous Facebook. Pour cela il faut aller dans https://developers.facebook.com

Cliquez sur “Apps” et choisissez “Add a New App“. Dans la fenêtre suivante choisissez “Website” et donnez un nom à votre application.

Après avoir cliquer sur “Create a New App ID“, choisissez une catégorie pour votre app dans la nouvelle fenêtre.
Vous pouvez cliquer sur “Skip Quick Start” et aller directement dans la configuration de votre application

Ainsi l’application pourra être utilisée.

3.2 Mise en oeuvre sous R

D’abord on installe les packages nécessaires

> install.packages("devtools")
> library(devtools)
> install_github("Rfacebook", "pablobarbera", subdir="Rfacebook")

On copie ensuite les paramètres de l’application à l’aide de la commande fb_oauth

> library("Rfacebook")
> library(Rook)
> fb_oauth <- fbOAuth(app_id="#########", app_secret="################",extended_permissions = TRUE) ### Cette information a été cachée pour car elles contiennent des informations personnelles.
>

Une fois qu’on appuie sur return on obtient alors la page suivante

On crée une fonction sous R pour extraire le nombre de likes dans une page d’une date à une autre.

> ExtractData=function(page,dates){
+   n <- length(dates)-1
+   df <- list()
+   for (i in 1:n){
+     cat(as.character(dates[i]), " ")
+     try(df[[i]] <- getPage(page, token=fb_oauth,n=100, since=dates[i], until=dates[i+1]))
+     cat("\n")
+   }
+   df <- do.call(rbind, df)
+   return(df)
+ }

Extraction des données de la page de Trump

> page="DonaldTrump"
> dates <- seq(as.Date("2015/09/01"), as.Date("2016/8/15"), by="days")
> trump=ExtractData(page,dates)
> 
> head(trump)

Et de la page de Hillary

> page="hillaryclinton"
> hillary=ExtractData(page,dates)
> head(hillary)
>

On compare le nombre de Likes et Comments entre les deux candidats. D’abord on crée une fonction qui permet de créer des séries temporelles

> TransData=function(data){
+ x=melt(data[,c(4,8,9,10)],id.vars = "created_time")
+ library(plyr)
+ x$variable=mapvalues(x$variable,from = unique(x$variable),to=c("likes","comments","shares"))
+ df=x
+ x=strptime(df$created_time, "%Y-%m-%dT%H:%M:%S")
+ df$created_time=x
+ df$created_time<- as.Date(df$created_time) 
+ Csums=unlist(tapply(df$value,df$variable,cumsum))
+ df$Cumsum=Csums
+ return(df)
+ }

On crée une seule base de données

> library(reshape2)
> df_trump=TransData(trump)
> df_hillary=TransData(hillary)
> 
> df=rbind.data.frame(df_trump[df_trump$variable=="likes",],df_hillary[df_hillary$variable=="shares",])
> df$candidate=c(rep("Trump",sum((df_trump$variable=="likes"))),
+                rep("Hillary",sum((df_hillary$variable=="likes"))))

Un graphique pour comparer le nombre de Likes Cumulées.

> library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:Rgraphviz':

    style
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:graphics':

    layout
> 
> f <- list(
+   family = "Courier New, monospace",
+   size = 14,
+   color = "#7f7f7f"
+ )
> x <- list(
+   title = "",
+   titlefont = f
+ )
> y <- list(
+   title = "Cumulative number of likes",
+   titlefont = f
+ )
> 
> plot_ly(data = df,color = candidate,x=created_time,y=Cumsum)%>%
+   layout(xaxis = x, yaxis = y)

3.3 Analyse sentiementale des messages

> sentiments_hl <- sentiment(hillary$message)
> sentiments_tr <- sentiment(trump$message)

> table(sentiments_hl$polarity)

negative  neutral positive 
     196     2140      321 
> table(sentiments_tr$polarity)

negative  neutral positive 
     195     1595      547

L’évolution de l’analyse des sentiments au cours du temps.

> sentiments_hl$date=hillary$created_time
> sentiments_tr$date=trump$created_time
> 
> x=strptime(sentiments_hl$date, "%Y-%m-%d")
> library(chron)
> y=months(x)
> sentiments_hl$Month=y
> 
> x=strptime(sentiments_tr$date, "%Y-%m-%d")
> y=months(x)
> sentiments_tr$Month=y
> 
> ## Création d'un score
> sentiments_hl$score <- 0
> sentiments_hl$score[sentiments_hl$polarity == "positive"] <- 1
> sentiments_hl$score[sentiments_hl$polarity == "negative"] <- -1 
> 
> r_hl<-aggregate(score ~ Month, data = sentiments_hl, sum) ## Le score par mois
> 
> sentiments_tr$score <- 0
> sentiments_tr$score[sentiments_tr$polarity == "positive"] <- 1
> sentiments_tr$score[sentiments_tr$polarity == "negative"] <- -1 
> 
> r_tr<-aggregate(score ~ Month, data = sentiments_tr, sum) ## Le score par mois
> 
> rr=rbind.data.frame(r_hl,r_tr)
> rr$candidate=c(rep("hillary",12),rep("trump",12))
> rr$Month=factor(rr$Month,levels=unique(rr$Month)[c(5,4,8,1,9,7,6,2,12,11,10,3)])
> p<-ggplot(rr,aes(x=Month,y=score,col=candidate,fill=candidate))+geom_bar(stat="identity",position="dodge")
> p+coord_flip()

4 Etude de cas : Analyse des tweet de D. Trump.

Text Mining, Extraction et Analyse des données Facebook

Dhafer Malouche

3ième année ESSAI, 2016-2017

1 Introduction

2 Mise en oeuvre sous `R`

2.1 Analyse d’un discours politique.

2.1.1 Importation du texte dans `R` et création du Corpus

2.1.2 Tracer le word cloud

2.2 Recherche de thèmes.

2.3 Analyse de sentiments.

2.4 Comparaison entre deux discours

3 Extraction des données à partir du réseaux Facebook.

3.1 Création de l’application sous Facebook.

3.2 Mise en oeuvre sous R

3.3 Analyse sentiementale des messages

4 Etude de cas : Analyse des tweet de D. Trump.

Text Mining, Extraction et Analyse des données Facebook

Dhafer Malouche

3ième année ESSAI, 2016-2017

1 Introduction

2 Mise en oeuvre sous R

2.1 Analyse d’un discours politique.

2.1.1 Importation du texte dans R et création du Corpus

2.1.2 Tracer le word cloud

2.2 Recherche de thèmes.

2.3 Analyse de sentiments.

2.4 Comparaison entre deux discours

3 Extraction des données à partir du réseaux Facebook.

3.1 Création de l’application sous Facebook.

3.2 Mise en oeuvre sous R

3.3 Analyse sentiementale des messages

4 Etude de cas : Analyse des tweet de D. Trump.

2 Mise en oeuvre sous `R`

2.1.1 Importation du texte dans `R` et création du Corpus