Objectif : faire une analyse d’un discours politique, ensuite on voudrait faire une comparaison entre les discours de deux candidats des présidentielles aux USA. Il d’agit en effet de voir les fréquences d’utilisation de différents termes et de voir si les fréquences de ces termes sont différentes entre les deux candidats.
Méthodologie : On présentera comment procéder dans R pour tracer un graphique qu’on appelle word cloud qui une repreésentation des mots prononceés lors d’un discours ou du texte à ètudier. La taille du mot représenté dans ce nuage des mots est proportionnelle à sa fréquence d’apparition dans le texte.
RR et création du Corpustm (pour faire du text mining) et word cloud.> library(tm)
Loading required package: NLP
> library(wordcloud)
Loading required package: RColorBrewer
> txt=readLines("/Users/dhafermalouche/Documents/Teaching/CoursDataMining_1516/WordCloud/ObamaSpeech2011.txt")
> txt[1:4]
[1] "Mr. Speaker, Mr. Vice President, members of Congress, distinguished guests, and fellow Americans:"
[2] ""
[3] " Tonight I want to begin by congratulating the men and women of the 112th Congress, as well as your new Speaker, John Boehner. (Applause.) And as we mark this occasion, we’re also mindful of the empty chair in this chamber, and we pray for the health of our colleague -- and our friend -– Gabby Giffords. (Applause.)"
[4] ""
En effet, ces données contiennent un discours de Obama en 2011 sur l’état de l’Union. Ces discours (State of the Union address) est un évènement annuel où le président des États-Unis présente son programme pour l’année en cours. Ce discours est prononcé à Washington au Capitole, où les deux chambres (la Chambre des représentants et le Sénat) sont réunies.
Pour l’histoire, George Washington donna le premier discours sur l’état de l’Union le 8 janvier 1790 dans la ville de New York, qui à l’époque était la capitale.
> txt <- removePunctuation(txt)
> txt[1:5]
[1] "Mr Speaker Mr Vice President members of Congress distinguished guests and fellow Americans"
[2] ""
[3] " Tonight I want to begin by congratulating the men and women of the 112th Congress as well as your new Speaker John Boehner Applause And as we mark this occasion were also mindful of the empty chair in this chamber and we pray for the health of our colleague and our friend Gabby Giffords Applause"
[4] ""
[5] " Its no secret that those of us here tonight have had our differences over the last two years The debates have been contentious we have fought fiercely for our beliefs And thats a good thing Thats what a robust democracy demands Thats what helps set us apart as a nation"
> txt <- removeNumbers(txt)
> txt[1:10]
[1] "Mr Speaker Mr Vice President members of Congress distinguished guests and fellow Americans"
[2] ""
[3] " Tonight I want to begin by congratulating the men and women of the th Congress as well as your new Speaker John Boehner Applause And as we mark this occasion were also mindful of the empty chair in this chamber and we pray for the health of our colleague and our friend Gabby Giffords Applause"
[4] ""
[5] " Its no secret that those of us here tonight have had our differences over the last two years The debates have been contentious we have fought fiercely for our beliefs And thats a good thing Thats what a robust democracy demands Thats what helps set us apart as a nation"
[6] ""
[7] " But theres a reason the tragedy in Tucson gave us pause Amid all the noise and passion and rancor of our public debate Tucson reminded us that no matter who we are or where we come from each of us is a part of something greater something more consequential than party or political preference"
[8] ""
[9] " We are part of the American family We believe that in a country where every race and faith and point of view can be found we are still bound together as one people that we share common hopes and a common creed that the dreams of a little girl in Tucson are not so different than those of our own children and that they all deserve the chance to be fulfilled"
[10] ""
> txt <- txt[-which(txt=="")]
> txt[1:10]
[1] "Mr Speaker Mr Vice President members of Congress distinguished guests and fellow Americans"
[2] " Tonight I want to begin by congratulating the men and women of the th Congress as well as your new Speaker John Boehner Applause And as we mark this occasion were also mindful of the empty chair in this chamber and we pray for the health of our colleague and our friend Gabby Giffords Applause"
[3] " Its no secret that those of us here tonight have had our differences over the last two years The debates have been contentious we have fought fiercely for our beliefs And thats a good thing Thats what a robust democracy demands Thats what helps set us apart as a nation"
[4] " But theres a reason the tragedy in Tucson gave us pause Amid all the noise and passion and rancor of our public debate Tucson reminded us that no matter who we are or where we come from each of us is a part of something greater something more consequential than party or political preference"
[5] " We are part of the American family We believe that in a country where every race and faith and point of view can be found we are still bound together as one people that we share common hopes and a common creed that the dreams of a little girl in Tucson are not so different than those of our own children and that they all deserve the chance to be fulfilled"
[6] " That too is what sets us apart as a nation Applause"
[7] " Now by itself this simple recognition wont usher in a new era of cooperation What comes of this moment is up to us What comes of this moment will be determined not by whether we can sit together tonight but whether we can work together tomorrow Applause"
[8] " I believe we can And I believe we must Thats what the people who sent us here expect of us With their votes theyve determined that governing will now be a shared responsibility between parties New laws will only pass with support from Democrats and Republicans We will move forward together or not at all for the challenges we face are bigger than party and bigger than politics"
[9] " At stake right now is not who wins the next election after all we just had an election At stake is whether new jobs and industries take root in this country or somewhere else Its whether the hard work and industry of our people is rewarded Its whether we sustain the leadership that has made America not just a place on a map but the light to the world"
[10] " We are poised for progress Two years after the worst recession most of us have ever known the stock market has come roaring back Corporate profits are up The economy is growing again"
> for(i in 1:length(txt))
+ txt[i]=tolower(txt[i])
> txt[1:10]
[1] "mr speaker mr vice president members of congress distinguished guests and fellow americans"
[2] " tonight i want to begin by congratulating the men and women of the th congress as well as your new speaker john boehner applause and as we mark this occasion were also mindful of the empty chair in this chamber and we pray for the health of our colleague and our friend gabby giffords applause"
[3] " its no secret that those of us here tonight have had our differences over the last two years the debates have been contentious we have fought fiercely for our beliefs and thats a good thing thats what a robust democracy demands thats what helps set us apart as a nation"
[4] " but theres a reason the tragedy in tucson gave us pause amid all the noise and passion and rancor of our public debate tucson reminded us that no matter who we are or where we come from each of us is a part of something greater something more consequential than party or political preference"
[5] " we are part of the american family we believe that in a country where every race and faith and point of view can be found we are still bound together as one people that we share common hopes and a common creed that the dreams of a little girl in tucson are not so different than those of our own children and that they all deserve the chance to be fulfilled"
[6] " that too is what sets us apart as a nation applause"
[7] " now by itself this simple recognition wont usher in a new era of cooperation what comes of this moment is up to us what comes of this moment will be determined not by whether we can sit together tonight but whether we can work together tomorrow applause"
[8] " i believe we can and i believe we must thats what the people who sent us here expect of us with their votes theyve determined that governing will now be a shared responsibility between parties new laws will only pass with support from democrats and republicans we will move forward together or not at all for the challenges we face are bigger than party and bigger than politics"
[9] " at stake right now is not who wins the next election after all we just had an election at stake is whether new jobs and industries take root in this country or somewhere else its whether the hard work and industry of our people is rewarded its whether we sustain the leadership that has made america not just a place on a map but the light to the world"
[10] " we are poised for progress two years after the worst recession most of us have ever known the stock market has come roaring back corporate profits are up the economy is growing again"
> txt <- removeWords(txt,stopwords("en"))
> txt[1:10]
[1] "mr speaker mr vice president members congress distinguished guests fellow americans"
[2] " tonight want begin congratulating men women th congress well new speaker john boehner applause mark occasion also mindful empty chair chamber pray health colleague friend gabby giffords applause"
[3] " secret us tonight differences last two years debates contentious fought fiercely beliefs thats good thing thats robust democracy demands thats helps set us apart nation"
[4] " theres reason tragedy tucson gave us pause amid noise passion rancor public debate tucson reminded us matter come us part something greater something consequential party political preference"
[5] " part american family believe country every race faith point view can found still bound together one people share common hopes common creed dreams little girl tucson different children deserve chance fulfilled"
[6] " sets us apart nation applause"
[7] " now simple recognition wont usher new era cooperation comes moment us comes moment will determined whether can sit together tonight whether can work together tomorrow applause"
[8] " believe can believe must thats people sent us expect us votes theyve determined governing will now shared responsibility parties new laws will pass support democrats republicans will move forward together challenges face bigger party bigger politics"
[9] " stake right now wins next election just election stake whether new jobs industries take root country somewhere else whether hard work industry people rewarded whether sustain leadership made america just place map light world"
[10] " poised progress two years worst recession us ever known stock market come roaring back corporate profits economy growing "
mr, us.> txt <- removeWords(txt,c("mr","us"))
> txt[1:10]
[1] " speaker vice president members congress distinguished guests fellow americans"
[2] " tonight want begin congratulating men women th congress well new speaker john boehner applause mark occasion also mindful empty chair chamber pray health colleague friend gabby giffords applause"
[3] " secret tonight differences last two years debates contentious fought fiercely beliefs thats good thing thats robust democracy demands thats helps set apart nation"
[4] " theres reason tragedy tucson gave pause amid noise passion rancor public debate tucson reminded matter come part something greater something consequential party political preference"
[5] " part american family believe country every race faith point view can found still bound together one people share common hopes common creed dreams little girl tucson different children deserve chance fulfilled"
[6] " sets apart nation applause"
[7] " now simple recognition wont usher new era cooperation comes moment comes moment will determined whether can sit together tonight whether can work together tomorrow applause"
[8] " believe can believe must thats people sent expect votes theyve determined governing will now shared responsibility parties new laws will pass support democrats republicans will move forward together challenges face bigger party bigger politics"
[9] " stake right now wins next election just election stake whether new jobs industries take root country somewhere else whether hard work industry people rewarded whether sustain leadership made america just place map light world"
[10] " poised progress two years worst recession ever known stock market come roaring back corporate profits economy growing "
txt dans un format Corpus puisqu’il puisse être analysé> corpus <- Corpus(VectorSource(txt))
> tdm <- TermDocumentMatrix(corpus,control = list(minWordLength=3))
> tdm
<<TermDocumentMatrix (terms: 1559, documents: 113)>>
Non-/sparse entries: 3376/172791
Sparsity : 98%
Maximal term length: 16
Weighting : term frequency (tf)
> dim(tdm)
[1] 1559 113
On peut conclure que dans le texte il y a + 1559 mots. + 113 paragraphes
Chaque ligne de la matrice tdm correspond à un mot et chaque colonne correspond à un paragraphe.
> sum((tdm==0))
[1] 172791
> sum((tdm!=0))
[1] 3376
Les mot le plus freéquents dans le texte
> m <- as.matrix(tdm)
> freqWords=rowSums(m)
> freqWords=sort(freqWords,d=T)
> t(freqWords[1:10])
applause will can new people jobs now years thats make
[1,] 80 58 37 36 31 25 25 25 24 23
On décide d’éléminrer le mot applause du Corpus
> i=grep('applause',rownames(m))
> m=m[-i,]
Cherchons le mot economy dans le texte et sa freéquence d’apparition
> i=grep('economy',rownames(m))
> sum(m[i,])
[1] 7
Et le mot security?
> i=grep('security',rownames(m))
> sum(m[i,])
[1] 3
> freqWords=rowSums(m)
> v=sort(freqWords,d=T)
> dt=data.frame(word=names(v),freq=v)
> head(dt)
word freq
will will 58
can can 37
new new 36
people people 31
jobs jobs 25
now now 25
> par(bg="gray")
> wordcloud(dt$word,dt$freq,min.freq = 5,stack=T,random.order = F)
On consideère mainetenant un deuxieème discours de Obama en 2012 sur sur l’état de l’Union. Ce texte a été téléchargé à partir du lien suivant : http://www.foxnews.com/politics/2012/01/24/transcript-obamas-2012-state-union/.
On refait donc le même travail avec l’autre discours après avoir intégré les deux discours dans le même data.frame
> source('/Users/dhafermalouche/Documents/Teaching/CoursDataMining_1516/WordCloud/ObamaSpeechs.R')
>
> ds <- DataframeSource(tmpText)
> inspect(VCorpus(ds))
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2
[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 46076
[[2]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 42559
> corp = Corpus(ds)
> corp = tm_map(corp,removePunctuation)
> corp = tm_map(corp, content_transformer(tolower))
> corp = tm_map(corp,removeNumbers)
> corp = tm_map(corp, function(x){removeWords(x,stopwords())})
> corp = tm_map(corp,function(x){removeWords(x,"applause")})
> term.matrix <- TermDocumentMatrix(corp)
> term.matrix <- as.matrix(term.matrix)
> colnames(term.matrix) <- c("SOTU 2011","SOTU 2012")
>
> comparison.cloud(term.matrix,max.words=300,random.order=FALSE,colors=c("#1F497D","#C0504D"),main="Différences entre 2011 et 2012")
Essayer de refaire le même travail sur d’autres types de discours politiques ?
Imaginer une façon pour représenter les associations des mots dans chaque discours ?