Part1 - US Presidential Speeches
mycorpus1 <- corpus_subset(data_corpus_inaugural,Year > 1856 & Year < 1870)
myDfm1 <- dfm(mycorpus1 , remove = stopwords("english"), tolower = TRUE, stem = TRUE,
remove_punct = TRUE, remove_numbers=TRUE)
Construct the frequency graph
americanFreq1 <- data.frame(
list(
document = rownames(myDfm1 [, 'constitut']),
frequency1 = unname(as.matrix(myDfm1 [, 'constitut'])),
frequency2 = unname(as.matrix(myDfm1 [, 'slave'])),
frequency3 = unname(as.matrix(myDfm1 [, 'state'])),
frequency4 = unname(as.matrix(myDfm1 [, 'govern']))
))
ggplot(americanFreq1) +
geom_point(aes(x=document,y=frequency1,colour = "constitut*")) +
geom_point(aes(x=document,y=frequency2,colour = "slave")) +
geom_point(aes(x=document,y=frequency3,colour = "state")) +
geom_point(aes(x=document,y=frequency4,colour = "govern")) +
theme(axis.text.x = element_text(angle=90, hjust=1))

Based on the topfeatures of the document-term matrix, I chose to graph the frequency of [constitut*], [state], [govern]. On the other hand, Based on the history of USA in that time period, I also graphed the frequency of the word - slave.
As the graph above, the frequency of the term [slave] was high especially in Lincoln’s speech in 1861. However, the word [constitut*] was in fact the most frequent term among those words I chose in Lincoln’s speech in 1861. Since the length of speech was different, I constructed the relative frequency of those words as well.
Construct the relative frequency graph
relDfm1 <- dfm_weight(myDfm1 , scheme = c("prop"))*100
relamericanFreq1 <- data.frame(
list(
document = rownames(relDfm1 [, 'constitut']),
frequency1 = unname(as.matrix(relDfm1 [, 'constitut'])),
frequency2 = unname(as.matrix(relDfm1 [, 'slave'])),
frequency3 = unname(as.matrix(relDfm1 [, 'state'])),
frequency4 = unname(as.matrix(relDfm1 [, 'govern']))
))
ggplot(relamericanFreq1) +
geom_point(aes(x=document,y=frequency1,colour = "constitut*")) +
geom_point(aes(x=document,y=frequency2,colour = "slave")) +
geom_point(aes(x=document,y=frequency3,colour = "state")) +
geom_point(aes(x=document,y=frequency4,colour = "govern")) +
theme(axis.text.x = element_text(angle=90, hjust=1))

The relative frequency graph is slightly different with the previous graph. We could see now the term [slave] was highly used in the speech in 1865, which was during the wartime. The word [govern] was the only consistently used word in those speeches.
This graph also showed that the word choices were influenced by the length of speech. The most used words of all corpus were not so frequently used in two short speeches (1865 and 1869).
Construct the wordcloud
textplot_wordcloud(myDfm1 , min_termfreq = 30,
colors=brewer.pal(10,"Set1"),
title.size=1,
comparison = TRUE)

After the graph of frequency, I also tried to graph the wordcloud of those speeches. The words in red were based on the Buchnan’s speech in 1857, which is the title next to those red words on the graph, the following 3 speeches were graphed in blue, green, purple in order.
In this graph, we can see the word [war] was highly used in the 1865 Lincoln’s speech.
Part2 (1) - Irish Party Manifestoes
setwd("~/Desktop/r/polimetrics/IE")
myText2 <- readtext("~/Desktop/r/polimetrics/IE/*.txt",
docvarsfrom = "filenames", dvsep = " ", docvarnames = c("Year", "Party"))
testCorpus2 <- corpus(myText2)
testCorpus2 <- corpus(myText2, docid_field = "doc_id")
docnames(testCorpus2) <- gsub(".txt", "", docnames(testCorpus2 ))
myDfm2 <- dfm(testCorpus2 , remove = stopwords("english"), tolower = TRUE, stem = TRUE,
remove_punct = TRUE, remove_numbers=TRUE)
ws2 <- textmodel_wordscores(myDfm2, c(4.5,13.13,15,6.88,17.63,rep(NA,5)))
Predict
pr_raw2 <- predict(ws2, newdata = myDfm2[c(6,7,8,9,10), ])
pr_lbg2_2 <- predict(ws2, rescaling = "lbg", newdata = myDfm2[c(6,7,8,9,10), ], interval = "confidence")
Graph of raw score
df1 <- as.data.frame(pr_raw2)
colnames(df1) <- c("V1")
df1$party <- myText2$Party[1:5]
df1$title <- rownames(df1)
df1$order <- c(1:5)
df1$title <- with(df1, reorder(title,order))
ggplot(df1,aes(title,V1,color = party)) +
geom_point() +
xlab("Title")+
ylab("Score")

Graph of LBG with reference points
df2 <- as.data.frame(pr_lbg2_2$fit)
colnames(df2) <- c("V1","V2","V3")
df2$g <- "pr"
df2$party <- myText2$Party[1:5]
df2_ref <- data.frame(
value1 = c(4.5,13.13,15,6.88,17.63),
value2 = c(4.5,13.13,15,6.88,17.63),
value3 = c(4.5,13.13,15,6.88,17.63)
)
rownames(df2_ref) <- docnames(testCorpus2)[1:5]
colnames(df2_ref) <- c("V1","V2","V3")
df2_ref$g <- "ref"
df2_ref$party <- myText2$Party[1:5]
df2_new <- data.frame(V1=NA,V2=NA,V3=NA,g=NA,party=NA)
for(i in 1:5){
df2_new <- as.data.frame(rbind(df2_new,df2_ref[i,],df2[i,]))
}
df2_new <- as.data.frame(df2_new[2:11,])
df2_new$title <- rownames(df2_new)
df2_new$order <- c(1:10)
df2_new$title <- with(df2_new, reorder(title,order))
ggplot(df2_new,aes(title,V1,color = party)) +
geom_point() +
geom_errorbar(aes(ymin=V2,ymax=V3)) +
xlab("Title")+
ylab("Score")

Comparing the raw scores and the transformed scores, the LBG scores were more convincing, since the reference scores were skewed and the variation was high. In the transformed graph, the shifts on social position of FF party and DL party were revealed.
PS: After the transformation of raw scores, we could see that their positions on the social scale changed.According to the Wikipedia, between the two elections in 1992 and 1997, the coalitions changed. What I found out was the coalitions did not quite depend on the social position. The importance was the majority of coalition.
Part2 (2) - Reagan v.s. Trump
presDfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1980),
remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
Original
ws_pres <- textmodel_wordscores(presDfm, c(1, NA, NA, 0, NA, NA, NA, -1, NA, NA))
pr_all_pres <- predict(ws_pres)
New
ws_pres2 <- textmodel_wordscores(presDfm, c(NA, NA, NA, 0, NA, NA, NA, -1, NA, 1))
pr_all_pres2 <- predict(ws_pres2)
Graph the prediction
pr_all <- as.data.frame(cbind(pr_all_pres,pr_all_pres2))
pr_all$title <- rownames(pr_all)
pr_all$order <- c(10:1)
pr_all$title <- with(pr_all,reorder(title,order))
ggplot(pr_all,aes(y=title,x=pr_all_pres)) +
geom_point(aes(pr_all_pres,color="Reagan as 1")) +
geom_point(aes(pr_all_pres2,color="Trump as 1")) +
xlab("Score") +
ylab("Title")

As the graph above, the position of 1981-Reagan and 2017-Trump changed due to the change of reference text undoubtedly. However, we could also find out the speech scores of all the other presidents were a little bit higher when we assumed 1981-Reagan as 1 in reference scores. The implication is that the context of Trump’s speech was more different from other’s speeches according to the scores.
---
title: "Home Assignment 1"
output: html_notebook
---
###Name: Yen Cheng Hsuan
###Student ID: 1A182901-2
#Package+Setup
```{r, message=FALSE, warning=FALSE}
rm(list=ls(all=TRUE))
setwd("~/Desktop/R/polimetrics")

library(RColorBrewer)
library(readtext)
library(quanteda)
library(ggplot2)
```
#Part1 - US Presidential Speeches
```{r, warning=FALSE}
mycorpus1 <- corpus_subset(data_corpus_inaugural,Year > 1856 & Year < 1870)

myDfm1 <- dfm(mycorpus1 , remove = stopwords("english"), tolower = TRUE, stem = TRUE,
             remove_punct = TRUE, remove_numbers=TRUE)


```

###Construct the frequency graph
```{r, warning=FALSE}
americanFreq1 <- data.frame(
  list(
    document = rownames(myDfm1 [, 'constitut']),
    frequency1 = unname(as.matrix(myDfm1 [, 'constitut'])),
    frequency2 = unname(as.matrix(myDfm1 [, 'slave'])),
    frequency3 = unname(as.matrix(myDfm1 [, 'state'])),
    frequency4 = unname(as.matrix(myDfm1 [, 'govern']))
  ))

ggplot(americanFreq1) +
  geom_point(aes(x=document,y=frequency1,colour = "constitut*")) +
  geom_point(aes(x=document,y=frequency2,colour = "slave")) +
  geom_point(aes(x=document,y=frequency3,colour = "state")) +
  geom_point(aes(x=document,y=frequency4,colour = "govern")) +
  theme(axis.text.x = element_text(angle=90, hjust=1))
```

Based on the topfeatures of the document-term matrix, I chose to graph the frequency of [constitut*], [state], [govern]. On the other hand, Based on the history of USA in that time period, I also graphed the frequency of the word - slave.

As the graph above, the frequency of the term [slave] was high especially in Lincoln's speech in 1861. However, the word [constitut*] was in fact the most frequent term among those words I chose in Lincoln's speech in 1861. Since the length of speech was different, I constructed the relative frequency of those words as well.

###Construct the relative frequency graph
```{r, warning=FALSE}
relDfm1 <- dfm_weight(myDfm1 , scheme = c("prop"))*100
relamericanFreq1 <- data.frame(
  list(
    document = rownames(relDfm1 [, 'constitut']),
    frequency1 = unname(as.matrix(relDfm1 [, 'constitut'])),
    frequency2 = unname(as.matrix(relDfm1 [, 'slave'])),
    frequency3 = unname(as.matrix(relDfm1 [, 'state'])),
    frequency4 = unname(as.matrix(relDfm1 [, 'govern']))
  ))

ggplot(relamericanFreq1) +
  geom_point(aes(x=document,y=frequency1,colour = "constitut*")) +
  geom_point(aes(x=document,y=frequency2,colour = "slave")) +
  geom_point(aes(x=document,y=frequency3,colour = "state")) +
  geom_point(aes(x=document,y=frequency4,colour = "govern")) +
  theme(axis.text.x = element_text(angle=90, hjust=1))
```

The relative frequency graph is slightly different with the previous graph. We could see now the term [slave] was highly used in the speech in 1865, which was during the wartime. The word [govern] was the only consistently used word in those speeches.

This graph also showed that the word choices were influenced by the length of speech. The most used words of all corpus were not so frequently used in two short speeches (1865 and 1869).

###Construct the wordcloud
```{r, warning=FALSE}
textplot_wordcloud(myDfm1 , min_termfreq = 30,
                   colors=brewer.pal(10,"Set1"),
                   title.size=1,
                   comparison = TRUE)
```

After the graph of frequency, I also tried to graph the wordcloud of those speeches. The words in red were based on the Buchnan's speech in 1857, which is the title next to those red words on the graph, the following 3 speeches were graphed in blue, green, purple in order.

In this graph, we can see the word [war] was highly used in the 1865 Lincoln's speech.

#Part2 (1) - Irish Party Manifestoes
```{r, message=FALSE, warning=FALSE}
setwd("~/Desktop/r/polimetrics/IE")
myText2 <- readtext("~/Desktop/r/polimetrics/IE/*.txt",
                   docvarsfrom = "filenames", dvsep = " ", docvarnames = c("Year", "Party"))
testCorpus2 <- corpus(myText2)
testCorpus2 <- corpus(myText2, docid_field = "doc_id")
docnames(testCorpus2) <- gsub(".txt", "", docnames(testCorpus2 ))

myDfm2 <- dfm(testCorpus2 , remove = stopwords("english"), tolower = TRUE, stem = TRUE,
             remove_punct = TRUE, remove_numbers=TRUE)
ws2 <- textmodel_wordscores(myDfm2, c(4.5,13.13,15,6.88,17.63,rep(NA,5)))


```
###Predict
```{r, message=FALSE, warning=FALSE}
pr_raw2 <- predict(ws2, newdata = myDfm2[c(6,7,8,9,10), ])
pr_lbg2_2 <- predict(ws2, rescaling = "lbg", newdata = myDfm2[c(6,7,8,9,10), ], interval = "confidence")
```

###Graph of raw score
```{r, warning=FALSE}

df1 <- as.data.frame(pr_raw2)
colnames(df1) <- c("V1")
df1$party <- myText2$Party[1:5]

df1$title <- rownames(df1)
df1$order <- c(1:5)
df1$title <- with(df1, reorder(title,order))

ggplot(df1,aes(title,V1,color = party)) +
  geom_point() +
  xlab("Title")+
  ylab("Score")
```
###Graph of LBG with reference points
```{r, warning=FALSE}

df2 <- as.data.frame(pr_lbg2_2$fit)
colnames(df2) <- c("V1","V2","V3")
df2$g <- "pr"
df2$party <- myText2$Party[1:5]
df2_ref <- data.frame(
  value1 = c(4.5,13.13,15,6.88,17.63),
  value2 = c(4.5,13.13,15,6.88,17.63),
  value3 = c(4.5,13.13,15,6.88,17.63)
)
rownames(df2_ref) <- docnames(testCorpus2)[1:5]
colnames(df2_ref) <- c("V1","V2","V3")
df2_ref$g <- "ref"
df2_ref$party <- myText2$Party[1:5]
df2_new <- data.frame(V1=NA,V2=NA,V3=NA,g=NA,party=NA)
for(i in 1:5){
  df2_new <- as.data.frame(rbind(df2_new,df2_ref[i,],df2[i,]))
}
df2_new <- as.data.frame(df2_new[2:11,])
df2_new$title <- rownames(df2_new)
df2_new$order <- c(1:10)
df2_new$title <- with(df2_new, reorder(title,order))

ggplot(df2_new,aes(title,V1,color = party)) +
  geom_point() +
  geom_errorbar(aes(ymin=V2,ymax=V3)) +
  xlab("Title")+
  ylab("Score")
```

Comparing the raw scores and the transformed scores, the LBG scores were more convincing, since the reference scores were skewed and the variation was high. In the transformed graph, the shifts on social position of FF party and DL party were revealed.

PS:  After the transformation of raw scores, we could see that their positions on the social scale changed.According to the Wikipedia, between the two elections in 1992 and 1997, the coalitions changed. What I found out was the coalitions did not quite depend on the social position. The importance was the majority of coalition.

#Part2 (2) - Reagan v.s. Trump
```{r}
presDfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1980), 
               remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
```
###Original
```{r, warning=FALSE}
ws_pres <- textmodel_wordscores(presDfm, c(1, NA, NA, 0, NA, NA, NA, -1, NA, NA))
pr_all_pres <- predict(ws_pres)
```
###New
```{r, warning=FALSE}
ws_pres2 <- textmodel_wordscores(presDfm, c(NA, NA, NA, 0, NA, NA, NA, -1, NA, 1))
pr_all_pres2 <- predict(ws_pres2)
```
###Graph the prediction
```{r, warning=FALSE}
pr_all <- as.data.frame(cbind(pr_all_pres,pr_all_pres2))
pr_all$title <- rownames(pr_all)
pr_all$order <- c(10:1)
pr_all$title <- with(pr_all,reorder(title,order))
ggplot(pr_all,aes(y=title,x=pr_all_pres)) +
  geom_point(aes(pr_all_pres,color="Reagan as 1")) +
  geom_point(aes(pr_all_pres2,color="Trump as 1")) +
  xlab("Score") +
  ylab("Title")
```

As the graph above, the position of 1981-Reagan and 2017-Trump changed due to the change of reference text undoubtedly. However, we could also find out the speech scores of all the other presidents were a little bit higher when we assumed 1981-Reagan as 1 in reference scores. The implication is that the context of Trump's speech was more different from other's speeches according to the scores.

```{r, message=FALSE, warning=FALSE, include=FALSE}
setwd("~/Desktop/R/polimetrics")
```

