JMSC 6116 Lecture 1: 蔡英文習近年2021新年談話全文分析

This short article aims to analyze and compare the New Year speeches made by Xi Jinping and Tsai Ing-wen.

First at all, we install the required libraries and “require” them into the system.

if (!require("quanteda")) install.packages("quanteda")
if (!require("plotly")) install.packages("plotly")
if (!require("jiebaR")) install.packages("jiebaR")

Then, let’s obtain the copy of Xi Jinping’s speech.The file is already uploaded to my GitHub in plain text format. The first five lines of his speech are displayed for checking.

con <- url("https://raw.githubusercontent.com/fukingwa/JMSC6116_public/master/xi2021c.txt") # Establish a connection via url
xi <- readLines(con, encoding = 'UTF-8')  # Read line by line from the connection to a string array xi
close(con) # Remember to close the connection after use
xi <- paste(xi,collapse=" ") 
engine1 <- worker()
xi <- paste(segment(xi,engine1),collapse=" ")

Next, we get Tsai’s speech.

con <- url("https://raw.githubusercontent.com/fukingwa/JMSC6116_public/master/tsai2021c.txt")
tsai <- readLines(con, encoding = 'UTF-8')
close(con)
tsai <- paste(tsai,collapse=" ")
tsai <- paste(segment(tsai,engine1),collapse=" ")

Then, we “clean” the texts by removing punctutions and English’s “stopwords” (“a”, “an”, “the” etc), create a R object called dfm_xt, which stands for “Document Feature Matrix for Xi-Tsai”, i.e. Two leaders (by rows) and terms used (by columns) See what it looks like. The number in the matrix stands for the term frequency (column) of that document (row).

corp_xt <- corpus(c(xi,tsai))
docnames(corp_xt) <- c("xi","tsai")

toks_xt <- tokens(corp_xt, remove_punct = TRUE)
toks_xt <- tokens_remove(toks_xt, pattern = stopwords("zh", source = "misc"))
dfm_xt <- dfm(toks_xt)

Ok. So far so good. We now compare the highest frequency terms used in each speech.

Num_of_terms_shown <- 5
xi_freqterm <- featfreq(dfm_xt["xi",])
xi_barplot <- data.frame(name=names(xi_freqterm),y=xi_freqterm)
xi_barplot <- xi_barplot[order(xi_barplot$y,decreasing=TRUE),]
xi_barplot$name <- factor(xi_barplot$name, levels = xi_barplot$name)
xi_barplot <- xi_barplot[1:Num_of_terms_shown,]

tsai_freqterm <- featfreq(dfm_xt["tsai",])
tsai_barplot <- data.frame(name=names(tsai_freqterm),y=tsai_freqterm)
tsai_barplot <- tsai_barplot[order(tsai_barplot$y,decreasing=TRUE),]
tsai_barplot$name <- factor(tsai_barplot$name, levels = tsai_barplot$name)
tsai_barplot <- tsai_barplot[1:Num_of_terms_shown,]

p1 <- plot_ly(xi_barplot, x = ~name, y = ~y, type = 'bar', 
              text = ~y, textposition = 'auto', name = "Xi Jingping's Speech",
              marker = list(color = 'red',
                            line = list(color = 'red', width = 1.5)))
p1 <- layout(p1, title = "", xaxis = list(title = ""), yaxis = list(title = ""))

p2 <- plot_ly(tsai_barplot, x = ~name, y = ~y, type = 'bar', 
              text = ~y, textposition = 'auto', name = "Tsai Ing-wen's Speech",
              marker = list(color = 'green',
                            line = list(color = 'green', width = 1.5)))
p2 <- layout(p2, title = "Top 5 Terms Used in Xi Jingping/Tsai Ing-wen's Speech" , xaxis = list(title = ""), yaxis = list(title = ""))

p <- subplot(p1,p2,shareY=T)
p <- layout(p, showlegend = T)
p

Next, we create a wordcloud for each.

textplot_wordcloud(dfm_xt["xi",], random_order = FALSE, rotation = .25, min_count = 2, color = RColorBrewer::brewer.pal(8, "Dark2"))

textplot_wordcloud(dfm_xt["tsai",], random_order = FALSE, rotation = .25, min_count = 2, color = RColorBrewer::brewer.pal(8, "Dark2"))

Finally, we generate a comparison wordcloud, which compares the relative frequency with which a term was used in the two speeches. The terms in the upper half are those used more frequently in Xi’s speech (proportion to their font size), the lower half are those used more in Tsai’s speech.

textplot_wordcloud(dfm_xt,comparison = TRUE, min_count = 2)

We redo the same exercise but this time instead of treating each term as a token, we combine two consecutive terms in the text as a “token”, what we call “bigram.”

toks_xt <- tokens_ngrams(toks_xt)
dfm_xt <- dfm(toks_xt)

Ok. Let’s do everytime the same again and see what we get.

Barplot (Bigram)

Num_of_terms_shown <- 5
xi_freqterm <- featfreq(dfm_xt["xi",])
xi_barplot <- data.frame(name=names(xi_freqterm),y=xi_freqterm)
xi_barplot <- xi_barplot[order(xi_barplot$y,decreasing=TRUE),]
xi_barplot$name <- factor(xi_barplot$name, levels = xi_barplot$name)
xi_barplot <- xi_barplot[1:Num_of_terms_shown,]

tsai_freqterm <- featfreq(dfm_xt["tsai",])
tsai_barplot <- data.frame(name=names(tsai_freqterm),y=tsai_freqterm)
tsai_barplot <- tsai_barplot[order(tsai_barplot$y,decreasing=TRUE),]
tsai_barplot$name <- factor(tsai_barplot$name, levels = tsai_barplot$name)
tsai_barplot <- tsai_barplot[1:Num_of_terms_shown,]

p1 <- plot_ly(xi_barplot, x = ~name, y = ~y, type = 'bar', 
              text = ~y, textposition = 'auto', name = "Xi Jingping's Speech",
              marker = list(color = 'red',
                            line = list(color = 'red', width = 1.5)))
p1 <- layout(p1, title = "", xaxis = list(title = ""), yaxis = list(title = ""))

p2 <- plot_ly(tsai_barplot, x = ~name, y = ~y, type = 'bar', 
              text = ~y, textposition = 'auto', name = "Tsai Ing-wen's Speech",
              marker = list(color = 'green',
                            line = list(color = 'green', width = 1.5)))
p2 <- layout(p2, title = "Top 5 Terms Used in Xi Jingping/Tsai Ing-wen's Speech" , xaxis = list(title = ""), yaxis = list(title = ""))

p <- subplot(p1,p2,shareY=T)
p <- layout(p, showlegend = T)
p

Wordcloud (Bigram)

textplot_wordcloud(dfm_xt["xi",], random_order = FALSE, rotation = .25, min_count = 2, color = RColorBrewer::brewer.pal(8, "Dark2"))

textplot_wordcloud(dfm_xt["tsai",], random_order = FALSE, rotation = .25, min_count = 2, color = RColorBrewer::brewer.pal(8, "Dark2"))

Comparison wordcloud (Bigram)

textplot_wordcloud(dfm_xt,comparison = TRUE, min_count = 2)

JMSC 6116 Lecture 1: 蔡英文習近年2021新年談話全文分析

King-wa Fu

January 22, 2021