Unsupervised Learning and Text Mining of Emotion Terms Using R

Teknik Informatika

Universitas Islam Negeri Maulana Malik Ibrahim Malang

Dosen Pembimbing: Prof. Dr. Suhartono, M.Kom

Pembelajaran tanpa pengawasan mengacu pada pendekatan ilmu data yang melibatkan pembelajaran tanpa pengetahuan sebelumnya tentang klasifikasi data sampel. Di Wikipedia, pembelajaran tanpa pengawasan telah digambarkan sebagai “tugas menyimpulkan fungsi untuk menggambarkan struktur tersembunyi dari data ‘tidak berlabel’ (klasifikasi kategorisasi tidak termasuk dalam pengamatan)”. Tujuan menyeluruh dari posting ini adalah untuk mengevaluasi dan memahami ko-kejadian dan / atau co-ekspresi kata-kata emosi dalam huruf individu, dan jika ada profil ekspresi diferensial / pola kata-kata emosi di antara 40 surat pemegang saham tahunan? Ekspresi diferensial kata-kata emosi digunakan untuk merujuk pada perbedaan kuantitatif dalam jumlah frekuensi kata emosi di antara huruf, serta perbedaan kualitatif dalam kata-kata emosi tertentu yang terjadi secara unik dalam beberapa huruf tetapi tidak hadir dalam huruf lain.

Dataset

# Retrieve the letters  
library(pdftools)

## Using poppler version 21.04.0

library(rvest)       
library(XML)
# Getting & Reading in HTML Letters
urls_77_97 <- paste('http://www.berkshirehathaway.com/letters/', seq(1977, 1997), '.html', sep='')
html_urls <- c(urls_77_97,
               'http://www.berkshirehathaway.com/letters/1998htm.html',
               'http://www.berkshirehathaway.com/letters/1999htm.html',
               'http://www.berkshirehathaway.com/2000ar/2000letter.html',
               'http://www.berkshirehathaway.com/2001ar/2001letter.html')

letters_html <- lapply(html_urls, function(x) read_html(x) %>% html_text())
# Getting & Reading in PDF Letters
urls_03_16 <- paste('http://www.berkshirehathaway.com/letters/', seq(2003, 2016), 'ltr.pdf', sep = '')
pdf_urls <- data.frame('year' = seq(2002, 2016),
                       'link' = c('http://www.berkshirehathaway.com/letters/2002pdf.pdf', urls_03_16))
download_pdfs <- function(x) {
  myfile = paste0(x['year'], '.pdf')
  download.file(url = x['link'], destfile = myfile, mode = 'wb')
  return(myfile)
}
pdfs <- apply(pdf_urls, 1, download_pdfs)
letters_pdf <- lapply(pdfs, function(x) pdf_text(x) %>% paste(collapse=" "))
tmp <- lapply(pdfs, function(x) if(file.exists(x)) file.remove(x)) 
# Combine letters in a data frame
letters <- do.call(rbind, Map(data.frame, year=seq(1977, 2016), text=c(letters_html, letters_pdf)))
letters$text <- as.character(letters$text)

Analysis of emotions terms usage

# Load additional required packages
require(tidyverse)
require(tidytext)
require(gplots)
require(SnowballC)
require(sqldf)
theme_set(theme_bw(12))
# pull emotion words and aggregate by year and emotion terms

emotions <- letters %>%
  unnest_tokens(word, text) %>%                           
  anti_join(stop_words, by = "word") %>%                  
  filter(!grepl('[0-9]', word)) %>%
  left_join(get_sentiments("nrc"), by = "word") %>%
  filter(!(sentiment == "negative" | sentiment == "positive")) %>%
  group_by(year, sentiment) %>%
  summarize( freq = n()) %>%
  mutate(percent=round(freq/sum(freq)*100)) %>%
  select(-freq) %>%
  spread(sentiment, percent, fill=0) %>%
  ungroup()
# Normalize data 
sd_scale <- function(x) {
     (x - mean(x))/sd(x)
 }
emotions[,c(2:9)] <- apply(emotions[,c(2:9)], 2, sd_scale)
emotions <- as.data.frame(emotions)
rownames(emotions) <- emotions[,1]
emotions3 <- emotions[,-1]
emotions3 <- as.matrix(emotions3)
## Using a heatmap and clustering to visualize and profile emotion terms expression data

heatmap.2(
     emotions3,
     dendrogram = "both",
     scale      = "none",
     trace      = "none",
     key        = TRUE,
     col    = colorRampPalette(c("green", "yellow", "red"))
 )

Co-expression profiles of emotion words usage

Berdasarkan profil ekspresi dikombinasikan dengan dendrogram vertikal, ada sekitar empat profil co-ekspresi istilah emosi: i) istilah emosi yang mengacu pada ketakutan dan kesedihan tampaknya diekspresikan bersama, ii) kemarahan menunjukkan profil ekspresi yang sama dan karenanya adalah istilah emosi yang diekspresikan bersama; Iii) istilah emosi yang mengacu pada sukacita, antisipasi dan kejutan tampaknya sama diungkapkan, dan iv) istilah emosi mengacu pada kepercayaan memang menunjukkan pola co-ekspresi paling sedikit.

Examples of word stemmer output

Ada beberapa kata stemmers dalam R. Salah satu fungsi tersebut, kataStem, dalam paket SnowballC mengekstrak batang dari masing-masing kata yang diberikan dalam vektor (Lihat contoh di bawah).

Before <- c("produce",  "produces", "produced", "producing", "product", "products", "production")
wstem <- as.data.frame(wordStem(Before))
names(wstem) <- "After"

Dispersion Plot

Dispersion plot adalah tampilan grafis yang dapat digunakan untuk mewakili perkiraan lokasi dan kepadatan istilah emosi di sepanjang dokumen teks. Ditunjukkan di bawah ini adalah tiga plot dispersi kata-kata emosi unik dari heatmap group-1 (1987, 1989), kelompok-5 (2001, 2008) dan kelompok-4 (2012 dan 2013) surat pemegang saham. Untuk plot dispersi, semua kata dalam tahun-tahun yang tercantum secara berurutan dipesan berdasarkan tahun huruf dan kehadiran dan perkiraan lokasi kata-kata unik diidentifikasi / ditampilkan oleh garis. Setiap garis mewakili contoh kata unik dalam surat pemegang saham.

Confirmation of emotion words expressed uniquely in heatmap group-1

group1_U <- as.data.frame(venn_list$'A')
names(group1_U) <- "terms"
uniq1 <- sqldf( "select t1.*, g1.terms
from emotions_final t1
left join
group1_U g1
on t1.word = g1.terms "
)
uniq1a <- !is.na(uniq1$terms)
uniqs1 <- rep(NA, length(emotions_final))
uniqs1[uniq1a] <- 1
plot(uniqs1, main="Dispersion plot of emotions words \n unique to heatmap group 1 ", xlab="Length (Word count)", ylab=" ", col="red", type='h', ylim=c(0,1), yaxt='n')

Confirmation of emotion words expressed uniquely in heatmap group-5

  ## confirmation of unique emotion words in heatmap group-5  
group5_U <- as.data.frame(venn_list$'C')
names(group5_U) <- "terms"
uniq5 <- sqldf( "select t1.*, g5.terms
from emotions_final t1
left join
group5_U g5
on t1.word = g5.terms "
)
uniq5a <- !is.na(uniq5$terms)
uniqs5 <- rep(NA, length(emotions_final))
uniqs5[uniq5a] <- 1
 
plot(uniqs5, main="Dispersion plot of emotions words \n unique to heatmap group 5 ", xlab="Length (Word count)", ylab=" ", col="red", type='h', ylim=c(0,1), yaxt='n')

Annual Returns on Investment in S&P500 (1977 – 2016)

## You need to first download the raw data before running the code to recreate the graph below. 
ggplot(sp500[50:89,], aes(x=year, y=return, colour=return>0)) +
geom_segment(aes(x=year, xend=year, y=0, yend=return),
size=1.1, alpha=0.8) +
geom_point(size=1.0) +
xlab("Investment Year") +
     ylab("S&P500 Annual Returns") +
     labs(title="Annual Returns on Investment in S&P500", subtitle= "source: http://pages.stern.nyu.edu/~adamodar/New_Home_Page/datafile/histretSP.html") + 
     theme(legend.position="none") +
     coord_flip()

Concluding Remarks

R menawarkan beberapa paket dan fungsi untuk evaluasi dan analisis ekspresi diferensial dan profil ekspresi bersama kata-kata emosi dalam data tekstual, serta visualisasi dan presentasi hasil analisis. Beberapa fungsi, teknik, dan alat tersebut telah dicoba, mudah-mudahan Anda menemukan contoh-contoh yang bermanfaat.

##Referensi

https://www.r-bloggers.com/

https://www.r-bloggers.com/2015/12/how-to-learn-r-2/