Analiza teksta

Analiza teksta dobiva na popularnosti zbog sve veće dostupnosti podataka i razvoja user friendly podrške za provedbu takve analize. Analiza tekstualnih podataka je moguća kroz mnoštvo različitih pristupa, a najšire korišten pristup je bag-of-words u kojem je frekvencija riječi polazište za analizu dok se (npr.) pozicija riječi zanemaruje. Postupak analize teksta započinje pripremom teksta (podataka), koja uključuje: uvoz teksta, operacije sa riječima, uređivanje i tokenizacija, izrada matrice pojmova, filtiranje i ponderiranje podataka. Pri tome valja imati na umu da vrsta analize i korištena metoda određuju način na koji je potrebno pripremiti podatke za daljnu analizu te da svaka metoda ima svoje specifičnosti. Nakon pripreme podataka se vrši analiza teksta (podataka) metodama nadziranog strojnog učenja, ne-nadziranog strojnog učenja, statistike na tekstualnim podatcima, analize riječnika. Napredne metode analize podataka uključuju NLP i analizu pozicije riječi i sintakse.

Procedura za analizu teksta.

U ovom ćemo pregledu koristiti ‘tidytext’ pristup za analizu tekstualnih podatka, detaljno opisan u knjizi Text Mining with R. Analizirati ćemo temu COVID-19 pandemije na osnovi članka objavljenih na velikom broju domaćih internetskih portala u zadnjih 30 dana. Eksplorativna analiza teksta koju ćemo provesti uključuje: uvoz podataka pomoću API, čišćenje, uređivanje i prilagodbu podataka, dekriptivnu statistiku na tekstualnim podatcima, analizu sentimenta, analizu frekvencija i tematsku analizu.

Uvoz podataka

Podatci za analizu su preuzeti kroz API i prikuljeni su sa velikog broja domaćih internetskih portala. Članci su identificirani ako sadrže “COVID” bilo gdje u tekstu članka,a članak je objavljen u zadnjih 30 dana.

# POVUCI PODATKE SA API


wh_token("6f9a2e1a-ba97-43a9-af55-918b50c644f4")

wh_news(q = '"COVID" is_first:true language:croatian site_type:news thread.country:HR',
        ts = (Sys.time() - (30 * 24 * 60 * 60))) %>%
  wh_paginate(100) %>% 
  wh_collect(TRUE) -> cro_corona_news

# UREDI PODATKE I POSPREMI

cro_corona_news %>%
  as.data.frame(.) %>%
  select(url, author, published, title, text, thread.site) %>%
  write.csv2(., file = "E:/Luka/Academic/HS/MULTIVARIJARNE METODE/7_TEXT/ccn2.csv")

Nakon što smo povukli članke, izabrali varijable od interesa i lokalno posporemili podatke, u sljedećem koraku učitavamo podatke u radni prostor R i učitavamo druge podatke koji su nam potrebni za analizu. Osim članaka, potrebni su nam leksikoni i stop riječi. Leksikone ćemo preuzeti sa iz FER-ovog repozitorija, a “stop riječi” ćemo napraviti sami.

# UČITAJ ČLANKE

cro_corona_news <- read.csv2("E:/Luka/Academic/HS/MULTIVARIJARNE METODE/7_TEXT/ccn2.csv", stringsAsFactors = F)


# UČITAJ LEKSIKONE

CroSentilex_n <- read.delim("C:/Users/msagovac/Dropbox/Mislav@Luka/crosentilex-negatives.txt",
                                   header = FALSE,
                                   sep = " ",
                                   stringsAsFactors = FALSE,
                                   fileEncoding = "UTF-8")  %>%
                   rename(word = "V1", sentiment = "V2" ) %>%
                   mutate(brija = "NEG")
 
 CroSentilex_p  <- read.delim("C:/Users/msagovac/Dropbox/Mislav@Luka/crosentilex-positives.txt",
                                   header = FALSE,
                                   sep = " ",
                                   stringsAsFactors = FALSE,
                                   fileEncoding = "UTF-8") %>%
                    rename(word = "V1", sentiment = "V2" ) %>%
                    mutate(brija = "POZ")
 
 Crosentilex_sve <- rbind(setDT(CroSentilex_n), setDT(CroSentilex_p))
 
 
 CroSentilex_Gold  <- read.delim2("C:/Users/msagovac/Dropbox/Mislav@Luka/gs-sentiment-annotations.txt",
                                 header = FALSE,
                                 sep = " ",
                                 stringsAsFactors = FALSE) %>%
                    rename(word = "V1", sentiment = "V2" ) 

 Encoding(CroSentilex_Gold$word) <- "UTF-8"
 CroSentilex_Gold[1,1] <- "dati"
 CroSentilex_Gold$sentiment <- str_replace(CroSentilex_Gold$sentiment , "-", "1")
 CroSentilex_Gold$sentiment <- str_replace(CroSentilex_Gold$sentiment , "\\+", "2")
 CroSentilex_Gold$sentiment <- as.numeric(unlist(CroSentilex_Gold$sentiment))
 
# STVORI "STOP RIJEČI"
 
 stopwords_cro <- get_stopwords(language = "hr", source = "stopwords-iso")
my_stop_words <- tibble(
  word = c(
    "jedan",
    "e","prvi", "dva","dvije","drugi",
    "tri","treći","pet","kod",
    "ove","ova",  "ovo","bez", "kod",
    "evo","oko",  "om", "ek",
    "mil","tko","šest", "sedam",
    "osam",   "čim", "zbog",
    "prema", "dok","zato", "koji", 
    "im", "čak","među", "tek",
    "koliko", "tko","kod","poput", 
    "baš", "dakle", "osim", "svih", 
    "svoju", "odnosno", "gdje",
    "kojoj", "ovi", "toga",
     "ubera", "vozača", "hrvatskoj", "usluge", "godine", "više", "taksi", "taxi", "taksija", "taksija", "kaže", "rekao", "19"
  ),
  lexicon = "lux"
)
stop_corpus <- my_stop_words %>%
  bind_rows(stopwords_cro)

Prilagodba podataka

U sljedećem koraku ćemo prilagoditi podatke u tidy format koji je prikladan za analizu. Za to je potrebno provesti tokenizaciju te očistiti riječi od brojeva i nepotrebnih riječi. Na tako uređenim podatcima ćemo napraviti osnovni statistički pregled i provesti jednostavnu statističku analizu ključnih riječi i domena.

# UREDI ČLANKE

newsCOVID <- cro_corona_news %>% 
  as.data.frame() %>%
  select(url, author, published, title, text, thread.site) %>%
  mutate(published = gsub("T.*","",published) ) %>%
  mutate(published = as.Date(published,"%Y-%m-%d")) %>%
  mutate(clanak = 1:n()) %>%
  group_by(thread.site) %>%
  mutate(domenaBr = n()) %>% 
  ungroup()

glimpse(newsCOVID)

## Rows: 7,919
## Columns: 8
## $ url         <chr> "https://podravski.hr/najnovije-nema-novih-zarazenih-u-...
## $ author      <chr> "Ivica Barać", "Iva Međugorac", "Željko Marušić", "Tea ...
## $ published   <date> 2020-04-26, 2020-04-26, 2020-04-26, 2020-04-26, 2020-0...
## $ title       <chr> "NAJNOVIJE! Nema novih zaraženih u našoj županiji, ozdr...
## $ text        <chr> "NAJNOVIJE! Nema novih zaraženih u našoj županiji, ozdr...
## $ thread.site <chr> "podravski.hr", "dnevno.hr", "dnevno.hr", "dnevno.hr", ...
## $ clanak      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ domenaBr    <int> 70, 103, 103, 103, 41, 269, 269, 439, 16, 294, 439, 64,...

# TOKENIZACIJA

newsCOVID %>% 
  unnest_tokens(word, text) -> newsCOVID_token 

glimpse(newsCOVID_token)

## Rows: 4,799,432
## Columns: 8
## $ url         <chr> "https://podravski.hr/najnovije-nema-novih-zarazenih-u-...
## $ author      <chr> "Ivica Barać", "Ivica Barać", "Ivica Barać", "Ivica Bar...
## $ published   <date> 2020-04-26, 2020-04-26, 2020-04-26, 2020-04-26, 2020-0...
## $ title       <chr> "NAJNOVIJE! Nema novih zaraženih u našoj županiji, ozdr...
## $ thread.site <chr> "podravski.hr", "podravski.hr", "podravski.hr", "podrav...
## $ clanak      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ domenaBr    <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,...
## $ word        <chr> "najnovije", "nema", "novih", "zaraženih", "u", "našoj"...

newsCOVID_token %>%
  head(10)

## # A tibble: 10 x 8
##    url          author  published  title       thread.site clanak domenaBr word 
##    <chr>        <chr>   <date>     <chr>       <chr>        <int>    <int> <chr>
##  1 https://pod~ Ivica ~ 2020-04-26 NAJNOVIJE!~ podravski.~      1       70 najn~
##  2 https://pod~ Ivica ~ 2020-04-26 NAJNOVIJE!~ podravski.~      1       70 nema 
##  3 https://pod~ Ivica ~ 2020-04-26 NAJNOVIJE!~ podravski.~      1       70 novih
##  4 https://pod~ Ivica ~ 2020-04-26 NAJNOVIJE!~ podravski.~      1       70 zara~
##  5 https://pod~ Ivica ~ 2020-04-26 NAJNOVIJE!~ podravski.~      1       70 u    
##  6 https://pod~ Ivica ~ 2020-04-26 NAJNOVIJE!~ podravski.~      1       70 našoj
##  7 https://pod~ Ivica ~ 2020-04-26 NAJNOVIJE!~ podravski.~      1       70 župa~
##  8 https://pod~ Ivica ~ 2020-04-26 NAJNOVIJE!~ podravski.~      1       70 ozdr~
##  9 https://pod~ Ivica ~ 2020-04-26 NAJNOVIJE!~ podravski.~      1       70 ukup~
## 10 https://pod~ Ivica ~ 2020-04-26 NAJNOVIJE!~ podravski.~      1       70 14

newsCOVID_token %>% 
  sample_n(.,10)

## # A tibble: 10 x 8
##    url         author  published  title       thread.site  clanak domenaBr word 
##    <chr>       <chr>   <date>     <chr>       <chr>         <int>    <int> <chr>
##  1 http://www~ "Z NET" 2020-04-08 ŠKOLSKA SH~ znet.hr        4612       99 razv~
##  2 https://ww~ ""      2020-04-17 POVJERENIC~ jutarnji.hr    2149      774 prob~
##  3 https://ww~ "Mirja~ 2020-04-15 U Hrvatsko~ icv.hr         2570       99 mjere
##  4 http://met~ ""      2020-04-10 Proglašene~ metro-porta~   3825       88 za   
##  5 https://ww~ "maxpo~ 2020-03-29 Dr. Lidija~ maxportal.hr   7269       41 mjer~
##  6 https://ww~ ""      2020-04-06 VODIČ ZA E~ jutarnji.hr    5215      774 osoba
##  7 https://ww~ "uredn~ 2020-04-10 Pučka prav~ civilnodrus~   3726       23 slož~
##  8 https://ww~ "Bljes~ 2020-04-20 UK će test~ bljesak.info   1415       92 prež~
##  9 https://vl~ ""      2020-04-07 Stožer: 60~ gov.hr         4763       64 juče~
## 10 http://www~ ""      2020-04-18 Globalna p~ glas-slavon~   1722      278 sve

## Ukloni "stop words", brojeve, veznike i pojedinačna slova
newsCOVID_token %>% 
  anti_join(stop_corpus, by = "word") %>%
  mutate(word = gsub("\\d+", NA, word)) %>%
  mutate(word = gsub("^[a-zA-Z]$", NA, word)) %>% 
  drop_na(.)-> newsCOVID_tokenTidy


newsCOVID_tokenTidy %>%
  sample_n(10)

## # A tibble: 10 x 8
##    url        author     published  title      thread.site clanak domenaBr word 
##    <chr>      <chr>      <date>     <chr>      <chr>        <int>    <int> <chr>
##  1 http://ww~ ""         2020-04-13 Veterinar~ tportal.hr    3317      294 skrbe
##  2 https://w~ ""         2020-04-07 Dvaput do~ rtl.hr        5016      267 stva~
##  3 http://ww~ ""         2020-04-24 Osječke u~ osijek031.~    450       47 natj~
##  4 https://n~ ""         2020-04-24 Butković:~ jutarnji.hr    314      774 butk~
##  5 https://w~ ""         2020-04-14 U TIJEKU ~ jutarnji.hr   2755      774 post~
##  6 https://w~ ""         2020-04-11 Hrvatska ~ rtl.hr        3462      267 podr~
##  7 http://ww~ "24sata.h~ 2020-04-11 OŠTEĆENJA~ znet.hr       3569       99 češći
##  8 https://s~ ""         2020-04-13 NAJAVIO J~ jutarnji.hr   3196      774 bres~
##  9 https://n~ "Narod.hr" 2020-03-31 (VIDEO) D~ narod.hr      6767       72 devet
## 10 https://w~ "Dijana M~ 2020-04-02 Sve bliže~ 24sata.hr     6361      429 izol~

# DESKRIPTIVNI PREGLED PODATAKA

## Vremenski raspon
range(newsCOVID_token$published)

## [1] "2020-03-26" "2020-04-26"

## Najčešće riječi
newsCOVID_tokenTidy %>%
  count(word, sort = T) %>%
  head(25)

## # A tibble: 25 x 2
##    word             n
##    <chr>        <int>
##  1 covid        13977
##  2 osoba        13068
##  3 dana          9053
##  4 ljudi         8966
##  5 koronavirusa  8892
##  6 mjere         8375
##  7 mjera         7851
##  8 sada          7632
##  9 zaštite       7544
## 10 osobe         7388
## # ... with 15 more rows

## Vizualizacija najčešćih riječi
newsCOVID_tokenTidy %>%
  count(word, sort = T) %>%
  filter(n > 5000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

## Vizualizacija najčešćih riječi kroz vrijeme
newsCOVID_tokenTidy %>%
   mutate(Datum = floor_date(published, "day")) %>%
   group_by(Datum) %>%
   count(word) %>% 
   mutate(gn = sum(n)) %>%
   filter(word %in%  c("covid", "kriza", "ekonomija", "oporavak")) %>%
   ggplot(., aes(Datum,  n / gn)) + 
   geom_point() +
   ggtitle("Učestalost korištenja riječi u člancima o pandemiji COVID-19") +
   ylab("% ukupnih riječi") +
   geom_smooth() +
   facet_wrap(~ word, scales = "free_y") +
   scale_y_continuous(labels = scales::percent_format())

## Broj domena
newsCOVID_tokenTidy %>% 
  summarise(Domena = n_distinct(thread.site))

## # A tibble: 1 x 1
##   Domena
##    <int>
## 1    160

## Broj članaka po domeni

newsCOVID %>% 
  drop_na(.) %>%
  group_by(thread.site) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>% 
  head(20)

## # A tibble: 20 x 2
##    thread.site              n
##    <chr>                <int>
##  1 slobodnadalmacija.hr   790
##  2 jutarnji.hr            774
##  3 novilist.hr            439
##  4 24sata.hr              429
##  5 net.hr                 383
##  6 tportal.hr             294
##  7 glas-slavonije.hr      278
##  8 dnevnik.hr             269
##  9 rtl.hr                 267
## 10 glasistre.hr           214
## 11 poslovni.hr            212
## 12 dalmacijanews.hr       135
## 13 telegram.hr            135
## 14 lider.media            121
## 15 dalmatinskiportal.hr   117
## 16 ipress.hr              111
## 17 dnevno.hr              103
## 18 hkm.hr                 102
## 19 icv.hr                  99
## 20 znet.hr                 99

## Najveći portali

newsCOVID %>% 
  group_by(thread.site) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>% 
  top_n(6) %>%
  select(thread.site) %>%
  pull() -> topPortali

## Broj članaka po domeni kroz vrijeme

newsCOVID %>% 
   mutate(Datum = floor_date(published, "day")) %>%
   group_by(Datum) %>%
   count(thread.site) %>% 
   mutate(gn = sum(n)) %>%
   ungroup() %>% 
   filter(thread.site %in% topPortali) %>%
   group_by(Datum, thread.site) %>%
   ggplot(., aes(Datum,  n / gn)) + 
   geom_point() +
   ggtitle("Članci o pandemiji COVID-19 na najvažnijim RH portalima") +
   ylab("% ukupno objavljenih članaka") +
   geom_smooth() +
   facet_wrap(~ thread.site, scales = "free_y") +
   scale_y_continuous(labels = scales::percent_format())

Analiza sentimenta

Nakon uređivanja podataka i osnovnog pregleda najvažnijih riječi i dinamike kretanja članaka u vremenu, provesti ćemo analizu sentimenta. Za analizu sentimenta je potrebno preuzeti leksikone sentimenta koji su za hrvatski jezik dostupni kroz FER-ov Croatian Sentiment Lexicon. Analiza sentimenta će biti provedena za cijeli korpus članaka te za domene i uključuje sentiment kroz vrijeme, doprinos riječi sentimentu, ‘wordCloud’ i analizu negativnosti portala.

## Pregled leksikona
CroSentilex_n %>% sample_n(10)

##           word sentiment brija
## 1        zečić   0.22811   NEG
## 2   istegnusti   0.28867   NEG
## 3       srečko   0.12995   NEG
## 4    rodilište   0.47950   NEG
## 5  preokretati   0.44251   NEG
## 6        izvoz   0.36509   NEG
## 7    portoriko   0.30171   NEG
## 8      jakljan   0.32017   NEG
## 9       lampaš   0.15640   NEG
## 10     paradis   0.07237   NEG

CroSentilex_p %>% sample_n(10)

##           word sentiment brija
## 1       garros  0.510340   POZ
## 2        osama  0.451340   POZ
## 3        vanek  0.057987   POZ
## 4      santoro  0.106860   POZ
## 5  dvokrevetan  0.256320   POZ
## 6        milja  0.527150   POZ
## 7       vikend  0.277400   POZ
## 8    krzysztof  0.166890   POZ
## 9       bjeliš  0.380770   POZ
## 10     horwath  0.120390   POZ

Crosentilex_sve %>% sample_n(10)

##            word sentiment brija
## 1    električan   0.37711   NEG
## 2       marulić   0.37298   NEG
## 3        usluga   0.30013   POZ
## 4    sjemenište   0.32842   NEG
## 5      zvukovan   0.56314   POZ
## 6       šetanje   0.12224   POZ
## 7       položen   0.45734   POZ
## 8  prepisivanje   0.22789   POZ
## 9      benedikt   0.56247   POZ
## 10    depozitan   0.11399   NEG

CroSentilex_Gold %>% sample_n(10)

##         word sentiment
## 1     opasan         1
## 2       vino         0
## 3       soba         0
## 4       brak         0
## 5    uvjeren         0
## 6     snimka         0
## 7    ukraden         1
## 8    platiti         0
## 9        lak         0
## 10 vremenski         0

## Kretanje sentimenta kroz vrijeme
vizualiziraj_sentiment <- function(dataset, frq = "day") {

dataset %>%
  inner_join( Crosentilex_sve, by = "word") %>%
  filter(!is.na(word)) %>%
  select(word, brija, published, sentiment) %>% 
  unique() %>%
  spread(. , brija, sentiment) %>%
  mutate(sentiment = POZ - NEG) %>%
  select(word, published, sentiment) %>% 
  group_by(word) %>% 
  mutate(count = n()) %>%
  arrange(desc(count)) %>%
  mutate( score = sentiment*count) %>%
  ungroup() %>%
  group_by(published) %>%
  arrange(desc(published)) -> sm

 
sm %>%
  select(published, score) %>%
  group_by(Datum = floor_date(published, frq)) %>%
  summarise(Dnevni_sent = sum(score, na.rm = TRUE)) %>%
  ggplot(., aes(Datum, Dnevni_sent)) +
  geom_bar(stat = "identity") + 
  ggtitle(paste0("Sentiment kroz vrijeme;frekvencija podataka:", frq)) +
  ylab("SentimentScore") -> gg_sentiment_kroz_vrijeme_qv


gg_sentiment_kroz_vrijeme_qv

}
vizualiziraj_sentiment(newsCOVID_tokenTidy,"day")

## Doprinos sentimentu
doprinos_sentimentu <- function(dataset, no = n) {
dataset %>%
  inner_join(CroSentilex_Gold, by = "word") %>% 
  count(word, sentiment,sort = TRUE) %>% 
  group_by(sentiment) %>%
  top_n(no) %>%
  ungroup() %>%
  mutate(sentiment = case_when(sentiment == 0 ~ "NEUTRALNO",
                                 sentiment == 1 ~ "NEGATIVNO",
                                 sentiment == 2 ~ "POZITIVNO")) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  ggtitle( "Doprinos sentimentu") +
  labs( x = "Riječ", y = "Broj riječi") +
  facet_wrap(~ sentiment, scales = "free_y") +
  coord_flip() -> gg_doprinos_sentimentu
  
 gg_doprinos_sentimentu
 
}
doprinos_sentimentu(newsCOVID_tokenTidy,15)

## WordCloud(vulgaris)
newsCOVID_tokenTidy %>%
  anti_join(CroSentilex_Gold,by="word") %>% 
  count(word) %>% 
  arrange(desc(n)) %>%
  top_n(100) %>%
  with(wordcloud(word, n, max.words = 80))

## ComparisonCloud
newsCOVID_tokenTidy %>%
  inner_join(CroSentilex_Gold,by="word") %>% 
  count(word, sentiment) %>% 
  top_n(200) %>%
  mutate(sentiment = case_when(sentiment == 0 ~ "+/-",
                                 sentiment == 1 ~ "-",
                                 sentiment == 2 ~ "+")) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("firebrick3", "deepskyblue3","darkslategray"),
                   max.words = 120)

## Najnegativniji portali

wCount <- newsCOVID_tokenTidy %>% 
  group_by(thread.site) %>%
  summarise(word = n())

CroSentilex_Gold_neg <- CroSentilex_Gold %>% filter(sentiment == 1)
CroSentilex_Gold_poz <- CroSentilex_Gold %>% filter(sentiment == 2)


newsCOVID_tokenTidy %>% 
  semi_join(CroSentilex_Gold_neg, by= "word") %>%
  group_by(thread.site) %>% 
  summarise(negWords = n()) %>%
  left_join(wCount, by = "thread.site") %>%
  mutate(negativnostIndex = (negWords/word)*100) %>%
  arrange(desc(negativnostIndex))

## # A tibble: 150 x 4
##    thread.site       negWords  word negativnostIndex
##    <chr>                <int> <int>            <dbl>
##  1 voljatel.hr              3   130             2.31
##  2 alexandar.org            7   371             1.89
##  3 makora.hr               22  1176             1.87
##  4 hkv.hr                 409 24085             1.70
##  5 drvo-namjestaj.hr       12   829             1.45
##  6 hrportfolio.hr          41  2923             1.40
##  7 vozim.hr                13   966             1.35
##  8 pcchip.hr                4   299             1.34
##  9 klik.hr                 22  1667             1.32
## 10 vecernji.hr            107  8402             1.27
## # ... with 140 more rows

## Najpozitivniji portali
 
newsCOVID_tokenTidy %>% 
  semi_join(CroSentilex_Gold_poz, by= "word") %>%
  group_by(thread.site) %>% 
  summarise(pozWords = n()) %>%
  left_join(wCount, by = "thread.site") %>%
  mutate(pozitivnostIndex = (pozWords/word)*100) %>%
  arrange(desc(pozitivnostIndex))

## # A tibble: 156 x 4
##    thread.site         pozWords  word pozitivnostIndex
##    <chr>                  <int> <int>            <dbl>
##  1 sportnedjeljom.hr         11   198             5.56
##  2 profightstore.hr           8   196             4.08
##  3 pcchip.hr                 12   299             4.01
##  4 ekozadar.hr               14   366             3.83
##  5 soundguardian.com          5   142             3.52
##  6 infozona.hr                4   114             3.51
##  7 gameperspectives.hr       59  1820             3.24
##  8 redea.hr                  62  1969             3.15
##  9 sensa.hr                  55  1765             3.12
## 10 hpb.hr                     6   196             3.06
## # ... with 146 more rows

### Veliki portali 

newsCOVID %>%
  group_by(thread.site) %>%
  count %>%
  arrange(desc(n)) %>%
  head(10) %>%
  pull(thread.site) -> najveceDomene

newsCOVID %>%
  group_by(thread.site) %>%
  count %>%
  arrange(desc(n)) %>%
  head(5) %>%
  pull(thread.site) -> najveceDomene5



newsCOVID_tokenTidy %>% 
  filter(thread.site %in% najveceDomene) %>%
  semi_join(CroSentilex_Gold_neg, by= "word") %>%
  group_by(thread.site) %>% 
  summarise(negWords = n()) %>%
  left_join(wCount, by = "thread.site") %>%
  mutate(negativnostIndex = (negWords/word)*100) %>%
  arrange(desc(negativnostIndex))

## # A tibble: 10 x 4
##    thread.site          negWords   word negativnostIndex
##    <chr>                   <int>  <int>            <dbl>
##  1 24sata.hr                1968 177938            1.11 
##  2 slobodnadalmacija.hr     3450 314073            1.10 
##  3 glasistre.hr              558  53506            1.04 
##  4 jutarnji.hr              3271 315739            1.04 
##  5 glas-slavonije.hr         957  93121            1.03 
##  6 net.hr                   1081 106967            1.01 
##  7 tportal.hr                689  72569            0.949
##  8 novilist.hr              1407 148393            0.948
##  9 rtl.hr                    676  75066            0.901
## 10 dnevnik.hr                897 108804            0.824

newsCOVID_tokenTidy %>% 
  filter(thread.site %in% najveceDomene) %>%
  semi_join(CroSentilex_Gold_poz, by= "word") %>%
  group_by(thread.site) %>% 
  summarise(pozWords = n()) %>%
  left_join(wCount, by = "thread.site") %>%
  mutate(pozitivnostIndex = (pozWords/word)*100) %>%
  arrange(desc(pozitivnostIndex))

## # A tibble: 10 x 4
##    thread.site          pozWords   word pozitivnostIndex
##    <chr>                   <int>  <int>            <dbl>
##  1 24sata.hr                3524 177938             1.98
##  2 glas-slavonije.hr        1835  93121             1.97
##  3 glasistre.hr             1024  53506             1.91
##  4 slobodnadalmacija.hr     6008 314073             1.91
##  5 novilist.hr              2777 148393             1.87
##  6 net.hr                   1973 106967             1.84
##  7 rtl.hr                   1318  75066             1.76
##  8 jutarnji.hr              5541 315739             1.75
##  9 tportal.hr               1216  72569             1.68
## 10 dnevnik.hr               1693 108804             1.56

Analiza frekvencija riječi

Poslije analize sentimenta je korisno analizirati najbitnije riječi, a to je moguće putem IDF (inverse document frequency) metode. IDF metoda omogućuje identifikaciju važnih, a ne nužno čestih riječi u korpusu i može poslužiti za analizu najvažnijih pojmova po domenama.

## Udio riječi po domenama

domenaWords <- newsCOVID %>%
  unnest_tokens(word,text) %>% 
  count(thread.site, word, sort = T)
  
ukupnoWords <- domenaWords %>%
  group_by(thread.site) %>%
  summarise(totWords = sum(n))

domenaWords <- left_join(domenaWords, ukupnoWords)


domenaWords %>% head(15)

## # A tibble: 15 x 4
##    thread.site          word      n totWords
##    <chr>                <chr> <int>    <int>
##  1 jutarnji.hr          je    19475   522108
##  2 icv.hr               je    18476   324085
##  3 jutarnji.hr          u     18094   522108
##  4 slobodnadalmacija.hr je    17630   524303
##  5 slobodnadalmacija.hr i     17545   524303
##  6 slobodnadalmacija.hr u     17010   524303
##  7 jutarnji.hr          i     16185   522108
##  8 slobodnadalmacija.hr se    10150   524303
##  9 24sata.hr            je    10070   294027
## 10 24sata.hr            u      9764   294027
## 11 24sata.hr            i      9689   294027
## 12 icv.hr               i      9578   324085
## 13 jutarnji.hr          se     9482   522108
## 14 icv.hr               u      9313   324085
## 15 jutarnji.hr          na     9174   522108

domenaWords %>% filter(thread.site %in% najveceDomene) %>%
ggplot(., aes(n/totWords, fill = thread.site)) +
  geom_histogram(show.legend = FALSE) +
  xlim(NA, 0.0009) +
  facet_wrap(~thread.site, ncol = 2, scales = "free_y")

## Najbitnije riječi po domenma

idf <- domenaWords %>%
  bind_tf_idf(word, thread.site, n)

idf %>% head(10)

## # A tibble: 10 x 7
##    thread.site          word      n totWords     tf     idf   tf_idf
##    <chr>                <chr> <int>    <int>  <dbl>   <dbl>    <dbl>
##  1 jutarnji.hr          je    19475   522108 0.0373 0.00627 0.000234
##  2 icv.hr               je    18476   324085 0.0570 0.00627 0.000357
##  3 jutarnji.hr          u     18094   522108 0.0347 0       0       
##  4 slobodnadalmacija.hr je    17630   524303 0.0336 0.00627 0.000211
##  5 slobodnadalmacija.hr i     17545   524303 0.0335 0       0       
##  6 slobodnadalmacija.hr u     17010   524303 0.0324 0       0       
##  7 jutarnji.hr          i     16185   522108 0.0310 0       0       
##  8 slobodnadalmacija.hr se    10150   524303 0.0194 0.00627 0.000121
##  9 24sata.hr            je    10070   294027 0.0342 0.00627 0.000215
## 10 24sata.hr            u      9764   294027 0.0332 0       0

idf %>% 
  select(-totWords) %>%
  arrange(desc(tf_idf))

## # A tibble: 806,633 x 6
##    thread.site        word          n     tf   idf tf_idf
##    <chr>              <chr>     <int>  <dbl> <dbl>  <dbl>
##  1 infozona.hr        dvk           4 0.0248  5.08 0.126 
##  2 lokalnahrvatska.hr iframes      90 0.0206  5.08 0.104 
##  3 zse.hr             d.d          13 0.0565  1.78 0.101 
##  4 serijala.com       knope         3 0.0180  5.08 0.0912
##  5 serijala.com       leslie        3 0.0180  5.08 0.0912
##  6 lokalnahrvatska.hr browser      90 0.0206  4.38 0.0902
##  7 lokalnahrvatska.hr p           180 0.0412  1.86 0.0764
##  8 zse.hr             ažurirana     4 0.0174  4.38 0.0762
##  9 pcchip.hr          capcom        9 0.0168  4.38 0.0734
## 10 soundguardian.com  novoselić     3 0.0144  5.08 0.0728
## # ... with 806,623 more rows

idf %>%
  filter(thread.site %in% najveceDomene5) %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  mutate(thread.site = factor(thread.site, levels = najveceDomene5)) %>%
  group_by(thread.site) %>% 
  top_n(10) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = thread.site)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~thread.site, ncol = 2, scales = "free") +
  coord_flip()

nGrami

Do sada smo analizirali tekst na način da je (jedna) riječ osnova za analizu. Taj način može sakriti bitne nalaze do kojih je moguće doći ukoliko se tekst tokenizira na fraze (dvije ili N riječi). U sljedećemo koraku ćemo tokenizirati tekst na bigrame (dvije riječi) kako bismo proveli frazeološku analizu. Korištenje bigrama omogućava korištenje dodatnih metoda pa ćemo provesti i analizu korelacije među riječima.

newsCOVID_bigram <- newsCOVID %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

newsCOVID_bigram %>% head(10)

## # A tibble: 10 x 8
##    url          author published  title      thread.site clanak domenaBr bigram 
##    <chr>        <chr>  <date>     <chr>      <chr>        <int>    <int> <chr>  
##  1 http://alex~ ""     2020-04-19 Francuski~ alexandar.~   1712        2 francu~
##  2 http://alex~ ""     2020-04-19 Francuski~ alexandar.~   1712        2 nobelo~
##  3 http://alex~ ""     2020-04-19 Francuski~ alexandar.~   1712        2 korona~
##  4 http://alex~ ""     2020-04-19 Francuski~ alexandar.~   1712        2 je sig~
##  5 http://alex~ ""     2020-04-19 Francuski~ alexandar.~   1712        2 sigurn~
##  6 http://alex~ ""     2020-04-19 Francuski~ alexandar.~   1712        2 nastao~
##  7 http://alex~ ""     2020-04-19 Francuski~ alexandar.~   1712        2 u labo~
##  8 http://alex~ ""     2020-04-19 Francuski~ alexandar.~   1712        2 labora~
##  9 http://alex~ ""     2020-04-19 Francuski~ alexandar.~   1712        2 u wuha~
## 10 http://alex~ ""     2020-04-19 Francuski~ alexandar.~   1712        2 wuhanu~

newsCOVID_bigram %>%
  count(bigram, sort = T) %>%
  head(15)

## # A tibble: 15 x 2
##    bigram              n
##    <chr>           <int>
##  1 covid 19        12222
##  2 da je           12058
##  3 da se           11829
##  4 je u             8871
##  5 rekao je         6520
##  6 koji su          6474
##  7 je da            6090
##  8 civilne zaštite  6077
##  9 će se            5765
## 10 što je           5706
## 11 da će            5429
## 12 kako bi          5176
## 13 da su            4953
## 14 u hrvatskoj      4319
## 15 je i             4304

newsCOVID_bigram_sep <- newsCOVID_bigram %>%
  separate(bigram, c("word1","word2"), sep = " ")

newsCOVID_bigram_tidy <- newsCOVID_bigram_sep %>%
  filter(!word1 %in% stop_corpus$word) %>%
  filter(!word2 %in% stop_corpus$word) %>%
  mutate(word1 = gsub("\\d+", NA, word1)) %>%
  mutate(word2 = gsub("\\d+", NA, word2)) %>%
  mutate(word1 = gsub("^[a-zA-Z]$", NA, word1)) %>%
  mutate(word2 = gsub("^[a-zA-Z]$", NA, word2)) %>% 
  drop_na(.)


newsCOVID_bigram_tidy_bigram_counts <- newsCOVID_bigram_tidy %>% 
  count(word1, word2, sort = TRUE)

newsCOVID_bigram_tidy_bigram_counts

## # A tibble: 691,620 x 3
##    word1       word2             n
##    <chr>       <chr>         <int>
##  1 civilne     zaštite        6077
##  2 stožera     civilne        2958
##  3 stožer      civilne        1929
##  4 javno       zdravstvo      1575
##  5 nacionalnog stožera        1404
##  6 bolesti     covid          1269
##  7 zaraze      koronavirusom  1262
##  8 vili        beroš          1255
##  9 ministar    zdravstva      1205
## 10 novih       slučajeva      1136
## # ... with 691,610 more rows

bigrams_united <- newsCOVID_bigram_tidy %>%
  drop_na(.) %>%
  unite(bigram, word1, word2, sep = " ")

bigrams_united

## # A tibble: 1,712,053 x 8
##    url          author published  title     thread.site clanak domenaBr bigram  
##    <chr>        <chr>  <date>     <chr>     <chr>        <int>    <int> <chr>   
##  1 http://alex~ ""     2020-04-19 Francusk~ alexandar.~   1712        2 francus~
##  2 http://alex~ ""     2020-04-19 Francusk~ alexandar.~   1712        2 nobelov~
##  3 http://alex~ ""     2020-04-19 Francusk~ alexandar.~   1712        2 sigurno~
##  4 http://alex~ ""     2020-04-19 Francusk~ alexandar.~   1712        2 wuhanu ~
##  5 http://alex~ ""     2020-04-19 Francusk~ alexandar.~   1712        2 foto epa
##  6 http://alex~ ""     2020-04-19 Francusk~ alexandar.~   1712        2 epa efe 
##  7 http://alex~ ""     2020-04-19 Francusk~ alexandar.~   1712        2 efe kon~
##  8 http://alex~ ""     2020-04-19 Francusk~ alexandar.~   1712        2 kontrov~
##  9 http://alex~ ""     2020-04-19 Francusk~ alexandar.~   1712        2 francus~
## 10 http://alex~ ""     2020-04-19 Francusk~ alexandar.~   1712        2 znanstv~
## # ... with 1,712,043 more rows

bigrams_united %>% 
  count(clanak,bigram,sort = T) -> topicBigram

# Najvažniji bigrami po domenama

 bigram_tf_idf <- bigrams_united %>%
  count(thread.site, bigram) %>%
  bind_tf_idf(bigram, thread.site, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf %>%
  filter(thread.site %in% najveceDomene5) %>%
  arrange(desc(tf_idf)) %>%
  mutate(bigram = factor(bigram, levels = rev(unique(bigram)))) %>% 
  mutate(thread.site = factor(thread.site, levels = najveceDomene5)) %>%
  group_by(thread.site) %>% 
  top_n(10) %>% 
  ungroup() %>%
  ggplot(aes(bigram, tf_idf, fill = thread.site)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~thread.site, ncol = 2, scales = "free") +
  coord_flip()

# Analiza bigramskih fraza

newsCOVID_bigram_tidy %>%
  filter(word1 == "covid") %>%
  count(word1,word2,sort=T)

## # A tibble: 208 x 3
##    word1 word2          n
##    <chr> <chr>      <int>
##  1 covid odjelu       164
##  2 covid infekciju     67
##  3 covid ambulante     64
##  4 covid score         56
##  5 covid odjel         52
##  6 covid ambulanti     42
##  7 covid bolnici       42
##  8 covid odjela        39
##  9 covid pozitivnih    34
## 10 covid pozitivna     20
## # ... with 198 more rows

# Vizualiziraj bigrame

bigram_graph <- newsCOVID_bigram_tidy_bigram_counts %>%
  filter(n>950) %>%
   graph_from_data_frame()

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

# Korelacije riječi ( R crash na T=30)

newsCOVID_tokenTidy %>% 
  filter(published == "2020-04-22") %>%
  pairwise_count(word, domenaBr, sort = T) %>%
  filter_all(any_vars(!is.na(.))) -> pairsWords

newsCOVID_tokenTidy %>% 
  filter(published > "2020-04-20") %>%
  group_by(word) %>%
  filter(n() > 20) %>%
  filter(!is.na(word)) %>%
  pairwise_cor(word,thread.site, sort = T) -> corsWords

corsWords %>%
  filter(item1 == "oporavak")

## # A tibble: 4,567 x 3
##    item1    item2        correlation
##    <chr>    <chr>              <dbl>
##  1 oporavak period             0.666
##  2 oporavak odlučila           0.640
##  3 oporavak siječnju           0.634
##  4 oporavak kafići             0.630
##  5 oporavak linija             0.629
##  6 oporavak who                0.616
##  7 oporavak negativan          0.612
##  8 oporavak brojke             0.607
##  9 oporavak koronavirusu       0.604
## 10 oporavak policijske         0.601
## # ... with 4,557 more rows

corsWords %>%
  filter(item1 %in% c("kriza", "gospodarstvo", "oporavak", "mjere")) %>%
  group_by(item1) %>%
  top_n(8) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip()

Tematska analiza

Na kraju provodimo tematsku analizu kao najsloženiji dio do sada provedene analize. Pri tome koristimo LDA(Latent Dirichlet allocation) algoritam kako bismo pronašli najvažnije riječi u algoritamski identificiranim temama. Ovdje je važno primijetiti da prije provedbe LDA modela valja tokeniirane riječi pretvoriti u matricu pojmova (document term matrix) koju ćemo kasnije koristiti kao input za LDA algoritam.

newsCOVID_tokenTidy %>%
  count(clanak, word, sort = TRUE) %>%
  cast_dtm(clanak, word,n) -> dtm

newsCOVID_LDA <- LDA(dtm, k = 3,  control = list(seed = 1234))

newsCOVID_LDA_tidy <- tidy(newsCOVID_LDA, matrix = "beta")
newsCOVID_LDA_tidy

## # A tibble: 381,168 x 3
##    topic term          beta
##    <int> <chr>        <dbl>
##  1     1 osoba    0.000245 
##  2     2 osoba    0.0133   
##  3     3 osoba    0.00101  
##  4     1 the      0.000456 
##  5     2 the      0.000128 
##  6     3 the      0.000375 
##  7     1 ukupno   0.000215 
##  8     2 ukupno   0.00633  
##  9     3 ukupno   0.0000243
## 10     1 ministar 0.000407 
## # ... with 381,158 more rows

newsCOVID_terms <- newsCOVID_LDA_tidy %>%
  drop_na(.) %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

newsCOVID_terms

## # A tibble: 30 x 3
##    topic term            beta
##    <int> <chr>          <dbl>
##  1     1 covid        0.00405
##  2     1 mjere        0.00295
##  3     1 mjera        0.00263
##  4     1 pandemije    0.00245
##  5     1 vrijeme      0.00232
##  6     1 koronavirusa 0.00214
##  7     1 kuna         0.00209
##  8     1 sada         0.00204
##  9     1 mogu         0.00190
## 10     1 hrvatske     0.00189
## # ... with 20 more rows

newsCOVID_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered()

# Bigrami topic

topicBigram %>%
  cast_dtm(clanak, bigram,n) -> dtmB

newsCOVID_LDA <- LDA(dtmB, k = 3,  control = list(seed = 1234))

newsCOVID_LDA_tidy <- tidy(newsCOVID_LDA, matrix = "beta")
newsCOVID_LDA_tidy

newsCOVID_terms <- newsCOVID_LDA_tidy %>%
  drop_na(.) %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

newsCOVID_terms


newsCOVID_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered() + 
  theme_economist()

ANALIZA TEKSTA

dr.sc.Luka Sikic