tugas

Liga Inggris VS Liga Spanyol on Twitter

Liga Inggris dan Liga Spanyol menjadi liga terbaik di bumi ini. Bagaimana tidak? mulai dari pemain, pelatih, dan pertandingan yang epik menjadi magnet para penggemar bola diseluruh dunia. Banyak media sosial yang digunakan untuk membicarakan kedua liga ini, salah satunya adalah TWITTER. Berikut adalah beberapa hasil analisis text minning di twitter.

Extracting Tweets

Retrieve tweets from Twitter

# Load packages
library(rtweet)
library(tidyverse)

# Twitter authentication
create_token(
  app             = "my_twitter_research_app",
  consumer_key    = consumer_key,
  consumer_secret = consumer_secret,
  access_token    = access_token,
  access_secret   = access_secret)

## <Token>
## <oauth_endpoint>
##  request:   https://api.twitter.com/oauth/request_token
##  authorize: https://api.twitter.com/oauth/authenticate
##  access:    https://api.twitter.com/oauth/access_token
## <oauth_app> my_twitter_research_app
##   key:    zXMq2IaQDXtXeTmrCYb0k8ym2
##   secret: <hidden>
## <credentials> oauth_token, oauth_token_secret
## ---

# Retrieve tweets
tweets <- search_tweets("Liga Inggris", n = 10000, tweet_mode="extended")

## Searching for tweets...

## Finished collecting tweets!

tweets <- distinct(tweets, text, .keep_all=TRUE)

# Retrieve tweets
tweetss <- search_tweets("Liga Spanyol", n = 10000, tweet_mode="extended")

## Searching for tweets...

## Finished collecting tweets!

tweetss <- distinct(tweetss, text, .keep_all=TRUE)

Tweets Description

## plot time series of tweets
ts_plot(tweets, "3 hours") +
  theme_minimal() +
  theme(plot.title = ggplot2::element_text(face = "bold")) +
  labs(
    x = NULL, y = NULL,
    title = "Frequency of Liga Inggris Twitter statuses",
    subtitle = "Twitter status (tweet) counts aggregated using three-hour intervals",
    caption = "\nSource: Data collected from Twitter's REST API via rtweet"
  )

## plot time series of tweets
ts_plot(tweetss, "3 hours") +
  theme_minimal() +
  theme(plot.title = ggplot2::element_text(face = "bold")) +
  labs(
    x = NULL, y = NULL,
    title = "Frequency of Liga Sapnyol Twitter statuses",
    subtitle = "Twitter status (tweet) counts aggregated using three-hour intervals",
    caption = "\nSource: Data collected from Twitter's REST API via rtweet"
  )

Grafik diatas menunjukkan frekuensi kata “Liga Inggris” dan “Liga Spanyol” ditulis oleh para warganet. Diketahui grafik diatas menampilkan data frekuensi dari tanggal 3 November sampai 11 November. Persamaan kedua grafik diatas adalah terjadi peningkatan yang cukup signifikan pada tanggal 4 dan 11 November. Hal ini disebabkan adanya big match club raksasa di masing-masing Liga. Pertandingan Liga Inggris pada tanggal 4 November 2018 mempertemukan Arsenal VS Man.City dan Chelsea VS Man.United, sedangkan tanggal 11 November 2018 yang paling menyita perhatian pecinta bola adalah pertandingan DERBY MANCHESTER, yaitu Manchester City VS Manchester United. Selain derby manchester club raksasa lain yang sedang bertanding adalah Arsenal VS Wolves. Pada Liga Spanyol Barcelona dan Real Madrid juga bertanding menghadapi lawannya.

Dari kedua grafik diatas dapat disimpulkan bahwa terjadi kenaikan frekuensi tweet “Liga Inggris” maupun “Liga Spanyol” saat club raksasa masing-masing negara tengah bertanding.

tail(tweets, 5)

## # A tibble: 5 x 88
##   user_id status_id created_at          screen_name text  source
##   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
## 1 245216~ 10583560~ 2018-11-02 13:52:08 totosudarm~ Emer~ dlvr.~
## 2 185533~ 10583560~ 2018-11-02 13:52:06 sonardi_ro~ Emer~ dlvr.~
## 3 128468~ 10583500~ 2018-11-02 13:28:03 Geol_Goal   FPL ~ dlvr.~
## 4 535516~ 10583475~ 2018-11-02 13:18:04 KnuckleHea~ Goal~ dlvr.~
## 5 585303~ 10583429~ 2018-11-02 12:59:59 muhperi_sa~ @beg~ Twitt~
## # ... with 82 more variables: display_text_width <dbl>,
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, hashtags <list>,
## #   symbols <list>, urls_url <list>, urls_t.co <list>,
## #   urls_expanded_url <list>, media_url <list>, media_t.co <list>,
## #   media_expanded_url <list>, media_type <list>, ext_media_url <list>,
## #   ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <chr>, mentions_user_id <list>,
## #   mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>,
## #   quoted_text <chr>, quoted_created_at <dttm>, quoted_source <chr>,
## #   quoted_favorite_count <int>, quoted_retweet_count <int>,
## #   quoted_user_id <chr>, quoted_screen_name <chr>, quoted_name <chr>,
## #   quoted_followers_count <int>, quoted_friends_count <int>,
## #   quoted_statuses_count <int>, quoted_location <chr>,
## #   quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>,
## #   retweet_created_at <dttm>, retweet_source <chr>,
## #   retweet_favorite_count <int>, retweet_retweet_count <int>,
## #   retweet_user_id <chr>, retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>,
## #   country <chr>, country_code <chr>, geo_coords <list>,
## #   coords_coords <list>, bbox_coords <list>, status_url <chr>,
## #   name <chr>, location <chr>, description <chr>, url <chr>,
## #   protected <lgl>, followers_count <int>, friends_count <int>,
## #   listed_count <int>, statuses_count <int>, favourites_count <int>,
## #   account_created_at <dttm>, verified <lgl>, profile_url <chr>,
## #   profile_expanded_url <chr>, account_lang <chr>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>

tail(tweetss, 5)

## # A tibble: 5 x 88
##   user_id status_id created_at          screen_name text  source
##   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
## 1 276006~ 10583393~ 2018-11-02 12:45:44 aditrimasr~ "#So~ Twitt~
## 2 280128~ 10583339~ 2018-11-02 12:24:06 sekelas_gw  Jadw~ dlvr.~
## 3 120661~ 10583330~ 2018-11-02 12:20:26 indhk       @Nov~ Twitt~
## 4 234015~ 10583289~ 2018-11-02 12:04:08 suaradotcom Jadw~ dlvr.~
## 5 163193~ 10583288~ 2018-11-02 12:04:03 EPras92     "Lov~ Twitt~
## # ... with 82 more variables: display_text_width <dbl>,
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, hashtags <list>,
## #   symbols <list>, urls_url <list>, urls_t.co <list>,
## #   urls_expanded_url <list>, media_url <list>, media_t.co <list>,
## #   media_expanded_url <list>, media_type <list>, ext_media_url <list>,
## #   ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <chr>, mentions_user_id <list>,
## #   mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>,
## #   quoted_text <chr>, quoted_created_at <dttm>, quoted_source <chr>,
## #   quoted_favorite_count <int>, quoted_retweet_count <int>,
## #   quoted_user_id <chr>, quoted_screen_name <chr>, quoted_name <chr>,
## #   quoted_followers_count <int>, quoted_friends_count <int>,
## #   quoted_statuses_count <int>, quoted_location <chr>,
## #   quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>,
## #   retweet_created_at <dttm>, retweet_source <chr>,
## #   retweet_favorite_count <int>, retweet_retweet_count <int>,
## #   retweet_user_id <chr>, retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>,
## #   country <chr>, country_code <chr>, geo_coords <list>,
## #   coords_coords <list>, bbox_coords <list>, status_url <chr>,
## #   name <chr>, location <chr>, description <chr>, url <chr>,
## #   protected <lgl>, followers_count <int>, friends_count <int>,
## #   listed_count <int>, statuses_count <int>, favourites_count <int>,
## #   account_created_at <dttm>, verified <lgl>, profile_url <chr>,
## #   profile_expanded_url <chr>, account_lang <chr>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>

library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(ggplot2)

Build corpus

# build a corpus, and specify the source to be character vectors 
myCorpus <- Corpus(VectorSource(tweets$text))
# convert to lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(myCorpus, content_transformer(tolower)):
## transformation drops documents

# remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))

## Warning in tm_map.SimpleCorpus(myCorpus, content_transformer(removeURL)):
## transformation drops documents

# remove anything other than English letters or space 
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x) 
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))

## Warning in tm_map.SimpleCorpus(myCorpus,
## content_transformer(removeNumPunct)): transformation drops documents

# remove stopwords
myStopwords <- c(setdiff(stopwords('english'), c("r", "big")), "use", "see", "used", "via", "amp", "indihome")
stopwords_id <- read.table('H:/stopwords-id.txt', header = FALSE)
myStopwords <- c(myStopwords, as.matrix(stopwords_id$V1), "hi", "yg")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

## Warning in tm_map.SimpleCorpus(myCorpus, removeWords, myStopwords):
## transformation drops documents

# remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)

## Warning in tm_map.SimpleCorpus(myCorpus, stripWhitespace): transformation
## drops documents

# keep a copy for stem completion later
myCorpusCopy <- myCorpus

myCorpuss <- Corpus(VectorSource(tweetss$text))
myCorpuss <- tm_map(myCorpuss, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(myCorpuss, content_transformer(tolower)):
## transformation drops documents

removeURLs <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpuss <- tm_map(myCorpuss, content_transformer(removeURLs))

## Warning in tm_map.SimpleCorpus(myCorpuss, content_transformer(removeURLs)):
## transformation drops documents

removeNumPuncts <- function(x) gsub("[^[:alpha:][:space:]]*", "", x) 
myCorpuss <- tm_map(myCorpuss, content_transformer(removeNumPuncts))

## Warning in tm_map.SimpleCorpus(myCorpuss,
## content_transformer(removeNumPuncts)): transformation drops documents

myStopwordss <- c(setdiff(stopwords('english'), c("r", "big")), "use", "see", "used", "via", "amp", "indihome")
stopwords_ids <- read.table('H:/stopwords-id.txt', header = FALSE)
myStopwordss <- c(myStopwordss, as.matrix(stopwords_id$V1), "hi", "yg")
myCorpuss <- tm_map(myCorpuss, removeWords, myStopwordss)

## Warning in tm_map.SimpleCorpus(myCorpuss, removeWords, myStopwordss):
## transformation drops documents

myCorpuss <- tm_map(myCorpuss, stripWhitespace)

## Warning in tm_map.SimpleCorpus(myCorpuss, stripWhitespace): transformation
## drops documents

myCorpusCopys <- myCorpuss

Frequent Words

Build Term Document Matrix

tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(1, Inf)))

tdms <- TermDocumentMatrix(myCorpuss, control = list(wordLengths = c(1, Inf)))

tdm

## <<TermDocumentMatrix (terms: 4368, documents: 3972)>>
## Non-/sparse entries: 34601/17315095
## Sparsity           : 100%
## Maximal term length: 32
## Weighting          : term frequency (tf)

tdms

## <<TermDocumentMatrix (terms: 1566, documents: 1145)>>
## Non-/sparse entries: 9902/1783168
## Sparsity           : 99%
## Maximal term length: 27
## Weighting          : term frequency (tf)

Top Frequent Terms

freq.terms <- findFreqTerms(tdm, lowfreq = 20)

freq.terms[1:50]

##  [1] "inggris"     "klub"        "liga"        "masuk"       "malam"      
##  [6] "minggu"      "sepakbola"   "siaran"      "bermain"     "hasil"      
## [11] "klasemen"    "pekan"       "primer"      "arsenal"     "liverpool"  
## [16] "null"        "city"        "derby"       "laga"        "manchester" 
## [21] "martial"     "mourinho"    "rashford"    "chelsea"     "rekor"      
## [26] "terulang"    "juara"       "laju"        "menjaga"     "sarri"      
## [31] "hotspur"     "palace"      "poin"        "tottenham"   "ars"        
## [36] "gagal"       "gunners"     "kalahkan"    "menundukkan" "reds"       
## [41] "sulit"       "gol"         "guardiola"   "menang"      "pesta"      
## [46] "puas"        "babak"       "southampton" "unggul"      "aguero"

Top Frequent Terms

freq.termss <- findFreqTerms(tdms, lowfreq = 20)

freq.termss[1:50]

##  [1] "camp"       "nou"        "lionel"     "messi"      "pemain"    
##  [6] "piala"      "comeback"   "lawan"      "solari"     "madrid"    
## [11] "real"       "blancos"    "los"        "liga"       "menang"    
## [16] "santiago"   "barcelona"  "barca"      "rayo"       "detiksport"
## [21] "betis"      "main"       "pulih"      "puas"       "benzema"   
## [26] "gol"        "atletico"   "bilbao"     "dramatis"   "hasil"     
## [31] "spanyol"    "celta"      "vigo"       "vs"         "pekan"     
## [36] "wenger"     "pelatih"    "kalah"      "laga"       "chelsea"   
## [41] "bale"       "vinicius"   "kemenangan" "bermain"    "ramos"     
## [46] "vallecano"  "tim"        "leganes"    "valladolid" "puasa"

term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >= 200)
df <- data.frame(term = names(term.freq), freq = term.freq)

term.freqs <- rowSums(as.matrix(tdms))
term.freqs <- subset(term.freqs, term.freqs >= 200)
dfs <- data.frame(term = names(term.freqs), freq = term.freqs)

ggplot(df, aes(x=reorder(term,freq), y=freq)) + geom_bar(stat="identity") +
  xlab("Terms") + ylab("Count") + coord_flip() +
  theme(axis.text=element_text(size=7))

ggplot(dfs, aes(x=reorder(term,freq), y=freq)) + geom_bar(stat="identity") +
  xlab("Terms") + ylab("Count") + coord_flip() +
  theme(axis.text=element_text(size=7))

Diagram batang diatas menunjukkan kata yang paling sering muncul bersamaan dengan kata “Liga Inggris” dan “Liga Spanyol”. Dari gambar diatas diketahui kata yang sering muncul selain liga inggris dan liga spanyol itu sendiri adalah nama-nama club. Pada liga inggris sendiri kata yang sering muncul seperti manchester city dan united, liverpool, chelsea, arsenal. Tidak mengherankan nama-nama yang sering muncul adalah nama-nama club raksasa liga inggris, notabene sebagian besar mereka menempati 5 besar klasemen sementara liga inggris musim ini, sehingga sangat seru untuk diperbincangkan. Hal yang sama terjadi pada liga spanyol.Namun, terdapat perbedaan yang mencolok antara tweet mengenai liga inggris dan liga spanyol. Tweet yang berkaitan dengan liga inggris lebih banyak variasi kata yang muncul dibandingkan dengan tweet yang berkaitan dengan liga spanyol.

Dapat disimpulkan bahwa warganet yang berselancar pada dunia maya twitter lebih tertarik dengan liga inggris dibandingkan dengan liga spanyol.

Wordcloud

Build Wordcloud

library(wordcloud)

## Loading required package: RColorBrewer

library(RColorBrewer)

m <- as.matrix(tdm)
# calculate the frequency of words and sort it by frequency 
word.freq <- sort(rowSums(m), decreasing = T)
# colors
pal <- brewer.pal(9, "BuGn")[-(1:4)]

ms <- as.matrix(tdms)
# calculate the frequency of words and sort it by frequency 
word.freqs <- sort(rowSums(ms), decreasing = T)
# colors
pals <- brewer.pal(9, "BuGn")[-(1:4)]

wordcloud(words = names(word.freq), freq = word.freq, min.freq = 50,
    random.order = F, colors = 'red')

wordcloud(words = names(word.freqs), freq = word.freq, min.freq = 50,
    random.order = F, colors = 'blue')

Gambar diatas merupakan bentuk visual dari kata-kata yang sering muncul bersamaan dengan kata “Liga Inggris” dan “Liga Spanyol”. Kata yang muncul pada gambar memiliki frekuensi lebih dari 50. Diketahui bahwa selain nama-nama club di liga inggris, nama pelatih pun turut menjadi bagian. Tampak pelatih Manchester United mourinho dan pelatih Manchester City guardiola tercantum dalam wordcloud. Dua club ini memang memiliki perhatian tersendiri, dimana mereka saling berebut tropi liga inggris dari tahun ke tahun. Pada wordcloud liga spanyol nama yang muncul dari pelatih ada solari dan salah satu striker terbaik barcelona yaitu messi.

Satu hal yang dapat kita pahami adalah liga inggris dan liga spanyol telah menunjukkan kelasnya di mata dunia, sehingga selalu memiliki kesan bagi para penikmat bola.

tugas

Shindy Sari Utami_06211540000005

12 November 2018

Liga Inggris VS Liga Spanyol on Twitter

Extracting Tweets

Retrieve tweets from Twitter

Tweets Description

Build corpus

Frequent Words

Build Term Document Matrix

Top Frequent Terms

Top Frequent Terms

Wordcloud

Build Wordcloud