Twitter Analysis

Extracting Tweets
- Retrieve tweets from Twitter
- Tweets Description
Text Cleaning
- Build corpus
Frequent Words
- Build Term Document Matrix
- Top Frequent Terms
Wordcloud
- Build Wordcloud

Extracting Tweets

Retrieve tweets from Twitter

# Load packages
library(rtweet)
library(tidyverse)
library(wordcloud)
library(tm)

# Twitter authentication
create_token(
  app             = "my_twitter_research_app",
  consumer_key    = consumer_key,
  consumer_secret = consumer_secret,
  access_token    = access_token,
  access_secret   = access_secret)

## <Token>
## <oauth_endpoint>
##  request:   https://api.twitter.com/oauth/request_token
##  authorize: https://api.twitter.com/oauth/authenticate
##  access:    https://api.twitter.com/oauth/access_token
## <oauth_app> my_twitter_research_app
##   key:    1KdnTxM6HKLJxnC5d1ZiMmKcf
##   secret: <hidden>
## <credentials> oauth_token, oauth_token_secret
## ---

Topik yang diangkat : Presiden Donald Trump merupkan pebisnis , tokoh televisi , politikus dan dan presiden Amerika serikat ke-45 yang memiliki banyak kontroversi. Oleh karena itu perlu untuk mengetahui bagaimana karakteristik ataupun pandangan pengguna twitter terhadap problematika presiden Donald Trump dengan menggunakan kalimat “trump”

# Retrieve tweets
tweets <- search_tweets("trump", n = 10000, tweet_mode="extended")

## Searching for tweets...

## Finished collecting tweets!

tweets <- distinct(tweets, text, .keep_all=TRUE)

Tweets Description

## plot time series of tweets
ts_plot(tweets, "1 hours") +
  theme_minimal() +
  theme(plot.title = ggplot2::element_text(face = "bold")) +
  labs(
    x = NULL, y = NULL,
    title = "Frequency of trump Twitter statuses from past 9 days",
    subtitle = "Twitter status (tweet) counts aggregated using three-hour intervals",
    caption = "\nSource: Data collected from Twitter's REST API via rtweet"
  )

## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?

tail(tweets, 20)

Text Cleaning

library(tm)

Build corpus

# build a corpus, and specify the source to be character vectors 
myCorpus <- Corpus(VectorSource(tweets$text))
# convert to lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
# remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
# remove anything other than English letters or space 
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x) 
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))
# remove stopwords
myStopwords <- c(setdiff(stopwords('english'), c("r", "big")), "jd", "ri", "used", "via", "amp", "presiden")
stopwords_id <- read.table('stopwords-id.txt', header = FALSE)
myStopwords <- c(myStopwords, as.matrix(stopwords_id$V1), "hi", "yg")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
# remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)
# keep a copy for stem completion later
myCorpusCopy <- myCorpus

Frequent Words

Build Term Document Matrix

tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(1, Inf)))

tdm

## <<TermDocumentMatrix (terms: 14966, documents: 3642)>>
## Non-/sparse entries: 60502/54445670
## Sparsity           : 100%
## Maximal term length: 77
## Weighting          : term frequency (tf)

Top Frequent Terms

freq.terms <- findFreqTerms(tdm, lowfreq = 20)

freq.terms[1:50]

##  [1] "government" "house"      "policy"     "support"    "time"      
##  [6] "trumps"     "white"      "year"       "never"      "rain"      
## [11] "trump"      "ago"        "allies"     "america"    "american"  
## [16] "can"        "europe"     "european"   "imagine"    "news"      
## [21] "soldiers"   "clinton"    "donald"     "hillary"    "just"      
## [26] "base"       "corrupt"    "hate"       "like"       "made"      
## [31] "man"        "matter"     "men"        "old"        "political" 
## [36] "racist"     "voter"      "well"       "dont"       "good"      
## [41] "help"       "hope"       "im"         "let"        "life"      
## [46] "next"       "president"  "something"  "tell"       "use"

term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >= 150)
df <- data.frame(term = names(term.freq), freq = term.freq)

ggplot(df, aes(x=term, y=freq)) + geom_bar(stat="identity") +
  xlab("Terms") + ylab("Count") + coord_flip() +
  theme(axis.text=element_text(size=7))

Interpretasi : Hasil barchart dengan kata kunci “trump”" menunjukkan bahwa pengguna twitter banyak melakukan tweet dengan kata yang paling banyak muncul dalam tweet adalah “trump”, kemudian disusul dengan “president” dan kata “de”.

Wordcloud

Build Wordcloud

library(wordcloud)

m <- as.matrix(tdm)
# calculate the frequency of words and sort it by frequency 
word.freq <- sort(rowSums(m), decreasing = T)
# colors
pal <- brewer.pal(9, "BuGn")[-(1:5)]

wordcloud(words = names(word.freq), freq = word.freq, min.freq = 100,
    random.order = F, colors = pal)

Worldcloud menunjukkan kata yang banyak muncul dalam tweet yang dilakukan oleh pengguna twitter. Dari hasil diatas menunjukkan bahwa kata “trump” merupakan kata yang paling banyak muncul dalam tweet yang dilakukan oleh pengguna tweet. Kata selanjutnya adalah “president”, dan “de”. Kata-kata lainnya yang muncul terlihat dalam worldcloud diatas.