We are all agreed that these past 2 years were the most well-bred year for the start-up company ecosystem, as many ups and downs in such a short time. The development of a startup’s hot breaking news can be easily obtained just by looking at the reactions of social media users, moreover, these social media users will give opinions, reviews, even a simple anonymously ‘tea spilling’ voluntarily and those online-based news portals will broadcast this opinion in real-time.

Thanks to Twitter and it’s free permission, we would dive into one of the hottest unicorn start-up that happening recently -after their agreement on merging two well-known start-up in Indonesia-, Gojek.


Introduction

Why “Gojek?”

Based on Crunchbase cited by Bisnis.com, Thursday (1/4/2021), Gojek is known to have taken the acquisition route to Tokopedia, which is a marketplace platform, with undisclosed value on March 9, 2021.

Gojek as one of the established start-ups in Indonesia has a significant influence on the fast-paced mobility of society. As a one-stop application engaged in transportation and a provider of public needs, Indonesians often share conversations and experiences from using Gojek to social media, until that this application has become part of people’s lives.

Lately, the news of the Gojek and Tokopedia merger has become a topic that has stirred up almost all levels of society, where pros and cons have also been raised on various platforms, especially social media.

Objective

The role of social media will be in the main spotlight, how people react to “Gojek” in keyword searches on Twitter, and analyze how Gojek has become a conversation with Twitter users over the past days.

The keywords ‘Gojek’ will also be examined the sentiment tendencies in each Twitter status.

Library Preparation

library(ggplot2)
library(ggpubr)

Assign Dataset

Gojek dataset is consist of ‘Gojek’ keyword search on Twitter fetch by rtweet on April 4th 2021. for the instruction and steps on fetching the twitter data, i’ll publish on my RPubs page sooner.

gojek<-read.csv("data/gojek.csv")

The fetched data consist of 90 variables and 4300 observations, and main data on this dataset will be shown below:

gojek

on this section, we will use some of the variables as there are our main concern for this project. Thus, I wouldn’t delete the other variables. The data contains the following information:

  • text: Status from the twitter user
  • source: Device / platform twitter broadcasted by users
  • hashtags: hashtags fetch by twitter API based on user status
  • favorite_count: total of likes
  • retweet_count: total of retweet (quote/unquote)
  • status_id: ID of tweet status
  • user_id: ID of user
  • screen_name: Twitter Username

Drop Unnecessary Rows by Keyword

Apparently, this dataset could have a lot of unnecessary entry (from bot user or even spam). on April 4th, these keywords detected to be unrelated keyword (making some trends / not even have a relationship with the main keyword).

We use grep for identifying the keyword on the string and set the tweet that contain those keywords to be eliminated (set invert to TRUE)

gojek<-gojek[ grep("renjun|RENJUN|Renjun|misi|hadiah|Militan|Batas|Jl", gojek$text, invert = TRUE) , ]
gojek<-gojek[ grep("bot|Bot", gojek$source, invert = TRUE) , ]
gojek<-gojek[ grep("bot|Bot", gojek$description, invert = TRUE) , ]
gojek<-gojek[ grep("judi|Judi|JUDI", gojek$hashtags, invert = TRUE) , ]


gojek

Data Splitting

After the major cleaning, we will divide the dataset into three sections : original (og), retweet(rt), replies(rp). These codes will be use later for the plot making.

# hanya retweet
gojek_rt<-gojek[gojek$is_retweet==TRUE,]
# hanya reply
gojek_rp <- subset(gojek, !is.na(gojek$reply_to_status_id))
# hanya tl
gojek_og<-subset(gojek, (!status_id %in% gojek_rp$status_id)&(!status_id %in% gojek_rt$status_id))

Overview

Tweet by users

Getting tweet activity counts based on user’s tweet posted by aggregate function on each categorical dataset.

gojek_users<-aggregate(user_id ~ screen_name,
                         data = gojek_og, 
                         FUN = length)
gojek_rt_users<-aggregate(user_id ~ screen_name,
                         data = gojek_rt, 
                         FUN = length)
gojek_rp_users<-aggregate(user_id ~ screen_name,
                         data = gojek_rp, 
                         FUN = length)

m1<- merge(gojek_users,gojek_rp_users,by='screen_name', all=T)
m1<- merge(m1, gojek_rt_users, by='screen_name', all=T)

m1[is.na(m1)] <- 0

plot1<-head(m1[order(m1$user_id.x, decreasing = T),], 6)
plot2<-head(m1[order(m1$user_id.y, decreasing = T),], 6)
plot3<-head(m1[order(m1$user_id, decreasing = T),], 6)

After aggregating the variables needed, visualized it to the plot bu ggplot . On this plot, I arrange three plots all together by ggarrage that packaged in ggpubr on one rows, as it would be easier for us to compare.

p1<-ggplot(plot1, aes(user_id.x,reorder(screen_name,user_id.x))) +
  geom_col(alpha=0.5, fill="#6ec0ff") +
  labs(title='Tweets by Users',
       x='',
       y='')+
  theme_minimal()+
  theme(panel.grid.major.x=element_line(colour = "white"),
        panel.grid.minor.y=element_line(colour = "white"))

p2<- ggplot(plot2, aes(user_id.y,reorder(screen_name,user_id.y))) +
  geom_col(alpha=0.5, fill="#ff6ec0") +
  labs(title='Replies by Users',
       x='',
       y='')+
  theme_minimal()+
  theme(panel.grid.major.x=element_line(colour = "white"),
        panel.grid.minor.y=element_line(colour = "white"))
  
p3<- ggplot(plot3, aes(user_id,reorder(screen_name,user_id))) +
  geom_col(alpha=0.5, fill="#c0ff6e") +
  labs(title='Retweet by Users',
       x='',
       y='')+
  theme_minimal()+
  theme(panel.grid.major.x=element_line(colour = "white"),
        panel.grid.minor.y=element_line(colour = "white"))

 arr<-ggarrange(p1, p2, p3,
          ncol = 3)
 
annotate_figure(arr,
               top = text_grob("Twitter User Summary\n", face = "bold", size = 18),
               bottom = text_grob("Source: twitter.com", hjust = 1, x = 1, size = 10)
               )

Based on these plots, in April 4th 2021, there are 4 tweets that contains keyword ‘Gojek’ in average top 6 user’s tweet. for user replies and retweet, there are 20 and 10 tweets average on top 6 user. Username Abang_Gojek_ seems to active on making the replies and retweet regarding ‘Gojek’ on that day.

Most Favorited Tweets

plot

ggplot(gojek_og, aes(retweet_count,favorite_count, color=favorite_count, size=retweet_count)) + 
  geom_point(alpha=0.3) +
  labs(
    title="Most Favourited Tweet",
    subtitle="Based on Favorite/Retweet count per tweet",
    caption="source: twitter.com",
    x="Retweet",
    y="Favorite"
  )+
  theme_minimal()+
  theme(legend.position = "none")

Tweet Published

Twitter cames from many type of apps and 3rd party integration. This plot will show us, which apps that used by user’s the most on April 4th, 2021 by using heatmap.

gojek_publish_count<-aggregate(user_id~source,
                               data=gojek,
                               FUN=length)



library(treemap)
treemap(gojek_publish_count, 
        index = "source",
        vSize = "user_id",
        type = "index",
        palette = "Blues",
        
        border.col="white",            
        border.lwds=0,
        fontface.labels=1,
        
        title = "Tweet Source\n"
)

From the treemap, Android user still leading on ‘Gojek’ topic related, followed by Apple user and browser user. At this point, there are 2 3rd party apps that included on top 5 tweet source list, there are Salesforce - Social Studio & divr.it.


Text Analysis

Analyzing text data doesn’t complete if there isn’t any text / word analysis. Based on this dataset, I have already made the keyword-separate dataset called gojek_clean based on the former dataset. For cleaning, wrangling and making proper Indonesian word based on Colloquial Indonesia Lexicon, I would post it on my RPubs sooner.

Getting the dataset

Eventually, this dataset will get some cleaning process conduct by grep

# gunakan data teks cleaning
gojek_teks<-read.csv("data/gojek_clean.csv")

gojek_teks<-gojek_teks[grep("c|C|face|joy|tears|of|with|ya|kalo|kayak|banget|gua|gue|gw|gojek|deh|sih|tau", gojek_teks$word, invert = TRUE) , ]

gojek_teks<-as.data.frame(gojek_teks)

Most Frequent Keyword

teks_group<-aggregate(gojek_teks,
                      gojek_teks,
                      FUN='length')

colnames(teks_group)<-c("word", "total")

teks_group<-teks_group[order(teks_group$total, decreasing=T),]

plot4<-head(teks_group, 20)
ggplot(plot4, aes(total, reorder(word, total)))+
  geom_col(alpha=0.6, fill="#607F37")+
  labs(title='Most Frequent Keyword',
       subtitle="Main Keyword: GOJEK\n",
       x='',
       y='',
       caption="source: twitter.com")+
  theme_minimal()+
  theme(panel.grid.major.y=element_line(colour = "white"),
        panel.grid.minor.x=element_line(colour = "white"))

Used Hashtag

gojek$hashtags <- as.character(gojek$hashtags)
gojek$hashtags <- gsub("c\\(", "", gojek$hashtags)

library(wordcloud)

set.seed(1234)
wordcloud(gojek$hashtags, min.freq=2, scale=c(3.5, .5), random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Blues"))

Most retweeted account

set.seed(1234)
wordcloud(gojek_rt$retweet_screen_name, min.freq=5, scale=c(2, .5), random.order=FALSE, rot.per=0.25, 
          colors=brewer.pal(8, "PuBu"))