We are all agreed that these past 2 years were the most well-bred year for the start-up company ecosystem, as many ups and downs in such a short time. The development of a startup’s hot breaking news can be easily obtained just by looking at the reactions of social media users, moreover, these social media users will give opinions, reviews, even a simple anonymously ‘tea spilling’ voluntarily and those online-based news portals will broadcast this opinion in real-time.
Thanks to Twitter and it’s free permission, we would dive into one of the hottest unicorn start-up that happening recently -after their agreement on merging two well-known start-up in Indonesia-, Gojek.
Based on Crunchbase cited by Bisnis.com, Thursday (1/4/2021), Gojek is known to have taken the acquisition route to Tokopedia, which is a marketplace platform, with undisclosed value on March 9, 2021.
Gojek as one of the established start-ups in Indonesia has a significant influence on the fast-paced mobility of society. As a one-stop application engaged in transportation and a provider of public needs, Indonesians often share conversations and experiences from using Gojek to social media, until that this application has become part of people’s lives.
Lately, the news of the Gojek and Tokopedia merger has become a topic that has stirred up almost all levels of society, where pros and cons have also been raised on various platforms, especially social media.
The role of social media will be in the main spotlight, how people react to “Gojek” in keyword searches on Twitter, and analyze how Gojek has become a conversation with Twitter users over the past days.
The keywords ‘Gojek’ will also be examined the sentiment tendencies in each Twitter status.
library(ggplot2)
library(ggpubr)
Gojek dataset is consist of ‘Gojek’ keyword search on Twitter fetch by rtweet
on April 4th 2021. for the instruction and steps on fetching the twitter data, i’ll publish on my RPubs page sooner.
<-read.csv("data/gojek.csv") gojek
The fetched data consist of 90 variables and 4300 observations, and main data on this dataset will be shown below:
gojek
on this section, we will use some of the variables as there are our main concern for this project. Thus, I wouldn’t delete the other variables. The data contains the following information:
Apparently, this dataset could have a lot of unnecessary entry (from bot user or even spam). on April 4th, these keywords detected to be unrelated keyword (making some trends / not even have a relationship with the main keyword).
We use grep
for identifying the keyword on the string and set the tweet that contain those keywords to be eliminated (set invert to TRUE
)
<-gojek[ grep("renjun|RENJUN|Renjun|misi|hadiah|Militan|Batas|Jl", gojek$text, invert = TRUE) , ]
gojek<-gojek[ grep("bot|Bot", gojek$source, invert = TRUE) , ]
gojek<-gojek[ grep("bot|Bot", gojek$description, invert = TRUE) , ]
gojek<-gojek[ grep("judi|Judi|JUDI", gojek$hashtags, invert = TRUE) , ]
gojek
gojek
After the major cleaning, we will divide the dataset into three sections : original (og), retweet(rt), replies(rp). These codes will be use later for the plot making.
# hanya retweet
<-gojek[gojek$is_retweet==TRUE,]
gojek_rt# hanya reply
<- subset(gojek, !is.na(gojek$reply_to_status_id))
gojek_rp # hanya tl
<-subset(gojek, (!status_id %in% gojek_rp$status_id)&(!status_id %in% gojek_rt$status_id)) gojek_og
Getting tweet activity counts based on user’s tweet posted by aggregate
function on each categorical dataset.
<-aggregate(user_id ~ screen_name,
gojek_usersdata = gojek_og,
FUN = length)
<-aggregate(user_id ~ screen_name,
gojek_rt_usersdata = gojek_rt,
FUN = length)
<-aggregate(user_id ~ screen_name,
gojek_rp_usersdata = gojek_rp,
FUN = length)
<- merge(gojek_users,gojek_rp_users,by='screen_name', all=T)
m1<- merge(m1, gojek_rt_users, by='screen_name', all=T)
m1
is.na(m1)] <- 0
m1[
<-head(m1[order(m1$user_id.x, decreasing = T),], 6)
plot1<-head(m1[order(m1$user_id.y, decreasing = T),], 6)
plot2<-head(m1[order(m1$user_id, decreasing = T),], 6) plot3
After aggregating the variables needed, visualized it to the plot bu ggplot
. On this plot, I arrange three plots all together by ggarrage
that packaged in ggpubr
on one rows, as it would be easier for us to compare.
<-ggplot(plot1, aes(user_id.x,reorder(screen_name,user_id.x))) +
p1geom_col(alpha=0.5, fill="#6ec0ff") +
labs(title='Tweets by Users',
x='',
y='')+
theme_minimal()+
theme(panel.grid.major.x=element_line(colour = "white"),
panel.grid.minor.y=element_line(colour = "white"))
<- ggplot(plot2, aes(user_id.y,reorder(screen_name,user_id.y))) +
p2geom_col(alpha=0.5, fill="#ff6ec0") +
labs(title='Replies by Users',
x='',
y='')+
theme_minimal()+
theme(panel.grid.major.x=element_line(colour = "white"),
panel.grid.minor.y=element_line(colour = "white"))
<- ggplot(plot3, aes(user_id,reorder(screen_name,user_id))) +
p3geom_col(alpha=0.5, fill="#c0ff6e") +
labs(title='Retweet by Users',
x='',
y='')+
theme_minimal()+
theme(panel.grid.major.x=element_line(colour = "white"),
panel.grid.minor.y=element_line(colour = "white"))
<-ggarrange(p1, p2, p3,
arrncol = 3)
annotate_figure(arr,
top = text_grob("Twitter User Summary\n", face = "bold", size = 18),
bottom = text_grob("Source: twitter.com", hjust = 1, x = 1, size = 10)
)
Based on these plots, in April 4th 2021, there are 4 tweets that contains keyword ‘Gojek’ in average top 6 user’s tweet. for user replies and retweet, there are 20 and 10 tweets average on top 6 user. Username Abang_Gojek_
seems to active on making the replies and retweet regarding ‘Gojek’ on that day.
plot
ggplot(gojek_og, aes(retweet_count,favorite_count, color=favorite_count, size=retweet_count)) +
geom_point(alpha=0.3) +
labs(
title="Most Favourited Tweet",
subtitle="Based on Favorite/Retweet count per tweet",
caption="source: twitter.com",
x="Retweet",
y="Favorite"
+
)theme_minimal()+
theme(legend.position = "none")
Twitter cames from many type of apps and 3rd party integration. This plot will show us, which apps that used by user’s the most on April 4th, 2021 by using heatmap
.
<-aggregate(user_id~source,
gojek_publish_countdata=gojek,
FUN=length)
library(treemap)
treemap(gojek_publish_count,
index = "source",
vSize = "user_id",
type = "index",
palette = "Blues",
border.col="white",
border.lwds=0,
fontface.labels=1,
title = "Tweet Source\n"
)
From the
treemap
, Android
user still leading on ‘Gojek’ topic related, followed by Apple
user and browser
user. At this point, there are 2 3rd party apps that included on top 5 tweet source list, there are Salesforce - Social Studio
& divr.it
.
Analyzing text data doesn’t complete if there isn’t any text / word analysis. Based on this dataset, I have already made the keyword-separate dataset called gojek_clean
based on the former dataset. For cleaning, wrangling and making proper Indonesian word based on Colloquial Indonesia Lexicon
, I would post it on my RPubs sooner.
Eventually, this dataset will get some cleaning process conduct by grep
# gunakan data teks cleaning
<-read.csv("data/gojek_clean.csv")
gojek_teks
<-gojek_teks[grep("c|C|face|joy|tears|of|with|ya|kalo|kayak|banget|gua|gue|gw|gojek|deh|sih|tau", gojek_teks$word, invert = TRUE) , ]
gojek_teks
<-as.data.frame(gojek_teks) gojek_teks
<-aggregate(gojek_teks,
teks_group
gojek_teks,FUN='length')
colnames(teks_group)<-c("word", "total")
<-teks_group[order(teks_group$total, decreasing=T),]
teks_group
<-head(teks_group, 20) plot4
ggplot(plot4, aes(total, reorder(word, total)))+
geom_col(alpha=0.6, fill="#607F37")+
labs(title='Most Frequent Keyword',
subtitle="Main Keyword: GOJEK\n",
x='',
y='',
caption="source: twitter.com")+
theme_minimal()+
theme(panel.grid.major.y=element_line(colour = "white"),
panel.grid.minor.x=element_line(colour = "white"))
$hashtags <- as.character(gojek$hashtags)
gojek$hashtags <- gsub("c\\(", "", gojek$hashtags)
gojek
library(wordcloud)
set.seed(1234)
wordcloud(gojek$hashtags, min.freq=2, scale=c(3.5, .5), random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Blues"))
set.seed(1234)
wordcloud(gojek_rt$retweet_screen_name, min.freq=5, scale=c(2, .5), random.order=FALSE, rot.per=0.25,
colors=brewer.pal(8, "PuBu"))