Informative opinions which can take a business forward are everywhere on the internet in the forms of tweets, reviews, comments and posts. Sentiment analysis is an excellent way to collect these opinions, extract information embedded within these opinions, usually unstructured, derive insight from the information and finally act on this.
In this project, Sentiment analysis was performed on twitter’s data analyzing tweets on Samsung and iPhone brands. Ten thousand tweets from each brand were extracted and analyzed. It is interesting to see that in recent times there are relatively more tweets on iPhone than Samsung, which indicates higher brand salience for iPhone.
library(rtweet)
library(httpuv)
library(plyr)
library(reshape)
library(ggplot2)
library(tidyverse)
library(qdapRegex)
library(tm)
library(qdap)
library(RColorBrewer)
library(wordcloud)
library(topicmodels)
library(syuzhet)
library(igraph)
library(gridExtra)
library(tidyr)
auth_setup_default()
samsung_tw <- search_tweets("#samsung", n = 10000, include_rts = FALSE)
iphone_tw <- search_tweets("#iphone", n = 10000, include_rts = FALSE)
Twitter time series analysis is to determine the frequency of chat over a period of time. This analysis detect changing trends and understand interest level on the brands.
ts_plot(samsung_tw, by = "hours", color= "blue")
## Plot iPhone Tweets Frequency graph
ts_plot(iphone_tw, by = "hours", color= "red")
Brand salience is the extent to which a brand is spoken about by potential customers.
Convert tweet data into a time series object
samsung_ts <- ts_data(samsung_tw, by = 'hours')
iphone_ts <- ts_data(iphone_tw, by = 'hours')
Rename the two columns in the time series object
names(samsung_ts) <- c("time", "samsung_n")
names(iphone_ts) <- c("time", "iphone_n")
Merge the two time series objects and retain “time” column
merged_df <- merge(iphone_ts, samsung_ts, by ="time", all = TRUE)
Stack the tweet frequency columns using melt() function
melt_df <- melt(merged_df, na.rm = TRUE, id.vars = "time")
ggplot(data = melt_df,
aes(x = time, y = value, col = variable)) +
geom_line(lwd = 0.8)
It is interesting to see that there are relatively more tweets on iPhone than on samsung. This indicate higher brand saliance for iphone brand
Extracting the tweet texts and save it in a data frame and remove url
samsung_txt <- samsung_tw$text
iphone_txt <- iphone_tw$text
samsung_txt_url <- rm_twitter_url(samsung_txt)
iphone_txt_url <- rm_twitter_url(iphone_txt)
head(samsung_txt_url)
## [1] "Samsung Galaxy Book 2 GO and Samsung Galaxy Book 2 GO 5G receive the Bluetooth SIG certification. #Samsung #SamsungGalaxyBook2Go"
## [2] "Samsung Galaxy A14 5G visits the NBTC certification and the Indian BIS certification. #Samsung #SamsungGalaxyA145G"
## [3] "My own experience with #iPhone and #Samsung is not perfect at all...I'm waiting for Elon Musk's phone with excitement @elonmusk But will he give us a phone from the future..?"
## [4] "ロー #相互支援 #相互フォロー希望 #相互希 望 #HDYF #TFBJP #ANDROID #相互 #TEAMFOLLOWBACK #ipad #Samsung #DIPROMOSIKAN"
## [5] "Samsung Galaxy M04 With MediaTek Helio G35 SoC Spotted On Google Play Console. #Samsung #SamsungGalaxyM04 #GalaxyM04"
## [6] "ভারত জুড়ে বড় পরিকল্পনা Samsung-এর! IIT এবং ইঞ্জিনিয়ারিং কলেজ থেকে শত-শত ইঞ্জিনিয়ার নিয়োগের ভাবনা #jobs #job #recruitment #samsung #engineer #চাকরি #নিয়োগ #স্যামসাং #ইঞ্জিনিয়ার"
head(iphone_txt_url)
## [1] "My own experience with #iPhone and #Samsung is not perfect at all...I'm waiting for Elon Musk's phone with excitement @elonmusk But will he give us a phone from the future..?"
## [2] "Si esperas un #iPhone 14 Pro de #Apple , tienes mala suerte. Los envíos están retrasados por semanas debido a la actual interrupción en la fábrica clave de Foxconn en Zhengzhou, #China. #Infografía Graphic News"
## [3] "Insólito: violentas protestas en la mayor fábrica de #iPhone en China Empleados de la planta de iPhone, propiedad de #Foxconn, protestan para exigir mejores condiciones de trabajo y de vida. Los videos se publicaron en redes sociales, algo poco frecuente en ese país. /cmw-cc"
## [4] "【docomo】 まだまだiPhone13シリーズの取扱いがあります! お問い合わせお待ちしております ※特価機種等の告知では御座いません。 TEL:03-5831-2866 #テルル #スマホ #ドコモ #のりかえ #MNP #iPhone #iPhone13 #iPhone13mini"
## [5] "Dm now for any hacking services or account recovery services I assure you nothing but the best services #iphone #icloud #document #snap #snapchat #snapchatsupport #instagram #facebook #DeleteSpotify #content #privacy #altcoinseason"
## [6] "Apex Legends Mobile Wins the iPhone Game of The Year 2022 Award #WargXP #eSports #Gaming #News #Memes #ApexLegends #ApexLegendsMobile #EA #Game #Games #Gamers #iPhone #Award #GameOfTheYear"
Remove special characters, punctuation, and numbers
samsung_txt_chrs <- gsub("[^A-Za-z]", " ", samsung_txt_url)
iphone_txt_chrs <- gsub("[^A-Za-z]", " ", iphone_txt_url)
Convert text to corpus using the tm library
samsung_corpus <- samsung_txt_chrs %>%
VectorSource() %>% #vecorSource() fuction converts the tweet text to a vector of texts
Corpus() #corpos() covert to corpus
iphone_corpus <- iphone_txt_chrs %>%
VectorSource() %>% #vecorSource() fuction converts the tweet text to a vector of texts
Corpus() #corpos() covert to corpus
Convert text corpus to lowercase, so that a word will not be counted twice
samsung_corpus_lower <- tm_map(samsung_corpus, tolower)
iphone_corpus_lower <- tm_map(iphone_corpus, tolower)
stopwords("english")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
samsung_corpus_stopwd <- tm_map(samsung_corpus_lower, removeWords, stopwords("english"))
iphone_corpus_stopwd <- tm_map(iphone_corpus_lower, removeWords, stopwords("english"))
Let’s remove additional spaces to create clean corpus
samsung_corpus_final <- tm_map(samsung_corpus_stopwd, stripWhitespace)
iphone_corpus_final <- tm_map(iphone_corpus_stopwd, stripWhitespace)
Let’s create a vector of custom stop words to remove
iphone_samsung_cusomstop <- c("iphone", "s", "samsung","k", "t","g", "now", "can", "will","just", "also", "
even", "still", "m", "one", "z", "like", "best","get", "co", "china", "de")
Remove custom stop words
samsung_corpus_refined <- tm_map(samsung_corpus_final, removeWords, iphone_samsung_cusomstop)
iphone_corpus_refined <- tm_map(iphone_corpus_final, removeWords, iphone_samsung_cusomstop)
Extract term frequencies for the top 20 words
samsung_count_clean <- freq_terms(samsung_corpus_refined, 30)
iphone_count_clean <- freq_terms(iphone_corpus_refined, 30)
head(samsung_count_clean)
## WORD FREQ
## 1 galaxy 2379
## 2 android 1028
## 3 r 857
## 4 google 789
## 5 gb 781
## 6 apple 753
head(iphone_count_clean)
## WORD FREQ
## 1 apple 2147
## 2 pro 840
## 3 android 829
## 4 ios 673
## 5 foxconn 579
## 6 ipad 573
samsung_term300 <- subset(samsung_count_clean, FREQ > 400)
iphone_term300 <- subset(iphone_count_clean, FREQ > 400)
Create bar plot of frequent terms
ggplot(samsung_term300, aes(x = reorder(WORD, -FREQ), y = FREQ)) +
geom_bar(stat = "identity", fill = "blue") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Samsung brand popular words") -> p1
ggplot(iphone_term300, aes(x = reorder(WORD, -FREQ), y = FREQ)) +
geom_bar(stat = "identity", fill = "red") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("iphone brand popular words") -> p2
grid.arrange(p1, p2, ncol = 2)
wordcloud(samsung_corpus_refined, min.freq = 150, colors = brewer.pal(6, "Dark2"),
scale = c(3, 0.5), random.order = FALSE)
wordcloud(iphone_corpus_refined, min.freq = 150, colors = brewer.pal(6, "Dark2"),
scale = c(3, 0.5), random.order = FALSE)
Topic modeling is the task of automatically discovering topics from a vast text.This can help us to organize and offer insights for us to understand large collections of unstructured text bodies. Dirichlet allocation will be applied in this analysis
Steps in this model
samsung_dtm <- DocumentTermMatrix(samsung_corpus_refined)
iphone_dtm <- DocumentTermMatrix(iphone_corpus_refined)
inspect(samsung_dtm)
## <<DocumentTermMatrix (documents: 6745, terms: 16013)>>
## Non-/sparse entries: 78731/107928954
## Sparsity : 100%
## Maximal term length: 52
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs amazon android apple bibiana galaxy google nasa pro smartphone ultra
## 1092 0 0 0 0 0 0 0 0 0 0
## 1464 0 0 0 0 0 0 0 0 0 0
## 1970 0 0 0 0 0 0 0 0 0 0
## 2308 0 0 0 0 0 0 0 0 0 0
## 2404 0 0 0 0 0 0 0 0 0 0
## 241 0 0 0 0 0 0 0 0 0 0
## 3623 0 0 0 0 0 0 0 0 0 0
## 4571 0 0 1 0 1 0 0 0 0 0
## 6149 0 0 0 0 0 0 0 0 0 0
## 6216 0 0 0 0 0 0 0 0 0 0
inspect(iphone_dtm)
## <<DocumentTermMatrix (documents: 6837, terms: 13253)>>
## Non-/sparse entries: 56788/90553973
## Sparsity : 100%
## Maximal term length: 37
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs android app apple covid follow foxconn ios ipad pro zhengzhou
## 1159 0 0 0 1 0 1 0 0 0 0
## 1223 0 0 2 0 0 0 0 0 2 0
## 2663 0 0 0 0 0 0 0 0 0 0
## 3595 0 0 0 0 0 0 0 0 0 0
## 3917 0 0 0 0 0 2 0 0 0 0
## 4633 0 0 0 0 0 0 0 0 1 0
## 6749 0 0 1 0 0 0 0 0 0 0
## 681 0 0 1 0 0 1 0 0 0 0
## 748 0 0 0 1 0 0 0 0 0 0
## 871 0 0 1 0 0 0 2 0 0 0
samsung_rowTotal <- apply(samsung_dtm, 1, sum)
iphone_rowTotal <- apply(iphone_dtm, 1, sum)
samsung_dtm_tweet_new <- samsung_dtm[samsung_rowTotal> 0, ]
iphone_dtm_tweet_new <- iphone_dtm[iphone_rowTotal> 0, ]
samsung_lda5 <- LDA(samsung_dtm_tweet_new, k = 5) #k is the number of topics
iphone_lda5 <- LDA(iphone_dtm_tweet_new, k = 5)
samsung_top10terms <- terms(samsung_lda5, 10)
iphone_top10terms <- terms(iphone_lda5, 10)
head(samsung_top10terms)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "apple" "google" "galaxy" "galaxy" "apple"
## [2,] "android" "bibiana" "oferta" "ultra" "galaxy"
## [3,] "phone" "nasa" "smartphone" "galaxys" "case"
## [4,] "galaxy" "kangmj" "android" "android" "xiaomi"
## [5,] "amazon" "pinterest" "promo" "pro" "pro"
## [6,] "elonmusk" "youtube" "smart" "galaxya" "smartphone"
head(iphone_top10terms)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "foxconn" "android" "les" "apple" "apple"
## [2,] "apple" "follow" "apple" "app" "pro"
## [3,] "factory" "mnp" "user" "ios" "ipad"
## [4,] "covid" "apple" "dynamic" "download" "ios"
## [5,] "zhengzhou" "case" "sur" "free" "sim"
## [6,] "workers" "ver" "portrait" "black" "promax"
Topic model is an essential a brand can determine a way to center a theme for promotional purposes or advertisment
Let’s Extract the users opinions and perceptions from tweets using sentiment analysis. Sentiment analysis is the process of retrieving information about a consumer’s perception of a brand
sa.value <- get_nrc_sentiment(samsung_tw$text)
## Warning: `spread_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `spread()` instead.
## ℹ The deprecated feature was likely used in the syuzhet package.
## Please report the issue to the authors.
sa.value[1:5, 1:7]
## anger anticipation disgust fear joy sadness surprise
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## 3 0 2 0 0 2 0 1
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 1 0
score <- colSums(sa.value[,])
samsung_score_df <- data.frame(score)
samsung_sa.score <- cbind(sentiment = row.names(samsung_score_df),
samsung_score_df, row.names = NULL)
ggplot(data = samsung_sa.score, aes(x = sentiment, y= score, fill = sentiment)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
pa.value <- get_nrc_sentiment(iphone_tw$text)
pa.value[1:5, 1:7]
## anger anticipation disgust fear joy sadness surprise
## 1 0 2 0 0 2 0 1
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 1 0 0
pscore <- colSums(pa.value[,])
iphone_score_df <- data.frame(pscore)
iphone_pa.score <- cbind(sentiment = row.names(iphone_score_df),
iphone_score_df, row.names = NULL)
ggplot(data = iphone_pa.score, aes(x = sentiment, y= pscore, fill = sentiment)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
It is not surprising that the most common words for iPhone brand included “apple”, “pro”, “android”, and “ios”.While top terms for the Samsung brand had “galaxy”, “android”, “google”, “Bibiana”, and “NASA”.
The sentiment analysis shows a high percentage of positive reaction, anticipation and joy. On the contrary, there is a low percentage of negative reactions, sadness and fear.