A sentiment analysis on consumer’s preference on Samsung and iPhone phones using twitter data

INTRODUCTION

Informative opinions which can take a business forward are everywhere on the internet in the forms of tweets, reviews, comments and posts. Sentiment analysis is an excellent way to collect these opinions, extract information embedded within these opinions, usually unstructured, derive insight from the information and finally act on this.

In this project, Sentiment analysis was performed on twitter’s data analyzing tweets on Samsung and iPhone brands. Ten thousand tweets from each brand were extracted and analyzed. It is interesting to see that in recent times there are relatively more tweets on iPhone than Samsung, which indicates higher brand salience for iPhone.

EXPLORATORY DATA ANALYSIS

Authenticate setup default to get access to twitter data.
Installation and activation of libraries.
There was a 982 record of #samsung in the past 7 days and 2646 record of #iPhone tweets.
New version of samsung galaxy #GalaxyzFlip4 has a records of 1228 in the past 7 days while #iphone14 tweets has a records of 1295

Understanding twitter data

A tweet can have over 150 metadata components
Twitter APIs return tweets and their components (metadata) as JavaScript Object Notation (JSON)
JSON uses attributes and values to describe tweets and components
Twitter JSON is converted to a dataframe by rtweet library in this project
Attributes and values are converted to column names and values in the dataframe
Tweet and its components were used for analysis.

Activate required Libraries

library(rtweet)
library(httpuv)
library(plyr)
library(reshape)
library(ggplot2)
library(tidyverse)
library(qdapRegex)
library(tm)
library(qdap)
library(RColorBrewer)
library(wordcloud)
library(topicmodels)
library(syuzhet)
library(igraph)
library(gridExtra)
library(tidyr)

Authenticate setup default to get access to twitter data

auth_setup_default()

Extracts 10000 tweets on “#samsung”

samsung_tw <- search_tweets("#samsung", n = 10000, include_rts = FALSE)

Extracts 10000 tweets on “#iphone”

iphone_tw <- search_tweets("#iphone", n = 10000, include_rts = FALSE)

Ploting twitter data over time

Twitter time series analysis is to determine the frequency of chat over a period of time. This analysis detect changing trends and understand interest level on the brands.

Plot samsung Tweets Frequency graph

ts_plot(samsung_tw, by = "hours", color= "blue")

## Plot iPhone Tweets Frequency graph

ts_plot(iphone_tw, by = "hours", color= "red")

BRAND SALIANCE

Brand salience is the extent to which a brand is spoken about by potential customers.

Comparing Brand Saliance

Convert tweet data into a time series object

samsung_ts <- ts_data(samsung_tw, by = 'hours')

iphone_ts <- ts_data(iphone_tw, by = 'hours')

Rename the two columns in the time series object

names(samsung_ts) <- c("time", "samsung_n")
names(iphone_ts) <- c("time", "iphone_n")

Merge the two time series objects and retain “time” column

merged_df <- merge(iphone_ts, samsung_ts, by ="time", all = TRUE)

Stack the tweet frequency columns using melt() function

melt_df <- melt(merged_df, na.rm = TRUE, id.vars = "time")

Plot frequency of tweets on samsung and iphone

ggplot(data = melt_df,
       aes(x = time, y = value, col = variable)) +
  geom_line(lwd = 0.8)

It is interesting to see that there are relatively more tweets on iPhone than on samsung. This indicate higher brand saliance for iphone brand

PROCESSING TWEETS TEXT

Remove redundant information: URLs, special characters, punctuation, numbers
Covert tweet text to a corpus: A corpus is a list of text document
Convert to lowercase
Remove common words or stop words.

Extracting the tweet texts and save it in a data frame and remove url

samsung_txt <- samsung_tw$text
iphone_txt <- iphone_tw$text
samsung_txt_url <- rm_twitter_url(samsung_txt)
iphone_txt_url <- rm_twitter_url(iphone_txt)

head(samsung_txt_url)

## [1] "Samsung Galaxy Book 2 GO and Samsung Galaxy Book 2 GO 5G receive the Bluetooth SIG certification. #Samsung #SamsungGalaxyBook2Go"                                              
## [2] "Samsung Galaxy A14 5G visits the NBTC certification and the Indian BIS certification. #Samsung #SamsungGalaxyA145G"                                                            
## [3] "My own experience with #iPhone and #Samsung is not perfect at all...I'm waiting for Elon Musk's phone with excitement @elonmusk But will he give us a phone from the future..?"
## [4] "ﾛｰ #相互支援 #相互ﾌｫﾛｰ希望 #相互希 望 #HDYF #TFBJP #ANDROID #相互 #TEAMFOLLOWBACK #ipad #Samsung #DIPROMOSIKAN"                                                                
## [5] "Samsung Galaxy M04 With MediaTek Helio G35 SoC Spotted On Google Play Console. #Samsung #SamsungGalaxyM04 #GalaxyM04"                                                          
## [6] "ভারত জুড়ে বড় পরিকল্পনা Samsung-এর! IIT এবং ইঞ্জিনিয়ারিং কলেজ থেকে শত-শত ইঞ্জিনিয়ার নিয়োগের ভাবনা #jobs #job #recruitment #samsung #engineer #চাকরি #নিয়োগ #স্যামসাং #ইঞ্জিনিয়ার"

head(iphone_txt_url)

## [1] "My own experience with #iPhone and #Samsung is not perfect at all...I'm waiting for Elon Musk's phone with excitement @elonmusk But will he give us a phone from the future..?"                                                                                                     
## [2] "Si esperas un #iPhone 14 Pro de #Apple , tienes mala suerte. Los envíos están retrasados por semanas debido a la actual interrupción en la fábrica clave de Foxconn en Zhengzhou, #China. #Infografía Graphic News"                                                                 
## [3] "Insólito: violentas protestas en la mayor fábrica de #iPhone en China Empleados de la planta de iPhone, propiedad de #Foxconn, protestan para exigir mejores condiciones de trabajo y de vida. Los videos se publicaron en redes sociales, algo poco frecuente en ese país. /cmw-cc"
## [4] "【docomo】 まだまだiPhone13シリーズの取扱いがあります！ お問い合わせお待ちしております ※特価機種等の告知では御座いません。 TEL：03-5831-2866 #テルル #スマホ #ドコモ #のりかえ #MNP #iPhone #iPhone13 #iPhone13mini"                                                                
## [5] "Dm now for any hacking services or account recovery services I assure you nothing but the best services #iphone #icloud #document #snap #snapchat #snapchatsupport #instagram #facebook #DeleteSpotify #content #privacy #altcoinseason"                                            
## [6] "Apex Legends Mobile Wins the iPhone Game of The Year 2022 Award #WargXP #eSports #Gaming #News #Memes #ApexLegends #ApexLegendsMobile #EA #Game #Games #Gamers #iPhone #Award #GameOfTheYear"

Remove special characters, punctuation, and numbers

samsung_txt_chrs <- gsub("[^A-Za-z]", " ", samsung_txt_url)
iphone_txt_chrs <- gsub("[^A-Za-z]", " ", iphone_txt_url)

Convert text to corpus using the tm library

samsung_corpus <- samsung_txt_chrs %>%
                    VectorSource() %>% #vecorSource() fuction converts the tweet text to a vector of texts
                    Corpus()           #corpos() covert to corpus

iphone_corpus <- iphone_txt_chrs %>%
                    VectorSource() %>% #vecorSource() fuction converts the tweet text to a vector of texts
                    Corpus()           #corpos() covert to corpus

Convert text corpus to lowercase, so that a word will not be counted twice

samsung_corpus_lower <- tm_map(samsung_corpus, tolower)
iphone_corpus_lower <- tm_map(iphone_corpus, tolower)

View common stop words

stopwords("english")

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"

Remove stopwords to focus on the important words. stop words are commonly used words

samsung_corpus_stopwd <- tm_map(samsung_corpus_lower, removeWords, stopwords("english"))
iphone_corpus_stopwd <- tm_map(iphone_corpus_lower, removeWords, stopwords("english"))

Let’s remove additional spaces to create clean corpus

samsung_corpus_final <- tm_map(samsung_corpus_stopwd, stripWhitespace)
iphone_corpus_final <- tm_map(iphone_corpus_stopwd, stripWhitespace)

Let’s create a vector of custom stop words to remove

iphone_samsung_cusomstop <- c("iphone", "s", "samsung","k", "t","g", "now", "can", "will","just", "also", " 
even", "still", "m", "one", "z", "like", "best","get", "co", "china", "de")

Remove custom stop words

samsung_corpus_refined <- tm_map(samsung_corpus_final, removeWords, iphone_samsung_cusomstop)
iphone_corpus_refined <- tm_map(iphone_corpus_final, removeWords, iphone_samsung_cusomstop)

Extract term frequencies for the top 20 words

samsung_count_clean <- freq_terms(samsung_corpus_refined, 30)
iphone_count_clean <- freq_terms(iphone_corpus_refined, 30)

head(samsung_count_clean)

##   WORD    FREQ
## 1 galaxy  2379
## 2 android 1028
## 3 r        857
## 4 google   789
## 5 gb       781
## 6 apple    753

head(iphone_count_clean)

##   WORD    FREQ
## 1 apple   2147
## 2 pro      840
## 3 android  829
## 4 ios      673
## 5 foxconn  579
## 6 ipad     573

Visualizing words with frequencies

Create a subset dataframe

samsung_term300 <- subset(samsung_count_clean, FREQ > 400)
iphone_term300 <- subset(iphone_count_clean, FREQ > 400)

Create bar plot of frequent terms

ggplot(samsung_term300, aes(x = reorder(WORD, -FREQ), y = FREQ)) +
  geom_bar(stat = "identity", fill = "blue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Samsung brand popular words") -> p1

ggplot(iphone_term300, aes(x = reorder(WORD, -FREQ), y = FREQ)) +
  geom_bar(stat = "identity", fill = "red") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("iphone brand popular words") -> p2

grid.arrange(p1, p2, ncol = 2)

Visualize Samsung frequent words with wordcloud

wordcloud(samsung_corpus_refined, min.freq = 150, colors = brewer.pal(6, "Dark2"),
          scale = c(3, 0.5), random.order = FALSE)

Visualize iphone frequent words with wordcloud

wordcloud(iphone_corpus_refined, min.freq = 150, colors = brewer.pal(6, "Dark2"),
          scale = c(3, 0.5), random.order = FALSE)

TOPIC MODELLING OF TWEETS

Topic modeling is the task of automatically discovering topics from a vast text.This can help us to organize and offer insights for us to understand large collections of unstructured text bodies. Dirichlet allocation will be applied in this analysis

Steps in this model

Create a document term matrix (DTM)
Build a topic model from the DTM
Latent Dirichlet Allocation will be applied in this modeling
LDA(Mathematical model) -[Mixture of words in a topic] & [Mixture of topics in a document]

Create a document term matrix

samsung_dtm <- DocumentTermMatrix(samsung_corpus_refined)
iphone_dtm <- DocumentTermMatrix(iphone_corpus_refined)

Inspect the samsung dtm

inspect(samsung_dtm)

## <<DocumentTermMatrix (documents: 6745, terms: 16013)>>
## Non-/sparse entries: 78731/107928954
## Sparsity           : 100%
## Maximal term length: 52
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   amazon android apple bibiana galaxy google nasa pro smartphone ultra
##   1092      0       0     0       0      0      0    0   0          0     0
##   1464      0       0     0       0      0      0    0   0          0     0
##   1970      0       0     0       0      0      0    0   0          0     0
##   2308      0       0     0       0      0      0    0   0          0     0
##   2404      0       0     0       0      0      0    0   0          0     0
##   241       0       0     0       0      0      0    0   0          0     0
##   3623      0       0     0       0      0      0    0   0          0     0
##   4571      0       0     1       0      1      0    0   0          0     0
##   6149      0       0     0       0      0      0    0   0          0     0
##   6216      0       0     0       0      0      0    0   0          0     0

inspect(iphone_dtm)

## <<DocumentTermMatrix (documents: 6837, terms: 13253)>>
## Non-/sparse entries: 56788/90553973
## Sparsity           : 100%
## Maximal term length: 37
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   android app apple covid follow foxconn ios ipad pro zhengzhou
##   1159       0   0     0     1      0       1   0    0   0         0
##   1223       0   0     2     0      0       0   0    0   2         0
##   2663       0   0     0     0      0       0   0    0   0         0
##   3595       0   0     0     0      0       0   0    0   0         0
##   3917       0   0     0     0      0       2   0    0   0         0
##   4633       0   0     0     0      0       0   0    0   1         0
##   6749       0   0     1     0      0       0   0    0   0         0
##   681        0   0     1     0      0       1   0    0   0         0
##   748        0   0     0     1      0       0   0    0   0         0
##   871        0   0     1     0      0       0   2    0   0         0

Find the sum of word counts in each document

samsung_rowTotal <- apply(samsung_dtm, 1, sum)
iphone_rowTotal <- apply(iphone_dtm, 1, sum)

Select rows from DTM with row totals greater than zero

samsung_dtm_tweet_new <- samsung_dtm[samsung_rowTotal> 0, ]
iphone_dtm_tweet_new <- iphone_dtm[iphone_rowTotal> 0, ]

samsung_lda5 <- LDA(samsung_dtm_tweet_new, k = 5) #k is the number of topics
iphone_lda5 <- LDA(iphone_dtm_tweet_new, k = 5)

View top 10 terms in the topic model

samsung_top10terms <- terms(samsung_lda5, 10)
iphone_top10terms <- terms(iphone_lda5, 10)

head(samsung_top10terms)

##      Topic 1    Topic 2     Topic 3      Topic 4   Topic 5     
## [1,] "apple"    "google"    "galaxy"     "galaxy"  "apple"     
## [2,] "android"  "bibiana"   "oferta"     "ultra"   "galaxy"    
## [3,] "phone"    "nasa"      "smartphone" "galaxys" "case"      
## [4,] "galaxy"   "kangmj"    "android"    "android" "xiaomi"    
## [5,] "amazon"   "pinterest" "promo"      "pro"     "pro"       
## [6,] "elonmusk" "youtube"   "smart"      "galaxya" "smartphone"

head(iphone_top10terms)

##      Topic 1     Topic 2   Topic 3    Topic 4    Topic 5 
## [1,] "foxconn"   "android" "les"      "apple"    "apple" 
## [2,] "apple"     "follow"  "apple"    "app"      "pro"   
## [3,] "factory"   "mnp"     "user"     "ios"      "ipad"  
## [4,] "covid"     "apple"   "dynamic"  "download" "ios"   
## [5,] "zhengzhou" "case"    "sur"      "free"     "sim"   
## [6,] "workers"   "ver"     "portrait" "black"    "promax"

Topic model is an essential a brand can determine a way to center a theme for promotional purposes or advertisment

Let’s Extract the users opinions and perceptions from tweets using sentiment analysis. Sentiment analysis is the process of retrieving information about a consumer’s perception of a brand

sa.value <- get_nrc_sentiment(samsung_tw$text)

## Warning: `spread_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `spread()` instead.
## ℹ The deprecated feature was likely used in the syuzhet package.
##   Please report the issue to the authors.

sa.value[1:5, 1:7]

##   anger anticipation disgust fear joy sadness surprise
## 1     0            0       0    0   0       0        0
## 2     0            0       0    0   0       0        0
## 3     0            2       0    0   2       0        1
## 4     0            0       0    0   0       0        0
## 5     0            0       0    0   0       1        0

score <- colSums(sa.value[,])
samsung_score_df <- data.frame(score)

samsung_sa.score <- cbind(sentiment = row.names(samsung_score_df),
                  samsung_score_df, row.names = NULL)

ggplot(data = samsung_sa.score, aes(x = sentiment, y= score, fill = sentiment)) +
      geom_bar(stat = "identity") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

pa.value <- get_nrc_sentiment(iphone_tw$text)

pa.value[1:5, 1:7]

##   anger anticipation disgust fear joy sadness surprise
## 1     0            2       0    0   2       0        1
## 2     0            0       0    0   0       0        0
## 3     0            0       0    0   0       0        0
## 4     0            0       0    0   0       0        0
## 5     0            0       0    0   1       0        0

pscore <- colSums(pa.value[,])
iphone_score_df <- data.frame(pscore)

iphone_pa.score <- cbind(sentiment = row.names(iphone_score_df),
                  iphone_score_df, row.names = NULL)

ggplot(data = iphone_pa.score, aes(x = sentiment, y= pscore, fill = sentiment)) +
      geom_bar(stat = "identity") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusion

It is not surprising that the most common words for iPhone brand included “apple”, “pro”, “android”, and “ios”.While top terms for the Samsung brand had “galaxy”, “android”, “google”, “Bibiana”, and “NASA”.

The sentiment analysis shows a high percentage of positive reaction, anticipation and joy. On the contrary, there is a low percentage of negative reactions, sadness and fear.