#Load packages

Executive summary

In this report, I will analyze the ‘tweet data’ that reveals various emotions for each IT brand. Through each tweet, we will find out what emotions users mainly express about the product or brand, what causes negative/positive emotions, and visualize them in various formats.

In this study, there are two research hypotheses: “People who have a negative evaluation of a particular brand will mainly leave tweets”, and “The negative evaluation will mainly be complaints about a particular brand’s function of product.”

Data background

The data used in the analysis is called “Brands and Product Emotions.” The data was extracted from the data.word site and was created on August 30, 2013.

The data is categorized into 8,721 tweet texts, the attributes of the tweet’s emotions, and the products or brands that cause those emotions. All of this data contains the tags #SXSW or #sxsw in common. SXSW is a series of movies, interactions, music festivals, and conferences held every spring in Austin, Texas, in the United States. IT brands such as Apple, Google, and Android participated in the conference, and this data reflects users’ tweets about each brand’s products or actions in SXSW. It should also be taken into account that since the data was created in 2013, data on SXSW conducted after 2013 is not reflected.

The self-classification of emotions in the data consists of a total of four items. They are “positive emotion”, “negative emotion”, “I can’t tell”, and “No emotion toward band or product”, respectively.

“Product or Brand” consists of a total of nine. “Android,” “Android App,” “Apple,” “Google,” “Other Apple product or service,” “Other Google product or service,” “iPad,” “iPad or iPhone App,” and items whose product or brand is not clearly identified are blanked in that column.

Data loading, cleaning and preprocessing

Prior to data analysis, a unique code was attached to each tweet. This is to distinguish which tweet the words belong to even if the tweet data is tokenized in units of words.

The original data goes through the process of removing unnecessary symbols, including hashtags,urls and etc… Because the text on Twitter was scraped as it is.

After that, the data is tokenized by word using the unconest_tokens() function, and the “stop word” is removed using the anti_join() function.

The name of the column is changed using mute() and select().

getwd()

## [1] "/Users/byeonjunhyeog/Documents/TextMining/TextMining/Final_report/Final_Report_20224595"

data <- read.csv("company_data.csv")

data <- data %>%
  rowid_to_column("tweet_code")

preprocess_tweets <- function(data) {
  data %>%
    str_to_lower() %>%               
    str_remove_all("http\\S+\\s*") %>% 
    str_remove_all("@\\w+") %>%       
    str_remove_all("#\\w+") %>%        
    str_remove_all("[[:punct:]]") %>%  
    str_remove_all("[[:digit:]]") %>%  
    str_replace_all("[^[:alnum:]\\s]", " ") %>% 
    str_squish()     
}

tokenized_data <- data %>%
  mutate(tweet_text = sapply(tweet_text, preprocess_tweets)) %>%
  unnest_tokens(word, tweet_text) %>%
  anti_join(stop_words) %>%
  mutate(Product = emotion_in_tweet_is_directed_at, Sentiment = is_there_an_emotion_directed_at_a_brand_or_product) %>%
  select(tweet_code, Product, Sentiment, word)

## Joining with `by = join_by(word)`

Combining and modifyng sentiment data

In this analysis, a comparative analysis between products will be conducted. Therefore, if the product indicated or suggested by the text is unclear or not (if data is not written in the Product column), it is excluded from the dataset.

The ‘bing’ data is applied to classify the sentimental data of each word, which is distinguished from the existing sentimental data. the method of classifying bing data and existing sentimental data may be different. Since the existing classification method goes through the process of grasping the context of the entire tweet, it can be easy to analyze it by tweet unit, but because it goes through the process of identifying the sentiments of individual words in the text analysis process, it borrows sentimental classification through ‘bing’ rather than the existing classification method.

product_categories <- c("Android", "Android App", "Apple", "Google", "Other Apple product or service", "Other Google product or service", "iPad", "iPad or iPhone App", "iPhone")

bing_data <- tokenized_data %>%
    inner_join(get_sentiments("bing")) %>%
    filter(Product %in% product_categories) %>%
    select(tweet_code, Product, word, bing_sentiment = sentiment)

## Joining with `by = join_by(word)`

Number of Positive and Negative Words per Product

To demonstrate the first hypothesis, the number of positive and negative tweets by product was calculated and visualized as a bar graph.

In the graph, it was found that the number of positive tweets was higher in all product lines. In the two items with the largest number of tweet words, ‘iPad’ and ‘Apple’, the difference between positive and negative tweets was more than 200. The product line with the smallest number of data was the ‘Other Apple product of service’, indicating that 27 out of a total of 37 data were positive words.

Through this, the first research hypothesis, “People who have a negative evaluation of a partial band will be maintained leave tweets,” was found to be wrong.

bing_data_count <- bing_data %>%
  group_by(Product, bing_sentiment) %>%
  summarise(tweet_count = n())

## `summarise()` has grouped output by 'Product'. You can override using the
## `.groups` argument.

graph1 <- ggplot(bing_data_count, aes(x = bing_sentiment)) +
  geom_bar(aes(y = tweet_count, fill = bing_sentiment), stat = "identity", position = position_dodge2(width = 0.9)) +
  scale_fill_manual(values = c("negative" = "red", "positive" = "blue"), guide = "legend") +
  labs(title = "Number of Positive and Negative Words per Product",
       x = "Product",
       y = "Number Of Word") +
  theme_minimal() +
  theme(legend.title = element_blank()) +
  facet_wrap(~Product, ncol = 3)
ggsave(filename = ("graph1.png"), plot = graph1, width = 8, height = 6)
graph1

## Ratio of Positive to Negative Counts for Each Product In order to accurately analyze the ‘ratio’ of positive and negative words in each product, it was also visualized using a pie graph. In addition, the graphs were listed in the order of the largest sum of all words to see how the total number of words (the absolute amount of tweets mentioned by the product or brand) correlates with the ratio of positive and negative words.

As a result of the analysis, it was found that the product with the highest percentage of negative words was ‘iPhone’ and the product with the highest percentage of positive words was ‘Android App’. In general, the proportion of negative words appeared to be high in products with a large number of data, but there was no plausible correlation. In other words, it can be seen that the first hypothesis is clearly wrong.

bing_data_count2 <- bing_data_count %>%
  group_by(Product) %>%
  mutate(total_count = sum(tweet_count)) %>%
  ungroup() %>%
  arrange(desc(total_count))

graph2 <- ggplot(bing_data_count2, aes(x = "", y = tweet_count, fill = bing_sentiment)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  facet_wrap(~ reorder(Product, -total_count), scales = "free") + # Reorder facet
  scale_fill_manual(values = c("negative" = "red", "positive" = "blue")) +
  labs(title = "Ratio of Positive to Negative Counts for Each Product",
       fill = "Sentiment") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        plot.title = element_text(hjust = 0.5),
        strip.text = element_text(size = 5))

ggsave(filename = ("graph2.png"), plot = graph2, width = 8, height = 6)
graph2

Average Number of Positive/Negative Words per Tweet by Product

If positive/negative words have been analyzed for each product so far, it is also necessary to look at the ratio of positive/negative words for each product’s single tweet. For example, if emotion-expressing words are used too often in positive/negative tweets of a particular product, the analysis of words for each product may not be objective. Therefore, the difference should be identified by calculating the difference in the average number of positive and negative words for each tweet. If the difference is too stark, it suggests that the previous analysis may not be objective.

Therefore, using the ‘tweet_code’ column given to each tweet unit at first, the difference in the number of positive and negative words per tweet was compared by product and visualized as a graph, which is easy to compare their absolute values.

As a result of the analysis, it was found that the difference in the number of positive/negative words per tweet in most product groups was not very large. For Android App and Android, which showed relatively large differences, the difference was about 0.5 and 0.3 words, respectively. For products with a large absolute number of data, the difference was narrower.

Therefore, it is inferred that differences in the number of positive/negative words per tweet will not make a significant difference in the analysis results.

word_counts <- bing_data %>%
  group_by(Product, bing_sentiment, tweet_code) %>%
  summarise(word_count = n()) %>%
  ungroup()

## `summarise()` has grouped output by 'Product', 'bing_sentiment'. You can
## override using the `.groups` argument.

avg_word_counts <- word_counts %>%
  group_by(Product, bing_sentiment) %>%
  summarise(avg_word_count = mean(word_count))

## `summarise()` has grouped output by 'Product'. You can override using the
## `.groups` argument.

graph3 <- ggplot(avg_word_counts, aes(x = Product, y = avg_word_count, fill = bing_sentiment)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average Number of Positive/Negative Words per Tweet by Product",
       x = "Product",
       y = "Average Number of Words",
       fill = "Sentiment") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave(filename = ("graph3.png"), plot = graph3, width = 8, height = 6)
graph3

Top 10 Sentiment words

In products with a significantly large amount of data, 20 positive and negative words were selected and visualized as bar graphs, respectively, which appeared most frequently in the iPad or iPhone App and iPhone products with a high percentage of positive and negative words (the bar graph was used because absolute numbers had to be compared). This analysis can identify which factors affected the frequently occurring positive and negative evaluations.

As a result of the analysis, it is estimated that the most frequent positive word in the iPad or iPhone App, which had a high positive evaluation scale, was free (37 times), and considering that the original data dealt with data before 2013, it was a period when the free application market was gradually expanding. Other words are abstract expressions and are presumed to include positive evaluation of function. In negative words, “crashes (five times)” appeared most frequently, and it seems to be referring to the bouncing phenomenon of the application.

The most frequently occurring negative words in the iPhone app, which had a high negative evaluation scale, were “sucks” (4 times), and “shit” (4 times), which were mostly offensive expressions. Below them were evaluations that seemed to suggest functional defects in products such as “struggle,” “dies,” “dead,” “issue,” and “broken.”

Based on the second hypothesis, negative evaluations of a product or brand usually described that ‘the function of the product’ had a lot of influence. Therefore, the hypothesis seems to have some validity.

iPhoneiPadApp_bing <- bing_data %>%
  filter(Product %in% "iPad or iPhone App")

Top10_iPhoneiPadApp <- iPhoneiPadApp_bing %>%
  group_by(bing_sentiment, word) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  group_by(bing_sentiment) %>%
  slice_max(count, n = 20, with_ties = FALSE) %>%
  ungroup()

## `summarise()` has grouped output by 'bing_sentiment'. You can override using
## the `.groups` argument.

graph4 <- ggplot(Top10_iPhoneiPadApp, aes(x =  reorder_within(word, count, bing_sentiment),
                  y = count,
                  fill = bing_sentiment)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~bing_sentiment, scales = "free") +
  scale_x_reordered() +
  labs(x = NULL, y = "TOP 10 Sentiment words on iPad or iPhone App")

ggsave(filename = ("graph4.png"), plot = graph4, width = 8, height = 6)
graph4

iPhone_bing <- bing_data %>%
  filter(Product %in% "iPhone")

Top10_iPhone <-iPhone_bing %>%
  group_by(bing_sentiment, word) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  group_by(bing_sentiment) %>%
  slice_max(count, n = 20, with_ties = FALSE) %>%
  ungroup()

## `summarise()` has grouped output by 'bing_sentiment'. You can override using
## the `.groups` argument.

graph5 <- ggplot(Top10_iPhone, aes(x =  reorder_within(word, count, bing_sentiment),
                  y = count,
                  fill = bing_sentiment)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~bing_sentiment, scales = "free") +
  scale_x_reordered() +
  labs(x = NULL, y = "TOP 10 Sentiment words on iPhone")

ggsave(filename = ("graph5.png"), plot = graph5, width = 8, height = 6)
graph5

## TF-IDF Values by Sentiment Apart from the frequently occurring words analyzed earlier, it is necessary to check which words are important for each tweet. This is just to understand what factors have an important influence on the evaluation of a product or brand other than analysis through frequency calculation of emotional expression words.

In this analysis, to measure the importance of words in context independent of frequency, the emotion classification method was used in the existing data (classified by tweet context), and it was classified for each emotion using the TF-IDF method (positive, negative, neutral, unknown).

In order to visually emphasize the ‘importance’ of each word, the scatter plot was used to change the size of the words (nodes).

As a result of the analysis, it was found that in addition to the words that appeared in the existing frequency analysis, several important words for each emotion were of high importance. Representatively, ‘delegates’ were identified from negative emotions and ‘begins’ from positive words. Each of these two words can have a great influence on negative evaluation by the personnel management of a brand or company, and it can be inferred that the ‘start’ or ‘first’ of a specific technology or product in market competition has an effect on positive evaluation. In other words, depending on the context, it can be seen that it is difficult to affirm that the second hypothesis is always correct.

filter_words <- data.frame(word = c("link", "rt", "ipad", "google", "apple", "iphone", "android", "google"))
filter_words

##      word
## 1    link
## 2      rt
## 3    ipad
## 4  google
## 5   apple
## 6  iphone
## 7 android
## 8  google

data2 <- tokenized_data %>%
  anti_join(filter_words)

## Joining with `by = join_by(word)`

tf_idf_data2 <- data2 %>%
  count(Sentiment, word) %>%
     bind_tf_idf(term = word,           
              document = Sentiment,  
              n = n) %>%             
   group_by(Sentiment) %>%
  slice_max(tf_idf, n = 10, with_ties = F)

graph6 <- ggplot(tf_idf_data2, aes(x = reorder(word, tf_idf), y = tf_idf, color = Sentiment, size = tf_idf)) +
  geom_point(alpha = 0.7) +
  scale_size_continuous(range = c(1, 5)) +  
  facet_wrap(~ Sentiment, scales = "free") +
  labs(
    title = "TF-IDF Values by Sentiment",
    x = "Word",
    y = "TF-IDF Value",
    color = "Sentiment",
    size = "TF-IDF"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave(filename = ("graph6.png"), plot = graph6, width = 8, height = 6)
graph6

Bigram Network for representive products

For contextual analysis, the two previously analyzed data were analyzed through the Bigram Network graph, which shows the semantic connection of each word.

In the case of the iPhone, which had a high scale of negative evaluation, it can be seen that the battery consumption problem in the graph acted as a factor that influenced the negative evaluation. Through this, it was possible to understand what the aforementioned ‘functional defects’ were specifically. From this result, it can be seen that the second hypothesis has some validity.

In the case of iPad and iPhone apps, which had a high level of positive evaluation, when the connected words were combined, video-live-streaming, link-android-marketplace, photos-share, TV-connected, etc. were found. From this result, we can see Apple’s business strategy. Early in the application platform market, Apple chose a business strategy to increase product value by amplifying the network effect between app users and providers. Therefore, various service providers were able to provide various functions to users through the app store, and users positively evaluated them. As a result, the app store, such as live video streaming, Android market compatibility, photo sharing, and TV connection, overcame one point of hardware through ‘excellent compatibility of the app store’. The history is evident in this graph.

bg_data <- data %>% 
  mutate(tweet_text = sapply(tweet_text, preprocess_tweets)) %>%
  unnest_tokens(bigram, tweet_text, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram)) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  rename(Product = emotion_in_tweet_is_directed_at, Sentiment = is_there_an_emotion_directed_at_a_brand_or_product) %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

iPhone_bg <- bg_data %>% 
  filter(Product == "iPhone") %>% 
  count(word1, word2, sort = TRUE) %>%
  filter(n > 2) %>%
  graph_from_data_frame()


graph7 <- ggraph(iPhone_bg, layout = "fr") +
  geom_edge_link(alpha = 0.5) +
  geom_node_point(alpha = 0.5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)+
  labs(title = "Bigram Network for iPhone")

ggsave(filename = ("graph7.png"), plot = graph7, width = 8, height = 6)
graph7

iPhoneiPadApp_bg <- bg_data %>% 
  filter(Product == "iPad or iPhone App") %>% 
  count(word1, word2, sort = TRUE) %>%
  filter(n > 3) %>%
  graph_from_data_frame()


graph8 <- ggraph(iPhoneiPadApp_bg, layout = "fr") +
  geom_edge_link(alpha = 0.5) +
  geom_node_point(alpha = 0.5) +
  geom_node_text(aes(label = name), vjust = 0.5, hjust = 0.5)+
  labs(title = "Bigram Network for iPad or iPhone App")

ggsave(filename = ("graph8.png"), plot = graph8, width = 8, height = 6)
graph8

Conclusion

Research Hypothesis:

“People who give a negative review of a particular brand will mostly leave a tweet”: Wrong
“The negative evaluation will be primarily complaints about a particular brand’s product capabilities” : Correct (not in all contexts)

This study shows that the evaluation of IT products and companies left by users in social network services, including Twitter, is significant regardless of positive or negative emotions, and the criteria for evaluation may vary depending on the characteristics of hardware/software, but for hardware products (such as iPhones) with high measures of negative evaluation, functional defects have a significant impact on raising the measures of negative evaluation.

ATA Final Report

20224595 Byun-Junhyuk

06/28