Project 2 - STAT 611 - Smith, Nick

1.0 Introduction

1.1 Overview of Project

Through the R programming language, and it’s respective software packages, it is possible to execute text mining across various platforms. Twitter is one of the most widely used social media platforms in the world. By performing text mining functions through R via the Twitter Developer API, a researcher (a data analyst, statistician, data scientist, etc.), can investigate various factors and variables to make conclusions and inference.

Of particular note, by processing tweets in R, a researcher can analyze the “sentiment” of various tweets to infer a consensus on a particular topic. Sentiment is a metric in natural language processing (NLP) that uses machine learning algorithms to analyze and classify the emotional “tone” of text data [1]. The functions used in this document will classify “negative” tones and “positive” tones on a scale of [-1,1], with a value of 0 noting neutral sentiment, positive values relating to positive sentiment, and negative values relating to negative sentiment.

This document will demonstrate software written to infer the sentiment on various tweets relating to the Graphics Processing Units [GPUs] manufactured by AMD and NVIDIA. It will compare the general sentiment at the observational time-frame of when the software is run in order to determine if the general sentiment at a given time is different between tweets mentioning “NVIDIA AND GPU OR NVIDIA” or “AMD AND GPU OR AMD”.

1.2 NVIDIA vs. AMD, GPUs, why care?

Computer hardware in recent years has exploded in popularity [2]. This is due to a variety of factors, whether it is an end-user building a gaming PC, a medical device manufacturer developing a computer for a next-generation medical device, an ambitious investor building a machine to mine bitcoin and cryptocurrency, or an engineer building a computer for computational fluid dynamics; the GPU (Graphic Processing Unit) is an integral component to any build. The GPU is a component that utilizes parallel processing to execute computations alongside a CPU (Central Processing Unit), and is used in a plethora of applications that require large amounts of computational power [3].

For years, NVIDIA dominated the market in terms of GPUs [4], followed by AMD (Advanced Micro Devices). Over time, AMD has grown to control more of the market share as a valid competitor in GPU hardware, with NVIDIA controlling 83%, and AMD controlling 17% [4]. This fierce competition has led some to believe that AMD is a “fan favorite,” for it’s affordability and performance. As a result of COVID-19 pandemic, the semiconductor supply chain has been constricted. There is now a formal shortage, resulting in volatility in pricing, including aftermarket direct-to-buyer products (such as the GPU) [5].

2.0 Statement of the Problem:

Across a given timeframe, does the sentiment between tweets containing “NVIDIA AND GPU OR NVIDIA” or “AMD AND GPU OR AMD” on twitter differ from one another?

Using text mining methods in R, a researcher can “query” the tweets of a given time frame to see how sentiment across twitter varies between tweets. Qualitative analysis can be performed at the discretion of the researcher by looking at visualizations of tweet data and sentiment through various means. Quantitative analysis can be done by interpreting the results of a parametric, or non-parametric, statistical test. Through both means, they can conclude if at the time of the Twitter query, if there is a difference in sentiment between tweets containing “NVIDIA AND GPU OR NVIDIA” or “AMD AND GPU OR AMD”.

NOTE ON ANALYSIS AND THE QUESTION AT HAND

Please note that this is not a “stationary” look at Twitter data. A consequence of Twitter is that the sentiment may shift or vary over time. This software is analyzing a sample of ~2500 tweets in each category (~5000 total) over a given time frame of ~3 days. Over time, expect this sentiment to ebb and flow. The answer in this document refers explicitly to the time when the software is run over (3/23/22-3/26/22).

2.1 Noise Factors

As with any data analysis, the analyst needs to make a differentiation between the signal and the noise. In this document, the data queried on Twitter has noise factors which must be taken into account. This will help bring value the determination of whether or not the results noted in the latter portion of this document have significance.

One noise factor in this analysis is the fact that the accounts presenting these tweet are from a random sample of ~2500 tweets in each factor (NVIDIA or AMD). As a result of the conditions described in section 1.2, many accounts and tweets in this sample are the result of “bot” accounts. These accounts denote the rise and drop of pricing in the market of specific components by either NVIDIA or AMD.

This will contribute to the sentiment, along with news sources and individuals tweeting about these GPUs and their manufacturers. These accounts are intended to notify users of when an GPU of the respective manufacturer is for sale, lowers in price, raises in price, etc. They are included in the analysis, as the sentiment of these “bot” accounts can provide a an idea of the instantaneous pricing of the components from each manufacturer. Generally, if the price is raising and users are unhappy, this will provide meaningful results to the sentiment analysis. For most cases, neutral posts determining if there is a GPU of a particular brand on sale will not contribute to the sentiment score in a positive or negative fashion.

Another factor to note, is that the search terms of interest are strictly those outlined in 2.0. There is a possibility that the nature of the search engine is contributing to the noisiness of the data (e.g. If a user tweets “AMD GPUs and NVIDIA GPUs are terrible!”, or “I LOVE NVIDIA GPUs and HATE AMD GPUs!”) may or may not capture the isolated sentiment of both manufacturers. It is also possible the sentiment algorithm may miscategorize different languages due to translation issues. Implementing additional pre-processing / optimization of this search factor is not in the scope of this document, but should be considered for future iterations of analysis. The pre-processing in this document and it’s software organize data to be analyzed in the steps shown in this software and it’s code.

2.2 Using ggplot2, sentimentAnalysis, and wordcloud software packages for Data Visualization

This document uses ggplot2, sentimentAnalysis, and wordcloud packages for various data visualization graphs/plots.

2.3 Using dplyr to Organize R DataFrames

dplyr, from the tidyverse family of R packages, is used to “pipe” data into tables and manipulate R dataframes.

2.4 Statistical Analysis Tools

In the quantitative/statistical analysis portion of this document, either parametric and non-parametric methods are used to generate a statistical result to be interpreted. It is assumed the reader has some comfort with these topics, so the descriptions below are abridged.

2.4.1 Parametric Methods - Student’s t-test in a Nutshell [6]

The parametric method used by the software outlined in this document is the Student’s t-test. This is done across two independent datasets (the sentiment of NVIDIA GPUs vs. AMD GPUs). If the data is normal per a Shapiro-Wilks test (Ho: “The data are normally distributed”, Ha: “The data are not normally distributed”), it fails to reject the null hypothesis (at alpha = .05), and can use the unpaired two sample t-test to see if the mean sentiment score of NVIDIA GPUs and AMD GPUs is different. This test is done on each sample (NVIDIA and AMD GPUs sentiment scores respectively, over this given timeframe).If the data is not normal, section 2.4.2 describes the non-parametric test used in this document.

In this case, use the unpaired two sample t-test since it is assumed do not know the true population standard deviation of the sentiment score of either population. If it rejects the null hypothesis at alpha = .05 (Ho: “The mean sentiment score of NVIDIA and AMD GPUs respectively are not different”, Ha: “The mean sentiment score of NVIDIA and AMD GPUs respectively are different”), one can determine the two mean sentiment values between each are different.

2.4.2 Non-parametreic Methods - Wilcoxon Signed Ranks Tests in a Nutshell [7]

For non-normal data, one cannot look at the mean difference. To give a measure of center, the median value is investigated. For the two unpaired groups, Ho: the true location shift between NVIDIA GPUs and AMD GPUs is equal to zero and Ha: the true location shift between NVIDIA GPUs and AMD GPUs is not equal to zero. At alpha = .05, if the test rejects the null hypothesis, one can conclude that there is a difference in median sentiment score between NVIDIA GPUs and AMD GPUs.

3.0 Data Collection

3.1 Preparing to Collect Data

To collect and process the data from twitter, first, the software packages below are imported:

# Load software packages

library(rtweet)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(SentimentAnalysis)

## 
## Attaching package: 'SentimentAnalysis'

## The following object is masked from 'package:base':
## 
##     write

library(ggplot2)
library(SnowballC)
library(wordcloud)

## Loading required package: RColorBrewer

# Disable warnings for markdown
options(warn = -1)

Then, the Twitter developer API is integrated in order to perform text mining of the Twitter data, and a token is created. This is done with the code below (this is done with the author’s Twitter data):

# set Twitter API permission = fill in each of these with your own API permission from
# your developer twitter account 

api_key = "KtTB6LCt9Ma9G2UkdXWbf5ZcX"
api_secret_key = "ZJrk7gxDwImrvyt8B76YZpYbQzi1AUziIBtXWNxooHKrVurmDJ"
access_token = "1067618556026781696-A7AE6z352dnCDHwP1vPDa8M52bNA3x"
access_token_secret = "vNDbitFjvzCqwrxNIyC9hPpIDIK58ChOyYasYr7JerzYa"

## authenticate via web browser = fill in the app name created in developer.twitter.com
token = create_token(
  app = "RProj2",
  consumer_key = api_key,
  consumer_secret = api_secret_key,
  access_token = access_token,
  access_secret = access_token_secret
)

3.2 Collecting Data

Through the rtweet software library, a text mining function search_tweets(), creates dataframe objects for each search category. It will query at ~2500 tweets in each category. In some cases, depending on the tweets it receives, it may not retrieve exactly 2500 tweets. This is compensated for in the Data Analysis section, 4.0.

See the code snippet below for the search terms used:

## twitter data - text mining

# search for tweets - create two test groups
# search for twitter statuses containing keyword(s) "NVIDIA AND GPU OR NVIDIA", 
# AND "AMD AND GPU OR AMD". 

NVIDIAgpus_search_df = search_tweets(q = "NVIDIA AND GPU OR NVIDIA", n = 2500)
AMDgpus_search_df = search_tweets(q= "AMD AND GPU OR AMD", n = 2500)

The inclusion of company mentions is also included in order to further the amount of tweets captured in the sample groups. [The truncated dataframes are outlined in the Appendix.]

4.0 Data Analysis

Data analysis is done for each dataframe outlined in section 3.2. Each step of the data analysis, is outlined by the following subsections.

4.1 Determination of User Frequency in Tweet Data Sample

Per the point introduced in section 2.1, dplyr is used in order to differentiate which users are tweeting:

# Users tweeting and frequency in the NVIDIA AND GPU 
NVIDIAgpus_tbl <- NVIDIAgpus_search_df %>%
  count(screen_name, sort = TRUE) %>%
  top_n(10) %>%
  mutate(screen_name = paste0("@", screen_name))

## Selecting by n

print(tibble(NVIDIAgpus_tbl))

## # A tibble: 10 × 2
##    screen_name          n
##    <chr>            <int>
##  1 @fixitfixitfixit   184
##  2 @gpustocks         160
##  3 @Datenheim_GPU     106
##  4 @BotInventory       82
##  5 @BotPCParts         81
##  6 @Gpu4Y              74
##  7 @Xplacasdevideo     74
##  8 @tweet_bot_317      71
##  9 @SnailMonitor       59
## 10 @DataLoverDrops     50

# Users tweeting and frequency in the AMD AND GPU 
AMDgpus_tbl <- AMDgpus_search_df %>%
  count(screen_name, sort = TRUE) %>%
  top_n(10) %>%
  mutate(screen_name = paste0("@", screen_name))

## Selecting by n

print(tibble(AMDgpus_tbl))

## # A tibble: 10 × 2
##    screen_name         n
##    <chr>           <int>
##  1 @fixittrackerca    73
##  2 @AMD_kush          57
##  3 @sufiyan_amd_      47
##  4 @gpustocks         39
##  5 @DataLoverDrops    22
##  6 @Raheel_Amd        19
##  7 @84_amd            18
##  8 @DaveTrouba        18
##  9 @gpu_drops         16
## 10 @gpudrops_ca       16

As suspected, for each sample (“NVIDIA AND GPU OR NVIDIA” and “AMD AND GPU OR AMD”), many of the top accounts are suspected bots (e.g. @BotInventory and @BotPCParts) which report sales and prices of individual components. To reiterate what was noted in section 2.1, many of these accounts can be assumed to not contribute to the sentiment in a meaningful manner. If there are specifics on pricing dropping or raising, it will still provide meaningful contributions to the overall sentiment score of it’s respective group.

4.2 Sentiment Analysis of each group

Using the sentimentAnalysis package, sentiment analysis is done across each tweet. The function uses machine learning algorithms to determine the net sentiment of the tweets, and it’s “components” of negativity, positivity, etc.

## Sentiment Analysis
#conduct sentiment analysis for each of these
NVIDIAgpus_sentiment = analyzeSentiment(NVIDIAgpus_search_df$text)
AMDgpus_sentiment = analyzeSentiment(AMDgpus_search_df$text)

[The truncated sentiment dataframes are outlined in the Appendix]

dplyr is then used to bind the datasets noted in sections 3.0-4.0 as such:

#brings together the results into one dataset
NVIDIAgpus_dataset = bind_cols(NVIDIAgpus_search_df, NVIDIAgpus_sentiment)
AMDgpus_dataset = bind_cols(AMDgpus_search_df, AMDgpus_sentiment)

[The truncated datasets are outlined in the Appendix]

4.3 Data Visualization with ggplot2 and sentimentAnalysis

4.3.1 Distribution Shape and Size, Histogram Plots

The histograms of each sample group are done via ggplot2. For each group, the count is labelled on the y-axis, and the sentiment score on the x-axis. Recall that the sentiment that is in the positive region [>0] corresponds to positive sentiment, and the negative region [<0] corresponds to negative sentiment. Note that sentiment equal to 0 is neutral.

For the NVIDIA AND GPUS group, the following histogram is generated:

## ggplot / data visualizations
# NVIDIA GPUs Plots
# plot a histogram of the NVIDIA GPU sentiments
ggplot(NVIDIAgpus_dataset, aes(x=SentimentGI)) + 
  ggtitle("NVIDIA GPU dataset sentiment") +
  geom_histogram(binwidth = 0.05, color="#FF0000", alpha = 0.5)

The shape of the data for the NVIDIA AND GPUS OR NVIDIA group does not seem to follow a bell-curve-like distribution, with a tail skewing towards positive sentiment. It can be argued that the center is ~0.00, or neutral sentiment. There is a gap before a small spike in negative sentiment. Most data on this histogram is in the positive sentiment range.

For the AMD AND GPUS group, the following histogram is generated:

# AMD GPUs Plots
# plot a histogram of the AMD GPUs sentiments
ggplot(AMDgpus_dataset, aes(x=SentimentGI)) + 
  ggtitle("AMDgpus dataset sentiment") +
  geom_histogram(binwidth = 0.05, color="#0000FF", alpha = 0.5)

In the AMD AND GPUS OR AMD group, it appears to be centered around neutral sentiment. It can be interpreted as a non-normal distribution, even with the large sample size. From a qualitative observation, it appears that there is a slight bias/skew towards positive sentiment as opposed to the NVIDIA histogram. There is a large spread in the negative and positive sentiment range.

To look at both histograms on the same graph, the following code is used:

# Combined Histogram of NVIDIA and AMD GPUs
ggplot()+
  geom_histogram(aes(x=AMDgpus_dataset$SentimentGI, fill="blue"), alpha=0.5) +
  geom_histogram(aes(x=NVIDIAgpus_dataset$SentimentGI, fill = "red"), alpha=0.5)+
  xlab("Sentiment Score")+ylab("Density")+
  ggtitle("Comparison of Tweet Sentiment of\nNVIDIA GPUs and AMD GPUs") +
  theme(plot.title = element_text(hjust=0.5, size = 10)) +
  theme(axis.title = element_text(size=10)) +
  scale_fill_manual(
    values = c("blue","red"),
    name = "Company (NVIDIA or AMD)",
    labels = c("AMD", "NVIDIA")
  )+
  theme(legend.text = element_text(size=10),
        legend.title = element_text(size=10))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The shape of both histograms do not appear to be very similar, though both are primarily centered around zero.. Visually, AMD appears to have more variance in nature in comparison to the NVIDIA histogram. The NVIDIA histogram has less neutral sentiment tweets. The NVIDIA histogram appears to show slightly less variance than that of the AMD histogram. It also should be noted the max positive sentiment for each group hovers around .5-.6, with the negative sentiment limit being asymmetrical. There are multiple outliers in each histogram plot.

4.3.2 Sentiment Analysis Plots via sentimentAnalysis

The sentiment analysis plot of the sentiment response vs. the time frame in days in which the tweet was posted are shown below for each group:

## sentiment plots
# plots NVIDIA GPU sentiment vs. time of post
plotSentimentResponse(NVIDIAgpus_dataset$created_at, 
                      NVIDIAgpus_dataset$SentimentGI, 
                      xlab = "time of twitter post Tweets") + 
                      ggtitle("NVIDIA Sentiment Response vs. 
                              Time of tweets [days]")

## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

# plots AMD GPU sentiment  vs. time of post
plotSentimentResponse(AMDgpus_dataset$created_at, AMDgpus_dataset$SentimentGI, 
                      xlab = "time of twitter post Tweets") + 
                      ggtitle("AMD Sentiment Response vs. 
                              Time of tweets [days]")

## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

It is clear that there is no inherent linearity in this either of these plots. Perhaps it can be argued there is a more “flat” response of the time of post and AMD Sentiment Response, but there is no obvious linear pattern. Data transformations and further analysis can be done to see if there is any linear relationship or correlation, but from the raw sentiment response for each group, there is no overt linear relationship.

The sentiment analysis plot of # of post retweets for AMD GPUS vs sentiment are shown below.

# plots # of retweets the post received for NVIDIA GPUs vs. NVIDIA sentiment 
plotSentimentResponse(NVIDIAgpus_dataset$SentimentGI, 
                      NVIDIAgpus_dataset$retweet_count,
                      ylab = "# of post retweets for NVIDIA GPUS") + 
                      ggtitle("# post retweers  vs. NVIDIA Sentiment Response")

## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

# plots # of retweets the post received for AMD GPUs  vs. AMD sentiment
plotSentimentResponse(AMDgpus_dataset$SentimentGI,AMDgpus_dataset$retweet_count,
                      ylab = "# of post retweets for AMD GPUS") + 
                      ggtitle("# post retweers  vs. AMD Sentiment Response")

## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

A more flat response is observed for NVIDIA, with apparently more positive and noisy data across the AMD GPUS group. There appears to be more variance with regards to the response.

4.3.3 Wordcloud Frequency Plots via wordcloud

Each respective group is now analyzed for terms of frequency in their respective “Wordcloud” plots of 200 of the most frequent words. [Note: Emojis and non-English characters are omitted from wordcloud.]

## NVIDIA GPUs create word cloud
#Which words were the most frequent?
wordcloud(NVIDIAgpus_search_df$text, min.freq=30, max.words = 200,  
          scale=c(3.20, .40), random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

## AMD GPUs create word cloud
#Which words were the most frequent?
wordcloud(AMDgpus_search_df$text, min.freq=30, max.words = 200, 
          scale=c(3.20, .40), random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

In both wordclouds, very similar terms are observed. It can be interpreted that the many of mentions on Twitter for both groups are to buy and sell units from both manufacturers. This may be attributed to the inherently volatile market of GPUs outlined in section 1.2. Some of the most frequent terms referred to gaming or sales of their products and GPUs.

4.4 Statistical Analysis and Comparison of Each Group (NVIDIA GPUS and AMD GPUS)

To determine whether or not to use parametric or non-parametric statistical analysis shall be used, a function is employed. This function “compare_two_sentiments()” will only work if the data is pre-processed and cleaned through the aforementioned steps in this document.

The “compare_two_sentiments()” function takes in 4 arguments. The title of the arguments and are defined as such: title1 <“String”>, preProcessed.df1 <data.frame>, title2 <“String”> , preProcessed.df2 <data.frame>. The functions used in this function can compensate for different lengths of each dataframe if the initial text mining yields different numbers of elements in the base dataframe. This is done by limiting to a length of the sample that is equal in each test.

It uses the pre-processed data to evaluate if the data appears normal or not via the Shapiro-Wilks test. If it does not reject the null hypothesis, it will evaluate the mean sentiment response score of both groups per section 2.4.1. If it rejects the null hypothesis outlined in section 2.4.1, it will evaluate the median response score of both groups through the non-parametric test, unpaired Wilcoxon Signed Ranks Test, described in 2.4.2.

The data summary, result of the Shapiro-Wilks test, and the results of the statistical test that is appropriate is stored in a list data type, and returned to the user.

Note: This function can be generalized for any two pre-processed dataframes of sentiment data of any two categories if it is processed in a similar manner as outlined in this document.

# Data Analysis
# comparing two dataframes, pre-processed, for sentiments. 
# Function Arguments: title1 "String", preProcessed.df1, 
# title2 "String", preProcessed.df2    
compare_two_sentiments <- function(sentimenttitle1="", sentimentdf, 
                                   sentimenttitle2="", sentimentdf2){
  
  #data summaries
  summary1 <- summary(sentimentdf$SentimentGI)
  summary1.title <- c("Summary of", sentimenttitle1)
  summary2 <- summary(sentimentdf2$SentimentGI)
  summary2.title <- c("Summary of", sentimenttitle2)
  
  #test for normality
  testfornorm_1 <- shapiro.test(sentimentdf$SentimentGI)
  testfornorm_2 <- shapiro.test(sentimentdf2$SentimentGI)
  
    if(testfornorm_1$p.value || testfornorm_2$p.value >= .05){
    
      print("Data is not normal per Shapiro-Wilks test.")
    
      # Wilcoxon Test
      nonparamTest <- wilcox.test(sentimentdf$SentimentGI[1:2400], 
                                sentimentdf2$SentimentGI[1:2400], paired = FALSE, 
                                alternative = "two.sided")
    
      listnonparam <- list(summary1.title, summary1, testfornorm_1, 
                          summary2.title, summary2, testfornorm_2, nonparamTest)
    
      return(listnonparam)
    
  }
  
    else{
    
      print("Data is normal per Shapiro-Wilks test.")
    
      # Two sample t test
      twosamplettest <- t.test(sentimentdf$SentimentGI[1:2400], 
                               sentimentdf2$SentimentGI[1:2400], 
                             paired = FALSE, alternative = "two.sided")
    
     listparam <- list(summary1.title, summary1, 
                        summary2.title, summary2, twosamplettest)
      return(listparam)
  }
}

compare_two_sentiments("AMD GPUs", AMDgpus_dataset, 
                       "NVIDIA GPUs", NVIDIAgpus_dataset)

## [1] "Data is not normal per Shapiro-Wilks test."

## [[1]]
## [1] "Summary of" "AMD GPUs"  
## 
## [[2]]
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.66667  0.00000  0.00000  0.04013  0.07692  0.57143 
## 
## [[3]]
## 
##  Shapiro-Wilk normality test
## 
## data:  sentimentdf$SentimentGI
## W = 0.84857, p-value < 2.2e-16
## 
## 
## [[4]]
## [1] "Summary of"  "NVIDIA GPUs"
## 
## [[5]]
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.37500  0.00000  0.04545  0.05341  0.10526  0.50000 
## 
## [[6]]
## 
##  Shapiro-Wilk normality test
## 
## data:  sentimentdf2$SentimentGI
## W = 0.92037, p-value < 2.2e-16
## 
## 
## [[7]]
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  sentimentdf$SentimentGI[1:2400] and sentimentdf2$SentimentGI[1:2400]
## W = 2535317, p-value = 5.519e-14
## alternative hypothesis: true location shift is not equal to 0

In this observation period, the data is not normal, so the unpaired Wilcoxon Signed Ranks Test shall be performed. The function output is displayed in the code snippet above. The data summaries appear to be different, with different variances and AMD holding a higher median. Per the unpaired Wilcoxon Signed Ranks Test, there is enough statistical evidence at the 5% level of significance, we reject the null hypothesis noted in section 2.4.2 to conclude that there is a difference in median between sentiment response of tweets containing NVIDIA GPUS and AMD GPUS.

5.0 Conclusions

Qualitatively, it does appear that there are some differences in the sentiment response score between tweets containing “NVIDIA AND GPUS OR NVIDIA” and “AMD AND GPUS OR AMD”. This is shown in the commentary of the various plots.

Statistically speaking, the data are not normal. As a consequence, one can only analyze from the notion of “center” generated through the non-parametric test listed in 2.4.2. This center is the median of both groups in question. The data does show a difference in median sentiment score value at this observational time frame. We can conclude that each group’s sentiment is different.

To answer the question noted in the beginning of this document:

At the given observational time period noted in section 2.0, there are qualitative differences, and statistically speaking; the median sentiment response score between tweets containing “NVIDIA AND GPU OR NVIDIA” and “AMD AND GPU OR AMD” does indeed show a difference.

6.0 Discussion

6.1 Implications of Study

The implications of this analysis is that each data group are not normally distributed. As noted previously, this leaves the analyst to infer upon the median. If the sample size was large enough, perhaps the unpaired two sample t-test would yield a difference in means between the groups. It also only notes this time frame noted in 1.0.

It would be interesting to observe more turbulent times, perhaps upon the release of a new GPU by either NVIDIA or AMD. Periodically querying the Twitter database could show differences over time.

6.2 Improvements of Study

To improve this study, or build upon it, the following measures may be done:

Query a larger tweet range (limited by existing Twitter developer API)
Compare different time ranges
Compare upon release of new product, or announcements
Run the study for only “bot” accounts, “news accounts,” and “individuals to see if there is any difference
Weigh the sentiment response variables based on the frequency of tweets from users
Optimize the search terms as listed in Section 2.0
Create a time series analysis for both datasets
Implement additional machine learning algorithms to analyze sentiment

6.3 Biases

Bias may exist through the opinions of the investigator on either companies. It may also exist through the observational time frame that the tweets were mined. It can be argued that the inclusion of “bot” and “news” accounts can create an inherent bias in the samples collected as well. Optimizing the search methods can help eliminate or minimize some biases. Further parsing of this study done in more specific categories of user can infer upon the sentiment from users of different backgrounds.

7.0 References

[1] Sentiment analysis - NVIDIA Data Science Glossary: https://www.nvidia.com/en-us/glossary/data-science/sentiment-analysis/

[2] Consumer Report on GPUs: https://www.consumerreports.org/computers/why-even-non-gamers-may-want-a-powerful-graphics-card/

[3] Intel - GPU Theory and Applications: https://www.intel.com/content/www/us/en/products/docs/processors/what-is-a-gpu.html

[4] PC GPU Market Share 2009-2021 by Vendor: https://www.statista.com/statistics/754557/worldwide-gpu-shipments-market-share-by-vendor/

[5] US Department of Commerce; Semiconductor Shortage: https://www.commerce.gov/news/blog/2022/01/results-semiconductor-supply-chain-request-information

[6] Parametric test - T-test Statistical test: https://www.sciencedirect.com/topics/agricultural-and-biological-sciences/t-test

[7] Non-Parametric test - Wilcoxon Signed Ranks Tests: https://www.sciencedirect.com/topics/medicine-and-dentistry/wilcoxon-signed-ranks-test

8.0 Appendix

See Raw R Code from “Project 2 - STAT 611 - Smith, Nick.R” Below:

source("Project 2 - STAT 611 - Smith, Nick.R", echo=TRUE)

## 
## > library(rtweet)
## 
## > library(dplyr)
## 
## > library(SentimentAnalysis)
## 
## > library(ggplot2)
## 
## > library(SnowballC)
## 
## > library(wordcloud)
## 
## > options(warn = -1)
## 
## > api_key = "KtTB6LCt9Ma9G2UkdXWbf5ZcX"
## 
## > api_secret_key = "ZJrk7gxDwImrvyt8B76YZpYbQzi1AUziIBtXWNxooHKrVurmDJ"
## 
## > access_token = "1067618556026781696-A7AE6z352dnCDHwP1vPDa8M52bNA3x"
## 
## > access_token_secret = "vNDbitFjvzCqwrxNIyC9hPpIDIK58ChOyYasYr7JerzYa"
## 
## > token = create_token(app = "RProj2", consumer_key = api_key, 
## +     consumer_secret = api_secret_key, access_token = access_token, 
## +     access_sec .... [TRUNCATED] 
## 
## > NVIDIAgpus_search_df = search_tweets(q = "NVIDIA AND GPU OR NVIDIA", 
## +     n = 2500)
## 
## > AMDgpus_search_df = search_tweets(q = "AMD AND GPU OR AMD", 
## +     n = 2500)
## 
## > NVIDIAgpus_tbl <- NVIDIAgpus_search_df %>% count(screen_name, 
## +     sort = TRUE) %>% top_n(10) %>% mutate(screen_name = paste0("@", 
## +     screen_n .... [TRUNCATED]

## Selecting by n

## 
## > tbl_df(NVIDIAgpus_tbl)
## # A tibble: 10 × 2
##    screen_name          n
##    <chr>            <int>
##  1 @fixitfixitfixit   184
##  2 @gpustocks         160
##  3 @Datenheim_GPU     106
##  4 @BotInventory       82
##  5 @BotPCParts         81
##  6 @Gpu4Y              74
##  7 @Xplacasdevideo     74
##  8 @tweet_bot_317      71
##  9 @SnailMonitor       59
## 10 @DataLoverDrops     50
## 
## > AMDgpus_tbl <- AMDgpus_search_df %>% count(screen_name, 
## +     sort = TRUE) %>% top_n(10) %>% mutate(screen_name = paste0("@", 
## +     screen_name))

## Selecting by n

## 
## > tbl_df(AMDgpus_tbl)
## # A tibble: 10 × 2
##    screen_name         n
##    <chr>           <int>
##  1 @fixittrackerca    73
##  2 @AMD_kush          57
##  3 @sufiyan_amd_      47
##  4 @gpustocks         39
##  5 @DataLoverDrops    22
##  6 @Raheel_Amd        19
##  7 @84_amd            18
##  8 @DaveTrouba        18
##  9 @gpu_drops         16
## 10 @gpudrops_ca       16
## 
## > NVIDIAgpus_sentiment = analyzeSentiment(NVIDIAgpus_search_df$text)
## 
## > AMDgpus_sentiment = analyzeSentiment(AMDgpus_search_df$text)
## 
## > NVIDIAgpus_dataset = bind_cols(NVIDIAgpus_search_df, 
## +     NVIDIAgpus_sentiment)
## 
## > AMDgpus_dataset = bind_cols(AMDgpus_search_df, AMDgpus_sentiment)
## 
## > ggplot(NVIDIAgpus_dataset, aes(x = SentimentGI)) + 
## +     ggtitle("NVIDIA GPU dataset sentiment") + geom_histogram(binwidth = 0.05, 
## +     color = " ..." ... [TRUNCATED]

## 
## > ggplot(AMDgpus_dataset, aes(x = SentimentGI)) + ggtitle("AMDgpus dataset sentiment") + 
## +     geom_histogram(binwidth = 0.05, color = "#0000FF", alp .... [TRUNCATED]

## 
## > ggplot() + geom_histogram(aes(x = AMDgpus_dataset$SentimentGI, 
## +     fill = "blue"), alpha = 0.5) + geom_histogram(aes(x = NVIDIAgpus_dataset$Senti .... [TRUNCATED]

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## > plotSentimentResponse(NVIDIAgpus_dataset$created_at, 
## +     NVIDIAgpus_dataset$SentimentGI, xlab = "time of twitter post Tweets") + 
## +     ggtitle(" ..." ... [TRUNCATED]

## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

## 
## > plotSentimentResponse(AMDgpus_dataset$created_at, 
## +     AMDgpus_dataset$SentimentGI, xlab = "time of twitter post Tweets") + 
## +     ggtitle("AMD Se ..." ... [TRUNCATED]

## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

## 
## > plotSentimentResponse(NVIDIAgpus_dataset$SentimentGI, 
## +     NVIDIAgpus_dataset$retweet_count, ylab = "# of post retweets for NVIDIA GPUS") + 
## +     .... [TRUNCATED]

## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

## 
## > plotSentimentResponse(AMDgpus_dataset$SentimentGI, 
## +     AMDgpus_dataset$retweet_count, ylab = "# of post retweets for AMD GPUS") + 
## +     ggtitle( .... [TRUNCATED]

## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

## 
## > wordcloud(NVIDIAgpus_search_df$text, min.freq = 20, 
## +     max.words = 200, scale = c(3.2, 0.4), random.order = FALSE, 
## +     rot.per = 0.35, colors .... [TRUNCATED]

## 
## > wordcloud(AMDgpus_search_df$text, min.freq = 20, max.words = 200, 
## +     scale = c(3.2, 0.4), random.order = FALSE, rot.per = 0.35, 
## +     colors =  .... [TRUNCATED]

## 
## > compare_two_sentiments <- function(sentimenttitle1 = "", 
## +     sentimentdf, sentimenttitle2 = "", sentimentdf2) {
## +     summary1 <- summary(sentime .... [TRUNCATED] 
## 
## > compare_two_sentiments("AMD GPUs", AMDgpus_dataset, 
## +     "NVIDIA GPUs", NVIDIAgpus_dataset)
## [1] "Data is normal per Shapiro-Wilks test."
## [[1]]
## [1] "Summary of" "AMD GPUs"  
## 
## [[2]]
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.66667  0.00000  0.00000  0.04016  0.07692  0.57143 
## 
## [[3]]
## [1] "Summary of"  "NVIDIA GPUs"
## 
## [[4]]
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.37500  0.00000  0.04545  0.05341  0.10526  0.50000 
## 
## [[5]]
## 
##  Welch Two Sample t-test
## 
## data:  sentimentdf$SentimentGI[1:2400] and sentimentdf2$SentimentGI[1:2400]
## t = -4.5447, df = 4631.5, p-value = 5.641e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.017051439 -0.006773831
## sample estimates:
##  mean of x  mean of y 
## 0.04091498 0.05282761

Text Mining in R: Twitter Data Sentiment Analysis Across NVIDIA and AMD Graphics Processing Units [GPUs]

STAT 611: R Programming - Nick Smith

2022-03