Unit 2 Independent Analysis

1. PREPARE

1a. Context

Two of the most widely used programming tools that have emerged in the data science field over the past twenty years are Python and R. Popularity has more than likely risen due to how both are free, open source, and relatively simple to learn (at least initially). While each language was developed for a slightly different purpose, they are considered essential to anyone working with large data sets, creating colorful visualizations, or developing complex machine learning algorithms. This analysis will use text mining in Twitter to decipher what pushes a user to choose one over the other or to use both.

The below graph shows how these two languages have trended over time based on the use of their tags since 2008, when Stack Overflow was founded:

1b. Guiding Questions

The following general questions are guiding this text analysis:

Which programming language is used the most?
Are the languages used for different programming projects?
How do tweets posted to #Python differ from those with the #Rstats hashtag?
How does participation in #Python and #Rstats discussions relate to the public sentiment individuals express?
How does public sentiment vary over time?

Our project goal is to gauge public sentiment and popularity around the programming languages by comparing Python and R tweets.

Our specific research questions for this analysis are:

How does popularity for Python compare to that of R?
How does sentiment for Python compare to sentiment for R?
What key words/projects/topics are most used for Python and R?

1c. Setup

For this project, I’ll be installing the same array of packages explored during the Unit 2 Case Study:

# Discovered there was an error in the {wordcloud2} package on CRAN
remotes::install_github("lchiffon/wordcloud2")

library(dplyr)
library(readr)
library(tidyr)
library(rtweet)
library(writexl)
library(wordcloud2)
library(tidytext)
library(textdata)
library(ggplot2)
library(textdata)
library(scales)

Validate Twitter App

This step stores and authenticates API keys ensuring the Twitter app from my developer account is active. Note: secret keys are hidden.

# authenticate via web browser
token <- create_token(
  app = app_name,
  consumer_key = api_key,
  consumer_secret = api_secret_key,
  access_token = access_token,
  access_secret = access_token_secret)

# check to see if the token is loaded
get_token()

## <Token>
## <oauth_endpoint>
##  request:   https://api.twitter.com/oauth/request_token
##  authorize: https://api.twitter.com/oauth/authenticate
##  access:    https://api.twitter.com/oauth/access_token
## <oauth_app> Educational Text Mining
##   key:    tTnrXa68RRRJYB9ECBLVziVrX
##   secret: <hidden>
## <credentials> oauth_token, oauth_token_secret
## ---

2. WRANGLE

2a. Import Tweets

In this section, the rtweet package and some key functions are used to search for tweets of interest.

Search Tweets

The first step to creating our dataset is to import tweets based on our Python and R search terms. For ease of comparison, I will maintain the Python and R data in separate data frames initially.

python_all_tweets <- search_tweets(q = "#python", n=5000)

rstats_all_tweets <- search_tweets(q = "#rstats", n=5000)

These queries returned data frames with over 4500 observations each. Unfortunately, many of them contained duplicate data due to an abundance of retweeting.

Remove Retweets

python_non_retweets <- search_tweets(q = "#python", 
                                   n=5000, 
                                   include_rts = FALSE)

rstats_non_retweets <- search_tweets(q = "#rstats", 
                                   n=5000, 
                                   include_rts = FALSE)

These queries returned a similar data frame for Python, but a much smaller one for R. This implies much less recent activity with respect to R.

Write to Excel

Finally, the Twitter data frames are exported as Excel files to use in later exercises since tweets have a tendency to change every minute.

write_xlsx(python_non_retweets, "data/python_non_retweets.xlsx")
write_xlsx(rstats_non_retweets, "data/rstats_non_retweets.xlsx")

2b. Tidy Text

The tidytext package is used to both “tidy” and tokenize our tweets in order to create our data frame for analysis.

Filter and Reformat Data Frames

For this analysis, we want to filter the data by language and then reformat the data into a frame containing only the information we need to answer the specific research questions. Lastly we’ll ensure the data highlights whether it pertains to Python or R.

python_text <-
  python_non_retweets %>%
  filter(lang == "en") %>%
  select(screen_name, created_at, text) %>%
  mutate(program = "python") %>%
  relocate(program)

rstats_text <-
  rstats_non_retweets %>%
  filter(lang == "en") %>%
  select(screen_name, created_at, text) %>%
  mutate(program = "rstats") %>%
  relocate(program)

Combine Data Frames

tweets <- bind_rows(python_text, rstats_text)

And let’s take a quick look at both the head() and the tail() of this new tweets data frame to make sure it contains both “python” and “rstats” tweets:

head(tweets)

## # A tibble: 6 × 4
##   program screen_name created_at          text                                  
##   <chr>   <chr>       <dttm>              <chr>                                 
## 1 python  OlofPaulson 2022-02-04 20:57:26 "@NFLosophy Hey NFL👋 😉 \nMaybe a ba…
## 2 python  OlofPaulson 2022-02-04 14:11:02 "@paulabartabajo_ Thanks so much for …
## 3 python  OlofPaulson 2022-02-04 15:14:07 "@s1lent_cr0w Hey Crow👋\nMaybe this …
## 4 python  OlofPaulson 2022-02-04 14:25:47 "@Barbara61708255 Thanks for followin…
## 5 python  OlofPaulson 2022-02-04 09:11:19 "💪 Something to think about 💪\n\n#p…
## 6 python  OlofPaulson 2022-02-04 14:38:13 "🐍 TGIF Coding Challenge /Puzzle \nS…

tail(tweets)

## # A tibble: 6 × 4
##   program screen_name    created_at          text                               
##   <chr>   <chr>          <dttm>              <chr>                              
## 1 rstats  FosdemResearch 2022-01-31 16:10:00 "Join @FosdemResearch on Feb 5th a…
## 2 rstats  M_Steinhilber  2022-01-31 16:08:46 "Battling Corona is much easier af…
## 3 rstats  ryanahart      2022-01-31 16:01:02 "#genuary Day 31 - Negative Space\…
## 4 rstats  MajaIlicZg     2022-01-31 15:53:12 "Many thanks for the invitation, i…
## 5 rstats  steffilazerte  2022-01-31 15:50:14 "Looking forward to rOpenSci Cowor…
## 6 rstats  Rami_Krispin   2022-01-31 15:47:29 "R For Beginners! 🚀🚀🚀\n\nIf you…

Tokenize Text

tweet_tokens <- 
  tweets %>%
  unnest_tokens(output = word, 
                input = text, 
                token = "tweets")

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

Remove Stop Words

Now let’s remove stop words like “the” and “a” that don’t help us learn much about what people are tweeting about the two programming languages.

tidy_tweets <-
  tweet_tokens %>%
  anti_join(stop_words, by = "word")

Custom Stop Words

Before wrapping up, let’s take a quick count of the most common words in tidy_tweets data frame:

count(tidy_tweets, word, sort = T)

## # A tibble: 19,924 × 2
##    word                 n
##    <chr>            <int>
##  1 #python           4330
##  2 #rstats           2759
##  3 100daysofcode     2237
##  4 #javascript       2119
##  5 #datascience      1713
##  6 #machinelearning  1363
##  7 #programming      1330
##  8 #ai               1314
##  9 #coding           1288
## 10 #iot              1221
## # … with 19,914 more rows

A couple of terms that appear in the most common words do not really help in analyzing the data such as the redundant words or hashtags of python/#python and #rstats. A filter function is applied to weed out these terms.

tidy_tweets <-
  tweet_tokens %>%
  anti_join(stop_words, by = "word") %>%
  filter(!word == "#python" & !word == "#rstats"
         & !word == "python" & !word == "amp")

2c. Add Sentiment Values

Finally, sentiment lexicons and the inner_join() function are introduced to append sentiment values to the data frame.

Get Sentiments

afinn <- get_sentiments("afinn")

bing <- get_sentiments("bing")

nrc <- get_sentiments("nrc")

loughran <- get_sentiments("loughran")

Join Sentiments

The final step in the data wrangling process before is to integrate the tidy_tweets with the various lexicons.

sentiment_afinn <- inner_join(tidy_tweets, afinn, by = "word")
sentiment_afinn

## # A tibble: 3,346 × 5
##    program screen_name created_at          word      value
##    <chr>   <chr>       <dttm>              <chr>     <dbl>
##  1 python  OlofPaulson 2022-02-04 20:57:26 easy          1
##  2 python  OlofPaulson 2022-02-04 20:57:26 enjoy         2
##  3 python  OlofPaulson 2022-02-04 14:11:02 ill          -2
##  4 python  OlofPaulson 2022-02-04 14:11:02 free          1
##  5 python  OlofPaulson 2022-02-04 14:11:02 share         1
##  6 python  OlofPaulson 2022-02-04 15:14:07 free          1
##  7 python  OlofPaulson 2022-02-04 15:14:07 enjoy         2
##  8 python  OlofPaulson 2022-02-04 14:25:47 hope          2
##  9 python  OlofPaulson 2022-02-04 14:25:47 free          1
## 10 python  OlofPaulson 2022-02-04 14:38:13 challenge    -1
## # … with 3,336 more rows

sentiment_bing <- inner_join(tidy_tweets, bing, by = "word")
sentiment_bing

## # A tibble: 3,412 × 5
##    program screen_name created_at          word    sentiment
##    <chr>   <chr>       <dttm>              <chr>   <chr>    
##  1 python  OlofPaulson 2022-02-04 20:57:26 easy    positive 
##  2 python  OlofPaulson 2022-02-04 20:57:26 enjoy   positive 
##  3 python  OlofPaulson 2022-02-04 14:11:02 free    positive 
##  4 python  OlofPaulson 2022-02-04 15:14:07 master  positive 
##  5 python  OlofPaulson 2022-02-04 15:14:07 free    positive 
##  6 python  OlofPaulson 2022-02-04 15:14:07 enjoy   positive 
##  7 python  OlofPaulson 2022-02-04 14:25:47 free    positive 
##  8 python  OlofPaulson 2022-02-04 14:38:13 pretend negative 
##  9 python  OlofPaulson 2022-02-04 13:55:49 free    positive 
## 10 python  OlofPaulson 2022-02-04 15:10:46 easy    positive 
## # … with 3,402 more rows

sentiment_nrc <- inner_join(tidy_tweets, bing, by = "word")
sentiment_nrc

## # A tibble: 3,412 × 5
##    program screen_name created_at          word    sentiment
##    <chr>   <chr>       <dttm>              <chr>   <chr>    
##  1 python  OlofPaulson 2022-02-04 20:57:26 easy    positive 
##  2 python  OlofPaulson 2022-02-04 20:57:26 enjoy   positive 
##  3 python  OlofPaulson 2022-02-04 14:11:02 free    positive 
##  4 python  OlofPaulson 2022-02-04 15:14:07 master  positive 
##  5 python  OlofPaulson 2022-02-04 15:14:07 free    positive 
##  6 python  OlofPaulson 2022-02-04 15:14:07 enjoy   positive 
##  7 python  OlofPaulson 2022-02-04 14:25:47 free    positive 
##  8 python  OlofPaulson 2022-02-04 14:38:13 pretend negative 
##  9 python  OlofPaulson 2022-02-04 13:55:49 free    positive 
## 10 python  OlofPaulson 2022-02-04 15:10:46 easy    positive 
## # … with 3,402 more rows

sentiment_loughran <- inner_join(tidy_tweets, bing, by = "word")
sentiment_loughran

## # A tibble: 3,412 × 5
##    program screen_name created_at          word    sentiment
##    <chr>   <chr>       <dttm>              <chr>   <chr>    
##  1 python  OlofPaulson 2022-02-04 20:57:26 easy    positive 
##  2 python  OlofPaulson 2022-02-04 20:57:26 enjoy   positive 
##  3 python  OlofPaulson 2022-02-04 14:11:02 free    positive 
##  4 python  OlofPaulson 2022-02-04 15:14:07 master  positive 
##  5 python  OlofPaulson 2022-02-04 15:14:07 free    positive 
##  6 python  OlofPaulson 2022-02-04 15:14:07 enjoy   positive 
##  7 python  OlofPaulson 2022-02-04 14:25:47 free    positive 
##  8 python  OlofPaulson 2022-02-04 14:38:13 pretend negative 
##  9 python  OlofPaulson 2022-02-04 13:55:49 free    positive 
## 10 python  OlofPaulson 2022-02-04 15:10:46 easy    positive 
## # … with 3,402 more rows

3. EXPLORE

Now that we have our tweets tidied and sentiments joined, we’re ready for a little data exploration. One goal in this phase is to explore questions that drove the original analysis. Topics addressed in Section 3 include:

Analyze. We take a quick look at a few statistical summaries to better understand sentiment counts and proportions of Python use vs R.
Visualize. We put together some basic graphical summaries of our sentiment values in order to compare the use of Python and R.

3a. Analyze

Sentiment Counts

Let’s start with bing, our simplest sentiment lexicon, and use the count function to count how many times in our sentiment_bing data frame “positive” and “negative” occur in sentiment column and :

summary_bing <- count(sentiment_bing, sentiment, sort = TRUE)

Collectively, it looks like our combined dataset has more positive words than negative words.

summary_bing

## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 positive   2189
## 2 negative   1223

Since our main goal is to compare positive and negative sentiment between Python and R, let’s use the group_by function again to get sentiment summaries for the two programming languages separately:

summary_bing <- sentiment_bing %>% 
  group_by(program) %>% 
  count(sentiment) 
summary_bing

## # A tibble: 4 × 3
## # Groups:   program [2]
##   program sentiment     n
##   <chr>   <chr>     <int>
## 1 python  negative    656
## 2 python  positive   1240
## 3 rstats  negative    567
## 4 rstats  positive    949

Looks like both programs have far more positive words than negative, but Python skews much more positive.

Compute Sentiment Value

To calculate a single sentiment “score” for the tweets that can be used for quick comparison and create a new variable indicating which lexicon we used.

First, let’s untidy our data a little by using the spread function from the tidyr package to transform our sentiment column into separate columns for negative and positive that contains the n counts for each:

summary_bing <- sentiment_bing %>% 
  group_by(program) %>% 
  count(sentiment, sort = TRUE) %>% 
  spread(sentiment, n) 
summary_bing

## # A tibble: 2 × 3
## # Groups:   program [2]
##   program negative positive
##   <chr>      <int>    <int>
## 1 python       656     1240
## 2 rstats       567      949

Finally, we’ll use the mutate function to create two new variables: sentiment and lexicon so we have a single sentiment score and the lexicon from which it was derived:

summary_bing <- sentiment_bing %>% 
  group_by(program) %>% 
  count(sentiment, sort = TRUE) %>% 
  spread(sentiment, n) %>%
  mutate(sentiment = positive - negative) %>%
  mutate(lexicon = "bing") %>%
  relocate(lexicon)
summary_bing

## # A tibble: 2 × 5
## # Groups:   program [2]
##   lexicon program negative positive sentiment
##   <chr>   <chr>      <int>    <int>     <int>
## 1 bing    python       656     1240       584
## 2 bing    rstats       567      949       382

There we go, now we can see that Python scores much more positive than R with the Bing lexicon.

Let’s calculate a quick score for using the other lexicons now.

summary_afinn <- sentiment_afinn %>% 
  group_by(program) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(lexicon = "AFINN") %>%
  relocate(lexicon)
summary_afinn

## # A tibble: 2 × 3
##   lexicon program sentiment
##   <chr>   <chr>       <dbl>
## 1 AFINN   python        973
## 2 AFINN   rstats       1441

Again, both remain relatively positive. In this case, however, R scored a higher positive score than Python.

summary_nrc <- sentiment_nrc %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(program) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "nrc")  %>%
  spread(sentiment, n) %>%
  mutate(sentiment = positive/negative) %>% 
  relocate(method)
summary_nrc

## # A tibble: 2 × 5
## # Groups:   program [2]
##   method program negative positive sentiment
##   <chr>  <chr>      <int>    <int>     <dbl>
## 1 nrc    python       656     1240      1.89
## 2 nrc    rstats       567      949      1.67

sentiment_loughran <- inner_join(tidy_tweets, loughran, by = "word")

summary_loughran <- sentiment_loughran %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(program) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "loughran")  %>%
  spread(sentiment, n) %>%
  mutate(sentiment = positive/negative) %>% 
  relocate(method)
summary_loughran

## # A tibble: 2 × 5
## # Groups:   program [2]
##   method   program negative positive sentiment
##   <chr>    <chr>      <int>    <int>     <dbl>
## 1 loughran python       285      427      1.50
## 2 loughran rstats       339      344      1.01

Each of the lexicon scores confirm that Python is discussed in a more positive manner than R in the tweets we’ve captured. This seems to validate the popularity trends we highlighted in the introduction.

3b. Visualize

Now that we understand the sentiment a little better, we’ll use the ts_plot function to take a very quick look at how the number of tweets compare by programming language:

ts_plot(dplyr::group_by(tweets, program), "days")

Notice that this effectively creates a ggplot time series plot for the tweets. I’ve included the by = argument which by default is set to “days”. It looks like tweets go back 5 days which was probably truncated due to the amount of Python tweets. That said, it looks like R discussions were much more consistently discussed during that time frame.

Changing the time period to hours give a much more refined scale for the tweets:

To better understand potential topics of interest within these tweets, we will filter the tidied data frames by programming language, and then focus on the top 50 terms:

top_tokens_python <- tidy_tweets %>%
              filter(program == "python") %>% 
              count(word, sort = TRUE) %>%
              top_n(50)

## Selecting by n

top_tokens_python

## # A tibble: 50 × 2
##    word                 n
##    <chr>            <int>
##  1 100daysofcode     1487
##  2 #javascript       1441
##  3 #programming       795
##  4 #datascience       762
##  5 #coding            715
##  6 #machinelearning   690
##  7 #ai                602
##  8 #essay             581
##  9 #iot               539
## 10 pay                515
## # … with 40 more rows

We can then construct a wordcloud to visualize the topical themes for Python:

wordcloud2(top_tokens_python)

Applying this same method to the R data:

top_tokens_rstats <- tidy_tweets %>%
              filter(program == "rstats") %>% 
              count(word, sort = TRUE) %>%
              top_n(50)

## Selecting by n

top_tokens_rstats

## # A tibble: 50 × 2
##    word                 n
##    <chr>            <int>
##  1 #datascience       951
##  2 100daysofcode      750
##  3 #ai                712
##  4 #iot               682
##  5 #javascript        678
##  6 #machinelearning   673
##  7 #analytics         671
##  8 #iiot              630
##  9 #tensorflow        618
## 10 #bigdata           604
## # … with 40 more rows

wordcloud2(top_tokens_rstats)

4. COMMUNICATE

4a. Select

Recall from the questions guiding this research that the focus is on which programming languages are preferred across the data science community and how they are used. Specfically:

How does popularity for Python compare to that of R?
How does sentiment for Python compare to sentiment for R?
What key words/projects/topics are most used for Python and R?

To address questions 1 and 2, I’m going to focus my analyses and data products on the following:

Analyses. For RQ1, I’m want to try and replicate as closely as possible the analysis by Rosenberg et al. so I will clean up my analysis and calculate a single sentiment score using the AFINN Lexicon for the entire tweet and label it positive or negative based on that score. I also want to highlight how regardless of the lexicon selected, NGSS tweets contain more positive words than negative, so I’ll also polish my previous analyses and calculate percentages of positive and negative words for the
Data Products. I know these are shunned in the world of data viz, but I think a pie chart will actually be an effective way to quickly communicate the proportion of positive and negative tweets among the Next Generation Science Standards. And for my analyses with the bing, nrc, and loughan lexicons, I’ll create some 100% stacked bars showing the percentage of positive and negative words among all tweets for the NGSS and CCSS.

4b. Polish

Programming Language Popularity

ggplot(tweets, aes(x = program, fill = program)) +
  geom_bar(width = .6, show.legend = FALSE) +
  labs(title = "Language Popularity on Twitter",
       subtitle = "#Python vs #Rstats Hashtag Counts Last 5 Days") +
  xlab(label = "Language") +
  ylab(label = "# Tweets")

Hashtag counts show that Python was discussed more often than R.

To polish my analyses and prepare for publication, first I need to rebuild the tweets dataset from my python_non_retweets and rstats_non_retweets and select both the status_id that is unique to each tweet, and the text column which contains the actual post:

python_text_clean <-
  python_non_retweets %>%
  filter(lang == "en") %>%
  select(status_id, text) %>%
  mutate(program = "python") %>%
  relocate(program)
rstats_text_clean <-
  rstats_non_retweets %>%
  filter(lang == "en") %>%
  select(status_id, text) %>%
  mutate(program = "rstats") %>%
  relocate(program)
tweets_clean <- bind_rows(python_text_clean, rstats_text_clean)
tweets_clean

## # A tibble: 5,953 × 3
##    program status_id           text                                             
##    <chr>   <chr>               <chr>                                            
##  1 python  1489704675117273088 "@NFLosophy Hey NFL👋 😉 \nMaybe a basic #Python…
##  2 python  1489602402181431306 "@paulabartabajo_ Thanks so much for follow 🙏\n…
##  3 python  1489618278347689991 "@s1lent_cr0w Hey Crow👋\nMaybe this will help m…
##  4 python  1489606111405690880 "@Barbara61708255 Thanks for following ❤️\nHope t…
##  5 python  1489526973038743554 "💪 Something to think about 💪\n\n#programming …
##  6 python  1489609242281656328 "🐍 TGIF Coding Challenge /Puzzle \nSave #justin…
##  7 python  1489598572517597192 "@Bitcoinvangeli1 Thank's for the follow ❤️\nHope…
##  8 python  1489617432662757381 "@missbikesalot Hey Rachel👋\nHere’s a #Python 1…
##  9 python  1489604376662626304 "@PalpatinThesis ❤️Thank you for following🙏\nHop…
## 10 python  1489617938797809666 "@anugayeah Hey Cheekoo 👋\nMaybe a basic #Pytho…
## # … with 5,943 more rows

The status_id is important as it enables a calculation of an overall sentiment score for each tweet, rather than for each word. Prior to assigning a tweet sentiment scores however, the tweets must be tidied again and then the sentiment scores can be attached.

sentiment_afinn_clean <- tweets_clean %>%
  unnest_tokens(output = word, 
                input = text, 
                token = "tweets")  %>% 
  anti_join(stop_words, by = "word") %>%
  filter(!word == "#python" & !word == "#rstats"
         & !word == "python" & !word == "amp") %>%
  inner_join(afinn, by = "word")
sentiment_afinn_clean

## # A tibble: 3,346 × 4
##    program status_id           word      value
##    <chr>   <chr>               <chr>     <dbl>
##  1 python  1489704675117273088 easy          1
##  2 python  1489704675117273088 enjoy         2
##  3 python  1489602402181431306 ill          -2
##  4 python  1489602402181431306 free          1
##  5 python  1489602402181431306 share         1
##  6 python  1489618278347689991 free          1
##  7 python  1489618278347689991 enjoy         2
##  8 python  1489606111405690880 hope          2
##  9 python  1489606111405690880 free          1
## 10 python  1489609242281656328 challenge    -1
## # … with 3,336 more rows

Next, I want to calculate a single score for each tweet. To do that, I’ll use the by now familiar group_by and summarize

afinn_score <- sentiment_afinn_clean %>% 
  group_by(program, status_id) %>% 
  summarise(value = sum(value))
afinn_score

## # A tibble: 2,403 × 3
## # Groups:   program [2]
##    program status_id           value
##    <chr>   <chr>               <dbl>
##  1 python  1489377043406114816     2
##  2 python  1489377458763837445    -1
##  3 python  1489377675919642625    -1
##  4 python  1489377868127813633    -1
##  5 python  1489378233418235904     1
##  6 python  1489378371578449920     2
##  7 python  1489378588457680898    -1
##  8 python  1489379318019067905     4
##  9 python  1489379369520963588    -2
## 10 python  1489379604519337984    -1
## # … with 2,393 more rows

I’ll add a flag for whether the tweet is “positive” or “negative” using the mutate function to create a new sentiment column to indicate whether that tweets was positive or negative.

afinn_sentiment <- afinn_score %>%
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive"))
afinn_sentiment

## # A tibble: 2,344 × 4
## # Groups:   program [2]
##    program status_id           value sentiment
##    <chr>   <chr>               <dbl> <chr>    
##  1 python  1489377043406114816     2 positive 
##  2 python  1489377458763837445    -1 negative 
##  3 python  1489377675919642625    -1 negative 
##  4 python  1489377868127813633    -1 negative 
##  5 python  1489378233418235904     1 positive 
##  6 python  1489378371578449920     2 positive 
##  7 python  1489378588457680898    -1 negative 
##  8 python  1489379318019067905     4 positive 
##  9 python  1489379369520963588    -2 negative 
## 10 python  1489379604519337984    -1 negative 
## # … with 2,334 more rows

Note that since a tweet sentiment score equal to 0 is neutral, I used the filter function to remove it from the dataset.

Finally, we’re ready to compute our ratio. We’ll use the group_by function and count the number of tweets for each of the standards that are positive or negative in the sentiment column. Then we’ll use the spread function to separate them out into separate columns so we can perform a quick calculation to compute the ratio.

afinn_ratio <- afinn_sentiment %>% 
  group_by(program) %>% 
  count(sentiment) %>% 
  spread(sentiment, n) %>%
  mutate(ratio = negative/positive)
afinn_ratio

## # A tibble: 2 × 4
## # Groups:   program [2]
##   program negative positive ratio
##   <chr>      <int>    <int> <dbl>
## 1 python       571      851 0.671
## 2 rstats       201      721 0.279

Finally,

afinn_counts <- afinn_sentiment %>%
  group_by(program) %>% 
  count(sentiment) %>%
  filter(program == "python")
afinn_counts %>%
ggplot(aes(x="", y=n, fill=sentiment)) +
  geom_bar(width = .6, stat = "identity") +
  labs(title = "#Python Tweets",
       subtitle = "Proportion of Positive & Negative Tweets") +
  coord_polar(theta = "y") +
  theme_void()

afinn_counts <- afinn_sentiment %>%
  group_by(program) %>% 
  count(sentiment) %>%
  filter(program == "rstats")
afinn_counts %>%
ggplot(aes(x="", y=n, fill=sentiment)) +
  geom_bar(width = .6, stat = "identity") +
  labs(title = "#Rstats Tweets",
       subtitle = "Proportion of Positive & Negative Tweets") +
  coord_polar(theta = "y") +
  theme_void()

Python vs R Across Lexicons

summary_afinn2 <- sentiment_afinn %>% 
  group_by(program) %>% 
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive")) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "AFINN")
summary_bing2 <- sentiment_bing %>% 
  group_by(program) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "bing")
summary_nrc2 <- sentiment_nrc %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(program) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "nrc") 
summary_loughran2 <- sentiment_loughran %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(program) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "loughran")

Next, I’ll combine those four data frames together using the bind_rows function again:

summary_sentiment <- bind_rows(summary_afinn2,
                               summary_bing2,
                               summary_nrc2,
                               summary_loughran2) %>%
  arrange(method, program) %>%
  relocate(method)
summary_sentiment

## # A tibble: 16 × 4
## # Groups:   program [2]
##    method   program sentiment     n
##    <chr>    <chr>   <chr>     <int>
##  1 AFINN    python  positive   1125
##  2 AFINN    python  negative    908
##  3 AFINN    rstats  positive    960
##  4 AFINN    rstats  negative    353
##  5 bing     python  positive   1240
##  6 bing     python  negative    656
##  7 bing     rstats  positive    949
##  8 bing     rstats  negative    567
##  9 loughran python  positive    427
## 10 loughran python  negative    285
## 11 loughran rstats  positive    344
## 12 loughran rstats  negative    339
## 13 nrc      python  positive   1240
## 14 nrc      python  negative    656
## 15 nrc      rstats  positive    949
## 16 nrc      rstats  negative    567

Then I’ll create a new data frame that has the total word counts for each set of standards and each method and join that to my summary_sentiment data frame:

total_counts <- summary_sentiment %>%
  group_by(method, program) %>%
  summarise(total = sum(n))

## `summarise()` has grouped output by 'method'. You can override using the
## `.groups` argument.

sentiment_counts <- left_join(summary_sentiment, total_counts)

## Joining, by = c("method", "program")

sentiment_counts

## # A tibble: 16 × 5
## # Groups:   program [2]
##    method   program sentiment     n total
##    <chr>    <chr>   <chr>     <int> <int>
##  1 AFINN    python  positive   1125  2033
##  2 AFINN    python  negative    908  2033
##  3 AFINN    rstats  positive    960  1313
##  4 AFINN    rstats  negative    353  1313
##  5 bing     python  positive   1240  1896
##  6 bing     python  negative    656  1896
##  7 bing     rstats  positive    949  1516
##  8 bing     rstats  negative    567  1516
##  9 loughran python  positive    427   712
## 10 loughran python  negative    285   712
## 11 loughran rstats  positive    344   683
## 12 loughran rstats  negative    339   683
## 13 nrc      python  positive   1240  1896
## 14 nrc      python  negative    656  1896
## 15 nrc      rstats  positive    949  1516
## 16 nrc      rstats  negative    567  1516

Finally, I’ll add a new row that calculates the percentage of positive and negative words for each set of state standards:

sentiment_percents <- sentiment_counts %>%
  mutate(percent = n/total * 100)
sentiment_percents

## # A tibble: 16 × 6
## # Groups:   program [2]
##    method   program sentiment     n total percent
##    <chr>    <chr>   <chr>     <int> <int>   <dbl>
##  1 AFINN    python  positive   1125  2033    55.3
##  2 AFINN    python  negative    908  2033    44.7
##  3 AFINN    rstats  positive    960  1313    73.1
##  4 AFINN    rstats  negative    353  1313    26.9
##  5 bing     python  positive   1240  1896    65.4
##  6 bing     python  negative    656  1896    34.6
##  7 bing     rstats  positive    949  1516    62.6
##  8 bing     rstats  negative    567  1516    37.4
##  9 loughran python  positive    427   712    60.0
## 10 loughran python  negative    285   712    40.0
## 11 loughran rstats  positive    344   683    50.4
## 12 loughran rstats  negative    339   683    49.6
## 13 nrc      python  positive   1240  1896    65.4
## 14 nrc      python  negative    656  1896    34.6
## 15 nrc      rstats  positive    949  1516    62.6
## 16 nrc      rstats  negative    567  1516    37.4

Now that I have my sentiment percent summaries for each lexicon, I’m going to create 100% stacked bar charts for each lexicon:

sentiment_percents %>%
  ggplot(aes(x = program, y = percent, fill=sentiment)) +
  geom_bar(width = .8, stat = "identity") +
  facet_wrap(~method, ncol = 1) +
  coord_flip() +
  labs(title = "Public Sentiment on Twitter", 
       subtitle = "#Python & #Rstats",
       x = "Language", 
       y = "Percentage of Words")

The chart above illustrates that in most cases (3 out of 4 lexicons), #python tweets contain more positive words than #rstats tweets.

4c. Narrate

Purpose. The data science community uses both Python and R as key tools to conduct statistical analysis and production of digital products. This case study was focused on determining the language of choice as well as why a specific language was chosen.
Methods. For this project, I chose to look at how often and in what contexts the languages were discussed on Twitter. The hashtags most often used by their respective communities were chosen as representative of how those communities regarded their particular language choice. From this data, I explored tweet counts, sentiment analysis, and top discussion topics.
Findings. Python is assessed to be the more popular coding language as it was discussed more often and maintained higher positive sentiment scores across the various lexicons. Top discussion topics by language included:

-Python: Coding, Java/Js, AI, IOT, Writing

-R: IOT, AI, Data, ML, Learning
Discussion. Insights from this case study can be used to guide those new to the data science community when deciding how to begin coding. Python is going to have a larger community of users as it is more often applied to general coding problems. Though R may be a smaller community, they focus their efforts on specific coding problems in the areas of statistical analysis, machine learning, and visualization.

A main limitation on this study was the size and scope of the dataset. The short time-span of the data (5 days) may lead to a recency bias and may not be applicable to other time periods. A much deeper pull of tweets may have shown how the languages have grown (or waned) in popularity over time. This could give some insights into when users began to identify niches or specific problems where each language shined. What were some of the strengths and weaknesses of your analysis?

5. REFERENCES

Infographic python vs. R for Data Analysis? DataCamp Community. (2020, January 9). Retrieved February 4, 2022, from https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis

Kappal, S. (2017, October 14). R vs. Python. dzone.com. Retrieved February 4, 2022, from https://dzone.com/articles/r-or-python-data-scientists-delight

Loukides, M. (Ed.). (2021). 2021 Data/Ai Salary Survey. O’Reilly Media, Inc.

Science, D.-D. (2020, May 16). Python vs R for data science: And the winner is… Medium. Retrieved February 4, 2022, from https://medium.com/@datadrivenscience/python-vs-r-for-data-science-and-the-winner-is-3ebb1a968197

Unit 2 Independent Analysis - Python vs R

James Hardaway

2/3/2022