Two of the most widely used programming tools that have emerged in the data science field over the past twenty years are Python and R. Popularity has more than likely risen due to how both are free, open source, and relatively simple to learn (at least initially). While each language was developed for a slightly different purpose, they are considered essential to anyone working with large data sets, creating colorful visualizations, or developing complex machine learning algorithms. This analysis will use text mining in Twitter to decipher what pushes a user to choose one over the other or to use both.
The below graph shows how these two languages have trended over time based on the use of their tags since 2008, when Stack Overflow was founded:
The following general questions are guiding this text analysis:
Our project goal is to gauge public sentiment and popularity around the programming languages by comparing Python and R tweets.
Our specific research questions for this analysis are:
For this project, I’ll be installing the same array of packages explored during the Unit 2 Case Study:
# Discovered there was an error in the {wordcloud2} package on CRAN
remotes::install_github("lchiffon/wordcloud2")
library(dplyr)
library(readr)
library(tidyr)
library(rtweet)
library(writexl)
library(wordcloud2)
library(tidytext)
library(textdata)
library(ggplot2)
library(textdata)
library(scales)
This step stores and authenticates API keys ensuring the Twitter app from my developer account is active. Note: secret keys are hidden.
# authenticate via web browser
token <- create_token(
app = app_name,
consumer_key = api_key,
consumer_secret = api_secret_key,
access_token = access_token,
access_secret = access_token_secret)
# check to see if the token is loaded
get_token()
## <Token>
## <oauth_endpoint>
## request: https://api.twitter.com/oauth/request_token
## authorize: https://api.twitter.com/oauth/authenticate
## access: https://api.twitter.com/oauth/access_token
## <oauth_app> Educational Text Mining
## key: tTnrXa68RRRJYB9ECBLVziVrX
## secret: <hidden>
## <credentials> oauth_token, oauth_token_secret
## ---
In this section, the rtweet package and some key functions are used to search for tweets of interest.
The first step to creating our dataset is to import tweets based on our Python and R search terms. For ease of comparison, I will maintain the Python and R data in separate data frames initially.
python_all_tweets <- search_tweets(q = "#python", n=5000)
rstats_all_tweets <- search_tweets(q = "#rstats", n=5000)
These queries returned data frames with over 4500 observations each. Unfortunately, many of them contained duplicate data due to an abundance of retweeting.
python_non_retweets <- search_tweets(q = "#python",
n=5000,
include_rts = FALSE)
rstats_non_retweets <- search_tweets(q = "#rstats",
n=5000,
include_rts = FALSE)
These queries returned a similar data frame for Python, but a much smaller one for R. This implies much less recent activity with respect to R.
Finally, the Twitter data frames are exported as Excel files to use in later exercises since tweets have a tendency to change every minute.
write_xlsx(python_non_retweets, "data/python_non_retweets.xlsx")
write_xlsx(rstats_non_retweets, "data/rstats_non_retweets.xlsx")
The tidytext package is used to both “tidy” and tokenize our tweets in order to create our data frame for analysis.
For this analysis, we want to filter the data by language and then reformat the data into a frame containing only the information we need to answer the specific research questions. Lastly we’ll ensure the data highlights whether it pertains to Python or R.
python_text <-
python_non_retweets %>%
filter(lang == "en") %>%
select(screen_name, created_at, text) %>%
mutate(program = "python") %>%
relocate(program)
rstats_text <-
rstats_non_retweets %>%
filter(lang == "en") %>%
select(screen_name, created_at, text) %>%
mutate(program = "rstats") %>%
relocate(program)
tweets <- bind_rows(python_text, rstats_text)
And let’s take a quick look at both the head() and the tail() of this new tweets data frame to make sure it contains both “python” and “rstats” tweets:
head(tweets)
## # A tibble: 6 × 4
## program screen_name created_at text
## <chr> <chr> <dttm> <chr>
## 1 python OlofPaulson 2022-02-04 20:57:26 "@NFLosophy Hey NFL👋 😉 \nMaybe a ba…
## 2 python OlofPaulson 2022-02-04 14:11:02 "@paulabartabajo_ Thanks so much for …
## 3 python OlofPaulson 2022-02-04 15:14:07 "@s1lent_cr0w Hey Crow👋\nMaybe this …
## 4 python OlofPaulson 2022-02-04 14:25:47 "@Barbara61708255 Thanks for followin…
## 5 python OlofPaulson 2022-02-04 09:11:19 "💪 Something to think about 💪\n\n#p…
## 6 python OlofPaulson 2022-02-04 14:38:13 "🐍 TGIF Coding Challenge /Puzzle \nS…
tail(tweets)
## # A tibble: 6 × 4
## program screen_name created_at text
## <chr> <chr> <dttm> <chr>
## 1 rstats FosdemResearch 2022-01-31 16:10:00 "Join @FosdemResearch on Feb 5th a…
## 2 rstats M_Steinhilber 2022-01-31 16:08:46 "Battling Corona is much easier af…
## 3 rstats ryanahart 2022-01-31 16:01:02 "#genuary Day 31 - Negative Space\…
## 4 rstats MajaIlicZg 2022-01-31 15:53:12 "Many thanks for the invitation, i…
## 5 rstats steffilazerte 2022-01-31 15:50:14 "Looking forward to rOpenSci Cowor…
## 6 rstats Rami_Krispin 2022-01-31 15:47:29 "R For Beginners! 🚀🚀🚀\n\nIf you…
tweet_tokens <-
tweets %>%
unnest_tokens(output = word,
input = text,
token = "tweets")
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
Now let’s remove stop words like “the” and “a” that don’t help us learn much about what people are tweeting about the two programming languages.
tidy_tweets <-
tweet_tokens %>%
anti_join(stop_words, by = "word")
Before wrapping up, let’s take a quick count of the most common words in tidy_tweets data frame:
count(tidy_tweets, word, sort = T)
## # A tibble: 19,924 × 2
## word n
## <chr> <int>
## 1 #python 4330
## 2 #rstats 2759
## 3 100daysofcode 2237
## 4 #javascript 2119
## 5 #datascience 1713
## 6 #machinelearning 1363
## 7 #programming 1330
## 8 #ai 1314
## 9 #coding 1288
## 10 #iot 1221
## # … with 19,914 more rows
A couple of terms that appear in the most common words do not really help in analyzing the data such as the redundant words or hashtags of python/#python and #rstats. A filter function is applied to weed out these terms.
tidy_tweets <-
tweet_tokens %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "#python" & !word == "#rstats"
& !word == "python" & !word == "amp")
Finally, sentiment lexicons and the inner_join() function are introduced to append sentiment values to the data frame.
afinn <- get_sentiments("afinn")
bing <- get_sentiments("bing")
nrc <- get_sentiments("nrc")
loughran <- get_sentiments("loughran")
The final step in the data wrangling process before is to integrate the tidy_tweets with the various lexicons.
sentiment_afinn <- inner_join(tidy_tweets, afinn, by = "word")
sentiment_afinn
## # A tibble: 3,346 × 5
## program screen_name created_at word value
## <chr> <chr> <dttm> <chr> <dbl>
## 1 python OlofPaulson 2022-02-04 20:57:26 easy 1
## 2 python OlofPaulson 2022-02-04 20:57:26 enjoy 2
## 3 python OlofPaulson 2022-02-04 14:11:02 ill -2
## 4 python OlofPaulson 2022-02-04 14:11:02 free 1
## 5 python OlofPaulson 2022-02-04 14:11:02 share 1
## 6 python OlofPaulson 2022-02-04 15:14:07 free 1
## 7 python OlofPaulson 2022-02-04 15:14:07 enjoy 2
## 8 python OlofPaulson 2022-02-04 14:25:47 hope 2
## 9 python OlofPaulson 2022-02-04 14:25:47 free 1
## 10 python OlofPaulson 2022-02-04 14:38:13 challenge -1
## # … with 3,336 more rows
sentiment_bing <- inner_join(tidy_tweets, bing, by = "word")
sentiment_bing
## # A tibble: 3,412 × 5
## program screen_name created_at word sentiment
## <chr> <chr> <dttm> <chr> <chr>
## 1 python OlofPaulson 2022-02-04 20:57:26 easy positive
## 2 python OlofPaulson 2022-02-04 20:57:26 enjoy positive
## 3 python OlofPaulson 2022-02-04 14:11:02 free positive
## 4 python OlofPaulson 2022-02-04 15:14:07 master positive
## 5 python OlofPaulson 2022-02-04 15:14:07 free positive
## 6 python OlofPaulson 2022-02-04 15:14:07 enjoy positive
## 7 python OlofPaulson 2022-02-04 14:25:47 free positive
## 8 python OlofPaulson 2022-02-04 14:38:13 pretend negative
## 9 python OlofPaulson 2022-02-04 13:55:49 free positive
## 10 python OlofPaulson 2022-02-04 15:10:46 easy positive
## # … with 3,402 more rows
sentiment_nrc <- inner_join(tidy_tweets, bing, by = "word")
sentiment_nrc
## # A tibble: 3,412 × 5
## program screen_name created_at word sentiment
## <chr> <chr> <dttm> <chr> <chr>
## 1 python OlofPaulson 2022-02-04 20:57:26 easy positive
## 2 python OlofPaulson 2022-02-04 20:57:26 enjoy positive
## 3 python OlofPaulson 2022-02-04 14:11:02 free positive
## 4 python OlofPaulson 2022-02-04 15:14:07 master positive
## 5 python OlofPaulson 2022-02-04 15:14:07 free positive
## 6 python OlofPaulson 2022-02-04 15:14:07 enjoy positive
## 7 python OlofPaulson 2022-02-04 14:25:47 free positive
## 8 python OlofPaulson 2022-02-04 14:38:13 pretend negative
## 9 python OlofPaulson 2022-02-04 13:55:49 free positive
## 10 python OlofPaulson 2022-02-04 15:10:46 easy positive
## # … with 3,402 more rows
sentiment_loughran <- inner_join(tidy_tweets, bing, by = "word")
sentiment_loughran
## # A tibble: 3,412 × 5
## program screen_name created_at word sentiment
## <chr> <chr> <dttm> <chr> <chr>
## 1 python OlofPaulson 2022-02-04 20:57:26 easy positive
## 2 python OlofPaulson 2022-02-04 20:57:26 enjoy positive
## 3 python OlofPaulson 2022-02-04 14:11:02 free positive
## 4 python OlofPaulson 2022-02-04 15:14:07 master positive
## 5 python OlofPaulson 2022-02-04 15:14:07 free positive
## 6 python OlofPaulson 2022-02-04 15:14:07 enjoy positive
## 7 python OlofPaulson 2022-02-04 14:25:47 free positive
## 8 python OlofPaulson 2022-02-04 14:38:13 pretend negative
## 9 python OlofPaulson 2022-02-04 13:55:49 free positive
## 10 python OlofPaulson 2022-02-04 15:10:46 easy positive
## # … with 3,402 more rows
Now that we have our tweets tidied and sentiments joined, we’re ready for a little data exploration. One goal in this phase is to explore questions that drove the original analysis. Topics addressed in Section 3 include:
Let’s start with bing, our simplest sentiment lexicon, and use the count function to count how many times in our sentiment_bing data frame “positive” and “negative” occur in sentiment column and :
summary_bing <- count(sentiment_bing, sentiment, sort = TRUE)
Collectively, it looks like our combined dataset has more positive words than negative words.
summary_bing
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 positive 2189
## 2 negative 1223
Since our main goal is to compare positive and negative sentiment between Python and R, let’s use the group_by function again to get sentiment summaries for the two programming languages separately:
summary_bing <- sentiment_bing %>%
group_by(program) %>%
count(sentiment)
summary_bing
## # A tibble: 4 × 3
## # Groups: program [2]
## program sentiment n
## <chr> <chr> <int>
## 1 python negative 656
## 2 python positive 1240
## 3 rstats negative 567
## 4 rstats positive 949
Looks like both programs have far more positive words than negative, but Python skews much more positive.
To calculate a single sentiment “score” for the tweets that can be used for quick comparison and create a new variable indicating which lexicon we used.
First, let’s untidy our data a little by using the spread function from the tidyr package to transform our sentiment column into separate columns for negative and positive that contains the n counts for each:
summary_bing <- sentiment_bing %>%
group_by(program) %>%
count(sentiment, sort = TRUE) %>%
spread(sentiment, n)
summary_bing
## # A tibble: 2 × 3
## # Groups: program [2]
## program negative positive
## <chr> <int> <int>
## 1 python 656 1240
## 2 rstats 567 949
Finally, we’ll use the mutate function to create two new variables: sentiment and lexicon so we have a single sentiment score and the lexicon from which it was derived:
summary_bing <- sentiment_bing %>%
group_by(program) %>%
count(sentiment, sort = TRUE) %>%
spread(sentiment, n) %>%
mutate(sentiment = positive - negative) %>%
mutate(lexicon = "bing") %>%
relocate(lexicon)
summary_bing
## # A tibble: 2 × 5
## # Groups: program [2]
## lexicon program negative positive sentiment
## <chr> <chr> <int> <int> <int>
## 1 bing python 656 1240 584
## 2 bing rstats 567 949 382
There we go, now we can see that Python scores much more positive than R with the Bing lexicon.
Let’s calculate a quick score for using the other lexicons now.
summary_afinn <- sentiment_afinn %>%
group_by(program) %>%
summarise(sentiment = sum(value)) %>%
mutate(lexicon = "AFINN") %>%
relocate(lexicon)
summary_afinn
## # A tibble: 2 × 3
## lexicon program sentiment
## <chr> <chr> <dbl>
## 1 AFINN python 973
## 2 AFINN rstats 1441
Again, both remain relatively positive. In this case, however, R scored a higher positive score than Python.
summary_nrc <- sentiment_nrc %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(program) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "nrc") %>%
spread(sentiment, n) %>%
mutate(sentiment = positive/negative) %>%
relocate(method)
summary_nrc
## # A tibble: 2 × 5
## # Groups: program [2]
## method program negative positive sentiment
## <chr> <chr> <int> <int> <dbl>
## 1 nrc python 656 1240 1.89
## 2 nrc rstats 567 949 1.67
sentiment_loughran <- inner_join(tidy_tweets, loughran, by = "word")
summary_loughran <- sentiment_loughran %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(program) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "loughran") %>%
spread(sentiment, n) %>%
mutate(sentiment = positive/negative) %>%
relocate(method)
summary_loughran
## # A tibble: 2 × 5
## # Groups: program [2]
## method program negative positive sentiment
## <chr> <chr> <int> <int> <dbl>
## 1 loughran python 285 427 1.50
## 2 loughran rstats 339 344 1.01
Each of the lexicon scores confirm that Python is discussed in a more positive manner than R in the tweets we’ve captured. This seems to validate the popularity trends we highlighted in the introduction.
Now that we understand the sentiment a little better, we’ll use the ts_plot function to take a very quick look at how the number of tweets compare by programming language:
ts_plot(dplyr::group_by(tweets, program), "days")
Notice that this effectively creates a ggplot time series plot for the tweets. I’ve included the by = argument which by default is set to “days”. It looks like tweets go back 5 days which was probably truncated due to the amount of Python tweets. That said, it looks like R discussions were much more consistently discussed during that time frame.
Changing the time period to hours give a much more refined scale for the tweets:
To better understand potential topics of interest within these tweets, we will filter the tidied data frames by programming language, and then focus on the top 50 terms:
top_tokens_python <- tidy_tweets %>%
filter(program == "python") %>%
count(word, sort = TRUE) %>%
top_n(50)
## Selecting by n
top_tokens_python
## # A tibble: 50 × 2
## word n
## <chr> <int>
## 1 100daysofcode 1487
## 2 #javascript 1441
## 3 #programming 795
## 4 #datascience 762
## 5 #coding 715
## 6 #machinelearning 690
## 7 #ai 602
## 8 #essay 581
## 9 #iot 539
## 10 pay 515
## # … with 40 more rows
We can then construct a wordcloud to visualize the topical themes for Python:
wordcloud2(top_tokens_python)
Applying this same method to the R data:
top_tokens_rstats <- tidy_tweets %>%
filter(program == "rstats") %>%
count(word, sort = TRUE) %>%
top_n(50)
## Selecting by n
top_tokens_rstats
## # A tibble: 50 × 2
## word n
## <chr> <int>
## 1 #datascience 951
## 2 100daysofcode 750
## 3 #ai 712
## 4 #iot 682
## 5 #javascript 678
## 6 #machinelearning 673
## 7 #analytics 671
## 8 #iiot 630
## 9 #tensorflow 618
## 10 #bigdata 604
## # … with 40 more rows
wordcloud2(top_tokens_rstats)
Recall from the questions guiding this research that the focus is on which programming languages are preferred across the data science community and how they are used. Specfically:
To address questions 1 and 2, I’m going to focus my analyses and data products on the following:
bing, nrc, and loughan lexicons, I’ll create some 100% stacked bars showing the percentage of positive and negative words among all tweets for the NGSS and CCSS.ggplot(tweets, aes(x = program, fill = program)) +
geom_bar(width = .6, show.legend = FALSE) +
labs(title = "Language Popularity on Twitter",
subtitle = "#Python vs #Rstats Hashtag Counts Last 5 Days") +
xlab(label = "Language") +
ylab(label = "# Tweets")
Hashtag counts show that Python was discussed more often than R.
To polish my analyses and prepare for publication, first I need to rebuild the tweets dataset from my python_non_retweets and rstats_non_retweets and select both the status_id that is unique to each tweet, and the text column which contains the actual post:
python_text_clean <-
python_non_retweets %>%
filter(lang == "en") %>%
select(status_id, text) %>%
mutate(program = "python") %>%
relocate(program)
rstats_text_clean <-
rstats_non_retweets %>%
filter(lang == "en") %>%
select(status_id, text) %>%
mutate(program = "rstats") %>%
relocate(program)
tweets_clean <- bind_rows(python_text_clean, rstats_text_clean)
tweets_clean
## # A tibble: 5,953 × 3
## program status_id text
## <chr> <chr> <chr>
## 1 python 1489704675117273088 "@NFLosophy Hey NFL👋 😉 \nMaybe a basic #Python…
## 2 python 1489602402181431306 "@paulabartabajo_ Thanks so much for follow 🙏\n…
## 3 python 1489618278347689991 "@s1lent_cr0w Hey Crow👋\nMaybe this will help m…
## 4 python 1489606111405690880 "@Barbara61708255 Thanks for following ❤️\nHope t…
## 5 python 1489526973038743554 "💪 Something to think about 💪\n\n#programming …
## 6 python 1489609242281656328 "🐍 TGIF Coding Challenge /Puzzle \nSave #justin…
## 7 python 1489598572517597192 "@Bitcoinvangeli1 Thank's for the follow ❤️\nHope…
## 8 python 1489617432662757381 "@missbikesalot Hey Rachel👋\nHere’s a #Python 1…
## 9 python 1489604376662626304 "@PalpatinThesis ❤️Thank you for following🙏\nHop…
## 10 python 1489617938797809666 "@anugayeah Hey Cheekoo 👋\nMaybe a basic #Pytho…
## # … with 5,943 more rows
The status_id is important as it enables a calculation of an overall sentiment score for each tweet, rather than for each word. Prior to assigning a tweet sentiment scores however, the tweets must be tidied again and then the sentiment scores can be attached.
sentiment_afinn_clean <- tweets_clean %>%
unnest_tokens(output = word,
input = text,
token = "tweets") %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "#python" & !word == "#rstats"
& !word == "python" & !word == "amp") %>%
inner_join(afinn, by = "word")
sentiment_afinn_clean
## # A tibble: 3,346 × 4
## program status_id word value
## <chr> <chr> <chr> <dbl>
## 1 python 1489704675117273088 easy 1
## 2 python 1489704675117273088 enjoy 2
## 3 python 1489602402181431306 ill -2
## 4 python 1489602402181431306 free 1
## 5 python 1489602402181431306 share 1
## 6 python 1489618278347689991 free 1
## 7 python 1489618278347689991 enjoy 2
## 8 python 1489606111405690880 hope 2
## 9 python 1489606111405690880 free 1
## 10 python 1489609242281656328 challenge -1
## # … with 3,336 more rows
Next, I want to calculate a single score for each tweet. To do that, I’ll use the by now familiar group_by and summarize
afinn_score <- sentiment_afinn_clean %>%
group_by(program, status_id) %>%
summarise(value = sum(value))
afinn_score
## # A tibble: 2,403 × 3
## # Groups: program [2]
## program status_id value
## <chr> <chr> <dbl>
## 1 python 1489377043406114816 2
## 2 python 1489377458763837445 -1
## 3 python 1489377675919642625 -1
## 4 python 1489377868127813633 -1
## 5 python 1489378233418235904 1
## 6 python 1489378371578449920 2
## 7 python 1489378588457680898 -1
## 8 python 1489379318019067905 4
## 9 python 1489379369520963588 -2
## 10 python 1489379604519337984 -1
## # … with 2,393 more rows
I’ll add a flag for whether the tweet is “positive” or “negative” using the mutate function to create a new sentiment column to indicate whether that tweets was positive or negative.
afinn_sentiment <- afinn_score %>%
filter(value != 0) %>%
mutate(sentiment = if_else(value < 0, "negative", "positive"))
afinn_sentiment
## # A tibble: 2,344 × 4
## # Groups: program [2]
## program status_id value sentiment
## <chr> <chr> <dbl> <chr>
## 1 python 1489377043406114816 2 positive
## 2 python 1489377458763837445 -1 negative
## 3 python 1489377675919642625 -1 negative
## 4 python 1489377868127813633 -1 negative
## 5 python 1489378233418235904 1 positive
## 6 python 1489378371578449920 2 positive
## 7 python 1489378588457680898 -1 negative
## 8 python 1489379318019067905 4 positive
## 9 python 1489379369520963588 -2 negative
## 10 python 1489379604519337984 -1 negative
## # … with 2,334 more rows
Note that since a tweet sentiment score equal to 0 is neutral, I used the filter function to remove it from the dataset.
Finally, we’re ready to compute our ratio. We’ll use the group_by function and count the number of tweets for each of the standards that are positive or negative in the sentiment column. Then we’ll use the spread function to separate them out into separate columns so we can perform a quick calculation to compute the ratio.
afinn_ratio <- afinn_sentiment %>%
group_by(program) %>%
count(sentiment) %>%
spread(sentiment, n) %>%
mutate(ratio = negative/positive)
afinn_ratio
## # A tibble: 2 × 4
## # Groups: program [2]
## program negative positive ratio
## <chr> <int> <int> <dbl>
## 1 python 571 851 0.671
## 2 rstats 201 721 0.279
Finally,
afinn_counts <- afinn_sentiment %>%
group_by(program) %>%
count(sentiment) %>%
filter(program == "python")
afinn_counts %>%
ggplot(aes(x="", y=n, fill=sentiment)) +
geom_bar(width = .6, stat = "identity") +
labs(title = "#Python Tweets",
subtitle = "Proportion of Positive & Negative Tweets") +
coord_polar(theta = "y") +
theme_void()
afinn_counts <- afinn_sentiment %>%
group_by(program) %>%
count(sentiment) %>%
filter(program == "rstats")
afinn_counts %>%
ggplot(aes(x="", y=n, fill=sentiment)) +
geom_bar(width = .6, stat = "identity") +
labs(title = "#Rstats Tweets",
subtitle = "Proportion of Positive & Negative Tweets") +
coord_polar(theta = "y") +
theme_void()
summary_afinn2 <- sentiment_afinn %>%
group_by(program) %>%
filter(value != 0) %>%
mutate(sentiment = if_else(value < 0, "negative", "positive")) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "AFINN")
summary_bing2 <- sentiment_bing %>%
group_by(program) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "bing")
summary_nrc2 <- sentiment_nrc %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(program) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "nrc")
summary_loughran2 <- sentiment_loughran %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(program) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "loughran")
Next, I’ll combine those four data frames together using the bind_rows function again:
summary_sentiment <- bind_rows(summary_afinn2,
summary_bing2,
summary_nrc2,
summary_loughran2) %>%
arrange(method, program) %>%
relocate(method)
summary_sentiment
## # A tibble: 16 × 4
## # Groups: program [2]
## method program sentiment n
## <chr> <chr> <chr> <int>
## 1 AFINN python positive 1125
## 2 AFINN python negative 908
## 3 AFINN rstats positive 960
## 4 AFINN rstats negative 353
## 5 bing python positive 1240
## 6 bing python negative 656
## 7 bing rstats positive 949
## 8 bing rstats negative 567
## 9 loughran python positive 427
## 10 loughran python negative 285
## 11 loughran rstats positive 344
## 12 loughran rstats negative 339
## 13 nrc python positive 1240
## 14 nrc python negative 656
## 15 nrc rstats positive 949
## 16 nrc rstats negative 567
Then I’ll create a new data frame that has the total word counts for each set of standards and each method and join that to my summary_sentiment data frame:
total_counts <- summary_sentiment %>%
group_by(method, program) %>%
summarise(total = sum(n))
## `summarise()` has grouped output by 'method'. You can override using the
## `.groups` argument.
sentiment_counts <- left_join(summary_sentiment, total_counts)
## Joining, by = c("method", "program")
sentiment_counts
## # A tibble: 16 × 5
## # Groups: program [2]
## method program sentiment n total
## <chr> <chr> <chr> <int> <int>
## 1 AFINN python positive 1125 2033
## 2 AFINN python negative 908 2033
## 3 AFINN rstats positive 960 1313
## 4 AFINN rstats negative 353 1313
## 5 bing python positive 1240 1896
## 6 bing python negative 656 1896
## 7 bing rstats positive 949 1516
## 8 bing rstats negative 567 1516
## 9 loughran python positive 427 712
## 10 loughran python negative 285 712
## 11 loughran rstats positive 344 683
## 12 loughran rstats negative 339 683
## 13 nrc python positive 1240 1896
## 14 nrc python negative 656 1896
## 15 nrc rstats positive 949 1516
## 16 nrc rstats negative 567 1516
Finally, I’ll add a new row that calculates the percentage of positive and negative words for each set of state standards:
sentiment_percents <- sentiment_counts %>%
mutate(percent = n/total * 100)
sentiment_percents
## # A tibble: 16 × 6
## # Groups: program [2]
## method program sentiment n total percent
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 AFINN python positive 1125 2033 55.3
## 2 AFINN python negative 908 2033 44.7
## 3 AFINN rstats positive 960 1313 73.1
## 4 AFINN rstats negative 353 1313 26.9
## 5 bing python positive 1240 1896 65.4
## 6 bing python negative 656 1896 34.6
## 7 bing rstats positive 949 1516 62.6
## 8 bing rstats negative 567 1516 37.4
## 9 loughran python positive 427 712 60.0
## 10 loughran python negative 285 712 40.0
## 11 loughran rstats positive 344 683 50.4
## 12 loughran rstats negative 339 683 49.6
## 13 nrc python positive 1240 1896 65.4
## 14 nrc python negative 656 1896 34.6
## 15 nrc rstats positive 949 1516 62.6
## 16 nrc rstats negative 567 1516 37.4
Now that I have my sentiment percent summaries for each lexicon, I’m going to create 100% stacked bar charts for each lexicon:
sentiment_percents %>%
ggplot(aes(x = program, y = percent, fill=sentiment)) +
geom_bar(width = .8, stat = "identity") +
facet_wrap(~method, ncol = 1) +
coord_flip() +
labs(title = "Public Sentiment on Twitter",
subtitle = "#Python & #Rstats",
x = "Language",
y = "Percentage of Words")
The chart above illustrates that in most cases (3 out of 4 lexicons), #python tweets contain more positive words than #rstats tweets.
Purpose. The data science community uses both Python and R as key tools to conduct statistical analysis and production of digital products. This case study was focused on determining the language of choice as well as why a specific language was chosen.
Methods. For this project, I chose to look at how often and in what contexts the languages were discussed on Twitter. The hashtags most often used by their respective communities were chosen as representative of how those communities regarded their particular language choice. From this data, I explored tweet counts, sentiment analysis, and top discussion topics.
Findings. Python is assessed to be the more popular coding language as it was discussed more often and maintained higher positive sentiment scores across the various lexicons. Top discussion topics by language included:
-Python: Coding, Java/Js, AI, IOT, Writing
-R: IOT, AI, Data, ML, Learning
Discussion. Insights from this case study can be used to guide those new to the data science community when deciding how to begin coding. Python is going to have a larger community of users as it is more often applied to general coding problems. Though R may be a smaller community, they focus their efforts on specific coding problems in the areas of statistical analysis, machine learning, and visualization.
A main limitation on this study was the size and scope of the dataset. The short time-span of the data (5 days) may lead to a recency bias and may not be applicable to other time periods. A much deeper pull of tweets may have shown how the languages have grown (or waned) in popularity over time. This could give some insights into when users began to identify niches or specific problems where each language shined. What were some of the strengths and weaknesses of your analysis?
Infographic python vs. R for Data Analysis? DataCamp Community. (2020, January 9). Retrieved February 4, 2022, from https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis
Kappal, S. (2017, October 14). R vs. Python. dzone.com. Retrieved February 4, 2022, from https://dzone.com/articles/r-or-python-data-scientists-delight
Loukides, M. (Ed.). (2021). 2021 Data/Ai Salary Survey. O’Reilly Media, Inc.
Science, D.-D. (2020, May 16). Python vs R for data science: And the winner is… Medium. Retrieved February 4, 2022, from https://medium.com/@datadrivenscience/python-vs-r-for-data-science-and-the-winner-is-3ebb1a968197