Introduction

The MeToo movement is a movement against sexual harassment and sexual assault. The phrase originated with activist and community organizer Tarana Burke in 2006; after allegations against Harvey Weinstein unfolded publicly, the phrase was popularized and went viral in October 2017 when actor Alyssa Milano shared her experiences with sexual assault and harassment with the hastag #MeToo. Milano encouraged other victims of sexual harassment to to tweet about and demonstrate the widespread prevalence of sexual assault and harassment, especially in the workplace. Researchers[1][2][3][4] have identified over 8.1 million tweets containing the hashtag, and have demonstrated that the most prevalent topics of conversation have been about the movement and activism, sexual abuse and assault, harassment, and politics. More than 100 occupations have been mentioned at least 100 times[2], and there are many retweets and @ mentions of politicians and other high-profile individuals. Some top associated hashtags include #TimesUp, #WithYou, #Resist and other political hashtags. Many researchers have explored the early #MeToo tweets to determine whether this movement is a leading indicator of societal change, and to identify the demographics of individuals using the hashtag to see how different communities may be participating in the disussion. However, #MeToo has popped back into the news over the past few months as different individuals (e.g., Louis CK, Charlie Rose) have attempted to re-enter the public eye and resume their careers[6], and news outlets’ focus has shifted to whether there is a significant backlash to the movement [5][7][8].

Now that it is a year later, how has the overall tone and content of #MeToo tweets shifted? By comparing a cache of tweets from the early part of the #MeToo movement to more recent tweets, we can analyze:

To answer these questions, #MeToo tweets will be analyzed through an archive of tweets coded with the #MeToo hashtag downloaded from Twitter. This investigation uses a dataset of 390,000 #MeToo tweets dated from November 29th to December 25th, 2017, downloaded from https://data.world/balexturner/390-000-metoo-tweets. It also relies on recent data I pulled from the Twitter API using my own account to retrieve the most recent 10,000 tweets with the hashtag #MeToo (see Code Appendix).

Data Preparation

2017 sample tweets

#read in the 2017 datafile
metootweets <- read.csv("/Users/meredithpowers/Desktop/metoo.csv", stringsAsFactors=FALSE)
head(metootweets$text)
[1] "American Harem.. #MeToo https://t.co/HjExLJdGuF"                                                                                              
[2] "@johnconyersjr  @alfranken  why have you guys not resigned yet? Liberal hypocrisy! #MeToo"                                                    
[3] "Watched Megan Kelly ask Joe Keery this A.M. if she can \"rub my fingers through your hair\", and refer to his body be https://t.co/Q86wfW7DeJ"
[4] "Women have been talking about this crap the entire time, finally someone listened. #metoo https://t.co/JlK11yhFXc"                            
[5] ".@BetteMidler please speak to this sexual assault by @GeraldoRivera during the interview. #MeToo  https://t.co/1iuafGaOmv"                    
[6] "We can't keep turning a blind eye and pretend this isn't real. #metoo https://t.co/1dLZcftbSs"                                                

2018 sample tweets

#read in the 2018 datafile
tweets2 <- read.csv("/Users/meredithpowers/Desktop/me2.csv", stringsAsFactors=FALSE)
head(tweets2$x)
[1] "RT @Youneedtowork: Why don't feminists &amp; co. not get annoyed about this treatment of a female protestor in France today standing for her ri…"
[2] "@Keith_Murray_ #metoo"                                                                                                                           
[3] "RT @FarrahTomazin: Shame, secrecy and a disparity of power could mean the number of women sexually abused by clergy is four times the figur…"    
[4] "RT @SwampysGhost: #MeToo That's why we have to arrest cabal on other charges. Child rape sentences are PATHETICALLY WEAK!! Sheeple may neve…"    
[5] "Poem from a survivor: TRIGGER WARNING sexual assault\n\n#ngocswny #16daysofactivism #hearmetoo #orangetheworld… https://t.co/nRY0QO5ZWj"         
[6] "RT @bud_cann: Seriously?...a professor and proponent of the #MeToo movement critiqued the story of the Virgin Mary suggesting that she did…"     

Text Analysis

What words were most associated with the MeToo movement on Twitter in late 2017?

Who gets the most @ mentions or retweets with #MeToo?

library(tm)
library(wordcloud)
library(stringi)
library(wesanderson)
#clean up the data and create a corpus
metootweets$text <- sapply(metootweets$text,function(row) iconv(row, "latin1", "ASCII", sub=""))
cloud <- Corpus(VectorSource(metootweets$text))
cloud <- cloud %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace) %>%
  tm_map(removeNumbers)%>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords('english')) %>%
  tm_map(removeWords, c('amp','metoo'))
#create wordcloud
wordcloud(cloud, max.words = 50, scale = c(3, 1),colors=brewer.pal(4, "Dark2"), random.color = TRUE, random.order = FALSE)

What about late 2018?

Are the most common words, @ mentions, and retweeted people the same or different?

#clean up the data and create a corpus
tweets2$x <- sapply(tweets2$x,function(row) iconv(row, "latin1", "ASCII", sub=""))
cloud2 <- Corpus(VectorSource(tweets2$x))
cloud2 <- cloud2 %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace) %>%
  tm_map(removeNumbers)%>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords('english')) %>%
  tm_map(removeWords, c('amp','metoo'))
#create wordcloud
wordcloud(cloud2, max.words = 50, scale = c(3, 1),colors=brewer.pal(4, "Paired"), random.color = TRUE, random.order = FALSE)

One of the more striking differences between 2017 and 2018 is the frequency of retweeting or tweeting @ specific usernames. In late 2017, an overwhelming number of tweets were directed at or about activist Tarana Burke, actor Alyssa Milano, songwriter/activist Lauren Jauregui, Senator Kristin Gillibrand, and President Donald Trump. The tweets including Burke and Milano were probably part of the initial onslaught of responses; a little digging indicates that Jauregi has been an outspoken feminist activist; and Senator Gillibrand served as the political face of the #MeToo movement for a time. The inclusion of “realdonaldtrump”, “trump”, “trumps”, and “president” in the commonly-tweeted words likely stems from the allegations of sexual harassment and assault against Trump. Notably, most of the early individual mentions are fully absent from the late 2018 text analysis. The conversation seems to have become more general, with words like “colleagues” and “dinners”. The increase in the word “female” is also indicative of a subtle shift in how Twitter users refer to women. It’s interesting that there are no direct mentions of Justice Brett Kavanaugh in the top 50 words, although words like “congress” and “talbertswan” may be related. (Bishop Talbert Swan, aka @talbertswan, was a vocal critic of Trump and Republican Christians who support Trump for political reasons; he was permanently banned from Twitter in late August 2018 for offensive language [9]). Other words, like “avoid” and “fake” indicate potential #MeToo backlash.

Sentiment Analysis

To perform an analysis of the overall tone and content of the #MeToo tweets, each tweet’s word is separated into its own row and joined with a sentiment library. The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing. (See https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm for more.)

Late 2017 #MeToo Sentiment

library(tidyverse)
library(tidytext)
library(glue)
library(stringr)
library(viridis)
library(dplyr)
metootweets$text <-  sapply(metootweets$text, function(row) iconv(row, 'latin1','ASCII',sub=""))
metoo_sentiment <- metootweets %>%
  unnest_tokens(word, text)
metoo_sentiment_freq <- metoo_sentiment %>%
  inner_join(get_sentiments("nrc")) %>% 
  dplyr::count(sentiment, sort = TRUE) %>% 
  mutate(sentiment = reorder(sentiment, n)) %>% 
  ggplot(aes(sentiment,n, fill=sentiment)) + 
  geom_col(color='white', stat='identity') + 
  theme(axis.text.y=element_blank()) + 
  labs(x='Sentiment', y='Frequency') + 
  scale_fill_viridis(discrete=TRUE, option = "C") + 
  theme(text = element_text(size = 15))
metoo_sentiment_freq

Late 2018 #MeToo Sentiment

tweets2$x <- sapply(tweets2$x, function(row) iconv(row, 'latin1','ASCII',sub=""))
tweets2_sentiment <- tweets2 %>%
  unnest_tokens(word, x)
tweets2_sentiment_freq <- tweets2_sentiment %>%
  inner_join(get_sentiments("nrc")) %>% 
  dplyr::count(sentiment, sort = TRUE) %>% 
  mutate(sentiment = reorder(sentiment, n)) %>% 
  ggplot(aes(sentiment,n, fill=sentiment)) + 
  geom_col(color='white', stat='identity') + 
  theme(axis.text.y=element_blank()) + 
  labs(x='Sentiment', y='Frequency') + 
  scale_fill_viridis(discrete=TRUE, option = "magma") + 
  theme(text = element_text(size = 15))
tweets2_sentiment_freq

The clearest shift in sentiment from 2017 and 2018 seems to be the stark decrease in surprise. While disgust, sadness, anticipation, and trust seem to hold steady in their proportional share of the overall sentiment, the feelings of fear and anger seem to switch places.

Frequency of 2017 Most-Tweeted #MeToo Words by Sentiment

metoo_sentiment_freq2 <- metoo_sentiment %>%
  inner_join(get_sentiments("nrc")) %>%
  dplyr::count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) + 
  geom_col(show.legend=FALSE, stat='identity') + 
  facet_wrap(~sentiment, scales='free_y', nrow=3) + 
  labs(y = NULL, x = NULL) + 
  coord_flip() + 
  theme_calc() + 
  scale_fill_viridis(discrete=TRUE, option = "C") + 
  theme(text = element_text(size=10))
metoo_sentiment_freq2

Frequency of 2018 Most-Tweeted #MeToo Words by Sentiment

tweets2_sentiment_freq2 <- tweets2_sentiment %>%
  inner_join(get_sentiments("nrc")) %>%
  dplyr::count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) + 
  geom_col(show.legend=FALSE, stat='identity') + 
  facet_wrap(~sentiment, scales='free_y', nrow=3) + 
  labs(y = NULL, x = NULL) + 
  coord_flip() + 
  theme_calc() + 
  scale_fill_viridis(discrete=TRUE, option = "magma") + 
  theme(text = element_text(size=10))
tweets2_sentiment_freq2

The proportion of positively related words is a little surprising, although it could be explained by the overall tone of #MeToo, which has been one of empowerment for victims of sexual assault. However, after seeing the break down in words under each sentiment category, the overall validity of using the sentiment library is somewhat undermined.

Discussion

It is possible to study societal shifts in how we discuss sexual assault and harassment using publicly archived or harvested Twitter data; likewise it is possbible to analysze the overall tone and content of tweets using the tidytext ecosystem.

Comparing some of #MeToo’s more common associated words in 2017 versus 2018 suggest that the initial outpouring of personal tweets at high-profile individuals has subsided and the Twitter conversation about sexual harassment has shifted to more general discussions. The emotional sentiment of the language in 2017 and 2018 is similar, although feelings of anger and surprise seem to have receded somewhat, while fear seems to have increased over the past year.

On the other hand, sentiment libraries have some fundamental drawbacks, and individual judgement is needed to provide context when interpreting results. Words like “sex” and “female” are coded positively in the NRC sentiment library (and further exploration of the Bing and AFINN libraries suggest they have similar classifications), although both of these words may not actually have positive connotations in the context of sexual harassment against women. Another example is the word “professor” appearing in the trust category – given the nature of some tweets about experience sexual assault by professors points more to an abuse or lack of trust rather than an overall feeling of trust. Furthermore, the words “black” and “white” appear in the negative and positive categories respectively; without knowing why the sentiment classed these words this way, it’s probably best to consider these words neutrally in an overall sentiment analysis. In the context of the #MeToo tweets, “black” and “white” could refer to race or a common phrase like “black and white thinking”, but I’m not sure what it means in the context of the original NRC sentiment crowdsourcing.

Future Directions

Tweets are one way to understand the international conversation about sexual harassment; scholarly journals are another way to study the way we view the phenonemom. The following data is an initial exploration of when and how sexual harassment is studied.

PLoS Textual Analysis of Sexual Harassment

library(rplos)
library(ggplot2)
plot_throughtime(terms = '"sexual harassment"', limit = 800) + geom_line(size=2, color='#cc0066')

aRxiv Textual Analysis of Sexual Harassment

When were papers related to sexual harassment submitted to aRxiv?

#retrieve articles
library(aRxiv)
metoo_articles <- arxiv_search(query = '"sexual harassment"', limit = 800)
#clean up dates
library(lubridate)
metoo_articles <- metoo_articles %>%
  mutate(submitted = ymd_hms(submitted), updated = ymd_hms(updated))
#when were these submitted?
xtabs(~ year(submitted), data = metoo_articles)
year(submitted)
2013 2017 2018 
   1    1    4 

When fields were they published under?

#what fields were they submitted in?
metoo_articles  %>%
  mutate(field = str_extract(primary_category, "^[a-z,-]+"))  %>%
  mosaic::tally(x = ~field)  %>%
  sort()
cs 
 6 

Given that there are only 6 published articles, what are their titles?

#since there are only six in total, let's look at each title to determine the specific focus of each
head(metoo_articles$title)
[1] "Online Social Networks: Threats and Solutions"                                                                                                                     
[2] "Forensic Investigation of Social Media and Instant Messaging Services in\n  Firefox OS: Facebook, Twitter, Google+, Telegram, OpenWapp and Line as Case\n  Studies"
[3] "A Quality Type-aware Annotated Corpus and Lexicon for Harassment\n  Research"                                                                                      
[4] "It Takes Two to #MeToo - Using Enclaves to Build Autonomous Trusted\n  Systems"                                                                                    
[5] "SafeCity: Understanding Diverse Forms of Sexual Harassment Personal\n  Stories"                                                                                    
[6] "SATE: Robust and Private Allegation Escrows"                                                                                                                       

While both of these datasets are too small to be particularly valuable, they do point to trends. The PLoS timeline indicates very little interest in sexual harassment as a topic of study prior to 2012; the number of articles is still small in 2018 but it a steady increase. Likewise, the aRxiv data points to a potentially open and burgeoning area of research in Computer Science: studying harassment by analyzing textual data.

References

[1] Anderson, M., & Toor, S. (2018, October 11). How social media users have discussed sexual harassment since #MeToo went viral. Retrieved from http://www.pewresearch.org/fact-tank/2018/10/11/how-social-media-users-have-discussed-sexual-harassment-since-metoo-went-viral/

[2] Georgetown University. (2018, September 10). #MeToo Movement Twitter Data Mined by Computer Science Professor. Retrieved from https://www.georgetown.edu/news/metoo-movement-twitter-data-mined-by-computer-science-professor

[3] Kunst, J. R., Bailey, A., Prendergast, C., & Gundersen, A. (2018). Sexism, rape myths and feminist identification explain gender differences in attitudes toward the# metoo social media campaign in two countries. Media Psychology, 1-26.

[4] Manikonda, L., Beigi, G., Liu, H., & Kambhampati, S. (2018). Twitter for Sparking a Movement, Reddit for Sharing the Moment:# metoo through the Lens of Social Media. arXiv preprint arXiv:1803.08022.

[5] Parker, K. (2018, December 6). Women starting to suffer the #MeToo backlash. Retrieved from https://www.postandcourier.com/opinion/commentary/women-starting-to-suffer-the-metoo-backlash/article_fc981576-f97d-11e8-8207-97a4ed0ae3d3.html

[6]Roberts, D. (2018, September 12). What so many men are missing about #MeToo. Retrieved from https://www.vox.com/2018/9/10/17826168/me-too-louis-ck-men-comeback

[7] Tan, G., & Porzecanski, K. (2018, December 3). Wall Street Rule for the #MeToo Era: Avoid Women at All Cost. Retrieved from https://www.bloomberg.com/news/articles/2018-12-03/a-wall-street-rule-for-the-metoo-era-avoid-women-at-all-cost

[8] The Economist. (2018, October 20). Measuring the #MeToo backlash. Retrieved from https://www.economist.com/united-states/2018/10/20/measuring-the-metoo-backlash

[9] Banks, A. (2018, August 28). Black bishop says Twitter suspended him for “hateful conduct.” Retrieved from https://religionnews.com/2018/08/28/black-bishop-and-trump-critic-says-twitter-suspended-account-for-hateful-conduct/

---
title: "Analyzing #MeToo Tweets"
output: 
  html_notebook: 
    code_folding: hide
    theme: lumen
---

# Introduction

The MeToo movement is a movement against sexual harassment and sexual assault. The phrase originated with activist and community organizer Tarana Burke in 2006; after allegations against Harvey Weinstein unfolded publicly, the phrase was popularized and went viral in October 2017 when actor Alyssa Milano shared her experiences with sexual assault and harassment with the hastag #MeToo. Milano encouraged other victims of sexual harassment to to tweet about and demonstrate the widespread prevalence of sexual assault and harassment, especially in the workplace. Researchers[1][2][3][4] have identified over 8.1 million tweets containing the hashtag, and have demonstrated that the most prevalent topics of conversation have been about the movement and activism, sexual abuse and assault, harassment, and politics. More than 100 occupations have been mentioned at least 100 times[2], and there are many retweets and @ mentions of politicians and other high-profile individuals. Some top associated hashtags include #TimesUp, #WithYou, #Resist and other political hashtags. Many researchers have explored the early #MeToo tweets to determine whether this movement is a leading indicator of societal change, and to identify the demographics of individuals using the hashtag to see how different communities may be participating in the disussion. However, #MeToo has popped back into the news over the past few months as different individuals (e.g., Louis CK, Charlie Rose) have attempted to re-enter the public eye and resume their careers[6], and news outlets' focus has shifted to whether there is a significant backlash to the movement [5][7][8].

Now that it is a year later, how has the overall tone and content of #MeToo tweets shifted? By comparing a cache of tweets from the early part of the #MeToo movement to more recent tweets, we can analyze:

- What are some of #MeToo's more common associated words?
- Is the language more positive or negative? 
- Overall, what type of messaging is coming through the #MeToo tweets in December 2017 versus December 2018 -- has the year since the initial #MeToo outpouring changed the tweets significantly?

To answer these questions, #MeToo tweets will be analyzed through an archive of tweets coded with the #MeToo hashtag downloaded from Twitter.  This investigation uses a dataset of 390,000 #MeToo tweets dated from November 29th to December 25th, 2017, downloaded from https://data.world/balexturner/390-000-metoo-tweets. It also relies on recent data I pulled from the Twitter API using my own account to retrieve the most recent 10,000 tweets with the hashtag #MeToo (see Code Appendix).

#Data Preparation

```{r message=FALSE, warning=FALSE, include=FALSE}
library(ggplot2)
library(readr)
library(dplyr)
library(ROAuth)
library(tm)
library(stm)
library(stringr)
library(tidyverse)
library(lubridate)
library(wordcloud)
library(ggthemes)
library(tidytext)
library(ggsci)
library(Zelig)
library(lubridate)
library(stringr)
library(tidyverse)
library(stm)
library(wordcloud)
library(SnowballC)
library(RColorBrewer)
```

##2017 sample tweets
```{r message=FALSE, warning=FALSE}
#read in the 2017 datafile
metootweets <- read.csv("/Users/meredithpowers/Desktop/metoo.csv", stringsAsFactors=FALSE)
head(metootweets$text)
```

##2018 sample tweets
```{r message=FALSE, warning=FALSE}
#read in the 2018 datafile
tweets2 <- read.csv("/Users/meredithpowers/Desktop/me2.csv", stringsAsFactors=FALSE)
head(tweets2$x)
```

# Text Analysis
## What words were most associated with the MeToo movement on Twitter in late 2017? 
### Who gets the most @ mentions or retweets with #MeToo?

```{r fig.width=10, message=FALSE, warning=FALSE}
library(tm)
library(wordcloud)
library(stringi)
library(wesanderson)
#clean up the data and create a corpus
metootweets$text <- sapply(metootweets$text,function(row) iconv(row, "latin1", "ASCII", sub=""))
cloud <- Corpus(VectorSource(metootweets$text))
cloud <- cloud %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace) %>%
  tm_map(removeNumbers)%>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords('english')) %>%
  tm_map(removeWords, c('amp','metoo'))
#create wordcloud
wordcloud(cloud, max.words = 50, scale = c(3, 1),colors=brewer.pal(4, "Dark2"), random.color = TRUE, random.order = FALSE)
```




##What about late 2018?
###Are the most common words, @ mentions, and retweeted people the same or different?
```{r fig.width=10, message=FALSE, warning=FALSE}
#clean up the data and create a corpus
tweets2$x <- sapply(tweets2$x,function(row) iconv(row, "latin1", "ASCII", sub=""))
cloud2 <- Corpus(VectorSource(tweets2$x))
cloud2 <- cloud2 %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace) %>%
  tm_map(removeNumbers)%>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords('english')) %>%
  tm_map(removeWords, c('amp','metoo'))
#create wordcloud
wordcloud(cloud2, max.words = 50, scale = c(3, 1),colors=brewer.pal(4, "Paired"), random.color = TRUE, random.order = FALSE)
```

One of the more striking differences between 2017 and 2018 is the frequency of retweeting or tweeting @ specific usernames. In late 2017, an overwhelming number of tweets were directed at or about activist Tarana Burke, actor Alyssa Milano, songwriter/activist Lauren Jauregui, Senator Kristin Gillibrand, and President Donald Trump. The tweets including Burke and Milano were probably part of the initial onslaught of responses; a little digging indicates that Jauregi has been an outspoken feminist activist; and Senator Gillibrand served as the political face of the #MeToo movement for a time. The inclusion of "realdonaldtrump", "trump", "trumps", and "president" in the commonly-tweeted words likely stems from the allegations of sexual harassment and assault against Trump. Notably, most of the early individual mentions are fully absent from the late 2018 text analysis. The conversation seems to have become more general, with words like "colleagues" and "dinners". The increase in the word "female" is also indicative of a subtle shift in how Twitter users refer to women. It's interesting that there are no direct mentions of Justice Brett Kavanaugh in the top 50 words, although words like "congress" and "talbertswan" may be related. (Bishop Talbert Swan, aka @talbertswan, was a vocal critic of Trump and Republican Christians who support Trump for political reasons; he was permanently banned from Twitter in late August 2018 for offensive language [9]). Other words, like "avoid" and "fake" indicate potential #MeToo backlash. 

#Sentiment Analysis
To perform an analysis of the overall tone and content of the #MeToo tweets, each tweet’s word is separated into its own row and joined with a sentiment library. The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing. (See https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm for more.)

##Late 2017 #MeToo Sentiment

```{r fig.width=10, message=FALSE, warning=FALSE}
library(tidyverse)
library(tidytext)
library(glue)
library(stringr)
library(viridis)
library(dplyr)
metootweets$text <-  sapply(metootweets$text, function(row) iconv(row, 'latin1','ASCII',sub=""))
metoo_sentiment <- metootweets %>%
  unnest_tokens(word, text)

metoo_sentiment_freq <- metoo_sentiment %>%
  inner_join(get_sentiments("nrc")) %>% 
  dplyr::count(sentiment, sort = TRUE) %>% 
  mutate(sentiment = reorder(sentiment, n)) %>% 
  ggplot(aes(sentiment,n, fill=sentiment)) + 
  geom_col(color='white', stat='identity') + 
  theme(axis.text.y=element_blank()) + 
  labs(x='Sentiment', y='Frequency') + 
  scale_fill_viridis(discrete=TRUE, option = "C") + 
  theme(text = element_text(size = 15))


metoo_sentiment_freq
```

##Late 2018 #MeToo Sentiment

```{r fig.width=10, message=FALSE, warning=FALSE}
tweets2$x <- sapply(tweets2$x, function(row) iconv(row, 'latin1','ASCII',sub=""))
tweets2_sentiment <- tweets2 %>%
  unnest_tokens(word, x)

tweets2_sentiment_freq <- tweets2_sentiment %>%
  inner_join(get_sentiments("nrc")) %>% 
  dplyr::count(sentiment, sort = TRUE) %>% 
  mutate(sentiment = reorder(sentiment, n)) %>% 
  ggplot(aes(sentiment,n, fill=sentiment)) + 
  geom_col(color='white', stat='identity') + 
  theme(axis.text.y=element_blank()) + 
  labs(x='Sentiment', y='Frequency') + 
  scale_fill_viridis(discrete=TRUE, option = "magma") + 
  theme(text = element_text(size = 15))


tweets2_sentiment_freq
```

The clearest shift in sentiment from 2017 and 2018 seems to be the stark decrease in *surprise*. While *disgust*, *sadness*, *anticipation*, and *trust* seem to hold steady in their proportional share of the overall sentiment, the feelings of *fear* and *anger* seem to switch places.


# Frequency of 2017 Most-Tweeted #MeToo Words by Sentiment
```{r fig.width=10, message=FALSE, warning=FALSE}
metoo_sentiment_freq2 <- metoo_sentiment %>%
  inner_join(get_sentiments("nrc")) %>%
  dplyr::count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) + 
  geom_col(show.legend=FALSE, stat='identity') + 
  facet_wrap(~sentiment, scales='free_y', nrow=3) + 
  labs(y = NULL, x = NULL) + 
  coord_flip() + 
  theme_calc() + 
  scale_fill_viridis(discrete=TRUE, option = "C") + 
  theme(text = element_text(size=10))


metoo_sentiment_freq2
```


# Frequency of 2018 Most-Tweeted #MeToo Words by Sentiment

```{r fig.width=10, message=FALSE, warning=FALSE}
tweets2_sentiment_freq2 <- tweets2_sentiment %>%
  inner_join(get_sentiments("nrc")) %>%
  dplyr::count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) + 
  geom_col(show.legend=FALSE, stat='identity') + 
  facet_wrap(~sentiment, scales='free_y', nrow=3) + 
  labs(y = NULL, x = NULL) + 
  coord_flip() + 
  theme_calc() + 
  scale_fill_viridis(discrete=TRUE, option = "magma") + 
  theme(text = element_text(size=10))


tweets2_sentiment_freq2
```

The proportion of positively related words is a little surprising, although it could be explained by the overall tone of #MeToo, which has been one of empowerment for victims of sexual assault. However, after seeing the break down in words under each sentiment category, the overall validity of using the sentiment library is somewhat undermined.

#Discussion

It is possible to study societal shifts in how we discuss sexual assault and harassment using publicly archived or harvested Twitter data; likewise it is possbible to analysze the overall tone and content of tweets using the *tidytext* ecosystem. 

Comparing some of #MeToo's more common associated words in 2017 versus 2018 suggest that the initial outpouring of personal tweets at high-profile individuals has subsided and the Twitter conversation about sexual harassment has shifted to more general discussions. The emotional sentiment of the language in 2017 and 2018 is similar, although feelings of *anger* and *surprise* seem to have receded somewhat, while *fear* seems to have increased over the past year.  

On the other hand, sentiment libraries have some fundamental drawbacks, and individual judgement is needed to provide context when interpreting results. Words like "sex" and "female" are coded positively in the NRC sentiment library (and further exploration of the Bing and AFINN libraries suggest they have similar classifications), although both of these words may not actually have positive connotations in the context of sexual harassment against women. Another example is the word "professor" appearing in the *trust* category -- given the nature of some tweets about experience sexual assault *by* professors points more to an abuse or lack of trust rather than an overall feeling of trust. Furthermore, the words "black" and "white" appear in the negative and positive categories respectively; without knowing why the sentiment classed these words this way, it's probably best to consider these words neutrally in an overall sentiment analysis. In the context of the #MeToo tweets, "black" and "white" could refer to race or a common phrase like "black and white thinking", but I'm not sure what it means in the context of the original NRC sentiment crowdsourcing.


# Future Directions
Tweets are one way to understand the international conversation about sexual harassment; scholarly journals are another way to study the way we view the phenonemom. The following data is an initial exploration of when and how sexual harassment is studied.

##PLoS Textual Analysis of Sexual Harassment

```{r fig.width=10, message=FALSE, warning=FALSE}
library(rplos)
library(ggplot2)
plot_throughtime(terms = '"sexual harassment"', limit = 800) + geom_line(size=2, color='#cc0066')
```

##aRxiv Textual Analysis of Sexual Harassment

When were papers related to sexual harassment submitted to aRxiv?
```{r message=FALSE, warning=FALSE}
#retrieve articles
library(aRxiv)
metoo_articles <- arxiv_search(query = '"sexual harassment"', limit = 800)
#clean up dates
library(lubridate)
metoo_articles <- metoo_articles %>%
  mutate(submitted = ymd_hms(submitted), updated = ymd_hms(updated))
# When were these submitted?
xtabs(~ year(submitted), data = metoo_articles)
```

When fields were they published under?
```{r message=FALSE, warning=FALSE}
#what fields were they submitted in?
metoo_articles  %>%
  mutate(field = str_extract(primary_category, "^[a-z,-]+"))  %>%
  mosaic::tally(x = ~field)  %>%
  sort()
```

Given that there are only 6 published articles, what are their titles?
```{r}
# Since there are only six in total, let's look at each title to determine the specific focus of each
head(metoo_articles$title)
```

While both of these datasets are too small to be particularly valuable, they do point to trends. The PLoS timeline indicates very little interest in sexual harassment as a topic of study prior to 2012; the number of articles is still small in 2018 but it a steady increase. Likewise, the *aRxiv* data points to a potentially open and burgeoning area of research in Computer Science: studying harassment by analyzing textual data. 

#References
[1] Anderson, M., & Toor, S. (2018, October 11). How social media users have discussed sexual harassment since #MeToo went viral. Retrieved from http://www.pewresearch.org/fact-tank/2018/10/11/how-social-media-users-have-discussed-sexual-harassment-since-metoo-went-viral/

[2] Georgetown University. (2018, September 10). #MeToo Movement Twitter Data Mined by Computer Science Professor. Retrieved from https://www.georgetown.edu/news/metoo-movement-twitter-data-mined-by-computer-science-professor

[3] Kunst, J. R., Bailey, A., Prendergast, C., & Gundersen, A. (2018). Sexism, rape myths and feminist identification explain gender differences in attitudes toward the# metoo social media campaign in two countries. Media Psychology, 1-26.

[4] Manikonda, L., Beigi, G., Liu, H., & Kambhampati, S. (2018). Twitter for Sparking a Movement, Reddit for Sharing the Moment:# metoo through the Lens of Social Media. arXiv preprint arXiv:1803.08022.

[5] Parker, K. (2018, December 6). Women starting to suffer the #MeToo backlash. Retrieved from https://www.postandcourier.com/opinion/commentary/women-starting-to-suffer-the-metoo-backlash/article_fc981576-f97d-11e8-8207-97a4ed0ae3d3.html

[6]Roberts, D. (2018, September 12). What so many men are missing about #MeToo. Retrieved from https://www.vox.com/2018/9/10/17826168/me-too-louis-ck-men-comeback

[7] Tan, G., & Porzecanski, K. (2018, December 3). Wall Street Rule for the #MeToo Era: Avoid Women at All Cost. Retrieved from https://www.bloomberg.com/news/articles/2018-12-03/a-wall-street-rule-for-the-metoo-era-avoid-women-at-all-cost

[8] The Economist. (2018, October 20). Measuring the #MeToo backlash. Retrieved from https://www.economist.com/united-states/2018/10/20/measuring-the-metoo-backlash

[9] Banks, A. (2018, August 28). Black bishop says Twitter suspended him for “hateful conduct.” Retrieved from https://religionnews.com/2018/08/28/black-bishop-and-trump-critic-says-twitter-suspended-account-for-hateful-conduct/