Can a TV Show change the sentiment around the controversial con-artist, Anna Sorokin?

Hypothesis

Anna Sorokin (aka Anna Delvey) is probably one of the most (in)famous fraudesters of the 21st century, partly due to the highly successful Netflix miniseries Inventing Anna aired in February 2022. In case you have missed the recent hustle and bustle around the tv show, let me briefly introduce Anna:

Anna Sorokin (Russian: Анна Сорокина; born January 23, 1991) is a Russian-born German con artist and fraudster. Between 2013 and 2017, Sorokin pretended to be a wealthy German heiress under the name Anna Delvey. In 2017, she was arrested after defrauding or intentionally deceiving major financial institutions, banks, hotels, and acquaintances in the United States for a total of $275,000. In 2019, Sorokin was convicted in a New York state court of attempted grand larceny, larceny in the second degree, and theft of services, and was sentenced to 4 to 12 years in prison. source: Wikipedia

So, as you can see, Anna is a rather controversial character. But let’s get back to my hypothesis now!

Anna’s story was first published in Vanity Fair back on 13 April 2018, but became real famous when Inventing Anna was released on Netflix on 11 February 2022. My hypothesis is that the launch of the miniseries changed the sentiment around Anna from negative to more positive by romanticizing being a con-artist villain and sharing her narrative with millions of Netflix subscribers. To validate whether her reputation shifted from criminal to hero, I’ll analyze the sentiment of Reddit comments from before and after the show was launched and see how the sentiment changed.

Getting the data

To get the data, I used the R package RedditExtractoR that enables us - among many other exciting things - to find URLs to reddit threads of interest. There are 2 available search strategies: by keywords and by home page. As my aim was to find threads where Anna was mentioned, I went for the keyword option and used anna delvey and anna sorokin for my query.

# look for keyword "anna sorokin"
#urls1 <- find_thread_urls(
#keywords = "anna sorokin",
#sort_by = "top",
#subreddit = NA,
#period = "all"
#)

# look for keyword "anna delvey"
#urls2 <- find_thread_urls(
#keywords = "anna delvey",
#sort_by = "top",
#subreddit = NA,
#period = "all"
#)

#urls <- rbind(urls1,urls2)

To get the content of the threads, I used the get_thread_content() function, which gave me (1) a data frame containing meta data describing each thread and (2) a data frame with comments found in all threads. As I was mainly interested in the sentiment of comments, I continued my analysis with using the second data frame.

Having a closer look to the resulting dataframa (df), we can see that it contains 10 variables: - url: URL of the comment - author: redit user who wrote the comment - date: date when the comment was written - timestamp: the exact time when the comment was written - score: equals the number of upvotes minus the number of downvotes - upvotes: the number of upvotes - downvotes: the number of downvotes - golds: the number of golden awards received - comment: the text of the comment - comment_id: the id of the comment within the thread

#Get thread content
#thread_content <- get_thread_content(urls$url)

#Narrow down to threads including "anna"
#thread_content_anna <- thread_content$threads

#Store comments in a separate df
#df <- thread_content$comments

#write_csv(df, paste0(mypath,"raw_data.csv"))
df <- read_csv(paste0(mypath,"raw_data.csv"))

As I wanted to compare the sentiment of comments from before Inventing Anna was introduced on Netflix, I split the dataframe in two parts: (1) Before the show was released and (2) After the show was released. I labelled both of them accordingly (added “pre” for comments before the premier and “post” for comments after the premier), and re-joined the two dataframes.

#min(df$date)
#the earliest comment is from 2013-03-28

#max(df$date)
#the latest comment is from 2022-05-08

# store comments from before Inventing Anna 
before_ia <- df %>% 
  filter(date < "2022-02-11")

#label the dataframe
before_ia <- before_ia %>% mutate(label = "pre")

# store comments after Inventing Anna
after_ia <- df %>% 
  filter(date >= "2022-02-11")

after_ia <- after_ia %>% mutate(label = "post")

# rejoin the two dataframes
anna_df <- rbind(before_ia, after_ia)

# calculate  date in year + month format
anna_df$date <- as.Date(anna_df$date)

Data cleaning

As the first step of cleaning, I converted deleted and removed comments no missing values (there where 347 such rows in total). After dropping rows with missing values, I got rid of columns irrelevant for the analysis and duplicated rows. Last but not least I generated unique ids for each comment and narrowed done the comments from between the first Vanity Fair article was published and 8 May 2022, when the data was retrieved.

### Step 1: Get rid of misssing values & duplicates

# change deleted/removed comments to missing
anna_df$comment <- gsub("\\[deleted\\]|\\[removed\\]", "", anna_df$comment)

# drop rows where comment was removed/deleted -> we had 347 such rows
anna_df <- anna_df %>% filter(comment != "")

# drop columns that are irrelevant for the analysis
anna_df <- anna_df %>%  select(- c(url, author, timestamp, comment_id))

#check duplicates
#sum(duplicated(anna_df)) # we had 511 duplicated rows

#drop duplicates
anna_df <- anna_df %>% distinct(.keep_all=T)

# We have 10,544 observations left

#add unique id to each comment
anna_df <- anna_df %>% mutate(id = row_number())

# included comments from the date of the first article published on 13 April 2018
anna_df <- anna_df %>% 
  filter(date >= "2018-04-13")

Text preprocessing

In order to prepare the text for further analysis, I cleaned the comments from all the “junk”, such as urls or hashtags, and set the text to lowercase. To work with the data as a tidy data set, I had restructure it in the one-token-per-row format with the help of the unnest_tokens() function and used lemmatization to get the base root form of the tokens. As the last step of text preprocessing, I filtered out stopwords as well as wors that consis of less than 3 characters to keep only the meaningful tokens without very common words.

### Step 2 - Text preprocessing

# Remove hashtags, urls, unneccessary characters
anna_df$comment <- gsub("@[^\\s]+|http\\S+|\\W|\\s+[a-zA-Z]\\s+|\\d+|\\s+", " ", anna_df$comment)

# Lower the text
anna_df$comment <- tolower(anna_df$comment)

# Unnest tokens
anna_tidy <- unnest_tokens(anna_df, word, comment)

#Lemmatize tokens
anna_tidy$word <- lemmatize_words(anna_tidy$word)

# Remove stopwords and words that are less than 3 character long
data(stop_words)

# Add some extra stopwords and store in dataframe
extra_stopwords <- c("don", "didn", "doesn", "gt","isn", "cc" , "lot")
extra_stopwords_df <- data.frame(word = extra_stopwords, lexicon = "extra_stopwords")
                                
stop_words <-rbind (stop_words, extra_stopwords_df)                                 

# Anti join stopwords
anna_tidy <- anna_tidy%>%
  anti_join(stop_words) %>% 
  filter(nchar(word)>3)

Exploratory data analysis

To get familiar with the data, I run a couple of basic analytics on the comments. First, I checked the distribution of labels: out of the 10.230 comments included there were 4665 before and slightly more, 5565 comments written after the show was released on 11 February 2022. Then I continued my analysis by analyzing the distribution of comments from before and after the premier of Inventing Anna. Although I first included all dates from 2013 to 2022, I quickly realized that there are very few submissions before 2018, so I decided to exclude them from the visualization. In line with my expectations, there were relatively few comments on Anna Sorokin before the Netflix show, however, we can see a huge rise right before the premier - probably due to the media campaign that prepared the audience for the upcoming new series. The number of comments dropped after the show was officially out, but remained significantly higher than in the period before.

#distributon of labels
#table(anna_df$label)

#distribution of comments
p1 <- anna_df %>% 
  group_by(month = lubridate::floor_date(date, "month")) %>% 
  summarize(comment_ct = n()) %>% 
  ggplot(aes(month, comment_ct)) +
  geom_col(fill = col, stat = 'identity') +
  labs(title = "Distribution of comments between 2018 and 2022", y = "Nr of comments", x = "Date") +
  geom_vline(xintercept = as.numeric(as.Date("2022-02-11")), linetype = 4, color = "black") +
  scale_x_date(date_breaks = "3 months") +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  geom_label(aes(as.Date("2022-02-11"), 3000), label = "Premier", show.legend = FALSE)

p1

To dig a little deeper, I calculated the length of comments and checked its distribution on a histogram. If you take a look at the graph below, you can see that the length of comments (measured in the number of words) had a skewed distribution with a long right tale. While the median comment consisted of 22 words, the average word count was 40.84.

### Calculate some extra features, such as number of words per comment,  etc.

# Number of words per comment
anna_df$word_ct <- sapply(strsplit(anna_df$comment, " "), length)

# Check descriptive stats
#summary(anna_df$word_ct)

# Visualize distribution of word count (where word count < 500)
p2 <- anna_df %>% 
  filter(word_ct < 500) %>% 
  ggplot(aes(word_ct)) +
  geom_histogram(fill=col, bins = 100) +
  labs(title = "Distribution of word count", y = "Count", x = "Word count")

p2

Then I checked the ten most frequent words in the comments from before and after Inventing Anna was released. Interestingly, the name “Anna” didn’t make into the top 10 before the show was introduced, but scored first after the premier. Not surprisingly, “money” was among the most frequent words in both cases, while “Rachel”, the name of one of the victims of Anna who wrote the first article in Vanity Fair and played a major role in the show as well, appeared after the series was launched.

### Visualize the most frequent words from before/after Inventing Anna

p3 <- anna_tidy %>%
  filter(label == "pre") %>% 
  count(word, sort = TRUE) %>% 
  head(10) %>% 
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) + 
  geom_col(fill = col) + 
  xlab(NULL) + 
  coord_flip() +
  labs (title = "Top 10 most frequent words", subtitle = "Before Inventing Anna")

p4 <- anna_tidy %>%
  filter(label == "post") %>% 
  count(word, sort = TRUE) %>% 
  head(10) %>% 
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) + 
  geom_col(fill = col) + 
  xlab(NULL) + 
  coord_flip() +
  labs (title = "Top 10 most frequent words", subtitle = "After Inventing Anna")

grid.arrange(p3, p4, ncol=2)

Sentiment analysis

Sentiment distribution and ratios

To start my sentiment analysis, I first used the bing lexicon, that categorizes words in a binary fashion into positive and negative categories, to see how the distribution and ratio of sentiments in the two time periods. As it can be seen from the graphs below, while there were overall more comments written after the premier, the ratio of positive sentiment words remained stable at around 35%. Nevertheless, the results suggest that my hypothesis might not be right.

#using Bing
sentiments_distribution <- anna_tidy %>% 
                           select(label, word) %>%
                           inner_join(get_sentiments("bing"), by = "word")  %>% 
                           ggplot() +
                           geom_bar(aes(sentiment, fill=sentiment), stat="count")  +
                           scale_fill_manual(values = c(col, col2)) +
                           labs(title = 'Sentiment distirbution', y = "Count", x = "Sentiment") +
                           theme(legend.position = 'None') +
                           facet_wrap(~label) 

sentiment_ratios <- anna_tidy %>% select(label, word) %>%
                               inner_join(get_sentiments("bing"), by = "word")  %>% 
                               ggplot() +
                               geom_bar(aes(label, fill=sentiment), stat="count", position = "fill")  +
                               labs(title = 'Sentiment ratios', y = "Ratio", x = "Period") +
                               scale_fill_manual(values = c(col, col2)) 
  
grid.arrange(sentiments_distribution, sentiment_ratios, ncol=2)

To get a closer look on how the sentiment changed over time, I calculated the proportion of positive/negative sentiment words for each month. As the graph below shows, the positive sentiment ratio didn’t go up after the premier - in fact, it didn’t even reach the average positive ratio! This means that my original hypothesis could not be proved based on the data.

# positive ratio over time
temp <- anna_tidy %>% 
  select(date, label, word) %>%
  inner_join(get_sentiments("bing"), by = "word")  %>% 
  group_by(month = lubridate::floor_date(date, "month"), sentiment) %>% 
  summarize("sentiment_count"=n()) %>% 
  mutate("sentiment_prop"=sentiment_count/sum(sentiment_count)*100) %>% 
  filter(sentiment == "positive") 
  
mean_sentiment_prop_pos <- mean(temp$sentiment_prop)
  
p5 <- anna_tidy %>% 
  select(date, label, word) %>%
  inner_join(get_sentiments("bing"), by = "word")  %>% 
  group_by(month = lubridate::floor_date(date, "month"), sentiment) %>% 
  summarize("sentiment_count"=n()) %>% 
  mutate("sentiment_prop"=sentiment_count/sum(sentiment_count)*100) %>% 
  ggplot(aes(month, sentiment_prop)) +
  geom_bar(aes(fill = sentiment), stat="identity", position = "fill") +
  geom_vline(xintercept = as.numeric(as.Date("2022-02-11")), linetype = 4, color = "black") +
  geom_hline(yintercept = mean_sentiment_prop_pos/100, linetype = 4, color = "black") +
  scale_x_date(date_breaks = "6 months") +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  labs(title = "Sentiment ratios over time", x ="", y= "%") +
  scale_fill_manual(values = c(col, col2)) +
  geom_label(aes(as.Date("2022-02-11"), 1.00), label = "Premier", show.legend = FALSE) +
  geom_label(aes(as.Date("2018-07-20"), mean_sentiment_prop_pos/100), label = "Avg positive rate", show.legend = FALSE)

p5

Sentiment contributors

To get a deeper understanding on which words were the biggest contributors to the two sentiments, I created a visualization of the top 10 positive and negative contributors. Looking at the list of negative contributors, we can see words like “scam”, “prison”, “crime”, “steal” and “fake” - all describe the factual aspects of Anna’s “career”. However, on the positive side we can find words like “enjoy”, “free”, “amaze” and “fascinate” - all describe the heroic narrative on our Anna, who successfully scammed her way through New York society and lived the life of the rich and famous for a little while.

anna_word_counts <- anna_tidy %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

p6 <- anna_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL) +
  scale_fill_manual(values = c(col, col2))

p6

Sentiment categories

To understand how different sentiment categories changed before and after Inventing Anna, I used the NRC lexicon to put the words into 10 distant categories. In line with the outcome of the sentiment analysis with the Bing lexicon, the ratio of positive sentiment remained approximately the same, however, there were some minor changes in other categories. Rather positive categories such as trust, anticipation, and surprise all decreased by 1 percentage point, while rather negative categories like anger, fear, sadness and disgust increased by 1 percentage point. Well, at this point I guess I have to admit that the exact opposite of my hypothesis was true!

sentiment_cat<- anna_tidy %>% 
  inner_join(get_sentiments('nrc')) %>% 
  group_by(label, sentiment) %>% 
  summarize("sentiment_count"=n()) %>% 
  mutate("sentiment_prop"=sentiment_count/sum(sentiment_count))
  

p7 <- ggplot(sentiment_cat, aes(x=reorder(sentiment, sentiment_prop), y = sentiment_prop, fill=sentiment)) +
  geom_col(show.legend=F) + coord_flip() +
  geom_text(aes(label= round(sentiment_prop,2), hjust=1.15)) +
  labs(title = "Ratio of sentiment categories", y = "Ratio", x = "Sentiment category") +
  facet_wrap(~label) 

p7

TF-IDF

In the next step of my analysis I fitted a tf-idf model to the cleaned text to understand which keywords had the highest importance in the text. In case of comments written after the release of the series, the names of leading characters (such as Vivian the journalist or Kacy, Anna’s personal trainer in the show), the name of the actress who played Anna (Ruth Langmore) and some series-specific words like cast, storyline, expression appeared in the top 10. Surprisingly, the name “Simon” was also among the most important terms - although there was no actor or actress with this name in Inventing Anna, it probably refers to Simon Levied, the fake Israeli billionaire from another Netflix hit released in early 2022.

Interestingly, the list of most important words from before Inventing Anna seem not to be related to Anna Sorokin’s story at first. However, one of Simon’s victims from The Tinder Swindler returned to Amsterdam and Conrad was in fact one of the businessman’s name in Inventing Anna, while subscriber probably refers to Netflix subscribers. Nevertheless, further analysis would be needed to properly understand how to seemingly unrelated words, such as nude or nudity that did not play a big role in the show, made it to the list of most important terms.

# TFIDF

anna_words <- anna_tidy %>%
  count(label, word, sort = TRUE) %>%
  ungroup() %>% 
  bind_tf_idf(word, label, n) %>% 
  filter(nchar(word)>3)

p8 <- anna_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(label) %>% 
  top_n(10) %>% 
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = label)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~label, ncol = 2, scales = "free") +
  coord_flip() +
  scale_fill_manual(values = c(col, col2))

p8

Bigrams

I also extended the analysis to bigrams to check the most common word pairs in the comments. By doing so, I realized that “tinder swindler” and “elizabeth holmes” were among the most frequent bigrams - both of them referring to other fraudsters. (Elizabeth Holmes might sound familiar from the Theranos-scandal or from the book “Bad Blood” written by John Carreyrou.) Besides this, most bigrams refer to main characters or real-life characters of Anna’s story or terms that is closely related to Anna’s actions, such as “rich people” or “credit card”.

anna_bigrams <- anna_df %>%
  unnest_tokens(bigram, comment, token = "ngrams", n = 2)

# Before premier
anna_bigrams_before <- anna_bigrams %>% 
                       filter(date < "2022-02-11")

bigrams_separated_bef <- anna_bigrams_before %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered_bef <- bigrams_separated_bef %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>% 
  filter(nchar(word1) >3) %>% 
  filter(nchar(word2)>3) 

# new bigram counts:
bigram_counts_bef <- bigrams_filtered_bef %>% 
  count(word1, word2, sort = TRUE)

# After premier
anna_bigrams_after <- anna_bigrams %>% 
                       filter(date >= "2022-02-11")

bigrams_separated_after <- anna_bigrams_after %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered_after <- bigrams_separated_after %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>% 
  filter(nchar(word1) >3) %>% 
  filter(nchar(word2)>3) 

# new bigram counts:
bigram_counts_after <- bigrams_filtered_after %>% 
  count(word1, word2, sort = TRUE)

# visualization
p9 <- bigram_counts_bef[-1,] %>% 
      mutate(bigram = paste0(word1," ",word2)) %>% 
      arrange(-n) %>% 
      head(10) %>% 
      ggplot(aes(reorder(bigram, -n),n)) +
      geom_col(fill = col) +
      coord_flip() +
      labs(title = "Top 10 bigram", subtitle = "Before Inventing Anna", y = "", x = "")

p10 <- bigram_counts_after[-1,] %>% 
      mutate(bigram = paste0(word1," ",word2)) %>% 
      arrange(-n) %>% 
      head(10) %>% 
      ggplot(aes(reorder(bigram, -n),n)) +
      geom_col(fill = col) +
      coord_flip() +
      labs(title = "Top 10 bigram", subtitle = "After Inventing Anna", y = "", x = "")

grid.arrange(p9, p10, ncol=2)

Wordcloud

Last but not least I created a wordcloud to illustrate the most common positive and negative words from the comments. (I also tried to create two wordclouds for before and after the premier, but the words were overlapping so I ended up with one simple wordcloud with no differentiation between the two periods). While on the negative side we can see words that showcase a criminal, the positive cloud displays a heroine.

# Worldcloud

# before premier
p11 <- anna_tidy %>%
  filter(label == "pre") %>% 
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c(col, col2),
                   max.words = 100)

Conclusion

While I failed to prove my hypothesis that the Netflix show improved the sentiment around Anna Sorokin. I would conclude that Anna is a rather controversial character based on the comments of Reddit users, as the ratio of positive and negative sentiments stayed approximately at the average levels from the period covered between the release of the first article in Vanity Fair and 8 May 2022, when the comments where retrieved from Reddit.

If you want to find out more about the story and decide whether Anna was a criminal or a hero, I recommend you to conduct your empirical research by watching the show. :)