“If you knew anything about telenovelas, you’d know that everything is supposed to be dramatic!”

— Rogelio De la Vega

credit image: fanart.tv

Inspiration

This project has been inspired by the work of Katie Segreti and Harry Anderson (who was himself inspired by Katie’s), that you can respectively find here and here. The idea is to scrape data about a TV series from the web, analyze it, and produce some beautiful data visualizations with the material. I followed a DataCamp course on web scraping a few weeks ago, so that seemed like the perfect way to put into practice what I have learned. Moreover, as I have also been learning some text mining and sentiment analysis, I decided to add some of that as well to my own iteration of the project.

Jane the virgin

The TV series I chose is Jane the Virgin, one of my favorites among the ones I watched in 2020 (yes, lock down resulted in a lot Netflix binge-watching - none of which I regret). Jane the Virgin is produced by CW, and aired in the US between 2014 and 2019. The show is based in Miami, and is centered around a family of three latina women: Jane, her mother Xiomara and her grandmother Alba, and it follows the unfolding of their lives after Jane (who is, you guessed it, a virgin), during a gynecologist check-up, gets accidentally inseminated with the sperm of Rafael Solano, the owner of the hotel where she is working. The plot is based on the Venezuelan telenovela “Juana la virgen” and showcases all the main telenovela tropes and drama, such that it can be considered a true celebration of the genre, albeit done in an ironic way.

In Jane you can find it all: love, death, drama, crime, drug lords, evil twins, fame, success and failure - a roller coaster of emotions with bone-rattling twists and a tenseful cliff-hanger basically at the end of each episode.

But Jane is not only about its plot, what is actually most delightful about the show are its characters: funny, deep and relatable, each and one of them is modeled on a stereotype but layered with a complex and tri-dimensional personality, who is never completely “good” or “bad” and most of all is never trivial. Each of them goes through his/her own unique personal journey of self understanding and growth, so that the show is not only fun and entertaining, but has a deeply thought-provoking quality - for all things life, relationships, career and family.

With a cast, production and writer’s room made up of mostly women and latinx people, Jane the virgin gives us the chance to see a truly diverse representation, and especially an authentic and non-stereotypical narration of the latino community.

Web scraping

I scraped the data from two websites: IMDB and Wikipedia. From the former I got the episode rating and summary (which I will later use for text analysis), while from the latter I got the number of viewers on air date (US) of each episode. Let’s start by loading all the packages needed for the project.

The following code scrapes the IMDB website and create a first dataframe with the data obtained. I have used the html_nodes() function from the rvest package, which pass the html name or attribute of each section of the webpage, which can be obtained directly from your browser by right-clicking on the web page’s portion you are interested in, and selecting inspect.

jtv_episodes <- c()
for(k in 1:5){
  url <- paste0("https://www.imdb.com/title/tt3566726/episodes?season=", k)
  jtv_html <- read_html(url)
  episode_number <- jtv_html %>%
    html_nodes('.image') %>%
    html_text()
  episode_airdate <- jtv_html %>%
    html_nodes('.airdate') %>%
    html_text()
  episode_title <- jtv_html %>%
    html_nodes('.info strong a') %>%
    html_attr("title")
  episode_rating <- jtv_html %>%
    html_nodes('.ipl-rating-star.small .ipl-rating-star__rating') %>%
    html_text()
  episode_votes <- jtv_html %>%
    html_nodes('.ipl-rating-star__total-votes') %>%
    html_text()
  episode_summary <- jtv_html %>%
    html_nodes('.item_description') %>%
    html_text()
  assign(paste("jtv_episodes", k, sep = "_"), cbind(episode_number, episode_airdate, episode_title, episode_rating, episode_votes, episode_summary)) 
}

# Collect episodes for each season in the same df
jtv_episodes <- rbind(jtv_episodes_1, jtv_episodes_2, jtv_episodes_3, jtv_episodes_4, jtv_episodes_5)
jtv_episodes <- as.data.frame(jtv_episodes)

# Clean the data frame (data type, add separate column for season and episode)
jtv_episodes <-jtv_episodes %>%
  mutate(episode_airdate = dmy(episode_airdate),
         episode_rating = as.numeric(as.character(episode_rating)),
         episode_votes = parse_number(as.character(episode_votes)),
         episode_summary = as.character(episode_summary),
         number = c(1:100)) 
jtv_episodes <- jtv_episodes %>%
  separate(col = episode_number, into = c("season", "episode"), sep = ",")

jtv_episodes <- jtv_episodes %>%
  mutate(episode_number = parse_number(episode),
         season = str_replace(season, "S", "Season "))

The next piece of code scrapes the Wikipedia website of each season. Here, rather than the html node, I used the Xpath notation to refer to the elements of interest. You can access it by righ-clicking on the html node, select copy and then copy Xpath. As here I am interested in the viewers tables, I then added html_table(fill = TRUE) so that the data is directly stored into a table format in my R session. For the web scraping code, I highly recommend to follow Harry Anderson’s Medium post linked in the introduction.

The following step is to join the two data sets together, as to have a unique data frame for further analysis and to build the data visualizations.

Data viz

At this point I am ready to make the first plots: episode rating and viewers per episode. As I really liked the style of Katie Segreti’s plot and had never used the ggthemes package, I thought this could be a good chance to try it out and thus I kept a format that is very similar to her own.

From the plot we can see that the rating is pretty consistent throughout the whole show, with each season finale having a generally higher rating than the other episodes of the same season. We can also notice that there is a rating decrease in the last season. We can see this better by averaging all the episode’s ratings by season, and compare them.

It seems that going from season 1 to 5, there is a small decline in ratings, as well as the number of votes - which as we are about to see, actually correlates with the viewers on airdate, and generally hints to a decrease in audience engagement with time.

The number of viewers on airdate is almost halved going from the first to the last season. Did people get bored of the show? Well, I personally find this surprising, as I was basically hooked since episode one and was unable to put the show down (..but I also recognize that not everyone is an obsessional person). Of course the viewers on airdate are only a small portion of the people who actually saw the show, and with Jane being now available on Netflix, it could be that the trend is not fully representative.

Text mining

Another information I scraped from IMBD is the episode synopsis, which gives a brief summary of each episode: the main characters appearing and the main events. I thought to use this data to get first a glimpse of the most frequently mentioned characters (as done by Katie Segreti) and then to run some deeper analysis to understand the main themes recurring through the show as well as their sentimental character.

Text analysis is a complex subject which is entirely dependent on the type and quality of your text data (well, like pretty much any analysis) - in this case, the summary is a rather brief and certainly incomplete source, and any conclusion must be taken for what it is.

I started by using the tidytext package, which store text data in a tidy format, with one word per row, so that it can be analyzed using the various functions of the tidyverse library. After obtaining a word variable with unnest_tokens(), in order to obtain a chart with the most popular characters, I used the filter() function to retain the names of the show’s main characters and calculate their recurrence.

I find quite surprising that Xo and Alba rank so low in mentions, but that could just be because they are more present in the more introspective/mundane scenes of the show, which likely do not get mentioned in a brief summary such as the ones I am working with. The fact that Rogelio’s bar ended up being purple means that not customizing the colors was just the right choice to make (show’s fans will understand this statement).

The next step is to identify the main themes, and how common they are across the different seasons. For the sake of practice, this time I decided to use a different analysis method, which make use the tm package. The tm package works with corpuses (list of text elements), and the analysis workflow consist in:

  • transform the data frame into a corpus
  • clean the corpus
  • transform the corpus into a document term matrix or DTM (a matrix where each row is a document and each column in a word, and the values are the number of times the word has in each document)
  • transformed the DTM in a matrix
  • Sum the matrix rows to obtain the frequency of each word
  • converted the data back to a data frame format for data viz (if needed)

The process is a bit more tedious than the tidy text approach, but it can be useful if you want to apply machine learning models to the data. It is possible to transform tidy text data in DTM and vice versa, and I recommend chapter 5 of the “Text Mining with R” book for a complete explanation of how this is done.

As you can see from the code, there is a vast list of words that I decided to exclude besides the most common stopwords. They are mostly character’s or actor’s names as well as other non significant words such as “will” or “finally”. Of course this word choice is entirely arbitrary - and arguably correct - but the aim of my analysis was to capture the main themes rather than the overall most frequently used terms, and that’s what I felt like resulted in a fairly representative output (at least from my very biased point of view).

I then decided to use wordclouds to see which themes are common between the five seasons and which ones are not. Wordclouds might not be the most informative way to represent words frequency, but once again, for the sake of practice I wanted to do at least a couple of them.

I thus collected all the 5 seasons texts and collapsed them into one single data frame, and then applied the functions commonality.cloud() and comparison.cloud() to see the results.

Sentiment analysis

Jane is all about feelings, so this seemed like literally the most appropriate type of analysis to run. In a nutshell, sentiment analysis compare the terms in a text with the ones contained in a pre-defined lexicon, and assign them accordingly a sentiment or a score. For my analysis, I used three different lexicons (or dictionaries):

  • NRC: classify words with respect to 10 main sentiments
  • Bing: classify words as negative or positive
  • Afinn: score words within a scale from -5 (very negative) to +5 (very positive)

I started by analyzing what were the most prevalent sentiments in each season.

While using the NRC lexicon, it seems that the sentiment distribution in the five different seasons is pretty homogeneous, with a predominance of “positive”, “trust” and (unsurprisingly!) “anticipation”.

If we remove positive and negative from the NRC lexicon, we are left with 8 sentiments which are part of Plutchik’s wheel of emotions. Another way to represent the results is to make use of a radar chart, that can be built with the radarchart package, in an interactive form.

## Joining, by = "word"
## Joining, by = "word"

The Bing lexicon yields to negative sentiments being the most prevalent in every season, with season 1 having the lowest score and season 4 the highest.

The Afinn dictionary yields to some season being overall positive and others overall negative, with a trend that is quite similar (with the exception of season 2) to what was obtained with the Bing dictionary.

Beside seeing the average sentiment per season, I also wanted to monitor the score of each episode, and build a detailed timeline of the sequence of emotions throughout the show. To do that, I decided to use only the Bing and Afinn dictionary, as it was difficult to attribute a score to the sentiments of the NRC one.

## Joining, by = "word"
## Joining, by = "word"
## `geom_smooth()` using formula 'y ~ x'

## Joining, by = "word"
## Joining, by = "word"
## `geom_smooth()` using formula 'y ~ x'

Well, the result is very much what I expected from this show: a rollercoaster of emotions! While the Afinn lexicon is built upon and thus yields to a larger sentiment variation, both the timelines are trending slightly upwards, indicating a happy progression of the show (I am not sure that that is the case, but for sure we have a happy ending).

Conclusions

The project was superfun and a great way to practice the latest data stuff I have learned. It also gave me the chance to revel in my Jane obsession one more time, and I feel that this is not going to be the last one 😃