This exercise involves scraping the episodes for all seasons of the series “Friends”. The files were provised in HTML format. The main objective of the exercise is to scrape, clean and prepare the data to produce one or more datasets for further analysis.
The second objective is to carry out analysis at different levcels.
Extracting and cleaning the data was about 70% of this project. To clean and restructure the data, we did some text tandardisation like case conversion, abbreviations, and removal of white spaces and punctuations. We used functions like the the unnest_tokens() from the tidytext package. One of the many functions available in R for text mining and analysis.
Since R has a library of stop words, we used anti_join function to stopwords that correspond to the ones in its dictionary. We also carried out further cleaning to remove custom stop words that are not contained in the the stop words dict.
For the analysis, we mostly conduct exploratoray data analysis (EDA). We explore the characters, the series as a whole, and then the episodes. Specifically, we considered; the most involved characters, the most occurring words in the series, the sentiments in the series, the writer with the most episodes, the transcriber on the episodes etc.
At the character level, we looked at the most involved characters across the entire series, and the swntimens associated with their dialogue. We used the Bing dictionary for the sentiments analysis. We also analysed their talking habits with word cloud.
Rachel and Ross are the most involved characters in the series with the most appearances. Let’s dive deeper into the most involved characters to see what they were talking about.
In most of Rachel’s dialogue, she was speaking to ross, she also says’guys’ and ‘joey’ quite a lot.
In most of Ross’s dialogues, he was speaking also to Rachel.They end up together sometime in the series, so it’s understandble that they interact with each other a lot.
Here we see that the most negative character is Monica and the the most positive character is Ross. If we use a different dictionary for sentiments analysis, we may see a range of other emotions for the characters. For the sake of simplicity, we stick to the bing sentiments dictionary. We will considder the NRC further in this analysis.
At the series level, we want to see general frequency and appearance of words and sentiments as well. We exclude names of main characters to avoid skewing the data. When we run the analysis at first, see that the main character names are the most occuring words within the series. If we remove the character names and run the analysis again, we see a new layer of frequently occuring words. We represent this next layer of analysis in a word cloud.
In our earlier anaysis, we saw sentiments by character, we also want to understand the general sentiment within the entire series. For the sentiments analysis in the entire sereies, we make use of the NRC dictionary from tiyverse(textdata). We chose this dictionary because it gives us a range of sentiments, not just positive and negative. To achieve this, we count the number of words associated with each sentiment in NRC and join the sentiments dictionaries to our clean data.
Overall, the sentiments from the friends series is mostly positive.
For further analysis at the episode level, we combined the data frame by title to create a new dataframe where each row of data represents an episode, we save this in a csv and do some manual cleaning on excel. Here, we want to know which writer has written the most episodes and the same with the transcriber. Note that some epidsodes do not have this information. We also want to know the word count for each episode.
For the episodes, firstly, we look at what episodes (the top 10) had the most words in their script. We see that is the one in Vegas!
## title epi_count
## 1 The One In Vegas 3398
## 2 The one in Barbados 3198
## 3 The One With Ross and Monica's Cousin 3039
## 4 The One That Could Have Been 3013
## 5 The One With The Girl From Poughkeepsie 2996
## 6 The One With Chandler and Monica's Wedding 2793
## 7 The Last One 2788
## 8 The One Where Rachel Has A Baby 2635
## 9 The One After Vegas 2277
## 10 The One With The Rumor 2083
| title | epi_count |
|---|---|
| The One In Vegas | 3398 |
| The one in Barbados | 3198 |
| The One With Ross and Monica’s Cousin | 3039 |
| The One That Could Have Been | 3013 |
| The One With The Girl From Poughkeepsie | 2996 |
| The One With Chandler and Monica’s Wedding | 2793 |
| The Last One | 2788 |
| The One Where Rachel Has A Baby | 2635 |
| The One After Vegas | 2277 |
| The One With The Rumor | 2083 |
There’d be no Friends without its writers. So, which writer has written the most episodes? Note that Some epidsodes do not have this information.
Based on the available data Andrew Reich wrote over 15 episodes which is the highest. However, sinve some episodes did not have information about the writers, this information could change if that data becomes available.
Eric Aasen transcribed over 100 episodes on the Friends series data that we analysed.