## Rows: 673 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): comment_id, video_title, author, comment, sentiment
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Introduction

My whole life, I have been a huge fan of the Star Wars saga. Some of my fondest memories involve watching the original and prequel trilogies on DVD as a child. However, there is no doubt that since its debut in 1977, that Star Wars has evolved tremendously. Since Lucas Films was bought out by Disney in 2012, there never seems to be a break in the Star Wars action. There’s always something new coming out.

I’m interested in how people view the Star Wars content of today as opposed to the content of yesterday. Not only are there more new Star Wars series, but there’s also a lot more fans from the more recent generations. To gauge how fans feel about Disney’s new Star Wars shows, a great place to look would be YouTube comments.

Dataset I’m Using

Kaggle is a great place to find datasets. I found a data set called Star Wars YouTube Comments Sentiment Dataset that has YouTube comments from a variety of recent Star Wars trailers. This includes the comments from the new trilogy movie trailers, and trailers for numerous new Star Wars shows (such as Andor, Ahsoka, Obi-Wan, etc.). Each row in this dataset represents one comment, and it includes information such as who left the comment, the video they commented on, the date they commented, and the overall sentiment of that comment (positive, negative, or neutral). Even though it includes the sentiment of each comment, I won’t be using it as much, instead I will be interested in the overall sentiment for each type of video.

To get the type of video, I will be creating a new column in the dataset called video_topic where the topic will be obtained by using the grepl() command to search for key words in the video’s titles and assigning a simplified title to that row in the data. For example, one video is called “Obi-Wan Kenobi | Official Trailer | Disney+”. In that row, the video topic would just be “Obi-Wan”. Below is a data dictionary for this dataset.

Data Dictionary

1. comment_id

  • Description: Unique identifier for each comment
  • Type: Character

2. video_title

  • Description: Title of the video commented on. All videos are trailers
  • Type: Character

3. video_topic

  • Description: Simplified video title, only contains key word
  • Type: Character

4. author

  • Description: YouTube username of person who wrote the comment
  • Type: Character

5. comment

  • Description: Content within the user’s comment
  • Type: Character

6. date

  • Description: Date the comment was left
  • Type: Date

7. sentiment

  • Description: Overall tone of the comment left. Positive, Negative, or Neutral
  • Type: Character

8. comment_id

  • Description: Row number
  • Type: Numeric

8. comment_length

  • Description: Number of characters in the comment
  • Type: Numeric

YouTube Sentiment Analysis

YouTube comments are notorious for being overly negative. In the context of these Star Wars trailers, is there proof behind that notoriety? Are there certain video topics that seem to be more negative than others?

## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'video_topic'. You can override using the
## `.groups` argument.
## Joining with `by = join_by(word)`
## PhantomJS not found. You can install it with webshot::install_phantomjs(). If
## it is installed, please make sure the phantomjs executable can be found via the
## PATH variable.
Shiny applications not supported in static R Markdown documents

The graphs above are to just get an idea of what words are most used in the comments of each video topic. One thing to note about comments in the “Eclipse” series; the only scorable word that was used more than once in those comments was “love”, so there wasn’t alot to analyze there. For each video topic, I am noticing more negatively scored words than positive. Some of the words that are counted as “negative” are slightly questionable. For example, in the Episode I comments, the words “funny” and “goofy” are scored as negative.

One thing to always keep in mind with sentiment analysis is to take what you see with a grain of salt. Just because one set of comments is scored more negatively doesn’t mean that the users who commented don’t like it as much as the other video. In the previous graphs I was just exploring the language of the comments, but now I want to look at the comments of all topics next to each other, and compare the percentage of positive words.

## `summarise()` has grouped output by 'video_topic'. You can override using the
## `.groups` argument.

## # A tibble: 10 × 2
##    video_topic           n
##    <chr>             <int>
##  1 Ahsoka               81
##  2 Andor               182
##  3 Book of Boba Fett    56
##  4 Eclipse              28
##  5 Episode I            79
##  6 Episode III          45
##  7 Episode IX           60
##  8 Episode VII          50
##  9 Obi-Wan              50
## 10 Rogue One            42

The only video topic that had above 50% positive words was Eclipse, and that’s due to the fact that there were only a handful of scorable words in those comments. Other than Eclipse, the topic with the highest percentage of positively scored words was Andor, which I haven’t seen but I have heard that it is really good. It’s also worth noting that Andor had by far the most scorable words out of the sample of comments, mainly because out of the 673 comments collected, over 180 of them were from YouTube videos about Andor.

The topic that had the lowest percentage of positive words was Obi-Wan, which I saw part of and wasn’t a big fan of. The most common negative words in the Obi-Wan comments were “annoying” and “garbage” which sums up how the YouTube comments feel about the series pretty well.

Emotions Evoked in YouTube Comments

In the last section, words within comments were scored as either positive or negative, which certainly has its uses, but now I want to look at the specific emotions behind the commenters words and explore the differences in emotions between the video topics. I will be using the nrc lexicon to perform this analysis.

## `summarise()` has grouped output by 'sentiment'. You can override using the
## `.groups` argument.
Shiny applications not supported in static R Markdown documents

Positive and Negative are still the most common emotions evoked in most of comments within the video topics, but it’s interesting to see how some topics have a higher percentage of words that evoke trust, others are fear and even anticipation.

Now that I have explored the sentiments of new Star Wars content on YouTube comments, I will transition into exploring the sentiments of IMDb reviews for the original trilogy of Star Wars movies, also know was “OG” Star Wars.

We saw the “New”, how about the “Old”?

## New names:
## Rows: 51 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): Review, Movie dbl (2): ...1, review_length
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

I found the previous dataset about YouTube comments on Star Wars related videos on Kaggle. This new dataset, however, was obtained via HTML scraping IMDb’s website. I scraped 17 reviews from each of the 3 original Star Wars movies, so this new dataset has 51 rows. I chose to obtain 17 from each movie, because IMDb reviews are much much longer than YouTube comments, so I wanted to keep the number of characters consistent in each data set. Nearly 700 YouTube comments translated to about 100,000 total characters, so I checked that the reviews I scraped would result in about the same length.

sum(starwars_yt$comment_length)
## [1] 100675
sum(starwars_imdb$review_length)
## [1] 108445

Firstly, I just want to get an idea on the most commmon words used in the IMDb reviews.

## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'Movie'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'Movie'. You can override using the
## `.groups` argument.
## Joining with `by = join_by(word)`

Some of the most common scorable words in the IMDb review dataset include “death”, “dark”, “evil”, and “destroy”. This is contrasted by words like “boring”, “bad”, and “love” that were most common within the YouTube comments. This sort of highlights the difference in the language used between IMDb reviews and YouTube comments. IMDb is more formal; where a reviewer would say “You can feel the evil radiate from the screen when the Emperor walks into the scene”, a YouTube commentor would say “The Emperor is a bad guy”. This is a purely hypothetical situation, but it’s the reality behind the platforms.