Load the Netflix Dataset I’ll first load the Netflix dataset and preview it to understand its structure.

# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")

# Preview the dataset
head(netflix_data)
##         id                               title  type
## 1 ts300399 Five Came Back: The Reference Films  SHOW
## 2  tm84618                         Taxi Driver MOVIE
## 3 tm127384     Monty Python and the Holy Grail MOVIE
## 4  tm70993                       Life of Brian MOVIE
## 5 tm190788                        The Exorcist MOVIE
## 6  ts22164        Monty Python's Flying Circus  SHOW
##                                                                                                                                                                                                                                                                                                                                                                                                                                                          description
## 1                                                                                                                                                                                                                                                                                                            This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discussed in the docuseries "Five Came Back."
## 2                                                                                                                                                                                                                                A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.
## 3                                    King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not  to enter, as "it is a silly place".
## 4 Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as the Messiah. When he's not dodging his followers or being scolded by his shrill mother, the hapless Brian has to contend with the pompous Pontius Pilate and acronym-obsessed members of a separatist movement. Rife with Monty Python's signature absurdity, the tale finds Brian's life paralleling Biblical lore, albeit with many more laughs.
## 5                                                                                                                12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area of Georgetown. Her mother becomes torn between science and superstition in a desperate bid to save her daughter, and ultimately turns to her last hope: Father Damien Karras, a troubled priest who is struggling with his own faith.
## 6                                                                                                                                                                                                                                                                                             A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
##   release_year age_certification runtime                 genres
## 1         1945             TV-MA      48      ['documentation']
## 2         1976                 R     113     ['crime', 'drama']
## 3         1975                PG      91  ['comedy', 'fantasy']
## 4         1979                 R      94             ['comedy']
## 5         1973                 R     133             ['horror']
## 6         1969             TV-14      30 ['comedy', 'european']
##   production_countries seasons   imdb_id imdb_score imdb_votes tmdb_popularity
## 1               ['US']       1                   NA         NA           0.600
## 2               ['US']      NA tt0075314        8.3     795222          27.612
## 3               ['GB']      NA tt0071853        8.2     530877          18.216
## 4               ['GB']      NA tt0079470        8.0     392419          17.505
## 5               ['US']      NA tt0070047        8.1     391942          95.337
## 6               ['GB']       4 tt0063929        8.8      72895          12.919
##   tmdb_score
## 1         NA
## 2        8.2
## 3        7.8
## 4        7.8
## 5        7.7
## 6        8.3

Building Two Pairs of Numeric Variables:

I’ll focus on:

  1. IMDb votes vs IMDb score per vote: Here, IMDb votes serve as the explanatory variable, and the newly created column, IMDb score per vote, is the response variable. This will show how the number of votes impacts the adjusted IMDb score.

  2. TMDB popularity vs IMDb Score: TMDB popularity could influence how the IMDb score is perceived on a different platform, so this will be the second pair of variables.

Let’s create a new column called “IMDb score per vote” by dividing IMDb score by the number of votes, which gives an adjusted score based on popularity.

# Create a new variable: IMDb score per vote
netflix_data <- netflix_data %>%
  mutate(imdb_score_per_vote = imdb_score / imdb_votes)

# Select relevant columns
netflix_data_clean <- netflix_data %>%
  select(imdb_score, imdb_votes, imdb_score_per_vote, runtime, tmdb_popularity) %>%
  filter(!is.na(imdb_score), !is.na(imdb_votes), !is.na(tmdb_popularity))

Visualizing Relationships and Drawing Conclusions

Pair 1: IMDb Votes vs IMDb Score Per Vote

I’ll first visualize the relationship between IMDb votes and the newly created IMDb score per vote. This helps show whether more popular titles (those with more votes) tend to have higher adjusted IMDb scores.

# Visualization: IMDb Votes vs IMDb Score Per Vote
ggplot(netflix_data_clean, aes(x = imdb_votes, y = imdb_score_per_vote)) +
  geom_point(aes(color = imdb_score_per_vote), alpha = 0.6, size = 3) + 
  scale_color_gradient(low = "lightblue", high = "darkblue") + 
  geom_smooth(method = "lm", se = FALSE, color = "red", size = 1.2) + 
  labs(
    title = "Relationship Between IMDb Votes and IMDb Score Per Vote",
    subtitle = "Adjusted IMDb scores based on vote count",
    x = "Number of IMDb Votes",
    y = "IMDb Score Per Vote",
    color = "IMDb Score Per Vote"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 14, face = "italic", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.position = "right"
  ) + 
  scale_x_log10(labels = scales::comma)  # Log scale for IMDb votes
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

Conclusion:

The mean line in the plot showed a downward trend, moving from positive to negative values, which further reinforces the idea that more popular titles (with higher votes) might not always receive the best per-vote scores.


Pair 2: TMDB Popularity vs IMDb Score Per Vote

Next, I’ll explore whether higher TMDB popularity correlates with a higher IMDb score per vote. This helps evaluate if content popularity on TMDB aligns with the adjusted IMDb scores.

# Visualization: TMDB Popularity vs IMDb Score Per Vote
ggplot(netflix_data_clean, aes(x = tmdb_popularity, y = imdb_score_per_vote)) +
  geom_point(aes(color = imdb_score_per_vote), alpha = 0.6, size = 3) + 
  scale_color_gradient(low = "lightgreen", high = "darkgreen") + 
  geom_smooth(method = "lm", se = FALSE, color = "blue", size = 1.2) + 
  labs(
    title = "Relationship Between TMDB Popularity and IMDb Score Per Vote",
    subtitle = "How TMDB popularity correlates with adjusted IMDb ratings",
    x = "TMDB Popularity",
    y = "IMDb Score Per Vote",
    color = "IMDb Score Per Vote"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 14, face = "italic", hjust = 0.5),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.position = "right"
  ) + 
  scale_x_log10(labels = scales::comma)  # Log scale for IMDb votes
## `geom_smooth()` using formula = 'y ~ x'

Conclusion:

The downward trend in the mean line also showed that higher TMDB popularity doesn’t necessarily translate to higher IMDb scores per vote.


Calculating Correlation Coefficients

IMDb Votes vs IMDb Score

# Correlation: IMDb Votes vs IMDb Score Per Vote
cor(netflix_data_clean$imdb_votes, netflix_data_clean$imdb_score_per_vote, use = "complete.obs")
## [1] -0.0708038

Explanation:

This weak negative correlation suggests that IMDb votes and IMDb scores per vote are not strongly related. High votes don’t necessarily translate to high scores per vote, which could be due to divisive content or varied audience preferences.

Pair 2: TMDB Popularity vs IMDb Score Per Vote

# Correlation: TMDB Popularity vs IMDb Score Per Vote
cor(netflix_data_clean$tmdb_popularity, netflix_data_clean$imdb_score_per_vote, use = "complete.obs")
## [1] -0.06278737

Explanation:

This weak negative correlation indicates that there is no strong relationship between TMDB popularity and IMDb score per vote. Popularity on TMDB doesn’t consistently align with high IMDb scores per vote, reflecting the potential influence of other factors like recency or marketing.


Building Confidence Intervals

For each response variable, I’ll calculate confidence intervals to assess the population means.

IMDb Score Per Vote Confidence Interval

# Mean and standard error for IMDb score per vote
imdb_per_vote_mean <- mean(netflix_data_clean$imdb_score_per_vote, na.rm = TRUE)
imdb_per_vote_se <- sd(netflix_data_clean$imdb_score_per_vote, na.rm = TRUE) / sqrt(nrow(netflix_data_clean))

# Confidence interval (95%)
imdb_per_vote_ci <- c(imdb_per_vote_mean - 1.96 * imdb_per_vote_se, imdb_per_vote_mean + 1.96 * imdb_per_vote_se)
imdb_per_vote_ci
## [1] 0.02114956 0.02600128

Explanation:

This narrow range indicates that with 95% confidence, the true mean of IMDb scores per vote for Netflix content lies within this interval. It suggests that the scores are fairly consistent, though on the lower side, which could reflect average audience reception.


TMDB Popularity Confidence Interval

# Confidence interval for TMDB popularity
tmdb_mean <- mean(netflix_data_clean$tmdb_popularity, na.rm = TRUE)
tmdb_se <- sd(netflix_data_clean$tmdb_popularity, na.rm = TRUE) / sqrt(nrow(netflix_data_clean))

tmdb_ci <- c(tmdb_mean - 1.96 * tmdb_se, tmdb_mean + 1.96 * tmdb_se)
tmdb_ci
## [1] 21.71221 25.60479

Explanation:

The wider interval reflects more variability in TMDB popularity scores across Netflix content. This indicates that while some content is extremely popular, there is a large range in popularity, likely due to the varying appeal of different types of content.


Final Insights and Further Questions

In summary:

  • IMDb Votes vs IMDb Score Per Vote: There’s no clear pattern suggesting that more IMDb votes lead to higher adjusted IMDb scores. Some content with many votes may still receive lower scores, possibly due to varied audience preferences or divisive subject matter.
  • TMDB Popularity vs IMDb Score Per Vote: Popularity on TMDB doesn’t strongly correlate with IMDb scores per vote. This might be because TMDB popularity reflects different factors, such as recent releases or marketing efforts, that don’t necessarily align with IMDb ratings.

Further Questions:

  • What other factors influence IMDB and TMDB scores? Could genre or release year play a role?

  • Why do some high-vote IMDB titles have lower scores? Could it be due to divisive content?

This analysis could be extended by exploring non-numeric features such as genre and looking at interactions between multiple variables.