Load the Netflix Dataset

I’ll first load the Netflix dataset and preview it to understand its structure.

# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")

# Preview the dataset
head(netflix_data)
##         id                               title  type
## 1 ts300399 Five Came Back: The Reference Films  SHOW
## 2  tm84618                         Taxi Driver MOVIE
## 3 tm127384     Monty Python and the Holy Grail MOVIE
## 4  tm70993                       Life of Brian MOVIE
## 5 tm190788                        The Exorcist MOVIE
## 6  ts22164        Monty Python's Flying Circus  SHOW
##                                                                                                                                                                                                                                                                                                                                                                                                                                                          description
## 1                                                                                                                                                                                                                                                                                                            This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discussed in the docuseries "Five Came Back."
## 2                                                                                                                                                                                                                                A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.
## 3                                    King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not  to enter, as "it is a silly place".
## 4 Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as the Messiah. When he's not dodging his followers or being scolded by his shrill mother, the hapless Brian has to contend with the pompous Pontius Pilate and acronym-obsessed members of a separatist movement. Rife with Monty Python's signature absurdity, the tale finds Brian's life paralleling Biblical lore, albeit with many more laughs.
## 5                                                                                                                12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area of Georgetown. Her mother becomes torn between science and superstition in a desperate bid to save her daughter, and ultimately turns to her last hope: Father Damien Karras, a troubled priest who is struggling with his own faith.
## 6                                                                                                                                                                                                                                                                                             A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
##   release_year age_certification runtime                 genres
## 1         1945             TV-MA      48      ['documentation']
## 2         1976                 R     113     ['crime', 'drama']
## 3         1975                PG      91  ['comedy', 'fantasy']
## 4         1979                 R      94             ['comedy']
## 5         1973                 R     133             ['horror']
## 6         1969             TV-14      30 ['comedy', 'european']
##   production_countries seasons   imdb_id imdb_score imdb_votes tmdb_popularity
## 1               ['US']       1                   NA         NA           0.600
## 2               ['US']      NA tt0075314        8.3     795222          27.612
## 3               ['GB']      NA tt0071853        8.2     530877          18.216
## 4               ['GB']      NA tt0079470        8.0     392419          17.505
## 5               ['US']      NA tt0070047        8.1     391942          95.337
## 6               ['GB']       4 tt0063929        8.8      72895          12.919
##   tmdb_score
## 1         NA
## 2        8.2
## 3        7.8
## 4        7.8
## 5        7.7
## 6        8.3
# Check the structure of the data
str(netflix_data)
## 'data.frame':    5806 obs. of  15 variables:
##  $ id                  : chr  "ts300399" "tm84618" "tm127384" "tm70993" ...
##  $ title               : chr  "Five Came Back: The Reference Films" "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" ...
##  $ type                : chr  "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
##  $ description         : chr  "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ ...
##  $ release_year        : int  1945 1976 1975 1979 1973 1969 1971 1964 1980 1967 ...
##  $ age_certification   : chr  "TV-MA" "R" "PG" "R" ...
##  $ runtime             : int  48 113 91 94 133 30 102 170 104 110 ...
##  $ genres              : chr  "['documentation']" "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" ...
##  $ production_countries: chr  "['US']" "['US']" "['GB']" "['GB']" ...
##  $ seasons             : num  1 NA NA NA NA 4 NA NA NA NA ...
##  $ imdb_id             : chr  "" "tt0075314" "tt0071853" "tt0079470" ...
##  $ imdb_score          : num  NA 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 ...
##  $ imdb_votes          : num  NA 795222 530877 392419 391942 ...
##  $ tmdb_popularity     : num  0.6 27.6 18.2 17.5 95.3 ...
##  $ tmdb_score          : num  NA 8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 ...

1. Identify Columns that are Unclear Until You Read the Documentation

Upon inspecting the dataset, I noticed that certain columns are unclear without referencing the documentation:

Why Encoded This Way?

The data might be encoded this way to keep things compact and ensure standardized identifiers like imdb_id or age_certification. Without reading the documentation, I might assume these are just descriptive values or misinterpret them, affecting the analysis.


2. Element Unclear Even After Reading Documentation

One column that remains unclear even after reading the documentation is tmdb_popularity. While the documentation explains that this refers to popularity on TMDB, it doesn’t specify the units or calculation method behind it.

Is tmdb_popularity based on user ratings, views, or some other metric?

Does a higher value indicate more recent popularity, or is it cumulative over time?


3. Visualization Highlighting the Unclear Element

To explore the unclear nature of tmdb_popularity, I’ll create a scatter plot comparing IMDb scores to TMDB popularity and highlight any odd trends or outliers.

# Load the plotly library
# install.packages("plotly", repos = "https://cran.rstudio.com/")
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Scatter plot for IMDb Score vs TMDB Popularity using ggplot2
p <- ggplot(netflix_data, aes(x = imdb_score, y = tmdb_popularity)) +
  
  # Points with gradient color based on IMDb score
  geom_point(aes(color = imdb_score, text = paste("IMDb Score:", imdb_score, "<br>TMDB Popularity:", tmdb_popularity)), 
             alpha = 0.7, size = 3) +  
  
  # Set color gradient from blue (low IMDb score) to orange (high IMDb score)
  scale_color_gradient(low = "blue", high = "orange") +  
  
  # Labels for title and axis
  labs(title = "Relationship Between IMDb Score and TMDB Popularity",
       x = "IMDb Score",
       y = "TMDB Popularity",
       color = "IMDb Score") +  
  
  # Add smooth line
  geom_smooth(method = "lm", col = "blue", se = FALSE, size = 1) +  
  
  # Annotation text
  annotate("text", x = 9, y = 225, label = "Unclear how popularity is measured", 
           color = "blue", size = 4, hjust = 0, fontface = "italic") +  
  
  # Use minimal theme for clean design
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),  
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.position = "right"
  )
## Warning in geom_point(aes(color = imdb_score, text = paste("IMDb Score:", :
## Ignoring unknown aesthetics: text
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Convert ggplot to a dynamic plotly plot
ggplotly(p, tooltip = c("text"), width = 1600, height = 800)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 607 rows containing non-finite outside the scale range
## (`stat_smooth()`).

Explanation of Visualization:

The scatter plot shows IMDb scores against TMDB popularity. While there’s some correlation, it’s unclear why certain high-rated movies have low popularity and vice versa. This might be due to TMDB’s undefined metric.

Significant Risk:

Relying on a metric like tmdb_popularity without fully understanding its definition could lead to skewed conclusions about content performance. For example, I might falsely assume that content with a high TMDB score is more popular, when in reality, the score could be outdated or region-specific.

Mitigation:

I would either:

Further Investigation:

Are there other metrics, like runtime or genres, that behave unexpectedly in the dataset?

Can I infer a more accurate metric for popularity using IMDb votes or ratings alongside TMDB scores?

This process highlights the importance of data documentation, as misinterpreting even one column can lead to incorrect conclusions.