I’ll first load the Netflix dataset and preview it to understand its structure.
# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")
# Preview the dataset
head(netflix_data)
## id title type
## 1 ts300399 Five Came Back: The Reference Films SHOW
## 2 tm84618 Taxi Driver MOVIE
## 3 tm127384 Monty Python and the Holy Grail MOVIE
## 4 tm70993 Life of Brian MOVIE
## 5 tm190788 The Exorcist MOVIE
## 6 ts22164 Monty Python's Flying Circus SHOW
## description
## 1 This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discussed in the docuseries "Five Came Back."
## 2 A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.
## 3 King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not to enter, as "it is a silly place".
## 4 Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as the Messiah. When he's not dodging his followers or being scolded by his shrill mother, the hapless Brian has to contend with the pompous Pontius Pilate and acronym-obsessed members of a separatist movement. Rife with Monty Python's signature absurdity, the tale finds Brian's life paralleling Biblical lore, albeit with many more laughs.
## 5 12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area of Georgetown. Her mother becomes torn between science and superstition in a desperate bid to save her daughter, and ultimately turns to her last hope: Father Damien Karras, a troubled priest who is struggling with his own faith.
## 6 A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
## release_year age_certification runtime genres
## 1 1945 TV-MA 48 ['documentation']
## 2 1976 R 113 ['crime', 'drama']
## 3 1975 PG 91 ['comedy', 'fantasy']
## 4 1979 R 94 ['comedy']
## 5 1973 R 133 ['horror']
## 6 1969 TV-14 30 ['comedy', 'european']
## production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity
## 1 ['US'] 1 NA NA 0.600
## 2 ['US'] NA tt0075314 8.3 795222 27.612
## 3 ['GB'] NA tt0071853 8.2 530877 18.216
## 4 ['GB'] NA tt0079470 8.0 392419 17.505
## 5 ['US'] NA tt0070047 8.1 391942 95.337
## 6 ['GB'] 4 tt0063929 8.8 72895 12.919
## tmdb_score
## 1 NA
## 2 8.2
## 3 7.8
## 4 7.8
## 5 7.7
## 6 8.3
# Check the structure of the data
str(netflix_data)
## 'data.frame': 5806 obs. of 15 variables:
## $ id : chr "ts300399" "tm84618" "tm127384" "tm70993" ...
## $ title : chr "Five Came Back: The Reference Films" "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" ...
## $ type : chr "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
## $ description : chr "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ ...
## $ release_year : int 1945 1976 1975 1979 1973 1969 1971 1964 1980 1967 ...
## $ age_certification : chr "TV-MA" "R" "PG" "R" ...
## $ runtime : int 48 113 91 94 133 30 102 170 104 110 ...
## $ genres : chr "['documentation']" "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" ...
## $ production_countries: chr "['US']" "['US']" "['GB']" "['GB']" ...
## $ seasons : num 1 NA NA NA NA 4 NA NA NA NA ...
## $ imdb_id : chr "" "tt0075314" "tt0071853" "tt0079470" ...
## $ imdb_score : num NA 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 ...
## $ imdb_votes : num NA 795222 530877 392419 391942 ...
## $ tmdb_popularity : num 0.6 27.6 18.2 17.5 95.3 ...
## $ tmdb_score : num NA 8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 ...
Upon inspecting the dataset, I noticed that certain columns are unclear without referencing the documentation:
imdb_id: This column contains unique title IDs on IMDb. Without documentation, it would be unclear whether this ID refers to a cast, episode, or title.
age_certification: Values like “TV-MA” or “PG” indicate content rating, but it’s essential to know that these codes are based on age suitability, which could vary by country.
genres: This column is a list of genres in string format (e.g., “[‘comedy’, ‘drama’]”), and it’s unclear whether it’s a single genre or a list without documentation.
Why Encoded This Way?
The data might be encoded this way to keep things compact and ensure
standardized identifiers like imdb_id
or
age_certification
. Without reading the documentation, I
might assume these are just descriptive values or misinterpret them,
affecting the analysis.
One column that remains unclear even after reading the documentation
is tmdb_popularity
. While the documentation explains that
this refers to popularity on TMDB, it doesn’t specify the units or
calculation method behind it.
Is tmdb_popularity based on user ratings, views, or some other metric?
Does a higher value indicate more recent popularity, or is it cumulative over time?
To explore the unclear nature of tmdb_popularity
, I’ll
create a scatter plot comparing IMDb scores to TMDB popularity and
highlight any odd trends or outliers.
# Load the plotly library
# install.packages("plotly", repos = "https://cran.rstudio.com/")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Scatter plot for IMDb Score vs TMDB Popularity using ggplot2
p <- ggplot(netflix_data, aes(x = imdb_score, y = tmdb_popularity)) +
# Points with gradient color based on IMDb score
geom_point(aes(color = imdb_score, text = paste("IMDb Score:", imdb_score, "<br>TMDB Popularity:", tmdb_popularity)),
alpha = 0.7, size = 3) +
# Set color gradient from blue (low IMDb score) to orange (high IMDb score)
scale_color_gradient(low = "blue", high = "orange") +
# Labels for title and axis
labs(title = "Relationship Between IMDb Score and TMDB Popularity",
x = "IMDb Score",
y = "TMDB Popularity",
color = "IMDb Score") +
# Add smooth line
geom_smooth(method = "lm", col = "blue", se = FALSE, size = 1) +
# Annotation text
annotate("text", x = 9, y = 225, label = "Unclear how popularity is measured",
color = "blue", size = 4, hjust = 0, fontface = "italic") +
# Use minimal theme for clean design
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "right"
)
## Warning in geom_point(aes(color = imdb_score, text = paste("IMDb Score:", :
## Ignoring unknown aesthetics: text
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Convert ggplot to a dynamic plotly plot
ggplotly(p, tooltip = c("text"), width = 1600, height = 800)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 607 rows containing non-finite outside the scale range
## (`stat_smooth()`).
Explanation of Visualization:
The scatter plot shows IMDb scores against TMDB popularity. While there’s some correlation, it’s unclear why certain high-rated movies have low popularity and vice versa. This might be due to TMDB’s undefined metric.
Significant Risk:
Relying on a metric like tmdb_popularity without fully understanding its definition could lead to skewed conclusions about content performance. For example, I might falsely assume that content with a high TMDB score is more popular, when in reality, the score could be outdated or region-specific.
Mitigation:
I would either:
Investigate further by contacting TMDB or reviewing additional documentation to understand how popularity is calculated.
Exclude or de-emphasize tmdb_popularity in my analysis until its meaning is clearer.
Further Investigation:
Are there other metrics, like runtime or genres, that behave unexpectedly in the dataset?
Can I infer a more accurate metric for popularity using IMDb votes or ratings alongside TMDB scores?
This process highlights the importance of data documentation, as misinterpreting even one column can lead to incorrect conclusions.