My article is from Stephen Follows Film Data and Education. Titled, Is a Film’s Length a Sign of Its Quality, Stephen begins by discussing the possibility of knowing which movies to avoid based on: 1. Embargoed reviews, 2. Production reviews, 3. Poster Chicanery, 4. Interviews, and 5. Sub-90 minute running time. The article chooses to focus on the correlation between running time and quality (IMDB Score). The article claims that a sub-90 minute movie is more likely to receive higher ratings. I have personally chosen this topic, as I feel movies greater than 90 mins are better than those which are shorter. I wish to explore the validity of this claim, providing any possible context for external factors.
My dataset represents 600 netflix originals. Key variables include: 1. Title, 2. Genre, 3. Premiere, 4. Runtime, 5. IMDB.Score, and 6. Language. To clarify, Runtime is given in minutes, IMDB pulls data from the Internet Movie Database for ratings, and lastly, Premiere offers insight into the date of the Premiere.
My dataset was retreived from kaggle, containing ratings of near 600 movies.
netflix <- read.csv("NetflixOriginals.csv") # read in data
library(knitr)
library(kableExtra)
library(ggplot2)
# kableExtra will allow for knitting tables
kable(head(netflix), booktabs = TRUE) %>% kable_styling(font_size = 13)
| Title | Genre | Premiere | Runtime | IMDB.Score | Language |
|---|---|---|---|---|---|
| Enter the Anime | Documentary | August 5, 2019 | 58 | 2.5 | English/Japanese |
| Dark Forces | Thriller | August 21, 2020 | 81 | 2.6 | Spanish |
| The App | Science fiction/Drama | December 26, 2019 | 79 | 2.6 | Italian |
| The Open House | Horror thriller | January 19, 2018 | 94 | 3.2 | English |
| Kaali Khuhi | Mystery | October 30, 2020 | 90 | 3.4 | Hindi |
| Drive | Action | November 1, 2019 | 147 | 3.5 | Hindi |
kable(summary(netflix))%>% kable_styling(font_size = 13)
| Title | Genre | Premiere | Runtime | IMDB.Score | Language | |
|---|---|---|---|---|---|---|
| Length:584 | Length:584 | Length:584 | Min. : 4.00 | Min. :2.500 | Length:584 | |
| Class :character | Class :character | Class :character | 1st Qu.: 86.00 | 1st Qu.:5.700 | Class :character | |
| Mode :character | Mode :character | Mode :character | Median : 97.00 | Median :6.350 | Mode :character | |
| NA | NA | NA | Mean : 93.58 | Mean :6.272 | NA | |
| NA | NA | NA | 3rd Qu.:108.00 | 3rd Qu.:7.000 | NA | |
| NA | NA | NA | Max. :209.00 | Max. :9.000 | NA |
#Checking the type of each variable
typeof(netflix$Title)
[1] "character"
typeof(netflix$Genre)
[1] "character"
typeof(netflix$Premiere)
[1] "character"
typeof(netflix$Runtime)
[1] "integer"
typeof(netflix$IMDB.Score)
[1] "double"
typeof(netflix$Language)
[1] "character"
The above data types are correct and as expected. It is important to note the categorical variables versus the continuous variables prior to analysis. Specifically:
kable(head(unique(netflix$Title)), col.names = "584 Unique Movie Titles")%>% kable_styling(font_size = 13)
| 584 Unique Movie Titles |
|---|
| Enter the Anime |
| Dark Forces |
| The App |
| The Open House |
| Kaali Khuhi |
| Drive |
kable(head(unique(netflix$Genre)), col.names = "115 Unique Movie Genres")%>% kable_styling(font_size = 13)
| 115 Unique Movie Genres |
|---|
| Documentary |
| Thriller |
| Science fiction/Drama |
| Horror thriller |
| Mystery |
| Action |
kable(head(unique(netflix$Language)), col.names = "38 Unique Language Combinations")%>% kable_styling(font_size = 13)
| 38 Unique Language Combinations |
|---|
| English/Japanese |
| Spanish |
| Italian |
| English |
| Hindi |
| Turkish |
# Omits missing values #
netflix <- na.omit(netflix)
# Making a scatterplot
library(car)
Loading required package: carData
runtime <- netflix$Runtime
IMDB_score <- netflix$IMDB.Score
car_scatter <- scatterplot(x=runtime,
y=IMDB_score,
xlab = "Runtime",
ylab = "IMDB Score",
main = "IMDB Score",
ellipse = list(levels=c(.5, .95), robust = TRUE, fill=FALSE),
smooth = FALSE,
regLine = TRUE, #Create a regression line
legend = TRUE,
col = "red")
We can observe from the boxplots the median of Runtime is 97, while the median of IMDB score is 6.4. We do observe outliers, specifically in Runtime, indicating movies less than 50 minutes or greater than 140 minutes are of abnormal length. The second ellipse is a representation of a heavily data concentrated area. In context of this data, this indicates a relationship of popularity (meaning most movies fall within about 100mins with an associated IMDB score of 6.4).
This graph may confirm the given argument presented by Stephen. Recalling the article, Stephen claims that movies less than 90 minues long will receive higher IMDB Scores. It is worth noting that many outliers less than 90 minutes exist, possibly skewing the article’s interpretation. Also, it is worth noting that many other factors are at play to impact an IMDB Score other than Simply Runtime. Stephen claims these may be difficult to measure. This gives way to further analysis below, offering further insight into runtime and insight into external factors at play.
IMDB officialy recommends a good movie above 7.0. In the context of the article, movies less than 90 minutes have IMDB scores greater than or equal to 7.0. For further analysis, the data will be subsetted by IMDB score greater than 7.
newnetflix <- subset(netflix, netflix$IMDB.Score>7) # subset of netflix data
kable(head(newnetflix), booktabs = TRUE) %>% kable_styling(font_size = 13) #head
| Title | Genre | Premiere | Runtime | IMDB.Score | Language | |
|---|---|---|---|---|---|---|
| 452 | 13th: A Conversation with Oprah Winfrey & Ava DuVernay | Aftershow / Interview | January 26, 2017 | 36 | 7.1 | English |
| 453 | Angela’s Christmas | Animation | November 30, 2018 | 30 | 7.1 | English |
| 454 | Angela’s Christmas Wish | Animation | December 1, 2020 | 47 | 7.1 | English |
| 455 | Beats | Drama | June 19, 2019 | 110 | 7.1 | English |
| 456 | Circus of Books | Documentary | April 22, 2020 | 92 | 7.1 | English |
| 457 | Dance Dreams: Hot Chocolate Nutcracker | Documentary | November 27, 2020 | 80 | 7.1 | English |
kable(summary(newnetflix), booktabs = TRUE) %>% kable_styling(font_size = 13) #summary
| Title | Genre | Premiere | Runtime | IMDB.Score | Language | |
|---|---|---|---|---|---|---|
| Length:133 | Length:133 | Length:133 | Min. : 11.00 | Min. :7.100 | Length:133 | |
| Class :character | Class :character | Class :character | 1st Qu.: 79.00 | 1st Qu.:7.200 | Class :character | |
| Mode :character | Mode :character | Mode :character | Median : 97.00 | Median :7.300 | Mode :character | |
| NA | NA | NA | Mean : 91.89 | Mean :7.471 | NA | |
| NA | NA | NA | 3rd Qu.:112.00 | 3rd Qu.:7.600 | NA | |
| NA | NA | NA | Max. :209.00 | Max. :9.000 | NA |
#Plotting Runtime with subsetted data#
ggplot(newnetflix) +
geom_boxplot(aes(x = Runtime)) +
labs(title = "Runtime of High Ranking Movies")
The subsetted data show a median Runtime of 97 minutes wit a mean of 91.89 minutes. This means that the average number of movies which are highly liked by IMDB (7.0+), have a median runtime of 97 minutes.
This does provide a answer to question 1. “Is user satisfaction greater when a movie is less than 90 minutes long?” The answer is no, user satisfaction tends to be at its peak when runtime is 97 minutes long.
For analysis on Question 2, below details possible explanation for factors influencing a 97 minute runtime.
ggplot(netflix, aes(x = Genre, y=IMDB.Score, col = IMDB.Score>7)) +
geom_point(size = 3) +
labs(x = "Genre", y = "IMDB Score", title = "Genre vs. IMDB Score") +
scale_x_discrete(position="top") + #Moving x labels to top for easier viewing
theme(axis.text.x = element_text(angle = 270, vjust = 0.5, hjust=1)) #Rotation of x labels by 270
In the above graph, blue represents an IMDB score greater than 7. The purpose of this visual is to gain insight into which genre is most frequently ranked greater than 7 by IMDB. The most popular genre greater than 7.0 is Documentaries, followed by Drama.
ggplot(netflix, aes(x = Runtime)) +
geom_histogram(aes(y = ..density..), fill = "white", colour = "black", bins = 50) +
labs(x = "Runtime (mins)", y = "Density", title = "Review of Movie by Runtime (mins)") +
geom_density(col = "red") +
stat_function(fun = dnorm, n = 10000, col = "blue", args = list(mean = mean(netflix$Runtime), sd = sd(netflix$Runtime))) +
scale_y_continuous(breaks = NULL)
The above histogram depicts both a normal distribution curve (dnorm) and density curve. We can observe that runtime is skewed slightly left of the expected normal distribution indicating a lower mean, median, and mode than anticipated. In terms of the real world, this does make sense as much fewer 200 minute movies exist.
It is interesting to note the expected runtime, as this is what the article used the number 90. Perhaps the article interpolated the data, or had a much larger sample size to achieve their claim.
library(gridExtra)
p1 <- ggplot(netflix) +
geom_bar(aes(x = IMDB.Score, fill = Language)) +
labs(x = "IMDB Score", y = "Count of Movies", title = "IMDB Score by Language") +
theme(legend.position = "bottom")
p1
The above plot shows the count of movies in each language based on IMDB Score. Let us observe that English is most often used in highest ranked movies (IMDB Score > 7).
IMDB Scores will be higher when runtime is 97mins. This contradicts the central claim of the chosen article. Recall a claim from the article that Runtime impacts IMDB score. Although true, we have shown many confounding variables to take into account including language and genre.
Stephen Follows Film Data and Education provides extensive argumentation on the sole basis of runtime with the argument that categorical data is unable to be fully analyzed in comparison to IMDB Score.
It is possible that each of these variables contribute to the ranking. Specifically, we observed higher IMDB Scores in Genres of Documentaries and Drama. We also observed higher IMDB Scores in the Language of English.