Explanation of Article

My article is from Stephen Follows Film Data and Education. Titled, Is a Film’s Length a Sign of Its Quality, Stephen begins by discussing the possibility of knowing which movies to avoid based on: 1. Embargoed reviews, 2. Production reviews, 3. Poster Chicanery, 4. Interviews, and 5. Sub-90 minute running time. The article chooses to focus on the correlation between running time and quality (IMDB Score). The article claims that a sub-90 minute movie is more likely to receive higher ratings. I have personally chosen this topic, as I feel movies greater than 90 mins are better than those which are shorter. I wish to explore the validity of this claim, providing any possible context for external factors.

Explanation of Data

My dataset represents 600 netflix originals. Key variables include: 1. Title, 2. Genre, 3. Premiere, 4. Runtime, 5. IMDB.Score, and 6. Language. To clarify, Runtime is given in minutes, IMDB pulls data from the Internet Movie Database for ratings, and lastly, Premiere offers insight into the date of the Premiere.

My dataset was retreived from kaggle, containing ratings of near 600 movies.

Question

  • Is user satisfaction greater when a movie is less than 90 minutes long?
  • What possible variables influence the success of a movie?

Initial Exploration of Data

netflix <- read.csv("NetflixOriginals.csv") # read in data
library(knitr) 
library(kableExtra)
library(ggplot2)
# kableExtra will allow for knitting tables 
kable(head(netflix), booktabs = TRUE) %>% kable_styling(font_size = 13)
Title Genre Premiere Runtime IMDB.Score Language
Enter the Anime Documentary August 5, 2019 58 2.5 English/Japanese
Dark Forces Thriller August 21, 2020 81 2.6 Spanish
The App Science fiction/Drama December 26, 2019 79 2.6 Italian
The Open House Horror thriller January 19, 2018 94 3.2 English
Kaali Khuhi Mystery October 30, 2020 90 3.4 Hindi
Drive Action November 1, 2019 147 3.5 Hindi
kable(summary(netflix))%>% kable_styling(font_size = 13)
Title Genre Premiere Runtime IMDB.Score Language
Length:584 Length:584 Length:584 Min. : 4.00 Min. :2.500 Length:584
Class :character Class :character Class :character 1st Qu.: 86.00 1st Qu.:5.700 Class :character
Mode :character Mode :character Mode :character Median : 97.00 Median :6.350 Mode :character
NA NA NA Mean : 93.58 Mean :6.272 NA
NA NA NA 3rd Qu.:108.00 3rd Qu.:7.000 NA
NA NA NA Max. :209.00 Max. :9.000 NA

Data Validation

Data Types

#Checking the type of each variable
typeof(netflix$Title)
[1] "character"
typeof(netflix$Genre)
[1] "character"
typeof(netflix$Premiere)
[1] "character"
typeof(netflix$Runtime)
[1] "integer"
typeof(netflix$IMDB.Score)
[1] "double"
typeof(netflix$Language)
[1] "character"

The above data types are correct and as expected. It is important to note the categorical variables versus the continuous variables prior to analysis. Specifically:

  • Conintuous: Runtime, IMDB.Score
  • Categorical: Title, Genre, Premiere, Language

Duplicates

kable(head(unique(netflix$Title)), col.names = "584 Unique Movie Titles")%>% kable_styling(font_size = 13)
584 Unique Movie Titles
Enter the Anime
Dark Forces
The App
The Open House
Kaali Khuhi
Drive
kable(head(unique(netflix$Genre)), col.names = "115 Unique Movie Genres")%>% kable_styling(font_size = 13)
115 Unique Movie Genres
Documentary
Thriller
Science fiction/Drama
Horror thriller
Mystery
Action
kable(head(unique(netflix$Language)), col.names = "38 Unique Language Combinations")%>% kable_styling(font_size = 13)
38 Unique Language Combinations
English/Japanese
Spanish
Italian
English
Hindi
Turkish

Missinig Values

# Omits missing values #
netflix <- na.omit(netflix)

Plots

Scatterplot 1

# Making a scatterplot
library(car)
Loading required package: carData
runtime <- netflix$Runtime
IMDB_score <- netflix$IMDB.Score

car_scatter <- scatterplot(x=runtime, 
                           y=IMDB_score, 
                           xlab = "Runtime",
                           ylab = "IMDB Score",
                           main = "IMDB Score",
                           ellipse = list(levels=c(.5, .95), robust = TRUE, fill=FALSE),
                           smooth = FALSE,
                           regLine = TRUE, #Create a regression line
                           legend = TRUE,
                           col = "red")

Notes

We can observe from the boxplots the median of Runtime is 97, while the median of IMDB score is 6.4. We do observe outliers, specifically in Runtime, indicating movies less than 50 minutes or greater than 140 minutes are of abnormal length. The second ellipse is a representation of a heavily data concentrated area. In context of this data, this indicates a relationship of popularity (meaning most movies fall within about 100mins with an associated IMDB score of 6.4).

This graph may confirm the given argument presented by Stephen. Recalling the article, Stephen claims that movies less than 90 minues long will receive higher IMDB Scores. It is worth noting that many outliers less than 90 minutes exist, possibly skewing the article’s interpretation. Also, it is worth noting that many other factors are at play to impact an IMDB Score other than Simply Runtime. Stephen claims these may be difficult to measure. This gives way to further analysis below, offering further insight into runtime and insight into external factors at play.

Boxplot

IMDB officialy recommends a good movie above 7.0. In the context of the article, movies less than 90 minutes have IMDB scores greater than or equal to 7.0. For further analysis, the data will be subsetted by IMDB score greater than 7.

newnetflix <- subset(netflix, netflix$IMDB.Score>7) # subset of netflix data
kable(head(newnetflix), booktabs = TRUE) %>% kable_styling(font_size = 13) #head
Title Genre Premiere Runtime IMDB.Score Language
452 13th: A Conversation with Oprah Winfrey & Ava DuVernay Aftershow / Interview January 26, 2017 36 7.1 English
453 Angela’s Christmas Animation November 30, 2018 30 7.1 English
454 Angela’s Christmas Wish Animation December 1, 2020 47 7.1 English
455 Beats Drama June 19, 2019 110 7.1 English
456 Circus of Books Documentary April 22, 2020 92 7.1 English
457 Dance Dreams: Hot Chocolate Nutcracker Documentary November 27, 2020 80 7.1 English
kable(summary(newnetflix), booktabs = TRUE) %>% kable_styling(font_size = 13) #summary
Title Genre Premiere Runtime IMDB.Score Language
Length:133 Length:133 Length:133 Min. : 11.00 Min. :7.100 Length:133
Class :character Class :character Class :character 1st Qu.: 79.00 1st Qu.:7.200 Class :character
Mode :character Mode :character Mode :character Median : 97.00 Median :7.300 Mode :character
NA NA NA Mean : 91.89 Mean :7.471 NA
NA NA NA 3rd Qu.:112.00 3rd Qu.:7.600 NA
NA NA NA Max. :209.00 Max. :9.000 NA
#Plotting Runtime with subsetted data#
ggplot(newnetflix) +
  geom_boxplot(aes(x = Runtime)) +
  labs(title = "Runtime of High Ranking Movies")

Notes

The subsetted data show a median Runtime of 97 minutes wit a mean of 91.89 minutes. This means that the average number of movies which are highly liked by IMDB (7.0+), have a median runtime of 97 minutes.

This does provide a answer to question 1. “Is user satisfaction greater when a movie is less than 90 minutes long?” The answer is no, user satisfaction tends to be at its peak when runtime is 97 minutes long.

For analysis on Question 2, below details possible explanation for factors influencing a 97 minute runtime.

Scatterplot 2

ggplot(netflix, aes(x = Genre, y=IMDB.Score, col = IMDB.Score>7)) +
  geom_point(size = 3) +
  labs(x = "Genre", y = "IMDB Score", title = "Genre vs. IMDB Score") +
  scale_x_discrete(position="top") + #Moving x labels to top for easier viewing
  theme(axis.text.x = element_text(angle = 270, vjust = 0.5, hjust=1)) #Rotation of x labels by 270

Notes

In the above graph, blue represents an IMDB score greater than 7. The purpose of this visual is to gain insight into which genre is most frequently ranked greater than 7 by IMDB. The most popular genre greater than 7.0 is Documentaries, followed by Drama.

Histogram

ggplot(netflix, aes(x = Runtime)) + 
  geom_histogram(aes(y = ..density..), fill = "white", colour = "black", bins = 50) + 
  labs(x = "Runtime (mins)", y = "Density", title = "Review of Movie by Runtime (mins)") +
  geom_density(col = "red") + 
  stat_function(fun = dnorm, n = 10000, col = "blue", args = list(mean = mean(netflix$Runtime), sd = sd(netflix$Runtime))) +
  scale_y_continuous(breaks = NULL)

Notes

The above histogram depicts both a normal distribution curve (dnorm) and density curve. We can observe that runtime is skewed slightly left of the expected normal distribution indicating a lower mean, median, and mode than anticipated. In terms of the real world, this does make sense as much fewer 200 minute movies exist.

It is interesting to note the expected runtime, as this is what the article used the number 90. Perhaps the article interpolated the data, or had a much larger sample size to achieve their claim.

Barplot

library(gridExtra)

p1 <- ggplot(netflix) + 
  geom_bar(aes(x = IMDB.Score, fill = Language)) +
  labs(x = "IMDB Score", y = "Count of Movies", title = "IMDB Score by Language") +
  theme(legend.position = "bottom") 
p1

Notes

The above plot shows the count of movies in each language based on IMDB Score. Let us observe that English is most often used in highest ranked movies (IMDB Score > 7).

Conclusion

IMDB Scores will be higher when runtime is 97mins. This contradicts the central claim of the chosen article. Recall a claim from the article that Runtime impacts IMDB score. Although true, we have shown many confounding variables to take into account including language and genre.

Stephen Follows Film Data and Education provides extensive argumentation on the sole basis of runtime with the argument that categorical data is unable to be fully analyzed in comparison to IMDB Score.

It is possible that each of these variables contribute to the ranking. Specifically, we observed higher IMDB Scores in Genres of Documentaries and Drama. We also observed higher IMDB Scores in the Language of English.