Midterm

Sources

Explanation of Article

My article is from Stephen Follows Film Data and Education. Titled, Is a Film’s Length a Sign of Its Quality, Stephen begins by discussing the possibility of knowing which movies to avoid based on: 1. Embargoed reviews, 2. Production reviews, 3. Poster Chicanery, 4. Interviews, and 5. Sub-90 minute running time. The article chooses to focus on the correlation between running time and quality (IMDB Score). The article claims that a sub-90 minute movie is more likely to receive higher ratings. I have personally chosen this topic, as I feel movies greater than 90 mins are better than those which are shorter. I wish to explore the validity of this claim, providing any possible context for external factors.

Explanation of Data

My dataset represents 600 netflix originals. Key variables include: 1. Title, 2. Genre, 3. Premiere, 4. Runtime, 5. IMDB.Score, and 6. Language. To clarify, Runtime is given in minutes, IMDB pulls data from the Internet Movie Database for ratings, and lastly, Premiere offers insight into the date of the Premiere.

My dataset was retreived from kaggle, containing ratings of near 600 movies.

Question

Is user satisfaction greater when a movie is less than 90 minutes long?
What possible variables influence the success of a movie?

Initial Exploration of Data

netflix <- read.csv("NetflixOriginals.csv") # read in data
library(knitr) 
library(kableExtra)
library(ggplot2)
# kableExtra will allow for knitting tables 
kable(head(netflix), booktabs = TRUE) %>% kable_styling(font_size = 13)

Title	Genre	Premiere	Runtime	IMDB.Score	Language
Enter the Anime	Documentary	August 5, 2019	58	2.5	English/Japanese
Dark Forces	Thriller	August 21, 2020	81	2.6	Spanish
The App	Science fiction/Drama	December 26, 2019	79	2.6	Italian
The Open House	Horror thriller	January 19, 2018	94	3.2	English
Kaali Khuhi	Mystery	October 30, 2020	90	3.4	Hindi
Drive	Action	November 1, 2019	147	3.5	Hindi

kable(summary(netflix))%>% kable_styling(font_size = 13)

Title	Genre	Premiere	Runtime	IMDB.Score	Language
Length:584	Length:584	Length:584	Min. : 4.00	Min. :2.500	Length:584
Class :character	Class :character	Class :character	1st Qu.: 86.00	1st Qu.:5.700	Class :character
Mode :character	Mode :character	Mode :character	Median : 97.00	Median :6.350	Mode :character
NA	NA	NA	Mean : 93.58	Mean :6.272	NA
NA	NA	NA	3rd Qu.:108.00	3rd Qu.:7.000	NA
NA	NA	NA	Max. :209.00	Max. :9.000	NA

Data Validation

Data Types

#Checking the type of each variable
typeof(netflix$Title)

[1] "character"

typeof(netflix$Genre)

[1] "character"

typeof(netflix$Premiere)

[1] "character"

typeof(netflix$Runtime)

[1] "integer"

typeof(netflix$IMDB.Score)

[1] "double"

typeof(netflix$Language)

[1] "character"

The above data types are correct and as expected. It is important to note the categorical variables versus the continuous variables prior to analysis. Specifically:

Conintuous: Runtime, IMDB.Score
Categorical: Title, Genre, Premiere, Language

Duplicates

kable(head(unique(netflix$Title)), col.names = "584 Unique Movie Titles")%>% kable_styling(font_size = 13)

584 Unique Movie Titles
Enter the Anime
Dark Forces
The App
The Open House
Kaali Khuhi
Drive

kable(head(unique(netflix$Genre)), col.names = "115 Unique Movie Genres")%>% kable_styling(font_size = 13)

115 Unique Movie Genres
Documentary
Thriller
Science fiction/Drama
Horror thriller
Mystery
Action

kable(head(unique(netflix$Language)), col.names = "38 Unique Language Combinations")%>% kable_styling(font_size = 13)

38 Unique Language Combinations
English/Japanese
Spanish
Italian
English
Hindi
Turkish

Missinig Values

# Omits missing values #
netflix <- na.omit(netflix)

Plots

Scatterplot 1

# Making a scatterplot
library(car)

Loading required package: carData

runtime <- netflix$Runtime
IMDB_score <- netflix$IMDB.Score

car_scatter <- scatterplot(x=runtime, 
                           y=IMDB_score, 
                           xlab = "Runtime",
                           ylab = "IMDB Score",
                           main = "IMDB Score",
                           ellipse = list(levels=c(.5, .95), robust = TRUE, fill=FALSE),
                           smooth = FALSE,
                           regLine = TRUE, #Create a regression line
                           legend = TRUE,
                           col = "red")

Notes

We can observe from the boxplots the median of Runtime is 97, while the median of IMDB score is 6.4. We do observe outliers, specifically in Runtime, indicating movies less than 50 minutes or greater than 140 minutes are of abnormal length. The second ellipse is a representation of a heavily data concentrated area. In context of this data, this indicates a relationship of popularity (meaning most movies fall within about 100mins with an associated IMDB score of 6.4).

This graph may confirm the given argument presented by Stephen. Recalling the article, Stephen claims that movies less than 90 minues long will receive higher IMDB Scores. It is worth noting that many outliers less than 90 minutes exist, possibly skewing the article’s interpretation. Also, it is worth noting that many other factors are at play to impact an IMDB Score other than Simply Runtime. Stephen claims these may be difficult to measure. This gives way to further analysis below, offering further insight into runtime and insight into external factors at play.

Boxplot

IMDB officialy recommends a good movie above 7.0. In the context of the article, movies less than 90 minutes have IMDB scores greater than or equal to 7.0. For further analysis, the data will be subsetted by IMDB score greater than 7.

newnetflix <- subset(netflix, netflix$IMDB.Score>7) # subset of netflix data
kable(head(newnetflix), booktabs = TRUE) %>% kable_styling(font_size = 13) #head

	Title	Genre	Premiere	Runtime	IMDB.Score	Language
452	13th: A Conversation with Oprah Winfrey & Ava DuVernay	Aftershow / Interview	January 26, 2017	36	7.1	English
453	Angela’s Christmas	Animation	November 30, 2018	30	7.1	English
454	Angela’s Christmas Wish	Animation	December 1, 2020	47	7.1	English
455	Beats	Drama	June 19, 2019	110	7.1	English
456	Circus of Books	Documentary	April 22, 2020	92	7.1	English
457	Dance Dreams: Hot Chocolate Nutcracker	Documentary	November 27, 2020	80	7.1	English

kable(summary(newnetflix), booktabs = TRUE) %>% kable_styling(font_size = 13) #summary

Title	Genre	Premiere	Runtime	IMDB.Score	Language
Length:133	Length:133	Length:133	Min. : 11.00	Min. :7.100	Length:133
Class :character	Class :character	Class :character	1st Qu.: 79.00	1st Qu.:7.200	Class :character
Mode :character	Mode :character	Mode :character	Median : 97.00	Median :7.300	Mode :character
NA	NA	NA	Mean : 91.89	Mean :7.471	NA
NA	NA	NA	3rd Qu.:112.00	3rd Qu.:7.600	NA
NA	NA	NA	Max. :209.00	Max. :9.000	NA

#Plotting Runtime with subsetted data#
ggplot(newnetflix) +
  geom_boxplot(aes(x = Runtime)) +
  labs(title = "Runtime of High Ranking Movies")

Notes

The subsetted data show a median Runtime of 97 minutes wit a mean of 91.89 minutes. This means that the average number of movies which are highly liked by IMDB (7.0+), have a median runtime of 97 minutes.

This does provide a answer to question 1. “Is user satisfaction greater when a movie is less than 90 minutes long?” The answer is no, user satisfaction tends to be at its peak when runtime is 97 minutes long.

For analysis on Question 2, below details possible explanation for factors influencing a 97 minute runtime.

Scatterplot 2

ggplot(netflix, aes(x = Genre, y=IMDB.Score, col = IMDB.Score>7)) +
  geom_point(size = 3) +
  labs(x = "Genre", y = "IMDB Score", title = "Genre vs. IMDB Score") +
  scale_x_discrete(position="top") + #Moving x labels to top for easier viewing
  theme(axis.text.x = element_text(angle = 270, vjust = 0.5, hjust=1)) #Rotation of x labels by 270

Notes

In the above graph, blue represents an IMDB score greater than 7. The purpose of this visual is to gain insight into which genre is most frequently ranked greater than 7 by IMDB. The most popular genre greater than 7.0 is Documentaries, followed by Drama.

Histogram

ggplot(netflix, aes(x = Runtime)) + 
  geom_histogram(aes(y = ..density..), fill = "white", colour = "black", bins = 50) + 
  labs(x = "Runtime (mins)", y = "Density", title = "Review of Movie by Runtime (mins)") +
  geom_density(col = "red") + 
  stat_function(fun = dnorm, n = 10000, col = "blue", args = list(mean = mean(netflix$Runtime), sd = sd(netflix$Runtime))) +
  scale_y_continuous(breaks = NULL)

Notes

The above histogram depicts both a normal distribution curve (dnorm) and density curve. We can observe that runtime is skewed slightly left of the expected normal distribution indicating a lower mean, median, and mode than anticipated. In terms of the real world, this does make sense as much fewer 200 minute movies exist.

It is interesting to note the expected runtime, as this is what the article used the number 90. Perhaps the article interpolated the data, or had a much larger sample size to achieve their claim.

Barplot

library(gridExtra)

p1 <- ggplot(netflix) + 
  geom_bar(aes(x = IMDB.Score, fill = Language)) +
  labs(x = "IMDB Score", y = "Count of Movies", title = "IMDB Score by Language") +
  theme(legend.position = "bottom") 
p1

Notes

The above plot shows the count of movies in each language based on IMDB Score. Let us observe that English is most often used in highest ranked movies (IMDB Score > 7).

Conclusion

IMDB Scores will be higher when runtime is 97mins. This contradicts the central claim of the chosen article. Recall a claim from the article that Runtime impacts IMDB score. Although true, we have shown many confounding variables to take into account including language and genre.

Stephen Follows Film Data and Education provides extensive argumentation on the sole basis of runtime with the argument that categorical data is unable to be fully analyzed in comparison to IMDB Score.

It is possible that each of these variables contribute to the ranking. Specifically, we observed higher IMDB Scores in Genres of Documentaries and Drama. We also observed higher IMDB Scores in the Language of English.

Midterm

Joe Barrett

2/21/2022

Sources

Explanation of Article

Explanation of Data

Question

Initial Exploration of Data

Data Validation

Data Types

Duplicates

Missinig Values

Plots

Scatterplot 1

Notes

Boxplot

Notes

Scatterplot 2

Notes

Histogram

Notes

Barplot

Notes

Conclusion