The numbering of the outline shows the requirements for the Take-Home-Exam. The following code in the Chunks is the solution for the requirements.
library(tidyverse)
library(ggplot2)
library(data.table)
# download data
download.file("https://datasets.imdbws.com/title.basics.tsv.gz", "title.basics.tsv.gz")
download.file("https://datasets.imdbws.com/title.ratings.tsv.gz", "title.ratings.tsv.gz")
download.file("https://datasets.imdbws.com/name.basics.tsv.gz", "name.basics.tsv.gz")
download.file("https://datasets.imdbws.com/title.crew.tsv.gz", "title.crew.tsv.gz")
# data loading
title_basics <- fread("title.basics.tsv.gz", sep = "\t", quote = "", na.strings = "\\N")
ratings <- fread("title.ratings.tsv.gz", sep = "\t", quote = "", na.strings = "\\N")
name_basics <- fread("name.basics.tsv.gz", sep = "\t", quote = "", na.strings = "\\N")
crew <- fread("title.crew.tsv.gz", sep = "\t", quote = "", na.string = "\\N")
len_title_org <- nrow(title_basics)
# filter runtimeMinutes >= 60
title_basics$runtimeMinutes <- as.numeric(title_basics$runtimeMinutes)
title_basics <- title_basics %>% filter(runtimeMinutes >= 60)
len_title_60 <- nrow(title_basics)
percent_diff <- ((len_title_org - len_title_60) / len_title_org) * 100
After filtering out the movies with a runtimeMinutes of less than 60 minutes, we reduced the dataset by 91.7248642%.
Assumption: I assumed that the startYear(because of also endYear contains mainly Nan values) is the year the movies were released. I created a new column called decade to group the movies by decade. I also separated the genres column and grouped the data by decade and genre(after separation df increased by 40%). After i separated the genres column, I filtered out the rows with missing values in the genres and decade columns.
# change startYear to numeric
title_basics$startYear <- as.numeric(title_basics$startYear)
# create decade
title_basics$decade <- (title_basics$startYear %/% 10) * 10
# group by
genre_decade <- title_basics %>%
separate_rows(genres, sep = ",") %>%
filter(!is.na(genres)) %>%
filter(!is.na(decade)) %>%
group_by(decade, genres) %>%
summarise(n = n(), .groups = "drop")
The plot over the decades does not show a strong trend.The genre does not change significantly (here only visual) over the decades. From the following plot, we can only see that more and more movies are produced over the decades.
#| label: Movies per Genre and Decade using bar plot
#| fig-cap: "Movies per Genre and Decade bar plot all"
#| warning: false
ggplot(genre_decade, aes(x = decade, y = n, fill = genres)) +
geom_bar(stat = "identity") +
labs(title = "Number of movies per Genre and Decade",
x = "decade",
y = "count") +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Since the bar plot makes it difficult to observe the trends, I used a line plot to show the changes over the decades more clearly. The line plot highlights that genres like Drama, Comedy, and Documentary have maintained consistent trends over the decades.
ggplot(genre_decade, aes(x = decade, y = n, color = genres)) +
geom_line() +
labs(title = "Number of movies per Genre and Decade",
x = "Decade",
y = "count") +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The following plot makes it clear that genres like Drama, Comedy, Documentary, Talk-Show, and Romance have maintained consistent trends over the last 130 years.
genre_decade %>%
group_by(genres) %>%
summarise(n = sum(n)) %>%
ggplot(aes(x = reorder(genres, n), y = n, fill = genres)) +
geom_bar(stat = "identity") +
labs(title = "Number of movies per Genre",
x = "genre",
y = "count") +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
To identify the best-rated genres, I joined the title_basics data set with the ratings data set and calculated the mean average rating for each genre. The plot reveals that genres like History, Documentary, and Music have the highest average ratings, around 7, while genres like Sci-Fi, Horror, and Thriller have the lowest average ratings, around 5.
# join title_basics with ratings
genre_ratings <- title_basics %>%
separate_rows(genres, sep = ",") %>%
filter(!is.na(genres)) %>%
left_join(ratings, by = "tconst") %>%
filter(!is.na(averageRating))
Since a movie can have different genres, the mean value is based on the genres.
#| label: mean of average Rating per genre
#| fig-cap: "Movies by Genre"
#| warning: false
genre_ratings %>%
group_by(genres) %>%
summarise(mean_rating = mean(averageRating)) %>%
ggplot(aes(x = reorder(genres, mean_rating), y = mean_rating, fill= genres)) +
geom_bar(stat = "identity") +
labs(title = "Average rating per Genre",
x = "Genre",
y = "Average Rating") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
crew <- crew %>%
separate_rows(directors, sep = ",") %>%
filter(!is.na(directors))
ratings_crew <- ratings %>%
left_join(crew, by = "tconst") %>%
rename(nconst = directors) %>%
left_join(name_basics, by = "nconst") %>%
filter(!is.na(primaryName)) %>%
group_by(primaryName) %>%
summarise(n = n(), sum_rating = sum(numVotes))
The most significant film director is Christopher Nolan. He has the most number of votes.
ratings_crew %>%
top_n(10, sum_rating) %>%
ggplot(aes(x = reorder(primaryName, sum_rating), y = sum_rating, fill = primaryName)) +
geom_bar(stat = "identity") +
labs(title = "Film directors with the most number of votes",
x = "Directors",
y = "Number of votes") +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
In the following code use the 4 datasets to find the movie for which Christopher Nolan received the most votes.
nolan_best <- ratings %>%
left_join(crew, by = "tconst") %>%
rename(nconst = directors) %>%
left_join(name_basics, by = "nconst") %>%
filter(!is.na(primaryName)) %>%
filter(primaryName == "Christopher Nolan") %>%
filter(numVotes == max(numVotes)) %>%
left_join(title_basics, by = "tconst")
nolan_best_numVotes <- nolan_best$numVotes
nolan_best_primaryTitle <- nolan_best$primaryTitle
Christoper Nolan received the most votes for the movie The Dark Knight with 2955094 votes.