The numbering of the outline shows the requirements for the Take-Home-Exam. The following code in the Chunks is the solution for the requirements.

Import Libraries

library(tidyverse)
library(ggplot2)
library(data.table)

1 Read datasets

# download data
download.file("https://datasets.imdbws.com/title.basics.tsv.gz", "title.basics.tsv.gz")

download.file("https://datasets.imdbws.com/title.ratings.tsv.gz", "title.ratings.tsv.gz")

download.file("https://datasets.imdbws.com/name.basics.tsv.gz", "name.basics.tsv.gz")

download.file("https://datasets.imdbws.com/title.crew.tsv.gz", "title.crew.tsv.gz")

Define variables for the datasets

# data loading
title_basics <- fread("title.basics.tsv.gz", sep = "\t", quote = "", na.strings = "\\N")
ratings <- fread("title.ratings.tsv.gz", sep = "\t", quote = "", na.strings = "\\N")
name_basics <- fread("name.basics.tsv.gz", sep = "\t", quote = "", na.strings = "\\N")
crew <- fread("title.crew.tsv.gz", sep = "\t", quote = "", na.string = "\\N")

2 Data Cleaning and dimension reduction

len_title_org <- nrow(title_basics)

# filter runtimeMinutes >= 60
title_basics$runtimeMinutes <- as.numeric(title_basics$runtimeMinutes)
title_basics <- title_basics %>% filter(runtimeMinutes >= 60)

len_title_60 <- nrow(title_basics)
percent_diff <- ((len_title_org - len_title_60) / len_title_org) * 100

After filtering out the movies with a runtimeMinutes of less than 60 minutes, we reduced the dataset by 91.7248642%.

3 Grouping by Decade and Genre

Assumption: I assumed that the startYear(because of also endYear contains mainly Nan values) is the year the movies were released. I created a new column called decade to group the movies by decade. I also separated the genres column and grouped the data by decade and genre(after separation df increased by 40%). After i separated the genres column, I filtered out the rows with missing values in the genres and decade columns.

# change startYear to numeric
title_basics$startYear <- as.numeric(title_basics$startYear) 

# create decade
title_basics$decade <- (title_basics$startYear %/% 10) * 10

# group by 
genre_decade <- title_basics %>% 
  separate_rows(genres, sep = ",") %>%
  filter(!is.na(genres)) %>%
  filter(!is.na(decade)) %>%
  group_by(decade, genres) %>%
  summarise(n = n(), .groups = "drop")

The plot over the decades does not show a strong trend.The genre does not change significantly (here only visual) over the decades. From the following plot, we can only see that more and more movies are produced over the decades.

#| label: Movies per Genre and Decade using bar plot
#| fig-cap: "Movies per Genre and Decade bar plot all"
#| warning: false

ggplot(genre_decade, aes(x = decade, y = n, fill = genres)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of movies per Genre and Decade",
       x = "decade",
       y = "count") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Since the bar plot makes it difficult to observe the trends, I used a line plot to show the changes over the decades more clearly. The line plot highlights that genres like Drama, Comedy, and Documentary have maintained consistent trends over the decades.

ggplot(genre_decade, aes(x = decade, y = n, color = genres)) +
  geom_line() +
  labs(title = "Number of movies per Genre and Decade",
       x = "Decade",
       y = "count") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The following plot makes it clear that genres like Drama, Comedy, Documentary, Talk-Show, and Romance have maintained consistent trends over the last 130 years.

genre_decade %>% 
  group_by(genres) %>% 
  summarise(n = sum(n)) %>% 
  ggplot(aes(x = reorder(genres, n), y = n, fill = genres)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of movies per Genre",
       x = "genre",
       y = "count") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

4 Best rated Genres

To identify the best-rated genres, I joined the title_basics data set with the ratings data set and calculated the mean average rating for each genre. The plot reveals that genres like History, Documentary, and Music have the highest average ratings, around 7, while genres like Sci-Fi, Horror, and Thriller have the lowest average ratings, around 5.

# join title_basics with ratings
genre_ratings <- title_basics %>%
  separate_rows(genres, sep = ",") %>%
  filter(!is.na(genres)) %>%
  left_join(ratings, by = "tconst") %>%
  filter(!is.na(averageRating))

Since a movie can have different genres, the mean value is based on the genres.

#| label: mean of average Rating per genre
#| fig-cap: "Movies by Genre"
#| warning: false

genre_ratings %>% 
  group_by(genres) %>% 
  summarise(mean_rating = mean(averageRating)) %>% 
  ggplot(aes(x = reorder(genres, mean_rating), y = mean_rating, fill= genres)) +
  geom_bar(stat = "identity") +
  labs(title = "Average rating per Genre",
       x = "Genre",
       y = "Average Rating") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

5 Who was/is the most significant film director?

crew <- crew %>% 
  separate_rows(directors, sep = ",") %>%
  filter(!is.na(directors))

ratings_crew <- ratings %>% 
  left_join(crew, by = "tconst") %>%
  rename(nconst = directors) %>%
  left_join(name_basics, by = "nconst") %>%
  filter(!is.na(primaryName)) %>%
  group_by(primaryName) %>%
  summarise(n = n(), sum_rating = sum(numVotes))

The most significant film director is Christopher Nolan. He has the most number of votes.

ratings_crew %>% 
  top_n(10, sum_rating) %>% 
  ggplot(aes(x = reorder(primaryName, sum_rating), y = sum_rating, fill = primaryName)) +
  geom_bar(stat = "identity") +
  labs(title = "Film directors with the most number of votes",
       x = "Directors",
       y = "Number of votes") + 
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

6 For which movie did Christopher Nolan receive the most votes?

In the following code use the 4 datasets to find the movie for which Christopher Nolan received the most votes.

nolan_best <- ratings %>% 
  left_join(crew, by = "tconst") %>%
  rename(nconst = directors) %>%
  left_join(name_basics, by = "nconst") %>%
  filter(!is.na(primaryName)) %>%
  filter(primaryName == "Christopher Nolan") %>%
  filter(numVotes == max(numVotes)) %>%
  left_join(title_basics, by = "tconst")
nolan_best_numVotes <- nolan_best$numVotes
nolan_best_primaryTitle <- nolan_best$primaryTitle

Christoper Nolan received the most votes for the movie The Dark Knight with 2955094 votes.