The objective of this project is to analyze popularity of movie genre over time.
I found data on Kaggle for “which movie should I watch today?” which included 4 csv files called movies, FilmDetails, MoreInfo, and PosterPath.csv. The data looks like a relational database table where each file is referenced through the movie ID column.
The goal of this project is to analyze the popularity of different movie genres on user ratings, released years, and other factors to determine which movie to watch.
The data will need to be cleaned up first since there are some observations that do not have anything in there. In Movies.csv the genres will need to be separated into different columns and in FilmDetails.csv the top_billed actors will need to be separated and cleaned up as well.
Results this project seeks to achieve:
Find genre popularity ranking based on ranking Find which genre is gaining popularity over time Find out which actor and director have the highest grossing film
I tried to upload data onto the MySQL database so I can simulate pulling data from the database and working on it with R. It was taking a very long time to upload probably due to the volume of data in each csv file. To simulate I will pull data from Github into dataframe and only MoreInfo will be pulled from MySQL database.
# load data from github
url <-"https://raw.githubusercontent.com/amily52131/DATA607/refs/heads/main/Final_Project/Data/FilmDetails.csv"
FilmDetails<- read.csv(url)
url <- "https://raw.githubusercontent.com/amily52131/DATA607/refs/heads/main/Final_Project/Data/Movies.csv"
Movies <- read.csv(url)
#url <- "https://raw.githubusercontent.com/amily52131/DATA607/refs/heads/main/Final_Project/Data/MoreInfo.csv"
#MoreInfo <- read.csv(url)
Initiate database connection
library(DBI) #Database infrastructure for R
library(RMySQL) #Translating R and MySQL
#create a config.yml file with database information in it
#default:
# datawarehouse:
# driver: 'MySQL()'
# server: 'Name_or_link_for_Database'
# uid: 'Database_Username'
# pwd: 'Database_Password'
# database: 'Name_of_Database'
#Use get() from config library to get database information from config.yml
dbConfig <- config::get("datawarehouse")
#Connecting to Database with config info
mydb <- dbConnect(
MySQL(),
user = dbConfig$uid,
password = dbConfig$pwd,
dbname = dbConfig$database,
host = dbConfig$server
)
#Convert the information from MySQL to R data frame
MoreInfo_db <- tbl(mydb, "moreinfo") #convert source data to table
MoreInfo_df <- collect(MoreInfo_db) #convert table to R data frame
glimpse(MoreInfo_df)
## Rows: 9,718
## Columns: 5
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ runtime <chr> "2h 22 min", "2h 55 min", "3h 22 min", "3h 15 min", "1h 37 min…
## $ budget <chr> "$25,000,000", "$6,000,000", "$13,000,000", "$22,000,000", "$3…
## $ revenue <chr> "$28,341,469", "$245,066,411", "$102,600,000", "$321,365,567",…
## $ film_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
dbDisconnect(mydb) #Disconnect from database
## [1] TRUE
The data obtained were not tidy. There were many blank values and there are many values that has unwanted characters in there like the $ symbol for revenue. In this part I will clean up the data to make it into tidy format.
# FilmDetails
fd_dt <- sapply(FilmDetails, class) %>% print()
## id director top_billed budget_usd revenue_usd
## "integer" "character" "character" "integer" "numeric"
head(FilmDetails)
## id director
## 1 1 Frank Darabont
## 2 2 Francis Ford Coppola
## 3 3 Francis Ford Coppola
## 4 4 Steven Spielberg
## 5 5 Sidney Lumet
## 6 6 Hayao Miyazaki
## top_billed
## 1 Tim Robbins, Morgan Freeman, Bob Gunton, William Sadler, Clancy Brown
## 2 Marlon Brando, Al Pacino, James Caan, Robert Duvall, Richard S. Castellano
## 3 Al Pacino, Robert Duvall, Diane Keaton, Robert De Niro, John Cazale
## 4 Liam Neeson, Ben Kingsley, Ralph Fiennes, Caroline Goodall, Jonathan Sagall
## 5 Martin Balsam, John Fiedler, Lee J. Cobb, E.G. Marshall, Jack Klugman
## 6 Rumi Hiiragi, Miyu Irino, Mari Natsuki, Takashi Naito, Yasuko Sawaguchi
## budget_usd revenue_usd
## 1 25000000 28341469
## 2 6000000 245066411
## 3 13000000 102600000
## 4 22000000 321365567
## 5 397751 4360000
## 6 19000000 274925095
To make FilmDetails tidy: - Separate top_billed column for actors by the deliminator and then pivot back into the dataframe as “actors”
# separate the column by deliminator , and then pivot into a single column called Actors
FilmDetails_clean <- FilmDetails %>%
separate_wider_delim(top_billed,
delim = ',',
names_sep = '_',
too_few = 'align_start'
) %>%
pivot_longer(cols = starts_with("top_billed"),
names_to = 'top_billed',
values_to = 'actor',
values_drop_na = TRUE
) %>%
select(-top_billed)
#Trim the spaces for the actor names
FilmDetails_clean$actor <- str_trim(FilmDetails_clean$actor)
# Movies
m_dt <- sapply(Movies, class) %>% print()
## id title genres language user_score runtime_hour
## "integer" "character" "character" "character" "numeric" "integer"
## runtime_min release_date vote_count
## "integer" "character" "integer"
head(Movies)
## id title genres language user_score
## 1 1 The Shawshank Redemption Drama, Crime en 8.7
## 2 2 The Godfather Drama, Crime en 8.7
## 3 3 The Godfather Part II Drama, Crime en 8.6
## 4 4 Schindler's List Drama, History, War en 8.6
## 5 5 12 Angry Men Drama en 8.5
## 6 6 Spirited Away Animation, Family, Fantasy ja 8.5
## runtime_hour runtime_min release_date vote_count
## 1 2 22 1994-09-23 27070
## 2 2 55 1972-03-14 20563
## 3 3 22 1974-12-20 12403
## 4 3 15 1993-12-15 15810
## 5 1 37 1957-04-10 8611
## 6 2 5 2001-07-20 16462
To make Movies tidy: - Separate genres by the deliminator then pivot back into as “genres” - Convert runtime_hour and runtime_min to “total_length” - Convert release_date into a date object
# Tidy up genres
Movies_clean <- Movies %>%
separate_wider_delim(genres,
delim = ',',
names_sep = '_',
too_few = "align_start"
) %>%
pivot_longer(cols = starts_with("genres"),
names_to = 'genre_rank',
values_to = 'genres',
values_drop_na = TRUE
)
#Trim the spaces for the actor names
Movies_clean$genres <- str_trim(Movies_clean$genres)
# create total_length from runtime_hour and runtime_min
Movies_clean <- Movies_clean %>%
mutate(total_length = runtime_hour * 60 + runtime_min)
# convert release_dte into date object so R can compare
Movies_clean$release_date <- as.Date(Movies_clean$release_date)
# MoreInfo
mi_dt <- sapply(MoreInfo_df, class) %>% print()
## id runtime budget revenue film_id
## "integer" "character" "character" "character" "integer"
head(MoreInfo_df)
## # A tibble: 6 × 5
## id runtime budget revenue film_id
## <int> <chr> <chr> <chr> <int>
## 1 1 2h 22 min $25,000,000 $28,341,469 1
## 2 2 2h 55 min $6,000,000 $245,066,411 2
## 3 3 3h 22 min $13,000,000 $102,600,000 3
## 4 4 3h 15 min $22,000,000 $321,365,567 4
## 5 5 1h 37 min $397,751 $4,360,000 5
## 6 6 2h 5 min $19,000,000 $274,925,095 6
To make MoreInfo tidy: - convert runtime to total_length - convert budget to integer - convert revenue to integer
# convert runtime to total_length by separating hours and minutes
MoreInfo_clean <- MoreInfo_df %>%
mutate(hours = as.numeric(gsub("([0-9]+)h.*","\\1", runtime )),
minutes = as.numeric(gsub(".* ([0-9]+) min.*","\\1", runtime)),
total_length = hours * 60 + minutes
)
# convert budget to integer
MoreInfo_clean$budget <- as.integer(gsub("[$,]","", MoreInfo_clean$budget))
# convert revenue to integer
MoreInfo_clean$revenue <- as.integer(gsub("[$,]","", MoreInfo_clean$revenue))
## Warning: NAs introduced by coercion to integer range
Since our data are in three different tables, I will pull them together into a big table of movie_details.
# movie_details
movie_details <- Movies_clean %>%
left_join(MoreInfo_clean %>% select(id, budget, revenue)) %>%
select(-runtime_hour, -runtime_min)
## Joining with `by = join_by(id)`
movie_details <- movie_details %>%
left_join(FilmDetails_clean %>% select(id, director, actor))
## Joining with `by = join_by(id)`
## Warning in left_join(., FilmDetails_clean %>% select(id, director, actor)): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# highest average score by genre
high_rating <- movie_details %>%
group_by(genres) %>%
summarize(avg_rating = mean(user_score)) %>%
ggplot(aes(x = reorder(genres, avg_rating), y = avg_rating)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Average Movie Rating by Genre",
x = "Genre",
y = "Average Rating") +
theme_minimal() +
coord_flip()
high_rating
# score based on genre
genres_rating <- movie_details %>%
group_by(genres) %>%
ggplot(aes(x = user_score, fill = genres)) +
geom_histogram(bins = 20) +
facet_wrap(~genres)
genres_rating
# Bayesian rating formula (weighted average)
# M is global mean
M <- mean(Movies_clean$user_score)
# C is minimum vote count threshold
C <- 1000
weighted_average <- movie_details %>%
mutate(standardized_rating = (C * M + vote_count * user_score) / (C + vote_count)) %>%
group_by(genres) %>%
summarise(avg_srating = mean(standardized_rating))
ggplot(data = weighted_average, aes(x = reorder(genres, avg_srating), y = avg_srating)) +
geom_bar(stat = "identity", fill = "steelblue")+ coord_flip()
# get the year from release date
genre_trend <- movie_details %>%
mutate(release_year = year(release_date)) %>%
group_by(release_year, genres) %>%
mutate(avg_rate_year = mean(user_score)) %>%
arrange(release_year)
ggplot(genre_trend, aes(x = release_year, y = avg_rate_year, color = genres)) +
geom_line()+
geom_smooth(method = "lm") +
labs(title = "Genre Popularity Over Time", x = "Year", y = "Average Genre Rank") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
facet_wrap(~genres)
## `geom_smooth()` using formula = 'y ~ x'
## Warning in qt((1 - level)/2, df): NaNs produced
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
# number one movie of that genre for that year
movie_by_genre <- genre_trend %>%
group_by(release_year, genres) %>%
arrange(desc(user_score)) %>%
slice(1) %>%
select(title, release_year, genres,user_score, director)
best_rated_movie <- movie_details %>%
group_by(genres) %>%
arrange(desc(user_score)) %>%
slice(1) %>%
select(title, release_date, genres,user_score, director)
knitr::kable(best_rated_movie, position = "center")
| title | release_date | genres | user_score | director |
|---|---|---|---|---|
| The Dark Knight | 2008-07-16 | Action | 8.5 | Christopher Nolan |
| The Lord of the Rings: The Return of the King | 2003-12-17 | Adventure | 8.5 | Peter Jackson |
| Spirited Away | 2001-07-20 | Animation | 8.5 | Hayao Miyazaki |
| Dilwale Dulhania Le Jayenge | 1995-10-20 | Comedy | 8.5 | Aditya Chopra |
| The Shawshank Redemption | 1994-09-23 | Crime | 8.7 | Frank Darabont |
| The Shawshank Redemption | 1994-09-23 | Drama | 8.7 | Frank Darabont |
| Spirited Away | 2001-07-20 | Family | 8.5 | Hayao Miyazaki |
| Spirited Away | 2001-07-20 | Fantasy | 8.5 | Hayao Miyazaki |
| Schindler’s List | 1993-12-15 | History | 8.6 | Steven Spielberg |
| Psycho | 1960-06-22 | Horror | 8.4 | Alfred Hitchcock |
| Whiplash | 2014-10-10 | Music | 8.4 | Damien Chazelle |
| Psycho | 1960-06-22 | Mystery | 8.4 | Alfred Hitchcock |
| Dilwale Dulhania Le Jayenge | 1995-10-20 | Romance | 8.5 | Aditya Chopra |
| The Wild Robot | 2024-09-12 | Science Fiction | 8.5 | Chris Sanders |
| Return | 1975-09-04 | Short | 6.9 | Bill Viola |
| Steven Universe: The Movie | 2019-10-07 | TV Movie | 8.2 | Rebecca Sugar |
| The Dark Knight | 2008-07-16 | Thriller | 8.5 | Christopher Nolan |
| Schindler’s List | 1993-12-15 | War | 8.6 | Steven Spielberg |
| The Good, the Bad and the Ugly | 1966-12-22 | Western | 8.5 | Sergio Leone |
From Kaggle, this data set about “which movie I should see?” has three files tied by the movie id. I originally tried to upload all three files into MySQL database but was unable to do so. In the end only one table was successfully uploaded and pulled from MySQL the other two tables were uploaded onto Github and pulled with the raw address.
There were not too many cleaning I had to do with this data set. Mostly converting the values to the right data type and separate the columns that had multiple values in one cell. When data is in the right data type like dates, then it will become easier to do analysis that shows trend over time. I tried a couple different ways to analyze the data using the clean data from previous step, I created a movie_details data frame for analysis.
First I found the genre popularity ranking based on user_score. Surprisingly the War genres comes in first even though it doesn’t have as many as categories like comedy or drama but the overall quality of movies in that genre generally scored higher than other categories. Then I tried to find the trend for genre over time and it seems that other than the Animation category that is gaining better ratings over the years, comedy, crime, horror, mystery, and thriller has a noticeable declining rating over the years. Lastly, I looked for the best rated movie titles in each category and noticed that some movies appeared in two different genres. This is due to when the genre is split the user_score stayed the same for each observation of the same movie but for different genre. Maybe if each movie is given a rating score evaluated for that category then we would get a more accurate list of best movies in each genre.