DATA607 Final Project

Intruduction

The objective of this project is to analyze popularity of movie genre over time.

I found data on Kaggle for “which movie should I watch today?” which included 4 csv files called movies, FilmDetails, MoreInfo, and PosterPath.csv. The data looks like a relational database table where each file is referenced through the movie ID column.

The goal of this project is to analyze the popularity of different movie genres on user ratings, released years, and other factors to determine which movie to watch.

The data will need to be cleaned up first since there are some observations that do not have anything in there. In Movies.csv the genres will need to be separated into different columns and in FilmDetails.csv the top_billed actors will need to be separated and cleaned up as well.

Results this project seeks to achieve:

Find genre popularity ranking based on ranking Find which genre is gaining popularity over time Find out which actor and director have the highest grossing film

Load Data

I tried to upload data onto the MySQL database so I can simulate pulling data from the database and working on it with R. It was taking a very long time to upload probably due to the volume of data in each csv file. To simulate I will pull data from Github into dataframe and only MoreInfo will be pulled from MySQL database.

# load data from github
url <-"https://raw.githubusercontent.com/amily52131/DATA607/refs/heads/main/Final_Project/Data/FilmDetails.csv"
FilmDetails<- read.csv(url)

url <- "https://raw.githubusercontent.com/amily52131/DATA607/refs/heads/main/Final_Project/Data/Movies.csv"
Movies <- read.csv(url)

#url <- "https://raw.githubusercontent.com/amily52131/DATA607/refs/heads/main/Final_Project/Data/MoreInfo.csv"
#MoreInfo <- read.csv(url)

Initiate database connection

library(DBI)      #Database infrastructure for R
library(RMySQL)   #Translating R and MySQL

#create a config.yml file with database information in it
#default:
#  datawarehouse:
#    driver: 'MySQL()' 
#    server: 'Name_or_link_for_Database'
#    uid: 'Database_Username'
#    pwd: 'Database_Password'
#    database: 'Name_of_Database'

#Use get() from config library to get database information from config.yml
dbConfig <- config::get("datawarehouse")

#Connecting to Database with config info
mydb <- dbConnect(
  MySQL(),
  user = dbConfig$uid,
  password = dbConfig$pwd,
  dbname = dbConfig$database,
  host = dbConfig$server
)

#Convert the information from MySQL to R data frame
MoreInfo_db <- tbl(mydb, "moreinfo") #convert source data to table
MoreInfo_df <- collect(MoreInfo_db) #convert table to R data frame
glimpse(MoreInfo_df)

## Rows: 9,718
## Columns: 5
## $ id      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ runtime <chr> "2h 22 min", "2h 55 min", "3h 22 min", "3h 15 min", "1h 37 min…
## $ budget  <chr> "$25,000,000", "$6,000,000", "$13,000,000", "$22,000,000", "$3…
## $ revenue <chr> "$28,341,469", "$245,066,411", "$102,600,000", "$321,365,567",…
## $ film_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…

dbDisconnect(mydb) #Disconnect from database

## [1] TRUE

Clean up data

The data obtained were not tidy. There were many blank values and there are many values that has unwanted characters in there like the $ symbol for revenue. In this part I will clean up the data to make it into tidy format.

# FilmDetails
fd_dt <- sapply(FilmDetails, class) %>% print()

##          id    director  top_billed  budget_usd revenue_usd 
##   "integer" "character" "character"   "integer"   "numeric"

head(FilmDetails)

##   id             director
## 1  1       Frank Darabont
## 2  2 Francis Ford Coppola
## 3  3 Francis Ford Coppola
## 4  4     Steven Spielberg
## 5  5         Sidney Lumet
## 6  6       Hayao Miyazaki
##                                                                    top_billed
## 1       Tim Robbins, Morgan Freeman, Bob Gunton, William Sadler, Clancy Brown
## 2  Marlon Brando, Al Pacino, James Caan, Robert Duvall, Richard S. Castellano
## 3         Al Pacino, Robert Duvall, Diane Keaton, Robert De Niro, John Cazale
## 4 Liam Neeson, Ben Kingsley, Ralph Fiennes, Caroline Goodall, Jonathan Sagall
## 5       Martin Balsam, John Fiedler, Lee J. Cobb, E.G. Marshall, Jack Klugman
## 6     Rumi Hiiragi, Miyu Irino, Mari Natsuki, Takashi Naito, Yasuko Sawaguchi
##   budget_usd revenue_usd
## 1   25000000    28341469
## 2    6000000   245066411
## 3   13000000   102600000
## 4   22000000   321365567
## 5     397751     4360000
## 6   19000000   274925095

To make FilmDetails tidy: - Separate top_billed column for actors by the deliminator and then pivot back into the dataframe as “actors”

# separate the column by deliminator , and then pivot into a single column called Actors
FilmDetails_clean <- FilmDetails %>% 
  separate_wider_delim(top_billed,
                       delim = ',',
                       names_sep = '_',
                       too_few = 'align_start'
                       ) %>% 
  pivot_longer(cols = starts_with("top_billed"),
               names_to = 'top_billed',
               values_to = 'actor',
               values_drop_na = TRUE
               ) %>% 
  select(-top_billed)

#Trim the spaces for the actor names
FilmDetails_clean$actor <- str_trim(FilmDetails_clean$actor)

# Movies
m_dt <- sapply(Movies, class) %>% print()

##           id        title       genres     language   user_score runtime_hour 
##    "integer"  "character"  "character"  "character"    "numeric"    "integer" 
##  runtime_min release_date   vote_count 
##    "integer"  "character"    "integer"

head(Movies)

##   id                    title                     genres language user_score
## 1  1 The Shawshank Redemption               Drama, Crime       en        8.7
## 2  2            The Godfather               Drama, Crime       en        8.7
## 3  3    The Godfather Part II               Drama, Crime       en        8.6
## 4  4         Schindler's List        Drama, History, War       en        8.6
## 5  5             12 Angry Men                      Drama       en        8.5
## 6  6            Spirited Away Animation, Family, Fantasy       ja        8.5
##   runtime_hour runtime_min release_date vote_count
## 1            2          22   1994-09-23      27070
## 2            2          55   1972-03-14      20563
## 3            3          22   1974-12-20      12403
## 4            3          15   1993-12-15      15810
## 5            1          37   1957-04-10       8611
## 6            2           5   2001-07-20      16462

To make Movies tidy: - Separate genres by the deliminator then pivot back into as “genres” - Convert runtime_hour and runtime_min to “total_length” - Convert release_date into a date object

# Tidy up genres
Movies_clean <- Movies %>% 
  separate_wider_delim(genres,
                       delim = ',',
                       names_sep = '_',
                       too_few = "align_start"
                       ) %>% 
  pivot_longer(cols = starts_with("genres"),
               names_to = 'genre_rank',
               values_to = 'genres',
               values_drop_na = TRUE
  )

#Trim the spaces for the actor names
Movies_clean$genres <- str_trim(Movies_clean$genres)

# create total_length from runtime_hour and runtime_min
Movies_clean <- Movies_clean %>% 
  mutate(total_length = runtime_hour * 60 + runtime_min)

# convert release_dte into date object so R can compare
Movies_clean$release_date <- as.Date(Movies_clean$release_date)

# MoreInfo
mi_dt <- sapply(MoreInfo_df, class) %>% print()

##          id     runtime      budget     revenue     film_id 
##   "integer" "character" "character" "character"   "integer"

head(MoreInfo_df)

## # A tibble: 6 × 5
##      id runtime   budget      revenue      film_id
##   <int> <chr>     <chr>       <chr>          <int>
## 1     1 2h 22 min $25,000,000 $28,341,469        1
## 2     2 2h 55 min $6,000,000  $245,066,411       2
## 3     3 3h 22 min $13,000,000 $102,600,000       3
## 4     4 3h 15 min $22,000,000 $321,365,567       4
## 5     5 1h 37 min $397,751    $4,360,000         5
## 6     6 2h 5 min  $19,000,000 $274,925,095       6

To make MoreInfo tidy: - convert runtime to total_length - convert budget to integer - convert revenue to integer

# convert runtime to total_length by separating hours and minutes
MoreInfo_clean <- MoreInfo_df %>% 
  mutate(hours = as.numeric(gsub("([0-9]+)h.*","\\1", runtime )),
         minutes = as.numeric(gsub(".* ([0-9]+) min.*","\\1", runtime)),
         total_length = hours * 60 + minutes
        )

# convert budget to integer
MoreInfo_clean$budget <- as.integer(gsub("[$,]","", MoreInfo_clean$budget))
  
# convert revenue to integer
MoreInfo_clean$revenue <- as.integer(gsub("[$,]","", MoreInfo_clean$revenue))

## Warning: NAs introduced by coercion to integer range

Explore

Since our data are in three different tables, I will pull them together into a big table of movie_details.

# movie_details 
movie_details <- Movies_clean %>% 
  left_join(MoreInfo_clean %>% select(id, budget, revenue)) %>% 
  select(-runtime_hour, -runtime_min)

## Joining with `by = join_by(id)`

movie_details <- movie_details %>% 
  left_join(FilmDetails_clean %>% select(id, director, actor))

## Joining with `by = join_by(id)`

## Warning in left_join(., FilmDetails_clean %>% select(id, director, actor)): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Analyze

Find genre popularity ranking based on user_score

# highest average score by genre
high_rating <- movie_details %>% 
  group_by(genres) %>% 
  summarize(avg_rating = mean(user_score)) %>% 
  ggplot(aes(x = reorder(genres, avg_rating), y = avg_rating)) +
  geom_bar(stat = "identity", fill = "steelblue") + 
  labs(title = "Average Movie Rating by Genre", 
       x = "Genre", 
       y = "Average Rating") +
  theme_minimal() +
  coord_flip()
high_rating

# score based on genre 
genres_rating <- movie_details %>% 
  group_by(genres) %>% 
  ggplot(aes(x = user_score, fill = genres)) + 
  geom_histogram(bins = 20) +
  facet_wrap(~genres)
genres_rating

# Bayesian rating formula (weighted average)
# M is global mean
M <- mean(Movies_clean$user_score)
# C is minimum vote count threshold
C <- 1000

weighted_average <- movie_details %>%
  mutate(standardized_rating = (C * M + vote_count * user_score) / (C + vote_count)) %>% 
  group_by(genres) %>%
  summarise(avg_srating = mean(standardized_rating))
ggplot(data = weighted_average, aes(x = reorder(genres, avg_srating), y = avg_srating)) +
  geom_bar(stat = "identity", fill = "steelblue")+ coord_flip()

Find which genre is gaining popularity over time.

# get the year from release date
genre_trend <- movie_details %>% 
  mutate(release_year = year(release_date)) %>% 
  group_by(release_year, genres) %>% 
  mutate(avg_rate_year = mean(user_score)) %>% 
  arrange(release_year)

ggplot(genre_trend, aes(x = release_year, y = avg_rate_year, color = genres)) +
  geom_line()+
  geom_smooth(method = "lm") +
  labs(title = "Genre Popularity Over Time", x = "Year", y = "Average Genre Rank") + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  facet_wrap(~genres)

## `geom_smooth()` using formula = 'y ~ x'

## Warning in qt((1 - level)/2, df): NaNs produced

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

# number one movie of that genre for that year
movie_by_genre <- genre_trend %>% 
  group_by(release_year, genres) %>% 
  arrange(desc(user_score)) %>% 
  slice(1) %>% 
  select(title, release_year, genres,user_score, director)

Which movie has the best rating in that genre?

best_rated_movie <- movie_details %>% 
  group_by(genres) %>% 
  arrange(desc(user_score)) %>% 
  slice(1) %>% 
  select(title, release_date, genres,user_score, director)

knitr::kable(best_rated_movie, position = "center")

title	release_date	genres	user_score	director
The Dark Knight	2008-07-16	Action	8.5	Christopher Nolan
The Lord of the Rings: The Return of the King	2003-12-17	Adventure	8.5	Peter Jackson
Spirited Away	2001-07-20	Animation	8.5	Hayao Miyazaki
Dilwale Dulhania Le Jayenge	1995-10-20	Comedy	8.5	Aditya Chopra
The Shawshank Redemption	1994-09-23	Crime	8.7	Frank Darabont
The Shawshank Redemption	1994-09-23	Drama	8.7	Frank Darabont
Spirited Away	2001-07-20	Family	8.5	Hayao Miyazaki
Spirited Away	2001-07-20	Fantasy	8.5	Hayao Miyazaki
Schindler’s List	1993-12-15	History	8.6	Steven Spielberg
Psycho	1960-06-22	Horror	8.4	Alfred Hitchcock
Whiplash	2014-10-10	Music	8.4	Damien Chazelle
Psycho	1960-06-22	Mystery	8.4	Alfred Hitchcock
Dilwale Dulhania Le Jayenge	1995-10-20	Romance	8.5	Aditya Chopra
The Wild Robot	2024-09-12	Science Fiction	8.5	Chris Sanders
Return	1975-09-04	Short	6.9	Bill Viola
Steven Universe: The Movie	2019-10-07	TV Movie	8.2	Rebecca Sugar
The Dark Knight	2008-07-16	Thriller	8.5	Christopher Nolan
Schindler’s List	1993-12-15	War	8.6	Steven Spielberg
The Good, the Bad and the Ugly	1966-12-22	Western	8.5	Sergio Leone

Conclusion

From Kaggle, this data set about “which movie I should see?” has three files tied by the movie id. I originally tried to upload all three files into MySQL database but was unable to do so. In the end only one table was successfully uploaded and pulled from MySQL the other two tables were uploaded onto Github and pulled with the raw address.

There were not too many cleaning I had to do with this data set. Mostly converting the values to the right data type and separate the columns that had multiple values in one cell. When data is in the right data type like dates, then it will become easier to do analysis that shows trend over time. I tried a couple different ways to analyze the data using the clean data from previous step, I created a movie_details data frame for analysis.

First I found the genre popularity ranking based on user_score. Surprisingly the War genres comes in first even though it doesn’t have as many as categories like comedy or drama but the overall quality of movies in that genre generally scored higher than other categories. Then I tried to find the trend for genre over time and it seems that other than the Animation category that is gaining better ratings over the years, comedy, crime, horror, mystery, and thriller has a noticeable declining rating over the years. Lastly, I looked for the best rated movie titles in each category and noticed that some movies appeared in two different genres. This is due to when the genre is split the user_score stayed the same for each observation of the same movie but for different genre. Maybe if each movie is given a rating score evaluated for that category then we would get a more accurate list of best movies in each genre.