Data 607 Assignment

Data Set

I select a set of data sets from Kaggle that cointains 4 tables necessary to do this project. The data sets are from both The Movie Database and IMDB. these files are located in this links https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

Content

This dataset consists of the following files:

movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.

ratings.csv: It has more than 100,000 ratings from more than 1000’s users.

Cleaning the data is necessary to handle unsuported CSV parameters in Postgress. I Trim the data and work on long character format to Eliminate unwanted symbols and separate needed strings.

# load library needed  in R workspace 
library(stringr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(DBI)
library(RODBC)
library(odbc)
library(RPostgres)

data <- read.csv("movies_metadata.csv", header = T)
head(data)

##      id   imdb_id adult   budget
## 1   862 tt0114709 FALSE 30000000
## 2  8844 tt0113497 FALSE 65000000
## 3 15602 tt0113228 FALSE        0
## 4 31357 tt0114885 FALSE 16000000
## 5 11862 tt0113041 FALSE        0
## 6   949 tt0113277 FALSE 60000000
##                                                                                                                     genres
## 1                         [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
## 2                        [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]
## 3                                                         [{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]
## 4                            [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]
## 5                                                                                           [{'id': 35, 'name': 'Comedy'}]
## 6 [{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}]
##                               homepage original_language
## 1 http://toystory.disney.com/toy-story                en
## 2                                                     en
## 3                                                     en
## 4                                                     en
## 5                                                     en
## 6                                                     en
##                original_title popularity
## 1                   Toy Story  21.946943
## 2                     Jumanji  17.015539
## 3            Grumpier Old Men    11.7129
## 4           Waiting to Exhale   3.859495
## 5 Father of the Bride Part II   8.387519
## 6                        Heat  17.924927
##                                                                                                                  production_companies
## 1                                                                                      [{'name': 'Pixar Animation Studios', 'id': 3}]
## 2 [{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]
## 3                                                     [{'name': 'Warner Bros.', 'id': 6194}, {'name': 'Lancaster Gate', 'id': 19464}]
## 4                                                                     [{'name': 'Twentieth Century Fox Film Corporation', 'id': 306}]
## 5                                        [{'name': 'Sandollar Productions', 'id': 5842}, {'name': 'Touchstone Pictures', 'id': 9195}]
## 6             [{'name': 'Regency Enterprises', 'id': 508}, {'name': 'Forward Pass', 'id': 675}, {'name': 'Warner Bros.', 'id': 6194}]
##                                         production_countries release_date
## 1 [{'iso_3166_1': 'US', 'name': 'United States of America'}]   10/30/1995
## 2 [{'iso_3166_1': 'US', 'name': 'United States of America'}]   12/15/1995
## 3 [{'iso_3166_1': 'US', 'name': 'United States of America'}]   12/22/1995
## 4 [{'iso_3166_1': 'US', 'name': 'United States of America'}]   12/22/1995
## 5 [{'iso_3166_1': 'US', 'name': 'United States of America'}]    2/10/1995
## 6 [{'iso_3166_1': 'US', 'name': 'United States of America'}]   12/15/1995
##     revenue runtime
## 1 373554033      81
## 2 262797249     104
## 3         0     101
## 4  81452156     127
## 5  76578911     106
## 6 187436818     170
##                                                                    spoken_languages
## 1                                          [{'iso_639_1': 'en', 'name': 'English'}]
## 2 [{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]
## 3                                          [{'iso_639_1': 'en', 'name': 'English'}]
## 4                                          [{'iso_639_1': 'en', 'name': 'English'}]
## 5                                          [{'iso_639_1': 'en', 'name': 'English'}]
## 6  [{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'es', 'name': 'Español'}]
##     status                       title video vote_average vote_count
## 1 Released                   Toy Story FALSE          7.7       5415
## 2 Released                     Jumanji FALSE          6.9       2413
## 3 Released            Grumpier Old Men FALSE          6.5         92
## 4 Released           Waiting to Exhale FALSE          6.1         34
## 5 Released Father of the Bride Part II FALSE          5.7        173
## 6 Released                        Heat FALSE          7.7       1886

String manipulation needed before creating the database. Take a look at the “” [{‘id’: 16, ‘name’: ‘Animation’}, {‘id’: 35, ‘name’: ‘Comedy’}, {‘id’: 10751, ‘name’: ‘Family’}] “” .

# renaming the variable and 
movie_dataset <- data %>% mutate(genres= str_match(data$genres, "[A-Z][a-z]+"),
                      production_companies = str_match(data$production_companies, "[A-Z][a-z]+ [A-Z][a-z]++"),
                      production_countries = str_extract(data$production_countries, "[A-Z][a-z]+ [A-Z][a-z]++"),
                      spoken_languages = str_match(data$spoken_languages, "[A-Z][a-z]+")
                      ) %>%  distinct()
head(movie_dataset)

##      id   imdb_id adult   budget    genres                             homepage
## 1   862 tt0114709 FALSE 30000000 Animation http://toystory.disney.com/toy-story
## 2  8844 tt0113497 FALSE 65000000 Adventure                                     
## 3 15602 tt0113228 FALSE        0   Romance                                     
## 4 31357 tt0114885 FALSE 16000000    Comedy                                     
## 5 11862 tt0113041 FALSE        0    Comedy                                     
## 6   949 tt0113277 FALSE 60000000    Action                                     
##   original_language              original_title popularity
## 1                en                   Toy Story  21.946943
## 2                en                     Jumanji  17.015539
## 3                en            Grumpier Old Men    11.7129
## 4                en           Waiting to Exhale   3.859495
## 5                en Father of the Bride Part II   8.387519
## 6                en                        Heat  17.924927
##    production_companies production_countries release_date   revenue runtime
## 1       Pixar Animation        United States   10/30/1995 373554033      81
## 2         Star Pictures        United States   12/15/1995 262797249     104
## 3           Warner Bros        United States   12/22/1995         0     101
## 4     Twentieth Century        United States   12/22/1995  81452156     127
## 5 Sandollar Productions        United States    2/10/1995  76578911     106
## 6   Regency Enterprises        United States   12/15/1995 187436818     170
##   spoken_languages   status                       title video vote_average
## 1          English Released                   Toy Story FALSE          7.7
## 2          English Released                     Jumanji FALSE          6.9
## 3          English Released            Grumpier Old Men FALSE          6.5
## 4          English Released           Waiting to Exhale FALSE          6.1
## 5          English Released Father of the Bride Part II FALSE          5.7
## 6          English Released                        Heat FALSE          7.7
##   vote_count
## 1       5415
## 2       2413
## 3         92
## 4         34
## 5        173
## 6       1886

This code is grey-out because it will overide the file in my computer. this is the next step before moving the file to PostGres PGAdmin 4 : write.table(movie_dataset,“movies_metadata_clean.txt”, sep = “, row.names = F)

Let run the needed libraries first

# database library 
library(DBI)
library(RODBC)
library(odbc)
library(RPostgres)
library(ggplot2)
library(hrbrthemes)

## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.

##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and

##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

library(kableExtra)

## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(gt)

Let connect the database to the PostGres Database. It is important to make sure that the required ODBC driver must be identified and installed in order to access data from any database to another platform.

Best way to add the driver type or copy this in Windows Explore : <C:32.exe>

# Create connection to db
con <- dbConnect(odbc::odbc(), "PostgreSQL30", timeout = 10)
dbListTables(con)

## [1] "film"            "links"           "movie_ratings"   "movies"         
## [5] "movies_2"        "movies_final"    "movies_main"     "movies_metadata"
## [9] "ratings"

Table Creation

The following syntax below is used to create the table in the database: CREATE TABLE public.movies_metadata ( id bigint NOT NULL, imdb_id bigint NOT NULL, adult boolean, budget bigint, genres character, homepage character, original_language character, original_title character, popularity bigint, production_companies character, production_countries character, release_date date, revenue numeric, runtime bigint, spoken_languages character, status character, title character, video boolean, vote_average numeric, vote_count numeric, PRIMARY KEY (id) );

ALTER TABLE IF EXISTS public.movies_metadata OWNER to postgres;

create table movies_2 as select distinct id, imdb_id, genres, original_title , production_companies, release_date, revenue, runtime, spoken_languages, vote_average, vote_count from movies

CREATE TABLE IF NOT EXISTS public.ratings ( user_id character varying COLLATE pg_catalog.”default” NOT NULL, movie_id character varying COLLATE pg_catalog.”default”, ratings numeric(1000,0), “timestamp” bigint, CONSTRAINT ratings_pkey PRIMARY KEY (user_id) )

TABLESPACE pg_default;

ALTER TABLE IF EXISTS public.ratings OWNER to postgres;

CREATE TABLE IF NOT EXISTS public.links ( movie_id character varying COLLATE pg_catalog.”default”, imdb_id character varying COLLATE pg_catalog.”default” NOT NULL, tmdb_id character varying COLLATE pg_catalog.”default”, CONSTRAINT links_pkey PRIMARY KEY (imdb_id) )

TABLESPACE pg_default;

ALTER TABLE IF EXISTS public.links OWNER to postgres;

TABLESPACE pg_default;

ALTER TABLE IF EXISTS public.ratings OWNER to postgres;

REATE TABLE IF NOT EXISTS public.film ( id character varying COLLATE pg_catalog.”default” NOT NULL, imdb_id text COLLATE pg_catalog.”default” NOT NULL, genres character varying COLLATE pg_catalog.”default”, original_title character varying COLLATE pg_catalog.”default”, production_companies character varying COLLATE pg_catalog.”default”, release_date character varying COLLATE pg_catalog.”default”, revenue character varying COLLATE pg_catalog.”default”, status character varying COLLATE pg_catalog.”default”, vote_average character varying COLLATE pg_catalog.”default”, vote_count character varying COLLATE pg_catalog.”default”, CONSTRAINT film_pkey PRIMARY KEY (id), CONSTRAINT film_fkey FOREIGN KEY (id) REFERENCES public.ratings (user_id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION NOT VALID )

TABLESPACE pg_default;

ALTER TABLE IF EXISTS public.film OWNER to postgres;

Analysis

I am going to read the data from my PostgreSQL Connection. I create 3 variables using the three tables i have in my database. It is very important to filter large table so R studio doesn’t crash.

# Loading subsets of all film links and rating 
film <-  dbGetQuery(con, "select * from public.film where revenue >= 100000000")
str(film)

## 'data.frame':    1373 obs. of  10 variables:
##  $ id                  : chr  "10003" "100042" "10020" "10021" ...
##  $ imdb_id             : chr  "120053" "2096672" "101414" "93936" ...
##  $ genres              : chr  "Thriller" "Comedy" "Romance" "Comedy" ...
##  $ original_title      : chr  "The Saint" "Dumb and Dumber To" "Beauty and the Beast" "The Secret of My Success" ...
##  $ production_companies: chr  "Paramount Pictures" "New Line" "Walt Disney" "Universal Pictures" ...
##  $ release_date        : chr  "4/3/1997" "11/12/2014" "11/13/1991" "4/10/1987" ...
##  $ revenue             : num  1.18e+08 1.70e+08 3.77e+08 1.11e+08 1.13e+08 ...
##  $ status              : chr  "Released" "Released" "Released" "Released" ...
##  $ vote_average        : num  5.9 5.4 7.5 6.2 5.8 7.6 5.5 6 7.4 5.8 ...
##  $ vote_count          : num  310 1140 3029 151 865 ...

links <- dbGetQuery(con, "select * from public.links")
str(links)

## 'data.frame':    45843 obs. of  3 variables:
##  $ movie_id: chr  "1" "2" "3" "4" ...
##  $ imdb_id : chr  "114709" "113497" "113228" "114885" ...
##  $ tmdb_id : chr  "862" "8844" "15602" "31357" ...

ratings <- dbGetQuery(con, "select * from public.ratings where ratings >= 5.0")
str(ratings)

## 'data.frame':    5982940 obs. of  4 variables:
##  $ user_id  : chr  "11768" "11768" "11768" "11768" ...
##  $ movie_id : chr  "92210" "102445" "103372" "107406" ...
##  $ ratings  : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ timestamp:integer64 1465358338 1465355384 1465357668 1465355884 1465354970 1465354093 1465354774 1465354126 ...

summary(film)

##       id              imdb_id             genres          original_title    
##  Length:1373        Length:1373        Length:1373        Length:1373       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  production_companies release_date          revenue             status         
##  Length:1373          Length:1373        Min.   :1.000e+08   Length:1373       
##  Class :character     Class :character   1st Qu.:1.362e+08   Class :character  
##  Mode  :character     Mode  :character   Median :1.957e+08   Mode  :character  
##                                          Mean   :2.822e+08                     
##                                          3rd Qu.:3.320e+08                     
##                                          Max.   :2.788e+09                     
##   vote_average     vote_count   
##  Min.   :3.400   Min.   :    2  
##  1st Qu.:5.900   1st Qu.:  639  
##  Median :6.500   Median : 1289  
##  Mean   :6.478   Mean   : 1903  
##  3rd Qu.:7.000   3rd Qu.: 2459  
##  Max.   :9.100   Max.   :14075

summary(ratings)

##    user_id            movie_id            ratings    timestamp         
##  Length:5982940     Length:5982940     Min.   :5   Min.   : 789652009  
##  Class :character   Class :character   1st Qu.:5   1st Qu.: 996807846  
##  Mode  :character   Mode  :character   Median :5   Median :1180239830  
##                                        Mean   :5   Mean   :1188520851  
##                                        3rd Qu.:5   3rd Qu.:1398545662  
##                                        Max.   :5   Max.   :1501828648

summary(links)

##    movie_id           imdb_id            tmdb_id         
##  Length:45843       Length:45843       Length:45843      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

I am going to rearrange the revenue column so i can select the highest revenue. Notice there are other ways to do it if your data is not organized.

# Join all table by their necessary primarykey. 
top_movies <- film %>% inner_join(links) %>% distinct() %>% 
  inner_join(ratings) %>% arrange(desc(revenue)) %>% 
  select(id,imdb_id,genres,original_title,production_companies,release_date,
         revenue,vote_average,tmdb_id,ratings) %>% distinct()

## Joining with `by = join_by(imdb_id)`
## Joining with `by = join_by(movie_id)`

head(top_movies)

##       id imdb_id  genres               original_title  production_companies
## 1  19995  499549  Action                       Avatar        Ingenious Film
## 2 140607 2488496  Action Star Wars: The Force Awakens Truenorth Productions
## 3    597  120338   Drama                      Titanic    Paramount Pictures
## 4  24428  848228 Science                 The Avengers    Paramount Pictures
## 5 135397  369610  Action               Jurassic World     Universal Studios
## 6 168259 2820852  Action                    Furious 7    Universal Pictures
##   release_date    revenue vote_average tmdb_id ratings
## 1   12/10/2009 2787965087          7.2   19995       5
## 2   12/15/2015 2068223624          7.5  140607       5
## 3   11/18/1997 1845034188          7.5     597       5
## 4    4/25/2012 1519557910          7.4   24428       5
## 5     6/9/2015 1513528810          6.5  135397       5
## 6     4/1/2015 1506249360          7.3  168259       5

# 20 highest earning movies
head(top_movies,20) %>% ggplot( aes(x=original_title, y=vote_average) ) +
  geom_bar(stat="identity", fill="#69b3a2") +
  coord_flip() +
  theme_ipsum() +
  theme(
    panel.grid.minor.y = element_blank(),
    panel.grid.major.y = element_blank(),
    legend.position="none"
  ) + 
  xlab("Top 20 Highest Earning Box Office Movies") +   ggtitle("IMDB Average Vote per User")+
  ylab("Average IMDB fan votes")

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

top_movies %>% head(20) %>% ggplot( aes(x=genres, y=revenue)) +
  geom_segment( aes(xend=genres, yend=0)) +
  geom_point( size=4, color="orange") +
  theme_bw() +
  xlab("Top 20 Highest Earning Box Office Movies") +   ggtitle("Highest Earning Box Office movies by Genres")+
  ylab("Highest Earning Movies by Genres")

# Box Office of All Time 
All_time <- head(top_movies,10) %>% select(original_title, genres, revenue, release_date) %>%
  gt() %>%
  tab_header(
    title = "Best Rated Movies of All Time",
    subtitle = "Highest earning Movies of All Time "
  ) 
All_time

original_title	genres	revenue	release_date
Best Rated Movies of All Time
Highest earning Movies of All Time
Avatar	Action	2787965087	12/10/2009
Star Wars: The Force Awakens	Action	2068223624	12/15/2015
Titanic	Drama	1845034188	11/18/1997
The Avengers	Science	1519557910	4/25/2012
Jurassic World	Action	1513528810	6/9/2015
Furious 7	Action	1506249360	4/1/2015
Avengers: Age of Ultron	Action	1405403694	4/22/2015
Harry Potter and the Deathly Hallows: Part 2	Family	1342000000	7/7/2011
Frozen	Animation	1274219009	11/27/2013
Beauty and the Beast	Family	1262886337	3/16/2017

Conclusion

As Movie Fan, i can agree with the relevant information that i discover in the data set. Avatar has always been the highest earning movies of all time. Avengers makes quite some noise when it came out. We notice that action movies are by far the highest earning movies in the industry. There is still a big gap when comparing movie earnings by genres while only one Animation Frozen movies made it to the top 10.

I upload some the files in github. See link below: https://github.com/joewarner89/CUNY-607/tree/main/homeworks/Assignment2

Data 607 Assignment – SQL and R

Warner Alexis

2023-09-16

Data Set

Content

Table Creation

Analysis

Conclusion