0.0.1 Explanatory data analysis

  • still learn to work with date formats in SQL so will start my analysis using R
  • again since , updating a database using the DBI package is still trivial , i have resorted to creating new columns USING R instead ,thus am Calculating the return on investment as the worldwide_gross/production_budget.
dat_new<-dat_new |> 
  select(-1) |>
  mutate(release_date = mdy(release_date)) |> #mdy is the setup of the date variable
  mutate_if(is.character,as.factor) |> 
  mutate(roi = worldwide_gross / production_budget) |>
  mutate(pct_domestic = domestic_gross / worldwide_gross) |>
  mutate(year = year(release_date)) |> 
  mutate(month = month(release_date,label=T)) |> 
  mutate(day = as.factor(wday(release_date,label=T))) |> 
  arrange(desc(release_date)) |>
  filter(worldwide_gross > 0) |>
  filter(!is.na(distributor)) |>
  filter(!is.na(mpaa_rating)) |> 
  select(year,month,day,everything())

0.1 Setup SQL In R

con<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con,dat_new)
copy_to(con,out_new)

0.1.1 look at the first few rows of the data

dbGetQuery(con,'SELECT *
                FROM dat_new 
                LIMIT 5')
  • We have a little problem , release_date now looks weird , hope we will take care of that along the way
  • i have already tried to pull necessary information from the date column using R , Ionformation like year,month and day

0.1.2 Now lets get querying

dbGetQuery(con,'SELECT *
                FROM dat_new 
                LIMIT 5')

Question 1

  • quantitative look at the average revenue by day
  • let’s remove all of 2018. Let’s get anything produced in 1975 and after (>=1975) but before 2018 (<2018).
dbGetQuery(con,'SELECT day,AVG(worldwide_gross) AS avg_rev
                FROM dat_new 
                WHERE year BETWEEN 1975 AND 2017
                GROUP BY day
                ORDER BY avg_rev DESC')

Question 2

  • quantitative look at the average revenue by month
dbGetQuery(con,'SELECT month,AVG(worldwide_gross) AS avg_rev
                FROM dat_new 
                WHERE year BETWEEN 1975 AND 2017
                GROUP BY month
                ORDER BY avg_rev DESC')

It looks like summer months and holiday months at the end of the year fare well.

Question 3

  • quantitative look at the average revenue by year (TOP 10)
dbGetQuery(con,'SELECT year,AVG(worldwide_gross) AS avg_rev
                FROM dat_new 
                WHERE year BETWEEN 1975 AND 2017
                GROUP BY year
                ORDER BY avg_rev DESC
                LIMIT 10')

Question 4

  • quantitative look at the average revenue by GENRE
dbGetQuery(con,'SELECT genre,AVG(worldwide_gross) AS avg_rev
                FROM dat_new 
                WHERE year BETWEEN 1975 AND 2017
                GROUP BY genre
                ORDER BY avg_rev DESC
                ')
  • On average , Adventure has the highest average worldwide gross

Question 5

  • But what about return on investment by genre?
dbGetQuery(con,'SELECT genre,AVG(roi) AS avg_roi
                FROM dat_new 
                WHERE year BETWEEN 1975 AND 2017
                GROUP BY genre
                ORDER BY avg_roi DESC
                ')

aha! , horror movies have the largest return on investment

Question 6

What were those super profitable movies? Looks like they’re mostly horror movies.

dbGetQuery(con,'SELECT movie,roi
                FROM dat_new 
                WHERE year BETWEEN 1975 AND 2017
                ORDER BY roi DESC
                LIMIT 10
                ')

the highest most profitable is a horror movie , i still think over 40% of those in the top 10 are horror movies

What about rating

What about movie ratings? R-rated movies have a lower average revenue but ROI isn’t substantially less.

dbGetQuery(con,'SELECT mpaa_rating ,AVG(worldwide_gross) AS mean_rev,
                                    SUM(worldwide_gross) AS total_rev,
                                    AVG(roi) AS roi,
                                    COUNT(*) AS number
                FROM dat_new 
                WHERE year BETWEEN 1975 AND 2017
                GROUP BY mpaa_rating
                ')

We can see that while G-rated movies have the highest mean revenue, there were relatively few of them produced, and had a lower total revenue. There were more R-rated movies, but PG-13 movies really drove the total revenue worldwide.

0.1.3 Join to IMDB reviews data

  • Separately for each MPAA rating, display the mean IMDB rating and mean number of votes cast
dbGetQuery(con,'SELECT mpaa_rating ,AVG(imdb) AS mean_imdb,
                                    AVG(votes) AS avg_votes
                                   
                FROM dat_new AS d
                INNER JOIN out_new AS r
                ON d.movie=r.movie
                WHERE year BETWEEN 1975 AND 2017
                GROUP BY mpaa_rating')
  • Separately for each genre, display the mean IMDB rating and mean number of votes cast
dbGetQuery(con,'SELECT genre ,AVG(imdb) AS mean_imdb,
                                    AVG(votes) AS avg_votes
                                   
                FROM dat_new AS d
                INNER JOIN out_new AS r
                ON d.movie=r.movie
                WHERE year BETWEEN 1975 AND 2017
                GROUP BY genre')