Synopsis

In this week’s assignment six movie releases from 2017 served as the root of the SQL database cinema created in MySQL Workbench based on the following films:

Data Dictionary

The cinema database is comprised of 17 varibles which are listed below; data provided from Wikipedia, IMDB, Rotten Tomatoes.

    • MID - Movie ID variable
    • Title - Movie Title
    • Descr - Brief description
    • Dir - Director
    • Dist - Movie Studio Disbriutor
    • Budget - Movie Budget
    • BoxOffice - Box Office Profits
    • RD_Month - Release Date (Month)
    • RD_Day - Release Date (Day)
    • RD_Year - Release Date (Year)
    • Rating - Movie Rating (i.e. PG, PG-13)
    • LengthHr - Movie Length (Hour)
    • LengthMin - Movie Length (Minutes)
    • IMDB - IMDB Movie Rating
    • Metascore - Metascore Rating
    • Tomato - Rotten Tomatoes Rating
    • AudScore - Audience Score

MySQL Workbench

Data was exported from the movies table from MySQL Workbench to a csv file. In addition to the export the DESCRIBE (TABLE) statement output was also exported to a csv file.

Load the tidyverse package

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.3
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.1     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
## Warning: package 'readr' was built under R version 3.4.3
## Warning: package 'purrr' was built under R version 3.4.3
## Warning: package 'dplyr' was built under R version 3.4.2
## Warning: package 'forcats' was built under R version 3.4.3
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
#import the describe csv file
describe <- read_csv("C:\\Users\\lizza\\Documents\\CUNY - Data Analytics\\DATA 607_Revised\\Week 2\\Assignment\\Cinema_Describe.csv")
## Parsed with column specification:
## cols(
##   Field = col_character(),
##   Type = col_character(),
##   Null = col_character(),
##   Key = col_character(),
##   Default = col_character(),
##   Extra = col_character()
## )

By displaying the describe tibble we see that MID is considered a primary key which means it can’t be duplicated; primary key’s play an important role in databases as they allow several tables to link by SQL joins.

#display the describe table
describe
## # A tibble: 17 x 6
##    Field     Type         Null  Key   Default Extra
##    <chr>     <chr>        <chr> <chr> <chr>   <chr>
##  1 MID       int(11)      NO    PRI   NULL    <NA> 
##  2 Title     varchar(255) YES   <NA>  NULL    <NA> 
##  3 Descr     text         YES   <NA>  NULL    <NA> 
##  4 Dir       varchar(255) YES   <NA>  NULL    <NA> 
##  5 Dist      varchar(255) YES   <NA>  NULL    <NA> 
##  6 Budget    int(11)      YES   <NA>  NULL    <NA> 
##  7 BoxOffice int(11)      YES   <NA>  NULL    <NA> 
##  8 RD_Month  int(11)      YES   <NA>  NULL    <NA> 
##  9 RD_Day    int(11)      YES   <NA>  NULL    <NA> 
## 10 RD_Year   int(11)      YES   <NA>  NULL    <NA> 
## 11 Rating    varchar(5)   YES   <NA>  NULL    <NA> 
## 12 LengthHr  int(11)      YES   <NA>  NULL    <NA> 
## 13 LengthMin int(11)      YES   <NA>  NULL    <NA> 
## 14 IMDB      double       YES   <NA>  NULL    <NA> 
## 15 Metascore int(11)      YES   <NA>  NULL    <NA> 
## 16 Tomato    int(11)      YES   <NA>  NULL    <NA> 
## 17 AudScore  int(11)      YES   <NA>  NULL    <NA>

Import the Data

#import the cinema csv
cinema <- read_csv("C:\\Users\\lizza\\Documents\\CUNY - Data Analytics\\DATA 607_Revised\\Week 2\\Assignment\\Cinema.csv")
## Parsed with column specification:
## cols(
##   MID = col_integer(),
##   Title = col_character(),
##   Descr = col_character(),
##   Dir = col_character(),
##   Dist = col_character(),
##   Budget = col_integer(),
##   BoxOffice = col_integer(),
##   RD_Month = col_integer(),
##   RD_Day = col_integer(),
##   RD_Year = col_integer(),
##   Rating = col_character(),
##   LengthHr = col_integer(),
##   LengthMin = col_integer(),
##   IMDB = col_double(),
##   Metascore = col_integer(),
##   Tomato = col_integer(),
##   AudScore = col_integer()
## )
glimpse(cinema)
## Observations: 6
## Variables: 17
## $ MID       <int> 1, 2, 3, 4, 5, 6
## $ Title     <chr> "Star Wars", "Jumanji: Welcome to the Jungle", "Beau...
## $ Descr     <chr> "Rey develops her newly discovered abilites with the...
## $ Dir       <chr> "Rian Johnson", "Jake Kasdan", "Bill Condon", "Patty...
## $ Dist      <chr> "Walt Disney Studios", "Sony Pictures Entertainment"...
## $ Budget    <int> 200000000, 100000000, 160000000, 149000000, 17500000...
## $ BoxOffice <int> 1321000000, 858400000, 1264000000, 821900000, 880200...
## $ RD_Month  <int> 12, 12, 3, 6, 7, 4
## $ RD_Day    <int> 15, 20, 17, 2, 7, 14
## $ RD_Year   <int> 2017, 2017, 2017, 2017, 2017, 2017
## $ Rating    <chr> "PG-13", "PG-13", "PG", "PG-13", "PG-13", "PG-13"
## $ LengthHr  <int> 2, 1, 2, 2, 2, 2
## $ LengthMin <int> 32, 59, 9, 21, 13, 16
## $ IMDB      <dbl> 7.5, 7.2, 7.3, 7.5, 7.5, 6.7
## $ Metascore <int> 85, 58, 65, 76, 73, 56
## $ Tomato    <int> 91, 76, 71, 92, 92, 66
## $ AudScore  <int> 48, 89, 81, 88, 88, 73

View the names of the cinema tibble

names(cinema)
##  [1] "MID"       "Title"     "Descr"     "Dir"       "Dist"     
##  [6] "Budget"    "BoxOffice" "RD_Month"  "RD_Day"    "RD_Year"  
## [11] "Rating"    "LengthHr"  "LengthMin" "IMDB"      "Metascore"
## [16] "Tomato"    "AudScore"

Using the tidyverse rename the following imported variables: MID, Title, Descr, Dir, Dist, AudScore.

#rename variables
cinema_new <- rename(cinema,MovieID=MID,MovieTitle=Title,Description=Descr,Director=Dir,Distributor=Dist,AudienceScore=AudScore)
#display the column names
colnames(cinema_new)
##  [1] "MovieID"       "MovieTitle"    "Description"   "Director"     
##  [5] "Distributor"   "Budget"        "BoxOffice"     "RD_Month"     
##  [9] "RD_Day"        "RD_Year"       "Rating"        "LengthHr"     
## [13] "LengthMin"     "IMDB"          "Metascore"     "Tomato"       
## [17] "AudienceScore"

View the cinema_new tibble

print(cinema_new)
## # A tibble: 6 x 17
##   Movie~ Movie~ Descrip~ Dire~ Dist~ Budget BoxOf~ RD_M~ RD_D~ RD_Y~ Rati~
##    <int> <chr>  <chr>    <chr> <chr>  <int>  <int> <int> <int> <int> <chr>
## 1      1 Star ~ Rey dev~ Rian~ Walt~ 2.00e8 1.32e9    12    15  2017 PG-13
## 2      2 Juman~ Four te~ Jake~ Sony~ 1.00e8 8.58e8    12    20  2017 PG-13
## 3      3 Beaut~ An adap~ Bill~ Walt~ 1.60e8 1.26e9     3    17  2017 PG   
## 4      4 Wonde~ When a ~ Patt~ Warn~ 1.49e8 8.22e8     6     2  2017 PG-13
## 5      5 Spide~ Peter P~ Jon ~ Sony~ 1.75e8 8.80e8     7     7  2017 PG-13
## 6      6 The F~ When a ~ F. G~ Univ~ 2.50e8 1.24e9     4    14  2017 PG-13
## # ... with 6 more variables: LengthHr <int>, LengthMin <int>, IMDB <dbl>,
## #   Metascore <int>, Tomato <int>, AudienceScore <int>

Discover the film with the highest budget

cinema_new %>% arrange(desc(Budget)) %>% select(MovieTitle,Budget)
## # A tibble: 6 x 2
##   MovieTitle                        Budget
##   <chr>                              <int>
## 1 The Fate of the Furious        250000000
## 2 Star Wars                      200000000
## 3 Spider-Man Homecoming          175000000
## 4 Beauty and the Beast           160000000
## 5 Wonder Woman                   149000000
## 6 Jumanji: Welcome to the Jungle 100000000

Based on the findings we can see that Jumanji:Welcome to the Jungle had the lowest budget with 100 million versus The Fate of the Furious with the largest budget of 250 million dollars.

Discover the film that grossed the most at the box office

cinema_new %>% arrange(desc(BoxOffice)) %>% select (MovieTitle,BoxOffice)
## # A tibble: 6 x 2
##   MovieTitle                      BoxOffice
##   <chr>                               <int>
## 1 Star Wars                      1321000000
## 2 Beauty and the Beast           1264000000
## 3 The Fate of the Furious        1236000000
## 4 Spider-Man Homecoming           880200000
## 5 Jumanji: Welcome to the Jungle  858400000
## 6 Wonder Woman                    821900000

Based on the findings we can see that Star Wars:The Last Jedi was the winner of the 6 films by earning a little over 1.3 billion dollars while Wonder Woman earned over 820 million at the lowest slot.

Even though it is clear that Star Wars: The Last Jedi grossed the most at the box office we should look at the profits based on the budget;create a variable entitled Profits

cinema_new %>% mutate(Profits = BoxOffice - Budget) %>% arrange (desc(Profits)) %>% select (MovieTitle,Profits)
## # A tibble: 6 x 2
##   MovieTitle                        Profits
##   <chr>                               <int>
## 1 Star Wars                      1121000000
## 2 Beauty and the Beast           1104000000
## 3 The Fate of the Furious         986000000
## 4 Jumanji: Welcome to the Jungle  758400000
## 5 Spider-Man Homecoming           705200000
## 6 Wonder Woman                    672900000

Based on our findings Star Wars: The Last Jedi garnered the highest profits at the box office.

Next, create a new variable entitled Ratio which divides the Profits variable over the BoxOffice variable.

cinema_new %>% mutate(Profits = BoxOffice - Budget) %>% mutate(Ratio = Profits / BoxOffice) %>% arrange (desc(Ratio)) %>% select(MovieTitle,Ratio)
## # A tibble: 6 x 2
##   MovieTitle                     Ratio
##   <chr>                          <dbl>
## 1 Jumanji: Welcome to the Jungle 0.884
## 2 Beauty and the Beast           0.873
## 3 Star Wars                      0.849
## 4 Wonder Woman                   0.819
## 5 Spider-Man Homecoming          0.801
## 6 The Fate of the Furious        0.798

Based on our findings Jumanji: Welcome to the Jungle was the most successful film out of the six.

Next, based on the LengthHr and LengthMin we will discover the longest movie out of the six by creating a variable entitled TotalTime.

#multiply LengthHr by 60 to convert hours to minutes
cinema_new %>% mutate(TotalTime = (LengthHr * 60) + LengthMin) %>% arrange(desc(TotalTime)) %>% select (MovieTitle,TotalTime)
## # A tibble: 6 x 2
##   MovieTitle                     TotalTime
##   <chr>                              <dbl>
## 1 Star Wars                            152
## 2 Wonder Woman                         141
## 3 The Fate of the Furious              136
## 4 Spider-Man Homecoming                133
## 5 Beauty and the Beast                 129
## 6 Jumanji: Welcome to the Jungle       119

Based on our findings Star Wars: The Last Jedi was the longest film at 152 minutes while Jumanji: Welcome to the Jungle was the shortest with 119 minutes.

Ratings

IMDB

IMDB has become one of the leading sites that provide the public with ratings for television and film. Next, we will summarize the IMDB variable, plot the films and see which one had the highest ranking.

summary(cinema_new$IMDB)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.700   7.225   7.400   7.283   7.500   7.500
cinema_new %>% arrange (desc(IMDB)) %>% select (MovieTitle,IMDB)
## # A tibble: 6 x 2
##   MovieTitle                      IMDB
##   <chr>                          <dbl>
## 1 Star Wars                       7.50
## 2 Wonder Woman                    7.50
## 3 Spider-Man Homecoming           7.50
## 4 Beauty and the Beast            7.30
## 5 Jumanji: Welcome to the Jungle  7.20
## 6 The Fate of the Furious         6.70
barplot (cinema_new$IMDB, main = "IMDB Rankings of 2017 Films",xlab="Movie Titles", ylab="IMDB Rating", col= c("lightcyan","mistyrose","beige","yellow","red","green"), names.arg=c("Star","Jumanji","Beauty","Wonder","Spider-Man","Furious"))

Based on the findings, Star Wars, Wonder Woman and Spider-Man all received an IMDB score of 7.5 while Fate of the Furious had a score of 6.7.

Metascore

Similar to IMDB, Metascore provides a ranking score for television, film and also video games.

summary(cinema_new$Metascore)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   56.00   59.75   69.00   68.83   75.25   85.00
cinema_new %>% arrange(desc(Metascore)) %>% select(MovieTitle,Metascore)
## # A tibble: 6 x 2
##   MovieTitle                     Metascore
##   <chr>                              <int>
## 1 Star Wars                             85
## 2 Wonder Woman                          76
## 3 Spider-Man Homecoming                 73
## 4 Beauty and the Beast                  65
## 5 Jumanji: Welcome to the Jungle        58
## 6 The Fate of the Furious               56
barplot (cinema_new$Metascore, main = "Metascore Rankings of 2017 Films",xlab="Movie Titles", ylab="Metascore Rating", col= c("lightcyan","mistyrose","beige","yellow","red","green"), names.arg=c("Star","Jumanji","Beauty","Wonder","Spider-Man","Furious"))

Star Wars remains at the top, but this time without Wonder Woman and Spider-Man. Jumanji and Fate of the Furious are at the bottom of the Metascore rankings.

Rotten Tomatoes & Audience Score

One of the newest ranking sites is Rotten Tomatoes which has their own system as well as an AudienceScore which we will also explore.

summary(cinema_new$Tomato)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   66.00   72.25   83.50   81.33   91.75   92.00
cinema_new %>% arrange(desc(Tomato)) %>% select(MovieTitle,Tomato)
## # A tibble: 6 x 2
##   MovieTitle                     Tomato
##   <chr>                           <int>
## 1 Wonder Woman                       92
## 2 Spider-Man Homecoming              92
## 3 Star Wars                          91
## 4 Jumanji: Welcome to the Jungle     76
## 5 Beauty and the Beast               71
## 6 The Fate of the Furious            66
barplot (cinema_new$Tomato, main = "Rotten Tomatoes Rankings of 2017 Films",xlab="Movie Titles", ylab="Rotten Tomatoes Rating", col= c("lightcyan","mistyrose","beige","yellow","red","green"), names.arg=c("Star","Jumanji","Beauty","Wonder","Spider-Man","Furious"))

With a narrow lead Wonder Woman and Spider-Man got the edge over Star Wars while it seems the Fast & Furious movies need to pump the brakes on new films.

summary(cinema_new$AudienceScore)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   75.00   84.50   77.83   88.00   89.00
cinema_new %>% arrange (desc(AudienceScore)) %>% select(MovieTitle,AudienceScore)
## # A tibble: 6 x 2
##   MovieTitle                     AudienceScore
##   <chr>                                  <int>
## 1 Jumanji: Welcome to the Jungle            89
## 2 Wonder Woman                              88
## 3 Spider-Man Homecoming                     88
## 4 Beauty and the Beast                      81
## 5 The Fate of the Furious                   73
## 6 Star Wars                                 48
barplot (cinema_new$AudienceScore, main = "Audience Score (Rotten Tomatoes) of 2017 Films",xlab="Movie Titles", ylab="Audience Score Rating", col= c("lightcyan","mistyrose","beige","yellow","red","green"), names.arg=c("Star","Jumanji","Beauty","Wonder","Spider-Man","Furious"))

Here we see the audience ratings, and they were not pleased with Star Wars but welcomed Jumanji, Wonder Woman, and Spider-Man

Conclusion

Based on the data we had to work with its clear there are endless possibilities if granted access to a full database where one could explore movie genres, movie releases by year, or delve deeper in directors and distribuors.