In this week’s assignment six movie releases from 2017 served as the root of the SQL database cinema created in MySQL Workbench based on the following films:
The cinema database is comprised of 17 varibles which are listed below; data provided from Wikipedia, IMDB, Rotten Tomatoes.
Data was exported from the movies table from MySQL Workbench to a csv file. In addition to the export the DESCRIBE (TABLE) statement output was also exported to a csv file.
Load the tidyverse package
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.3
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.1 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
## Warning: package 'readr' was built under R version 3.4.3
## Warning: package 'purrr' was built under R version 3.4.3
## Warning: package 'dplyr' was built under R version 3.4.2
## Warning: package 'forcats' was built under R version 3.4.3
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#import the describe csv file
describe <- read_csv("C:\\Users\\lizza\\Documents\\CUNY - Data Analytics\\DATA 607_Revised\\Week 2\\Assignment\\Cinema_Describe.csv")
## Parsed with column specification:
## cols(
## Field = col_character(),
## Type = col_character(),
## Null = col_character(),
## Key = col_character(),
## Default = col_character(),
## Extra = col_character()
## )
By displaying the describe tibble we see that MID is considered a primary key which means it can’t be duplicated; primary key’s play an important role in databases as they allow several tables to link by SQL joins.
#display the describe table
describe
## # A tibble: 17 x 6
## Field Type Null Key Default Extra
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 MID int(11) NO PRI NULL <NA>
## 2 Title varchar(255) YES <NA> NULL <NA>
## 3 Descr text YES <NA> NULL <NA>
## 4 Dir varchar(255) YES <NA> NULL <NA>
## 5 Dist varchar(255) YES <NA> NULL <NA>
## 6 Budget int(11) YES <NA> NULL <NA>
## 7 BoxOffice int(11) YES <NA> NULL <NA>
## 8 RD_Month int(11) YES <NA> NULL <NA>
## 9 RD_Day int(11) YES <NA> NULL <NA>
## 10 RD_Year int(11) YES <NA> NULL <NA>
## 11 Rating varchar(5) YES <NA> NULL <NA>
## 12 LengthHr int(11) YES <NA> NULL <NA>
## 13 LengthMin int(11) YES <NA> NULL <NA>
## 14 IMDB double YES <NA> NULL <NA>
## 15 Metascore int(11) YES <NA> NULL <NA>
## 16 Tomato int(11) YES <NA> NULL <NA>
## 17 AudScore int(11) YES <NA> NULL <NA>
#import the cinema csv
cinema <- read_csv("C:\\Users\\lizza\\Documents\\CUNY - Data Analytics\\DATA 607_Revised\\Week 2\\Assignment\\Cinema.csv")
## Parsed with column specification:
## cols(
## MID = col_integer(),
## Title = col_character(),
## Descr = col_character(),
## Dir = col_character(),
## Dist = col_character(),
## Budget = col_integer(),
## BoxOffice = col_integer(),
## RD_Month = col_integer(),
## RD_Day = col_integer(),
## RD_Year = col_integer(),
## Rating = col_character(),
## LengthHr = col_integer(),
## LengthMin = col_integer(),
## IMDB = col_double(),
## Metascore = col_integer(),
## Tomato = col_integer(),
## AudScore = col_integer()
## )
glimpse(cinema)
## Observations: 6
## Variables: 17
## $ MID <int> 1, 2, 3, 4, 5, 6
## $ Title <chr> "Star Wars", "Jumanji: Welcome to the Jungle", "Beau...
## $ Descr <chr> "Rey develops her newly discovered abilites with the...
## $ Dir <chr> "Rian Johnson", "Jake Kasdan", "Bill Condon", "Patty...
## $ Dist <chr> "Walt Disney Studios", "Sony Pictures Entertainment"...
## $ Budget <int> 200000000, 100000000, 160000000, 149000000, 17500000...
## $ BoxOffice <int> 1321000000, 858400000, 1264000000, 821900000, 880200...
## $ RD_Month <int> 12, 12, 3, 6, 7, 4
## $ RD_Day <int> 15, 20, 17, 2, 7, 14
## $ RD_Year <int> 2017, 2017, 2017, 2017, 2017, 2017
## $ Rating <chr> "PG-13", "PG-13", "PG", "PG-13", "PG-13", "PG-13"
## $ LengthHr <int> 2, 1, 2, 2, 2, 2
## $ LengthMin <int> 32, 59, 9, 21, 13, 16
## $ IMDB <dbl> 7.5, 7.2, 7.3, 7.5, 7.5, 6.7
## $ Metascore <int> 85, 58, 65, 76, 73, 56
## $ Tomato <int> 91, 76, 71, 92, 92, 66
## $ AudScore <int> 48, 89, 81, 88, 88, 73
View the names of the cinema tibble
names(cinema)
## [1] "MID" "Title" "Descr" "Dir" "Dist"
## [6] "Budget" "BoxOffice" "RD_Month" "RD_Day" "RD_Year"
## [11] "Rating" "LengthHr" "LengthMin" "IMDB" "Metascore"
## [16] "Tomato" "AudScore"
Using the tidyverse rename the following imported variables: MID, Title, Descr, Dir, Dist, AudScore.
#rename variables
cinema_new <- rename(cinema,MovieID=MID,MovieTitle=Title,Description=Descr,Director=Dir,Distributor=Dist,AudienceScore=AudScore)
#display the column names
colnames(cinema_new)
## [1] "MovieID" "MovieTitle" "Description" "Director"
## [5] "Distributor" "Budget" "BoxOffice" "RD_Month"
## [9] "RD_Day" "RD_Year" "Rating" "LengthHr"
## [13] "LengthMin" "IMDB" "Metascore" "Tomato"
## [17] "AudienceScore"
View the cinema_new tibble
print(cinema_new)
## # A tibble: 6 x 17
## Movie~ Movie~ Descrip~ Dire~ Dist~ Budget BoxOf~ RD_M~ RD_D~ RD_Y~ Rati~
## <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> <chr>
## 1 1 Star ~ Rey dev~ Rian~ Walt~ 2.00e8 1.32e9 12 15 2017 PG-13
## 2 2 Juman~ Four te~ Jake~ Sony~ 1.00e8 8.58e8 12 20 2017 PG-13
## 3 3 Beaut~ An adap~ Bill~ Walt~ 1.60e8 1.26e9 3 17 2017 PG
## 4 4 Wonde~ When a ~ Patt~ Warn~ 1.49e8 8.22e8 6 2 2017 PG-13
## 5 5 Spide~ Peter P~ Jon ~ Sony~ 1.75e8 8.80e8 7 7 2017 PG-13
## 6 6 The F~ When a ~ F. G~ Univ~ 2.50e8 1.24e9 4 14 2017 PG-13
## # ... with 6 more variables: LengthHr <int>, LengthMin <int>, IMDB <dbl>,
## # Metascore <int>, Tomato <int>, AudienceScore <int>
Discover the film with the highest budget
cinema_new %>% arrange(desc(Budget)) %>% select(MovieTitle,Budget)
## # A tibble: 6 x 2
## MovieTitle Budget
## <chr> <int>
## 1 The Fate of the Furious 250000000
## 2 Star Wars 200000000
## 3 Spider-Man Homecoming 175000000
## 4 Beauty and the Beast 160000000
## 5 Wonder Woman 149000000
## 6 Jumanji: Welcome to the Jungle 100000000
Based on the findings we can see that Jumanji:Welcome to the Jungle had the lowest budget with 100 million versus The Fate of the Furious with the largest budget of 250 million dollars.
Discover the film that grossed the most at the box office
cinema_new %>% arrange(desc(BoxOffice)) %>% select (MovieTitle,BoxOffice)
## # A tibble: 6 x 2
## MovieTitle BoxOffice
## <chr> <int>
## 1 Star Wars 1321000000
## 2 Beauty and the Beast 1264000000
## 3 The Fate of the Furious 1236000000
## 4 Spider-Man Homecoming 880200000
## 5 Jumanji: Welcome to the Jungle 858400000
## 6 Wonder Woman 821900000
Based on the findings we can see that Star Wars:The Last Jedi was the winner of the 6 films by earning a little over 1.3 billion dollars while Wonder Woman earned over 820 million at the lowest slot.
Even though it is clear that Star Wars: The Last Jedi grossed the most at the box office we should look at the profits based on the budget;create a variable entitled Profits
cinema_new %>% mutate(Profits = BoxOffice - Budget) %>% arrange (desc(Profits)) %>% select (MovieTitle,Profits)
## # A tibble: 6 x 2
## MovieTitle Profits
## <chr> <int>
## 1 Star Wars 1121000000
## 2 Beauty and the Beast 1104000000
## 3 The Fate of the Furious 986000000
## 4 Jumanji: Welcome to the Jungle 758400000
## 5 Spider-Man Homecoming 705200000
## 6 Wonder Woman 672900000
Based on our findings Star Wars: The Last Jedi garnered the highest profits at the box office.
Next, create a new variable entitled Ratio which divides the Profits variable over the BoxOffice variable.
cinema_new %>% mutate(Profits = BoxOffice - Budget) %>% mutate(Ratio = Profits / BoxOffice) %>% arrange (desc(Ratio)) %>% select(MovieTitle,Ratio)
## # A tibble: 6 x 2
## MovieTitle Ratio
## <chr> <dbl>
## 1 Jumanji: Welcome to the Jungle 0.884
## 2 Beauty and the Beast 0.873
## 3 Star Wars 0.849
## 4 Wonder Woman 0.819
## 5 Spider-Man Homecoming 0.801
## 6 The Fate of the Furious 0.798
Based on our findings Jumanji: Welcome to the Jungle was the most successful film out of the six.
Next, based on the LengthHr and LengthMin we will discover the longest movie out of the six by creating a variable entitled TotalTime.
#multiply LengthHr by 60 to convert hours to minutes
cinema_new %>% mutate(TotalTime = (LengthHr * 60) + LengthMin) %>% arrange(desc(TotalTime)) %>% select (MovieTitle,TotalTime)
## # A tibble: 6 x 2
## MovieTitle TotalTime
## <chr> <dbl>
## 1 Star Wars 152
## 2 Wonder Woman 141
## 3 The Fate of the Furious 136
## 4 Spider-Man Homecoming 133
## 5 Beauty and the Beast 129
## 6 Jumanji: Welcome to the Jungle 119
Based on our findings Star Wars: The Last Jedi was the longest film at 152 minutes while Jumanji: Welcome to the Jungle was the shortest with 119 minutes.
IMDB has become one of the leading sites that provide the public with ratings for television and film. Next, we will summarize the IMDB variable, plot the films and see which one had the highest ranking.
summary(cinema_new$IMDB)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.700 7.225 7.400 7.283 7.500 7.500
cinema_new %>% arrange (desc(IMDB)) %>% select (MovieTitle,IMDB)
## # A tibble: 6 x 2
## MovieTitle IMDB
## <chr> <dbl>
## 1 Star Wars 7.50
## 2 Wonder Woman 7.50
## 3 Spider-Man Homecoming 7.50
## 4 Beauty and the Beast 7.30
## 5 Jumanji: Welcome to the Jungle 7.20
## 6 The Fate of the Furious 6.70
barplot (cinema_new$IMDB, main = "IMDB Rankings of 2017 Films",xlab="Movie Titles", ylab="IMDB Rating", col= c("lightcyan","mistyrose","beige","yellow","red","green"), names.arg=c("Star","Jumanji","Beauty","Wonder","Spider-Man","Furious"))
Based on the findings, Star Wars, Wonder Woman and Spider-Man all received an IMDB score of 7.5 while Fate of the Furious had a score of 6.7.
Similar to IMDB, Metascore provides a ranking score for television, film and also video games.
summary(cinema_new$Metascore)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 56.00 59.75 69.00 68.83 75.25 85.00
cinema_new %>% arrange(desc(Metascore)) %>% select(MovieTitle,Metascore)
## # A tibble: 6 x 2
## MovieTitle Metascore
## <chr> <int>
## 1 Star Wars 85
## 2 Wonder Woman 76
## 3 Spider-Man Homecoming 73
## 4 Beauty and the Beast 65
## 5 Jumanji: Welcome to the Jungle 58
## 6 The Fate of the Furious 56
barplot (cinema_new$Metascore, main = "Metascore Rankings of 2017 Films",xlab="Movie Titles", ylab="Metascore Rating", col= c("lightcyan","mistyrose","beige","yellow","red","green"), names.arg=c("Star","Jumanji","Beauty","Wonder","Spider-Man","Furious"))
Star Wars remains at the top, but this time without Wonder Woman and Spider-Man. Jumanji and Fate of the Furious are at the bottom of the Metascore rankings.
One of the newest ranking sites is Rotten Tomatoes which has their own system as well as an AudienceScore which we will also explore.
summary(cinema_new$Tomato)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 66.00 72.25 83.50 81.33 91.75 92.00
cinema_new %>% arrange(desc(Tomato)) %>% select(MovieTitle,Tomato)
## # A tibble: 6 x 2
## MovieTitle Tomato
## <chr> <int>
## 1 Wonder Woman 92
## 2 Spider-Man Homecoming 92
## 3 Star Wars 91
## 4 Jumanji: Welcome to the Jungle 76
## 5 Beauty and the Beast 71
## 6 The Fate of the Furious 66
barplot (cinema_new$Tomato, main = "Rotten Tomatoes Rankings of 2017 Films",xlab="Movie Titles", ylab="Rotten Tomatoes Rating", col= c("lightcyan","mistyrose","beige","yellow","red","green"), names.arg=c("Star","Jumanji","Beauty","Wonder","Spider-Man","Furious"))
With a narrow lead Wonder Woman and Spider-Man got the edge over Star Wars while it seems the Fast & Furious movies need to pump the brakes on new films.
summary(cinema_new$AudienceScore)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 75.00 84.50 77.83 88.00 89.00
cinema_new %>% arrange (desc(AudienceScore)) %>% select(MovieTitle,AudienceScore)
## # A tibble: 6 x 2
## MovieTitle AudienceScore
## <chr> <int>
## 1 Jumanji: Welcome to the Jungle 89
## 2 Wonder Woman 88
## 3 Spider-Man Homecoming 88
## 4 Beauty and the Beast 81
## 5 The Fate of the Furious 73
## 6 Star Wars 48
barplot (cinema_new$AudienceScore, main = "Audience Score (Rotten Tomatoes) of 2017 Films",xlab="Movie Titles", ylab="Audience Score Rating", col= c("lightcyan","mistyrose","beige","yellow","red","green"), names.arg=c("Star","Jumanji","Beauty","Wonder","Spider-Man","Furious"))
Here we see the audience ratings, and they were not pleased with Star Wars but welcomed Jumanji, Wonder Woman, and Spider-Man
Based on the data we had to work with its clear there are endless possibilities if granted access to a full database where one could explore movie genres, movie releases by year, or delve deeper in directors and distribuors.