Finding a proper metric relating budget and gross in the form of gross profit and gross margin, then cleaning it up for outliers and relating it to a number of other values including but not limited to year user ratings and actor facebook likes. I created the new metrics already in a new, although seperate, excel file for future use.
R Tableau Watson Excel
# Here we are checking if the package is installed
if(!require("tidyverse")){
# If the package is not in the system then it will be install
install.packages("tidyverse", dependencies = TRUE)
# Here we are loading the package
library("tidyverse")
}
## Loading required package: tidyverse
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# Here we are checking if the package is installed
if(!require("corrplot")){
# If the package is not in the system then it will be install
install.packages("corrplot", dependencies = TRUE)
# Here we are loading the package
library("corrplot")
}
## Loading required package: corrplot
## corrplot 0.84 loaded
Source: Rotten Tomatoes Year: Retreived 2016 data ranging from 1916 - 2016 Country: Various
Movie data from rottentomatoes.com including metrics such as main actor facebook likes, budget, year of release, gross, and user rating.
mydata = read.csv(file = 'data/rottentomatoes.csv')
head(mydata)
## title
## 1 AvatarÂÂ
## 2 Pirates of the Caribbean: At World's EndÂÂ
## 3 SpectreÂÂ
## 4 The Dark Knight RisesÂÂ
## 5 Star Wars: Episode VII - The Force AwakensÂÂ
## 6 John CarterÂÂ
## genres director actor1
## 1 Action|Adventure|Fantasy|Sci-Fi James Cameron CCH Pounder
## 2 Action|Adventure|Fantasy Gore Verbinski Johnny Depp
## 3 Action|Adventure|Thriller Sam Mendes Christoph Waltz
## 4 Action|Thriller Christopher Nolan Tom Hardy
## 5 Documentary Doug Walker Doug Walker
## 6 Action|Adventure|Sci-Fi Andrew Stanton Daryl Sabara
## actor2 actor3 length budget director_fb_likes
## 1 Joel David Moore Wes Studi 178 237000000 0
## 2 Orlando Bloom Jack Davenport 169 300000000 563
## 3 Rory Kinnear Stephanie Sigman 148 245000000 0
## 4 Christian Bale Joseph Gordon-Levitt 164 250000000 22000
## 5 Rob Walker NA NA 131
## 6 Samantha Morton Polly Walker 132 263700000 475
## actor1_fb_likes actor2_fb_likes actor3_fb_likes total_cast_likes
## 1 1000 936 855 4834
## 2 40000 5000 1000 48350
## 3 11000 393 161 11700
## 4 27000 23000 23000 106759
## 5 131 12 NA 143
## 6 640 632 530 1873
## fb_likes critic_reviews users_reviews users_votes score aspect_ratio
## 1 33000 723 3054 886204 7.9 1.78
## 2 0 302 1238 471220 7.1 2.35
## 3 85000 602 994 275868 6.8 2.35
## 4 164000 813 2701 1144337 8.5 2.35
## 5 0 NA NA 8 7.1 NA
## 6 24000 462 738 212204 6.6 2.35
## gross year
## 1 760505847 2009
## 2 309404152 2007
## 3 200074175 2015
## 4 448130642 2012
## 5 NA NA
## 6 73058679 2012
From first glance, exploring and sorting the data in excel, there are clear issues with many of the data values relating to budget and gross, these will eventually be solved anyway by excluding outliers but I believe it is caused by misentry of data and possibly through the use of different currencies. In addition to this there are some weird characters that appear in the title column.
The statistics are for different movies, the title of the movie, genre, director, main actors, length of the movie, budget, facebook likes for director and actor, reviews for the movie, gross, and the year the movie was made.
Cleaned the names of the films to remove weird character just to make it look better. Seperated genres into categories so that during analysis they can be better utilized. Removed a number of movies that had either no listed budget or gross. Removed duplicate entries. Created two columns to calculate gross profit and margin for analysis. Removed incomplete entries.
Done in excel.
data2 = read.csv(file = 'data/rottentomatoes copy 2.csv')
head(data2)
## title genre1 genre2 genre3 genre4
## 1 Paranormal Activity Horror
## 2 Tarnation Biography Documentary
## 3 The Blair Witch Project Horror
## 4 The Brothers McMullen Comedy Drama Romance
## 5 The Texas Chain Saw Massacre Horror Thriller
## 6 El Mariachi Action Crime Drama Romance
## genre5 genre6 genre7 genre8 director actor1
## 1 Oren Peli Micah Sloat
## 2 Jonathan Caouette Greg Ayres
## 3 Daniel Myrick Heather Donahue
## 4 Edward Burns Shari Albert
## 5 Tobe Hooper Gunnar Hansen
## 6 Thriller Robert Rodriguez Carlos Gallardo
## actor2 actor3 length budget director_fb_likes
## 1 Ashley Palmer Amber Armstrong 84 15000 110
## 2 Jonathan Caouette Renee Leblanc 88 218 20
## 3 Joshua Leonard Michael C. Williams 81 60000 19
## 4 Michael McGlone Maxine Bahns 98 25000 0
## 5 Edwin Neal Marilyn Burns 88 83532 365
## 6 Peter Marquardt Consuelo Gómez 81 7000 0
## actor1_fb_likes actor2_fb_likes actor3_fb_likes total_cast_likes
## 1 189 109 21 330
## 2 58 20 0 78
## 3 170 170 39 399
## 4 138 111 73 388
## 5 383 371 177 1094
## 6 121 20 6 147
## fb_likes critic_reviews users_reviews users_votes score aspect_ratio
## 1 12000 409 1189 184824 6.3 1.85
## 2 754 72 114 5709 7.2 1.37
## 3 0 360 3400 186786 6.4 1.33
## 4 265 36 36 6375 6.6 1.85
## 5 0 277 826 96410 7.5 1.85
## 6 0 56 130 52055 6.9 1.37
## gross year Gross.Profit Gross.Margin
## 1 107917283 2007 107902283 0.9998610
## 2 592014 2003 591796 0.9996318
## 3 140530114 1999 140470114 0.9995730
## 4 10246600 1995 10221600 0.9975602
## 5 30859000 1974 30775468 0.9972931
## 6 2040920 1992 2033920 0.9965702
summary(data2)
## title genre1 genre2
## [Rec] 2Â : 1 Comedy :991 Drama : 890
## 10 Cloverfield Lane : 1 Action :928 Adventure: 396
## 10 Days in a Madhouse : 1 Drama :661 : 355
## 10 Things I Hate About You : 1 Adventure:361 Crime : 333
## 102 Dalmatians : 1 Crime :252 Romance : 295
## 10th & Wolf : 1 Biography:206 Comedy : 286
## (Other) :3700 (Other) :307 (Other) :1151
## genre3 genre4 genre5 genre6
## :1328 :2571 :3334 :3627
## Thriller: 434 Thriller: 377 Thriller: 116 Sci-Fi : 17
## Romance : 362 Romance : 141 Fantasy : 54 Fantasy : 13
## Drama : 284 Family : 123 Sci-Fi : 43 Thriller: 12
## Sci-Fi : 223 Sci-Fi : 96 Romance : 35 Musical : 9
## Fantasy : 178 Fantasy : 87 Family : 30 Romance : 8
## (Other) : 897 (Other) : 311 (Other) : 94 (Other) : 20
## genre7 genre8 director
## :3686 :3703 Steven Spielberg: 25
## Sci-Fi : 7 Romance : 1 Clint Eastwood : 19
## Romance : 5 Thriller: 2 Woody Allen : 19
## Musical : 2 Martin Scorsese : 16
## Thriller: 2 Ridley Scott : 16
## Fantasy : 1 Renny Harlin : 15
## (Other) : 3 (Other) :3596
## actor1 actor2
## Robert De Niro : 42 Morgan Freeman : 18
## Johnny Depp : 34 Brad Pitt : 13
## Nicolas Cage : 30 Charlize Theron: 13
## Denzel Washington: 29 James Franco : 10
## J.K. Simmons : 29 Meryl Streep : 10
## Bruce Willis : 27 Adam Sandler : 9
## (Other) :3515 (Other) :3633
## actor3 length budget
## Steve Coogan : 8 Min. : 37.0 Min. :2.180e+02
## Robert Duvall : 7 1st Qu.: 96.0 1st Qu.:1.000e+07
## Anne Hathaway : 6 Median :106.0 Median :2.500e+07
## Ben Mendelsohn : 6 Mean :110.1 Mean :4.211e+07
## Bruce McGill : 6 3rd Qu.:120.0 3rd Qu.:5.000e+07
## Clifton Collins Jr.: 6 Max. :330.0 Max. :4.200e+09
## (Other) :3667
## director_fb_likes actor1_fb_likes actor2_fb_likes actor3_fb_likes
## Min. : 0.0 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 10.0 1st Qu.: 723 1st Qu.: 368.0 1st Qu.: 186.0
## Median : 60.0 Median : 1000 Median : 669.5 Median : 430.0
## Mean : 803.2 Mean : 7589 Mean : 1963.1 Mean : 746.6
## 3rd Qu.: 226.0 3rd Qu.: 12000 3rd Qu.: 971.8 3rd Qu.: 686.5
## Max. :23000.0 Max. :640000 Max. :137000.0 Max. :23000.0
##
## total_cast_likes fb_likes critic_reviews users_reviews
## Min. : 0 Min. : 0 Min. : 1.0 Min. : 1.0
## 1st Qu.: 1838 1st Qu.: 0 1st Qu.: 75.0 1st Qu.: 105.0
## Median : 3940 Median : 205 Median :134.0 Median : 204.0
## Mean : 11262 Mean : 9080 Mean :163.4 Mean : 327.7
## 3rd Qu.: 15986 3rd Qu.: 11000 3rd Qu.:220.0 3rd Qu.: 391.0
## Max. :656730 Max. :349000 Max. :813.0 Max. :5060.0
##
## users_votes score aspect_ratio gross
## Min. : 5 Min. :1.600 Min. : 1.180 Min. : 162
## 1st Qu.: 18324 1st Qu.:5.900 1st Qu.: 1.850 1st Qu.: 7363918
## Median : 52062 Median :6.600 Median : 2.350 Median : 28563552
## Mean : 103580 Mean :6.464 Mean : 2.108 Mean : 51249957
## 3rd Qu.: 124734 3rd Qu.:7.200 3rd Qu.: 2.350 3rd Qu.: 65988681
## Max. :1689764 Max. :9.300 Max. :16.000 Max. :760505847
##
## year Gross.Profit Gross.Margin
## Min. :1920 Min. :-4.200e+09 Min. :-55554.56
## 1st Qu.:1999 1st Qu.:-1.029e+07 1st Qu.: -1.15
## Median :2004 Median : 1.150e+06 Median : 0.08
## Mean :2003 Mean : 9.142e+06 Mean : -52.89
## 3rd Qu.:2010 3rd Qu.: 2.489e+07 3rd Qu.: 0.55
## Max. :2016 Max. : 5.235e+08 Max. : 1.00
##
It’s pretty much as expected no real outstanding things to mention.
Values above in summary. Main ones to mention: Budget Min: 2.180E=e+02 Max:4.200e+09 gross min:162 max: 760505847 gross margin min:-55554.56 (likely an error) max: 1 (not actually 1 more like .999) ### 4C) Create a correlation table ( only numeric data ). Note any significant values
data_corr <- cor( data2[-c(1,2,3,4,5,6,7,8,9,10,11,12,13)] )
data_corr
## length budget director_fb_likes
## length 1.000000000 0.13768474 0.182139769
## budget 0.137684738 1.00000000 0.039614855
## director_fb_likes 0.182139769 0.03961485 1.000000000
## actor1_fb_likes 0.083880145 0.04564294 0.091373154
## actor2_fb_likes 0.128044919 0.08042046 0.115778870
## actor3_fb_likes 0.122211179 0.08924428 0.119317294
## total_cast_likes 0.119043500 0.07186071 0.120242907
## fb_likes 0.218480366 0.10777993 0.164743204
## critic_reviews 0.224459648 0.16541069 0.180745270
## users_reviews 0.343676796 0.14645981 0.224201955
## users_votes 0.336851017 0.14302229 0.305387466
## score 0.369922315 0.04713212 0.194302462
## aspect_ratio 0.151615634 0.07368236 0.039123734
## gross 0.240102786 0.22169409 0.137848599
## year -0.134794527 0.08178870 -0.046533926
## Gross.Profit 0.014833078 -0.80926743 0.045694499
## Gross.Margin 0.004959606 -0.31128728 0.007226178
## actor1_fb_likes actor2_fb_likes actor3_fb_likes
## length 0.08388014 0.12804492 0.12221118
## budget 0.04564294 0.08042046 0.08924428
## director_fb_likes 0.09137315 0.11577887 0.11931729
## actor1_fb_likes 1.00000000 0.38830298 0.24978718
## actor2_fb_likes 0.38830298 1.00000000 0.54404078
## actor3_fb_likes 0.24978718 0.54404078 1.00000000
## total_cast_likes 0.94672268 0.63696771 0.48112133
## fb_likes 0.12674457 0.22802711 0.26299094
## critic_reviews 0.16761908 0.25345595 0.24761045
## users_reviews 0.12543438 0.18768687 0.20359967
## users_votes 0.17933571 0.23915625 0.25971272
## score 0.09356288 0.10143087 0.06876962
## aspect_ratio 0.05836148 0.06625472 0.04973974
## gross 0.14209652 0.23907476 0.27920423
## year 0.09156244 0.11899866 0.11271094
## Gross.Profit 0.04257017 0.06820383 0.08405977
## Gross.Margin 0.01390159 0.01768154 0.01459859
## total_cast_likes fb_likes critic_reviews users_reviews
## length 0.11904350 0.21848037 0.22445965 0.34367680
## budget 0.07186071 0.10777993 0.16541069 0.14645981
## director_fb_likes 0.12024291 0.16474320 0.18074527 0.22420196
## actor1_fb_likes 0.94672268 0.12674457 0.16761908 0.12543438
## actor2_fb_likes 0.63696771 0.22802711 0.25345595 0.18768687
## actor3_fb_likes 0.48112133 0.26299094 0.24761045 0.20359967
## total_cast_likes 1.00000000 0.19902568 0.23636700 0.18055261
## fb_likes 0.19902568 1.00000000 0.70477420 0.37997045
## critic_reviews 0.23636700 0.70477420 1.00000000 0.56361418
## users_reviews 0.18055261 0.37997045 0.56361418 1.00000000
## users_votes 0.24507769 0.52148441 0.59492020 0.78442506
## score 0.10676828 0.28564673 0.34920049 0.32644390
## aspect_ratio 0.07109254 0.10793592 0.17640696 0.09396047
## gross 0.22612367 0.36094107 0.46053379 0.54641946
## year 0.12158228 0.29617913 0.40999238 0.01754673
## Gross.Profit 0.06847202 0.11582476 0.12148692 0.19109444
## Gross.Margin 0.01893418 0.01649712 0.03052406 0.03044470
## users_votes score aspect_ratio gross
## length 0.33685102 0.369922315 0.15161563 0.24010279
## budget 0.14302229 0.047132122 0.07368236 0.22169409
## director_fb_likes 0.30538747 0.194302462 0.03912373 0.13784860
## actor1_fb_likes 0.17933571 0.093562881 0.05836148 0.14209652
## actor2_fb_likes 0.23915625 0.101430872 0.06625472 0.23907476
## actor3_fb_likes 0.25971272 0.068769621 0.04973974 0.27920423
## total_cast_likes 0.24507769 0.106768280 0.07109254 0.22612367
## fb_likes 0.52148441 0.285646728 0.10793592 0.36094107
## critic_reviews 0.59492020 0.349200488 0.17640696 0.46053379
## users_reviews 0.78442506 0.326443899 0.09396047 0.54641946
## users_votes 1.00000000 0.479511154 0.08578553 0.62654562
## score 0.47951115 1.000000000 0.03009787 0.21470357
## aspect_ratio 0.08578553 0.030097872 1.00000000 0.06611943
## gross 0.62654562 0.214703570 0.06611943 1.00000000
## year 0.01989931 -0.124930296 0.21900548 0.04485666
## Gross.Profit 0.24260591 0.084906875 -0.02963701 0.39341289
## Gross.Margin 0.02727867 0.005270862 -0.01918346 0.03798139
## year Gross.Profit Gross.Margin
## length -0.13479453 0.01483308 0.004959606
## budget 0.08178870 -0.80926743 -0.311287278
## director_fb_likes -0.04653393 0.04569450 0.007226178
## actor1_fb_likes 0.09156244 0.04257017 0.013901589
## actor2_fb_likes 0.11899866 0.06820383 0.017681544
## actor3_fb_likes 0.11271094 0.08405977 0.014598587
## total_cast_likes 0.12158228 0.06847202 0.018934182
## fb_likes 0.29617913 0.11582476 0.016497122
## critic_reviews 0.40999238 0.12148692 0.030524058
## users_reviews 0.01754673 0.19109444 0.030444704
## users_votes 0.01989931 0.24260591 0.027278674
## score -0.12493030 0.08490688 0.005270862
## aspect_ratio 0.21900548 -0.02963701 -0.019183462
## gross 0.04485666 0.39341289 0.037981390
## year 1.00000000 -0.05008920 -0.028432945
## Gross.Profit -0.05008920 1.00000000 0.316369931
## Gross.Margin -0.02843294 0.31636993 1.000000000
corrplot(data_corr)
knitr::include_graphics(path = 'data/Max Gross by year.png')
knitr::include_graphics(path = 'data/Max Gross by year.png')
knitr::include_graphics(path = 'data/Gross Margin by Budget.png')
### 5B) Evaluate the efficiency of model. Note R Square and Adjusted R Square, be suspicious of very high R Squares. R-Squared is 0.655 and the p value suggests that the assosciation is significant. ### 5C) Try different models (combination of independent variables). To find a better model if possible.
knitr::include_graphics(path = 'data/Gross Profit by Budget.png')
knitr::include_graphics(path = 'data/Gross Margin by Budget 2.png')
Clearly the lower budget movies are more profitable up to about 125 million dollar budgets where the prediction line shows a negative gross margin.