Task#1: About this Project


1A) Brief description of your project.

Finding a proper metric relating budget and gross in the form of gross profit and gross margin, then cleaning it up for outliers and relating it to a number of other values including but not limited to year user ratings and actor facebook likes. I created the new metrics already in a new, although seperate, excel file for future use.

1B) Resources - Packages/software use on the project (R, Tableau, Watson)

R Tableau Watson Excel

# Here we are checking if the package is installed
if(!require("tidyverse")){
  
  # If the package is not in the system then it will be install
  install.packages("tidyverse", dependencies = TRUE)
  
  # Here we are loading the package
  library("tidyverse")
}
## Loading required package: tidyverse
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# Here we are checking if the package is installed
if(!require("corrplot")){
  
  # If the package is not in the system then it will be install
  install.packages("corrplot", dependencies = TRUE)
  
  # Here we are loading the package
  library("corrplot")
}
## Loading required package: corrplot
## corrplot 0.84 loaded

1C) Data Description source, year, country.

Source: Rotten Tomatoes Year: Retreived 2016 data ranging from 1916 - 2016 Country: Various

1D) Potential business cases (hypothesis)

Lower budget movies are more volatile yet overall significantly more profitable.

Task#2: Data Collection - Movies - rottentomatoes.csv


2A) Brief description and key metrics of the project industry (context)

Movie data from rottentomatoes.com including metrics such as main actor facebook likes, budget, year of release, gross, and user rating.

2B) Explore the dataset, note any interesting patterns highlights

mydata = read.csv(file = 'data/rottentomatoes.csv')
head(mydata)
##                                                      title
## 1                                                 Avatar 
## 2               Pirates of the Caribbean: At World's End 
## 3                                                Spectre 
## 4                                  The Dark Knight Rises 
## 5 Star Wars: Episode VII - The Force Awakens             
## 6                                            John Carter 
##                            genres          director          actor1
## 1 Action|Adventure|Fantasy|Sci-Fi     James Cameron     CCH Pounder
## 2        Action|Adventure|Fantasy    Gore Verbinski     Johnny Depp
## 3       Action|Adventure|Thriller        Sam Mendes Christoph Waltz
## 4                 Action|Thriller Christopher Nolan       Tom Hardy
## 5                     Documentary       Doug Walker     Doug Walker
## 6         Action|Adventure|Sci-Fi    Andrew Stanton    Daryl Sabara
##             actor2               actor3 length    budget director_fb_likes
## 1 Joel David Moore            Wes Studi    178 237000000                 0
## 2    Orlando Bloom       Jack Davenport    169 300000000               563
## 3     Rory Kinnear     Stephanie Sigman    148 245000000                 0
## 4   Christian Bale Joseph Gordon-Levitt    164 250000000             22000
## 5       Rob Walker                          NA        NA               131
## 6  Samantha Morton         Polly Walker    132 263700000               475
##   actor1_fb_likes actor2_fb_likes actor3_fb_likes total_cast_likes
## 1            1000             936             855             4834
## 2           40000            5000            1000            48350
## 3           11000             393             161            11700
## 4           27000           23000           23000           106759
## 5             131              12              NA              143
## 6             640             632             530             1873
##   fb_likes critic_reviews users_reviews users_votes score aspect_ratio
## 1    33000            723          3054      886204   7.9         1.78
## 2        0            302          1238      471220   7.1         2.35
## 3    85000            602           994      275868   6.8         2.35
## 4   164000            813          2701     1144337   8.5         2.35
## 5        0             NA            NA           8   7.1           NA
## 6    24000            462           738      212204   6.6         2.35
##       gross year
## 1 760505847 2009
## 2 309404152 2007
## 3 200074175 2015
## 4 448130642 2012
## 5        NA   NA
## 6  73058679 2012

From first glance, exploring and sorting the data in excel, there are clear issues with many of the data values relating to budget and gross, these will eventually be solved anyway by excluding outliers but I believe it is caused by misentry of data and possibly through the use of different currencies. In addition to this there are some weird characters that appear in the title column.

2C) Paragraph using descriptive statistics

The statistics are for different movies, the title of the movie, genre, director, main actors, length of the movie, budget, facebook likes for director and actor, reviews for the movie, gross, and the year the movie was made.


Task#3: Data Preparation - Cleaning and preparing the data for analysis


3A) Describe the steps to cleaning and prepare the data for analysis. Make an argument of why those steps are necessary.

Cleaned the names of the films to remove weird character just to make it look better. Seperated genres into categories so that during analysis they can be better utilized. Removed a number of movies that had either no listed budget or gross. Removed duplicate entries. Created two columns to calculate gross profit and margin for analysis. Removed incomplete entries.

3B) Clean the data

Done in excel.

3C) Save a new clean dataset

data2 = read.csv(file = 'data/rottentomatoes copy 2.csv')
head(data2)
##                           title    genre1      genre2  genre3  genre4
## 1          Paranormal Activity     Horror                            
## 2                    Tarnation  Biography Documentary                
## 3      The Blair Witch Project     Horror                            
## 4        The Brothers McMullen     Comedy       Drama Romance        
## 5 The Texas Chain Saw Massacre     Horror    Thriller                
## 6                  El Mariachi     Action       Crime   Drama Romance
##     genre5 genre6 genre7 genre8          director          actor1
## 1                                       Oren Peli     Micah Sloat
## 2                               Jonathan Caouette      Greg Ayres
## 3                                   Daniel Myrick Heather Donahue
## 4                                    Edward Burns    Shari Albert
## 5                                     Tobe Hooper   Gunnar Hansen
## 6 Thriller                       Robert Rodriguez Carlos Gallardo
##              actor2              actor3 length budget director_fb_likes
## 1     Ashley Palmer     Amber Armstrong     84  15000               110
## 2 Jonathan Caouette       Renee Leblanc     88    218                20
## 3    Joshua Leonard Michael C. Williams     81  60000                19
## 4   Michael McGlone        Maxine Bahns     98  25000                 0
## 5        Edwin Neal       Marilyn Burns     88  83532               365
## 6   Peter Marquardt     Consuelo Gómez     81   7000                 0
##   actor1_fb_likes actor2_fb_likes actor3_fb_likes total_cast_likes
## 1             189             109              21              330
## 2              58              20               0               78
## 3             170             170              39              399
## 4             138             111              73              388
## 5             383             371             177             1094
## 6             121              20               6              147
##   fb_likes critic_reviews users_reviews users_votes score aspect_ratio
## 1    12000            409          1189      184824   6.3         1.85
## 2      754             72           114        5709   7.2         1.37
## 3        0            360          3400      186786   6.4         1.33
## 4      265             36            36        6375   6.6         1.85
## 5        0            277           826       96410   7.5         1.85
## 6        0             56           130       52055   6.9         1.37
##       gross year Gross.Profit Gross.Margin
## 1 107917283 2007    107902283    0.9998610
## 2    592014 2003       591796    0.9996318
## 3 140530114 1999    140470114    0.9995730
## 4  10246600 1995     10221600    0.9975602
## 5  30859000 1974     30775468    0.9972931
## 6   2040920 1992      2033920    0.9965702
summary(data2)
##                          title            genre1          genre2    
##  [Rec] 2                    :   1   Comedy   :991   Drama    : 890  
##  10 Cloverfield Lane        :   1   Action   :928   Adventure: 396  
##  10 Days in a Madhouse      :   1   Drama    :661            : 355  
##  10 Things I Hate About You :   1   Adventure:361   Crime    : 333  
##  102 Dalmatians             :   1   Crime    :252   Romance  : 295  
##  10th & Wolf                :   1   Biography:206   Comedy   : 286  
##  (Other)                    :3700   (Other)  :307   (Other)  :1151  
##       genre3          genre4          genre5          genre6    
##          :1328           :2571           :3334           :3627  
##  Thriller: 434   Thriller: 377   Thriller: 116   Sci-Fi  :  17  
##  Romance : 362   Romance : 141   Fantasy :  54   Fantasy :  13  
##  Drama   : 284   Family  : 123   Sci-Fi  :  43   Thriller:  12  
##  Sci-Fi  : 223   Sci-Fi  :  96   Romance :  35   Musical :   9  
##  Fantasy : 178   Fantasy :  87   Family  :  30   Romance :   8  
##  (Other) : 897   (Other) : 311   (Other) :  94   (Other) :  20  
##       genre7          genre8                 director   
##          :3686           :3703   Steven Spielberg:  25  
##  Sci-Fi  :   7   Romance :   1   Clint Eastwood  :  19  
##  Romance :   5   Thriller:   2   Woody Allen     :  19  
##  Musical :   2                   Martin Scorsese :  16  
##  Thriller:   2                   Ridley Scott    :  16  
##  Fantasy :   1                   Renny Harlin    :  15  
##  (Other) :   3                   (Other)         :3596  
##                actor1                 actor2    
##  Robert De Niro   :  42   Morgan Freeman :  18  
##  Johnny Depp      :  34   Brad Pitt      :  13  
##  Nicolas Cage     :  30   Charlize Theron:  13  
##  Denzel Washington:  29   James Franco   :  10  
##  J.K. Simmons     :  29   Meryl Streep   :  10  
##  Bruce Willis     :  27   Adam Sandler   :   9  
##  (Other)          :3515   (Other)        :3633  
##                  actor3         length          budget         
##  Steve Coogan       :   8   Min.   : 37.0   Min.   :2.180e+02  
##  Robert Duvall      :   7   1st Qu.: 96.0   1st Qu.:1.000e+07  
##  Anne Hathaway      :   6   Median :106.0   Median :2.500e+07  
##  Ben Mendelsohn     :   6   Mean   :110.1   Mean   :4.211e+07  
##  Bruce McGill       :   6   3rd Qu.:120.0   3rd Qu.:5.000e+07  
##  Clifton Collins Jr.:   6   Max.   :330.0   Max.   :4.200e+09  
##  (Other)            :3667                                      
##  director_fb_likes actor1_fb_likes  actor2_fb_likes    actor3_fb_likes  
##  Min.   :    0.0   Min.   :     0   Min.   :     0.0   Min.   :    0.0  
##  1st Qu.:   10.0   1st Qu.:   723   1st Qu.:   368.0   1st Qu.:  186.0  
##  Median :   60.0   Median :  1000   Median :   669.5   Median :  430.0  
##  Mean   :  803.2   Mean   :  7589   Mean   :  1963.1   Mean   :  746.6  
##  3rd Qu.:  226.0   3rd Qu.: 12000   3rd Qu.:   971.8   3rd Qu.:  686.5  
##  Max.   :23000.0   Max.   :640000   Max.   :137000.0   Max.   :23000.0  
##                                                                         
##  total_cast_likes    fb_likes      critic_reviews  users_reviews   
##  Min.   :     0   Min.   :     0   Min.   :  1.0   Min.   :   1.0  
##  1st Qu.:  1838   1st Qu.:     0   1st Qu.: 75.0   1st Qu.: 105.0  
##  Median :  3940   Median :   205   Median :134.0   Median : 204.0  
##  Mean   : 11262   Mean   :  9080   Mean   :163.4   Mean   : 327.7  
##  3rd Qu.: 15986   3rd Qu.: 11000   3rd Qu.:220.0   3rd Qu.: 391.0  
##  Max.   :656730   Max.   :349000   Max.   :813.0   Max.   :5060.0  
##                                                                    
##   users_votes          score        aspect_ratio        gross          
##  Min.   :      5   Min.   :1.600   Min.   : 1.180   Min.   :      162  
##  1st Qu.:  18324   1st Qu.:5.900   1st Qu.: 1.850   1st Qu.:  7363918  
##  Median :  52062   Median :6.600   Median : 2.350   Median : 28563552  
##  Mean   : 103580   Mean   :6.464   Mean   : 2.108   Mean   : 51249957  
##  3rd Qu.: 124734   3rd Qu.:7.200   3rd Qu.: 2.350   3rd Qu.: 65988681  
##  Max.   :1689764   Max.   :9.300   Max.   :16.000   Max.   :760505847  
##                                                                        
##       year       Gross.Profit         Gross.Margin      
##  Min.   :1920   Min.   :-4.200e+09   Min.   :-55554.56  
##  1st Qu.:1999   1st Qu.:-1.029e+07   1st Qu.:    -1.15  
##  Median :2004   Median : 1.150e+06   Median :     0.08  
##  Mean   :2003   Mean   : 9.142e+06   Mean   :   -52.89  
##  3rd Qu.:2010   3rd Qu.: 2.489e+07   3rd Qu.:     0.55  
##  Max.   :2016   Max.   : 5.235e+08   Max.   :     1.00  
## 

Data Analysis: Descriptive Statistics, Correlations


4A) Basic descriptive statistics of the new clean dataset (write down any observations)

It’s pretty much as expected no real outstanding things to mention.

4B) Using descriptive statistics explore dataset investigate min and max values

Values above in summary. Main ones to mention: Budget Min: 2.180E=e+02 Max:4.200e+09 gross min:162 max: 760505847 gross margin min:-55554.56 (likely an error) max: 1 (not actually 1 more like .999) ### 4C) Create a correlation table ( only numeric data ). Note any significant values

data_corr <- cor( data2[-c(1,2,3,4,5,6,7,8,9,10,11,12,13)] )
data_corr
##                         length      budget director_fb_likes
## length             1.000000000  0.13768474       0.182139769
## budget             0.137684738  1.00000000       0.039614855
## director_fb_likes  0.182139769  0.03961485       1.000000000
## actor1_fb_likes    0.083880145  0.04564294       0.091373154
## actor2_fb_likes    0.128044919  0.08042046       0.115778870
## actor3_fb_likes    0.122211179  0.08924428       0.119317294
## total_cast_likes   0.119043500  0.07186071       0.120242907
## fb_likes           0.218480366  0.10777993       0.164743204
## critic_reviews     0.224459648  0.16541069       0.180745270
## users_reviews      0.343676796  0.14645981       0.224201955
## users_votes        0.336851017  0.14302229       0.305387466
## score              0.369922315  0.04713212       0.194302462
## aspect_ratio       0.151615634  0.07368236       0.039123734
## gross              0.240102786  0.22169409       0.137848599
## year              -0.134794527  0.08178870      -0.046533926
## Gross.Profit       0.014833078 -0.80926743       0.045694499
## Gross.Margin       0.004959606 -0.31128728       0.007226178
##                   actor1_fb_likes actor2_fb_likes actor3_fb_likes
## length                 0.08388014      0.12804492      0.12221118
## budget                 0.04564294      0.08042046      0.08924428
## director_fb_likes      0.09137315      0.11577887      0.11931729
## actor1_fb_likes        1.00000000      0.38830298      0.24978718
## actor2_fb_likes        0.38830298      1.00000000      0.54404078
## actor3_fb_likes        0.24978718      0.54404078      1.00000000
## total_cast_likes       0.94672268      0.63696771      0.48112133
## fb_likes               0.12674457      0.22802711      0.26299094
## critic_reviews         0.16761908      0.25345595      0.24761045
## users_reviews          0.12543438      0.18768687      0.20359967
## users_votes            0.17933571      0.23915625      0.25971272
## score                  0.09356288      0.10143087      0.06876962
## aspect_ratio           0.05836148      0.06625472      0.04973974
## gross                  0.14209652      0.23907476      0.27920423
## year                   0.09156244      0.11899866      0.11271094
## Gross.Profit           0.04257017      0.06820383      0.08405977
## Gross.Margin           0.01390159      0.01768154      0.01459859
##                   total_cast_likes   fb_likes critic_reviews users_reviews
## length                  0.11904350 0.21848037     0.22445965    0.34367680
## budget                  0.07186071 0.10777993     0.16541069    0.14645981
## director_fb_likes       0.12024291 0.16474320     0.18074527    0.22420196
## actor1_fb_likes         0.94672268 0.12674457     0.16761908    0.12543438
## actor2_fb_likes         0.63696771 0.22802711     0.25345595    0.18768687
## actor3_fb_likes         0.48112133 0.26299094     0.24761045    0.20359967
## total_cast_likes        1.00000000 0.19902568     0.23636700    0.18055261
## fb_likes                0.19902568 1.00000000     0.70477420    0.37997045
## critic_reviews          0.23636700 0.70477420     1.00000000    0.56361418
## users_reviews           0.18055261 0.37997045     0.56361418    1.00000000
## users_votes             0.24507769 0.52148441     0.59492020    0.78442506
## score                   0.10676828 0.28564673     0.34920049    0.32644390
## aspect_ratio            0.07109254 0.10793592     0.17640696    0.09396047
## gross                   0.22612367 0.36094107     0.46053379    0.54641946
## year                    0.12158228 0.29617913     0.40999238    0.01754673
## Gross.Profit            0.06847202 0.11582476     0.12148692    0.19109444
## Gross.Margin            0.01893418 0.01649712     0.03052406    0.03044470
##                   users_votes        score aspect_ratio      gross
## length             0.33685102  0.369922315   0.15161563 0.24010279
## budget             0.14302229  0.047132122   0.07368236 0.22169409
## director_fb_likes  0.30538747  0.194302462   0.03912373 0.13784860
## actor1_fb_likes    0.17933571  0.093562881   0.05836148 0.14209652
## actor2_fb_likes    0.23915625  0.101430872   0.06625472 0.23907476
## actor3_fb_likes    0.25971272  0.068769621   0.04973974 0.27920423
## total_cast_likes   0.24507769  0.106768280   0.07109254 0.22612367
## fb_likes           0.52148441  0.285646728   0.10793592 0.36094107
## critic_reviews     0.59492020  0.349200488   0.17640696 0.46053379
## users_reviews      0.78442506  0.326443899   0.09396047 0.54641946
## users_votes        1.00000000  0.479511154   0.08578553 0.62654562
## score              0.47951115  1.000000000   0.03009787 0.21470357
## aspect_ratio       0.08578553  0.030097872   1.00000000 0.06611943
## gross              0.62654562  0.214703570   0.06611943 1.00000000
## year               0.01989931 -0.124930296   0.21900548 0.04485666
## Gross.Profit       0.24260591  0.084906875  -0.02963701 0.39341289
## Gross.Margin       0.02727867  0.005270862  -0.01918346 0.03798139
##                          year Gross.Profit Gross.Margin
## length            -0.13479453   0.01483308  0.004959606
## budget             0.08178870  -0.80926743 -0.311287278
## director_fb_likes -0.04653393   0.04569450  0.007226178
## actor1_fb_likes    0.09156244   0.04257017  0.013901589
## actor2_fb_likes    0.11899866   0.06820383  0.017681544
## actor3_fb_likes    0.11271094   0.08405977  0.014598587
## total_cast_likes   0.12158228   0.06847202  0.018934182
## fb_likes           0.29617913   0.11582476  0.016497122
## critic_reviews     0.40999238   0.12148692  0.030524058
## users_reviews      0.01754673   0.19109444  0.030444704
## users_votes        0.01989931   0.24260591  0.027278674
## score             -0.12493030   0.08490688  0.005270862
## aspect_ratio       0.21900548  -0.02963701 -0.019183462
## gross              0.04485666   0.39341289  0.037981390
## year               1.00000000  -0.05008920 -0.028432945
## Gross.Profit      -0.05008920   1.00000000  0.316369931
## Gross.Margin      -0.02843294   0.31636993  1.000000000
corrplot(data_corr)

4D) Create a correlation table ( only numeric data ). Give a brief explanation/guess of why some variables are correlated

Many of the variables are correlated for obvious relational reasons, the really interesting correlations are occur with gross and the different review metrics. Gross profit is negatively correlated with budget as the budget is included in the calculation of gross profit, and similarly with gross margin.


Visual Analytics: Use Tableau or R to create plots


knitr::include_graphics(path = 'data/Max Gross by year.png')


Predictive Analytics: Create a Predictive Model


knitr::include_graphics(path = 'data/Max Gross by year.png')


Task#5: Visual Analytics - Create a Predictive Model


5A) Based on your hypothesis and observations create a predictive model

knitr::include_graphics(path = 'data/Gross Margin by Budget.png')

### 5B) Evaluate the efficiency of model. Note R Square and Adjusted R Square, be suspicious of very high R Squares. R-Squared is 0.655 and the p value suggests that the assosciation is significant. ### 5C) Try different models (combination of independent variables). To find a better model if possible.

knitr::include_graphics(path = 'data/Gross Profit by Budget.png')

knitr::include_graphics(path = 'data/Gross Margin by Budget 2.png')

5D) Describe your findings. How accurately your model predicts the dependent (target) variable

Clearly the lower budget movies are more profitable up to about 125 million dollar budgets where the prediction line shows a negative gross margin.