How can we tell the greatness of a movie before it is released in cinema? Solving this problem is very meaningful for producers who are planing for next blockbuster project or TV station to decide which movie to buy nest season. Movie score can be metric to measure a sucessfulness of a movie. So in this project,

I will use IMDB data to find what are makes movies successful. The data is from Kaggle contained 28 variables for 5043 movies and 4906 posters (998MB), spanning across 100 years in 66 countries, including variables such as: duration" “director_name” “director_facebook_likes” “actor_3_name” “actor_3_facebook_likes” “genres” “num_voted_users” movie_imdb_link" “language” “country” “budget” “title_year” “imdb_score” “aspect_ratio” ect.

In this project, we will answer these 3 questions:

  1. Who are the top ten directors with best movie review? Who are the top ten actor with best movie review? what genres tend to achieve best movie review? Which countries’ movie tend to obtain high score?

2 What’s the relationships among variables?

3 What are the determing variables of imdb movie score. How do they affect score? (Developing an accurate predictive model that can be used to predict IMDb rating score on the basis of given attributes to answer third question).

step 1:prepare and clean data

import data

movie<- read.csv(file ="movie_metadata.csv", stringsAsFactors = TRUE)

Before cleaning data, we need first understand data

dim(movie)
## [1] 5043   28
head(movie)
##   color     director_name num_critic_for_reviews duration
## 1 Color     James Cameron                    723      178
## 2 Color    Gore Verbinski                    302      169
## 3 Color        Sam Mendes                    602      148
## 4 Color Christopher Nolan                    813      164
## 5             Doug Walker                     NA       NA
## 6 Color    Andrew Stanton                    462      132
##   director_facebook_likes actor_3_facebook_likes     actor_2_name
## 1                       0                    855 Joel David Moore
## 2                     563                   1000    Orlando Bloom
## 3                       0                    161     Rory Kinnear
## 4                   22000                  23000   Christian Bale
## 5                     131                     NA       Rob Walker
## 6                     475                    530  Samantha Morton
##   actor_1_facebook_likes     gross                          genres
## 1                   1000 760505847 Action|Adventure|Fantasy|Sci-Fi
## 2                  40000 309404152        Action|Adventure|Fantasy
## 3                  11000 200074175       Action|Adventure|Thriller
## 4                  27000 448130642                 Action|Thriller
## 5                    131        NA                     Documentary
## 6                    640  73058679         Action|Adventure|Sci-Fi
##      actor_1_name                                              movie_title
## 1     CCH Pounder                                                 Avatar聽
## 2     Johnny Depp               Pirates of the Caribbean: At World's End聽
## 3 Christoph Waltz                                                Spectre聽
## 4       Tom Hardy                                  The Dark Knight Rises聽
## 5     Doug Walker Star Wars: Episode VII - The Force Awakens聽            
## 6    Daryl Sabara                                            John Carter聽
##   num_voted_users cast_total_facebook_likes         actor_3_name
## 1          886204                      4834            Wes Studi
## 2          471220                     48350       Jack Davenport
## 3          275868                     11700     Stephanie Sigman
## 4         1144337                    106759 Joseph Gordon-Levitt
## 5               8                       143                     
## 6          212204                      1873         Polly Walker
##   facenumber_in_poster
## 1                    0
## 2                    0
## 3                    1
## 4                    0
## 5                    0
## 6                    1
##                                                      plot_keywords
## 1                           avatar|future|marine|native|paraplegic
## 2     goddess|marriage ceremony|marriage proposal|pirate|singapore
## 3                              bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 5                                                                 
## 6               alien|american civil war|male nipple|mars|princess
##                                        movie_imdb_link
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1
## 2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1
## 5 http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1
## 6 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1
##   num_user_for_reviews language country content_rating    budget
## 1                 3054  English     USA          PG-13 237000000
## 2                 1238  English     USA          PG-13 300000000
## 3                  994  English      UK          PG-13 245000000
## 4                 2701  English     USA          PG-13 250000000
## 5                   NA                                        NA
## 6                  738  English     USA          PG-13 263700000
##   title_year actor_2_facebook_likes imdb_score aspect_ratio
## 1       2009                    936        7.9         1.78
## 2       2007                   5000        7.1         2.35
## 3       2015                    393        6.8         2.35
## 4       2012                  23000        8.5         2.35
## 5         NA                     12        7.1           NA
## 6       2012                    632        6.6         2.35
##   movie_facebook_likes
## 1                33000
## 2                    0
## 3                85000
## 4               164000
## 5                    0
## 6                24000
str(movie)
## 'data.frame':    5043 obs. of  28 variables:
##  $ color                    : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 1 3 3 3 3 3 ...
##  $ director_name            : Factor w/ 2399 levels "","A. Raven Cruz",..: 923 796 2023 375 602 101 2026 1648 1220 550 ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1407 2218 2488 533 2432 2548 1227 801 2439 653 ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
##  $ actor_1_name             : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 301 978 351 1965 524 439 784 218 334 32 ...
##  $ movie_title              : Factor w/ 4917 levels "#Horror聽","[Rec] 2聽",..: 397 2729 3278 3706 3331 1960 3288 3458 398 1630 ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3439 1390 3131 1764 1 2711 1967 2160 3015 2938 ...
##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 ...
##  $ plot_keywords            : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 1 651 4745 29 1142 2005 ...
##  $ movie_imdb_link          : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 1 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 1 65 65 65 65 63 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 10 10 10 10 1 10 10 9 10 9 ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
colnames(movie)  
##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"
summary(movie)
##               color               director_name  num_critic_for_reviews
##                  :  19                   : 104   Min.   :  1.0         
##   Black and White: 209   Steven Spielberg:  26   1st Qu.: 50.0         
##  Color           :4815   Woody Allen     :  22   Median :110.0         
##                          Clint Eastwood  :  20   Mean   :140.2         
##                          Martin Scorsese :  20   3rd Qu.:195.0         
##                          Ridley Scott    :  17   Max.   :813.0         
##                          (Other)         :4834   NA's   :50            
##     duration     director_facebook_likes actor_3_facebook_likes
##  Min.   :  7.0   Min.   :    0.0         Min.   :    0.0       
##  1st Qu.: 93.0   1st Qu.:    7.0         1st Qu.:  133.0       
##  Median :103.0   Median :   49.0         Median :  371.5       
##  Mean   :107.2   Mean   :  686.5         Mean   :  645.0       
##  3rd Qu.:118.0   3rd Qu.:  194.5         3rd Qu.:  636.0       
##  Max.   :511.0   Max.   :23000.0         Max.   :23000.0       
##  NA's   :15      NA's   :104             NA's   :23            
##           actor_2_name  actor_1_facebook_likes     gross          
##  Morgan Freeman :  20   Min.   :     0         Min.   :      162  
##  Charlize Theron:  15   1st Qu.:   614         1st Qu.:  5340988  
##  Brad Pitt      :  14   Median :   988         Median : 25517500  
##                 :  13   Mean   :  6560         Mean   : 48468408  
##  James Franco   :  11   3rd Qu.: 11000         3rd Qu.: 62309438  
##  Meryl Streep   :  11   Max.   :640000         Max.   :760505847  
##  (Other)        :4959   NA's   :7              NA's   :884        
##                   genres                actor_1_name 
##  Drama               : 236   Robert De Niro   :  49  
##  Comedy              : 209   Johnny Depp      :  41  
##  Comedy|Drama        : 191   Nicolas Cage     :  33  
##  Comedy|Drama|Romance: 187   J.K. Simmons     :  31  
##  Comedy|Romance      : 158   Bruce Willis     :  30  
##  Drama|Romance       : 152   Denzel Washington:  30  
##  (Other)             :3910   (Other)          :4829  
##                      movie_title   num_voted_users  
##  Ben-Hur聽                 :   3   Min.   :      5  
##  Halloween聽               :   3   1st Qu.:   8594  
##  Home聽                    :   3   Median :  34359  
##  King Kong聽               :   3   Mean   :  83668  
##  Pan聽                     :   3   3rd Qu.:  96309  
##  The Fast and the Furious聽:   3   Max.   :1689764  
##  (Other)                   :5025                    
##  cast_total_facebook_likes         actor_3_name  facenumber_in_poster
##  Min.   :     0                          :  23   Min.   : 0.000      
##  1st Qu.:  1411            Ben Mendelsohn:   8   1st Qu.: 0.000      
##  Median :  3090            John Heard    :   8   Median : 1.000      
##  Mean   :  9699            Steve Coogan  :   8   Mean   : 1.371      
##  3rd Qu.: 13756            Anne Hathaway :   7   3rd Qu.: 2.000      
##  Max.   :656730            Jon Gries     :   7   Max.   :43.000      
##                            (Other)       :4982   NA's   :13          
##                                                                            plot_keywords 
##                                                                                   : 153  
##  based on novel                                                                   :   4  
##  1940s|child hero|fantasy world|orphan|reference to peter pan                     :   3  
##  alien friendship|alien invasion|australia|flying car|mother daughter relationship:   3  
##  animal name in title|ape abducts a woman|gorilla|island|king kong                :   3  
##  assistant|experiment|frankenstein|medical student|scientist                      :   3  
##  (Other)                                                                          :4874  
##                                              movie_imdb_link
##  http://www.imdb.com/title/tt0077651/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt0232500/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt0360717/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt1976009/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt2224026/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt2638144/?ref_=fn_tt_tt_1:   3  
##  (Other)                                             :5025  
##  num_user_for_reviews     language         country       content_rating
##  Min.   :   1.0       English :4704   USA      :3807   R        :2118  
##  1st Qu.:  65.0       French  :  73   UK       : 448   PG-13    :1461  
##  Median : 156.0       Spanish :  40   France   : 154   PG       : 701  
##  Mean   : 272.8       Hindi   :  28   Canada   : 126            : 303  
##  3rd Qu.: 326.0       Mandarin:  26   Germany  :  97   Not Rated: 116  
##  Max.   :5060.0       German  :  19   Australia:  55   G        : 112  
##  NA's   :21           (Other) : 153   (Other)  : 356   (Other)  : 232  
##      budget            title_year   actor_2_facebook_likes   imdb_score   
##  Min.   :2.180e+02   Min.   :1916   Min.   :     0         Min.   :1.600  
##  1st Qu.:6.000e+06   1st Qu.:1999   1st Qu.:   281         1st Qu.:5.800  
##  Median :2.000e+07   Median :2005   Median :   595         Median :6.600  
##  Mean   :3.975e+07   Mean   :2002   Mean   :  1652         Mean   :6.442  
##  3rd Qu.:4.500e+07   3rd Qu.:2011   3rd Qu.:   918         3rd Qu.:7.200  
##  Max.   :1.222e+10   Max.   :2016   Max.   :137000         Max.   :9.500  
##  NA's   :492         NA's   :108    NA's   :13                            
##   aspect_ratio   movie_facebook_likes
##  Min.   : 1.18   Min.   :     0      
##  1st Qu.: 1.85   1st Qu.:     0      
##  Median : 2.35   Median :   166      
##  Mean   : 2.22   Mean   :  7526      
##  3rd Qu.: 2.35   3rd Qu.:  3000      
##  Max.   :16.00   Max.   :349000      
##  NA's   :329

packages

library(ggplot2) 
## Warning: package 'ggplot2' was built under R version 3.3.2
library(shiny)
## Warning: package 'shiny' was built under R version 3.3.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.3.3
library(dplyr)  
## Warning: package 'dplyr' was built under R version 3.3.3
## Warning: Installed Rcpp (0.12.7) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rpart)
## Warning: package 'rpart' was built under R version 3.3.3
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.3.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.3.3
library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 3.3.3
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.3.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend

Check missing values

sum(is.na(movie))
## [1] 2059
colSums(is.na(movie)) 
##                     color             director_name 
##                         0                         0 
##    num_critic_for_reviews                  duration 
##                        50                        15 
##   director_facebook_likes    actor_3_facebook_likes 
##                       104                        23 
##              actor_2_name    actor_1_facebook_likes 
##                         0                         7 
##                     gross                    genres 
##                       884                         0 
##              actor_1_name               movie_title 
##                         0                         0 
##           num_voted_users cast_total_facebook_likes 
##                         0                         0 
##              actor_3_name      facenumber_in_poster 
##                         0                        13 
##             plot_keywords           movie_imdb_link 
##                         0                         0 
##      num_user_for_reviews                  language 
##                        21                         0 
##                   country            content_rating 
##                         0                         0 
##                    budget                title_year 
##                       492                       108 
##    actor_2_facebook_likes                imdb_score 
##                        13                         0 
##              aspect_ratio      movie_facebook_likes 
##                       329                         0

what proportion of missing value in our data

mean(is.na(movie))
## [1] 0.01458174

Since the proportion of missing value is small in our data, we can directly delete them

movie<- na.omit(movie)
sum(is.na(movie))
## [1] 0

Now, we have data without missing value, what does dit look like now?

dim(movie)
## [1] 3801   28
head(movie)
##   color     director_name num_critic_for_reviews duration
## 1 Color     James Cameron                    723      178
## 2 Color    Gore Verbinski                    302      169
## 3 Color        Sam Mendes                    602      148
## 4 Color Christopher Nolan                    813      164
## 6 Color    Andrew Stanton                    462      132
## 7 Color         Sam Raimi                    392      156
##   director_facebook_likes actor_3_facebook_likes     actor_2_name
## 1                       0                    855 Joel David Moore
## 2                     563                   1000    Orlando Bloom
## 3                       0                    161     Rory Kinnear
## 4                   22000                  23000   Christian Bale
## 6                     475                    530  Samantha Morton
## 7                       0                   4000     James Franco
##   actor_1_facebook_likes     gross                          genres
## 1                   1000 760505847 Action|Adventure|Fantasy|Sci-Fi
## 2                  40000 309404152        Action|Adventure|Fantasy
## 3                  11000 200074175       Action|Adventure|Thriller
## 4                  27000 448130642                 Action|Thriller
## 6                    640  73058679         Action|Adventure|Sci-Fi
## 7                  24000 336530303        Action|Adventure|Romance
##      actor_1_name                                movie_title
## 1     CCH Pounder                                   Avatar聽
## 2     Johnny Depp Pirates of the Caribbean: At World's End聽
## 3 Christoph Waltz                                  Spectre聽
## 4       Tom Hardy                    The Dark Knight Rises聽
## 6    Daryl Sabara                              John Carter聽
## 7    J.K. Simmons                             Spider-Man 3聽
##   num_voted_users cast_total_facebook_likes         actor_3_name
## 1          886204                      4834            Wes Studi
## 2          471220                     48350       Jack Davenport
## 3          275868                     11700     Stephanie Sigman
## 4         1144337                    106759 Joseph Gordon-Levitt
## 6          212204                      1873         Polly Walker
## 7          383056                     46055        Kirsten Dunst
##   facenumber_in_poster
## 1                    0
## 2                    0
## 3                    1
## 4                    0
## 6                    1
## 7                    0
##                                                      plot_keywords
## 1                           avatar|future|marine|native|paraplegic
## 2     goddess|marriage ceremony|marriage proposal|pirate|singapore
## 3                              bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 6               alien|american civil war|male nipple|mars|princess
## 7                        sandman|spider man|symbiote|venom|villain
##                                        movie_imdb_link
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1
## 2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1
## 6 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1
## 7 http://www.imdb.com/title/tt0413300/?ref_=fn_tt_tt_1
##   num_user_for_reviews language country content_rating    budget
## 1                 3054  English     USA          PG-13 237000000
## 2                 1238  English     USA          PG-13 300000000
## 3                  994  English      UK          PG-13 245000000
## 4                 2701  English     USA          PG-13 250000000
## 6                  738  English     USA          PG-13 263700000
## 7                 1902  English     USA          PG-13 258000000
##   title_year actor_2_facebook_likes imdb_score aspect_ratio
## 1       2009                    936        7.9         1.78
## 2       2007                   5000        7.1         2.35
## 3       2015                    393        6.8         2.35
## 4       2012                  23000        8.5         2.35
## 6       2012                    632        6.6         2.35
## 7       2007                  11000        6.2         2.35
##   movie_facebook_likes
## 1                33000
## 2                    0
## 3                85000
## 4               164000
## 6                24000
## 7                    0
str(movie)
## 'data.frame':    3801 obs. of  28 variables:
##  $ color                    : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ director_name            : Factor w/ 2399 levels "","A. Raven Cruz",..: 923 796 2023 375 101 2026 1648 1220 550 2390 ...
##  $ num_critic_for_reviews   : int  723 302 602 813 462 392 324 635 375 673 ...
##  $ duration                 : int  178 169 148 164 132 156 100 141 153 183 ...
##  $ director_facebook_likes  : int  0 563 0 22000 475 0 15 0 282 0 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 530 4000 284 19000 10000 2000 ...
##  $ actor_2_name             : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1407 2218 2488 533 2548 1227 801 2439 653 1703 ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 640 24000 799 26000 25000 15000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 73058679 336530303 200807262 458991599 301956980 330249062 ...
##  $ genres                   : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 126 120 308 126 447 126 ...
##  $ actor_1_name             : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 301 978 351 1965 439 784 218 334 32 738 ...
##  $ movie_title              : Factor w/ 4917 levels "#Horror聽","[Rec] 2聽",..: 397 2729 3278 3706 1960 3288 3458 398 1630 460 ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 212204 383056 294810 462669 321795 371639 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 1873 46055 2036 92000 58753 24450 ...
##  $ actor_3_name             : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3439 1390 3131 1764 2711 1967 2160 3015 2938 55 ...
##  $ facenumber_in_poster     : int  0 0 1 0 1 0 1 4 3 0 ...
##  $ plot_keywords            : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 651 4745 29 1142 2005 1564 ...
##  $ movie_imdb_link          : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 2476 2526 2458 4546 2551 4690 ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 738 1902 387 1117 973 3018 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 13 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 65 65 65 65 63 65 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 10 10 10 10 10 10 9 10 9 10 ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 2.64e+08 ...
##  $ title_year               : int  2009 2007 2015 2012 2012 2007 2010 2015 2009 2016 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 632 11000 553 21000 11000 4000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 6.6 6.2 7.8 7.5 7.5 6.9 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 2.35 2.35 1.85 2.35 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 24000 0 29000 118000 10000 197000 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:1242] 5 56 85 99 100 178 200 205 207 243 ...
##   .. ..- attr(*, "names")= chr [1:1242] "5" "56" "85" "99" ...
summary(movie)
##               color                director_name  num_critic_for_reviews
##                  :   1   Steven Spielberg :  25   Min.   :  1.0         
##   Black and White: 129   Clint Eastwood   :  19   1st Qu.: 75.0         
##  Color           :3671   Woody Allen      :  19   Median :137.0         
##                          Ridley Scott     :  17   Mean   :165.8         
##                          Martin Scorsese  :  16   3rd Qu.:223.0         
##                          Steven Soderbergh:  16   Max.   :813.0         
##                          (Other)          :3689                         
##     duration     director_facebook_likes actor_3_facebook_likes
##  Min.   : 37.0   Min.   :    0.0         Min.   :    0.0       
##  1st Qu.: 96.0   1st Qu.:   11.0         1st Qu.:  187.0       
##  Median :106.0   Median :   62.0         Median :  433.0       
##  Mean   :110.2   Mean   :  798.6         Mean   :  763.8       
##  3rd Qu.:120.0   3rd Qu.:  234.0         3rd Qu.:  690.0       
##  Max.   :330.0   Max.   :23000.0         Max.   :23000.0       
##                                                                
##           actor_2_name  actor_1_facebook_likes     gross          
##  Morgan Freeman :  20   Min.   :     0         Min.   :      162  
##  Brad Pitt      :  14   1st Qu.:   736         1st Qu.:  7689458  
##  Charlize Theron:  14   Median :  1000         Median : 29200000  
##  James Franco   :  11   Mean   :  7673         Mean   : 52005759  
##  Jason Flemyng  :  10   3rd Qu.: 13000         3rd Qu.: 66466372  
##  Meryl Streep   :  10   Max.   :640000         Max.   :760505847  
##  (Other)        :3722                                             
##                   genres                actor_1_name 
##  Drama               : 150   Robert De Niro   :  42  
##  Comedy|Drama|Romance: 148   Johnny Depp      :  39  
##  Comedy|Drama        : 142   J.K. Simmons     :  31  
##  Comedy              : 138   Nicolas Cage     :  31  
##  Comedy|Romance      : 131   Denzel Washington:  30  
##  Drama|Romance       : 116   Bruce Willis     :  29  
##  (Other)             :2976   (Other)          :3599  
##                      movie_title   num_voted_users  
##  Halloween聽               :   3   Min.   :      5  
##  Home聽                    :   3   1st Qu.:  18915  
##  King Kong聽               :   3   Median :  53028  
##  Pan聽                     :   3   Mean   : 104677  
##  The Fast and the Furious聽:   3   3rd Qu.: 126916  
##  Victor Frankenstein聽     :   3   Max.   :1689764  
##  (Other)                   :3783                    
##  cast_total_facebook_likes         actor_3_name  facenumber_in_poster
##  Min.   :     0            Steve Coogan  :   8   Min.   : 0.000      
##  1st Qu.:  1865            Anne Hathaway :   7   1st Qu.: 0.000      
##  Median :  3969            Ben Mendelsohn:   7   Median : 1.000      
##  Mean   : 11415            Kirsten Dunst :   7   Mean   : 1.379      
##  3rd Qu.: 16143            Robert Duvall :   7   3rd Qu.: 2.000      
##  Max.   :656730            Bruce McGill  :   6   Max.   :43.000      
##                            (Other)       :3759                       
##                                                                            plot_keywords 
##                                                                                   :  17  
##  1940s|child hero|fantasy world|orphan|reference to peter pan                     :   3  
##  alien friendship|alien invasion|australia|flying car|mother daughter relationship:   3  
##  animal name in title|ape abducts a woman|gorilla|island|king kong                :   3  
##  assistant|experiment|frankenstein|medical student|scientist                      :   3  
##  eighteen wheeler|illegal street racing|truck|trucker|undercover cop              :   3  
##  (Other)                                                                          :3769  
##                                              movie_imdb_link
##  http://www.imdb.com/title/tt0077651/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt0232500/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt0360717/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt1976009/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt2224026/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt3332064/?ref_=fn_tt_tt_1:   3  
##  (Other)                                             :3783  
##  num_user_for_reviews     language         country       content_rating
##  Min.   :   1.0       English :3626   USA      :3005   R        :1709  
##  1st Qu.: 107.0       French  :  36   UK       : 322   PG-13    :1313  
##  Median : 208.0       Spanish :  23   France   : 103   PG       : 567  
##  Mean   : 333.6       Mandarin:  15   Germany  :  82   G        :  87  
##  3rd Qu.: 397.0       German  :  13   Canada   :  62   Not Rated:  34  
##  Max.   :5060.0       Japanese:  12   Australia:  40            :  29  
##                       (Other) :  76   (Other)  : 187   (Other)  :  62  
##      budget            title_year   actor_2_facebook_likes   imdb_score   
##  Min.   :2.180e+02   Min.   :1920   Min.   :     0         Min.   :1.600  
##  1st Qu.:1.000e+07   1st Qu.:1999   1st Qu.:   374         1st Qu.:5.900  
##  Median :2.500e+07   Median :2005   Median :   680         Median :6.600  
##  Mean   :4.585e+07   Mean   :2003   Mean   :  2003         Mean   :6.466  
##  3rd Qu.:5.000e+07   3rd Qu.:2010   3rd Qu.:   975         3rd Qu.:7.200  
##  Max.   :1.222e+10   Max.   :2016   Max.   :137000         Max.   :9.300  
##                                                                           
##   aspect_ratio   movie_facebook_likes
##  Min.   : 1.18   Min.   :     0      
##  1st Qu.: 1.85   1st Qu.:     0      
##  Median : 2.35   Median :   224      
##  Mean   : 2.11   Mean   :  9262      
##  3rd Qu.: 2.35   3rd Qu.: 11000      
##  Max.   :16.00   Max.   :349000      
## 

Looking at variable movie_title, I find that there is always a messy code following every value So I correct data without messy code

movie$movie_title<-gsub('.$', '', movie$movie_title)

Step2: Explore Data I create several histograms of duration, num_user_for_ reviews, imdb_score, director_facebook_likes to understand their distribution

options(repr.plot.width=6, repr.plot.height=4) 
g1<-ggplot(movie,aes(x=duration))+geom_histogram(binwidth=5,aes(y=..density..),fill="green4")
g2<-ggplot(movie,aes(x=num_user_for_reviews))+geom_histogram(binwidth=50,aes(y=..density..),fill="red")
g3<-ggplot(movie,aes(x=imdb_score))+geom_histogram(binwidth=1,aes(y=..count..),fill="green4")
g4<-ggplot(movie,aes(x=director_facebook_likes))+geom_histogram(binwidth=5,aes(y=..count..),fill="red")
grid.arrange(g1,g2,g3,g4,nrow=2,ncol=2)

Since our goal is to research imdb score, show we create a shiny application to see different breaks of imdb score histogram

# Define UI for application that draws a histogram
ui <- shinyUI(fluidPage(
   
   # Application title
   titlePanel("IMDB SCORE DATA"),
   
   # Sidebar with a slider input for number of bins 
   sidebarLayout(
      sidebarPanel(
         sliderInput("bins",
                     "Number of bins:",
                     min = 1,
                     max = 50,
                     value = 30)
      ),
      
      # Show a plot of the generated distribution
      mainPanel(
         plotOutput("distPlot")
      )
   )
))

# Define server logic required to draw a histogram
server <- shinyServer(function(input, output) {
   
   output$distPlot <- renderPlot({
      # generate bins based on input$bins from ui.R
      movie_score    <- movie$imdb_score 
      bins <- seq(min(movie_score), max(movie_score), length.out = input$bins + 1)
      
      # draw the histogram with the specified number of bins
      hist(movie_score, breaks = bins, col = 'darkgray', border = 'white')
   })
})

# Run the application 
shinyApp(ui = ui, server = server)
Shiny applications not supported in static R Markdown documents

From the distribution, basiclly there are no outlier.
Understand categorical variables’ distribution’ We create movie Language Histogram

ggplot(movie,aes(x=language,fill=language))+
  geom_histogram(stat="count",aes(y=..count../sum(..count..)),binwidth=1)+
  theme(axis.text.x=element_text(angle=90,hjust=0.5,vjust=0),legend.position="none")+
  labs(y="Percent",title="Percentage of Movie Languages")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

English is the main language of movies, but we want to know the distribution of non-English movie? How many movies of other non-Language

movie %>% filter(!language == "English") %>% 
  ggplot(aes(x=language,fill=language))+
  geom_histogram(stat="count",aes(y=..count../sum(..count..)),binwidth=1)+
  theme(axis.text.x=element_text(angle=90,hjust=0.5,vjust=0),legend.position="none")+
  labs(y="Percent",title="Percentage of Movie Languages other than English")
## Warning: package 'bindrcpp' was built under R version 3.3.3
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Except English, French and Spain movie takes up huge movie amount

To understand how many movies dose every country produce every year, I create data frame of year and country with their movie amount.

year_movies<-movie %>% 
  group_by(title_year,country) %>% 
  summarize(movie_count=n())%>%
  filter(movie_count>=5)
head(year_movies)
## # A tibble: 6 x 3
## # Groups:   title_year [6]
##   title_year country movie_count
##        <int>  <fctr>       <int>
## 1       1974     USA           6
## 2       1977     USA           5
## 3       1978     USA           8
## 4       1980     USA          13
## 5       1981     USA           9
## 6       1982     USA          16

plot the movie number through time and color them with different country

options(repr.plot.width=4, repr.plot.height=4)
ggplot(year_movies,aes(x=title_year,y=movie_count,color=country))+
  geom_line(size =2)+xlab("Year")+ylab("Num of Movies")+theme_classic()

USA ’s movie productivity largely increased in 1990s and is higher than any other country.

Step 3: Answer question 1

Who are the top ten directors with best movie review? caculate average score of each director and choose the top ten

best_director<-movie%>%group_by(director_name)%>%
    summarise(mean= mean(imdb_score))%>%
    top_n(10)%>%
    arrange(desc(mean))
## Selecting by mean
best_director
## # A tibble: 11 x 2
##        director_name     mean
##               <fctr>    <dbl>
##  1   Charles Chaplin 8.600000
##  2         Tony Kaye 8.600000
##  3  Alfred Hitchcock 8.500000
##  4   Damien Chazelle 8.500000
##  5      Majid Majidi 8.500000
##  6        Ron Fricke 8.500000
##  7      Sergio Leone 8.433333
##  8 Christopher Nolan 8.425000
##  9    Asghar Farhadi 8.400000
## 10  Richard Marquand 8.400000
## 11    S.S. Rajamouli 8.400000

plot the top 10 directors with their average score

ggplot(best_director, aes(x = director_name, y = mean, alpha = mean))+
  geom_bar(stat = "identity",fill = "slateblue") + labs(x = "Best 10 Directors", y = "Average Imdb Score") + 
  ggtitle("Top 10 directors with average score")+coord_flip(ylim=c(8,8.75))

What are the top genres with best movie review? caculate average score of each genres and choose the top 3

best_category<-movie%>%
  group_by(genres)%>%
  summarise(mean= mean(imdb_score))%>%
  top_n(3)%>%
  arrange(desc(mean))
## Selecting by mean
best_category
## # A tibble: 5 x 2
##                                     genres  mean
##                                     <fctr> <dbl>
## 1 Adventure|Animation|Drama|Family|Musical   8.5
## 2              Crime|Drama|Fantasy|Mystery   8.5
## 3       Action|Adventure|Drama|Fantasy|War   8.4
## 4              Adventure|Animation|Fantasy   8.4
## 5             Adventure|Drama|Thriller|War   8.4

plot the top 3 genres with their average score

ggplot(best_category, aes(x = genres, y = mean, alpha = mean))+
  geom_bar(stat = "identity",fill = "slateblue") + labs(x = "Top movie genre", y = "Average Imdb Score") + 
  ggtitle("Top movie genre with average score")+coord_flip(ylim=c(8.3,8.6))

Who are the top actors with best movie review? caculate average score of each actor and choose the top ten for each

best_actor<-movie%>%
  group_by(actor_1_name)%>%
  summarise(mean= mean(imdb_score))%>%
  top_n(10)%>%
  arrange(desc(mean))
## Selecting by mean
best_actor
## # A tibble: 14 x 2
##              actor_1_name  mean
##                    <fctr> <dbl>
##  1       Scatman Crothers   8.7
##  2        Takashi Shimura   8.7
##  3         Bunta Sugawara   8.6
##  4       Paulette Goddard   8.6
##  5         Bahare Seddiqi   8.5
##  6 Collin Alfredo St. Dic   8.5
##  7             Emilia Fox   8.5
##  8            Janet Leigh   8.5
##  9           Claude Rains   8.4
## 10       J眉rgen Prochnow   8.4
## 11      Mathieu Kassovitz   8.4
## 12          Mhairi Calvey   8.4
## 13        Shahab Hosseini   8.4
## 14       Tamannaah Bhatia   8.4
best_actor2<-movie%>%
  group_by(actor_2_name)%>%
  summarise(mean= mean(imdb_score))%>%
  top_n(10)%>%
  arrange(desc(mean))
## Selecting by mean
best_actor2
## # A tibble: 10 x 2
##         actor_2_name  mean
##               <fctr> <dbl>
##  1    Jeffrey DeMunn   8.9
##  2    Luigi Pistilli   8.9
##  3       Kenny Baker   8.8
##  4      Marcus Chong   8.7
##  5  Michael Berryman   8.7
##  6     Minoru Chiaki   8.7
##  7     Peter Cushing   8.7
##  8         Seu Jorge   8.7
##  9 Ry没nosuke Kamiki   8.6
## 10  Stanley Blystone   8.6
best_actor3<-movie%>%
  group_by(actor_3_name)%>%
  summarise(mean= mean(imdb_score))%>%
  top_n(10)%>%
  arrange(desc(mean))
## Selecting by mean
best_actor3
## # A tibble: 11 x 2
##           actor_3_name  mean
##                 <fctr> <dbl>
##  1    Caroline Goodall  8.90
##  2         Enzo Petito  8.90
##  3         Phil LaMarr  8.90
##  4     Anthony Daniels  8.80
##  5   Eugenie Bondurant  8.80
##  6        Sam Anderson  8.80
##  7          Billy Boyd  8.75
##  8 Alexandre Rodrigues  8.70
##  9       Gloria Foster  8.70
## 10   Kamatari Fujiwara  8.70
## 11     Louise Fletcher  8.70

plot the top actors with their average score

actor1<-ggplot(best_actor, aes(x = actor_1_name, y = mean, alpha = mean))+
  geom_bar(stat = "identity",fill = "slateblue") + labs(x = "Top actor1", y = "Average Imdb Score") + 
  ggtitle("Top actor1 with average score")+coord_flip(ylim=c(8.0,9.0))
actor2<-ggplot(best_actor2, aes(x = actor_2_name, y = mean, alpha = mean))+
  geom_bar(stat = "identity",fill = "slateblue") + labs(x = "Top actor2", y = "Average Imdb Score") + 
  ggtitle("Top actor2 with average score")+coord_flip(ylim=c(8.0,9.0))
actor3<-ggplot(best_actor3, aes(x = actor_3_name, y = mean, alpha = mean))+
  geom_bar(stat = "identity",fill = "slateblue") + labs(x = "Top actor3", y = "Average Imdb Score") + 
  ggtitle("Top actor3 with average score")+coord_flip(ylim=c(8.0,9.0))
grid.arrange(actor1,actor2,actor3)

Which countries’ movie tend to obtain high score

best_moviecountry<-movie%>%
  group_by(country)%>%
  summarise(mean= mean(imdb_score))%>%
  top_n(10)%>%
  arrange(desc(mean))
## Selecting by mean
best_moviecountry
## # A tibble: 10 x 2
##         country     mean
##          <fctr>    <dbl>
##  1 West Germany 8.400000
##  2       Israel 8.000000
##  3       Brazil 7.760000
##  4         Iran 7.725000
##  5    Argentina 7.600000
##  6    Indonesia 7.600000
##  7       Sweden 7.600000
##  8  Netherlands 7.566667
##  9     Colombia 7.500000
## 10  New Zealand 7.481818
moviecountry<-ggplot(best_moviecountry, aes(x = country, y = mean, alpha = mean))+
  geom_bar(stat = "identity",fill = "slateblue") + labs(x = "Top 10 movie country", y = "Average Imdb Score") + 
  ggtitle("Top movie country with average score")+coord_flip(ylim=c(7.0,9.0))
moviecountry

Step 4: Answer Question 2

What are the relationshps betwwen variables

let’s look at correlations.Select numeric variables

numeric_col <- sapply(movie, is.numeric)
movie_numeric<- movie[,numeric_col]

create correlation matrix

Correlation<-cor(movie_numeric)
corrplot(Correlation, method = "color")

Correlation
##                           num_critic_for_reviews    duration
## num_critic_for_reviews                1.00000000  0.22770510
## duration                              0.22770510  1.00000000
## director_facebook_likes               0.17691581  0.17973400
## actor_3_facebook_likes                0.25508576  0.12577123
## actor_1_facebook_likes                0.17019757  0.08471988
## gross                                 0.46853499  0.24474304
## num_voted_users                       0.59498973  0.33803814
## cast_total_facebook_likes             0.24100544  0.12117142
## facenumber_in_poster                 -0.03400866  0.02909973
## num_user_for_reviews                  0.56679514  0.35039147
## budget                                0.10568105  0.06816122
## title_year                            0.41038049 -0.12942203
## actor_2_facebook_likes                0.25583672  0.12945227
## imdb_score                            0.34388077  0.36612369
## aspect_ratio                          0.18064082  0.15311429
## movie_facebook_likes                  0.70396936  0.21493610
##                           director_facebook_likes actor_3_facebook_likes
## num_critic_for_reviews                 0.17691581             0.25508576
## duration                               0.17973400             0.12577123
## director_facebook_likes                1.00000000             0.11824025
## actor_3_facebook_likes                 0.11824025             1.00000000
## actor_1_facebook_likes                 0.09073261             0.25372024
## gross                                  0.13993814             0.30158391
## num_voted_users                        0.30061915             0.26945536
## cast_total_facebook_likes              0.11974122             0.49068631
## facenumber_in_poster                  -0.04761895             0.10501768
## num_user_for_reviews                   0.21831138             0.20732096
## budget                                 0.01855931             0.04047813
## title_year                            -0.04460636             0.11553537
## actor_2_facebook_likes                 0.11690032             0.55418237
## imdb_score                             0.19083814             0.06497354
## aspect_ratio                           0.03787106             0.04712336
## movie_facebook_likes                   0.16273728             0.27251268
##                           actor_1_facebook_likes       gross
## num_critic_for_reviews                0.17019757  0.46853499
## duration                              0.08471988  0.24474304
## director_facebook_likes               0.09073261  0.13993814
## actor_3_facebook_likes                0.25372024  0.30158391
## actor_1_facebook_likes                1.00000000  0.14704475
## gross                                 0.14704475  1.00000000
## num_voted_users                       0.18226526  0.62694784
## cast_total_facebook_likes             0.94492526  0.23868703
## facenumber_in_poster                  0.05757968 -0.03225370
## num_user_for_reviews                  0.12522139  0.54710674
## budget                                0.01708638  0.10038914
## title_year                            0.09374233  0.05236800
## actor_2_facebook_likes                0.39267587  0.25465945
## imdb_score                            0.09313142  0.21212439
## aspect_ratio                          0.05760375  0.06526004
## movie_facebook_likes                  0.13177824  0.36849402
##                           num_voted_users cast_total_facebook_likes
## num_critic_for_reviews         0.59498973                0.24100544
## duration                       0.33803814                0.12117142
## director_facebook_likes        0.30061915                0.11974122
## actor_3_facebook_likes         0.26945536                0.49068631
## actor_1_facebook_likes         0.18226526                0.94492526
## gross                          0.62694784                0.23868703
## num_voted_users                1.00000000                0.25194009
## cast_total_facebook_likes      0.25194009                1.00000000
## facenumber_in_poster          -0.03202642                0.08098495
## num_user_for_reviews           0.77992455                0.18228784
## budget                         0.06682395                0.02942336
## title_year                     0.02193838                0.12401462
## actor_2_facebook_likes         0.24666028                0.64401612
## imdb_score                     0.47791732                0.10625870
## aspect_ratio                   0.08548456                0.06967465
## movie_facebook_likes           0.51869065                0.20706080
##                           facenumber_in_poster num_user_for_reviews
## num_critic_for_reviews             -0.03400866           0.56679514
## duration                            0.02909973           0.35039147
## director_facebook_likes            -0.04761895           0.21831138
## actor_3_facebook_likes              0.10501768           0.20732096
## actor_1_facebook_likes              0.05757968           0.12522139
## gross                              -0.03225370           0.54710674
## num_voted_users                    -0.03202642           0.77992455
## cast_total_facebook_likes           0.08098495           0.18228784
## facenumber_in_poster                1.00000000          -0.07940360
## num_user_for_reviews               -0.07940360           1.00000000
## budget                             -0.02175723           0.07125387
## title_year                          0.06795245           0.01759409
## actor_2_facebook_likes              0.07413806           0.18958182
## imdb_score                         -0.06429247           0.32252237
## aspect_ratio                        0.01662043           0.09855669
## movie_facebook_likes                0.01433235           0.37197029
##                                budget  title_year actor_2_facebook_likes
## num_critic_for_reviews     0.10568105  0.41038049             0.25583672
## duration                   0.06816122 -0.12942203             0.12945227
## director_facebook_likes    0.01855931 -0.04460636             0.11690032
## actor_3_facebook_likes     0.04047813  0.11553537             0.55418237
## actor_1_facebook_likes     0.01708638  0.09374233             0.39267587
## gross                      0.10038914  0.05236800             0.25465945
## num_voted_users            0.06682395  0.02193838             0.24666028
## cast_total_facebook_likes  0.02942336  0.12401462             0.64401612
## facenumber_in_poster      -0.02175723  0.06795245             0.07413806
## num_user_for_reviews       0.07125387  0.01759409             0.18958182
## budget                     1.00000000  0.04629319             0.03621089
## title_year                 0.04629319  1.00000000             0.11973855
## actor_2_facebook_likes     0.03621089  0.11973855             1.00000000
## imdb_score                 0.02904057 -0.12926516             0.10206038
## aspect_ratio               0.02579646  0.21977924             0.06421530
## movie_facebook_likes       0.05303510  0.30283494             0.23363209
##                            imdb_score aspect_ratio movie_facebook_likes
## num_critic_for_reviews     0.34388077   0.18064082           0.70396936
## duration                   0.36612369   0.15311429           0.21493610
## director_facebook_likes    0.19083814   0.03787106           0.16273728
## actor_3_facebook_likes     0.06497354   0.04712336           0.27251268
## actor_1_facebook_likes     0.09313142   0.05760375           0.13177824
## gross                      0.21212439   0.06526004           0.36849402
## num_voted_users            0.47791732   0.08548456           0.51869065
## cast_total_facebook_likes  0.10625870   0.06967465           0.20706080
## facenumber_in_poster      -0.06429247   0.01662043           0.01433235
## num_user_for_reviews       0.32252237   0.09855669           0.37197029
## budget                     0.02904057   0.02579646           0.05303510
## title_year                -0.12926516   0.21977924           0.30283494
## actor_2_facebook_likes     0.10206038   0.06421530           0.23363209
## imdb_score                 1.00000000   0.02845372           0.27947774
## aspect_ratio               0.02845372   1.00000000           0.11031824
## movie_facebook_likes       0.27947774   0.11031824           1.00000000

We notice that some variables are highly possitively related such as actor_1_facebook_likes and cast_total_facebook_likes; and movie_facebook_likes and num_critic_for_reviews.

We create scatter plot to understand their relationship

ggplot(movie, aes(x =actor_1_facebook_likes, y =cast_total_facebook_likes))+
  geom_point(size=2) +
  stat_smooth(methos = lm, se = F, color = "red")+geom_smooth()+
  labs(title = "actor1 facebook likes Vs. cast_total_facebook_likes", 
       x = "actor1 facebook likes", y = "cast_total_facebook_likes")+ ggtitle(paste("cor:", 0.945)) 
## Warning: Ignoring unknown parameters: methos
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'

ggplot(movie, aes(x =movie_facebook_likes, y =num_critic_for_reviews))+
  geom_point(size=2) +
  stat_smooth(methos = lm, se = F, color = "red")+geom_smooth()+
  labs(title = "Movie Facebook likes Vs. number of critics for reviews", 
       x = "Movie Facebook likes", y = "Number of critics reviews")+ ggtitle(paste("cor:", 0.704)) 
## Warning: Ignoring unknown parameters: methos
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'

Our main goal is predicting imdb score, we need to understand it’s determing variables.

From correlation matrix, We obtained the determing variables:num_critic_for_reviews, num_user_for_reviews,num_voted_users,duration,movie_facebook_likes,gross are determing variables.

Plotting duration and imdb score to understand their relationship: we will color user review number, firstly we need categorize it

movie$num_user_reviews<-cut(movie$num_user_for_reviews, breaks = c(0,107,208,333,397,5100), labels = c("very few","few","middle","high","very high"))
summary(movie$category)
## Length  Class   Mode 
##      0   NULL   NULL
ggplot(movie, aes(x =duration, y =imdb_score))+
  geom_point(size=2, aes(colour=num_user_reviews)) +
  labs(title = "Duration Vs. IMDB Score and Number of User Reviews", 
       x = "Duration", y = "IMDB Score")

Plotting num_voted_user and imdb score to understand their relationship:

ggplot(movie, aes(x =num_voted_users, y =imdb_score))+
  geom_point()+
  labs(title = "Voted User number Vs. IMDB Score and Number of User Reviews", 
       x = "voted user number", y = "IMDB Score")

Plotting num_user_for_reviews and imdb score to understand their relationship:

ggplot(movie, aes(x =num_user_for_reviews, y =imdb_score))+
  geom_point()+
  labs(title = "User Review number Vs. IMDB Score and Number of User Reviews", 
       x = "User review number", y = "IMDB Score")

Step 5 Answer Q3: how do determing variables affect score; build model to predicet score

To understand how determing variables affect the score, we first select determing variables

movies_importat_variables = movie[, c("imdb_score",
                                      "num_critic_for_reviews",
                                      "num_user_for_reviews",
                                      "num_voted_users",
                                      "duration",
                                      "movie_facebook_likes",
                                      "gross")]

We split our data to a test and training set

set.seed(2)
train <- sample(dim(movies_importat_variables)[1],dim(movies_importat_variables)[1]*0.9)
movie_train <- movies_importat_variables[train,]
movie_test <- movies_importat_variables[-train,]

We create liner model

lmfit = lm(imdb_score~.,data=movie_train)
summary(lmfit)
## 
## Call:
## lm(formula = imdb_score ~ ., data = movie_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5361 -0.4889  0.0760  0.6059  2.4958 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.850e+00  7.906e-02  61.348  < 2e-16 ***
## num_critic_for_reviews  1.443e-03  1.950e-04   7.397 1.74e-13 ***
## num_user_for_reviews   -5.321e-04  6.191e-05  -8.596  < 2e-16 ***
## num_voted_users         4.094e-06  1.815e-07  22.559  < 2e-16 ***
## duration                1.174e-02  7.075e-04  16.594  < 2e-16 ***
## movie_facebook_likes   -2.921e-06  1.025e-06  -2.851  0.00439 ** 
## gross                  -2.532e-09  2.798e-10  -9.050  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8755 on 3413 degrees of freedom
## Multiple R-squared:  0.3123, Adjusted R-squared:  0.3111 
## F-statistic: 258.3 on 6 and 3413 DF,  p-value: < 2.2e-16
plot(lmfit)

Test Model with test data based on mse {r}pred <- predict(lmfit,movie_test) mean((movie_test$imdb_score-pred)^2) The mse of liner model is around 0.78

Build up tree model

set.seed(3)
m.rpart <- rpart(imdb_score~.,data=movie_train)
m.rpart
## n= 3420 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 3420 3803.67300 6.478596  
##    2) num_voted_users< 88566 2261 2348.41200 6.158425  
##      4) duration< 110.5 1554 1637.39400 5.955727  
##        8) gross>=3679220 1122 1146.47000 5.808467  
##         16) num_voted_users< 32267.5 565  604.98740 5.541062  
##           32) num_user_for_reviews>=275 39   40.60974 4.535897 *
##           33) num_user_for_reviews< 275 526  522.05220 5.615589 *
##         17) num_voted_users>=32267.5 557  460.10080 6.079713 *
##        9) gross< 3679220 432  403.39980 6.338194 *
##      5) duration>=110.5 707  506.82890 6.603960 *
##    3) num_voted_users>=88566 1159  771.33820 7.103192  
##      6) num_voted_users< 349779.5 950  532.38460 6.921895  
##       12) gross>=2.73e+07 757  404.31110 6.796565 *
##       13) gross< 2.73e+07 193   69.54497 7.413472 *
##      7) num_voted_users>=349779.5 209   65.79455 7.927273 *

plot tree data

rpart.plot(m.rpart,digits = 3)

test model

p.rpart <- predict(m.rpart,movie_test)
mean((p.rpart-movie_test$imdb_score)^2)
## [1] 0.7755274

The mse of liner model is also around 0.78

We build up random forest

set.seed(100)
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.3.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
rf <- randomForest(imdb_score~.,data=movie_train,ntree=500)
pred_rf <- predict(rf,movie_test)
mean((pred_rf-movie_test$imdb_score)^2)
## [1] 0.5940703

The mse of liner model is also around 0.59, So model’s accuracy largely increase