How can we tell the greatness of a movie before it is released in cinema? Solving this problem is very meaningful for producers who are planing for next blockbuster project or TV station to decide which movie to buy nest season. Movie score can be metric to measure a sucessfulness of a movie. So in this project,
I will use IMDB data to find what are makes movies successful. The data is from Kaggle contained 28 variables for 5043 movies and 4906 posters (998MB), spanning across 100 years in 66 countries, including variables such as: duration" “director_name” “director_facebook_likes” “actor_3_name” “actor_3_facebook_likes” “genres” “num_voted_users” movie_imdb_link" “language” “country” “budget” “title_year” “imdb_score” “aspect_ratio” ect.
In this project, we will answer these 3 questions:
2 What’s the relationships among variables?
3 What are the determing variables of imdb movie score. How do they affect score? (Developing an accurate predictive model that can be used to predict IMDb rating score on the basis of given attributes to answer third question).
step 1:prepare and clean data
import data
movie<- read.csv(file ="movie_metadata.csv", stringsAsFactors = TRUE)
Before cleaning data, we need first understand data
dim(movie)
## [1] 5043 28
head(movie)
## color director_name num_critic_for_reviews duration
## 1 Color James Cameron 723 178
## 2 Color Gore Verbinski 302 169
## 3 Color Sam Mendes 602 148
## 4 Color Christopher Nolan 813 164
## 5 Doug Walker NA NA
## 6 Color Andrew Stanton 462 132
## director_facebook_likes actor_3_facebook_likes actor_2_name
## 1 0 855 Joel David Moore
## 2 563 1000 Orlando Bloom
## 3 0 161 Rory Kinnear
## 4 22000 23000 Christian Bale
## 5 131 NA Rob Walker
## 6 475 530 Samantha Morton
## actor_1_facebook_likes gross genres
## 1 1000 760505847 Action|Adventure|Fantasy|Sci-Fi
## 2 40000 309404152 Action|Adventure|Fantasy
## 3 11000 200074175 Action|Adventure|Thriller
## 4 27000 448130642 Action|Thriller
## 5 131 NA Documentary
## 6 640 73058679 Action|Adventure|Sci-Fi
## actor_1_name movie_title
## 1 CCH Pounder Avatar聽
## 2 Johnny Depp Pirates of the Caribbean: At World's End聽
## 3 Christoph Waltz Spectre聽
## 4 Tom Hardy The Dark Knight Rises聽
## 5 Doug Walker Star Wars: Episode VII - The Force Awakens聽
## 6 Daryl Sabara John Carter聽
## num_voted_users cast_total_facebook_likes actor_3_name
## 1 886204 4834 Wes Studi
## 2 471220 48350 Jack Davenport
## 3 275868 11700 Stephanie Sigman
## 4 1144337 106759 Joseph Gordon-Levitt
## 5 8 143
## 6 212204 1873 Polly Walker
## facenumber_in_poster
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
## 6 1
## plot_keywords
## 1 avatar|future|marine|native|paraplegic
## 2 goddess|marriage ceremony|marriage proposal|pirate|singapore
## 3 bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 5
## 6 alien|american civil war|male nipple|mars|princess
## movie_imdb_link
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1
## 2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1
## 5 http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1
## 6 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1
## num_user_for_reviews language country content_rating budget
## 1 3054 English USA PG-13 237000000
## 2 1238 English USA PG-13 300000000
## 3 994 English UK PG-13 245000000
## 4 2701 English USA PG-13 250000000
## 5 NA NA
## 6 738 English USA PG-13 263700000
## title_year actor_2_facebook_likes imdb_score aspect_ratio
## 1 2009 936 7.9 1.78
## 2 2007 5000 7.1 2.35
## 3 2015 393 6.8 2.35
## 4 2012 23000 8.5 2.35
## 5 NA 12 7.1 NA
## 6 2012 632 6.6 2.35
## movie_facebook_likes
## 1 33000
## 2 0
## 3 85000
## 4 164000
## 5 0
## 6 24000
str(movie)
## 'data.frame': 5043 obs. of 28 variables:
## $ color : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 1 3 3 3 3 3 ...
## $ director_name : Factor w/ 2399 levels "","A. Raven Cruz",..: 923 796 2023 375 602 101 2026 1648 1220 550 ...
## $ num_critic_for_reviews : int 723 302 602 813 NA 462 392 324 635 375 ...
## $ duration : int 178 169 148 164 NA 132 156 100 141 153 ...
## $ director_facebook_likes : int 0 563 0 22000 131 475 0 15 0 282 ...
## $ actor_3_facebook_likes : int 855 1000 161 23000 NA 530 4000 284 19000 10000 ...
## $ actor_2_name : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1407 2218 2488 533 2432 2548 1227 801 2439 653 ...
## $ actor_1_facebook_likes : int 1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
## $ gross : int 760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
## $ genres : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
## $ actor_1_name : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 301 978 351 1965 524 439 784 218 334 32 ...
## $ movie_title : Factor w/ 4917 levels "#Horror聽","[Rec] 2聽",..: 397 2729 3278 3706 3331 1960 3288 3458 398 1630 ...
## $ num_voted_users : int 886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
## $ cast_total_facebook_likes: int 4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
## $ actor_3_name : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3439 1390 3131 1764 1 2711 1967 2160 3015 2938 ...
## $ facenumber_in_poster : int 0 0 1 0 0 1 0 1 4 3 ...
## $ plot_keywords : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 1 651 4745 29 1142 2005 ...
## $ movie_imdb_link : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
## $ num_user_for_reviews : int 3054 1238 994 2701 NA 738 1902 387 1117 973 ...
## $ language : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 1 13 13 13 13 13 ...
## $ country : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 1 65 65 65 65 63 ...
## $ content_rating : Factor w/ 19 levels "","Approved",..: 10 10 10 10 1 10 10 9 10 9 ...
## $ budget : num 2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
## $ title_year : int 2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
## $ actor_2_facebook_likes : int 936 5000 393 23000 12 632 11000 553 21000 11000 ...
## $ imdb_score : num 7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
## $ aspect_ratio : num 1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
## $ movie_facebook_likes : int 33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
colnames(movie)
## [1] "color" "director_name"
## [3] "num_critic_for_reviews" "duration"
## [5] "director_facebook_likes" "actor_3_facebook_likes"
## [7] "actor_2_name" "actor_1_facebook_likes"
## [9] "gross" "genres"
## [11] "actor_1_name" "movie_title"
## [13] "num_voted_users" "cast_total_facebook_likes"
## [15] "actor_3_name" "facenumber_in_poster"
## [17] "plot_keywords" "movie_imdb_link"
## [19] "num_user_for_reviews" "language"
## [21] "country" "content_rating"
## [23] "budget" "title_year"
## [25] "actor_2_facebook_likes" "imdb_score"
## [27] "aspect_ratio" "movie_facebook_likes"
summary(movie)
## color director_name num_critic_for_reviews
## : 19 : 104 Min. : 1.0
## Black and White: 209 Steven Spielberg: 26 1st Qu.: 50.0
## Color :4815 Woody Allen : 22 Median :110.0
## Clint Eastwood : 20 Mean :140.2
## Martin Scorsese : 20 3rd Qu.:195.0
## Ridley Scott : 17 Max. :813.0
## (Other) :4834 NA's :50
## duration director_facebook_likes actor_3_facebook_likes
## Min. : 7.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 93.0 1st Qu.: 7.0 1st Qu.: 133.0
## Median :103.0 Median : 49.0 Median : 371.5
## Mean :107.2 Mean : 686.5 Mean : 645.0
## 3rd Qu.:118.0 3rd Qu.: 194.5 3rd Qu.: 636.0
## Max. :511.0 Max. :23000.0 Max. :23000.0
## NA's :15 NA's :104 NA's :23
## actor_2_name actor_1_facebook_likes gross
## Morgan Freeman : 20 Min. : 0 Min. : 162
## Charlize Theron: 15 1st Qu.: 614 1st Qu.: 5340988
## Brad Pitt : 14 Median : 988 Median : 25517500
## : 13 Mean : 6560 Mean : 48468408
## James Franco : 11 3rd Qu.: 11000 3rd Qu.: 62309438
## Meryl Streep : 11 Max. :640000 Max. :760505847
## (Other) :4959 NA's :7 NA's :884
## genres actor_1_name
## Drama : 236 Robert De Niro : 49
## Comedy : 209 Johnny Depp : 41
## Comedy|Drama : 191 Nicolas Cage : 33
## Comedy|Drama|Romance: 187 J.K. Simmons : 31
## Comedy|Romance : 158 Bruce Willis : 30
## Drama|Romance : 152 Denzel Washington: 30
## (Other) :3910 (Other) :4829
## movie_title num_voted_users
## Ben-Hur聽 : 3 Min. : 5
## Halloween聽 : 3 1st Qu.: 8594
## Home聽 : 3 Median : 34359
## King Kong聽 : 3 Mean : 83668
## Pan聽 : 3 3rd Qu.: 96309
## The Fast and the Furious聽: 3 Max. :1689764
## (Other) :5025
## cast_total_facebook_likes actor_3_name facenumber_in_poster
## Min. : 0 : 23 Min. : 0.000
## 1st Qu.: 1411 Ben Mendelsohn: 8 1st Qu.: 0.000
## Median : 3090 John Heard : 8 Median : 1.000
## Mean : 9699 Steve Coogan : 8 Mean : 1.371
## 3rd Qu.: 13756 Anne Hathaway : 7 3rd Qu.: 2.000
## Max. :656730 Jon Gries : 7 Max. :43.000
## (Other) :4982 NA's :13
## plot_keywords
## : 153
## based on novel : 4
## 1940s|child hero|fantasy world|orphan|reference to peter pan : 3
## alien friendship|alien invasion|australia|flying car|mother daughter relationship: 3
## animal name in title|ape abducts a woman|gorilla|island|king kong : 3
## assistant|experiment|frankenstein|medical student|scientist : 3
## (Other) :4874
## movie_imdb_link
## http://www.imdb.com/title/tt0077651/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt0232500/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt0360717/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt1976009/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt2224026/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt2638144/?ref_=fn_tt_tt_1: 3
## (Other) :5025
## num_user_for_reviews language country content_rating
## Min. : 1.0 English :4704 USA :3807 R :2118
## 1st Qu.: 65.0 French : 73 UK : 448 PG-13 :1461
## Median : 156.0 Spanish : 40 France : 154 PG : 701
## Mean : 272.8 Hindi : 28 Canada : 126 : 303
## 3rd Qu.: 326.0 Mandarin: 26 Germany : 97 Not Rated: 116
## Max. :5060.0 German : 19 Australia: 55 G : 112
## NA's :21 (Other) : 153 (Other) : 356 (Other) : 232
## budget title_year actor_2_facebook_likes imdb_score
## Min. :2.180e+02 Min. :1916 Min. : 0 Min. :1.600
## 1st Qu.:6.000e+06 1st Qu.:1999 1st Qu.: 281 1st Qu.:5.800
## Median :2.000e+07 Median :2005 Median : 595 Median :6.600
## Mean :3.975e+07 Mean :2002 Mean : 1652 Mean :6.442
## 3rd Qu.:4.500e+07 3rd Qu.:2011 3rd Qu.: 918 3rd Qu.:7.200
## Max. :1.222e+10 Max. :2016 Max. :137000 Max. :9.500
## NA's :492 NA's :108 NA's :13
## aspect_ratio movie_facebook_likes
## Min. : 1.18 Min. : 0
## 1st Qu.: 1.85 1st Qu.: 0
## Median : 2.35 Median : 166
## Mean : 2.22 Mean : 7526
## 3rd Qu.: 2.35 3rd Qu.: 3000
## Max. :16.00 Max. :349000
## NA's :329
packages
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
library(shiny)
## Warning: package 'shiny' was built under R version 3.3.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
## Warning: Installed Rcpp (0.12.7) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rpart)
## Warning: package 'rpart' was built under R version 3.3.3
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.3.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.3.3
library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 3.3.3
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.3.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
Check missing values
sum(is.na(movie))
## [1] 2059
colSums(is.na(movie))
## color director_name
## 0 0
## num_critic_for_reviews duration
## 50 15
## director_facebook_likes actor_3_facebook_likes
## 104 23
## actor_2_name actor_1_facebook_likes
## 0 7
## gross genres
## 884 0
## actor_1_name movie_title
## 0 0
## num_voted_users cast_total_facebook_likes
## 0 0
## actor_3_name facenumber_in_poster
## 0 13
## plot_keywords movie_imdb_link
## 0 0
## num_user_for_reviews language
## 21 0
## country content_rating
## 0 0
## budget title_year
## 492 108
## actor_2_facebook_likes imdb_score
## 13 0
## aspect_ratio movie_facebook_likes
## 329 0
what proportion of missing value in our data
mean(is.na(movie))
## [1] 0.01458174
Since the proportion of missing value is small in our data, we can directly delete them
movie<- na.omit(movie)
sum(is.na(movie))
## [1] 0
Now, we have data without missing value, what does dit look like now?
dim(movie)
## [1] 3801 28
head(movie)
## color director_name num_critic_for_reviews duration
## 1 Color James Cameron 723 178
## 2 Color Gore Verbinski 302 169
## 3 Color Sam Mendes 602 148
## 4 Color Christopher Nolan 813 164
## 6 Color Andrew Stanton 462 132
## 7 Color Sam Raimi 392 156
## director_facebook_likes actor_3_facebook_likes actor_2_name
## 1 0 855 Joel David Moore
## 2 563 1000 Orlando Bloom
## 3 0 161 Rory Kinnear
## 4 22000 23000 Christian Bale
## 6 475 530 Samantha Morton
## 7 0 4000 James Franco
## actor_1_facebook_likes gross genres
## 1 1000 760505847 Action|Adventure|Fantasy|Sci-Fi
## 2 40000 309404152 Action|Adventure|Fantasy
## 3 11000 200074175 Action|Adventure|Thriller
## 4 27000 448130642 Action|Thriller
## 6 640 73058679 Action|Adventure|Sci-Fi
## 7 24000 336530303 Action|Adventure|Romance
## actor_1_name movie_title
## 1 CCH Pounder Avatar聽
## 2 Johnny Depp Pirates of the Caribbean: At World's End聽
## 3 Christoph Waltz Spectre聽
## 4 Tom Hardy The Dark Knight Rises聽
## 6 Daryl Sabara John Carter聽
## 7 J.K. Simmons Spider-Man 3聽
## num_voted_users cast_total_facebook_likes actor_3_name
## 1 886204 4834 Wes Studi
## 2 471220 48350 Jack Davenport
## 3 275868 11700 Stephanie Sigman
## 4 1144337 106759 Joseph Gordon-Levitt
## 6 212204 1873 Polly Walker
## 7 383056 46055 Kirsten Dunst
## facenumber_in_poster
## 1 0
## 2 0
## 3 1
## 4 0
## 6 1
## 7 0
## plot_keywords
## 1 avatar|future|marine|native|paraplegic
## 2 goddess|marriage ceremony|marriage proposal|pirate|singapore
## 3 bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 6 alien|american civil war|male nipple|mars|princess
## 7 sandman|spider man|symbiote|venom|villain
## movie_imdb_link
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1
## 2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1
## 6 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1
## 7 http://www.imdb.com/title/tt0413300/?ref_=fn_tt_tt_1
## num_user_for_reviews language country content_rating budget
## 1 3054 English USA PG-13 237000000
## 2 1238 English USA PG-13 300000000
## 3 994 English UK PG-13 245000000
## 4 2701 English USA PG-13 250000000
## 6 738 English USA PG-13 263700000
## 7 1902 English USA PG-13 258000000
## title_year actor_2_facebook_likes imdb_score aspect_ratio
## 1 2009 936 7.9 1.78
## 2 2007 5000 7.1 2.35
## 3 2015 393 6.8 2.35
## 4 2012 23000 8.5 2.35
## 6 2012 632 6.6 2.35
## 7 2007 11000 6.2 2.35
## movie_facebook_likes
## 1 33000
## 2 0
## 3 85000
## 4 164000
## 6 24000
## 7 0
str(movie)
## 'data.frame': 3801 obs. of 28 variables:
## $ color : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ director_name : Factor w/ 2399 levels "","A. Raven Cruz",..: 923 796 2023 375 101 2026 1648 1220 550 2390 ...
## $ num_critic_for_reviews : int 723 302 602 813 462 392 324 635 375 673 ...
## $ duration : int 178 169 148 164 132 156 100 141 153 183 ...
## $ director_facebook_likes : int 0 563 0 22000 475 0 15 0 282 0 ...
## $ actor_3_facebook_likes : int 855 1000 161 23000 530 4000 284 19000 10000 2000 ...
## $ actor_2_name : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1407 2218 2488 533 2548 1227 801 2439 653 1703 ...
## $ actor_1_facebook_likes : int 1000 40000 11000 27000 640 24000 799 26000 25000 15000 ...
## $ gross : int 760505847 309404152 200074175 448130642 73058679 336530303 200807262 458991599 301956980 330249062 ...
## $ genres : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 126 120 308 126 447 126 ...
## $ actor_1_name : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 301 978 351 1965 439 784 218 334 32 738 ...
## $ movie_title : Factor w/ 4917 levels "#Horror聽","[Rec] 2聽",..: 397 2729 3278 3706 1960 3288 3458 398 1630 460 ...
## $ num_voted_users : int 886204 471220 275868 1144337 212204 383056 294810 462669 321795 371639 ...
## $ cast_total_facebook_likes: int 4834 48350 11700 106759 1873 46055 2036 92000 58753 24450 ...
## $ actor_3_name : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3439 1390 3131 1764 2711 1967 2160 3015 2938 55 ...
## $ facenumber_in_poster : int 0 0 1 0 1 0 1 4 3 0 ...
## $ plot_keywords : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 651 4745 29 1142 2005 1564 ...
## $ movie_imdb_link : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 2476 2526 2458 4546 2551 4690 ...
## $ num_user_for_reviews : int 3054 1238 994 2701 738 1902 387 1117 973 3018 ...
## $ language : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 13 13 13 13 13 13 ...
## $ country : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 65 65 65 65 63 65 ...
## $ content_rating : Factor w/ 19 levels "","Approved",..: 10 10 10 10 10 10 9 10 9 10 ...
## $ budget : num 2.37e+08 3.00e+08 2.45e+08 2.50e+08 2.64e+08 ...
## $ title_year : int 2009 2007 2015 2012 2012 2007 2010 2015 2009 2016 ...
## $ actor_2_facebook_likes : int 936 5000 393 23000 632 11000 553 21000 11000 4000 ...
## $ imdb_score : num 7.9 7.1 6.8 8.5 6.6 6.2 7.8 7.5 7.5 6.9 ...
## $ aspect_ratio : num 1.78 2.35 2.35 2.35 2.35 2.35 1.85 2.35 2.35 2.35 ...
## $ movie_facebook_likes : int 33000 0 85000 164000 24000 0 29000 118000 10000 197000 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:1242] 5 56 85 99 100 178 200 205 207 243 ...
## .. ..- attr(*, "names")= chr [1:1242] "5" "56" "85" "99" ...
summary(movie)
## color director_name num_critic_for_reviews
## : 1 Steven Spielberg : 25 Min. : 1.0
## Black and White: 129 Clint Eastwood : 19 1st Qu.: 75.0
## Color :3671 Woody Allen : 19 Median :137.0
## Ridley Scott : 17 Mean :165.8
## Martin Scorsese : 16 3rd Qu.:223.0
## Steven Soderbergh: 16 Max. :813.0
## (Other) :3689
## duration director_facebook_likes actor_3_facebook_likes
## Min. : 37.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 96.0 1st Qu.: 11.0 1st Qu.: 187.0
## Median :106.0 Median : 62.0 Median : 433.0
## Mean :110.2 Mean : 798.6 Mean : 763.8
## 3rd Qu.:120.0 3rd Qu.: 234.0 3rd Qu.: 690.0
## Max. :330.0 Max. :23000.0 Max. :23000.0
##
## actor_2_name actor_1_facebook_likes gross
## Morgan Freeman : 20 Min. : 0 Min. : 162
## Brad Pitt : 14 1st Qu.: 736 1st Qu.: 7689458
## Charlize Theron: 14 Median : 1000 Median : 29200000
## James Franco : 11 Mean : 7673 Mean : 52005759
## Jason Flemyng : 10 3rd Qu.: 13000 3rd Qu.: 66466372
## Meryl Streep : 10 Max. :640000 Max. :760505847
## (Other) :3722
## genres actor_1_name
## Drama : 150 Robert De Niro : 42
## Comedy|Drama|Romance: 148 Johnny Depp : 39
## Comedy|Drama : 142 J.K. Simmons : 31
## Comedy : 138 Nicolas Cage : 31
## Comedy|Romance : 131 Denzel Washington: 30
## Drama|Romance : 116 Bruce Willis : 29
## (Other) :2976 (Other) :3599
## movie_title num_voted_users
## Halloween聽 : 3 Min. : 5
## Home聽 : 3 1st Qu.: 18915
## King Kong聽 : 3 Median : 53028
## Pan聽 : 3 Mean : 104677
## The Fast and the Furious聽: 3 3rd Qu.: 126916
## Victor Frankenstein聽 : 3 Max. :1689764
## (Other) :3783
## cast_total_facebook_likes actor_3_name facenumber_in_poster
## Min. : 0 Steve Coogan : 8 Min. : 0.000
## 1st Qu.: 1865 Anne Hathaway : 7 1st Qu.: 0.000
## Median : 3969 Ben Mendelsohn: 7 Median : 1.000
## Mean : 11415 Kirsten Dunst : 7 Mean : 1.379
## 3rd Qu.: 16143 Robert Duvall : 7 3rd Qu.: 2.000
## Max. :656730 Bruce McGill : 6 Max. :43.000
## (Other) :3759
## plot_keywords
## : 17
## 1940s|child hero|fantasy world|orphan|reference to peter pan : 3
## alien friendship|alien invasion|australia|flying car|mother daughter relationship: 3
## animal name in title|ape abducts a woman|gorilla|island|king kong : 3
## assistant|experiment|frankenstein|medical student|scientist : 3
## eighteen wheeler|illegal street racing|truck|trucker|undercover cop : 3
## (Other) :3769
## movie_imdb_link
## http://www.imdb.com/title/tt0077651/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt0232500/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt0360717/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt1976009/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt2224026/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt3332064/?ref_=fn_tt_tt_1: 3
## (Other) :3783
## num_user_for_reviews language country content_rating
## Min. : 1.0 English :3626 USA :3005 R :1709
## 1st Qu.: 107.0 French : 36 UK : 322 PG-13 :1313
## Median : 208.0 Spanish : 23 France : 103 PG : 567
## Mean : 333.6 Mandarin: 15 Germany : 82 G : 87
## 3rd Qu.: 397.0 German : 13 Canada : 62 Not Rated: 34
## Max. :5060.0 Japanese: 12 Australia: 40 : 29
## (Other) : 76 (Other) : 187 (Other) : 62
## budget title_year actor_2_facebook_likes imdb_score
## Min. :2.180e+02 Min. :1920 Min. : 0 Min. :1.600
## 1st Qu.:1.000e+07 1st Qu.:1999 1st Qu.: 374 1st Qu.:5.900
## Median :2.500e+07 Median :2005 Median : 680 Median :6.600
## Mean :4.585e+07 Mean :2003 Mean : 2003 Mean :6.466
## 3rd Qu.:5.000e+07 3rd Qu.:2010 3rd Qu.: 975 3rd Qu.:7.200
## Max. :1.222e+10 Max. :2016 Max. :137000 Max. :9.300
##
## aspect_ratio movie_facebook_likes
## Min. : 1.18 Min. : 0
## 1st Qu.: 1.85 1st Qu.: 0
## Median : 2.35 Median : 224
## Mean : 2.11 Mean : 9262
## 3rd Qu.: 2.35 3rd Qu.: 11000
## Max. :16.00 Max. :349000
##
Looking at variable movie_title, I find that there is always a messy code following every value So I correct data without messy code
movie$movie_title<-gsub('.$', '', movie$movie_title)
Step2: Explore Data I create several histograms of duration, num_user_for_ reviews, imdb_score, director_facebook_likes to understand their distribution
options(repr.plot.width=6, repr.plot.height=4)
g1<-ggplot(movie,aes(x=duration))+geom_histogram(binwidth=5,aes(y=..density..),fill="green4")
g2<-ggplot(movie,aes(x=num_user_for_reviews))+geom_histogram(binwidth=50,aes(y=..density..),fill="red")
g3<-ggplot(movie,aes(x=imdb_score))+geom_histogram(binwidth=1,aes(y=..count..),fill="green4")
g4<-ggplot(movie,aes(x=director_facebook_likes))+geom_histogram(binwidth=5,aes(y=..count..),fill="red")
grid.arrange(g1,g2,g3,g4,nrow=2,ncol=2)
Since our goal is to research imdb score, show we create a shiny application to see different breaks of imdb score histogram
# Define UI for application that draws a histogram
ui <- shinyUI(fluidPage(
# Application title
titlePanel("IMDB SCORE DATA"),
# Sidebar with a slider input for number of bins
sidebarLayout(
sidebarPanel(
sliderInput("bins",
"Number of bins:",
min = 1,
max = 50,
value = 30)
),
# Show a plot of the generated distribution
mainPanel(
plotOutput("distPlot")
)
)
))
# Define server logic required to draw a histogram
server <- shinyServer(function(input, output) {
output$distPlot <- renderPlot({
# generate bins based on input$bins from ui.R
movie_score <- movie$imdb_score
bins <- seq(min(movie_score), max(movie_score), length.out = input$bins + 1)
# draw the histogram with the specified number of bins
hist(movie_score, breaks = bins, col = 'darkgray', border = 'white')
})
})
# Run the application
shinyApp(ui = ui, server = server)
From the distribution, basiclly there are no outlier.
Understand categorical variables’ distribution’ We create movie Language Histogram
ggplot(movie,aes(x=language,fill=language))+
geom_histogram(stat="count",aes(y=..count../sum(..count..)),binwidth=1)+
theme(axis.text.x=element_text(angle=90,hjust=0.5,vjust=0),legend.position="none")+
labs(y="Percent",title="Percentage of Movie Languages")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
English is the main language of movies, but we want to know the distribution of non-English movie? How many movies of other non-Language
movie %>% filter(!language == "English") %>%
ggplot(aes(x=language,fill=language))+
geom_histogram(stat="count",aes(y=..count../sum(..count..)),binwidth=1)+
theme(axis.text.x=element_text(angle=90,hjust=0.5,vjust=0),legend.position="none")+
labs(y="Percent",title="Percentage of Movie Languages other than English")
## Warning: package 'bindrcpp' was built under R version 3.3.3
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Except English, French and Spain movie takes up huge movie amount
To understand how many movies dose every country produce every year, I create data frame of year and country with their movie amount.
year_movies<-movie %>%
group_by(title_year,country) %>%
summarize(movie_count=n())%>%
filter(movie_count>=5)
head(year_movies)
## # A tibble: 6 x 3
## # Groups: title_year [6]
## title_year country movie_count
## <int> <fctr> <int>
## 1 1974 USA 6
## 2 1977 USA 5
## 3 1978 USA 8
## 4 1980 USA 13
## 5 1981 USA 9
## 6 1982 USA 16
plot the movie number through time and color them with different country
options(repr.plot.width=4, repr.plot.height=4)
ggplot(year_movies,aes(x=title_year,y=movie_count,color=country))+
geom_line(size =2)+xlab("Year")+ylab("Num of Movies")+theme_classic()
USA ’s movie productivity largely increased in 1990s and is higher than any other country.
Step 3: Answer question 1
Who are the top ten directors with best movie review? caculate average score of each director and choose the top ten
best_director<-movie%>%group_by(director_name)%>%
summarise(mean= mean(imdb_score))%>%
top_n(10)%>%
arrange(desc(mean))
## Selecting by mean
best_director
## # A tibble: 11 x 2
## director_name mean
## <fctr> <dbl>
## 1 Charles Chaplin 8.600000
## 2 Tony Kaye 8.600000
## 3 Alfred Hitchcock 8.500000
## 4 Damien Chazelle 8.500000
## 5 Majid Majidi 8.500000
## 6 Ron Fricke 8.500000
## 7 Sergio Leone 8.433333
## 8 Christopher Nolan 8.425000
## 9 Asghar Farhadi 8.400000
## 10 Richard Marquand 8.400000
## 11 S.S. Rajamouli 8.400000
plot the top 10 directors with their average score
ggplot(best_director, aes(x = director_name, y = mean, alpha = mean))+
geom_bar(stat = "identity",fill = "slateblue") + labs(x = "Best 10 Directors", y = "Average Imdb Score") +
ggtitle("Top 10 directors with average score")+coord_flip(ylim=c(8,8.75))
What are the top genres with best movie review? caculate average score of each genres and choose the top 3
best_category<-movie%>%
group_by(genres)%>%
summarise(mean= mean(imdb_score))%>%
top_n(3)%>%
arrange(desc(mean))
## Selecting by mean
best_category
## # A tibble: 5 x 2
## genres mean
## <fctr> <dbl>
## 1 Adventure|Animation|Drama|Family|Musical 8.5
## 2 Crime|Drama|Fantasy|Mystery 8.5
## 3 Action|Adventure|Drama|Fantasy|War 8.4
## 4 Adventure|Animation|Fantasy 8.4
## 5 Adventure|Drama|Thriller|War 8.4
plot the top 3 genres with their average score
ggplot(best_category, aes(x = genres, y = mean, alpha = mean))+
geom_bar(stat = "identity",fill = "slateblue") + labs(x = "Top movie genre", y = "Average Imdb Score") +
ggtitle("Top movie genre with average score")+coord_flip(ylim=c(8.3,8.6))
Who are the top actors with best movie review? caculate average score of each actor and choose the top ten for each
best_actor<-movie%>%
group_by(actor_1_name)%>%
summarise(mean= mean(imdb_score))%>%
top_n(10)%>%
arrange(desc(mean))
## Selecting by mean
best_actor
## # A tibble: 14 x 2
## actor_1_name mean
## <fctr> <dbl>
## 1 Scatman Crothers 8.7
## 2 Takashi Shimura 8.7
## 3 Bunta Sugawara 8.6
## 4 Paulette Goddard 8.6
## 5 Bahare Seddiqi 8.5
## 6 Collin Alfredo St. Dic 8.5
## 7 Emilia Fox 8.5
## 8 Janet Leigh 8.5
## 9 Claude Rains 8.4
## 10 J眉rgen Prochnow 8.4
## 11 Mathieu Kassovitz 8.4
## 12 Mhairi Calvey 8.4
## 13 Shahab Hosseini 8.4
## 14 Tamannaah Bhatia 8.4
best_actor2<-movie%>%
group_by(actor_2_name)%>%
summarise(mean= mean(imdb_score))%>%
top_n(10)%>%
arrange(desc(mean))
## Selecting by mean
best_actor2
## # A tibble: 10 x 2
## actor_2_name mean
## <fctr> <dbl>
## 1 Jeffrey DeMunn 8.9
## 2 Luigi Pistilli 8.9
## 3 Kenny Baker 8.8
## 4 Marcus Chong 8.7
## 5 Michael Berryman 8.7
## 6 Minoru Chiaki 8.7
## 7 Peter Cushing 8.7
## 8 Seu Jorge 8.7
## 9 Ry没nosuke Kamiki 8.6
## 10 Stanley Blystone 8.6
best_actor3<-movie%>%
group_by(actor_3_name)%>%
summarise(mean= mean(imdb_score))%>%
top_n(10)%>%
arrange(desc(mean))
## Selecting by mean
best_actor3
## # A tibble: 11 x 2
## actor_3_name mean
## <fctr> <dbl>
## 1 Caroline Goodall 8.90
## 2 Enzo Petito 8.90
## 3 Phil LaMarr 8.90
## 4 Anthony Daniels 8.80
## 5 Eugenie Bondurant 8.80
## 6 Sam Anderson 8.80
## 7 Billy Boyd 8.75
## 8 Alexandre Rodrigues 8.70
## 9 Gloria Foster 8.70
## 10 Kamatari Fujiwara 8.70
## 11 Louise Fletcher 8.70
plot the top actors with their average score
actor1<-ggplot(best_actor, aes(x = actor_1_name, y = mean, alpha = mean))+
geom_bar(stat = "identity",fill = "slateblue") + labs(x = "Top actor1", y = "Average Imdb Score") +
ggtitle("Top actor1 with average score")+coord_flip(ylim=c(8.0,9.0))
actor2<-ggplot(best_actor2, aes(x = actor_2_name, y = mean, alpha = mean))+
geom_bar(stat = "identity",fill = "slateblue") + labs(x = "Top actor2", y = "Average Imdb Score") +
ggtitle("Top actor2 with average score")+coord_flip(ylim=c(8.0,9.0))
actor3<-ggplot(best_actor3, aes(x = actor_3_name, y = mean, alpha = mean))+
geom_bar(stat = "identity",fill = "slateblue") + labs(x = "Top actor3", y = "Average Imdb Score") +
ggtitle("Top actor3 with average score")+coord_flip(ylim=c(8.0,9.0))
grid.arrange(actor1,actor2,actor3)
Which countries’ movie tend to obtain high score
best_moviecountry<-movie%>%
group_by(country)%>%
summarise(mean= mean(imdb_score))%>%
top_n(10)%>%
arrange(desc(mean))
## Selecting by mean
best_moviecountry
## # A tibble: 10 x 2
## country mean
## <fctr> <dbl>
## 1 West Germany 8.400000
## 2 Israel 8.000000
## 3 Brazil 7.760000
## 4 Iran 7.725000
## 5 Argentina 7.600000
## 6 Indonesia 7.600000
## 7 Sweden 7.600000
## 8 Netherlands 7.566667
## 9 Colombia 7.500000
## 10 New Zealand 7.481818
moviecountry<-ggplot(best_moviecountry, aes(x = country, y = mean, alpha = mean))+
geom_bar(stat = "identity",fill = "slateblue") + labs(x = "Top 10 movie country", y = "Average Imdb Score") +
ggtitle("Top movie country with average score")+coord_flip(ylim=c(7.0,9.0))
moviecountry
Step 4: Answer Question 2
What are the relationshps betwwen variables
let’s look at correlations.Select numeric variables
numeric_col <- sapply(movie, is.numeric)
movie_numeric<- movie[,numeric_col]
create correlation matrix
Correlation<-cor(movie_numeric)
corrplot(Correlation, method = "color")
Correlation
## num_critic_for_reviews duration
## num_critic_for_reviews 1.00000000 0.22770510
## duration 0.22770510 1.00000000
## director_facebook_likes 0.17691581 0.17973400
## actor_3_facebook_likes 0.25508576 0.12577123
## actor_1_facebook_likes 0.17019757 0.08471988
## gross 0.46853499 0.24474304
## num_voted_users 0.59498973 0.33803814
## cast_total_facebook_likes 0.24100544 0.12117142
## facenumber_in_poster -0.03400866 0.02909973
## num_user_for_reviews 0.56679514 0.35039147
## budget 0.10568105 0.06816122
## title_year 0.41038049 -0.12942203
## actor_2_facebook_likes 0.25583672 0.12945227
## imdb_score 0.34388077 0.36612369
## aspect_ratio 0.18064082 0.15311429
## movie_facebook_likes 0.70396936 0.21493610
## director_facebook_likes actor_3_facebook_likes
## num_critic_for_reviews 0.17691581 0.25508576
## duration 0.17973400 0.12577123
## director_facebook_likes 1.00000000 0.11824025
## actor_3_facebook_likes 0.11824025 1.00000000
## actor_1_facebook_likes 0.09073261 0.25372024
## gross 0.13993814 0.30158391
## num_voted_users 0.30061915 0.26945536
## cast_total_facebook_likes 0.11974122 0.49068631
## facenumber_in_poster -0.04761895 0.10501768
## num_user_for_reviews 0.21831138 0.20732096
## budget 0.01855931 0.04047813
## title_year -0.04460636 0.11553537
## actor_2_facebook_likes 0.11690032 0.55418237
## imdb_score 0.19083814 0.06497354
## aspect_ratio 0.03787106 0.04712336
## movie_facebook_likes 0.16273728 0.27251268
## actor_1_facebook_likes gross
## num_critic_for_reviews 0.17019757 0.46853499
## duration 0.08471988 0.24474304
## director_facebook_likes 0.09073261 0.13993814
## actor_3_facebook_likes 0.25372024 0.30158391
## actor_1_facebook_likes 1.00000000 0.14704475
## gross 0.14704475 1.00000000
## num_voted_users 0.18226526 0.62694784
## cast_total_facebook_likes 0.94492526 0.23868703
## facenumber_in_poster 0.05757968 -0.03225370
## num_user_for_reviews 0.12522139 0.54710674
## budget 0.01708638 0.10038914
## title_year 0.09374233 0.05236800
## actor_2_facebook_likes 0.39267587 0.25465945
## imdb_score 0.09313142 0.21212439
## aspect_ratio 0.05760375 0.06526004
## movie_facebook_likes 0.13177824 0.36849402
## num_voted_users cast_total_facebook_likes
## num_critic_for_reviews 0.59498973 0.24100544
## duration 0.33803814 0.12117142
## director_facebook_likes 0.30061915 0.11974122
## actor_3_facebook_likes 0.26945536 0.49068631
## actor_1_facebook_likes 0.18226526 0.94492526
## gross 0.62694784 0.23868703
## num_voted_users 1.00000000 0.25194009
## cast_total_facebook_likes 0.25194009 1.00000000
## facenumber_in_poster -0.03202642 0.08098495
## num_user_for_reviews 0.77992455 0.18228784
## budget 0.06682395 0.02942336
## title_year 0.02193838 0.12401462
## actor_2_facebook_likes 0.24666028 0.64401612
## imdb_score 0.47791732 0.10625870
## aspect_ratio 0.08548456 0.06967465
## movie_facebook_likes 0.51869065 0.20706080
## facenumber_in_poster num_user_for_reviews
## num_critic_for_reviews -0.03400866 0.56679514
## duration 0.02909973 0.35039147
## director_facebook_likes -0.04761895 0.21831138
## actor_3_facebook_likes 0.10501768 0.20732096
## actor_1_facebook_likes 0.05757968 0.12522139
## gross -0.03225370 0.54710674
## num_voted_users -0.03202642 0.77992455
## cast_total_facebook_likes 0.08098495 0.18228784
## facenumber_in_poster 1.00000000 -0.07940360
## num_user_for_reviews -0.07940360 1.00000000
## budget -0.02175723 0.07125387
## title_year 0.06795245 0.01759409
## actor_2_facebook_likes 0.07413806 0.18958182
## imdb_score -0.06429247 0.32252237
## aspect_ratio 0.01662043 0.09855669
## movie_facebook_likes 0.01433235 0.37197029
## budget title_year actor_2_facebook_likes
## num_critic_for_reviews 0.10568105 0.41038049 0.25583672
## duration 0.06816122 -0.12942203 0.12945227
## director_facebook_likes 0.01855931 -0.04460636 0.11690032
## actor_3_facebook_likes 0.04047813 0.11553537 0.55418237
## actor_1_facebook_likes 0.01708638 0.09374233 0.39267587
## gross 0.10038914 0.05236800 0.25465945
## num_voted_users 0.06682395 0.02193838 0.24666028
## cast_total_facebook_likes 0.02942336 0.12401462 0.64401612
## facenumber_in_poster -0.02175723 0.06795245 0.07413806
## num_user_for_reviews 0.07125387 0.01759409 0.18958182
## budget 1.00000000 0.04629319 0.03621089
## title_year 0.04629319 1.00000000 0.11973855
## actor_2_facebook_likes 0.03621089 0.11973855 1.00000000
## imdb_score 0.02904057 -0.12926516 0.10206038
## aspect_ratio 0.02579646 0.21977924 0.06421530
## movie_facebook_likes 0.05303510 0.30283494 0.23363209
## imdb_score aspect_ratio movie_facebook_likes
## num_critic_for_reviews 0.34388077 0.18064082 0.70396936
## duration 0.36612369 0.15311429 0.21493610
## director_facebook_likes 0.19083814 0.03787106 0.16273728
## actor_3_facebook_likes 0.06497354 0.04712336 0.27251268
## actor_1_facebook_likes 0.09313142 0.05760375 0.13177824
## gross 0.21212439 0.06526004 0.36849402
## num_voted_users 0.47791732 0.08548456 0.51869065
## cast_total_facebook_likes 0.10625870 0.06967465 0.20706080
## facenumber_in_poster -0.06429247 0.01662043 0.01433235
## num_user_for_reviews 0.32252237 0.09855669 0.37197029
## budget 0.02904057 0.02579646 0.05303510
## title_year -0.12926516 0.21977924 0.30283494
## actor_2_facebook_likes 0.10206038 0.06421530 0.23363209
## imdb_score 1.00000000 0.02845372 0.27947774
## aspect_ratio 0.02845372 1.00000000 0.11031824
## movie_facebook_likes 0.27947774 0.11031824 1.00000000
We notice that some variables are highly possitively related such as actor_1_facebook_likes and cast_total_facebook_likes; and movie_facebook_likes and num_critic_for_reviews.
We create scatter plot to understand their relationship
ggplot(movie, aes(x =actor_1_facebook_likes, y =cast_total_facebook_likes))+
geom_point(size=2) +
stat_smooth(methos = lm, se = F, color = "red")+geom_smooth()+
labs(title = "actor1 facebook likes Vs. cast_total_facebook_likes",
x = "actor1 facebook likes", y = "cast_total_facebook_likes")+ ggtitle(paste("cor:", 0.945))
## Warning: Ignoring unknown parameters: methos
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
ggplot(movie, aes(x =movie_facebook_likes, y =num_critic_for_reviews))+
geom_point(size=2) +
stat_smooth(methos = lm, se = F, color = "red")+geom_smooth()+
labs(title = "Movie Facebook likes Vs. number of critics for reviews",
x = "Movie Facebook likes", y = "Number of critics reviews")+ ggtitle(paste("cor:", 0.704))
## Warning: Ignoring unknown parameters: methos
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
Our main goal is predicting imdb score, we need to understand it’s determing variables.
From correlation matrix, We obtained the determing variables:num_critic_for_reviews, num_user_for_reviews,num_voted_users,duration,movie_facebook_likes,gross are determing variables.
Plotting duration and imdb score to understand their relationship: we will color user review number, firstly we need categorize it
movie$num_user_reviews<-cut(movie$num_user_for_reviews, breaks = c(0,107,208,333,397,5100), labels = c("very few","few","middle","high","very high"))
summary(movie$category)
## Length Class Mode
## 0 NULL NULL
ggplot(movie, aes(x =duration, y =imdb_score))+
geom_point(size=2, aes(colour=num_user_reviews)) +
labs(title = "Duration Vs. IMDB Score and Number of User Reviews",
x = "Duration", y = "IMDB Score")
Plotting num_voted_user and imdb score to understand their relationship:
ggplot(movie, aes(x =num_voted_users, y =imdb_score))+
geom_point()+
labs(title = "Voted User number Vs. IMDB Score and Number of User Reviews",
x = "voted user number", y = "IMDB Score")
Plotting num_user_for_reviews and imdb score to understand their relationship:
ggplot(movie, aes(x =num_user_for_reviews, y =imdb_score))+
geom_point()+
labs(title = "User Review number Vs. IMDB Score and Number of User Reviews",
x = "User review number", y = "IMDB Score")
Step 5 Answer Q3: how do determing variables affect score; build model to predicet score
To understand how determing variables affect the score, we first select determing variables
movies_importat_variables = movie[, c("imdb_score",
"num_critic_for_reviews",
"num_user_for_reviews",
"num_voted_users",
"duration",
"movie_facebook_likes",
"gross")]
We split our data to a test and training set
set.seed(2)
train <- sample(dim(movies_importat_variables)[1],dim(movies_importat_variables)[1]*0.9)
movie_train <- movies_importat_variables[train,]
movie_test <- movies_importat_variables[-train,]
We create liner model
lmfit = lm(imdb_score~.,data=movie_train)
summary(lmfit)
##
## Call:
## lm(formula = imdb_score ~ ., data = movie_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5361 -0.4889 0.0760 0.6059 2.4958
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.850e+00 7.906e-02 61.348 < 2e-16 ***
## num_critic_for_reviews 1.443e-03 1.950e-04 7.397 1.74e-13 ***
## num_user_for_reviews -5.321e-04 6.191e-05 -8.596 < 2e-16 ***
## num_voted_users 4.094e-06 1.815e-07 22.559 < 2e-16 ***
## duration 1.174e-02 7.075e-04 16.594 < 2e-16 ***
## movie_facebook_likes -2.921e-06 1.025e-06 -2.851 0.00439 **
## gross -2.532e-09 2.798e-10 -9.050 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8755 on 3413 degrees of freedom
## Multiple R-squared: 0.3123, Adjusted R-squared: 0.3111
## F-statistic: 258.3 on 6 and 3413 DF, p-value: < 2.2e-16
plot(lmfit)
Test Model with test data based on mse
{r}pred <- predict(lmfit,movie_test) mean((movie_test$imdb_score-pred)^2)
The mse of liner model is around 0.78
Build up tree model
set.seed(3)
m.rpart <- rpart(imdb_score~.,data=movie_train)
m.rpart
## n= 3420
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 3420 3803.67300 6.478596
## 2) num_voted_users< 88566 2261 2348.41200 6.158425
## 4) duration< 110.5 1554 1637.39400 5.955727
## 8) gross>=3679220 1122 1146.47000 5.808467
## 16) num_voted_users< 32267.5 565 604.98740 5.541062
## 32) num_user_for_reviews>=275 39 40.60974 4.535897 *
## 33) num_user_for_reviews< 275 526 522.05220 5.615589 *
## 17) num_voted_users>=32267.5 557 460.10080 6.079713 *
## 9) gross< 3679220 432 403.39980 6.338194 *
## 5) duration>=110.5 707 506.82890 6.603960 *
## 3) num_voted_users>=88566 1159 771.33820 7.103192
## 6) num_voted_users< 349779.5 950 532.38460 6.921895
## 12) gross>=2.73e+07 757 404.31110 6.796565 *
## 13) gross< 2.73e+07 193 69.54497 7.413472 *
## 7) num_voted_users>=349779.5 209 65.79455 7.927273 *
plot tree data
rpart.plot(m.rpart,digits = 3)
test model
p.rpart <- predict(m.rpart,movie_test)
mean((p.rpart-movie_test$imdb_score)^2)
## [1] 0.7755274
The mse of liner model is also around 0.78
We build up random forest
set.seed(100)
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.3.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
rf <- randomForest(imdb_score~.,data=movie_train,ntree=500)
pred_rf <- predict(rf,movie_test)
mean((pred_rf-movie_test$imdb_score)^2)
## [1] 0.5940703
The mse of liner model is also around 0.59, So model’s accuracy largely increase