| This project attempts to predict the score of movies on IMDB using primarily the number of user reivews and movie length. The dataset used includes 3890 movies and 30 variables. This could assist people within the movie industry by allowing them to know what is more important in terms of expenditures that will lead to success. The dataset was cleaned prior to reading it into this package, it initially included fewer rows that contained dirty data and random characters. |
This project utilizes R, and Tableau. The specific R Packages we used are tidyverse and corrplot.
Install required packages
# Here we are checking if the package is installed
if(!require("tidyverse")){
# If the package is not in the system then it will be install
install.packages("tidyverse", dependencies = TRUE)
# Here we are loading the package
library("tidyverse")
}
## Loading required package: tidyverse
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## Warning: package 'ggplot2' was built under R version 3.4.4
## -- Conflicts --------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Here we are checking if the package is installed
if(!require("corrplot")){
# If the package is not in the system then it will be install
install.packages("corrplot", dependencies = TRUE)
# Here we are loading the package
library("corrplot")}
## Loading required package: corrplot
## Warning: package 'corrplot' was built under R version 3.4.4
## corrplot 0.84 loaded
# Here we are checking if the package is installed
if(!require("ResourceSelection")){
# If the package is not in the system then it will be install
install.packages("ResourceSelection", dependencies = TRUE)
# Here we are loading the package
library("ResourceSelection")
}
## Loading required package: ResourceSelection
## Warning: package 'ResourceSelection' was built under R version 3.4.4
## ResourceSelection 0.3-2 2017-02-28
| ### 2A Industry: This project uses a cleaned dataset of movies,the contents relate most to the entertainment industry. Movie success is often measured by both Gross and Scores on various websites. The gross provides a more objective measure as it is purely quantitative and unaffected by perceptions like a score would be. However, most variables in this dataset are measured after the gross would be determined. For this reason, we will investigate how to predict score. |
originaldata = read.csv(file = "data/rottentomatoes.csv")
head(originaldata)
## ï..title
## 1 AvatarÃÂ
## 2 Pirates of the Caribbean: At World's EndÃÂ
## 3 SpectreÃÂ
## 4 The Dark Knight RisesÃÂ
## 5 Star Wars: Episode VII - The Force AwakensÃÂ
## 6 John CarterÃÂ
## genres director actor1
## 1 Action|Adventure|Fantasy|Sci-Fi James Cameron CCH Pounder
## 2 Action|Adventure|Fantasy Gore Verbinski Johnny Depp
## 3 Action|Adventure|Thriller Sam Mendes Christoph Waltz
## 4 Action|Thriller Christopher Nolan Tom Hardy
## 5 Documentary Doug Walker Doug Walker
## 6 Action|Adventure|Sci-Fi Andrew Stanton Daryl Sabara
## actor2 actor3 length budget director_fb_likes
## 1 Joel David Moore Wes Studi 178 237000000 0
## 2 Orlando Bloom Jack Davenport 169 300000000 563
## 3 Rory Kinnear Stephanie Sigman 148 245000000 0
## 4 Christian Bale Joseph Gordon-Levitt 164 250000000 22000
## 5 Rob Walker NA NA 131
## 6 Samantha Morton Polly Walker 132 263700000 475
## actor1_fb_likes actor2_fb_likes actor3_fb_likes total_cast_likes
## 1 1000 936 855 4834
## 2 40000 5000 1000 48350
## 3 11000 393 161 11700
## 4 27000 23000 23000 106759
## 5 131 12 NA 143
## 6 640 632 530 1873
## fb_likes critic_reviews users_reviews users_votes score aspect_ratio
## 1 33000 723 3054 886204 7.9 1.78
## 2 0 302 1238 471220 7.1 2.35
## 3 85000 602 994 275868 6.8 2.35
## 4 164000 813 2701 1144337 8.5 2.35
## 5 0 NA NA 8 7.1 NA
## 6 24000 462 738 212204 6.6 2.35
## gross year
## 1 760505847 2009
## 2 309404152 2007
## 3 200074175 2015
## 4 448130642 2012
## 5 NA NA
## 6 73058679 2012
summary(originaldata)
## ï..title genres
## Ben-HurÃÂ : 3 Drama : 236
## HalloweenÃÂ : 3 Comedy : 209
## HomeÃÂ : 3 Comedy|Drama : 191
## King KongÃÂ : 3 Comedy|Drama|Romance: 187
## PanÃÂ : 3 Comedy|Romance : 158
## The Fast and the FuriousÃÂ : 3 Drama|Romance : 152
## (Other) :5025 (Other) :3910
## director actor1 actor2
## : 104 Robert De Niro : 49 Morgan Freeman : 20
## Steven Spielberg: 26 Johnny Depp : 41 Charlize Theron: 15
## Woody Allen : 22 Nicolas Cage : 33 Brad Pitt : 14
## Clint Eastwood : 20 J.K. Simmons : 31 : 13
## Martin Scorsese : 20 Bruce Willis : 30 James Franco : 11
## Ridley Scott : 17 Denzel Washington: 30 Meryl Streep : 11
## (Other) :4834 (Other) :4829 (Other) :4959
## actor3 length budget
## : 23 Min. : 7.0 Min. :2.180e+02
## Ben Mendelsohn: 8 1st Qu.: 93.0 1st Qu.:6.000e+06
## John Heard : 8 Median :103.0 Median :2.000e+07
## Steve Coogan : 8 Mean :107.2 Mean :3.975e+07
## Anne Hathaway : 7 3rd Qu.:118.0 3rd Qu.:4.500e+07
## Jon Gries : 7 Max. :511.0 Max. :1.222e+10
## (Other) :4982 NA's :15 NA's :492
## director_fb_likes actor1_fb_likes actor2_fb_likes actor3_fb_likes
## Min. : 0.0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 7.0 1st Qu.: 614 1st Qu.: 281 1st Qu.: 133.0
## Median : 49.0 Median : 988 Median : 595 Median : 371.5
## Mean : 686.5 Mean : 6560 Mean : 1652 Mean : 645.0
## 3rd Qu.: 194.5 3rd Qu.: 11000 3rd Qu.: 918 3rd Qu.: 636.0
## Max. :23000.0 Max. :640000 Max. :137000 Max. :23000.0
## NA's :104 NA's :7 NA's :13 NA's :23
## total_cast_likes fb_likes critic_reviews users_reviews
## Min. : 0 Min. : 0 Min. : 1.0 Min. : 1.0
## 1st Qu.: 1411 1st Qu.: 0 1st Qu.: 50.0 1st Qu.: 65.0
## Median : 3090 Median : 166 Median :110.0 Median : 156.0
## Mean : 9699 Mean : 7526 Mean :140.2 Mean : 272.8
## 3rd Qu.: 13756 3rd Qu.: 3000 3rd Qu.:195.0 3rd Qu.: 326.0
## Max. :656730 Max. :349000 Max. :813.0 Max. :5060.0
## NA's :50 NA's :21
## users_votes score aspect_ratio gross
## Min. : 5 Min. :1.600 Min. : 1.18 Min. : 162
## 1st Qu.: 8594 1st Qu.:5.800 1st Qu.: 1.85 1st Qu.: 5340988
## Median : 34359 Median :6.600 Median : 2.35 Median : 25517500
## Mean : 83668 Mean :6.442 Mean : 2.22 Mean : 48468408
## 3rd Qu.: 96309 3rd Qu.:7.200 3rd Qu.: 2.35 3rd Qu.: 62309438
## Max. :1689764 Max. :9.500 Max. :16.00 Max. :760505847
## NA's :329 NA's :884
## year
## Min. :1916
## 1st Qu.:1999
## Median :2005
## Mean :2002
## 3rd Qu.:2011
## Max. :2016
## NA's :108
summary(originaldata) ï..title genres director actor1 actor2
Ben-HurÃ, : 3 Drama : 236 : 104 Robert De Niro : 49 Morgan Freeman : 20
HalloweenÃ, : 3 Comedy : 209 Steven Spielberg: 26 Johnny Depp : 41 Charlize Theron: 15
HomeÃ, : 3 Comedy|Drama : 191 Woody Allen : 22 Nicolas Cage : 33 Brad Pitt : 14
King KongÃ, : 3 Comedy|Drama|Romance: 187 Clint Eastwood : 20 J.K. Simmons : 31 : 13
PanÃ, : 3 Comedy|Romance : 158 Martin Scorsese : 20 Bruce Willis : 30 James Franco : 11
The Fast and the FuriousÃ, : 3 Drama|Romance : 152 Ridley Scott : 17 Denzel Washington: 30 Meryl Streep : 11
(Other) :5025 (Other) :3910 (Other) :4834 (Other) :4829 (Other) :4959
actor3 length budget director_fb_likes actor1_fb_likes actor2_fb_likes actor3_fb_likes total_cast_likes : 23 Min. : 7.0 Min. :2.180e+02 Min. : 0.0 Min. : 0 Min. : 0 Min. : 0.0 Min. : 0
Ben Mendelsohn: 8 1st Qu.: 93.0 1st Qu.:6.000e+06 1st Qu.: 7.0 1st Qu.: 614 1st Qu.: 281 1st Qu.: 133.0 1st Qu.: 1411
John Heard : 8 Median :103.0 Median :2.000e+07 Median : 49.0 Median : 988 Median : 595 Median : 371.5 Median : 3090
Steve Coogan : 8 Mean :107.2 Mean :3.975e+07 Mean : 686.5 Mean : 6560 Mean : 1652 Mean : 645.0 Mean : 9699
Anne Hathaway : 7 3rd Qu.:118.0 3rd Qu.:4.500e+07 3rd Qu.: 194.5 3rd Qu.: 11000 3rd Qu.: 918 3rd Qu.: 636.0 3rd Qu.: 13756
Jon Gries : 7 Max. :511.0 Max. :1.222e+10 Max. :23000.0 Max. :640000 Max. :137000 Max. :23000.0 Max. :656730
(Other) :4982 NA’s :15 NA’s :492 NA’s :104 NA’s :7 NA’s :13 NA’s :23
fb_likes critic_reviews users_reviews users_votes score aspect_ratio gross year
Min. : 0 Min. : 1.0 Min. : 1.0 Min. : 5 Min. :1.600 Min. : 1.18 Min. : 162 Min. :1916
1st Qu.: 0 1st Qu.: 50.0 1st Qu.: 65.0 1st Qu.: 8594 1st Qu.:5.800 1st Qu.: 1.85 1st Qu.: 5340988 1st Qu.:1999
Median : 166 Median :110.0 Median : 156.0 Median : 34359 Median :6.600 Median : 2.35 Median : 25517500 Median :2005
Mean : 7526 Mean :140.2 Mean : 272.8 Mean : 83668 Mean :6.442 Mean : 2.22 Mean : 48468408 Mean :2002
3rd Qu.: 3000 3rd Qu.:195.0 3rd Qu.: 326.0 3rd Qu.: 96309 3rd Qu.:7.200 3rd Qu.: 2.35 3rd Qu.: 62309438 3rd Qu.:2011
Max. :349000 Max. :813.0 Max. :5060.0 Max. :1689764 Max. :9.500 Max. :16.00 Max. :760505847 Max. :2016
NA’s :50 NA’s :21 NA’s :329 NA’s :884 NA’s :108
————–
The original dataset has many NA’s in each category. There are special characters in the title; additionally, the genres are grouped together on the same line. Actor 2 counts no actor as the fourth most popular. The movie length is on average 107.2 minutes with a minimum of 7 minutes and maximum of 511 minutes. The Budget has 492 NA values and the Gross has 844 NA values. They range from $218K to 1.2 Billion and $162 to 760M respectively. The score ranges from 1.6-9.5 with a mean of 6.4.
Patterns paragraph with descriptive stats
mydata = read.csv(file = "data/cleaned_movies.csv")
The data was cleaned in excel prior to reading it into R. The following changes were made:
Missing variables * 224 rows did not include data regarding the budget or gross, these were removed. * 660 rows did not include data regarding the gross, these were removed. * 268 rows did not include data regarding the budget, these were removed. * Following this, 3891 recordsd remained. This is a sufficient sized dataset. *While imputation could have assisted in predicting the missing values, we did not want to impute as it would decrease the accuracy of our prediction.
Separating Genres into multiple columns For genre, the records with multiple values in genre column were broken up into several columns. We created columns for each genre, filling them with 1 if the movie was that genre or 0 if it was not. This was done with countif. We did this because we wanted to capture if a movie fell into multiple categories since there is no way to determine the main genre. An “Other” category was used for less common genres that had under 500 occurrences. *We used numbers rather than text to enable further analysis.
Removing extraneous characters in the title This was done in excel using the RIGHT and LEFT functions. To remove the character which was placed at the beginning and end of the title we simply deleted the leftmost and rightmost character.
Changing NA to 0 - Often occurrs when value is blank.
mydata[is.na(mydata)] <- 0
title = mydata$ï..Title
director = mydata$director
actor1 = mydata$actor1
actor2 = mydata$actor2
actor3 = mydata$actor3
length = mydata$length
budget = mydata$budget
director_likes = mydata$director_fb_likes
actor1_likes = mydata$actor1_fb_likes
actor2_likes = mydata$actor2_fb_likes
actor3_likes = mydata$actor3_fb_likes
total_likes = mydata$total_cast_likes
fb_likes = mydata$fb_likes
creview = mydata$critic_reviews
ureview = mydata$users_reviews
uvotes = mydata$users_votes
rating = mydata$score
aspect_ratio = mydata$aspect_ratio
gross = mydata$gross
year = mydata$year
action = mydata$Action
adventure = mydata$Adventure
drama = mydata$Drama
comedy = mydata$Comedy
crime = mydata$Crime
scifi = mydata$Sci.Fi
romance = mydata$Romance
thriller = mydata$Thriller
other = mydata$Other
scorecat = mydata$ScoreCat
summary(mydata)
## Title director
## Halloween : 3 Steven Spielberg : 25
## Home : 3 Clint Eastwood : 19
## King Kong : 3 Woody Allen : 19
## Pan : 3 Ridley Scott : 17
## The Fast and the Furious: 3 Martin Scorsese : 16
## Victor Frankenstein : 3 Steven Soderbergh: 16
## (Other) :3873 (Other) :3779
## actor1 actor2 actor3
## Robert De Niro : 42 Morgan Freeman : 20 : 10
## Johnny Depp : 39 Brad Pitt : 14 Steve Coogan : 8
## J.K. Simmons : 31 Charlize Theron: 14 Anne Hathaway : 7
## Nicolas Cage : 31 James Franco : 11 Ben Mendelsohn: 7
## Denzel Washington: 30 Jason Flemyng : 10 Kirsten Dunst : 7
## Bruce Willis : 29 Meryl Streep : 10 Robert Duvall : 7
## (Other) :3689 (Other) :3812 (Other) :3845
## length budget director_fb_likes actor1_fb_likes
## Min. : 0.0 Min. :2.180e+02 Min. : 0.0 Min. : 0
## 1st Qu.: 95.0 1st Qu.:1.000e+07 1st Qu.: 10.0 1st Qu.: 721
## Median :106.0 Median :2.400e+07 Median : 58.0 Median : 1000
## Mean :109.9 Mean :4.521e+07 Mean : 781.3 Mean : 7579
## 3rd Qu.:120.0 3rd Qu.:5.000e+07 3rd Qu.: 226.0 3rd Qu.: 12000
## Max. :330.0 Max. :1.222e+10 Max. :23000.0 Max. :640000
##
## actor2_fb_likes actor3_fb_likes total_cast_likes fb_likes
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0
## 1st Qu.: 362 1st Qu.: 182.0 1st Qu.: 1818 1st Qu.: 0
## Median : 664 Median : 426.0 Median : 3888 Median : 209
## Mean : 1968 Mean : 751.6 Mean : 11263 Mean : 9138
## 3rd Qu.: 971 3rd Qu.: 686.0 3rd Qu.: 16002 3rd Qu.: 11000
## Max. :137000 Max. :23000.0 Max. :656730 Max. :349000
##
## critic_reviews users_reviews users_votes score
## Min. : 0.0 Min. : 1.0 Min. : 5 Min. :1.600
## 1st Qu.: 72.0 1st Qu.: 102.0 1st Qu.: 17314 1st Qu.:5.900
## Median :134.0 Median : 203.0 Median : 50415 Median :6.600
## Mean :163.2 Mean : 327.3 Mean : 102583 Mean :6.464
## 3rd Qu.:221.5 3rd Qu.: 391.0 3rd Qu.: 124204 3rd Qu.:7.200
## Max. :813.0 Max. :5060.0 Max. :1689764 Max. :9.300
##
## aspect_ratio gross year Action
## Min. : 0.000 Min. : 162 Min. :1920 Min. :0.0000
## 1st Qu.: 1.850 1st Qu.: 6836508 1st Qu.:1999 1st Qu.:0.0000
## Median : 2.350 Median : 27979400 Median :2005 Median :0.0000
## Mean : 2.069 Mean : 51054995 Mean :2003 Mean :0.2493
## 3rd Qu.: 2.350 3rd Qu.: 65360661 3rd Qu.:2010 3rd Qu.:0.0000
## Max. :16.000 Max. :760505847 Max. :2016 Max. :1.0000
##
## Adventure Drama Comedy Fantasy
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :1.000 Median :0.0000 Median :0.0000
## Mean :0.2043 Mean :0.504 Mean :0.3883 Mean :0.1329
## 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
##
## Crime Sci.Fi Romance Thriller
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.185 Mean :0.1288 Mean :0.2282 Mean :0.2904
## 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## Other ScoreCat
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :1.0000
## Mean :0.4731 Mean :0.5865
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
##
The directors who occur most often in the top movies are Steven Spielberg, Clint Eastwood, Woody Allen, and Ridley Stott. The top actors are Robert De Niro, Johnny Depp, JK Simmons, Nicolas Cage, and Denzel Washington. The actors present in Actor 2 and Actor 3 occur much less frequently. The longest movie is 5.5 hours long, making it a clear outlier compared to the average time of about 1 hour and 50 minutes and the third quartile of 2 hours. The scores range from 1.6 to 9.3, with an average score of 6.46. It appears that the minimum score is more of an anomoly than the maximum due to the distribution of the quartiles. The IMDB site estimates movie budgets, but the gross is accruate. The maximum gross is $7.6 Million dollars while the minumum is $162. Both of these values appear to be outliers. The average gross is about $5.1 Million. The genre that appears the most in the dataset is Drama while Sci-Fi, Fantasy, and Crime occur the least.
initial_corr <- cor(mydata[-c(1,2,3,4,5,31)])
initial_corr
## length budget director_fb_likes
## length 1.000000000 0.06922707 0.180575225
## budget 0.069227068 1.00000000 0.019234775
## director_fb_likes 0.180575225 0.01923478 1.000000000
## actor1_fb_likes 0.087749631 0.01761803 0.091673228
## actor2_fb_likes 0.132270101 0.03704083 0.118477486
## actor3_fb_likes 0.128435737 0.04133318 0.119610922
## total_cast_likes 0.124797573 0.03018868 0.121031412
## fb_likes 0.221579676 0.05469570 0.162554735
## critic_reviews 0.240152638 0.10770951 0.179298152
## users_reviews 0.356104345 0.07335186 0.220459113
## users_votes 0.343941065 0.06883371 0.301846810
## score 0.358704617 0.02913468 0.189292612
## aspect_ratio 0.177687712 0.03250297 0.051423336
## gross 0.252911180 0.10217945 0.142157513
## year -0.128050780 0.04498995 -0.045781469
## Action 0.091474252 0.06884208 0.005296011
## Adventure 0.087504778 0.09357300 0.006318655
## Drama 0.270126707 -0.02290978 0.047209467
## Comedy -0.316193622 -0.01213736 -0.083113871
## Fantasy -0.068296866 0.04309397 -0.016627083
## Crime 0.024210011 -0.01567477 0.037315219
## Sci.Fi 0.018402159 0.09172345 0.004183409
## Romance 0.017698834 -0.02637097 -0.043698547
## Thriller 0.019766854 -0.00460247 0.027793163
## Other -0.009388305 0.01681507 0.003521888
## actor1_fb_likes actor2_fb_likes actor3_fb_likes
## length 0.087749631 0.132270101 0.128435737
## budget 0.017618035 0.037040835 0.041333178
## director_fb_likes 0.091673228 0.118477486 0.119610922
## actor1_fb_likes 1.000000000 0.392226269 0.254175313
## actor2_fb_likes 0.392226269 1.000000000 0.554782114
## actor3_fb_likes 0.254175313 0.554782114 1.000000000
## total_cast_likes 0.945235574 0.643067712 0.490486996
## fb_likes 0.131702448 0.233097094 0.272143768
## critic_reviews 0.172684872 0.259129165 0.257320756
## users_reviews 0.127717690 0.193020501 0.210195914
## users_votes 0.183807897 0.249216879 0.271603050
## score 0.091584473 0.101350279 0.064244607
## aspect_ratio 0.067026245 0.078310693 0.061052802
## gross 0.149113958 0.256866044 0.303450929
## year 0.091098996 0.115498559 0.112049142
## Action 0.045328355 0.060604685 0.048010820
## Adventure 0.037057882 0.091593304 0.117069654
## Drama 0.010757325 0.006894064 -0.032116339
## Comedy -0.048174661 -0.057722774 -0.026175930
## Fantasy 0.021011595 0.040312015 0.094413491
## Crime 0.035368690 0.018638417 -0.030306467
## Sci.Fi 0.006762163 0.035761580 0.044609206
## Romance -0.023988427 -0.006810766 0.001863907
## Thriller 0.042764664 0.012567328 -0.010924016
## Other -0.041346760 -0.046131590 -0.047912845
## total_cast_likes fb_likes critic_reviews
## length 0.1247975732 0.221579676 0.24015264
## budget 0.0301886770 0.054695696 0.10770951
## director_fb_likes 0.1210314123 0.162554735 0.17929815
## actor1_fb_likes 0.9452355740 0.131702448 0.17268487
## actor2_fb_likes 0.6430677122 0.233097094 0.25912917
## actor3_fb_likes 0.4904869955 0.272143768 0.25732076
## total_cast_likes 1.0000000000 0.206782864 0.24412489
## fb_likes 0.2067828639 1.000000000 0.70338535
## critic_reviews 0.2441248856 0.703385350 1.00000000
## users_reviews 0.1855268135 0.376884691 0.57370402
## users_votes 0.2540496589 0.522649488 0.60053337
## score 0.1045460273 0.277349180 0.34039074
## aspect_ratio 0.0833722375 0.119114086 0.23315799
## gross 0.2409818265 0.375455430 0.47713084
## year 0.1203101688 0.295540826 0.39219492
## Action 0.0627454083 0.079625885 0.16059289
## Adventure 0.0707745077 0.137136567 0.17721943
## Drama 0.0001569721 0.009561122 -0.04340515
## Comedy -0.0535101031 -0.125448704 -0.21199518
## Fantasy 0.0409847792 0.065010075 0.08212020
## Crime 0.0298154322 -0.035010172 -0.02715438
## Sci.Fi 0.0233153270 0.148927308 0.19880997
## Romance -0.0228995007 -0.057542259 -0.11958310
## Thriller 0.0359886806 0.015325218 0.12179672
## Other -0.0543509990 -0.047235272 -0.02161028
## users_reviews users_votes score aspect_ratio
## length 0.35610434 0.34394106 0.358704617 0.177687712
## budget 0.07335186 0.06883371 0.029134678 0.032502970
## director_fb_likes 0.22045911 0.30184681 0.189292612 0.051423336
## actor1_fb_likes 0.12771769 0.18380790 0.091584473 0.067026245
## actor2_fb_likes 0.19302050 0.24921688 0.101350279 0.078310693
## actor3_fb_likes 0.21019591 0.27160305 0.064244607 0.061052802
## total_cast_likes 0.18552681 0.25404966 0.104546027 0.083372237
## fb_likes 0.37688469 0.52264949 0.277349180 0.119114086
## critic_reviews 0.57370402 0.60053337 0.340390741 0.233157986
## users_reviews 1.00000000 0.78249276 0.320005149 0.141583065
## users_votes 0.78249276 1.00000000 0.473209164 0.125238349
## score 0.32000515 0.47320916 1.000000000 0.035394972
## aspect_ratio 0.14158307 0.12523835 0.035394972 1.000000000
## gross 0.55249863 0.63140421 0.211525303 0.112166379
## year 0.01186838 0.01677337 -0.130236373 0.131593206
## Action 0.18762872 0.16062681 -0.092324467 0.168687515
## Adventure 0.19492852 0.18920448 -0.003251721 0.086566178
## Drama -0.04702432 -0.05189910 0.303009938 0.018752056
## Comedy -0.22554832 -0.14750317 -0.211918772 -0.148697548
## Fantasy 0.08099041 0.08405966 -0.067639719 0.009015242
## Crime -0.04265637 0.01480833 0.037276615 0.063907859
## Sci.Fi 0.20924614 0.16857491 -0.053277098 0.066266991
## Romance -0.08275357 -0.09029231 -0.017451747 -0.049941860
## Thriller 0.08041243 0.04053820 -0.048232742 0.151629449
## Other -0.07131177 -0.09359906 -0.002005325 -0.006982741
## gross year Action Adventure
## length 0.25291118 -0.128050780 0.091474252 0.087504778
## budget 0.10217945 0.044989953 0.068842076 0.093573005
## director_fb_likes 0.14215751 -0.045781469 0.005296011 0.006318655
## actor1_fb_likes 0.14911396 0.091098996 0.045328355 0.037057882
## actor2_fb_likes 0.25686604 0.115498559 0.060604685 0.091593304
## actor3_fb_likes 0.30345093 0.112049142 0.048010820 0.117069654
## total_cast_likes 0.24098183 0.120310169 0.062745408 0.070774508
## fb_likes 0.37545543 0.295540826 0.079625885 0.137136567
## critic_reviews 0.47713084 0.392194922 0.160592894 0.177219426
## users_reviews 0.55249863 0.011868377 0.187628724 0.194928518
## users_votes 0.63140421 0.016773371 0.160626811 0.189204481
## score 0.21152530 -0.130236373 -0.092324467 -0.003251721
## aspect_ratio 0.11216638 0.131593206 0.168687515 0.086566178
## gross 1.00000000 0.046575094 0.215178382 0.354148651
## year 0.04657509 1.000000000 0.013127642 -0.017322577
## Action 0.21517838 0.013127642 1.000000000 0.322401037
## Adventure 0.35414865 -0.017322577 0.322401037 1.000000000
## Drama -0.20257758 0.011063614 -0.261244014 -0.271118834
## Comedy -0.01129171 0.018978275 -0.188551025 -0.036258684
## Fantasy 0.19949813 0.008546656 0.047457811 0.267344385
## Crime -0.08365753 -0.005933677 0.150702528 -0.174166196
## Sci.Fi 0.17893883 0.002699712 0.296399825 0.250516291
## Romance -0.05206301 -0.041723458 -0.191626496 -0.140386439
## Thriller -0.02546523 0.013527309 0.300081094 -0.037741320
## Other -0.02148249 -0.023162885 -0.158194131 0.040662970
## Drama Comedy Fantasy Crime
## length 0.2701267074 -0.31619362 -0.068296866 0.024210011
## budget -0.0229097795 -0.01213736 0.043093965 -0.015674768
## director_fb_likes 0.0472094675 -0.08311387 -0.016627083 0.037315219
## actor1_fb_likes 0.0107573246 -0.04817466 0.021011595 0.035368690
## actor2_fb_likes 0.0068940639 -0.05772277 0.040312015 0.018638417
## actor3_fb_likes -0.0321163391 -0.02617593 0.094413491 -0.030306467
## total_cast_likes 0.0001569721 -0.05351010 0.040984779 0.029815432
## fb_likes 0.0095611217 -0.12544870 0.065010075 -0.035010172
## critic_reviews -0.0434051478 -0.21199518 0.082120197 -0.027154385
## users_reviews -0.0470243170 -0.22554832 0.080990405 -0.042656371
## users_votes -0.0518990984 -0.14750317 0.084059659 0.014808333
## score 0.3030099379 -0.21191877 -0.067639719 0.037276615
## aspect_ratio 0.0187520565 -0.14869755 0.009015242 0.063907859
## gross -0.2025775788 -0.01129171 0.199498129 -0.083657534
## year 0.0110636142 0.01897827 0.008546656 -0.005933677
## Action -0.2612440140 -0.18855102 0.047457811 0.150702528
## Adventure -0.2711188339 -0.03625868 0.267344385 -0.174166196
## Drama 1.0000000000 -0.25789085 -0.205284385 0.051797609
## Comedy -0.2578908541 1.00000000 0.034538837 -0.076857364
## Fantasy -0.2052843846 0.03453884 1.000000000 -0.157280379
## Crime 0.0517976086 -0.07685736 -0.157280379 1.000000000
## Sci.Fi -0.2248278570 -0.10321038 0.016800287 -0.137736752
## Romance 0.1634629119 0.17736647 -0.045082888 -0.129809270
## Thriller -0.0232142769 -0.38428512 -0.105317986 0.355582143
## Other -0.0430691626 -0.15728514 0.129485498 -0.153316094
## Sci.Fi Romance Thriller Other
## length 0.018402159 0.017698834 0.01976685 -0.009388305
## budget 0.091723446 -0.026370969 -0.00460247 0.016815069
## director_fb_likes 0.004183409 -0.043698547 0.02779316 0.003521888
## actor1_fb_likes 0.006762163 -0.023988427 0.04276466 -0.041346760
## actor2_fb_likes 0.035761580 -0.006810766 0.01256733 -0.046131590
## actor3_fb_likes 0.044609206 0.001863907 -0.01092402 -0.047912845
## total_cast_likes 0.023315327 -0.022899501 0.03598868 -0.054350999
## fb_likes 0.148927308 -0.057542259 0.01532522 -0.047235272
## critic_reviews 0.198809968 -0.119583104 0.12179672 -0.021610282
## users_reviews 0.209246135 -0.082753572 0.08041243 -0.071311774
## users_votes 0.168574908 -0.090292309 0.04053820 -0.093599063
## score -0.053277098 -0.017451747 -0.04823274 -0.002005325
## aspect_ratio 0.066266991 -0.049941860 0.15162945 -0.006982741
## gross 0.178938832 -0.052063014 -0.02546523 -0.021482494
## year 0.002699712 -0.041723458 0.01352731 -0.023162885
## Action 0.296399825 -0.191626496 0.30008109 -0.158194131
## Adventure 0.250516291 -0.140386439 -0.03774132 0.040662970
## Drama -0.224827857 0.163462912 -0.02321428 -0.043069163
## Comedy -0.103210377 0.177366474 -0.38428512 -0.157285137
## Fantasy 0.016800287 -0.045082888 -0.10531799 0.129485498
## Crime -0.137736752 -0.129809270 0.35558214 -0.153316094
## Sci.Fi 1.000000000 -0.141400074 0.12086277 -0.024658735
## Romance -0.141400074 1.000000000 -0.21838248 -0.155952895
## Thriller 0.120862765 -0.218382479 1.00000000 -0.027953289
## Other -0.024658735 -0.155952895 -0.02795329 1.000000000
corrplot(cor(mydata[-c(1,2,3,4,5,31)]))
** We are removing the variables that have a correlation with gross that is less than 0.15.
corr <- cor(mydata[-c(1,2,3,4,5, 7, 8, 9, 12, 18, 20, 23, 24, 26, 27, 28, 29, 30,31)])
corr
## length actor2_fb_likes actor3_fb_likes fb_likes
## length 1.00000000 0.13227010 0.12843574 0.22157968
## actor2_fb_likes 0.13227010 1.00000000 0.55478211 0.23309709
## actor3_fb_likes 0.12843574 0.55478211 1.00000000 0.27214377
## fb_likes 0.22157968 0.23309709 0.27214377 1.00000000
## critic_reviews 0.24015264 0.25912917 0.25732076 0.70338535
## users_reviews 0.35610434 0.19302050 0.21019591 0.37688469
## users_votes 0.34394106 0.24921688 0.27160305 0.52264949
## score 0.35870462 0.10135028 0.06424461 0.27734918
## gross 0.25291118 0.25686604 0.30345093 0.37545543
## Action 0.09147425 0.06060469 0.04801082 0.07962589
## Adventure 0.08750478 0.09159330 0.11706965 0.13713657
## Fantasy -0.06829687 0.04031201 0.09441349 0.06501008
## critic_reviews users_reviews users_votes score
## length 0.2401526 0.35610434 0.34394106 0.358704617
## actor2_fb_likes 0.2591292 0.19302050 0.24921688 0.101350279
## actor3_fb_likes 0.2573208 0.21019591 0.27160305 0.064244607
## fb_likes 0.7033853 0.37688469 0.52264949 0.277349180
## critic_reviews 1.0000000 0.57370402 0.60053337 0.340390741
## users_reviews 0.5737040 1.00000000 0.78249276 0.320005149
## users_votes 0.6005334 0.78249276 1.00000000 0.473209164
## score 0.3403907 0.32000515 0.47320916 1.000000000
## gross 0.4771308 0.55249863 0.63140421 0.211525303
## Action 0.1605929 0.18762872 0.16062681 -0.092324467
## Adventure 0.1772194 0.19492852 0.18920448 -0.003251721
## Fantasy 0.0821202 0.08099041 0.08405966 -0.067639719
## gross Action Adventure Fantasy
## length 0.2529112 0.09147425 0.087504778 -0.06829687
## actor2_fb_likes 0.2568660 0.06060469 0.091593304 0.04031201
## actor3_fb_likes 0.3034509 0.04801082 0.117069654 0.09441349
## fb_likes 0.3754554 0.07962589 0.137136567 0.06501008
## critic_reviews 0.4771308 0.16059289 0.177219426 0.08212020
## users_reviews 0.5524986 0.18762872 0.194928518 0.08099041
## users_votes 0.6314042 0.16062681 0.189204481 0.08405966
## score 0.2115253 -0.09232447 -0.003251721 -0.06763972
## gross 1.0000000 0.21517838 0.354148651 0.19949813
## Action 0.2151784 1.00000000 0.322401037 0.04745781
## Adventure 0.3541487 0.32240104 1.000000000 0.26734439
## Fantasy 0.1994981 0.04745781 0.267344385 1.00000000
corrplot(cor(mydata[-c(1,2,3,4,5, 7, 8, 9, 12, 18, 20, 23, 24, 26, 27, 28, 29, 30,31)]))
The highest correlations exist between gross and critic reviews (0.47703920), user reviews (0.55244979), and User votes (0.63135956). This would be due to the fact gross rises when more people view the movie, therefore yielding more reivews. This would not help us predict gross because the votes and reivew occur after viewing the movie. Due to this revelation, we feel that it would be more useful to predict the score of a movie. The nature of the dataset includes more variables that occur after the viewer has seen the film. These cannot be used to predict gross, but can be used to predict score.
corr2 <- cor(mydata[-c(1,2,3,4,5, 7, 9, 10, 11, 12, 18, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30,31)])
corr2
## length director_fb_likes fb_likes critic_reviews
## length 1.0000000 0.18057523 0.2215797 0.2401526
## director_fb_likes 0.1805752 1.00000000 0.1625547 0.1792982
## fb_likes 0.2215797 0.16255474 1.0000000 0.7033853
## critic_reviews 0.2401526 0.17929815 0.7033853 1.0000000
## users_reviews 0.3561043 0.22045911 0.3768847 0.5737040
## users_votes 0.3439411 0.30184681 0.5226495 0.6005334
## score 0.3587046 0.18929261 0.2773492 0.3403907
## gross 0.2529112 0.14215751 0.3754554 0.4771308
## Comedy -0.3161936 -0.08311387 -0.1254487 -0.2119952
## users_reviews users_votes score gross
## length 0.3561043 0.3439411 0.3587046 0.25291118
## director_fb_likes 0.2204591 0.3018468 0.1892926 0.14215751
## fb_likes 0.3768847 0.5226495 0.2773492 0.37545543
## critic_reviews 0.5737040 0.6005334 0.3403907 0.47713084
## users_reviews 1.0000000 0.7824928 0.3200051 0.55249863
## users_votes 0.7824928 1.0000000 0.4732092 0.63140421
## score 0.3200051 0.4732092 1.0000000 0.21152530
## gross 0.5524986 0.6314042 0.2115253 1.00000000
## Comedy -0.2255483 -0.1475032 -0.2119188 -0.01129171
## Comedy
## length -0.31619362
## director_fb_likes -0.08311387
## fb_likes -0.12544870
## critic_reviews -0.21199518
## users_reviews -0.22554832
## users_votes -0.14750317
## score -0.21191877
## gross -0.01129171
## Comedy 1.00000000
The highest correlations exist between score and user votes (0.4732577) and length (0.3587695). We believe user votes are correlated to the score because if there are fewer user votes, they could affect the score more. Additionally, the more users who vote, the less effect a bad score would have. We believe that lengh affects the score because it takes longer to tell a good story. One interesting observation to note is that movie genre is tied more to gross than score. Another observation is that the presence of the comedy genre is the only variable that negatively affects the score.
corrplot(cor(mydata[-c(1,2,3,4,5, 7, 9, 10, 11, 12, 18, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31)]))
knitr::include_graphics('imgs/comedyscore.png')
The count of each rating for comedy movies is a bell curve, meaning it is normally distributed.
knitr::include_graphics('imgs/LengthScoreScatter.png')
There does not appear to be a strong correlation here despite the correlation found on the matrix. This being said, score does tend to increase as the movie length increases. It is only evident in extreme cases of length.
knitr::include_graphics('imgs/ScoreVoteScatter.png')
User votes appear to be most correlated to score when put into a scatter plot. As the number of votes increase, the score tends to be higher. This being said, there is a concentrated number of lower user votes at all score ranges. This could be due to the increased influence of each vote as we previously predicted.
| ## linear model |
| ```r linear_model <- lm(score ~ uvotes, data = mydata) |
| summary(linear_model) ``` |
## ## Call: ## lm(formula = score ~ uvotes, data = mydata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.7699 -0.5204 0.0828 0.6495 2.3022 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.123e+00 1.804e-02 339.3 <2e-16 *** ## uvotes 3.316e-06 9.898e-08 33.5 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9305 on 3889 degrees of freedom ## Multiple R-squared: 0.2239, Adjusted R-squared: 0.2237 ## F-statistic: 1122 on 1 and 3889 DF, p-value: < 2.2e-16 |
| The R Square and Adjusted R Square are extremely low for this model. This is an indicator that it is not a good fit. |
| ```r mlinear_model <- lm(score ~ uvotes + length, data = mydata) |
| summary(linear_model) ``` |
## ## Call: ## lm(formula = score ~ uvotes, data = mydata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.7699 -0.5204 0.0828 0.6495 2.3022 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.123e+00 1.804e-02 339.3 <2e-16 *** ## uvotes 3.316e-06 9.898e-08 33.5 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9305 on 3889 degrees of freedom ## Multiple R-squared: 0.2239, Adjusted R-squared: 0.2237 ## F-statistic: 1122 on 1 and 3889 DF, p-value: < 2.2e-16 |
| By adding in length, the next strongest correlated variable, it got a bit better. Still not great. |
| ```r mlinear_model2 <- lm(score ~ uvotes + length + comedy, data = mydata) |
| summary(mlinear_model2) ``` |
## ## Call: ## lm(formula = score ~ uvotes + length + comedy, data = mydata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.9099 -0.5073 0.0980 0.6308 2.3538 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.269e+00 8.048e-02 65.462 < 2e-16 *** ## uvotes 2.752e-06 1.020e-07 26.978 < 2e-16 *** ## length 9.015e-03 7.038e-04 12.808 < 2e-16 *** ## comedy -2.005e-01 3.122e-02 -6.422 1.5e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8995 on 3887 degrees of freedom ## Multiple R-squared: 0.2752, Adjusted R-squared: 0.2746 ## F-statistic: 491.9 on 3 and 3887 DF, p-value: < 2.2e-16 With comedy added, the R Square and Adjusted R square got only incrementally better. A linear model may not be the best approach, we will try other models. |
| ```r uvotes_squared <- uvotes^2 uvotes_cubed <- uvotes^3 quadratic_model <- lm(score ~ uvotes + uvotes_squared, data = mydata) |
| summary(quadratic_model) ``` |
## ## Call: ## lm(formula = score ~ uvotes + uvotes_squared, data = mydata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.7851 -0.5157 0.0698 0.6400 2.3707 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.015e+00 2.028e-02 296.59 <2e-16 *** ## uvotes 5.164e-06 1.926e-07 26.81 <2e-16 *** ## uvotes_squared -2.431e-12 2.186e-13 -11.12 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9161 on 3888 degrees of freedom ## Multiple R-squared: 0.2479, Adjusted R-squared: 0.2475 ## F-statistic: 640.6 on 2 and 3888 DF, p-value: < 2.2e-16 |
| This model is worse than linear. This makes sense because our graphs did not show quadratic trends. We will try cubic since that looks more like our graph. |
| ```r cubic_model <- lm(score ~ uvotes + uvotes_squared + uvotes_cubed, data = mydata) |
| summary(cubic_model) ``` |
## ## Call: ## lm(formula = score ~ uvotes + uvotes_squared + uvotes_cubed, ## data = mydata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.8091 -0.5119 0.0768 0.6267 2.4145 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.966e+00 2.278e-02 261.881 < 2e-16 *** ## uvotes 6.406e-06 3.287e-07 19.485 < 2e-16 *** ## uvotes_squared -6.131e-12 8.240e-13 -7.441 1.22e-13 *** ## uvotes_cubed 2.146e-18 4.608e-19 4.656 3.33e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9137 on 3887 degrees of freedom ## Multiple R-squared: 0.252, Adjusted R-squared: 0.2515 ## F-statistic: 436.6 on 3 and 3887 DF, p-value: < 2.2e-16 This model is better than quadratic based on the R-squared adn Adjusted R-Squared, it still is worse than linear. |
This model predicts the category of the score good/bad where a score equal to or above the average (6.4) is good. A column was added to the dataset using this category.
Below we include all variables that had a strong correlation with score to determine which have the most affect on the model.
director_fb_likes fb_likes critic_reviews users_reviews users_votes score gross Comedy
logit_model_all <- glm(scorecat ~ uvotes+length+comedy+director_likes+ureview+fb_likes+gross, data = mydata, family=binomial())
summary(logit_model_all)
##
## Call:
## glm(formula = scorecat ~ uvotes + length + comedy + director_likes +
## ureview + fb_likes + gross, family = binomial(), data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5963 -0.9845 0.2364 0.9781 2.9194
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.569e+00 2.549e-01 -10.076 < 2e-16 ***
## uvotes 1.984e-05 1.174e-06 16.898 < 2e-16 ***
## length 2.366e-02 2.341e-03 10.107 < 2e-16 ***
## comedy -3.749e-01 8.015e-02 -4.678 2.9e-06 ***
## director_likes 7.246e-05 2.061e-05 3.515 0.000439 ***
## ureview -1.707e-03 1.999e-04 -8.542 < 2e-16 ***
## fb_likes -5.316e-09 3.218e-06 -0.002 0.998682
## gross -1.066e-08 1.034e-09 -10.313 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5277.1 on 3890 degrees of freedom
## Residual deviance: 4239.0 on 3883 degrees of freedom
## AIC: 4255
##
## Number of Fisher Scoring iterations: 6
hoslem.test(mydata$ScoreCat, fitted(logit_model_all))
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: mydata$ScoreCat, fitted(logit_model_all)
## X-squared = 7.2749, df = 8, p-value = 0.5073
This p-value is above 0.05 which means that the model is a good fit. wWe will settle on this model.