1A About this Project

This project attempts to predict the score of movies on IMDB using primarily the number of user reivews and movie length. The dataset used includes 3890 movies and 30 variables. This could assist people within the movie industry by allowing them to know what is more important in terms of expenditures that will lead to success. The dataset was cleaned prior to reading it into this package, it initially included fewer rows that contained dirty data and random characters.

1B Resources: Packages Use on this notebook -


This project utilizes R, and Tableau. The specific R Packages we used are tidyverse and corrplot.

Install required packages

# Here we are checking if the package is installed
if(!require("tidyverse")){
  
  # If the package is not in the system then it will be install
  install.packages("tidyverse", dependencies = TRUE)
  
  # Here we are loading the package
  library("tidyverse")
}
## Loading required package: tidyverse
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## Warning: package 'ggplot2' was built under R version 3.4.4
## -- Conflicts --------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# Here we are checking if the package is installed
if(!require("corrplot")){
  
  # If the package is not in the system then it will be install
  install.packages("corrplot", dependencies = TRUE)
  
  # Here we are loading the package
  library("corrplot")}
## Loading required package: corrplot
## Warning: package 'corrplot' was built under R version 3.4.4
## corrplot 0.84 loaded
# Here we are checking if the package is installed
if(!require("ResourceSelection")){
  
  # If the package is not in the system then it will be install
  install.packages("ResourceSelection", dependencies = TRUE)
  
  # Here we are loading the package
  library("ResourceSelection")  
  
}
## Loading required package: ResourceSelection
## Warning: package 'ResourceSelection' was built under R version 3.4.4
## ResourceSelection 0.3-2   2017-02-28

1C Data Description: The dataset includes movies from the years 1920 until 2016. This implies that the dataset was created sometime after 2016. While the set is named for Rotten Tomatoes, IMDB seems to be the actual source as the scoures in the dataset are out of 10, similar to IMDB, while on Rotten Tomatoes the score is a percentage. The data is sourced from multiple countries.

1D Business Cases: We believe that score will be most affected by the number of people who review the movie. If confirmed, this would help people in the industry because they would know to encourage people to review their film more, thus improving their score. Additionally, we could potentially determine an optimal length to improve the score. We think that this may be logistical regression because extremes in length would yield lower scores.

### 2A Industry: This project uses a cleaned dataset of movies,the contents relate most to the entertainment industry. Movie success is often measured by both Gross and Scores on various websites. The gross provides a more objective measure as it is purely quantitative and unaffected by perceptions like a score would be. However, most variables in this dataset are measured after the gross would be determined. For this reason, we will investigate how to predict score.

2B Explore the Dataset:

Read the csv file into R Studio and explore the dataset (extract variables)

originaldata = read.csv(file = "data/rottentomatoes.csv")
head(originaldata)
##                                                     ï..title
## 1                                                 Avatar 
## 2               Pirates of the Caribbean: At World's End 
## 3                                                Spectre 
## 4                                  The Dark Knight Rises 
## 5 Star Wars: Episode VII - The Force Awakens             
## 6                                            John Carter 
##                            genres          director          actor1
## 1 Action|Adventure|Fantasy|Sci-Fi     James Cameron     CCH Pounder
## 2        Action|Adventure|Fantasy    Gore Verbinski     Johnny Depp
## 3       Action|Adventure|Thriller        Sam Mendes Christoph Waltz
## 4                 Action|Thriller Christopher Nolan       Tom Hardy
## 5                     Documentary       Doug Walker     Doug Walker
## 6         Action|Adventure|Sci-Fi    Andrew Stanton    Daryl Sabara
##             actor2               actor3 length    budget director_fb_likes
## 1 Joel David Moore            Wes Studi    178 237000000                 0
## 2    Orlando Bloom       Jack Davenport    169 300000000               563
## 3     Rory Kinnear     Stephanie Sigman    148 245000000                 0
## 4   Christian Bale Joseph Gordon-Levitt    164 250000000             22000
## 5       Rob Walker                          NA        NA               131
## 6  Samantha Morton         Polly Walker    132 263700000               475
##   actor1_fb_likes actor2_fb_likes actor3_fb_likes total_cast_likes
## 1            1000             936             855             4834
## 2           40000            5000            1000            48350
## 3           11000             393             161            11700
## 4           27000           23000           23000           106759
## 5             131              12              NA              143
## 6             640             632             530             1873
##   fb_likes critic_reviews users_reviews users_votes score aspect_ratio
## 1    33000            723          3054      886204   7.9         1.78
## 2        0            302          1238      471220   7.1         2.35
## 3    85000            602           994      275868   6.8         2.35
## 4   164000            813          2701     1144337   8.5         2.35
## 5        0             NA            NA           8   7.1           NA
## 6    24000            462           738      212204   6.6         2.35
##       gross year
## 1 760505847 2009
## 2 309404152 2007
## 3 200074175 2015
## 4 448130642 2012
## 5        NA   NA
## 6  73058679 2012
summary(originaldata)
##                          ï..title                     genres    
##  Ben-Hur                  :   3   Drama               : 236  
##  Halloween                :   3   Comedy              : 209  
##  Home                     :   3   Comedy|Drama        : 191  
##  King Kong                :   3   Comedy|Drama|Romance: 187  
##  Pan                      :   3   Comedy|Romance      : 158  
##  The Fast and the Furious :   3   Drama|Romance       : 152  
##  (Other)                     :5025   (Other)             :3910  
##              director                  actor1                 actor2    
##                  : 104   Robert De Niro   :  49   Morgan Freeman :  20  
##  Steven Spielberg:  26   Johnny Depp      :  41   Charlize Theron:  15  
##  Woody Allen     :  22   Nicolas Cage     :  33   Brad Pitt      :  14  
##  Clint Eastwood  :  20   J.K. Simmons     :  31                  :  13  
##  Martin Scorsese :  20   Bruce Willis     :  30   James Franco   :  11  
##  Ridley Scott    :  17   Denzel Washington:  30   Meryl Streep   :  11  
##  (Other)         :4834   (Other)          :4829   (Other)        :4959  
##             actor3         length          budget         
##                :  23   Min.   :  7.0   Min.   :2.180e+02  
##  Ben Mendelsohn:   8   1st Qu.: 93.0   1st Qu.:6.000e+06  
##  John Heard    :   8   Median :103.0   Median :2.000e+07  
##  Steve Coogan  :   8   Mean   :107.2   Mean   :3.975e+07  
##  Anne Hathaway :   7   3rd Qu.:118.0   3rd Qu.:4.500e+07  
##  Jon Gries     :   7   Max.   :511.0   Max.   :1.222e+10  
##  (Other)       :4982   NA's   :15      NA's   :492        
##  director_fb_likes actor1_fb_likes  actor2_fb_likes  actor3_fb_likes  
##  Min.   :    0.0   Min.   :     0   Min.   :     0   Min.   :    0.0  
##  1st Qu.:    7.0   1st Qu.:   614   1st Qu.:   281   1st Qu.:  133.0  
##  Median :   49.0   Median :   988   Median :   595   Median :  371.5  
##  Mean   :  686.5   Mean   :  6560   Mean   :  1652   Mean   :  645.0  
##  3rd Qu.:  194.5   3rd Qu.: 11000   3rd Qu.:   918   3rd Qu.:  636.0  
##  Max.   :23000.0   Max.   :640000   Max.   :137000   Max.   :23000.0  
##  NA's   :104       NA's   :7        NA's   :13       NA's   :23       
##  total_cast_likes    fb_likes      critic_reviews  users_reviews   
##  Min.   :     0   Min.   :     0   Min.   :  1.0   Min.   :   1.0  
##  1st Qu.:  1411   1st Qu.:     0   1st Qu.: 50.0   1st Qu.:  65.0  
##  Median :  3090   Median :   166   Median :110.0   Median : 156.0  
##  Mean   :  9699   Mean   :  7526   Mean   :140.2   Mean   : 272.8  
##  3rd Qu.: 13756   3rd Qu.:  3000   3rd Qu.:195.0   3rd Qu.: 326.0  
##  Max.   :656730   Max.   :349000   Max.   :813.0   Max.   :5060.0  
##                                    NA's   :50      NA's   :21      
##   users_votes          score        aspect_ratio       gross          
##  Min.   :      5   Min.   :1.600   Min.   : 1.18   Min.   :      162  
##  1st Qu.:   8594   1st Qu.:5.800   1st Qu.: 1.85   1st Qu.:  5340988  
##  Median :  34359   Median :6.600   Median : 2.35   Median : 25517500  
##  Mean   :  83668   Mean   :6.442   Mean   : 2.22   Mean   : 48468408  
##  3rd Qu.:  96309   3rd Qu.:7.200   3rd Qu.: 2.35   3rd Qu.: 62309438  
##  Max.   :1689764   Max.   :9.500   Max.   :16.00   Max.   :760505847  
##                                    NA's   :329     NA's   :884        
##       year     
##  Min.   :1916  
##  1st Qu.:1999  
##  Median :2005  
##  Mean   :2002  
##  3rd Qu.:2011  
##  Max.   :2016  
##  NA's   :108

summary(originaldata) ï..title genres director actor1 actor2
Ben-HurÃ, : 3 Drama : 236 : 104 Robert De Niro : 49 Morgan Freeman : 20
HalloweenÃ, : 3 Comedy : 209 Steven Spielberg: 26 Johnny Depp : 41 Charlize Theron: 15
HomeÃ, : 3 Comedy|Drama : 191 Woody Allen : 22 Nicolas Cage : 33 Brad Pitt : 14
King KongÃ, : 3 Comedy|Drama|Romance: 187 Clint Eastwood : 20 J.K. Simmons : 31 : 13
PanÃ, : 3 Comedy|Romance : 158 Martin Scorsese : 20 Bruce Willis : 30 James Franco : 11
The Fast and the FuriousÃ, : 3 Drama|Romance : 152 Ridley Scott : 17 Denzel Washington: 30 Meryl Streep : 11
(Other) :5025 (Other) :3910 (Other) :4834 (Other) :4829 (Other) :4959
actor3 length budget director_fb_likes actor1_fb_likes actor2_fb_likes actor3_fb_likes total_cast_likes : 23 Min. : 7.0 Min. :2.180e+02 Min. : 0.0 Min. : 0 Min. : 0 Min. : 0.0 Min. : 0
Ben Mendelsohn: 8 1st Qu.: 93.0 1st Qu.:6.000e+06 1st Qu.: 7.0 1st Qu.: 614 1st Qu.: 281 1st Qu.: 133.0 1st Qu.: 1411
John Heard : 8 Median :103.0 Median :2.000e+07 Median : 49.0 Median : 988 Median : 595 Median : 371.5 Median : 3090
Steve Coogan : 8 Mean :107.2 Mean :3.975e+07 Mean : 686.5 Mean : 6560 Mean : 1652 Mean : 645.0 Mean : 9699
Anne Hathaway : 7 3rd Qu.:118.0 3rd Qu.:4.500e+07 3rd Qu.: 194.5 3rd Qu.: 11000 3rd Qu.: 918 3rd Qu.: 636.0 3rd Qu.: 13756
Jon Gries : 7 Max. :511.0 Max. :1.222e+10 Max. :23000.0 Max. :640000 Max. :137000 Max. :23000.0 Max. :656730
(Other) :4982 NA’s :15 NA’s :492 NA’s :104 NA’s :7 NA’s :13 NA’s :23
fb_likes critic_reviews users_reviews users_votes score aspect_ratio gross year
Min. : 0 Min. : 1.0 Min. : 1.0 Min. : 5 Min. :1.600 Min. : 1.18 Min. : 162 Min. :1916
1st Qu.: 0 1st Qu.: 50.0 1st Qu.: 65.0 1st Qu.: 8594 1st Qu.:5.800 1st Qu.: 1.85 1st Qu.: 5340988 1st Qu.:1999
Median : 166 Median :110.0 Median : 156.0 Median : 34359 Median :6.600 Median : 2.35 Median : 25517500 Median :2005
Mean : 7526 Mean :140.2 Mean : 272.8 Mean : 83668 Mean :6.442 Mean : 2.22 Mean : 48468408 Mean :2002
3rd Qu.: 3000 3rd Qu.:195.0 3rd Qu.: 326.0 3rd Qu.: 96309 3rd Qu.:7.200 3rd Qu.: 2.35 3rd Qu.: 62309438 3rd Qu.:2011
Max. :349000 Max. :813.0 Max. :5060.0 Max. :1689764 Max. :9.500 Max. :16.00 Max. :760505847 Max. :2016
NA’s :50 NA’s :21 NA’s :329 NA’s :884 NA’s :108
————–

2C Paragraph describing original dataset with descriptive stats

The original dataset has many NA’s in each category. There are special characters in the title; additionally, the genres are grouped together on the same line. Actor 2 counts no actor as the fourth most popular. The movie length is on average 107.2 minutes with a minimum of 7 minutes and maximum of 511 minutes. The Budget has 492 NA values and the Gross has 844 NA values. They range from $218K to 1.2 Billion and $162 to 760M respectively. The score ranges from 1.6-9.5 with a mean of 6.4.


Patterns paragraph with descriptive stats


## 3A - 3C Data Preparation: Cleaning and preparing the data for analysis

mydata = read.csv(file = "data/cleaned_movies.csv")

The data was cleaned in excel prior to reading it into R. The following changes were made:

Missing variables * 224 rows did not include data regarding the budget or gross, these were removed. * 660 rows did not include data regarding the gross, these were removed. * 268 rows did not include data regarding the budget, these were removed. * Following this, 3891 recordsd remained. This is a sufficient sized dataset. *While imputation could have assisted in predicting the missing values, we did not want to impute as it would decrease the accuracy of our prediction.

Separating Genres into multiple columns For genre, the records with multiple values in genre column were broken up into several columns. We created columns for each genre, filling them with 1 if the movie was that genre or 0 if it was not. This was done with countif. We did this because we wanted to capture if a movie fell into multiple categories since there is no way to determine the main genre. An “Other” category was used for less common genres that had under 500 occurrences. *We used numbers rather than text to enable further analysis.

Removing extraneous characters in the title This was done in excel using the RIGHT and LEFT functions. To remove the character which was placed at the beginning and end of the title we simply deleted the leftmost and rightmost character.

Changing NA to 0 - Often occurrs when value is blank.

mydata[is.na(mydata)] <- 0

Variable Extraction - Pulled all variables to be safe.

title = mydata$ï..Title
director = mydata$director
actor1 = mydata$actor1
actor2 = mydata$actor2
actor3 = mydata$actor3
length = mydata$length
budget = mydata$budget
director_likes = mydata$director_fb_likes
actor1_likes = mydata$actor1_fb_likes
actor2_likes = mydata$actor2_fb_likes
actor3_likes = mydata$actor3_fb_likes
total_likes = mydata$total_cast_likes
fb_likes = mydata$fb_likes
creview = mydata$critic_reviews
ureview = mydata$users_reviews
uvotes = mydata$users_votes
rating = mydata$score
aspect_ratio = mydata$aspect_ratio
gross = mydata$gross
year = mydata$year
action = mydata$Action
adventure = mydata$Adventure
drama = mydata$Drama
comedy = mydata$Comedy
crime = mydata$Crime
scifi = mydata$Sci.Fi
romance = mydata$Romance
thriller = mydata$Thriller
other = mydata$Other
scorecat = mydata$ScoreCat

4A-D Data Analysis: Descriptive Statistics, Correlations


Basic descriptive statistics of the dataset (write down any observations)

summary(mydata)
##                       Title                   director   
##  Halloween               :   3   Steven Spielberg :  25  
##  Home                    :   3   Clint Eastwood   :  19  
##  King Kong               :   3   Woody Allen      :  19  
##  Pan                     :   3   Ridley Scott     :  17  
##  The Fast and the Furious:   3   Martin Scorsese  :  16  
##  Victor Frankenstein     :   3   Steven Soderbergh:  16  
##  (Other)                 :3873   (Other)          :3779  
##                actor1                 actor2                actor3    
##  Robert De Niro   :  42   Morgan Freeman :  20                 :  10  
##  Johnny Depp      :  39   Brad Pitt      :  14   Steve Coogan  :   8  
##  J.K. Simmons     :  31   Charlize Theron:  14   Anne Hathaway :   7  
##  Nicolas Cage     :  31   James Franco   :  11   Ben Mendelsohn:   7  
##  Denzel Washington:  30   Jason Flemyng  :  10   Kirsten Dunst :   7  
##  Bruce Willis     :  29   Meryl Streep   :  10   Robert Duvall :   7  
##  (Other)          :3689   (Other)        :3812   (Other)       :3845  
##      length          budget          director_fb_likes actor1_fb_likes 
##  Min.   :  0.0   Min.   :2.180e+02   Min.   :    0.0   Min.   :     0  
##  1st Qu.: 95.0   1st Qu.:1.000e+07   1st Qu.:   10.0   1st Qu.:   721  
##  Median :106.0   Median :2.400e+07   Median :   58.0   Median :  1000  
##  Mean   :109.9   Mean   :4.521e+07   Mean   :  781.3   Mean   :  7579  
##  3rd Qu.:120.0   3rd Qu.:5.000e+07   3rd Qu.:  226.0   3rd Qu.: 12000  
##  Max.   :330.0   Max.   :1.222e+10   Max.   :23000.0   Max.   :640000  
##                                                                        
##  actor2_fb_likes  actor3_fb_likes   total_cast_likes    fb_likes     
##  Min.   :     0   Min.   :    0.0   Min.   :     0   Min.   :     0  
##  1st Qu.:   362   1st Qu.:  182.0   1st Qu.:  1818   1st Qu.:     0  
##  Median :   664   Median :  426.0   Median :  3888   Median :   209  
##  Mean   :  1968   Mean   :  751.6   Mean   : 11263   Mean   :  9138  
##  3rd Qu.:   971   3rd Qu.:  686.0   3rd Qu.: 16002   3rd Qu.: 11000  
##  Max.   :137000   Max.   :23000.0   Max.   :656730   Max.   :349000  
##                                                                      
##  critic_reviews  users_reviews     users_votes          score      
##  Min.   :  0.0   Min.   :   1.0   Min.   :      5   Min.   :1.600  
##  1st Qu.: 72.0   1st Qu.: 102.0   1st Qu.:  17314   1st Qu.:5.900  
##  Median :134.0   Median : 203.0   Median :  50415   Median :6.600  
##  Mean   :163.2   Mean   : 327.3   Mean   : 102583   Mean   :6.464  
##  3rd Qu.:221.5   3rd Qu.: 391.0   3rd Qu.: 124204   3rd Qu.:7.200  
##  Max.   :813.0   Max.   :5060.0   Max.   :1689764   Max.   :9.300  
##                                                                    
##   aspect_ratio        gross                year          Action      
##  Min.   : 0.000   Min.   :      162   Min.   :1920   Min.   :0.0000  
##  1st Qu.: 1.850   1st Qu.:  6836508   1st Qu.:1999   1st Qu.:0.0000  
##  Median : 2.350   Median : 27979400   Median :2005   Median :0.0000  
##  Mean   : 2.069   Mean   : 51054995   Mean   :2003   Mean   :0.2493  
##  3rd Qu.: 2.350   3rd Qu.: 65360661   3rd Qu.:2010   3rd Qu.:0.0000  
##  Max.   :16.000   Max.   :760505847   Max.   :2016   Max.   :1.0000  
##                                                                      
##    Adventure          Drama           Comedy          Fantasy      
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.000   Median :0.0000   Median :0.0000  
##  Mean   :0.2043   Mean   :0.504   Mean   :0.3883   Mean   :0.1329  
##  3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
##                                                                    
##      Crime           Sci.Fi          Romance          Thriller     
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.185   Mean   :0.1288   Mean   :0.2282   Mean   :0.2904  
##  3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                    
##      Other           ScoreCat     
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000  
##  Mean   :0.4731   Mean   :0.5865  
##  3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000  
## 
  • write observations - min and max value investigation, explore *

The directors who occur most often in the top movies are Steven Spielberg, Clint Eastwood, Woody Allen, and Ridley Stott. The top actors are Robert De Niro, Johnny Depp, JK Simmons, Nicolas Cage, and Denzel Washington. The actors present in Actor 2 and Actor 3 occur much less frequently. The longest movie is 5.5 hours long, making it a clear outlier compared to the average time of about 1 hour and 50 minutes and the third quartile of 2 hours. The scores range from 1.6 to 9.3, with an average score of 6.46. It appears that the minimum score is more of an anomoly than the maximum due to the distribution of the quartiles. The IMDB site estimates movie budgets, but the gross is accruate. The maximum gross is $7.6 Million dollars while the minumum is $162. Both of these values appear to be outliers. The average gross is about $5.1 Million. The genre that appears the most in the dataset is Drama while Sci-Fi, Fantasy, and Crime occur the least.

Initial Correlation table ( only numeric data )

initial_corr <- cor(mydata[-c(1,2,3,4,5,31)])
initial_corr
##                         length      budget director_fb_likes
## length             1.000000000  0.06922707       0.180575225
## budget             0.069227068  1.00000000       0.019234775
## director_fb_likes  0.180575225  0.01923478       1.000000000
## actor1_fb_likes    0.087749631  0.01761803       0.091673228
## actor2_fb_likes    0.132270101  0.03704083       0.118477486
## actor3_fb_likes    0.128435737  0.04133318       0.119610922
## total_cast_likes   0.124797573  0.03018868       0.121031412
## fb_likes           0.221579676  0.05469570       0.162554735
## critic_reviews     0.240152638  0.10770951       0.179298152
## users_reviews      0.356104345  0.07335186       0.220459113
## users_votes        0.343941065  0.06883371       0.301846810
## score              0.358704617  0.02913468       0.189292612
## aspect_ratio       0.177687712  0.03250297       0.051423336
## gross              0.252911180  0.10217945       0.142157513
## year              -0.128050780  0.04498995      -0.045781469
## Action             0.091474252  0.06884208       0.005296011
## Adventure          0.087504778  0.09357300       0.006318655
## Drama              0.270126707 -0.02290978       0.047209467
## Comedy            -0.316193622 -0.01213736      -0.083113871
## Fantasy           -0.068296866  0.04309397      -0.016627083
## Crime              0.024210011 -0.01567477       0.037315219
## Sci.Fi             0.018402159  0.09172345       0.004183409
## Romance            0.017698834 -0.02637097      -0.043698547
## Thriller           0.019766854 -0.00460247       0.027793163
## Other             -0.009388305  0.01681507       0.003521888
##                   actor1_fb_likes actor2_fb_likes actor3_fb_likes
## length                0.087749631     0.132270101     0.128435737
## budget                0.017618035     0.037040835     0.041333178
## director_fb_likes     0.091673228     0.118477486     0.119610922
## actor1_fb_likes       1.000000000     0.392226269     0.254175313
## actor2_fb_likes       0.392226269     1.000000000     0.554782114
## actor3_fb_likes       0.254175313     0.554782114     1.000000000
## total_cast_likes      0.945235574     0.643067712     0.490486996
## fb_likes              0.131702448     0.233097094     0.272143768
## critic_reviews        0.172684872     0.259129165     0.257320756
## users_reviews         0.127717690     0.193020501     0.210195914
## users_votes           0.183807897     0.249216879     0.271603050
## score                 0.091584473     0.101350279     0.064244607
## aspect_ratio          0.067026245     0.078310693     0.061052802
## gross                 0.149113958     0.256866044     0.303450929
## year                  0.091098996     0.115498559     0.112049142
## Action                0.045328355     0.060604685     0.048010820
## Adventure             0.037057882     0.091593304     0.117069654
## Drama                 0.010757325     0.006894064    -0.032116339
## Comedy               -0.048174661    -0.057722774    -0.026175930
## Fantasy               0.021011595     0.040312015     0.094413491
## Crime                 0.035368690     0.018638417    -0.030306467
## Sci.Fi                0.006762163     0.035761580     0.044609206
## Romance              -0.023988427    -0.006810766     0.001863907
## Thriller              0.042764664     0.012567328    -0.010924016
## Other                -0.041346760    -0.046131590    -0.047912845
##                   total_cast_likes     fb_likes critic_reviews
## length                0.1247975732  0.221579676     0.24015264
## budget                0.0301886770  0.054695696     0.10770951
## director_fb_likes     0.1210314123  0.162554735     0.17929815
## actor1_fb_likes       0.9452355740  0.131702448     0.17268487
## actor2_fb_likes       0.6430677122  0.233097094     0.25912917
## actor3_fb_likes       0.4904869955  0.272143768     0.25732076
## total_cast_likes      1.0000000000  0.206782864     0.24412489
## fb_likes              0.2067828639  1.000000000     0.70338535
## critic_reviews        0.2441248856  0.703385350     1.00000000
## users_reviews         0.1855268135  0.376884691     0.57370402
## users_votes           0.2540496589  0.522649488     0.60053337
## score                 0.1045460273  0.277349180     0.34039074
## aspect_ratio          0.0833722375  0.119114086     0.23315799
## gross                 0.2409818265  0.375455430     0.47713084
## year                  0.1203101688  0.295540826     0.39219492
## Action                0.0627454083  0.079625885     0.16059289
## Adventure             0.0707745077  0.137136567     0.17721943
## Drama                 0.0001569721  0.009561122    -0.04340515
## Comedy               -0.0535101031 -0.125448704    -0.21199518
## Fantasy               0.0409847792  0.065010075     0.08212020
## Crime                 0.0298154322 -0.035010172    -0.02715438
## Sci.Fi                0.0233153270  0.148927308     0.19880997
## Romance              -0.0228995007 -0.057542259    -0.11958310
## Thriller              0.0359886806  0.015325218     0.12179672
## Other                -0.0543509990 -0.047235272    -0.02161028
##                   users_reviews users_votes        score aspect_ratio
## length               0.35610434  0.34394106  0.358704617  0.177687712
## budget               0.07335186  0.06883371  0.029134678  0.032502970
## director_fb_likes    0.22045911  0.30184681  0.189292612  0.051423336
## actor1_fb_likes      0.12771769  0.18380790  0.091584473  0.067026245
## actor2_fb_likes      0.19302050  0.24921688  0.101350279  0.078310693
## actor3_fb_likes      0.21019591  0.27160305  0.064244607  0.061052802
## total_cast_likes     0.18552681  0.25404966  0.104546027  0.083372237
## fb_likes             0.37688469  0.52264949  0.277349180  0.119114086
## critic_reviews       0.57370402  0.60053337  0.340390741  0.233157986
## users_reviews        1.00000000  0.78249276  0.320005149  0.141583065
## users_votes          0.78249276  1.00000000  0.473209164  0.125238349
## score                0.32000515  0.47320916  1.000000000  0.035394972
## aspect_ratio         0.14158307  0.12523835  0.035394972  1.000000000
## gross                0.55249863  0.63140421  0.211525303  0.112166379
## year                 0.01186838  0.01677337 -0.130236373  0.131593206
## Action               0.18762872  0.16062681 -0.092324467  0.168687515
## Adventure            0.19492852  0.18920448 -0.003251721  0.086566178
## Drama               -0.04702432 -0.05189910  0.303009938  0.018752056
## Comedy              -0.22554832 -0.14750317 -0.211918772 -0.148697548
## Fantasy              0.08099041  0.08405966 -0.067639719  0.009015242
## Crime               -0.04265637  0.01480833  0.037276615  0.063907859
## Sci.Fi               0.20924614  0.16857491 -0.053277098  0.066266991
## Romance             -0.08275357 -0.09029231 -0.017451747 -0.049941860
## Thriller             0.08041243  0.04053820 -0.048232742  0.151629449
## Other               -0.07131177 -0.09359906 -0.002005325 -0.006982741
##                         gross         year       Action    Adventure
## length             0.25291118 -0.128050780  0.091474252  0.087504778
## budget             0.10217945  0.044989953  0.068842076  0.093573005
## director_fb_likes  0.14215751 -0.045781469  0.005296011  0.006318655
## actor1_fb_likes    0.14911396  0.091098996  0.045328355  0.037057882
## actor2_fb_likes    0.25686604  0.115498559  0.060604685  0.091593304
## actor3_fb_likes    0.30345093  0.112049142  0.048010820  0.117069654
## total_cast_likes   0.24098183  0.120310169  0.062745408  0.070774508
## fb_likes           0.37545543  0.295540826  0.079625885  0.137136567
## critic_reviews     0.47713084  0.392194922  0.160592894  0.177219426
## users_reviews      0.55249863  0.011868377  0.187628724  0.194928518
## users_votes        0.63140421  0.016773371  0.160626811  0.189204481
## score              0.21152530 -0.130236373 -0.092324467 -0.003251721
## aspect_ratio       0.11216638  0.131593206  0.168687515  0.086566178
## gross              1.00000000  0.046575094  0.215178382  0.354148651
## year               0.04657509  1.000000000  0.013127642 -0.017322577
## Action             0.21517838  0.013127642  1.000000000  0.322401037
## Adventure          0.35414865 -0.017322577  0.322401037  1.000000000
## Drama             -0.20257758  0.011063614 -0.261244014 -0.271118834
## Comedy            -0.01129171  0.018978275 -0.188551025 -0.036258684
## Fantasy            0.19949813  0.008546656  0.047457811  0.267344385
## Crime             -0.08365753 -0.005933677  0.150702528 -0.174166196
## Sci.Fi             0.17893883  0.002699712  0.296399825  0.250516291
## Romance           -0.05206301 -0.041723458 -0.191626496 -0.140386439
## Thriller          -0.02546523  0.013527309  0.300081094 -0.037741320
## Other             -0.02148249 -0.023162885 -0.158194131  0.040662970
##                           Drama      Comedy      Fantasy        Crime
## length             0.2701267074 -0.31619362 -0.068296866  0.024210011
## budget            -0.0229097795 -0.01213736  0.043093965 -0.015674768
## director_fb_likes  0.0472094675 -0.08311387 -0.016627083  0.037315219
## actor1_fb_likes    0.0107573246 -0.04817466  0.021011595  0.035368690
## actor2_fb_likes    0.0068940639 -0.05772277  0.040312015  0.018638417
## actor3_fb_likes   -0.0321163391 -0.02617593  0.094413491 -0.030306467
## total_cast_likes   0.0001569721 -0.05351010  0.040984779  0.029815432
## fb_likes           0.0095611217 -0.12544870  0.065010075 -0.035010172
## critic_reviews    -0.0434051478 -0.21199518  0.082120197 -0.027154385
## users_reviews     -0.0470243170 -0.22554832  0.080990405 -0.042656371
## users_votes       -0.0518990984 -0.14750317  0.084059659  0.014808333
## score              0.3030099379 -0.21191877 -0.067639719  0.037276615
## aspect_ratio       0.0187520565 -0.14869755  0.009015242  0.063907859
## gross             -0.2025775788 -0.01129171  0.199498129 -0.083657534
## year               0.0110636142  0.01897827  0.008546656 -0.005933677
## Action            -0.2612440140 -0.18855102  0.047457811  0.150702528
## Adventure         -0.2711188339 -0.03625868  0.267344385 -0.174166196
## Drama              1.0000000000 -0.25789085 -0.205284385  0.051797609
## Comedy            -0.2578908541  1.00000000  0.034538837 -0.076857364
## Fantasy           -0.2052843846  0.03453884  1.000000000 -0.157280379
## Crime              0.0517976086 -0.07685736 -0.157280379  1.000000000
## Sci.Fi            -0.2248278570 -0.10321038  0.016800287 -0.137736752
## Romance            0.1634629119  0.17736647 -0.045082888 -0.129809270
## Thriller          -0.0232142769 -0.38428512 -0.105317986  0.355582143
## Other             -0.0430691626 -0.15728514  0.129485498 -0.153316094
##                         Sci.Fi      Romance    Thriller        Other
## length             0.018402159  0.017698834  0.01976685 -0.009388305
## budget             0.091723446 -0.026370969 -0.00460247  0.016815069
## director_fb_likes  0.004183409 -0.043698547  0.02779316  0.003521888
## actor1_fb_likes    0.006762163 -0.023988427  0.04276466 -0.041346760
## actor2_fb_likes    0.035761580 -0.006810766  0.01256733 -0.046131590
## actor3_fb_likes    0.044609206  0.001863907 -0.01092402 -0.047912845
## total_cast_likes   0.023315327 -0.022899501  0.03598868 -0.054350999
## fb_likes           0.148927308 -0.057542259  0.01532522 -0.047235272
## critic_reviews     0.198809968 -0.119583104  0.12179672 -0.021610282
## users_reviews      0.209246135 -0.082753572  0.08041243 -0.071311774
## users_votes        0.168574908 -0.090292309  0.04053820 -0.093599063
## score             -0.053277098 -0.017451747 -0.04823274 -0.002005325
## aspect_ratio       0.066266991 -0.049941860  0.15162945 -0.006982741
## gross              0.178938832 -0.052063014 -0.02546523 -0.021482494
## year               0.002699712 -0.041723458  0.01352731 -0.023162885
## Action             0.296399825 -0.191626496  0.30008109 -0.158194131
## Adventure          0.250516291 -0.140386439 -0.03774132  0.040662970
## Drama             -0.224827857  0.163462912 -0.02321428 -0.043069163
## Comedy            -0.103210377  0.177366474 -0.38428512 -0.157285137
## Fantasy            0.016800287 -0.045082888 -0.10531799  0.129485498
## Crime             -0.137736752 -0.129809270  0.35558214 -0.153316094
## Sci.Fi             1.000000000 -0.141400074  0.12086277 -0.024658735
## Romance           -0.141400074  1.000000000 -0.21838248 -0.155952895
## Thriller           0.120862765 -0.218382479  1.00000000 -0.027953289
## Other             -0.024658735 -0.155952895 -0.02795329  1.000000000

Initial Correlation Plot ( only numeric data )

corrplot(cor(mydata[-c(1,2,3,4,5,31)]))

** We are removing the variables that have a correlation with gross that is less than 0.15.

Correlation table ( only numeric data )

corr <- cor(mydata[-c(1,2,3,4,5, 7, 8, 9, 12, 18, 20, 23, 24, 26, 27, 28, 29, 30,31)])
corr
##                      length actor2_fb_likes actor3_fb_likes   fb_likes
## length           1.00000000      0.13227010      0.12843574 0.22157968
## actor2_fb_likes  0.13227010      1.00000000      0.55478211 0.23309709
## actor3_fb_likes  0.12843574      0.55478211      1.00000000 0.27214377
## fb_likes         0.22157968      0.23309709      0.27214377 1.00000000
## critic_reviews   0.24015264      0.25912917      0.25732076 0.70338535
## users_reviews    0.35610434      0.19302050      0.21019591 0.37688469
## users_votes      0.34394106      0.24921688      0.27160305 0.52264949
## score            0.35870462      0.10135028      0.06424461 0.27734918
## gross            0.25291118      0.25686604      0.30345093 0.37545543
## Action           0.09147425      0.06060469      0.04801082 0.07962589
## Adventure        0.08750478      0.09159330      0.11706965 0.13713657
## Fantasy         -0.06829687      0.04031201      0.09441349 0.06501008
##                 critic_reviews users_reviews users_votes        score
## length               0.2401526    0.35610434  0.34394106  0.358704617
## actor2_fb_likes      0.2591292    0.19302050  0.24921688  0.101350279
## actor3_fb_likes      0.2573208    0.21019591  0.27160305  0.064244607
## fb_likes             0.7033853    0.37688469  0.52264949  0.277349180
## critic_reviews       1.0000000    0.57370402  0.60053337  0.340390741
## users_reviews        0.5737040    1.00000000  0.78249276  0.320005149
## users_votes          0.6005334    0.78249276  1.00000000  0.473209164
## score                0.3403907    0.32000515  0.47320916  1.000000000
## gross                0.4771308    0.55249863  0.63140421  0.211525303
## Action               0.1605929    0.18762872  0.16062681 -0.092324467
## Adventure            0.1772194    0.19492852  0.18920448 -0.003251721
## Fantasy              0.0821202    0.08099041  0.08405966 -0.067639719
##                     gross      Action    Adventure     Fantasy
## length          0.2529112  0.09147425  0.087504778 -0.06829687
## actor2_fb_likes 0.2568660  0.06060469  0.091593304  0.04031201
## actor3_fb_likes 0.3034509  0.04801082  0.117069654  0.09441349
## fb_likes        0.3754554  0.07962589  0.137136567  0.06501008
## critic_reviews  0.4771308  0.16059289  0.177219426  0.08212020
## users_reviews   0.5524986  0.18762872  0.194928518  0.08099041
## users_votes     0.6314042  0.16062681  0.189204481  0.08405966
## score           0.2115253 -0.09232447 -0.003251721 -0.06763972
## gross           1.0000000  0.21517838  0.354148651  0.19949813
## Action          0.2151784  1.00000000  0.322401037  0.04745781
## Adventure       0.3541487  0.32240104  1.000000000  0.26734439
## Fantasy         0.1994981  0.04745781  0.267344385  1.00000000
corrplot(cor(mydata[-c(1,2,3,4,5, 7, 8, 9, 12, 18, 20, 23, 24, 26, 27, 28, 29, 30,31)]))

The highest correlations exist between gross and critic reviews (0.47703920), user reviews (0.55244979), and User votes (0.63135956). This would be due to the fact gross rises when more people view the movie, therefore yielding more reivews. This would not help us predict gross because the votes and reivew occur after viewing the movie. Due to this revelation, we feel that it would be more useful to predict the score of a movie. The nature of the dataset includes more variables that occur after the viewer has seen the film. These cannot be used to predict gross, but can be used to predict score.

Correlation table Scores

corr2 <- cor(mydata[-c(1,2,3,4,5, 7, 9, 10, 11, 12, 18, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30,31)])
corr2
##                       length director_fb_likes   fb_likes critic_reviews
## length             1.0000000        0.18057523  0.2215797      0.2401526
## director_fb_likes  0.1805752        1.00000000  0.1625547      0.1792982
## fb_likes           0.2215797        0.16255474  1.0000000      0.7033853
## critic_reviews     0.2401526        0.17929815  0.7033853      1.0000000
## users_reviews      0.3561043        0.22045911  0.3768847      0.5737040
## users_votes        0.3439411        0.30184681  0.5226495      0.6005334
## score              0.3587046        0.18929261  0.2773492      0.3403907
## gross              0.2529112        0.14215751  0.3754554      0.4771308
## Comedy            -0.3161936       -0.08311387 -0.1254487     -0.2119952
##                   users_reviews users_votes      score       gross
## length                0.3561043   0.3439411  0.3587046  0.25291118
## director_fb_likes     0.2204591   0.3018468  0.1892926  0.14215751
## fb_likes              0.3768847   0.5226495  0.2773492  0.37545543
## critic_reviews        0.5737040   0.6005334  0.3403907  0.47713084
## users_reviews         1.0000000   0.7824928  0.3200051  0.55249863
## users_votes           0.7824928   1.0000000  0.4732092  0.63140421
## score                 0.3200051   0.4732092  1.0000000  0.21152530
## gross                 0.5524986   0.6314042  0.2115253  1.00000000
## Comedy               -0.2255483  -0.1475032 -0.2119188 -0.01129171
##                        Comedy
## length            -0.31619362
## director_fb_likes -0.08311387
## fb_likes          -0.12544870
## critic_reviews    -0.21199518
## users_reviews     -0.22554832
## users_votes       -0.14750317
## score             -0.21191877
## gross             -0.01129171
## Comedy             1.00000000

The highest correlations exist between score and user votes (0.4732577) and length (0.3587695). We believe user votes are correlated to the score because if there are fewer user votes, they could affect the score more. Additionally, the more users who vote, the less effect a bad score would have. We believe that lengh affects the score because it takes longer to tell a good story. One interesting observation to note is that movie genre is tied more to gross than score. Another observation is that the presence of the comedy genre is the only variable that negatively affects the score.

corrplot(cor(mydata[-c(1,2,3,4,5, 7, 9, 10, 11, 12, 18, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31)]))


Visual Analytics: Use Tableu or R to create plots


knitr::include_graphics('imgs/comedyscore.png')

The count of each rating for comedy movies is a bell curve, meaning it is normally distributed.

knitr::include_graphics('imgs/LengthScoreScatter.png')

There does not appear to be a strong correlation here despite the correlation found on the matrix. This being said, score does tend to increase as the movie length increases. It is only evident in extreme cases of length.

knitr::include_graphics('imgs/ScoreVoteScatter.png')

User votes appear to be most correlated to score when put into a scatter plot. As the number of votes increase, the score tends to be higher. This being said, there is a concentrated number of lower user votes at all score ranges. This could be due to the increased influence of each vote as we previously predicted.


Predictive Analytics: Create a Predictive Model

## linear model
```r linear_model <- lm(score ~ uvotes, data = mydata)
summary(linear_model) ```
## ## Call: ## lm(formula = score ~ uvotes, data = mydata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.7699 -0.5204 0.0828 0.6495 2.3022 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.123e+00 1.804e-02 339.3 <2e-16 *** ## uvotes 3.316e-06 9.898e-08 33.5 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9305 on 3889 degrees of freedom ## Multiple R-squared: 0.2239, Adjusted R-squared: 0.2237 ## F-statistic: 1122 on 1 and 3889 DF, p-value: < 2.2e-16
The R Square and Adjusted R Square are extremely low for this model. This is an indicator that it is not a good fit.
```r mlinear_model <- lm(score ~ uvotes + length, data = mydata)
summary(linear_model) ```
## ## Call: ## lm(formula = score ~ uvotes, data = mydata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.7699 -0.5204 0.0828 0.6495 2.3022 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.123e+00 1.804e-02 339.3 <2e-16 *** ## uvotes 3.316e-06 9.898e-08 33.5 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9305 on 3889 degrees of freedom ## Multiple R-squared: 0.2239, Adjusted R-squared: 0.2237 ## F-statistic: 1122 on 1 and 3889 DF, p-value: < 2.2e-16
By adding in length, the next strongest correlated variable, it got a bit better. Still not great.
```r mlinear_model2 <- lm(score ~ uvotes + length + comedy, data = mydata)
summary(mlinear_model2) ```
## ## Call: ## lm(formula = score ~ uvotes + length + comedy, data = mydata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.9099 -0.5073 0.0980 0.6308 2.3538 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.269e+00 8.048e-02 65.462 < 2e-16 *** ## uvotes 2.752e-06 1.020e-07 26.978 < 2e-16 *** ## length 9.015e-03 7.038e-04 12.808 < 2e-16 *** ## comedy -2.005e-01 3.122e-02 -6.422 1.5e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8995 on 3887 degrees of freedom ## Multiple R-squared: 0.2752, Adjusted R-squared: 0.2746 ## F-statistic: 491.9 on 3 and 3887 DF, p-value: < 2.2e-16 With comedy added, the R Square and Adjusted R square got only incrementally better. A linear model may not be the best approach, we will try other models.
```r uvotes_squared <- uvotes^2 uvotes_cubed <- uvotes^3 quadratic_model <- lm(score ~ uvotes + uvotes_squared, data = mydata)
summary(quadratic_model) ```
## ## Call: ## lm(formula = score ~ uvotes + uvotes_squared, data = mydata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.7851 -0.5157 0.0698 0.6400 2.3707 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.015e+00 2.028e-02 296.59 <2e-16 *** ## uvotes 5.164e-06 1.926e-07 26.81 <2e-16 *** ## uvotes_squared -2.431e-12 2.186e-13 -11.12 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9161 on 3888 degrees of freedom ## Multiple R-squared: 0.2479, Adjusted R-squared: 0.2475 ## F-statistic: 640.6 on 2 and 3888 DF, p-value: < 2.2e-16
This model is worse than linear. This makes sense because our graphs did not show quadratic trends. We will try cubic since that looks more like our graph.
```r cubic_model <- lm(score ~ uvotes + uvotes_squared + uvotes_cubed, data = mydata)
summary(cubic_model) ```
## ## Call: ## lm(formula = score ~ uvotes + uvotes_squared + uvotes_cubed, ## data = mydata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.8091 -0.5119 0.0768 0.6267 2.4145 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.966e+00 2.278e-02 261.881 < 2e-16 *** ## uvotes 6.406e-06 3.287e-07 19.485 < 2e-16 *** ## uvotes_squared -6.131e-12 8.240e-13 -7.441 1.22e-13 *** ## uvotes_cubed 2.146e-18 4.608e-19 4.656 3.33e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9137 on 3887 degrees of freedom ## Multiple R-squared: 0.252, Adjusted R-squared: 0.2515 ## F-statistic: 436.6 on 3 and 3887 DF, p-value: < 2.2e-16 This model is better than quadratic based on the R-squared adn Adjusted R-Squared, it still is worse than linear.

Predictive Analytics: Creating a Logistic Model Using Good/Bad Score Categories


This model predicts the category of the score good/bad where a score equal to or above the average (6.4) is good. A column was added to the dataset using this category.

Below we include all variables that had a strong correlation with score to determine which have the most affect on the model.

director_fb_likes fb_likes critic_reviews users_reviews users_votes score gross Comedy

logit_model_all <- glm(scorecat ~ uvotes+length+comedy+director_likes+ureview+fb_likes+gross, data = mydata, family=binomial())

summary(logit_model_all)
## 
## Call:
## glm(formula = scorecat ~ uvotes + length + comedy + director_likes + 
##     ureview + fb_likes + gross, family = binomial(), data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5963  -0.9845   0.2364   0.9781   2.9194  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -2.569e+00  2.549e-01 -10.076  < 2e-16 ***
## uvotes          1.984e-05  1.174e-06  16.898  < 2e-16 ***
## length          2.366e-02  2.341e-03  10.107  < 2e-16 ***
## comedy         -3.749e-01  8.015e-02  -4.678  2.9e-06 ***
## director_likes  7.246e-05  2.061e-05   3.515 0.000439 ***
## ureview        -1.707e-03  1.999e-04  -8.542  < 2e-16 ***
## fb_likes       -5.316e-09  3.218e-06  -0.002 0.998682    
## gross          -1.066e-08  1.034e-09 -10.313  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5277.1  on 3890  degrees of freedom
## Residual deviance: 4239.0  on 3883  degrees of freedom
## AIC: 4255
## 
## Number of Fisher Scoring iterations: 6
hoslem.test(mydata$ScoreCat, fitted(logit_model_all))
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  mydata$ScoreCat, fitted(logit_model_all)
## X-squared = 7.2749, df = 8, p-value = 0.5073

This p-value is above 0.05 which means that the model is a good fit. wWe will settle on this model.