Setup

Load data


Part 1: Data

This is an observational study which uses a data set that is comprised of 651 randomly sampled movies produced and released before 2016.No random assignment is used in this study which means we can only make generalize inferences(not causal). It helps us to find the relation between variables in this study.


Part 2: Research question

Q:How the popularity of a movie(imdb rating) is associated with it’s genre,runtime and other attributes?

Therefore, the response or dependent variable here is the imdb_rating and explanatory or indipendent variables are those linked to the genre and audience participation on rating. This could be interested to someone who would like to know which attributes of a movie make it popular(i.e. high imdb ratings).It might be useful to anyone who is creating a movie and wants to make it popular to the audience.


Part 3: Exploratory data analysis

Given below is a list of variables used in the dataset either responsible for the popularity or popularity parameter,catagorized into numerical and catagorical.

Numerical: imdb_rating(response variable) critics score audience score runtime *imdb_num_votes

Categorical: title_type genre mpaa_rating critics_rating audience_rating best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win *top200_box

Other variables excluded from the data are not linked to the popularity of a movie for example movie title,studio, release year/month/date,url etc.

First,we can look at the summary statistics and distribution of the response variable(imdb_rating)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.900   5.900   6.600   6.493   7.300   9.000

Here we can see that distribution of the response variable(imdb rating) is left skewed, with this sample movies having an average imdb rating of 6.6 Next,we will check how the numerical variables are related each other,specially whether there is any colinearity between them using a pairwise plot.

Here we can see that audience_score and critics_score are strongly corelated. Including both of them will introduce colinearity into the model, which we wish to avoid. Additionally, as the corelation coefficient between these two variable is high(>0.7),including both of them would not significantly improve the predicting power of the model compared to using just one.In this case we will use only critics_score which has the highest corelation value. since, the variable imdb_num_votes is not strongly corelated with other variables,we will look at the distribution of this variable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     180    4546   15116   57533   58301  893008

It is seen from the above histogram, the data for imdb_num_votes is strongly right skewed. Since non-normal distributions (like the strongly skewed data above) can be problematic when is comes to making inferences about the population, it is sensible to transform this data. We will do this using a log transformation:

After the log transformation, the imdb_num_votes variable can now be seen to be nearly normal, and therefore suitable to use in the model. Then, we will see the distribution of the variable runtime

## Warning: Removed 1 rows containing non-finite values (stat_bin).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    39.0    92.0   103.0   105.8   115.8   267.0       1

This distribution can be seen to be nearly normal with a weak to moderate right skew.As we have a large sample size (n > 30), this skew should not affect any inferences made regarding the population.

We will see the distribution of the final numeric variable critics_score.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   33.00   61.00   57.69   83.00  100.00

This distribution can be seen as a normal distribution with an average(median) score of 61 while maximum score is 100 and arithmatic mean is 57.69.

For a linear regression model to be valid, the response and explanatory variables need to have a linear dependence. This can be checked using scatter plots as seen below:

## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

In this case, both of these explanatory variables appear to have a linear relationship with the response variableimdb_ratinng. It is noticeable, however, that the runtime variable does not have a constant variation of data which may pose as a problem for the model and will be discussed further in the modelling section.


Part 4: Modeling

For the model, a backward selection,p value criteria model selection process will be used. P value model selection process will be used instead of Adjusted R squared selection process because it is a more time efficient method.As the model results will be used as more of a guideline rather than a strict value - a significance level of alpha = 0.1 will be used. The full model and summary statistics are given below:

##                                     Estimate  Std. Error      t value
## (Intercept)                     3.1824607384 0.313262672 10.159080618
## title_typeFeature Film         -0.3390950402 0.235746751 -1.438386906
## title_typeTV Movie             -0.6223767341 0.367022600 -1.695744985
## genreAnimation                 -0.1791475957 0.224901260 -0.796561102
## genreArt House & International  0.6990811487 0.188543917  3.707789454
## genreComedy                    -0.0741308018 0.104721448 -0.707885566
## genreDocumentary                0.8120383545 0.247514163  3.280775312
## genreDrama                      0.2984805629 0.091965427  3.245573615
## genreHorror                    -0.0702877613 0.153884207 -0.456757473
## genreMusical & Performing Arts  0.5995306599 0.214687085  2.792579068
## genreMystery & Suspense         0.2084174365 0.115706296  1.801262705
## genreOther                      0.1832643980 0.179240029  1.022452402
## genreScience Fiction & Fantasy -0.3266429124 0.223373871 -1.462314779
## runtime                         0.0045585771 0.001492515  3.054292343
## log10(imdb_num_votes)           0.3720385567 0.041731184  8.915121148
## critics_score                   0.0232982353 0.001035312 22.503598032
## best_pic_nomyes                 0.1651022581 0.163302715  1.011019679
## best_pic_winyes                -0.0950005100 0.285856993 -0.332335791
## best_actor_winyes              -0.0478286596 0.074359836 -0.643205552
## best_actress_winyes            -0.0175266640 0.082548255 -0.212320224
## best_dir_winyes                 0.0008889314 0.108300173  0.008208033
## top200_boxyes                  -0.0901000040 0.170794962 -0.527533149
##                                    Pr(>|t|)
## (Intercept)                    1.494622e-22
## title_typeFeature Film         1.508225e-01
## title_typeTV Movie             9.042983e-02
## genreAnimation                 4.260070e-01
## genreArt House & International 2.275654e-04
## genreComedy                    4.792790e-01
## genreDocumentary               1.092523e-03
## genreDrama                     1.234437e-03
## genreHorror                    6.480033e-01
## genreMusical & Performing Arts 5.388198e-03
## genreMystery & Suspense        7.214105e-02
## genreOther                     3.069605e-01
## genreScience Fiction & Fantasy 1.441552e-01
## runtime                        2.351443e-03
## log10(imdb_num_votes)          5.265369e-18
## critics_score                  1.092377e-82
## best_pic_nomyes                3.123964e-01
## best_pic_winyes                7.397466e-01
## best_actor_winyes              5.203255e-01
## best_actress_winyes            8.319261e-01
## best_dir_winyes                9.934536e-01
## top200_boxyes                  5.980097e-01

since best_dir_win has higher p values(0.99>0.1),best_dir_win will be removed from the model and model_2 will be computed:

##                                    Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)                     3.182139699 0.310564195 10.2463186 6.875685e-23
## title_typeFeature Film         -0.339034098 0.235442436 -1.4399872 1.503685e-01
## title_typeTV Movie             -0.622347915 0.366713972 -1.6970935 9.017368e-02
## genreAnimation                 -0.179182498 0.224682254 -0.7974929 4.254658e-01
## genreArt House & International  0.699035941 0.188313593  3.7120843 2.237896e-04
## genreComedy                    -0.074133209 0.104637766 -0.7084747 4.789130e-01
## genreDocumentary                0.812039429 0.247317311  3.2833910 1.082515e-03
## genreDrama                      0.298464095 0.091870429  3.2487504 1.220862e-03
## genreHorror                    -0.070277121 0.153756385 -0.4570680 6.477800e-01
## genreMusical & Performing Arts  0.599532908 0.214516197  2.7948142 5.351296e-03
## genreMystery & Suspense         0.208416441 0.115614226  1.8026885 7.191568e-02
## genreOther                      0.183247683 0.179085943  1.0232388 3.065883e-01
## genreScience Fiction & Fantasy -0.326592234 0.223110971 -1.4638107 1.437452e-01
## runtime                         0.004560384 0.001475019  3.0917454 2.077856e-03
## log10(imdb_num_votes)           0.372058750 0.041625478  8.9382456 4.358822e-18
## critics_score                   0.023299097 0.001029161 22.6389186 1.872712e-83
## best_pic_nomyes                 0.165009916 0.162785184  1.0136667 3.111316e-01
## best_pic_winyes                -0.094336409 0.273949819 -0.3443565 7.306933e-01
## best_actor_winyes              -0.047800452 0.074221314 -0.6440259 5.197934e-01
## best_actress_winyes            -0.017525911 0.082482564 -0.2124802 8.318013e-01
## top200_boxyes                  -0.090149287 0.170553660 -0.5285685 5.972913e-01

Here in the model_2, best_actress_win variable has high p value(0.83>0.1). so it will be removed and model_3 will be computed:

##                                    Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)                     3.186371654 0.309689930 10.2889095 4.683887e-23
## title_typeFeature Film         -0.339235547 0.235262038 -1.4419477 1.498140e-01
## title_typeTV Movie             -0.624300013 0.366320951 -1.7042433 8.882873e-02
## genreAnimation                 -0.181869060 0.224156149 -0.8113499 4.174712e-01
## genreArt House & International  0.697432603 0.188019703  3.7093591 2.261295e-04
## genreComedy                    -0.076273936 0.104072650 -0.7328913 4.638973e-01
## genreDocumentary                0.810945614 0.247076277  3.2821670 1.087052e-03
## genreDrama                      0.296141282 0.091148550  3.2489961 1.219726e-03
## genreHorror                    -0.070956215 0.153606626 -0.4619346 6.442877e-01
## genreMusical & Performing Arts  0.599579702 0.214353458  2.7971543 5.312916e-03
## genreMystery & Suspense         0.205852005 0.114895382  1.7916473 7.366930e-02
## genreOther                      0.181880058 0.178834561  1.0170297 3.095297e-01
## genreScience Fiction & Fantasy -0.326734977 0.222940819 -1.4655682 1.432648e-01
## runtime                         0.004525502 0.001464744  3.0896202 2.092412e-03
## log10(imdb_num_votes)           0.372017188 0.041593463  8.9441264 4.143138e-18
## critics_score                   0.023299887 0.001028374 22.6570115 1.387221e-83
## best_pic_nomyes                 0.161474133 0.161809659  0.9979264 3.186982e-01
## best_pic_winyes                -0.097319598 0.273382403 -0.3559834 7.219722e-01
## best_actor_winyes              -0.048712245 0.074040986 -0.6579092 5.108368e-01
## top200_boxyes                  -0.092154755 0.170163202 -0.5415669 5.883084e-01

Here in the model_3,best_pic_win variable has high p value(0.72>0.1). So it will be removed and model_4 will be computed:

##                                    Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)                     3.195160082 0.308490647 10.3573969 2.529145e-23
## title_typeFeature Film         -0.339109367 0.235098919 -1.4424114 1.496825e-01
## title_typeTV Movie             -0.625400625 0.366054339 -1.7084912 8.803701e-02
## genreAnimation                 -0.182750829 0.223987307 -0.8158981 4.148663e-01
## genreArt House & International  0.696789017 0.187880866  3.7086747 2.266983e-04
## genreComedy                    -0.077793719 0.103913067 -0.7486423 4.543518e-01
## genreDocumentary                0.809946802 0.246889326  3.2806068 1.092888e-03
## genreDrama                      0.295923492 0.091083403  3.2489288 1.219911e-03
## genreHorror                    -0.071779859 0.153482883 -0.4676734 6.401797e-01
## genreMusical & Performing Arts  0.599636207 0.214205020  2.7993565 5.277013e-03
## genreMystery & Suspense         0.204939533 0.114787274  1.7853855 7.467900e-02
## genreOther                      0.185184969 0.178469794  1.0376264 2.998415e-01
## genreScience Fiction & Fantasy -0.326721266 0.222786492 -1.4665219 1.430043e-01
## runtime                         0.004480785 0.001458337  3.0725299 2.213969e-03
## log10(imdb_num_votes)           0.371142718 0.041492118  8.9448969 4.103110e-18
## critics_score                   0.023292088 0.001027429 22.6702622 1.091610e-83
## best_pic_nomyes                 0.136510387 0.145723229  0.9367785 3.492309e-01
## best_actor_winyes              -0.046938868 0.073822073 -0.6358378 5.251127e-01
## top200_boxyes                  -0.094009940 0.169965646 -0.5531114 5.803830e-01

Here in the model_4,top_200_box variable has high p value(0.58>0.1). So it will be removed and model_5 will be computed:

##                                    Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)                     3.210022363 0.307149386 10.4510135 1.086352e-23
## title_typeFeature Film         -0.338297372 0.234965208 -1.4397764 1.504258e-01
## title_typeTV Movie             -0.624969226 0.365852451 -1.7082549 8.808010e-02
## genreAnimation                 -0.176255508 0.223556400 -0.7884163 4.307489e-01
## genreArt House & International  0.701674653 0.187570034  3.7408675 2.000641e-04
## genreComedy                    -0.073115129 0.103511335 -0.7063490 4.802314e-01
## genreDocumentary                0.815305148 0.246563686  3.3066716 9.976030e-04
## genreDrama                      0.301734535 0.090425762  3.3368205 8.971151e-04
## genreHorror                    -0.066554404 0.153107707 -0.4346901 6.639358e-01
## genreMusical & Performing Arts  0.606989099 0.213674688  2.8407160 4.646202e-03
## genreMystery & Suspense         0.211323644 0.114142773  1.8513975 6.457867e-02
## genreOther                      0.187335768 0.178329424  1.0505040 2.938880e-01
## genreScience Fiction & Fantasy -0.330461279 0.222561545 -1.4848085 1.380931e-01
## runtime                         0.004425647 0.001454127  3.0435077 2.435470e-03
## log10(imdb_num_votes)           0.367817656 0.041031769  8.9642165 3.500311e-18
## critics_score                   0.023246053 0.001023490 22.7125329 5.971072e-84
## best_pic_nomyes                 0.135452000 0.145630632  0.9301065 3.526711e-01
## best_actor_winyes              -0.047615064 0.073771408 -0.6454406 5.188759e-01

Here in the model_5,best_actor_win variable has high p value(0.51>0.1). So it will be removed and model_6 will be computed:

##                                    Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)                     3.227703410 0.305784351 10.5554892 4.205963e-24
## title_typeFeature Film         -0.339965499 0.234842697 -1.4476307 1.482154e-01
## title_typeTV Movie             -0.620494773 0.365618162 -1.6971115 9.016716e-02
## genreAnimation                 -0.179431236 0.223399229 -0.8031865 4.221685e-01
## genreArt House & International  0.705875885 0.187370655  3.7672702 1.804441e-04
## genreComedy                    -0.073734058 0.103459183 -0.7126874 4.763018e-01
## genreDocumentary                0.814495972 0.246446851  3.3049559 1.003538e-03
## genreDrama                      0.299546985 0.090320574  3.3164867 9.637117e-04
## genreHorror                    -0.063788053 0.152977161 -0.4169776 6.768361e-01
## genreMusical & Performing Arts  0.608601609 0.213561600  2.8497708 4.517398e-03
## genreMystery & Suspense         0.204778265 0.113638993  1.8020070 7.201993e-02
## genreOther                      0.185184645 0.178216095  1.0391017 2.991542e-01
## genreScience Fiction & Fantasy -0.326030835 0.222353130 -1.4662750 1.430700e-01
## runtime                         0.004250499 0.001427924  2.9766980 3.024967e-03
## log10(imdb_num_votes)           0.367107865 0.040998123  8.9542602 3.778956e-18
## critics_score                   0.023252930 0.001022963 22.7309600 4.401390e-84
## best_pic_nomyes                 0.126710913 0.144932727  0.8742740 3.823006e-01

Here in the model_6,best_actor_win variable has high p value(0.38>0.1). So it will be removed and model_7 will be computed:

##                                    Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)                     3.175798241 0.299909809 10.5891776 3.079371e-24
## title_typeFeature Film         -0.339600585 0.234798679 -1.4463479 1.485738e-01
## title_typeTV Movie             -0.623283318 0.365536299 -1.7051202 8.866188e-02
## genreAnimation                 -0.176400166 0.223330810 -0.7898604 4.299046e-01
## genreArt House & International  0.709114913 0.187299207  3.7860006 1.676333e-04
## genreComedy                    -0.068994381 0.103297846 -0.6679169 5.044297e-01
## genreDocumentary                0.817701934 0.246373770  3.3189488 9.553171e-04
## genreDrama                      0.304200830 0.090146820  3.3745043 7.845890e-04
## genreHorror                    -0.059308645 0.152862920 -0.3879858 6.981568e-01
## genreMusical & Performing Arts  0.605865047 0.213498972  2.8377891 4.688005e-03
## genreMystery & Suspense         0.206622044 0.113598305  1.8188832 6.940088e-02
## genreOther                      0.198177354 0.177562384  1.1160999 2.648022e-01
## genreScience Fiction & Fantasy -0.325350687 0.222310444 -1.4634971 1.438270e-01
## runtime                         0.004474797 0.001404427  3.1862091 1.512293e-03
## log10(imdb_num_votes)           0.372217088 0.040571939  9.1742495 6.311357e-19
## critics_score                   0.023371161 0.001013796 23.0531124 7.156688e-86

After this iteration, all of the variables appear to be significant predictor of the imdb rating of a movie(the categorical variables all have one or more levels which are highly significant).Each coefficient in this model represent an increase or decrease in imdb ratings.If the variable is numerical,this increase/decrease is for per unit of that variable.If the variable is categorical, then all but one of the levels of that variable will be set to 0 and the increase/decrease from remaining level will be added to the model. So,model_7 is our final model.

For the model to provide valid results, there are certain conditions that need to be met. These are:

1.Linear relationship between (numerical) x and Y 2.Nearly normal residuals with mean 0 3.Constant variability of residuals 4.Indipendent residuals The first condition(linear relationship) has already been verified by the previous scatter plots with the numerical variables. The second condition for nearly normal residuals with mean 0,can be checked using histogram and/or Q-Q plot of the residuals:

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.09058 -0.32153  0.04408  0.00000  0.40937  1.77615

From the summary statistics of the residuals,the mean can be seen to equal zero.However the histogram shows a very low amount of left skewness which can be easily avoided.The Q-Q plot indicates almost normal distribution with some deviation from normality occuring to the left hand side indicating that the data is very little amount of left skewed.As both plot shows nearly normal distribution with mean zero,we can be ensured that condition 2 has been met.

The variability of the residuals can be checked by plotting the residuals against the fitted values from the model.

In the residuals vs. fitted values plot, we don’t see a fan shape here.It appears that variability of the residuals stays constant as the value of the fitted or predicted values change.So,the 3rd condition of constant variability of residuals has been met.

We can check for independent residuals(condition 4) using the residuals vs. the order of data collection plot.

In the order of data collection plot where we have the residuals on the y-axis, and the order of data collection on the x-axis, does not show any patterns.If there was some non-independent structure we would see these residuals increasing or decreasing but we don’t see any such pattern, So, our last condition has been met as well.


Part 5: Prediction

The model can now be used to predict the running time of a movie not present in the sample data. The chosen test movie is Arrival (2016). The information for this movie was found IMDB, Rotten Tomatoes, and Box Office Mojo.

80% confidence interval will be used for the model prediction:

##        fit     lwr      upr
## 1 7.332814 6.48593 8.179697

From the model,we are 80% confident that imdb rating of the movie Arrival is between 6.5 to 8.2


Part 6: Conclusion

Imdb rating of the movie Arrival is 7.9 . The predicted imdb rating of the movie Arrival by the model is betweem 6.5 to 8.2 .The prediction in this case was correct.Overall we can conclude that the explanatory variables used in this model are significant predictor of imdb_rating and can be used to predic imdb_rating to a certain degree of accuracy.There are a few point to note. However, firstly the confidence level for the given interval may be lower than stated in reality due to variability in residuals. Secondly,the adjusted R squared value for the model was given to be 0.67 which means approximately 33% of the variability of the data is not explained by the model. With this, we can say that our model has strong predicting capabilities. Future research and improvement to the model can upgrade the predicting capabilities .