Predict Movies

Part 1.Data The data set is comprised of 651 randomly sampled movies produced and released before 2016 -We use random sampling because if we use non-random sampling our data will be biased and unreliable to use (not good to use) ,and also we should reject the data that make collinearity and parsimony.

we will use the random numurical data and catagories data to predict. -numurical data is runtime,thtr_rel_year,thtr_rel_month,thtr_rel_day,dvd_rel_year,dvd_rel_month,dvd_rel_day, imdb_rating,imdb_num_votes,critics_score,audience_score -catagories data is best_pic_nom,best_pic_win,best_actor_win,best_actress_win,best_dir_win,top200_box

I avoid to use the data that have NA because it can decrese my model accuracy

Part 2. My point is : Is critics_score and audience_score always go in the same way? (Is general people like us,think like critics that have a lot of knowledge about movie ,you can see sometimes the movie that get reward isn’t make sense for us)

Part 3. Audience score VS. Critics score

The plot is showned linear association that I can see by my eyes.

overall = lm(movies$critics_score ~ movies$audience_score ,data = movies) 
R_squared = summary(overall)$r.squared

R-quared is 0.496 summary : 49.6% of the variable in the critics_score is explained by the model.

R = cor(movies$audience_score,movies$critics_score)

R=0.7 summary : R=0.7 is shown the strong association between audience_score,critics_score

fit <- lm(movies$critics_score ~ movies$audience_score)
summary(fit)

## 
## Call:
## lm(formula = movies$critics_score ~ movies$audience_score)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.156 -12.395   3.017  14.508  53.551 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -3.99872    2.56579  -1.558     0.12    
## movies$audience_score  0.98917    0.03914  25.273   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.18 on 649 degrees of freedom
## Multiple R-squared:  0.496,  Adjusted R-squared:  0.4952 
## F-statistic: 638.7 on 1 and 649 DF,  p-value: < 2.2e-16

intercept = -3.999, slope = 0.989 summary : audience_score = -3.999+0.989*critics_score when the audiences give score 0 are expected on average that the critics give score -3.999

Part 4 Modeling I will predict audience score by data

model selection I will use backwards elimination(commonly use) with p- value(find significant predictors) because it’s more easier to find than R-adjusted variables will be significant predictors when P-value<0.05

full model

full_model = lm(audience_score ~ thtr_rel_year+thtr_rel_month+thtr_rel_day+imdb_rating+imdb_num_votes+critics_score+best_pic_nom+best_pic_win+best_actor_win+best_actress_win+best_dir_win+top200_box,data=movies)
summary(full_model)

## 
## Call:
## lm(formula = audience_score ~ thtr_rel_year + thtr_rel_month + 
##     thtr_rel_day + imdb_rating + imdb_num_votes + critics_score + 
##     best_pic_nom + best_pic_win + best_actor_win + best_actress_win + 
##     best_dir_win + top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.452  -6.283   0.276   5.650  52.427 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.561e+01  7.504e+01   1.008  0.31407    
## thtr_rel_year       -5.562e-02  3.753e-02  -1.482  0.13884    
## thtr_rel_month      -1.710e-01  1.140e-01  -1.500  0.13413    
## thtr_rel_day         1.970e-02  4.509e-02   0.437  0.66231    
## imdb_rating          1.465e+01  5.873e-01  24.942  < 2e-16 ***
## imdb_num_votes       2.854e-06  4.252e-06   0.671  0.50234    
## critics_score        6.943e-02  2.177e-02   3.190  0.00149 ** 
## best_pic_nomyes      5.063e+00  2.632e+00   1.924  0.05481 .  
## best_pic_winyes     -2.926e+00  4.627e+00  -0.632  0.52738    
## best_actor_winyes   -2.159e+00  1.156e+00  -1.867  0.06231 .  
## best_actress_winyes -2.459e+00  1.296e+00  -1.898  0.05811 .  
## best_dir_winyes     -1.960e+00  1.714e+00  -1.143  0.25328    
## top200_boxyes        1.371e+00  2.775e+00   0.494  0.62149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.03 on 638 degrees of freedom
## Multiple R-squared:  0.7585, Adjusted R-squared:  0.754 
## F-statistic:   167 on 12 and 638 DF,  p-value: < 2.2e-16

p-value of thtr_rel_day is highest as 0.66231 so cut it off

modify1_model = lm(audience_score ~ thtr_rel_year+thtr_rel_month+imdb_rating+imdb_num_votes+critics_score+best_pic_nom+best_pic_win+best_actor_win+best_actress_win+best_dir_win+top200_box,data=movies)
summary(modify1_model)

## 
## Call:
## lm(formula = audience_score ~ thtr_rel_year + thtr_rel_month + 
##     imdb_rating + imdb_num_votes + critics_score + best_pic_nom + 
##     best_pic_win + best_actor_win + best_actress_win + best_dir_win + 
##     top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.531  -6.360   0.312   5.547  52.580 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.261e+01  7.468e+01   0.972  0.33131    
## thtr_rel_year       -5.399e-02  3.732e-02  -1.447  0.14846    
## thtr_rel_month      -1.653e-01  1.132e-01  -1.461  0.14456    
## imdb_rating          1.465e+01  5.869e-01  24.956  < 2e-16 ***
## imdb_num_votes       2.915e-06  4.247e-06   0.686  0.49271    
## critics_score        6.955e-02  2.175e-02   3.198  0.00145 ** 
## best_pic_nomyes      5.029e+00  2.629e+00   1.913  0.05617 .  
## best_pic_winyes     -2.863e+00  4.622e+00  -0.619  0.53589    
## best_actor_winyes   -2.152e+00  1.156e+00  -1.862  0.06303 .  
## best_actress_winyes -2.433e+00  1.293e+00  -1.881  0.06041 .  
## best_dir_winyes     -1.977e+00  1.712e+00  -1.155  0.24865    
## top200_boxyes        1.357e+00  2.773e+00   0.489  0.62480    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.02 on 639 degrees of freedom
## Multiple R-squared:  0.7585, Adjusted R-squared:  0.7543 
## F-statistic: 182.4 on 11 and 639 DF,  p-value: < 2.2e-16

p-value of top200_boxyes is highest as 0.62480 so cut it off

modify2_model = lm(audience_score ~ thtr_rel_year+thtr_rel_month+imdb_rating+imdb_num_votes+critics_score+best_pic_nom+best_pic_win+best_actor_win+best_actress_win+best_dir_win,data=movies)
summary(modify2_model)

## 
## Call:
## lm(formula = audience_score ~ thtr_rel_year + thtr_rel_month + 
##     imdb_rating + imdb_num_votes + critics_score + best_pic_nom + 
##     best_pic_win + best_actor_win + best_actress_win + best_dir_win, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.511  -6.367   0.314   5.585  52.576 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.705e+01  7.408e+01   1.040  0.29869    
## thtr_rel_year       -5.620e-02  3.703e-02  -1.518  0.12954    
## thtr_rel_month      -1.620e-01  1.129e-01  -1.435  0.15177    
## imdb_rating          1.463e+01  5.857e-01  24.982  < 2e-16 ***
## imdb_num_votes       3.511e-06  4.066e-06   0.863  0.38825    
## critics_score        7.020e-02  2.170e-02   3.236  0.00128 ** 
## best_pic_nomyes      4.983e+00  2.626e+00   1.898  0.05814 .  
## best_pic_winyes     -2.874e+00  4.619e+00  -0.622  0.53404    
## best_actor_winyes   -2.139e+00  1.155e+00  -1.853  0.06435 .  
## best_actress_winyes -2.408e+00  1.292e+00  -1.864  0.06274 .  
## best_dir_winyes     -2.007e+00  1.710e+00  -1.174  0.24101    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.02 on 640 degrees of freedom
## Multiple R-squared:  0.7584, Adjusted R-squared:  0.7546 
## F-statistic: 200.9 on 10 and 640 DF,  p-value: < 2.2e-16

p-value of best_pic_winyes is highest as 0.53404 so cut it off

modify3_model = lm(audience_score ~ thtr_rel_year+thtr_rel_month+imdb_rating+imdb_num_votes+critics_score+best_pic_nom+best_actor_win+best_actress_win+best_dir_win,data=movies)
summary(modify3_model)

## 
## Call:
## lm(formula = audience_score ~ thtr_rel_year + thtr_rel_month + 
##     imdb_rating + imdb_num_votes + critics_score + best_pic_nom + 
##     best_actor_win + best_actress_win + best_dir_win, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.543  -6.364   0.351   5.595  52.563 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.524e+01  7.399e+01   1.017  0.30960    
## thtr_rel_year       -5.532e-02  3.698e-02  -1.496  0.13517    
## thtr_rel_month      -1.590e-01  1.128e-01  -1.411  0.15886    
## imdb_rating          1.464e+01  5.851e-01  25.026  < 2e-16 ***
## imdb_num_votes       3.082e-06  4.006e-06   0.769  0.44191    
## critics_score        7.010e-02  2.169e-02   3.233  0.00129 ** 
## best_pic_nomyes      4.331e+00  2.406e+00   1.800  0.07232 .  
## best_actor_winyes   -2.085e+00  1.151e+00  -1.812  0.07052 .  
## best_actress_winyes -2.449e+00  1.289e+00  -1.899  0.05797 .  
## best_dir_winyes     -2.301e+00  1.643e+00  -1.400  0.16185    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.01 on 641 degrees of freedom
## Multiple R-squared:  0.7582, Adjusted R-squared:  0.7548 
## F-statistic: 223.4 on 9 and 641 DF,  p-value: < 2.2e-16

p-value of imdb_num_votes is highest as 0.44191 so cut it off

modify4_model = lm(audience_score ~ thtr_rel_year+thtr_rel_month+imdb_rating+critics_score+best_pic_nom+best_actor_win+best_actress_win+best_dir_win,data=movies)
summary(modify4_model)

## 
## Call:
## lm(formula = audience_score ~ thtr_rel_year + thtr_rel_month + 
##     imdb_rating + critics_score + best_pic_nom + best_actor_win + 
##     best_actress_win + best_dir_win, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.672  -6.483   0.440   5.559  52.648 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         63.03380   72.24659   0.872  0.38327    
## thtr_rel_year       -0.04947    0.03618  -1.367  0.17201    
## thtr_rel_month      -0.15622    0.11266  -1.387  0.16604    
## imdb_rating         14.75172    0.56760  25.990  < 2e-16 ***
## critics_score        0.06879    0.02161   3.183  0.00153 ** 
## best_pic_nomyes      4.80182    2.32661   2.064  0.03943 *  
## best_actor_winyes   -2.07129    1.15017  -1.801  0.07219 .  
## best_actress_winyes -2.39815    1.28713  -1.863  0.06289 .  
## best_dir_winyes     -2.11570    1.62467  -1.302  0.19330    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.01 on 642 degrees of freedom
## Multiple R-squared:  0.758,  Adjusted R-squared:  0.755 
## F-statistic: 251.4 on 8 and 642 DF,  p-value: < 2.2e-16

p-value of best_dir_winyes is highest as 0.19330 so cut it off

modify5_model = lm(audience_score ~ thtr_rel_year+thtr_rel_month+imdb_rating+critics_score+best_pic_nom+best_actor_win+best_actress_win,data=movies)
summary(modify5_model)

## 
## Call:
## lm(formula = audience_score ~ thtr_rel_year + thtr_rel_month + 
##     imdb_rating + critics_score + best_pic_nom + best_actor_win + 
##     best_actress_win, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.636  -6.410   0.380   5.554  52.604 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         54.50778   71.98823   0.757  0.44922    
## thtr_rel_year       -0.04511    0.03604  -1.252  0.21116    
## thtr_rel_month      -0.16350    0.11258  -1.452  0.14691    
## imdb_rating         14.72547    0.56755  25.946  < 2e-16 ***
## critics_score        0.06770    0.02161   3.133  0.00181 ** 
## best_pic_nomyes      4.50965    2.31702   1.946  0.05205 .  
## best_actor_winyes   -2.16900    1.14834  -1.889  0.05937 .  
## best_actress_winyes -2.46630    1.28676  -1.917  0.05572 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.02 on 643 degrees of freedom
## Multiple R-squared:  0.7574, Adjusted R-squared:  0.7547 
## F-statistic: 286.7 on 7 and 643 DF,  p-value: < 2.2e-16

p-value of thtr_rel_year is highest as 0.21116 so cut it off

modify6_model = lm(audience_score ~ thtr_rel_month+imdb_rating+critics_score+best_pic_nom+best_actor_win+best_actress_win,data=movies)
summary(modify6_model)

## 
## Call:
## lm(formula = audience_score ~ thtr_rel_month + imdb_rating + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.063  -6.509   0.448   5.746  52.318 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -35.51919    2.93175 -12.115  < 2e-16 ***
## thtr_rel_month       -0.16467    0.11263  -1.462  0.14421    
## imdb_rating          14.68663    0.56695  25.905  < 2e-16 ***
## critics_score         0.07009    0.02153   3.255  0.00119 ** 
## best_pic_nomyes       4.58263    2.31730   1.978  0.04840 *  
## best_actor_winyes    -2.09565    1.14735  -1.827  0.06824 .  
## best_actress_winyes  -2.42987    1.28700  -1.888  0.05947 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.02 on 644 degrees of freedom
## Multiple R-squared:  0.7568, Adjusted R-squared:  0.7545 
## F-statistic:   334 on 6 and 644 DF,  p-value: < 2.2e-16

p-value of thtr_rel_month is highest as 0.1442 so cut it off

modify7_model = lm(audience_score ~ imdb_rating+critics_score+best_pic_nom+best_actor_win+best_actress_win,data=movies)
summary(modify7_model)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating + critics_score + best_pic_nom + 
##     best_actor_win + best_actress_win, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.149  -6.636   0.533   5.685  51.847 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -36.37538    2.87521 -12.651  < 2e-16 ***
## imdb_rating          14.64080    0.56658  25.841  < 2e-16 ***
## critics_score         0.07144    0.02153   3.318 0.000957 ***
## best_pic_nomyes       4.09572    2.29527   1.784 0.074826 .  
## best_actor_winyes    -2.20186    1.14606  -1.921 0.055141 .  
## best_actress_winyes  -2.45445    1.28802  -1.906 0.057148 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.03 on 645 degrees of freedom
## Multiple R-squared:  0.756,  Adjusted R-squared:  0.7541 
## F-statistic: 399.6 on 5 and 645 DF,  p-value: < 2.2e-16

p-value of best_pic_nomyes is highest as 0.074826 so cut it off

modify8_model = lm(audience_score ~ imdb_rating+critics_score+best_actor_win+best_actress_win,data=movies)
summary(modify8_model)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating + critics_score + best_actor_win + 
##     best_actress_win, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.120  -6.682   0.475   5.575  52.182 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -37.05607    2.85460 -12.981  < 2e-16 ***
## imdb_rating          14.73736    0.56494  26.087  < 2e-16 ***
## critics_score         0.07331    0.02154   3.403 0.000708 ***
## best_actor_winyes    -1.92549    1.13746  -1.693 0.090977 .  
## best_actress_winyes  -2.04649    1.26971  -1.612 0.107500    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.05 on 646 degrees of freedom
## Multiple R-squared:  0.7548, Adjusted R-squared:  0.7532 
## F-statistic:   497 on 4 and 646 DF,  p-value: < 2.2e-16

p-value of best_actress_winyes is highest as 0.107500 so cut it off

modify9_model = lm(audience_score ~ imdb_rating+critics_score+best_actor_win,data=movies)
summary(modify9_model)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating + critics_score + best_actor_win, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.951  -6.679   0.552   5.663  52.265 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -37.03252    2.85809 -12.957  < 2e-16 ***
## imdb_rating        14.70745    0.56533  26.016  < 2e-16 ***
## critics_score       0.07294    0.02157   3.382 0.000763 ***
## best_actor_winyes  -2.16754    1.12890  -1.920 0.055290 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.06 on 647 degrees of freedom
## Multiple R-squared:  0.7538, Adjusted R-squared:  0.7526 
## F-statistic: 660.2 on 3 and 647 DF,  p-value: < 2.2e-16

p-value of best_actor_winyes is highest as 0.055290 so cut it off

modify10_model = lm(audience_score ~ imdb_rating+critics_score,data=movies)
summary(modify10_model)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating + critics_score, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.668  -6.758   0.723   5.513  52.438 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -37.03195    2.86401 -12.930  < 2e-16 ***
## imdb_rating    14.65760    0.56590  25.901  < 2e-16 ***
## critics_score   0.07318    0.02161   3.386 0.000753 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.08 on 648 degrees of freedom
## Multiple R-squared:  0.7524, Adjusted R-squared:  0.7516 
## F-statistic: 984.4 on 2 and 648 DF,  p-value: < 2.2e-16

all of variable P-value<0.05 so this is final model.

audience_score = -37.03195+14.65760imdb_rating+0.07318critics_score intercept : when the critics_score and imbd_rating is 0 are expected on average that the critics give score -37.03195 slope(imdb_rating) : All else held constant, for each 1 unit increase in imdb_rating the model predicts audience_score to be higher on average by 14.65760 audience_score. slope(critics_score) : All else held constant, for each 1 unit increase in critics_score the model predicts audience_score to be higher on average by 0.07318 audience_score.

Diagnose

plot(modify10_model$residuals ~ movies$imdb_rating)

plot(modify10_model$residuals ~ movies$critics_score)

from scattor plot is really random around 0 so they have linear relationship together

hist(modify10_model$residuals)

from histogram see the residual random scatter around 0 this condition is satisfy

residual vs. predicted

plot(modify10_model$residuals ~ modify10_model$fitted)

have a band with a constant width around 0 (no fan shape)

plot(modify10_model$residuals)

x axis doesn’t show any pattern so non-independent this model can use

Part 5 Pridiction

choose movie : The Sessions(060) :imdb_rating=7.2,critics_score=93,audience_score=80

use our final model to predict : audience_score = -37.03195+14.65760imdb_rating+0.07318critics_score

audience_score = -37.03195+(14.657607.2)+(0.0731893)

answer is 75.30851 and the real score is 80 #Wow it’s almost collect how’s good:)

Part 6 Conclusion not all of variable can use for model it should significant with data you want ,a lot of variable never mean your model is good also a little of variable in model and also don’t use the data that never give you more R-adjusted(maybe collinerity,Parsimony or not have relationship) you should keep data without biased

only imdb_rating,critics_score significant to predict audience_score

shortcoming -> not all of data can explained by this model

Predict Movies

Peeraya Pongpanyaporn

November 8, 2016