Part 1.Data The data set is comprised of 651 randomly sampled movies produced and released before 2016 -We use random sampling because if we use non-random sampling our data will be biased and unreliable to use (not good to use) ,and also we should reject the data that make collinearity and parsimony.
we will use the random numurical data and catagories data to predict. -numurical data is runtime,thtr_rel_year,thtr_rel_month,thtr_rel_day,dvd_rel_year,dvd_rel_month,dvd_rel_day, imdb_rating,imdb_num_votes,critics_score,audience_score -catagories data is best_pic_nom,best_pic_win,best_actor_win,best_actress_win,best_dir_win,top200_box
I avoid to use the data that have NA because it can decrese my model accuracy
Part 2. My point is : Is critics_score and audience_score always go in the same way? (Is general people like us,think like critics that have a lot of knowledge about movie ,you can see sometimes the movie that get reward isn’t make sense for us)
Part 3. Audience score VS. Critics score
The plot is showned linear association that I can see by my eyes.
overall = lm(movies$critics_score ~ movies$audience_score ,data = movies)
R_squared = summary(overall)$r.squared
R-quared is 0.496 summary : 49.6% of the variable in the critics_score is explained by the model.
R = cor(movies$audience_score,movies$critics_score)
R=0.7 summary : R=0.7 is shown the strong association between audience_score,critics_score
fit <- lm(movies$critics_score ~ movies$audience_score)
summary(fit)
##
## Call:
## lm(formula = movies$critics_score ~ movies$audience_score)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.156 -12.395 3.017 14.508 53.551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.99872 2.56579 -1.558 0.12
## movies$audience_score 0.98917 0.03914 25.273 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.18 on 649 degrees of freedom
## Multiple R-squared: 0.496, Adjusted R-squared: 0.4952
## F-statistic: 638.7 on 1 and 649 DF, p-value: < 2.2e-16
intercept = -3.999, slope = 0.989 summary : audience_score = -3.999+0.989*critics_score when the audiences give score 0 are expected on average that the critics give score -3.999
Part 4 Modeling I will predict audience score by data
we will use the random numurical data and catagories data to predict. -numurical data is runtime,thtr_rel_year,thtr_rel_month,thtr_rel_day,dvd_rel_year,dvd_rel_month,dvd_rel_day, imdb_rating,imdb_num_votes,critics_score,audience_score -catagories data is best_pic_nom,best_pic_win,best_actor_win,best_actress_win,best_dir_win,top200_box I avoid to use the data that have NA because it can decrese my model accuracy that is runtime,dvd_rel_year,dvd_rel_month, dvd_rel_day
model selection I will use backwards elimination(commonly use) with p- value(find significant predictors) because it’s more easier to find than R-adjusted variables will be significant predictors when P-value<0.05
full model
full_model = lm(audience_score ~ thtr_rel_year+thtr_rel_month+thtr_rel_day+imdb_rating+imdb_num_votes+critics_score+best_pic_nom+best_pic_win+best_actor_win+best_actress_win+best_dir_win+top200_box,data=movies)
summary(full_model)
##
## Call:
## lm(formula = audience_score ~ thtr_rel_year + thtr_rel_month +
## thtr_rel_day + imdb_rating + imdb_num_votes + critics_score +
## best_pic_nom + best_pic_win + best_actor_win + best_actress_win +
## best_dir_win + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.452 -6.283 0.276 5.650 52.427
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.561e+01 7.504e+01 1.008 0.31407
## thtr_rel_year -5.562e-02 3.753e-02 -1.482 0.13884
## thtr_rel_month -1.710e-01 1.140e-01 -1.500 0.13413
## thtr_rel_day 1.970e-02 4.509e-02 0.437 0.66231
## imdb_rating 1.465e+01 5.873e-01 24.942 < 2e-16 ***
## imdb_num_votes 2.854e-06 4.252e-06 0.671 0.50234
## critics_score 6.943e-02 2.177e-02 3.190 0.00149 **
## best_pic_nomyes 5.063e+00 2.632e+00 1.924 0.05481 .
## best_pic_winyes -2.926e+00 4.627e+00 -0.632 0.52738
## best_actor_winyes -2.159e+00 1.156e+00 -1.867 0.06231 .
## best_actress_winyes -2.459e+00 1.296e+00 -1.898 0.05811 .
## best_dir_winyes -1.960e+00 1.714e+00 -1.143 0.25328
## top200_boxyes 1.371e+00 2.775e+00 0.494 0.62149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.03 on 638 degrees of freedom
## Multiple R-squared: 0.7585, Adjusted R-squared: 0.754
## F-statistic: 167 on 12 and 638 DF, p-value: < 2.2e-16
p-value of thtr_rel_day is highest as 0.66231 so cut it off
modify1_model = lm(audience_score ~ thtr_rel_year+thtr_rel_month+imdb_rating+imdb_num_votes+critics_score+best_pic_nom+best_pic_win+best_actor_win+best_actress_win+best_dir_win+top200_box,data=movies)
summary(modify1_model)
##
## Call:
## lm(formula = audience_score ~ thtr_rel_year + thtr_rel_month +
## imdb_rating + imdb_num_votes + critics_score + best_pic_nom +
## best_pic_win + best_actor_win + best_actress_win + best_dir_win +
## top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.531 -6.360 0.312 5.547 52.580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.261e+01 7.468e+01 0.972 0.33131
## thtr_rel_year -5.399e-02 3.732e-02 -1.447 0.14846
## thtr_rel_month -1.653e-01 1.132e-01 -1.461 0.14456
## imdb_rating 1.465e+01 5.869e-01 24.956 < 2e-16 ***
## imdb_num_votes 2.915e-06 4.247e-06 0.686 0.49271
## critics_score 6.955e-02 2.175e-02 3.198 0.00145 **
## best_pic_nomyes 5.029e+00 2.629e+00 1.913 0.05617 .
## best_pic_winyes -2.863e+00 4.622e+00 -0.619 0.53589
## best_actor_winyes -2.152e+00 1.156e+00 -1.862 0.06303 .
## best_actress_winyes -2.433e+00 1.293e+00 -1.881 0.06041 .
## best_dir_winyes -1.977e+00 1.712e+00 -1.155 0.24865
## top200_boxyes 1.357e+00 2.773e+00 0.489 0.62480
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.02 on 639 degrees of freedom
## Multiple R-squared: 0.7585, Adjusted R-squared: 0.7543
## F-statistic: 182.4 on 11 and 639 DF, p-value: < 2.2e-16
p-value of top200_boxyes is highest as 0.62480 so cut it off
modify2_model = lm(audience_score ~ thtr_rel_year+thtr_rel_month+imdb_rating+imdb_num_votes+critics_score+best_pic_nom+best_pic_win+best_actor_win+best_actress_win+best_dir_win,data=movies)
summary(modify2_model)
##
## Call:
## lm(formula = audience_score ~ thtr_rel_year + thtr_rel_month +
## imdb_rating + imdb_num_votes + critics_score + best_pic_nom +
## best_pic_win + best_actor_win + best_actress_win + best_dir_win,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.511 -6.367 0.314 5.585 52.576
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.705e+01 7.408e+01 1.040 0.29869
## thtr_rel_year -5.620e-02 3.703e-02 -1.518 0.12954
## thtr_rel_month -1.620e-01 1.129e-01 -1.435 0.15177
## imdb_rating 1.463e+01 5.857e-01 24.982 < 2e-16 ***
## imdb_num_votes 3.511e-06 4.066e-06 0.863 0.38825
## critics_score 7.020e-02 2.170e-02 3.236 0.00128 **
## best_pic_nomyes 4.983e+00 2.626e+00 1.898 0.05814 .
## best_pic_winyes -2.874e+00 4.619e+00 -0.622 0.53404
## best_actor_winyes -2.139e+00 1.155e+00 -1.853 0.06435 .
## best_actress_winyes -2.408e+00 1.292e+00 -1.864 0.06274 .
## best_dir_winyes -2.007e+00 1.710e+00 -1.174 0.24101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.02 on 640 degrees of freedom
## Multiple R-squared: 0.7584, Adjusted R-squared: 0.7546
## F-statistic: 200.9 on 10 and 640 DF, p-value: < 2.2e-16
p-value of best_pic_winyes is highest as 0.53404 so cut it off
modify3_model = lm(audience_score ~ thtr_rel_year+thtr_rel_month+imdb_rating+imdb_num_votes+critics_score+best_pic_nom+best_actor_win+best_actress_win+best_dir_win,data=movies)
summary(modify3_model)
##
## Call:
## lm(formula = audience_score ~ thtr_rel_year + thtr_rel_month +
## imdb_rating + imdb_num_votes + critics_score + best_pic_nom +
## best_actor_win + best_actress_win + best_dir_win, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.543 -6.364 0.351 5.595 52.563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.524e+01 7.399e+01 1.017 0.30960
## thtr_rel_year -5.532e-02 3.698e-02 -1.496 0.13517
## thtr_rel_month -1.590e-01 1.128e-01 -1.411 0.15886
## imdb_rating 1.464e+01 5.851e-01 25.026 < 2e-16 ***
## imdb_num_votes 3.082e-06 4.006e-06 0.769 0.44191
## critics_score 7.010e-02 2.169e-02 3.233 0.00129 **
## best_pic_nomyes 4.331e+00 2.406e+00 1.800 0.07232 .
## best_actor_winyes -2.085e+00 1.151e+00 -1.812 0.07052 .
## best_actress_winyes -2.449e+00 1.289e+00 -1.899 0.05797 .
## best_dir_winyes -2.301e+00 1.643e+00 -1.400 0.16185
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.01 on 641 degrees of freedom
## Multiple R-squared: 0.7582, Adjusted R-squared: 0.7548
## F-statistic: 223.4 on 9 and 641 DF, p-value: < 2.2e-16
p-value of imdb_num_votes is highest as 0.44191 so cut it off
modify4_model = lm(audience_score ~ thtr_rel_year+thtr_rel_month+imdb_rating+critics_score+best_pic_nom+best_actor_win+best_actress_win+best_dir_win,data=movies)
summary(modify4_model)
##
## Call:
## lm(formula = audience_score ~ thtr_rel_year + thtr_rel_month +
## imdb_rating + critics_score + best_pic_nom + best_actor_win +
## best_actress_win + best_dir_win, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.672 -6.483 0.440 5.559 52.648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.03380 72.24659 0.872 0.38327
## thtr_rel_year -0.04947 0.03618 -1.367 0.17201
## thtr_rel_month -0.15622 0.11266 -1.387 0.16604
## imdb_rating 14.75172 0.56760 25.990 < 2e-16 ***
## critics_score 0.06879 0.02161 3.183 0.00153 **
## best_pic_nomyes 4.80182 2.32661 2.064 0.03943 *
## best_actor_winyes -2.07129 1.15017 -1.801 0.07219 .
## best_actress_winyes -2.39815 1.28713 -1.863 0.06289 .
## best_dir_winyes -2.11570 1.62467 -1.302 0.19330
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.01 on 642 degrees of freedom
## Multiple R-squared: 0.758, Adjusted R-squared: 0.755
## F-statistic: 251.4 on 8 and 642 DF, p-value: < 2.2e-16
p-value of best_dir_winyes is highest as 0.19330 so cut it off
modify5_model = lm(audience_score ~ thtr_rel_year+thtr_rel_month+imdb_rating+critics_score+best_pic_nom+best_actor_win+best_actress_win,data=movies)
summary(modify5_model)
##
## Call:
## lm(formula = audience_score ~ thtr_rel_year + thtr_rel_month +
## imdb_rating + critics_score + best_pic_nom + best_actor_win +
## best_actress_win, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.636 -6.410 0.380 5.554 52.604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.50778 71.98823 0.757 0.44922
## thtr_rel_year -0.04511 0.03604 -1.252 0.21116
## thtr_rel_month -0.16350 0.11258 -1.452 0.14691
## imdb_rating 14.72547 0.56755 25.946 < 2e-16 ***
## critics_score 0.06770 0.02161 3.133 0.00181 **
## best_pic_nomyes 4.50965 2.31702 1.946 0.05205 .
## best_actor_winyes -2.16900 1.14834 -1.889 0.05937 .
## best_actress_winyes -2.46630 1.28676 -1.917 0.05572 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.02 on 643 degrees of freedom
## Multiple R-squared: 0.7574, Adjusted R-squared: 0.7547
## F-statistic: 286.7 on 7 and 643 DF, p-value: < 2.2e-16
p-value of thtr_rel_year is highest as 0.21116 so cut it off
modify6_model = lm(audience_score ~ thtr_rel_month+imdb_rating+critics_score+best_pic_nom+best_actor_win+best_actress_win,data=movies)
summary(modify6_model)
##
## Call:
## lm(formula = audience_score ~ thtr_rel_month + imdb_rating +
## critics_score + best_pic_nom + best_actor_win + best_actress_win,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.063 -6.509 0.448 5.746 52.318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -35.51919 2.93175 -12.115 < 2e-16 ***
## thtr_rel_month -0.16467 0.11263 -1.462 0.14421
## imdb_rating 14.68663 0.56695 25.905 < 2e-16 ***
## critics_score 0.07009 0.02153 3.255 0.00119 **
## best_pic_nomyes 4.58263 2.31730 1.978 0.04840 *
## best_actor_winyes -2.09565 1.14735 -1.827 0.06824 .
## best_actress_winyes -2.42987 1.28700 -1.888 0.05947 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.02 on 644 degrees of freedom
## Multiple R-squared: 0.7568, Adjusted R-squared: 0.7545
## F-statistic: 334 on 6 and 644 DF, p-value: < 2.2e-16
p-value of thtr_rel_month is highest as 0.1442 so cut it off
modify7_model = lm(audience_score ~ imdb_rating+critics_score+best_pic_nom+best_actor_win+best_actress_win,data=movies)
summary(modify7_model)
##
## Call:
## lm(formula = audience_score ~ imdb_rating + critics_score + best_pic_nom +
## best_actor_win + best_actress_win, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.149 -6.636 0.533 5.685 51.847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.37538 2.87521 -12.651 < 2e-16 ***
## imdb_rating 14.64080 0.56658 25.841 < 2e-16 ***
## critics_score 0.07144 0.02153 3.318 0.000957 ***
## best_pic_nomyes 4.09572 2.29527 1.784 0.074826 .
## best_actor_winyes -2.20186 1.14606 -1.921 0.055141 .
## best_actress_winyes -2.45445 1.28802 -1.906 0.057148 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.03 on 645 degrees of freedom
## Multiple R-squared: 0.756, Adjusted R-squared: 0.7541
## F-statistic: 399.6 on 5 and 645 DF, p-value: < 2.2e-16
p-value of best_pic_nomyes is highest as 0.074826 so cut it off
modify8_model = lm(audience_score ~ imdb_rating+critics_score+best_actor_win+best_actress_win,data=movies)
summary(modify8_model)
##
## Call:
## lm(formula = audience_score ~ imdb_rating + critics_score + best_actor_win +
## best_actress_win, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.120 -6.682 0.475 5.575 52.182
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.05607 2.85460 -12.981 < 2e-16 ***
## imdb_rating 14.73736 0.56494 26.087 < 2e-16 ***
## critics_score 0.07331 0.02154 3.403 0.000708 ***
## best_actor_winyes -1.92549 1.13746 -1.693 0.090977 .
## best_actress_winyes -2.04649 1.26971 -1.612 0.107500
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.05 on 646 degrees of freedom
## Multiple R-squared: 0.7548, Adjusted R-squared: 0.7532
## F-statistic: 497 on 4 and 646 DF, p-value: < 2.2e-16
p-value of best_actress_winyes is highest as 0.107500 so cut it off
modify9_model = lm(audience_score ~ imdb_rating+critics_score+best_actor_win,data=movies)
summary(modify9_model)
##
## Call:
## lm(formula = audience_score ~ imdb_rating + critics_score + best_actor_win,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.951 -6.679 0.552 5.663 52.265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.03252 2.85809 -12.957 < 2e-16 ***
## imdb_rating 14.70745 0.56533 26.016 < 2e-16 ***
## critics_score 0.07294 0.02157 3.382 0.000763 ***
## best_actor_winyes -2.16754 1.12890 -1.920 0.055290 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.06 on 647 degrees of freedom
## Multiple R-squared: 0.7538, Adjusted R-squared: 0.7526
## F-statistic: 660.2 on 3 and 647 DF, p-value: < 2.2e-16
p-value of best_actor_winyes is highest as 0.055290 so cut it off
modify10_model = lm(audience_score ~ imdb_rating+critics_score,data=movies)
summary(modify10_model)
##
## Call:
## lm(formula = audience_score ~ imdb_rating + critics_score, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.668 -6.758 0.723 5.513 52.438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.03195 2.86401 -12.930 < 2e-16 ***
## imdb_rating 14.65760 0.56590 25.901 < 2e-16 ***
## critics_score 0.07318 0.02161 3.386 0.000753 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.08 on 648 degrees of freedom
## Multiple R-squared: 0.7524, Adjusted R-squared: 0.7516
## F-statistic: 984.4 on 2 and 648 DF, p-value: < 2.2e-16
all of variable P-value<0.05 so this is final model.
audience_score = -37.03195+14.65760imdb_rating+0.07318critics_score intercept : when the critics_score and imbd_rating is 0 are expected on average that the critics give score -37.03195 slope(imdb_rating) : All else held constant, for each 1 unit increase in imdb_rating the model predicts audience_score to be higher on average by 14.65760 audience_score. slope(critics_score) : All else held constant, for each 1 unit increase in critics_score the model predicts audience_score to be higher on average by 0.07318 audience_score.
Diagnose
plot(modify10_model$residuals ~ movies$imdb_rating)
plot(modify10_model$residuals ~ movies$critics_score)
from scattor plot is really random around 0 so they have linear relationship together
hist(modify10_model$residuals)
from histogram see the residual random scatter around 0 this condition is satisfy
residual vs. predicted
plot(modify10_model$residuals ~ modify10_model$fitted)
have a band with a constant width around 0 (no fan shape)
plot(modify10_model$residuals)
x axis doesn’t show any pattern so non-independent this model can use
Part 5 Pridiction
choose movie : The Sessions(060) :imdb_rating=7.2,critics_score=93,audience_score=80
use our final model to predict : audience_score = -37.03195+14.65760imdb_rating+0.07318critics_score
audience_score = -37.03195+(14.657607.2)+(0.0731893)
answer is 75.30851 and the real score is 80 #Wow it’s almost collect how’s good:)
Part 6 Conclusion not all of variable can use for model it should significant with data you want ,a lot of variable never mean your model is good also a little of variable in model and also don’t use the data that never give you more R-adjusted(maybe collinerity,Parsimony or not have relationship) you should keep data without biased
only imdb_rating,critics_score significant to predict audience_score
shortcoming -> not all of data can explained by this model