Content is from Susan Li: https://susanli2016.github.io/Movie-Time/

如果是在自己电脑上,记得载入这些包:

并下载这个数据到自己的working directory:

(https://www.dropbox.com/s/sy790xviuxn8psp/movie_metadata.csv?dl=1)

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb3ZpZTwtcmVhZC5jc3YoXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zL3N5NzkweHZpdXhuOHBzcC9tb3ZpZV9tZXRhZGF0YS5jc3Y/ZGw9MVwiLCBzdHJpbmdzQXNGYWN0b3JzID0gRilcbmxpYnJhcnkoZ2dwbG90MilcbmxpYnJhcnkoZHBseXIpXG5saWJyYXJ5KEhtaXNjKVxubGlicmFyeShwc3ljaClcblxuZ2xpbXBzZShtb3ZpZSkiLCJzb2x1dGlvbiI6Im1vdmllPC1yZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3Mvc3k3OTB4dml1eG44cHNwL21vdmllX21ldGFkYXRhLmNzdj9kbD0xXCIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoSG1pc2MpXG5saWJyYXJ5KHBzeWNoKVxuXG5nbGltcHNlKG1vdmllKSJ9
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllPC1yZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3Mvc3k3OTB4dml1eG44cHNwL21vdmllX21ldGFkYXRhLmNzdj9kbD0xXCIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoSG1pc2MpXG5saWJyYXJ5KHBzeWNoKSIsInNhbXBsZSI6ImdncGxvdChhZXMoeCA9IG51bV9jcml0aWNfZm9yX3Jldmlld3MpLCBkYXRhID0gbW92aWUpICsgXG4gIGdlb21faGlzdG9ncmFtKGJpbnMgPSAyMCwgY29sb3IgPSAnd2hpdGUnKSArIFxuICBnZ3RpdGxlKCdIaXN0b2dyYW0gb2YgTnVtYmVyIG9mIHJldmlld3MnKVxuc3VtbWFyeShtb3ZpZSRudW1fY3JpdGljX2Zvcl9yZXZpZXdzKSIsInNvbHV0aW9uIjoiZ2dwbG90KGFlcyh4ID0gbnVtX2NyaXRpY19mb3JfcmV2aWV3cyksIGRhdGEgPSBtb3ZpZSkgKyBnZW9tX2hpc3RvZ3JhbShiaW5zID0gMjAsIGNvbG9yID0gJ3doaXRlJykgKyBnZ3RpdGxlKCdIaXN0b2dyYW0gb2YgTnVtYmVyIG9mIHJldmlld3MnKVxuc3VtbWFyeShtb3ZpZSRudW1fY3JpdGljX2Zvcl9yZXZpZXdzKSJ9

The distribution of the number of reviews is right skewed. Among these 5043 movie, the minimum number of review was 1 and the maximum number of reviews was 813. Majority of the movie received less than 200 reviews.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllPC1yZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3Mvc3k3OTB4dml1eG44cHNwL21vdmllX21ldGFkYXRhLmNzdj9kbD0xXCIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoSG1pc2MpXG5saWJyYXJ5KHBzeWNoKSIsInNhbXBsZSI6ImdncGxvdChhZXMoeCA9IGltZGJfc2NvcmUpLCBkYXRhID0gbW92aWUpICsgZ2VvbV9oaXN0b2dyYW0oYmlucyA9IDIwLCBjb2xvciA9ICd3aGl0ZScpICsgZ2d0aXRsZSgnSGlzdG9ncmFtIG9mIFNjb3JlcycpXG5zdW1tYXJ5KG1vdmllJGltZGJfc2NvcmUpIiwic29sdXRpb24iOiJnZ3Bsb3QoYWVzKHggPSBpbWRiX3Njb3JlKSwgZGF0YSA9IG1vdmllKSArIGdlb21faGlzdG9ncmFtKGJpbnMgPSAyMCwgY29sb3IgPSAnd2hpdGUnKSArIGdndGl0bGUoJ0hpc3RvZ3JhbSBvZiBTY29yZXMnKVxuc3VtbWFyeShtb3ZpZSRpbWRiX3Njb3JlKSJ9

The score distribution is left skewed, with minimum score at 1.60 and maximum score at 9.50.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllPC1yZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3Mvc3k3OTB4dml1eG44cHNwL21vdmllX21ldGFkYXRhLmNzdj9kbD0xXCIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoSG1pc2MpXG5saWJyYXJ5KHBzeWNoKSIsInNhbXBsZSI6ImdncGxvdChhZXMoeCA9IHRpdGxlX3llYXIpLCBkYXRhID0gbW92aWUpICsgZ2VvbV9oaXN0b2dyYW0oY29sb3I9J3doaXRlJykgK1xuICBnZ3RpdGxlKCdIaXN0b2dyYW0gb2YgVGl0bGUgWWVhcicpIiwic29sdXRpb24iOiJnZ3Bsb3QoYWVzKHggPSB0aXRsZV95ZWFyKSwgZGF0YSA9IG1vdmllKSArIGdlb21faGlzdG9ncmFtKGNvbG9yPSd3aGl0ZScpICtcbiAgZ2d0aXRsZSgnSGlzdG9ncmFtIG9mIFRpdGxlIFllYXInKSJ9

Most of the movie in the dataset were produced after 2000.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllPC1yZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3Mvc3k3OTB4dml1eG44cHNwL21vdmllX21ldGFkYXRhLmNzdj9kbD0xXCIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoSG1pc2MpXG5saWJyYXJ5KHBzeWNoKSIsInNhbXBsZSI6ImJveHBsb3QoaW1kYl9zY29yZSB+IHRpdGxlX3llYXIsIGRhdGE9bW92aWUsIGNvbD0naW5kaWFucmVkJylcbnRpdGxlKFwiSU1EQiBzY29yZSB2cyBUaXRsZSB5ZWFyXCIpIiwic29sdXRpb24iOiJib3hwbG90KGltZGJfc2NvcmUgfiB0aXRsZV95ZWFyLCBkYXRhPW1vdmllLCBjb2w9J2luZGlhbnJlZCcpXG50aXRsZShcIklNREIgc2NvcmUgdnMgVGl0bGUgeWVhclwiKSJ9

However, the movie with the highest scores were produced in the 1950s, and there have been significant amount of low score movie came out in the recent years.

Which countries produced the most movie and which countries have the highest scores?

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllPC1yZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3Mvc3k3OTB4dml1eG44cHNwL21vdmllX21ldGFkYXRhLmNzdj9kbD0xXCIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoSG1pc2MpXG5saWJyYXJ5KHBzeWNoKSIsInNhbXBsZSI6ImNvdW50cnlfZ3JvdXAgPC0gZ3JvdXBfYnkobW92aWUsIGNvdW50cnkpXG5tb3ZpZV9ieV9jb3VudHJ5IDwtIHN1bW1hcmlzZShjb3VudHJ5X2dyb3VwLFxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIG1lYW5fc2NvcmUgPSBtZWFuKGltZGJfc2NvcmUpLFxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIG4gPSBuKCkpIFxuZ2dwbG90KGFlcyh4ID0gY291bnRyeSwgeSA9IG4sIGZpbGwgPSBjb3VudHJ5KSwgZGF0YSA9IG1vdmllX2J5X2NvdW50cnkpICsgZ2VvbV9iYXIoc3RhdCA9ICdpZGVudGl0eScpICsgdGhlbWUobGVnZW5kLnBvc2l0aW9uID0gXCJub25lXCIsIGF4aXMudGV4dD1lbGVtZW50X3RleHQoc2l6ZT02KSkgK1xuICBjb29yZF9mbGlwKCkgKyBnZ3RpdGxlKCdDb3VudHJpZXMgdnMgTnVtYmVyIG9mIG1vdmllJykiLCJzb2x1dGlvbiI6ImNvdW50cnlfZ3JvdXAgPC0gZ3JvdXBfYnkobW92aWUsIGNvdW50cnkpXG5tb3ZpZV9ieV9jb3VudHJ5IDwtIHN1bW1hcmlzZShjb3VudHJ5X2dyb3VwLFxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIG1lYW5fc2NvcmUgPSBtZWFuKGltZGJfc2NvcmUpLFxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIG4gPSBuKCkpIFxuZ2dwbG90KGFlcyh4ID0gY291bnRyeSwgeSA9IG4sIGZpbGwgPSBjb3VudHJ5KSwgZGF0YSA9IG1vdmllX2J5X2NvdW50cnkpICsgZ2VvbV9iYXIoc3RhdCA9ICdpZGVudGl0eScpICsgdGhlbWUobGVnZW5kLnBvc2l0aW9uID0gXCJub25lXCIsIGF4aXMudGV4dD1lbGVtZW50X3RleHQoc2l6ZT02KSkgK1xuICBjb29yZF9mbGlwKCkgKyBnZ3RpdGxlKCdDb3VudHJpZXMgdnMgTnVtYmVyIG9mIG1vdmllJykifQ==

The USA produced the most number of movie.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllPC1yZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3Mvc3k3OTB4dml1eG44cHNwL21vdmllX21ldGFkYXRhLmNzdj9kbD0xXCIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoSG1pc2MpXG5saWJyYXJ5KHBzeWNoKVxuY291bnRyeV9ncm91cCA8LSBncm91cF9ieShtb3ZpZSwgY291bnRyeSlcbm1vdmllX2J5X2NvdW50cnkgPC0gc3VtbWFyaXNlKGNvdW50cnlfZ3JvdXAsXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbWVhbl9zY29yZSA9IG1lYW4oaW1kYl9zY29yZSksXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbiA9IG4oKSkgIiwic2FtcGxlIjoiZ2dwbG90KGFlcyh4ID0gY291bnRyeSwgeSA9IG1lYW5fc2NvcmUsIGZpbGwgPSBjb3VudHJ5KSwgZGF0YSA9IG1vdmllX2J5X2NvdW50cnkpICsgZ2VvbV9iYXIoc3RhdCA9ICdpZGVudGl0eScpICsgdGhlbWUobGVnZW5kLnBvc2l0aW9uID0gXCJub25lXCIsIGF4aXMudGV4dD1lbGVtZW50X3RleHQoc2l6ZT03KSkgK1xuICBjb29yZF9mbGlwKCkgKyBnZ3RpdGxlKCdDb3VudHJpZXMgdnMgSU1EQiBTY29yZXMnKSIsInNvbHV0aW9uIjoiZ2dwbG90KGFlcyh4ID0gY291bnRyeSwgeSA9IG1lYW5fc2NvcmUsIGZpbGwgPSBjb3VudHJ5KSwgZGF0YSA9IG1vdmllX2J5X2NvdW50cnkpICsgZ2VvbV9iYXIoc3RhdCA9ICdpZGVudGl0eScpICsgdGhlbWUobGVnZW5kLnBvc2l0aW9uID0gXCJub25lXCIsIGF4aXMudGV4dD1lbGVtZW50X3RleHQoc2l6ZT03KSkgK1xuICBjb29yZF9mbGlwKCkgKyBnZ3RpdGxlKCdDb3VudHJpZXMgdnMgSU1EQiBTY29yZXMnKSJ9

But that does not mean their movie are all good quality. Kyrgyzstan, Libya and United Arab Emirates might have the highest average scores.

Multiple Linear Regression - Variable Selection

Time to do some serious work, I intend to predict IMDB scores from the other variables using multiple linear regression model. Because regression can’t deal with missing values, I will eliminate all missing values by converting to mean or median.
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllPC1yZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3Mvc3k3OTB4dml1eG44cHNwL21vdmllX21ldGFkYXRhLmNzdj9kbD0xXCIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoSG1pc2MpXG5saWJyYXJ5KHBzeWNoKSIsInNhbXBsZSI6Im1vdmllJGltZGJfc2NvcmUgPC0gYXMubnVtZXJpYyhpbXB1dGUobW92aWUkaW1kYl9zY29yZSwgbWVhbikpXG5tb3ZpZSRudW1fY3JpdGljX2Zvcl9yZXZpZXdzIDwtIGFzLm51bWVyaWMoaW1wdXRlKG1vdmllJG51bV9jcml0aWNfZm9yX3Jldmlld3MsIG1lYW4pKVxubW92aWUkZHVyYXRpb24gPC0gYXMubnVtZXJpYyhpbXB1dGUobW92aWUkZHVyYXRpb24sIG1lYW4pKVxubW92aWUkZGlyZWN0b3JfZmFjZWJvb2tfbGlrZXMgPC0gYXMubnVtZXJpYyhpbXB1dGUobW92aWUkZGlyZWN0b3JfZmFjZWJvb2tfbGlrZXMsIG1lYW4pKVxubW92aWUkYWN0b3JfM19mYWNlYm9va19saWtlcyA8LSBhcy5udW1lcmljKGltcHV0ZShtb3ZpZSRhY3Rvcl8zX2ZhY2Vib29rX2xpa2VzLCBtZWFuKSlcbm1vdmllJGFjdG9yXzFfZmFjZWJvb2tfbGlrZXMgPC0gYXMubnVtZXJpYyhpbXB1dGUobW92aWUkYWN0b3JfMV9mYWNlYm9va19saWtlcywgbWVhbikpXG5tb3ZpZSRncm9zcyA8LSBhcy5udW1lcmljKGltcHV0ZShtb3ZpZSRncm9zcywgbWVhbikpXG5tb3ZpZSRjYXN0X3RvdGFsX2ZhY2Vib29rX2xpa2VzIDwtIGFzLm51bWVyaWMoaW1wdXRlKG1vdmllJGNhc3RfdG90YWxfZmFjZWJvb2tfbGlrZXMsIG1lYW4pKVxubW92aWUkZmFjZW51bWJlcl9pbl9wb3N0ZXIgPC0gYXMubnVtZXJpYyhpbXB1dGUobW92aWUkZmFjZW51bWJlcl9pbl9wb3N0ZXIsIG1lYW4pKVxubW92aWUkYnVkZ2V0IDwtIGFzLm51bWVyaWMoaW1wdXRlKG1vdmllJGJ1ZGdldCwgbWVhbikpXG5tb3ZpZSR0aXRsZV95ZWFyIDwtIGFzLm51bWVyaWMoaW1wdXRlKG1vdmllJHRpdGxlX3llYXIsIG1lZGlhbikpXG5tb3ZpZSRhY3Rvcl8yX2ZhY2Vib29rX2xpa2VzIDwtIGFzLm51bWVyaWMoaW1wdXRlKG1vdmllJGFjdG9yXzJfZmFjZWJvb2tfbGlrZXMsIG1lYW4pKVxubW92aWUkYXNwZWN0X3JhdGlvIDwtIGFzLm51bWVyaWMoaW1wdXRlKG1vdmllJGFzcGVjdF9yYXRpbywgbWVhbikpIiwic29sdXRpb24iOiJtb3ZpZSRpbWRiX3Njb3JlIDwtIGFzLm51bWVyaWMoaW1wdXRlKG1vdmllJGltZGJfc2NvcmUsIG1lYW4pKVxubW92aWUkbnVtX2NyaXRpY19mb3JfcmV2aWV3cyA8LSBhcy5udW1lcmljKGltcHV0ZShtb3ZpZSRudW1fY3JpdGljX2Zvcl9yZXZpZXdzLCBtZWFuKSlcbm1vdmllJGR1cmF0aW9uIDwtIGFzLm51bWVyaWMoaW1wdXRlKG1vdmllJGR1cmF0aW9uLCBtZWFuKSlcbm1vdmllJGRpcmVjdG9yX2ZhY2Vib29rX2xpa2VzIDwtIGFzLm51bWVyaWMoaW1wdXRlKG1vdmllJGRpcmVjdG9yX2ZhY2Vib29rX2xpa2VzLCBtZWFuKSlcbm1vdmllJGFjdG9yXzNfZmFjZWJvb2tfbGlrZXMgPC0gYXMubnVtZXJpYyhpbXB1dGUobW92aWUkYWN0b3JfM19mYWNlYm9va19saWtlcywgbWVhbikpXG5tb3ZpZSRhY3Rvcl8xX2ZhY2Vib29rX2xpa2VzIDwtIGFzLm51bWVyaWMoaW1wdXRlKG1vdmllJGFjdG9yXzFfZmFjZWJvb2tfbGlrZXMsIG1lYW4pKVxubW92aWUkZ3Jvc3MgPC0gYXMubnVtZXJpYyhpbXB1dGUobW92aWUkZ3Jvc3MsIG1lYW4pKVxubW92aWUkY2FzdF90b3RhbF9mYWNlYm9va19saWtlcyA8LSBhcy5udW1lcmljKGltcHV0ZShtb3ZpZSRjYXN0X3RvdGFsX2ZhY2Vib29rX2xpa2VzLCBtZWFuKSlcbm1vdmllJGZhY2VudW1iZXJfaW5fcG9zdGVyIDwtIGFzLm51bWVyaWMoaW1wdXRlKG1vdmllJGZhY2VudW1iZXJfaW5fcG9zdGVyLCBtZWFuKSlcbm1vdmllJGJ1ZGdldCA8LSBhcy5udW1lcmljKGltcHV0ZShtb3ZpZSRidWRnZXQsIG1lYW4pKVxubW92aWUkdGl0bGVfeWVhciA8LSBhcy5udW1lcmljKGltcHV0ZShtb3ZpZSR0aXRsZV95ZWFyLCBtZWRpYW4pKVxubW92aWUkYWN0b3JfMl9mYWNlYm9va19saWtlcyA8LSBhcy5udW1lcmljKGltcHV0ZShtb3ZpZSRhY3Rvcl8yX2ZhY2Vib29rX2xpa2VzLCBtZWFuKSlcbm1vdmllJGFzcGVjdF9yYXRpbyA8LSBhcy5udW1lcmljKGltcHV0ZShtb3ZpZSRhc3BlY3RfcmF0aW8sIG1lYW4pKSJ9

Now I have got rid of all ‘NA’s. And I picked the following variables as potential candidates for the IMDB score predicators.

  • num_critic_for_reviews
  • duration
  • director_facebook_likes
  • actor_1_facebook_likes
  • gross
  • cast_total_facebook_likes
  • facenumber_in_poster
  • budget
  • movie_facebook_likes
  • Select a subset of numeric variables for regression modelling.
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllPC1yZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3Mvc3k3OTB4dml1eG44cHNwL21vdmllX21ldGFkYXRhLmNzdj9kbD0xXCIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoSG1pc2MpXG5saWJyYXJ5KHBzeWNoKSIsInNhbXBsZSI6Im1vdmllX3N1YiA8LSBzdWJzZXQobW92aWUsIHNlbGVjdCA9IGMobnVtX2NyaXRpY19mb3JfcmV2aWV3cywgZHVyYXRpb24sIGRpcmVjdG9yX2ZhY2Vib29rX2xpa2VzLCBhY3Rvcl8xX2ZhY2Vib29rX2xpa2VzLCBncm9zcywgY2FzdF90b3RhbF9mYWNlYm9va19saWtlcywgZmFjZW51bWJlcl9pbl9wb3N0ZXIsIGJ1ZGdldCwgbW92aWVfZmFjZWJvb2tfbGlrZXMsIGltZGJfc2NvcmUpKVxucGFpcnMucGFuZWxzKG1vdmllX3N1YiwgY29sPSdyZWQnKSIsInNvbHV0aW9uIjoibW92aWVfc3ViIDwtIHN1YnNldChtb3ZpZSwgc2VsZWN0ID0gYyhudW1fY3JpdGljX2Zvcl9yZXZpZXdzLCBkdXJhdGlvbiwgZGlyZWN0b3JfZmFjZWJvb2tfbGlrZXMsIGFjdG9yXzFfZmFjZWJvb2tfbGlrZXMsIGdyb3NzLCBjYXN0X3RvdGFsX2ZhY2Vib29rX2xpa2VzLCBmYWNlbnVtYmVyX2luX3Bvc3RlciwgYnVkZ2V0LCBtb3ZpZV9mYWNlYm9va19saWtlcywgaW1kYl9zY29yZSkpXG5wYWlycy5wYW5lbHMobW92aWVfc3ViLCBjb2w9J3JlZCcpIn0=

Construct the model

Split data into training and testing.
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllPC1yZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3Mvc3k3OTB4dml1eG44cHNwL21vdmllX21ldGFkYXRhLmNzdj9kbD0xXCIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoSG1pc2MpXG5saWJyYXJ5KHBzeWNoKVxubW92aWVfc3ViIDwtIHN1YnNldChtb3ZpZSwgc2VsZWN0ID0gYyhudW1fY3JpdGljX2Zvcl9yZXZpZXdzLCBkdXJhdGlvbiwgZGlyZWN0b3JfZmFjZWJvb2tfbGlrZXMsIGFjdG9yXzFfZmFjZWJvb2tfbGlrZXMsIGdyb3NzLCBjYXN0X3RvdGFsX2ZhY2Vib29rX2xpa2VzLCBmYWNlbnVtYmVyX2luX3Bvc3RlciwgYnVkZ2V0LCBtb3ZpZV9mYWNlYm9va19saWtlcywgaW1kYl9zY29yZSkpIiwic2FtcGxlIjoic2V0LnNlZWQoMjAxNylcbnRyYWluX3NpemUgPC0gMC44IFxudHJhaW5faW5kZXggPC0gc2FtcGxlLmludChsZW5ndGgobW92aWVfc3ViJGltZGJfc2NvcmUpLCBsZW5ndGgobW92aWVfc3ViJGltZGJfc2NvcmUpICogdHJhaW5fc2l6ZSlcbnRyYWluX3NhbXBsZSA8LSBtb3ZpZV9zdWJbdHJhaW5faW5kZXgsXVxudGVzdF9zYW1wbGUgPC0gbW92aWVfc3ViWy10cmFpbl9pbmRleCxdIiwic29sdXRpb24iOiJzZXQuc2VlZCgyMDE3KVxudHJhaW5fc2l6ZSA8LSAwLjggXG50cmFpbl9pbmRleCA8LSBzYW1wbGUuaW50KGxlbmd0aChtb3ZpZV9zdWIkaW1kYl9zY29yZSksIGxlbmd0aChtb3ZpZV9zdWIkaW1kYl9zY29yZSkgKiB0cmFpbl9zaXplKVxudHJhaW5fc2FtcGxlIDwtIG1vdmllX3N1Ylt0cmFpbl9pbmRleCxdXG50ZXN0X3NhbXBsZSA8LSBtb3ZpZV9zdWJbLXRyYWluX2luZGV4LF0ifQ==

Fit the model

I will be using a stepwise selection of variables by backwards elimination. So I start with all candidate varibles and elimiate one at a time.
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllPC1yZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3Mvc3k3OTB4dml1eG44cHNwL21vdmllX21ldGFkYXRhLmNzdj9kbD0xXCIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoSG1pc2MpXG5saWJyYXJ5KHBzeWNoKVxubW92aWVfc3ViIDwtIHN1YnNldChtb3ZpZSwgc2VsZWN0ID0gYyhudW1fY3JpdGljX2Zvcl9yZXZpZXdzLCBkdXJhdGlvbiwgZGlyZWN0b3JfZmFjZWJvb2tfbGlrZXMsIGFjdG9yXzFfZmFjZWJvb2tfbGlrZXMsIGdyb3NzLCBjYXN0X3RvdGFsX2ZhY2Vib29rX2xpa2VzLCBmYWNlbnVtYmVyX2luX3Bvc3RlciwgYnVkZ2V0LCBtb3ZpZV9mYWNlYm9va19saWtlcywgaW1kYl9zY29yZSkpXG5cbnNldC5zZWVkKDIwMTcpXG50cmFpbl9zaXplIDwtIDAuOCBcbnRyYWluX2luZGV4IDwtIHNhbXBsZS5pbnQobGVuZ3RoKG1vdmllX3N1YiRpbWRiX3Njb3JlKSwgbGVuZ3RoKG1vdmllX3N1YiRpbWRiX3Njb3JlKSAqIHRyYWluX3NpemUpXG50cmFpbl9zYW1wbGUgPC0gbW92aWVfc3ViW3RyYWluX2luZGV4LF1cbnRlc3Rfc2FtcGxlIDwtIG1vdmllX3N1YlstdHJhaW5faW5kZXgsXSIsInNhbXBsZSI6ImZpdCA8LSBsbShpbWRiX3Njb3JlIH4gbnVtX2NyaXRpY19mb3JfcmV2aWV3cyArIGR1cmF0aW9uICsgICAgZGlyZWN0b3JfZmFjZWJvb2tfbGlrZXMgKyBhY3Rvcl8xX2ZhY2Vib29rX2xpa2VzICsgZ3Jvc3MgKyBjYXN0X3RvdGFsX2ZhY2Vib29rX2xpa2VzICsgZmFjZW51bWJlcl9pbl9wb3N0ZXIgKyBidWRnZXQgKyBtb3ZpZV9mYWNlYm9va19saWtlcywgZGF0YT10cmFpbl9zYW1wbGUpXG5zdW1tYXJ5KGZpdCkgIiwic29sdXRpb24iOiJmaXQgPC0gbG0oaW1kYl9zY29yZSB+IG51bV9jcml0aWNfZm9yX3Jldmlld3MgKyBkdXJhdGlvbiArICAgIGRpcmVjdG9yX2ZhY2Vib29rX2xpa2VzICsgYWN0b3JfMV9mYWNlYm9va19saWtlcyArIGdyb3NzICsgY2FzdF90b3RhbF9mYWNlYm9va19saWtlcyArIGZhY2VudW1iZXJfaW5fcG9zdGVyICsgYnVkZ2V0ICsgbW92aWVfZmFjZWJvb2tfbGlrZXMsIGRhdGE9dHJhaW5fc2FtcGxlKVxuc3VtbWFyeShmaXQpICJ9

I am going to eliminate the variables that has little value, - gross and budget, one at a time, and fit it again.

This is the final summary:
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllPC1yZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3Mvc3k3OTB4dml1eG44cHNwL21vdmllX21ldGFkYXRhLmNzdj9kbD0xXCIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoSG1pc2MpXG5saWJyYXJ5KHBzeWNoKVxubW92aWVfc3ViIDwtIHN1YnNldChtb3ZpZSwgc2VsZWN0ID0gYyhudW1fY3JpdGljX2Zvcl9yZXZpZXdzLCBkdXJhdGlvbiwgZGlyZWN0b3JfZmFjZWJvb2tfbGlrZXMsIGFjdG9yXzFfZmFjZWJvb2tfbGlrZXMsIGdyb3NzLCBjYXN0X3RvdGFsX2ZhY2Vib29rX2xpa2VzLCBmYWNlbnVtYmVyX2luX3Bvc3RlciwgYnVkZ2V0LCBtb3ZpZV9mYWNlYm9va19saWtlcywgaW1kYl9zY29yZSkpXG5cbnNldC5zZWVkKDIwMTcpXG50cmFpbl9zaXplIDwtIDAuOCBcbnRyYWluX2luZGV4IDwtIHNhbXBsZS5pbnQobGVuZ3RoKG1vdmllX3N1YiRpbWRiX3Njb3JlKSwgbGVuZ3RoKG1vdmllX3N1YiRpbWRiX3Njb3JlKSAqIHRyYWluX3NpemUpXG50cmFpbl9zYW1wbGUgPC0gbW92aWVfc3ViW3RyYWluX2luZGV4LF1cbnRlc3Rfc2FtcGxlIDwtIG1vdmllX3N1YlstdHJhaW5faW5kZXgsXSIsInNhbXBsZSI6ImZpdCA8LSBsbShpbWRiX3Njb3JlIH4gbnVtX2NyaXRpY19mb3JfcmV2aWV3cyArIGR1cmF0aW9uICsgYnVkZ2V0ICsgICBkaXJlY3Rvcl9mYWNlYm9va19saWtlcyArIGFjdG9yXzFfZmFjZWJvb2tfbGlrZXMgKyBjYXN0X3RvdGFsX2ZhY2Vib29rX2xpa2VzICsgZmFjZW51bWJlcl9pbl9wb3N0ZXIgKyBtb3ZpZV9mYWNlYm9va19saWtlcywgZGF0YT10cmFpbl9zYW1wbGUpXG5zdW1tYXJ5KGZpdCkgXG5maXQgPC0gbG0oaW1kYl9zY29yZSB+IG51bV9jcml0aWNfZm9yX3Jldmlld3MgKyBkdXJhdGlvbiArICAgZGlyZWN0b3JfZmFjZWJvb2tfbGlrZXMgKyBhY3Rvcl8xX2ZhY2Vib29rX2xpa2VzICsgY2FzdF90b3RhbF9mYWNlYm9va19saWtlcyArIGZhY2VudW1iZXJfaW5fcG9zdGVyICsgbW92aWVfZmFjZWJvb2tfbGlrZXMsIGRhdGE9dHJhaW5fc2FtcGxlKVxuc3VtbWFyeShmaXQpICIsInNvbHV0aW9uIjoiZml0IDwtIGxtKGltZGJfc2NvcmUgfiBudW1fY3JpdGljX2Zvcl9yZXZpZXdzICsgZHVyYXRpb24gKyBidWRnZXQgKyAgIGRpcmVjdG9yX2ZhY2Vib29rX2xpa2VzICsgYWN0b3JfMV9mYWNlYm9va19saWtlcyArIGNhc3RfdG90YWxfZmFjZWJvb2tfbGlrZXMgKyBmYWNlbnVtYmVyX2luX3Bvc3RlciArIG1vdmllX2ZhY2Vib29rX2xpa2VzLCBkYXRhPXRyYWluX3NhbXBsZSlcbnN1bW1hcnkoZml0KSBcbmZpdCA8LSBsbShpbWRiX3Njb3JlIH4gbnVtX2NyaXRpY19mb3JfcmV2aWV3cyArIGR1cmF0aW9uICsgICBkaXJlY3Rvcl9mYWNlYm9va19saWtlcyArIGFjdG9yXzFfZmFjZWJvb2tfbGlrZXMgKyBjYXN0X3RvdGFsX2ZhY2Vib29rX2xpa2VzICsgZmFjZW51bWJlcl9pbl9wb3N0ZXIgKyBtb3ZpZV9mYWNlYm9va19saWtlcywgZGF0YT10cmFpbl9zYW1wbGUpXG5zdW1tYXJ5KGZpdCkgIn0=

From the fitted model, I find that the model is significant since the p-value is very small. The “cast_total_facebook_likes” and “facenumber_in_poster” has negative weight. This model has multiple R-squared score of 0.143, meaning that around 14.3% of the variability can be explained by this model.

Let me make a few plots of the model I arrived at.

If I consider IMDB scores of all movie in the dataset, it is a non-linear fit, it has a small degree of nonlinearity.

This charts shows how all of the examples of residuals compare against theoretical distances from the model. I can see I have a bit problems here because some of the observations are not neatly fit the line.

This chart shows the distribution of residuals around the linear model in relation to the IMDB scores of all movie in my data. The higher the score, the less movie, and most movie are in the low or median score range.

This chart identifies all extrme values, but I don’t see any extrme value has huge impact on my model.

At this point, I think this model is as good as I can get. Let’s evaluate it.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllPC1yZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3Mvc3k3OTB4dml1eG44cHNwL21vdmllX21ldGFkYXRhLmNzdj9kbD0xXCIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoSG1pc2MpXG5saWJyYXJ5KHBzeWNoKVxubW92aWVfc3ViIDwtIHN1YnNldChtb3ZpZSwgc2VsZWN0ID0gYyhudW1fY3JpdGljX2Zvcl9yZXZpZXdzLCBkdXJhdGlvbiwgZGlyZWN0b3JfZmFjZWJvb2tfbGlrZXMsIGFjdG9yXzFfZmFjZWJvb2tfbGlrZXMsIGdyb3NzLCBjYXN0X3RvdGFsX2ZhY2Vib29rX2xpa2VzLCBmYWNlbnVtYmVyX2luX3Bvc3RlciwgYnVkZ2V0LCBtb3ZpZV9mYWNlYm9va19saWtlcywgaW1kYl9zY29yZSkpXG5cbnNldC5zZWVkKDIwMTcpXG50cmFpbl9zaXplIDwtIDAuOCBcbnRyYWluX2luZGV4IDwtIHNhbXBsZS5pbnQobGVuZ3RoKG1vdmllX3N1YiRpbWRiX3Njb3JlKSwgbGVuZ3RoKG1vdmllX3N1YiRpbWRiX3Njb3JlKSAqIHRyYWluX3NpemUpXG50cmFpbl9zYW1wbGUgPC0gbW92aWVfc3ViW3RyYWluX2luZGV4LF1cbnRlc3Rfc2FtcGxlIDwtIG1vdmllX3N1YlstdHJhaW5faW5kZXgsXSIsInNhbXBsZSI6InRyYWluX3NhbXBsZSRwcmVkX3Njb3JlIDwtIHByZWRpY3QoZml0LCBuZXdkYXRhID0gc3Vic2V0KHRyYWluX3NhbXBsZSwgc2VsZWN0PWMoaW1kYl9zY29yZSwgbnVtX2NyaXRpY19mb3JfcmV2aWV3cywgZHVyYXRpb24sIGRpcmVjdG9yX2ZhY2Vib29rX2xpa2VzLCBhY3Rvcl8xX2ZhY2Vib29rX2xpa2VzLCBjYXN0X3RvdGFsX2ZhY2Vib29rX2xpa2VzLCBmYWNlbnVtYmVyX2luX3Bvc3RlciwgbW92aWVfZmFjZWJvb2tfbGlrZXMpKSlcbnRlc3Rfc2FtcGxlJHByZWRfc2NvcmUgPC0gcHJlZGljdChmaXQsIG5ld2RhdGEgPSBzdWJzZXQodGVzdF9zYW1wbGUsIHNlbGVjdD1jKGltZGJfc2NvcmUsIG51bV9jcml0aWNfZm9yX3Jldmlld3MsIGR1cmF0aW9uLCBkaXJlY3Rvcl9mYWNlYm9va19saWtlcywgYWN0b3JfMV9mYWNlYm9va19saWtlcywgY2FzdF90b3RhbF9mYWNlYm9va19saWtlcywgZmFjZW51bWJlcl9pbl9wb3N0ZXIsIG1vdmllX2ZhY2Vib29rX2xpa2VzKSkpIiwic29sdXRpb24iOiJ0cmFpbl9zYW1wbGUkcHJlZF9zY29yZSA8LSBwcmVkaWN0KGZpdCwgbmV3ZGF0YSA9IHN1YnNldCh0cmFpbl9zYW1wbGUsIHNlbGVjdD1jKGltZGJfc2NvcmUsIG51bV9jcml0aWNfZm9yX3Jldmlld3MsIGR1cmF0aW9uLCBkaXJlY3Rvcl9mYWNlYm9va19saWtlcywgYWN0b3JfMV9mYWNlYm9va19saWtlcywgY2FzdF90b3RhbF9mYWNlYm9va19saWtlcywgZmFjZW51bWJlcl9pbl9wb3N0ZXIsIG1vdmllX2ZhY2Vib29rX2xpa2VzKSkpXG50ZXN0X3NhbXBsZSRwcmVkX3Njb3JlIDwtIHByZWRpY3QoZml0LCBuZXdkYXRhID0gc3Vic2V0KHRlc3Rfc2FtcGxlLCBzZWxlY3Q9YyhpbWRiX3Njb3JlLCBudW1fY3JpdGljX2Zvcl9yZXZpZXdzLCBkdXJhdGlvbiwgZGlyZWN0b3JfZmFjZWJvb2tfbGlrZXMsIGFjdG9yXzFfZmFjZWJvb2tfbGlrZXMsIGNhc3RfdG90YWxfZmFjZWJvb2tfbGlrZXMsIGZhY2VudW1iZXJfaW5fcG9zdGVyLCBtb3ZpZV9mYWNlYm9va19saWtlcykpKSJ9

The theoretical model performance is defined here as R-Squared

Check how good the model is on the training set.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllPC1yZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3Mvc3k3OTB4dml1eG44cHNwL21vdmllX21ldGFkYXRhLmNzdj9kbD0xXCIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoSG1pc2MpXG5saWJyYXJ5KHBzeWNoKVxubW92aWVfc3ViIDwtIHN1YnNldChtb3ZpZSwgc2VsZWN0ID0gYyhudW1fY3JpdGljX2Zvcl9yZXZpZXdzLCBkdXJhdGlvbiwgZGlyZWN0b3JfZmFjZWJvb2tfbGlrZXMsIGFjdG9yXzFfZmFjZWJvb2tfbGlrZXMsIGdyb3NzLCBjYXN0X3RvdGFsX2ZhY2Vib29rX2xpa2VzLCBmYWNlbnVtYmVyX2luX3Bvc3RlciwgYnVkZ2V0LCBtb3ZpZV9mYWNlYm9va19saWtlcywgaW1kYl9zY29yZSkpXG5cbnNldC5zZWVkKDIwMTcpXG50cmFpbl9zaXplIDwtIDAuOCBcbnRyYWluX2luZGV4IDwtIHNhbXBsZS5pbnQobGVuZ3RoKG1vdmllX3N1YiRpbWRiX3Njb3JlKSwgbGVuZ3RoKG1vdmllX3N1YiRpbWRiX3Njb3JlKSAqIHRyYWluX3NpemUpXG50cmFpbl9zYW1wbGUgPC0gbW92aWVfc3ViW3RyYWluX2luZGV4LF1cbnRlc3Rfc2FtcGxlIDwtIG1vdmllX3N1YlstdHJhaW5faW5kZXgsXSIsInNhbXBsZSI6InRyYWluX2NvcnIgPC0gcm91bmQoY29yKHRyYWluX3NhbXBsZSRwcmVkX3Njb3JlLCB0cmFpbl9zYW1wbGUkaW1kYl9zY29yZSksIDIpXG50cmFpbl9ybXNlIDwtIHJvdW5kKHNxcnQobWVhbigodHJhaW5fc2FtcGxlJHByZWRfc2NvcmUgLSB0cmFpbl9zYW1wbGUkaW1kYl9zY29yZSleMikpKVxudHJhaW5fbWFlIDwtIHJvdW5kKG1lYW4oYWJzKHRyYWluX3NhbXBsZSRwcmVkX3Njb3JlIC0gdHJhaW5fc2FtcGxlJGltZGJfc2NvcmUpKSlcbmModHJhaW5fY29ycl4yLCB0cmFpbl9ybXNlLCB0cmFpbl9tYWUpIiwic29sdXRpb24iOiJ0cmFpbl9jb3JyIDwtIHJvdW5kKGNvcih0cmFpbl9zYW1wbGUkcHJlZF9zY29yZSwgdHJhaW5fc2FtcGxlJGltZGJfc2NvcmUpLCAyKVxudHJhaW5fcm1zZSA8LSByb3VuZChzcXJ0KG1lYW4oKHRyYWluX3NhbXBsZSRwcmVkX3Njb3JlIC0gdHJhaW5fc2FtcGxlJGltZGJfc2NvcmUpXjIpKSlcbnRyYWluX21hZSA8LSByb3VuZChtZWFuKGFicyh0cmFpbl9zYW1wbGUkcHJlZF9zY29yZSAtIHRyYWluX3NhbXBsZSRpbWRiX3Njb3JlKSkpXG5jKHRyYWluX2NvcnJeMiwgdHJhaW5fcm1zZSwgdHJhaW5fbWFlKSJ9

The correlation between predicted score and actual score for the training set is 14.44%, which is very close to theoretical R-Squared for the model, this is good news. However, on average, on the set of the observations I have previously seen, I am going to make 1 score difference when estimating.

Check how good the model is on the test set.
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im1vdmllPC1yZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3Mvc3k3OTB4dml1eG44cHNwL21vdmllX21ldGFkYXRhLmNzdj9kbD0xXCIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoSG1pc2MpXG5saWJyYXJ5KHBzeWNoKVxubW92aWVfc3ViIDwtIHN1YnNldChtb3ZpZSwgc2VsZWN0ID0gYyhudW1fY3JpdGljX2Zvcl9yZXZpZXdzLCBkdXJhdGlvbiwgZGlyZWN0b3JfZmFjZWJvb2tfbGlrZXMsIGFjdG9yXzFfZmFjZWJvb2tfbGlrZXMsIGdyb3NzLCBjYXN0X3RvdGFsX2ZhY2Vib29rX2xpa2VzLCBmYWNlbnVtYmVyX2luX3Bvc3RlciwgYnVkZ2V0LCBtb3ZpZV9mYWNlYm9va19saWtlcywgaW1kYl9zY29yZSkpXG5cbnNldC5zZWVkKDIwMTcpXG50cmFpbl9zaXplIDwtIDAuOCBcbnRyYWluX2luZGV4IDwtIHNhbXBsZS5pbnQobGVuZ3RoKG1vdmllX3N1YiRpbWRiX3Njb3JlKSwgbGVuZ3RoKG1vdmllX3N1YiRpbWRiX3Njb3JlKSAqIHRyYWluX3NpemUpXG50cmFpbl9zYW1wbGUgPC0gbW92aWVfc3ViW3RyYWluX2luZGV4LF1cbnRlc3Rfc2FtcGxlIDwtIG1vdmllX3N1YlstdHJhaW5faW5kZXgsXSIsInNhbXBsZSI6InRlc3RfY29yciA8LSByb3VuZChjb3IodGVzdF9zYW1wbGUkcHJlZF9zY29yZSwgdGVzdF9zYW1wbGUkaW1kYl9zY29yZSksIDIpXG50ZXN0X3Jtc2UgPC0gcm91bmQoc3FydChtZWFuKCh0ZXN0X3NhbXBsZSRwcmVkX3Njb3JlIC0gdGVzdF9zYW1wbGUkaW1kYl9zY29yZSleMikpKVxudGVzdF9tYWUgPC0gcm91bmQobWVhbihhYnModGVzdF9zYW1wbGUkcHJlZF9zY29yZSAtIHRlc3Rfc2FtcGxlJGltZGJfc2NvcmUpKSlcbmModGVzdF9jb3JyXjIsIHRlc3Rfcm1zZSwgdGVzdF9tYWUpIiwic29sdXRpb24iOiJ0ZXN0X2NvcnIgPC0gcm91bmQoY29yKHRlc3Rfc2FtcGxlJHByZWRfc2NvcmUsIHRlc3Rfc2FtcGxlJGltZGJfc2NvcmUpLCAyKVxudGVzdF9ybXNlIDwtIHJvdW5kKHNxcnQobWVhbigodGVzdF9zYW1wbGUkcHJlZF9zY29yZSAtIHRlc3Rfc2FtcGxlJGltZGJfc2NvcmUpXjIpKSlcbnRlc3RfbWFlIDwtIHJvdW5kKG1lYW4oYWJzKHRlc3Rfc2FtcGxlJHByZWRfc2NvcmUgLSB0ZXN0X3NhbXBsZSRpbWRiX3Njb3JlKSkpXG5jKHRlc3RfY29ycl4yLCB0ZXN0X3Jtc2UsIHRlc3RfbWFlKSJ9

This result is not bad, the results of the test set are not far from the results of the training set.

Conclusion

The most important factor that affect movie score is the duration, the longer the movie, the higher the sore will be. The number of critic reviews is important, the more reviews a movie receives, the higher the score will be. The face number in poster has a negative effect to the movie score. The more faces in a movie poster, the lower the score will be.

The End

I hope movie will be the same after I learn how to analyze movie data. Apprécier le film!

Source code that created this post can be found here. I am happy to hear any feedback and questions.