Developing Data Products

October 1, 2019

Overview

The movies dataset consisted of 651 randomly sampled movies produced and released between 1970 and 2014. The dataset was generated from the Rotten Tomatoes and IMDb databases.

Data Manipulation

For this data analysis, five more varaibles must be created from four of the 32 existing variables. Sixteen explanatory variables are required in all for the modeling. The new five variables include:

feature_film created from the variable title_type
drama created from the variable genre
mpaa_rating_R created from the variable mpaa_rating
oscar_season created from the variable thtr_rel_month
summer_season created from the variable thtr_rel_month

Exploratory data analysis

Out of the 32 variables in the movies dataset, we only require 17 variables including the response variable for this analysis. 9 of the 17 variables are listed below starting with the response variable.

audience_score
feature_film
drama
mpaa_rating_R
oscar_season
summer_season

Exploratory data analysis

runtime
thtr_rel_year
imdb_rating
imdb_num_votes
critics_score
best_pic_nom
best_pic_win
best_actor_win
best_actress_win
best_dir_win
top200_box

Graphical visualization

Discussion

A close examination of the two forms of the set of boxplots showed that the plots of variable audience_score versus oscar_season, summer_season, drama, mpaa_rating, best_actor_winner, best_actress_winner, best_director_winner respectively, give no distinct trends between the response variable and the 'yes' or 'no' levels of the explanatory variables.

The plot of audience_score and feature_film gives useful information because level 'no' of the feature_film with its lowest 'no' values which is an outlier also correspond with level 'yes' lowest value which is not an outlier.

The plots of audience_score versus best_pic_nom, audience_score versus best_pic_win and audience_score versus top200_box respectively showed high 'yes' levels