Data Preparation

#Loading the dataframe
df <- read.csv('https://raw.githubusercontent.com/davidblumenstiel/data/master/kaglemovies/tmdb_5000_movies.csv')

#Trimming out the parts variables we aren't interested in
movies <- df[,c(1,9,13,14,19)]

#Taking out observations which have a budget under 1000 (gets rid of some suspect observations)
movies <- movies[movies$budget > 1000,]

Research question

I want to determine what relationship the budget, popularity, runtime, and average rating of a movie have on it’s revenue

Cases

The cases are individual movies that were previously releasesd (last updated September 2017) which had a budget over $1000; there are 3734 cases.

Data collection

The authors of the dataset collected the data from https://www.themoviedb.org using their API. I obtained the dataset from https://www.kaggle.com/tmdb/tmdb-movie-metadata/discussion. I also cleaned it a bit by taking out the variables that I will not be looking at, and observations with budgets under $1000 to reduce what I think may be unrealistic observations.

Type of study

This is an observational study.

Data Source

Data was collected by The Movie Database (TMDb) https://www.themoviedb.org, and presented on Kagle https://www.kaggle.com/tmdb/tmdb-movie-metadata.

Dependent Variable

The responce variable is revenue, and it is numerical

Independent Variable

The explanatory variables are budget, popularity, runtime, and average rating.

Relevant summary statistics

#Revenue
summary(movies$revenue)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 6.743e+06 3.945e+07 1.048e+08 1.226e+08 2.788e+09
hist(movies$revenue)

#Budget
summary(movies$budget)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      7000   8550000  23300000  37360290  50000000 380000000
hist(movies$budget)

#Popularity
summary(movies$popularity)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0016   8.1852  17.7095  26.1751  34.0656 875.5813
hist(movies$popularity)

#Runtime
summary(movies$runtime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    95.0   106.0   109.4   120.0   338.0       1
hist(movies$runtime)

#Average Rating
summary(movies$vote_average)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.700   6.300   6.235   6.900   8.500
hist(movies$vote_average)