#Loading the dataframe
df <- read.csv('https://raw.githubusercontent.com/davidblumenstiel/data/master/kaglemovies/tmdb_5000_movies.csv')
#Trimming out the parts variables we aren't interested in
movies <- df[,c(1,9,13,14,19)]
#Taking out observations which have a budget under 1000 (gets rid of some suspect observations)
movies <- movies[movies$budget > 1000,]
I want to determine what relationship the budget, popularity, runtime, and average rating of a movie have on it’s revenue
The cases are individual movies that were previously releasesd (last updated September 2017) which had a budget over $1000; there are 3734 cases.
The authors of the dataset collected the data from https://www.themoviedb.org using their API. I obtained the dataset from https://www.kaggle.com/tmdb/tmdb-movie-metadata/discussion. I also cleaned it a bit by taking out the variables that I will not be looking at, and observations with budgets under $1000 to reduce what I think may be unrealistic observations.
This is an observational study.
Data was collected by The Movie Database (TMDb) https://www.themoviedb.org, and presented on Kagle https://www.kaggle.com/tmdb/tmdb-movie-metadata.
The responce variable is revenue, and it is numerical
The explanatory variables are budget, popularity, runtime, and average rating.
#Revenue
summary(movies$revenue)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 6.743e+06 3.945e+07 1.048e+08 1.226e+08 2.788e+09
hist(movies$revenue)
#Budget
summary(movies$budget)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7000 8550000 23300000 37360290 50000000 380000000
hist(movies$budget)
#Popularity
summary(movies$popularity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0016 8.1852 17.7095 26.1751 34.0656 875.5813
hist(movies$popularity)
#Runtime
summary(movies$runtime)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 95.0 106.0 109.4 120.0 338.0 1
hist(movies$runtime)
#Average Rating
summary(movies$vote_average)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.700 6.300 6.235 6.900 8.500
hist(movies$vote_average)