1 Introduction

We found that the purpose of this data is to see if the movie industry is dying following the rising of the new entertainment king called Netflix.

The data is containing information such as revenue of each movie for three decades for us to create analysis of movie industry current condition

2 Data Pre procesing

2.1 Set up Library

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()

2.2 Read Data

movies <- read.csv("movies.csv")

names(movies)
##  [1] "budget"   "company"  "country"  "director" "genre"    "gross"   
##  [7] "name"     "rating"   "released" "runtime"  "score"    "star"    
## [13] "votes"    "writer"   "year"
head(movies)

2.3 Data structure

str(movies)
## 'data.frame':    6820 obs. of  15 variables:
##  $ budget  : num  8000000 6000000 15000000 18500000 9000000 6000000 25000000 6000000 9000000 15000000 ...
##  $ company : chr  "Columbia Pictures Corporation" "Paramount Pictures" "Paramount Pictures" "Twentieth Century Fox Film Corporation" ...
##  $ country : chr  "USA" "USA" "USA" "USA" ...
##  $ director: chr  "Rob Reiner" "John Hughes" "Tony Scott" "James Cameron" ...
##  $ genre   : chr  "Adventure" "Comedy" "Action" "Action" ...
##  $ gross   : num  5.23e+07 7.01e+07 1.80e+08 8.52e+07 1.86e+07 ...
##  $ name    : chr  "Stand by Me" "Ferris Bueller's Day Off" "Top Gun" "Aliens" ...
##  $ rating  : chr  "R" "PG-13" "PG" "R" ...
##  $ released: chr  "1986-08-22" "1986-06-11" "1986-05-16" "1986-07-18" ...
##  $ runtime : int  89 103 110 137 90 120 101 120 96 96 ...
##  $ score   : num  8.1 7.8 6.9 8.4 6.9 8.1 7.4 7.8 6.8 7.5 ...
##  $ star    : chr  "Wil Wheaton" "Matthew Broderick" "Tom Cruise" "Sigourney Weaver" ...
##  $ votes   : int  299174 264740 236909 540152 36636 317585 102879 146768 60565 129698 ...
##  $ writer  : chr  "Stephen King" "John Hughes" "Jim Cash" "James Cameron" ...
##  $ year    : int  1986 1986 1986 1986 1986 1986 1986 1986 1986 1986 ...

2.4 Changing column type

From the data we found that all the variable with type of character is containing categorical content therefore we decided to change the type of data of this column into factor

movies$company<- as.factor(movies$company)
movies$country<- as.factor(movies$country)
movies$director<- as.factor(movies$director)
movies$genre<- as.factor(movies$genre)
movies$name<- as.factor(movies$name)
movies$rating<- as.factor(movies$rating)
movies$star<- as.factor(movies$star)
movies$writer<- as.factor(movies$writer)
movies$year<- as.factor(movies$year)

str(movies)
## 'data.frame':    6820 obs. of  15 variables:
##  $ budget  : num  8000000 6000000 15000000 18500000 9000000 6000000 25000000 6000000 9000000 15000000 ...
##  $ company : Factor w/ 2179 levels "\"DIA\" Productions GmbH & Co. KG",..: 665 1683 1683 2068 2124 1159 1161 763 1683 1936 ...
##  $ country : Factor w/ 57 levels "Argentina","Aruba",..: 56 56 56 56 56 54 54 56 56 56 ...
##  $ director: Factor w/ 2759 levels "A.R. Murugadoss",..: 2200 1304 2652 1073 2129 1955 1214 590 1011 559 ...
##  $ genre   : Factor w/ 17 levels "Action","Adventure",..: 2 5 1 1 2 7 2 7 5 7 ...
##  $ gross   : num  5.23e+07 7.01e+07 1.80e+08 8.52e+07 1.86e+07 ...
##  $ name    : Factor w/ 6731 levels "'71","'night, Mother",..: 4671 1829 6210 299 1880 3911 2890 781 3969 5312 ...
##  $ rating  : Factor w/ 13 levels "B","B15","G",..: 9 8 7 9 7 9 7 9 8 9 ...
##  $ released: chr  "1986-08-22" "1986-06-11" "1986-05-16" "1986-07-18" ...
##  $ runtime : int  89 103 110 137 90 120 101 120 96 96 ...
##  $ score   : num  8.1 7.8 6.9 8.4 6.9 8.1 7.4 7.8 6.8 7.5 ...
##  $ star    : Factor w/ 2504 levels "'Weird Al' Yankovic",..: 2458 1609 2349 2197 1146 372 529 928 1734 1043 ...
##  $ votes   : int  299174 264740 236909 540152 36636 317585 102879 146768 60565 129698 ...
##  $ writer  : Factor w/ 4199 levels "'Weird Al' Yankovic",..: 3728 1980 1861 1637 2559 2996 981 901 1980 1342 ...
##  $ year    : Factor w/ 31 levels "1986","1987",..: 1 1 1 1 1 1 1 1 1 1 ...

From the data we also found that one variable is containing information showing date of each movies with type of data character. In this case we will change this column type into date using function lubridate.

movies <- movies %>% 
       mutate(released = ymd(released))
## Warning: Problem with `mutate()` input `released`.
## i  60 failed to parse.
## i Input `released` is `ymd(released)`.
## Warning: 60 failed to parse.
movies

2.5 Finding Na values

Before finding the Na values first we have to eliminate the data containing zero values for column budget as we will analyse this variable and also we believe in reality there is no movies with zero budget in the production making.

movies <- movies %>% 
  filter_if(is.numeric, all_vars((.) != 0))


movies

after we eliminate zero values in the our data now we only have 4,638 rows which means our data is being decrease.

Now we will find NA value in our data

colSums(is.na(movies))
##   budget  company  country director    genre    gross     name   rating 
##        0        0        0        0        0        0        0        0 
## released  runtime    score     star    votes   writer     year 
##        8        0        0        0        0        0        0

From the summary above we found that in released column there are eight data containing NA values therefore we need to eliminate this NA values using dplyr function

movies <- movies %>% 
  drop_na()

movies

In our data frame the rows is showing 4,630 rows which means we have successfully delete NA value in our data and we can continue to the next step

2.6 Adding new column

In this opportunity we would like to create analysis regarding budgeting and revenue of movies industry therefore we will classified each movies into several categorize such as low medium and high for its budget and revenue

Before we classified this two column first we will summary each column to see the range of the data.

options(scipen=999)
summary(movies$budget)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      6000  10000000  23000000  36196728  46000000 300000000
movies$budgetclass[movies$budget >= 6000 & movies$budget < 10000000] <-"low budget"

movies$budgetclass[movies$budget >= 10000000 & movies$budget <= 36196728] <-"medium budget"


movies$budgetclass[movies$budget > 36196728 & movies$budget <= 300000000] <-"high budget"

movies$budgetclass <- as.factor(movies$budgetclass)

levels(movies$budgetclass)
## [1] "high budget"   "low budget"    "medium budget"
movies <- movies %>% 
  drop_na()

colSums(is.na(movies))
##      budget     company     country    director       genre       gross 
##           0           0           0           0           0           0 
##        name      rating    released     runtime       score        star 
##           0           0           0           0           0           0 
##       votes      writer        year budgetclass 
##           0           0           0           0

3 Ploting

In this plot our ideas is to find relation on 5 top genre with the revenue and cost of its genre

3.1 Genre vs Vote

first we will create a plot showing which genre is most vote by people

library(dplyr)

test <- aggregate(votes~genre, data = movies, FUN = sum)
test <- test[order(test$votes, decreasing = T),]

test
p <- ggplot(test, aes(x ="", y = votes, fill = genre)) + scale_fill_brewer(aesthetics = colorspaces) + geom_bar(width = 1, stat = "identity") +  theme(plot.title = element_text(hjust = 0.5, size = 20), axis.title = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank(), panel.border = element_blank()) + guides(fill = guide_legend(reverse = FALSE)) 

pie <- p + coord_polar("y", start = 0) +  labs(y= "", x= "votes", 
         title= "Movies Genre", 
         subtitle = "Relation Between movies votes and genre", 
         caption = "Source: https://www.kaggle.com/danielgrijalvas/movies")

pie

From the information above acknowldege that top 5 movies genre vote by people is coming from

  • Action
  • Comedy
  • Drama
  • Crime
  • Adventure

So the next step we will choose only the above genre to be proceed to our final plot

agg <- movies[movies$genre == "Action" | movies$genre == "Comedy" | movies$genre == "Drama" | movies$genre == "Crime"| movies$genre == "Comedy" |movies$genre == "Adventure" ,]

agg
final_plot <- ggplot(data = agg, aes(x=gross, y=genre))+ geom_col(aes(col = genre))+ 
facet_wrap(~budgetclass) + theme_dark()+
 theme (axis.text.x = element_text(angle=20, hjust=1))+
  labs(title = "Relation Between Budget and Revenue",
       subtitle = "Top Five Genre",
       x = "Revenue",
       y = "Genre") 

final_plot

4 Conclusion

From the final graphic we could actually find several insight such as:

  1. If we want to make a movie with high revenue sadly we have to spend high budget on our movies. Eventough, in movie with comedy genre we found that the highest revenue is not coming from highest class budget but from middle class budget this anomaly can be further analyze which might give us some inside on how this could happen and also maybe this is a good news for comedy genre production house that to gain revenue on comedy movie genre we does not always have to spend a lot of money in the making.

  2. Action movies is still take a lead on the revenue compare to other genre

  3. For movie with genre we see there is no significant difference in the revenue between high,medium and low budget. This also might give us some insight that in this genre maybe it’s not about the budget which give this kind of genre get higher revenue but the good story line or any other film making technical procedure