This project is focused on exploratory analysis of the Movies_Dataset by cleaning the dataset, and then exploring relationships between identified variables.


Setup

Load packages

library(tidyverse)
## ── Attaching packages ───────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Load data

mov <- read.csv("Movies_Dataset.csv")
dim(mov)
## [1] 608  18

The dataset contains 608 rows and 18 columns.

head(mov)
##   Day.of.Week                Director  Genre       Movie.Title Release.Date
## 1      Friday               Brad Bird action      Tomorrowland   22/05/2015
## 2      Friday             Scott Waugh action    Need for Speed   14/03/2014
## 3      Friday          Patrick Hughes action The Expendables 3   15/08/2014
## 4      Friday Phil Lord, Chris Miller comedy    21 Jump Street   16/03/2012
## 5      Friday         Roland Emmerich action  White House Down   28/06/2013
## 6      Friday              David Ayer action              Fury   17/10/2014
##                Studio Adjusted.Gross...mill. Budget...mill. Gross...mill.
## 1 Buena Vista Studios                  202.1            170         202.1
## 2 Buena Vista Studios                  204.2             66         203.3
## 3           Lionsgate                  207.1            100         206.2
## 4                Sony                  208.8             42         201.6
## 5                Sony                  209.7            150         205.4
## 6                Sony                  212.8             80         211.8
##   IMDb.Rating MovieLens.Rating Overseas...mill. Overseas. Profit...mill.
## 1         6.7             3.26            111.9      55.4           32.1
## 2         6.6             2.97            159.7      78.6          137.3
## 3         6.1             2.93            166.9      80.9          106.2
## 4         7.2             3.62             63.1      31.3          159.6
## 5         8.0             3.65            132.3      64.4           55.4
## 6         5.8             2.85              126      59.5          131.8
##   Profit. Runtime..min. US...mill. Gross...US
## 1    18.9           130       90.2       44.6
## 2   208.0           132       43.6       21.4
## 3   106.2           126       39.3       19.1
## 4   380.0           109      138.4       68.7
## 5    36.9           131       73.1       35.6
## 6   164.8           134       85.8       40.5

The column names look messy. Let’s go ahead and change them.

mov <- mov %>%
  rename(
    Budget.Millions = Budget...mill.,
    Gross.Revenue.Millions= Gross...mill.,
    Audience.Rating = IMDb.Rating,
    Runtime = Runtime..min.,
    Profit.Millions = Profit...mill.,
    Profit.Percentage = Profit.,
    Overseas.Revenue = Overseas...mill.,
    Overseas.Revenue.Percentage = Overseas.,
    Gross.US.Revenue = US...mill.,
    US.Revenue.Percentage = Gross...US
  )
colnames(mov)
##  [1] "Day.of.Week"                 "Director"                   
##  [3] "Genre"                       "Movie.Title"                
##  [5] "Release.Date"                "Studio"                     
##  [7] "Adjusted.Gross...mill."      "Budget.Millions"            
##  [9] "Gross.Revenue.Millions"      "Audience.Rating"            
## [11] "MovieLens.Rating"            "Overseas.Revenue"           
## [13] "Overseas.Revenue.Percentage" "Profit.Millions"            
## [15] "Profit.Percentage"           "Runtime"                    
## [17] "Gross.US.Revenue"            "US.Revenue.Percentage"

Let’s look at the structure of the dataset.

str(mov)
## 'data.frame':    608 obs. of  18 variables:
##  $ Day.of.Week                : Factor w/ 6 levels "Friday","Saturday",..: 1 1 1 1 1 1 4 1 1 1 ...
##  $ Director                   : Factor w/ 337 levels "Aaron Blaise, Robert A. Walker",..: 31 297 233 256 287 76 276 71 108 126 ...
##  $ Genre                      : Factor w/ 15 levels "action","adventure",..: 1 1 1 5 1 1 2 1 1 10 ...
##  $ Movie.Title                : Factor w/ 608 levels "10,000 B.C.",..: 557 314 466 6 592 161 233 378 128 331 ...
##  $ Release.Date               : Factor w/ 534 levels "1/05/2009","1/05/2015",..: 273 86 121 134 384 159 347 16 28 257 ...
##  $ Studio                     : Factor w/ 36 levels "Art House Studios",..: 2 2 11 25 25 25 2 31 31 20 ...
##  $ Adjusted.Gross...mill.     : Factor w/ 585 levels "1,003","1,020",..: 50 51 52 53 54 55 56 57 58 59 ...
##  $ Budget.Millions            : num  170 66 100 42 150 80 50 85 70 5 ...
##  $ Gross.Revenue.Millions     : Factor w/ 561 levels "1,004.60","1,017",..: 30 33 43 27 40 59 63 49 72 45 ...
##  $ Audience.Rating            : num  6.7 6.6 6.1 7.2 8 5.8 6 6.8 6.3 5.9 ...
##  $ MovieLens.Rating           : num  3.26 2.97 2.93 3.62 3.65 2.85 3.16 3.45 2.92 2.9 ...
##  $ Overseas.Revenue           : Factor w/ 551 levels "1,160.60","1,528.10",..: 32 151 172 490 82 66 528 523 150 11 ...
##  $ Overseas.Revenue.Percentage: num  55.4 78.6 80.9 31.3 64.4 59.5 39.9 39.3 73.9 49.8 ...
##  $ Profit.Millions            : Factor w/ 566 levels "1,015.40","1,025.90",..: 366 47 13 94 494 39 100 28 69 189 ...
##  $ Profit.Percentage          : num  18.9 208 106.2 380 36.9 ...
##  $ Runtime                    : int  130 132 126 109 131 134 125 115 92 84 ...
##  $ Gross.US.Revenue           : num  90.2 43.6 39.3 138.4 73.1 ...
##  $ US.Revenue.Percentage      : num  44.6 21.4 19.1 68.7 35.6 40.5 60.1 60.7 26.1 50.2 ...

Notice that some columns (e.g. Gross.Revenue.Millions) are regognized as categorical variables due to the comma separator. Let’s convert them into numeric variables.

mov$Gross.Revenue.Millions <- as.numeric(as.character(mov$Gross.Revenue.Millions))
## Warning: NAs introduced by coercion
mov$Overseas.Revenue <- as.numeric(as.character(mov$Overseas.Revenue ))
## Warning: NAs introduced by coercion
mov$Profit.Millions <- as.numeric(as.character(mov$Profit.Millions))
## Warning: NAs introduced by coercion

str(mov)
## 'data.frame':    608 obs. of  18 variables:
##  $ Day.of.Week                : Factor w/ 6 levels "Friday","Saturday",..: 1 1 1 1 1 1 4 1 1 1 ...
##  $ Director                   : Factor w/ 337 levels "Aaron Blaise, Robert A. Walker",..: 31 297 233 256 287 76 276 71 108 126 ...
##  $ Genre                      : Factor w/ 15 levels "action","adventure",..: 1 1 1 5 1 1 2 1 1 10 ...
##  $ Movie.Title                : Factor w/ 608 levels "10,000 B.C.",..: 557 314 466 6 592 161 233 378 128 331 ...
##  $ Release.Date               : Factor w/ 534 levels "1/05/2009","1/05/2015",..: 273 86 121 134 384 159 347 16 28 257 ...
##  $ Studio                     : Factor w/ 36 levels "Art House Studios",..: 2 2 11 25 25 25 2 31 31 20 ...
##  $ Adjusted.Gross...mill.     : Factor w/ 585 levels "1,003","1,020",..: 50 51 52 53 54 55 56 57 58 59 ...
##  $ Budget.Millions            : num  170 66 100 42 150 80 50 85 70 5 ...
##  $ Gross.Revenue.Millions     : num  202 203 206 202 205 ...
##  $ Audience.Rating            : num  6.7 6.6 6.1 7.2 8 5.8 6 6.8 6.3 5.9 ...
##  $ MovieLens.Rating           : num  3.26 2.97 2.93 3.62 3.65 2.85 3.16 3.45 2.92 2.9 ...
##  $ Overseas.Revenue           : num  111.9 159.7 166.9 63.1 132.3 ...
##  $ Overseas.Revenue.Percentage: num  55.4 78.6 80.9 31.3 64.4 59.5 39.9 39.3 73.9 49.8 ...
##  $ Profit.Millions            : num  32.1 137.3 106.2 159.6 55.4 ...
##  $ Profit.Percentage          : num  18.9 208 106.2 380 36.9 ...
##  $ Runtime                    : int  130 132 126 109 131 134 125 115 92 84 ...
##  $ Gross.US.Revenue           : num  90.2 43.6 39.3 138.4 73.1 ...
##  $ US.Revenue.Percentage      : num  44.6 21.4 19.1 68.7 35.6 40.5 60.1 60.7 26.1 50.2 ...

Look out for missing values:

We see that the columns Gross.Revenue.Millions, Overseas.Revenue, Profit.Millions contain missing values. We can handle this situation in two ways.

  1. Simply delete the rows containing missing values, or
  2. Impute the missing values with respective column mean or median.

I am gonna stick to the second choice here.

mov <- mov %>%
  mutate(Gross.Revenue.Millions = replace(Gross.Revenue.Millions,
                                  is.na(Gross.Revenue.Millions),
                                  median(Gross.Revenue.Millions, na.rm = TRUE)))

mov <- mov %>%
  mutate(Overseas.Revenue = replace(Overseas.Revenue,
                                  is.na(Overseas.Revenue),
                                  median(Overseas.Revenue, na.rm = TRUE)))

mov <- mov %>%
  mutate(Profit.Millions = replace(Profit.Millions,
                                  is.na(Profit.Millions),
                                  median(Profit.Millions, na.rm = TRUE)))

Let’s check back to make sure there are no more missing values.

summary(mov)
##     Day.of.Week              Director         Genre    
##  Friday   :448   Steven Spielberg: 19   action   :236  
##  Saturday :  3   Robert Zemeckis :  9   animation: 97  
##  Sunday   :  1   Michael Bay     :  8   comedy   : 91  
##  Thursday : 27   Peter Jackson   :  7   drama    : 52  
##  Tuesday  : 10   Ridley Scott    :  7   adventure: 50  
##  Wednesday:119   Tim Burton      :  7   sci-fi   : 16  
##                  (Other)         :551   (Other)  : 66  
##                 Movie.Title      Release.Date                 Studio   
##  10,000 B.C.          :  1   25/12/2008:  4   Buena Vista Studios: 93  
##  101 Dalmatians       :  1   1/07/2009 :  3   WB                 : 93  
##  101 Dalmatians (1996):  1   16/12/2011:  3   Fox                : 85  
##  2 Fast 2 Furious     :  1   19/11/1999:  3   Universal          : 79  
##  2012                 :  1   1/05/2009 :  2   Sony               : 65  
##  21 Jump Street       :  1   10/06/2005:  2   Paramount Pictures : 62  
##  (Other)              :602   (Other)   :591   (Other)            :131  
##  Adjusted.Gross...mill. Budget.Millions  Gross.Revenue.Millions Audience.Rating
##  296    :  3            Min.   :  0.60   Min.   :200.3          Min.   :3.600  
##  231    :  2            1st Qu.: 45.00   1st Qu.:246.6          1st Qu.:6.375  
##  269.4  :  2            Median : 80.00   Median :320.4          Median :6.900  
##  274    :  2            Mean   : 92.47   Mean   :378.6          Mean   :6.924  
##  280    :  2            3rd Qu.:130.00   3rd Qu.:444.4          3rd Qu.:7.600  
##  294.3  :  2            Max.   :300.00   Max.   :987.5          Max.   :9.200  
##  (Other):595                                                                   
##  MovieLens.Rating Overseas.Revenue Overseas.Revenue.Percentage Profit.Millions
##  Min.   :1.490    Min.   : 46.9    Min.   : 17.2               Min.   : 19.9  
##  1st Qu.:3.038    1st Qu.:135.5    1st Qu.: 49.9               1st Qu.:180.7  
##  Median :3.365    Median :189.0    Median : 58.2               Median :245.2  
##  Mean   :3.340    Mean   :239.5    Mean   : 57.7               Mean   :302.5  
##  3rd Qu.:3.672    3rd Qu.:281.9    3rd Qu.: 66.3               3rd Qu.:366.3  
##  Max.   :4.500    Max.   :960.5    Max.   :100.0               Max.   :966.2  
##                                                                               
##  Profit.Percentage    Runtime      Gross.US.Revenue US.Revenue.Percentage
##  Min.   :    7.7   Min.   : 30.0   Min.   :  0.0    Min.   : 0.0         
##  1st Qu.:  201.8   1st Qu.:100.0   1st Qu.:107.0    1st Qu.:33.7         
##  Median :  338.6   Median :116.0   Median :141.7    Median :41.8         
##  Mean   :  719.3   Mean   :117.8   Mean   :167.1    Mean   :42.3         
##  3rd Qu.:  650.1   3rd Qu.:130.2   3rd Qu.:202.1    3rd Qu.:50.1         
##  Max.   :41333.3   Max.   :238.0   Max.   :760.5    Max.   :82.8         
## 

A cool insight:

ggplot(data=mov, aes(x=Day.of.Week, fill= Day.of.Week)) + geom_bar(color="Black")

We see that there are no movies released on Monday. Perhaps, Monday is not good for business.


Filtering:

We are only interested in looking at some specific genre and studio. Let’s create filters for them.

filt1 <- (mov$Genre == "action") | (mov$Genre == "adventure") | (mov$Genre == "animation") | (mov$Genre == "comedy") | (mov$Genre == "drama")

filt2 <- (mov$Studio == "Buena Vista Studios") | (mov$Studio == "WB") | (mov$Studio == "Fox") | (mov$Studio == "Universal") | (mov$Studio == "Sony") | (mov$Studio == "Paramount Pictures")

mov2 <- mov[filt1 & filt2,]
head(mov2)
##   Day.of.Week                Director     Genre      Movie.Title Release.Date
## 1      Friday               Brad Bird    action     Tomorrowland   22/05/2015
## 2      Friday             Scott Waugh    action   Need for Speed   14/03/2014
## 4      Friday Phil Lord, Chris Miller    comedy   21 Jump Street   16/03/2012
## 5      Friday         Roland Emmerich    action White House Down   28/06/2013
## 6      Friday              David Ayer    action             Fury   17/10/2014
## 7    Thursday            Rob Marshall adventure   Into the Woods   25/12/2014
##                Studio Adjusted.Gross...mill. Budget.Millions
## 1 Buena Vista Studios                  202.1             170
## 2 Buena Vista Studios                  204.2              66
## 4                Sony                  208.8              42
## 5                Sony                  209.7             150
## 6                Sony                  212.8              80
## 7 Buena Vista Studios                  213.9              50
##   Gross.Revenue.Millions Audience.Rating MovieLens.Rating Overseas.Revenue
## 1                  202.1             6.7             3.26            111.9
## 2                  203.3             6.6             2.97            159.7
## 4                  201.6             7.2             3.62             63.1
## 5                  205.4             8.0             3.65            132.3
## 6                  211.8             5.8             2.85            126.0
## 7                  212.9             6.0             3.16             84.9
##   Overseas.Revenue.Percentage Profit.Millions Profit.Percentage Runtime
## 1                        55.4            32.1              18.9     130
## 2                        78.6           137.3             208.0     132
## 4                        31.3           159.6             380.0     109
## 5                        64.4            55.4              36.9     131
## 6                        59.5           131.8             164.8     134
## 7                        39.9           162.9             325.8     125
##   Gross.US.Revenue US.Revenue.Percentage
## 1             90.2                  44.6
## 2             43.6                  21.4
## 4            138.4                  68.7
## 5             73.1                  35.6
## 6             85.8                  40.5
## 7            128.0                  60.1
dim(mov2)
## [1] 423  18

The new dataframe has 423 rows. We will get rid of some columns below.

mov2 <- mov2[, -c(1:2, 4:5, 7)]
head(mov2)
##       Genre              Studio Budget.Millions Gross.Revenue.Millions
## 1    action Buena Vista Studios             170                  202.1
## 2    action Buena Vista Studios              66                  203.3
## 4    comedy                Sony              42                  201.6
## 5    action                Sony             150                  205.4
## 6    action                Sony              80                  211.8
## 7 adventure Buena Vista Studios              50                  212.9
##   Audience.Rating MovieLens.Rating Overseas.Revenue Overseas.Revenue.Percentage
## 1             6.7             3.26            111.9                        55.4
## 2             6.6             2.97            159.7                        78.6
## 4             7.2             3.62             63.1                        31.3
## 5             8.0             3.65            132.3                        64.4
## 6             5.8             2.85            126.0                        59.5
## 7             6.0             3.16             84.9                        39.9
##   Profit.Millions Profit.Percentage Runtime Gross.US.Revenue
## 1            32.1              18.9     130             90.2
## 2           137.3             208.0     132             43.6
## 4           159.6             380.0     109            138.4
## 5            55.4              36.9     131             73.1
## 6           131.8             164.8     134             85.8
## 7           162.9             325.8     125            128.0
##   US.Revenue.Percentage
## 1                  44.6
## 2                  21.4
## 4                  68.7
## 5                  35.6
## 6                  40.5
## 7                  60.1

Number of movies by genre:

ggplot(data=mov2, aes(x=Genre, fill=Genre)) + geom_bar(color="Black")

Action movies significantly outnumber the other movies. Let’s check out the number of movies by budget.


Movies distribution by budget:

p <- ggplot(data=mov2, aes(x= Budget.Millions))
q <- p+ geom_histogram(binwidth = 10, aes(fill=Genre), color='Black')
q + xlab("Budget Millions")+
  ylab("Number of Movies")+
  ggtitle('Movie Budget Distribution')+
  theme(axis.title.x = element_text(colour="DarkRed", size=16),
        axis.title.y = element_text(colour="DarkBlue", size=16),
        axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12),
        legend.title = element_text(size=16),
        legend.text = element_text(size=12),
        legend.position = c(1,1),
        legend.justification = c(1,1),
        plot.title = element_text(hjust=0.5, colour="DarkGreen",
                                  size=17))

Most of the movie budgets are within $150 million, and here are a handful of movies that exceed $200 million in budget, and are mostly action movies.


Revenues by genre using boxplot:

r <- ggplot(data=mov2, aes(x=Genre, y=Overseas.Revenue))
s <- r + 
  geom_jitter(aes( colour=Studio)) + 
  geom_boxplot(alpha = 0.7, outlier.colour = NA) +
  xlab("Genre") + 
  ylab("Overseas Revenue") + 
  ggtitle("Overseas Revenue by Genre") + 
  theme(
    axis.title.x = element_text(colour="DarkRed", size=15),
    axis.title.y = element_text(colour="DarkBlue", size=15),
    axis.text.x = element_text(size=14),
    axis.text.y = element_text(size=14),  
    plot.title = element_text(hjust=0.5, colour="DarkGreen",
                                  size=17),
    
    legend.title = element_text(size=14),
    legend.text = element_text(size=10)
  )
s

We see that,

  • Adventure movies has the largest variability in terms of overseas revenue. It also has the highest median.
  • Comedy movies perform more consistently than any other type of movies with a median of about $130 million.
  • Action, and animation movies have outliers (in this case movies are over performing) and share the same median.
  • Drama movies has almost the same median as comedy movies.