This project is focused on exploratory analysis of the Movies_Dataset by cleaning the dataset, and then exploring relationships between identified variables.
library(tidyverse)
## ── Attaching packages ───────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
mov <- read.csv("Movies_Dataset.csv")
dim(mov)
## [1] 608 18
The dataset contains 608 rows and 18 columns.
head(mov)
## Day.of.Week Director Genre Movie.Title Release.Date
## 1 Friday Brad Bird action Tomorrowland 22/05/2015
## 2 Friday Scott Waugh action Need for Speed 14/03/2014
## 3 Friday Patrick Hughes action The Expendables 3 15/08/2014
## 4 Friday Phil Lord, Chris Miller comedy 21 Jump Street 16/03/2012
## 5 Friday Roland Emmerich action White House Down 28/06/2013
## 6 Friday David Ayer action Fury 17/10/2014
## Studio Adjusted.Gross...mill. Budget...mill. Gross...mill.
## 1 Buena Vista Studios 202.1 170 202.1
## 2 Buena Vista Studios 204.2 66 203.3
## 3 Lionsgate 207.1 100 206.2
## 4 Sony 208.8 42 201.6
## 5 Sony 209.7 150 205.4
## 6 Sony 212.8 80 211.8
## IMDb.Rating MovieLens.Rating Overseas...mill. Overseas. Profit...mill.
## 1 6.7 3.26 111.9 55.4 32.1
## 2 6.6 2.97 159.7 78.6 137.3
## 3 6.1 2.93 166.9 80.9 106.2
## 4 7.2 3.62 63.1 31.3 159.6
## 5 8.0 3.65 132.3 64.4 55.4
## 6 5.8 2.85 126 59.5 131.8
## Profit. Runtime..min. US...mill. Gross...US
## 1 18.9 130 90.2 44.6
## 2 208.0 132 43.6 21.4
## 3 106.2 126 39.3 19.1
## 4 380.0 109 138.4 68.7
## 5 36.9 131 73.1 35.6
## 6 164.8 134 85.8 40.5
The column names look messy. Let’s go ahead and change them.
mov <- mov %>%
rename(
Budget.Millions = Budget...mill.,
Gross.Revenue.Millions= Gross...mill.,
Audience.Rating = IMDb.Rating,
Runtime = Runtime..min.,
Profit.Millions = Profit...mill.,
Profit.Percentage = Profit.,
Overseas.Revenue = Overseas...mill.,
Overseas.Revenue.Percentage = Overseas.,
Gross.US.Revenue = US...mill.,
US.Revenue.Percentage = Gross...US
)
colnames(mov)
## [1] "Day.of.Week" "Director"
## [3] "Genre" "Movie.Title"
## [5] "Release.Date" "Studio"
## [7] "Adjusted.Gross...mill." "Budget.Millions"
## [9] "Gross.Revenue.Millions" "Audience.Rating"
## [11] "MovieLens.Rating" "Overseas.Revenue"
## [13] "Overseas.Revenue.Percentage" "Profit.Millions"
## [15] "Profit.Percentage" "Runtime"
## [17] "Gross.US.Revenue" "US.Revenue.Percentage"
Let’s look at the structure of the dataset.
str(mov)
## 'data.frame': 608 obs. of 18 variables:
## $ Day.of.Week : Factor w/ 6 levels "Friday","Saturday",..: 1 1 1 1 1 1 4 1 1 1 ...
## $ Director : Factor w/ 337 levels "Aaron Blaise, Robert A. Walker",..: 31 297 233 256 287 76 276 71 108 126 ...
## $ Genre : Factor w/ 15 levels "action","adventure",..: 1 1 1 5 1 1 2 1 1 10 ...
## $ Movie.Title : Factor w/ 608 levels "10,000 B.C.",..: 557 314 466 6 592 161 233 378 128 331 ...
## $ Release.Date : Factor w/ 534 levels "1/05/2009","1/05/2015",..: 273 86 121 134 384 159 347 16 28 257 ...
## $ Studio : Factor w/ 36 levels "Art House Studios",..: 2 2 11 25 25 25 2 31 31 20 ...
## $ Adjusted.Gross...mill. : Factor w/ 585 levels "1,003","1,020",..: 50 51 52 53 54 55 56 57 58 59 ...
## $ Budget.Millions : num 170 66 100 42 150 80 50 85 70 5 ...
## $ Gross.Revenue.Millions : Factor w/ 561 levels "1,004.60","1,017",..: 30 33 43 27 40 59 63 49 72 45 ...
## $ Audience.Rating : num 6.7 6.6 6.1 7.2 8 5.8 6 6.8 6.3 5.9 ...
## $ MovieLens.Rating : num 3.26 2.97 2.93 3.62 3.65 2.85 3.16 3.45 2.92 2.9 ...
## $ Overseas.Revenue : Factor w/ 551 levels "1,160.60","1,528.10",..: 32 151 172 490 82 66 528 523 150 11 ...
## $ Overseas.Revenue.Percentage: num 55.4 78.6 80.9 31.3 64.4 59.5 39.9 39.3 73.9 49.8 ...
## $ Profit.Millions : Factor w/ 566 levels "1,015.40","1,025.90",..: 366 47 13 94 494 39 100 28 69 189 ...
## $ Profit.Percentage : num 18.9 208 106.2 380 36.9 ...
## $ Runtime : int 130 132 126 109 131 134 125 115 92 84 ...
## $ Gross.US.Revenue : num 90.2 43.6 39.3 138.4 73.1 ...
## $ US.Revenue.Percentage : num 44.6 21.4 19.1 68.7 35.6 40.5 60.1 60.7 26.1 50.2 ...
Notice that some columns (e.g. Gross.Revenue.Millions) are regognized as categorical variables due to the comma separator. Let’s convert them into numeric variables.
mov$Gross.Revenue.Millions <- as.numeric(as.character(mov$Gross.Revenue.Millions))
## Warning: NAs introduced by coercion
mov$Overseas.Revenue <- as.numeric(as.character(mov$Overseas.Revenue ))
## Warning: NAs introduced by coercion
mov$Profit.Millions <- as.numeric(as.character(mov$Profit.Millions))
## Warning: NAs introduced by coercion
str(mov)
## 'data.frame': 608 obs. of 18 variables:
## $ Day.of.Week : Factor w/ 6 levels "Friday","Saturday",..: 1 1 1 1 1 1 4 1 1 1 ...
## $ Director : Factor w/ 337 levels "Aaron Blaise, Robert A. Walker",..: 31 297 233 256 287 76 276 71 108 126 ...
## $ Genre : Factor w/ 15 levels "action","adventure",..: 1 1 1 5 1 1 2 1 1 10 ...
## $ Movie.Title : Factor w/ 608 levels "10,000 B.C.",..: 557 314 466 6 592 161 233 378 128 331 ...
## $ Release.Date : Factor w/ 534 levels "1/05/2009","1/05/2015",..: 273 86 121 134 384 159 347 16 28 257 ...
## $ Studio : Factor w/ 36 levels "Art House Studios",..: 2 2 11 25 25 25 2 31 31 20 ...
## $ Adjusted.Gross...mill. : Factor w/ 585 levels "1,003","1,020",..: 50 51 52 53 54 55 56 57 58 59 ...
## $ Budget.Millions : num 170 66 100 42 150 80 50 85 70 5 ...
## $ Gross.Revenue.Millions : num 202 203 206 202 205 ...
## $ Audience.Rating : num 6.7 6.6 6.1 7.2 8 5.8 6 6.8 6.3 5.9 ...
## $ MovieLens.Rating : num 3.26 2.97 2.93 3.62 3.65 2.85 3.16 3.45 2.92 2.9 ...
## $ Overseas.Revenue : num 111.9 159.7 166.9 63.1 132.3 ...
## $ Overseas.Revenue.Percentage: num 55.4 78.6 80.9 31.3 64.4 59.5 39.9 39.3 73.9 49.8 ...
## $ Profit.Millions : num 32.1 137.3 106.2 159.6 55.4 ...
## $ Profit.Percentage : num 18.9 208 106.2 380 36.9 ...
## $ Runtime : int 130 132 126 109 131 134 125 115 92 84 ...
## $ Gross.US.Revenue : num 90.2 43.6 39.3 138.4 73.1 ...
## $ US.Revenue.Percentage : num 44.6 21.4 19.1 68.7 35.6 40.5 60.1 60.7 26.1 50.2 ...
We see that the columns Gross.Revenue.Millions, Overseas.Revenue, Profit.Millions contain missing values. We can handle this situation in two ways.
I am gonna stick to the second choice here.
mov <- mov %>%
mutate(Gross.Revenue.Millions = replace(Gross.Revenue.Millions,
is.na(Gross.Revenue.Millions),
median(Gross.Revenue.Millions, na.rm = TRUE)))
mov <- mov %>%
mutate(Overseas.Revenue = replace(Overseas.Revenue,
is.na(Overseas.Revenue),
median(Overseas.Revenue, na.rm = TRUE)))
mov <- mov %>%
mutate(Profit.Millions = replace(Profit.Millions,
is.na(Profit.Millions),
median(Profit.Millions, na.rm = TRUE)))
Let’s check back to make sure there are no more missing values.
summary(mov)
## Day.of.Week Director Genre
## Friday :448 Steven Spielberg: 19 action :236
## Saturday : 3 Robert Zemeckis : 9 animation: 97
## Sunday : 1 Michael Bay : 8 comedy : 91
## Thursday : 27 Peter Jackson : 7 drama : 52
## Tuesday : 10 Ridley Scott : 7 adventure: 50
## Wednesday:119 Tim Burton : 7 sci-fi : 16
## (Other) :551 (Other) : 66
## Movie.Title Release.Date Studio
## 10,000 B.C. : 1 25/12/2008: 4 Buena Vista Studios: 93
## 101 Dalmatians : 1 1/07/2009 : 3 WB : 93
## 101 Dalmatians (1996): 1 16/12/2011: 3 Fox : 85
## 2 Fast 2 Furious : 1 19/11/1999: 3 Universal : 79
## 2012 : 1 1/05/2009 : 2 Sony : 65
## 21 Jump Street : 1 10/06/2005: 2 Paramount Pictures : 62
## (Other) :602 (Other) :591 (Other) :131
## Adjusted.Gross...mill. Budget.Millions Gross.Revenue.Millions Audience.Rating
## 296 : 3 Min. : 0.60 Min. :200.3 Min. :3.600
## 231 : 2 1st Qu.: 45.00 1st Qu.:246.6 1st Qu.:6.375
## 269.4 : 2 Median : 80.00 Median :320.4 Median :6.900
## 274 : 2 Mean : 92.47 Mean :378.6 Mean :6.924
## 280 : 2 3rd Qu.:130.00 3rd Qu.:444.4 3rd Qu.:7.600
## 294.3 : 2 Max. :300.00 Max. :987.5 Max. :9.200
## (Other):595
## MovieLens.Rating Overseas.Revenue Overseas.Revenue.Percentage Profit.Millions
## Min. :1.490 Min. : 46.9 Min. : 17.2 Min. : 19.9
## 1st Qu.:3.038 1st Qu.:135.5 1st Qu.: 49.9 1st Qu.:180.7
## Median :3.365 Median :189.0 Median : 58.2 Median :245.2
## Mean :3.340 Mean :239.5 Mean : 57.7 Mean :302.5
## 3rd Qu.:3.672 3rd Qu.:281.9 3rd Qu.: 66.3 3rd Qu.:366.3
## Max. :4.500 Max. :960.5 Max. :100.0 Max. :966.2
##
## Profit.Percentage Runtime Gross.US.Revenue US.Revenue.Percentage
## Min. : 7.7 Min. : 30.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 201.8 1st Qu.:100.0 1st Qu.:107.0 1st Qu.:33.7
## Median : 338.6 Median :116.0 Median :141.7 Median :41.8
## Mean : 719.3 Mean :117.8 Mean :167.1 Mean :42.3
## 3rd Qu.: 650.1 3rd Qu.:130.2 3rd Qu.:202.1 3rd Qu.:50.1
## Max. :41333.3 Max. :238.0 Max. :760.5 Max. :82.8
##
ggplot(data=mov, aes(x=Day.of.Week, fill= Day.of.Week)) + geom_bar(color="Black")
We see that there are no movies released on Monday. Perhaps, Monday is not good for business.
We are only interested in looking at some specific genre and studio. Let’s create filters for them.
filt1 <- (mov$Genre == "action") | (mov$Genre == "adventure") | (mov$Genre == "animation") | (mov$Genre == "comedy") | (mov$Genre == "drama")
filt2 <- (mov$Studio == "Buena Vista Studios") | (mov$Studio == "WB") | (mov$Studio == "Fox") | (mov$Studio == "Universal") | (mov$Studio == "Sony") | (mov$Studio == "Paramount Pictures")
mov2 <- mov[filt1 & filt2,]
head(mov2)
## Day.of.Week Director Genre Movie.Title Release.Date
## 1 Friday Brad Bird action Tomorrowland 22/05/2015
## 2 Friday Scott Waugh action Need for Speed 14/03/2014
## 4 Friday Phil Lord, Chris Miller comedy 21 Jump Street 16/03/2012
## 5 Friday Roland Emmerich action White House Down 28/06/2013
## 6 Friday David Ayer action Fury 17/10/2014
## 7 Thursday Rob Marshall adventure Into the Woods 25/12/2014
## Studio Adjusted.Gross...mill. Budget.Millions
## 1 Buena Vista Studios 202.1 170
## 2 Buena Vista Studios 204.2 66
## 4 Sony 208.8 42
## 5 Sony 209.7 150
## 6 Sony 212.8 80
## 7 Buena Vista Studios 213.9 50
## Gross.Revenue.Millions Audience.Rating MovieLens.Rating Overseas.Revenue
## 1 202.1 6.7 3.26 111.9
## 2 203.3 6.6 2.97 159.7
## 4 201.6 7.2 3.62 63.1
## 5 205.4 8.0 3.65 132.3
## 6 211.8 5.8 2.85 126.0
## 7 212.9 6.0 3.16 84.9
## Overseas.Revenue.Percentage Profit.Millions Profit.Percentage Runtime
## 1 55.4 32.1 18.9 130
## 2 78.6 137.3 208.0 132
## 4 31.3 159.6 380.0 109
## 5 64.4 55.4 36.9 131
## 6 59.5 131.8 164.8 134
## 7 39.9 162.9 325.8 125
## Gross.US.Revenue US.Revenue.Percentage
## 1 90.2 44.6
## 2 43.6 21.4
## 4 138.4 68.7
## 5 73.1 35.6
## 6 85.8 40.5
## 7 128.0 60.1
dim(mov2)
## [1] 423 18
The new dataframe has 423 rows. We will get rid of some columns below.
mov2 <- mov2[, -c(1:2, 4:5, 7)]
head(mov2)
## Genre Studio Budget.Millions Gross.Revenue.Millions
## 1 action Buena Vista Studios 170 202.1
## 2 action Buena Vista Studios 66 203.3
## 4 comedy Sony 42 201.6
## 5 action Sony 150 205.4
## 6 action Sony 80 211.8
## 7 adventure Buena Vista Studios 50 212.9
## Audience.Rating MovieLens.Rating Overseas.Revenue Overseas.Revenue.Percentage
## 1 6.7 3.26 111.9 55.4
## 2 6.6 2.97 159.7 78.6
## 4 7.2 3.62 63.1 31.3
## 5 8.0 3.65 132.3 64.4
## 6 5.8 2.85 126.0 59.5
## 7 6.0 3.16 84.9 39.9
## Profit.Millions Profit.Percentage Runtime Gross.US.Revenue
## 1 32.1 18.9 130 90.2
## 2 137.3 208.0 132 43.6
## 4 159.6 380.0 109 138.4
## 5 55.4 36.9 131 73.1
## 6 131.8 164.8 134 85.8
## 7 162.9 325.8 125 128.0
## US.Revenue.Percentage
## 1 44.6
## 2 21.4
## 4 68.7
## 5 35.6
## 6 40.5
## 7 60.1
ggplot(data=mov2, aes(x=Genre, fill=Genre)) + geom_bar(color="Black")
Action movies significantly outnumber the other movies. Let’s check out the number of movies by budget.
p <- ggplot(data=mov2, aes(x= Budget.Millions))
q <- p+ geom_histogram(binwidth = 10, aes(fill=Genre), color='Black')
q + xlab("Budget Millions")+
ylab("Number of Movies")+
ggtitle('Movie Budget Distribution')+
theme(axis.title.x = element_text(colour="DarkRed", size=16),
axis.title.y = element_text(colour="DarkBlue", size=16),
axis.text.x = element_text(size=12),
axis.text.y = element_text(size=12),
legend.title = element_text(size=16),
legend.text = element_text(size=12),
legend.position = c(1,1),
legend.justification = c(1,1),
plot.title = element_text(hjust=0.5, colour="DarkGreen",
size=17))
Most of the movie budgets are within $150 million, and here are a handful of movies that exceed $200 million in budget, and are mostly action movies.
r <- ggplot(data=mov2, aes(x=Genre, y=Overseas.Revenue))
s <- r +
geom_jitter(aes( colour=Studio)) +
geom_boxplot(alpha = 0.7, outlier.colour = NA) +
xlab("Genre") +
ylab("Overseas Revenue") +
ggtitle("Overseas Revenue by Genre") +
theme(
axis.title.x = element_text(colour="DarkRed", size=15),
axis.title.y = element_text(colour="DarkBlue", size=15),
axis.text.x = element_text(size=14),
axis.text.y = element_text(size=14),
plot.title = element_text(hjust=0.5, colour="DarkGreen",
size=17),
legend.title = element_text(size=14),
legend.text = element_text(size=10)
)
s
We see that,