#required packages
library(readr)
library(foreign)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(validate)
##
## Attaching package: 'validate'
## The following object is masked from 'package:dplyr':
##
## expr
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:validate':
##
## expr
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:validate':
##
## label, label<-
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(stringr)
library(ggplot2)
library(editrules)
## Loading required package: igraph
##
## Attaching package: 'igraph'
## The following object is masked from 'package:validate':
##
## compare
## The following object is masked from 'package:tidyr':
##
## crossing
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
##
## Attaching package: 'editrules'
## The following objects are masked from 'package:igraph':
##
## blocks, normalize
## The following objects are masked from 'package:tidyr':
##
## contains, separate
## The following object is masked from 'package:dplyr':
##
## contains
library(outliers)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
In this analysis, three datasets were used . The datasets are “The movies Dataset”, " Movie Information Dataset" and“IMDB Movies Extensive Dataset”. The datasets were read by using read_csv(). To understand the data structure and data types, glimpse() function was used. Prior to mutating join, columns were renamed and required variables were selected by using select(). Two movies datasets were merged by imdbID and Title using left_join(). Followed by rating dataset was merged by imdbID using inner_join(). After imputing missing values, the obvious errors were assessed by using editset(). The NetIncome variable was created using mutate(). The outliers were detected by using boxplots and were capped. The capping was not succesful as outliers were still observed. The distribution of NetIncome was determined by using reciprocal transformation.
The Movies Dataset
The Movies Dataset was taken from https://www.kaggle.com/rounakbanik/the-movies-dataset?select=links.csv, containing 45,000 movies released on or before July 2017. The dataset has 26 million rating from 270,000 users for 45,000 movies. Ratings are on a scale of 1-5 that are obtained from the official GroupLens Website.
Variables
-IMDB_ID: The identifier of the movie in IMDB -Title: The title of the movie -Popularity: The popularity of the movie based on the voting -Revenue: Income from movie -Budget: Movie production cost
Movie Information Dataset
The Movie Information Dataset was taken from https://www.kaggle.com/beyjin/movies-1990-to-2017?select=Movie_Movies.csv, consists of 178687 observations and 18 variables. However , only 4 features were used
Variables
-IMDB_ID : The identifier of the movie in IMDB -Title: The title of the movie -Type: The type of the production -Country: The production Country
IMDB Movies Extensive Dataset
The IMDB Dataset was taken from https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset?select=IMDb+ratings.csv. The IMDB is the most popular movie website and it combines movie plot description, rating, reviews, release dates and many more aspects. There were four datasets provided by IMDB Dataset but only IMDB rating dataset was used for this analysis. The IMDB rating dataset includes 85,855 rating details from demographic perspective.
Variables
IMDB_Title_ID: The identifier of the movie in IMDB Weighted Average Vote:Total weighted average rating Total Vote: The totle votes received Mean Vote: The total mean vote *Median Vote: The total median vote
##Reading and Understanding Data The three datasets used in this analysis were imported using readar function, read_CSV().
#Importing the Movies Dataset
movie <- read_csv("movies_metadata.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## adult = col_logical(),
## budget = col_double(),
## id = col_double(),
## popularity = col_double(),
## release_date = col_date(format = ""),
## revenue = col_double(),
## runtime = col_double(),
## video = col_logical(),
## vote_average = col_double(),
## vote_count = col_double()
## )
## See spec(...) for full column specifications.
## Warning: 19 parsing failures.
## row col expected actual file
## 19730 NA 24 columns 10 columns 'movies_metadata.csv'
## 19731 adult 1/0/T/F/TRUE/FALSE - Written by Ørnås 'movies_metadata.csv'
## 19731 budget a double /ff9qCepilowshEtG2GYWwzt2bs4.jpg 'movies_metadata.csv'
## 19731 id no trailing characters -08-20 'movies_metadata.csv'
## 19731 release_date date like 1 'movies_metadata.csv'
## ..... ............ ...................... ................................ .....................
## See problems(...) for more details.
head(movie)
## # A tibble: 6 x 24
## adult belongs_to_coll~ budget genres homepage id imdb_id original_langua~
## <lgl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 FALSE {'id': 10194, '~ 3.00e7 [{'id~ http://~ 862 tt0114~ en
## 2 FALSE <NA> 6.50e7 [{'id~ <NA> 8844 tt0113~ en
## 3 FALSE {'id': 119050, ~ 0. [{'id~ <NA> 15602 tt0113~ en
## 4 FALSE <NA> 1.60e7 [{'id~ <NA> 31357 tt0114~ en
## 5 FALSE {'id': 96871, '~ 0. [{'id~ <NA> 11862 tt0113~ en
## 6 FALSE <NA> 6.00e7 [{'id~ <NA> 949 tt0113~ en
## # ... with 16 more variables: original_title <chr>, overview <chr>,
## # popularity <dbl>, poster_path <chr>, production_companies <chr>,
## # production_countries <chr>, release_date <date>, revenue <dbl>,
## # runtime <dbl>, spoken_languages <chr>, status <chr>, tagline <chr>,
## # title <chr>, video <lgl>, vote_average <dbl>, vote_count <dbl>
#Importing Movie Information Dataset
movie2<- read_csv("Movie_Movies.csv")
## Parsed with column specification:
## cols(
## Awards = col_character(),
## Country = col_character(),
## DVD = col_character(),
## Director = col_character(),
## Language = col_character(),
## Plot = col_character(),
## Poster = col_character(),
## Production = col_character(),
## Rated = col_character(),
## Released = col_character(),
## Runtime = col_character(),
## Title = col_character(),
## Type = col_character(),
## Website = col_character(),
## Year = col_double(),
## imdbID = col_character(),
## imdbRating = col_double(),
## imdbVotes = col_number()
## )
## Warning: 4 parsing failures.
## row col expected actual file
## 174977 Year no trailing characters –2005 'Movie_Movies.csv'
## 175430 Year no trailing characters –2011 'Movie_Movies.csv'
## 176138 Year no trailing characters –2013 'Movie_Movies.csv'
## 176359 Year no trailing characters –2016 'Movie_Movies.csv'
head(movie2)
## # A tibble: 6 x 18
## Awards Country DVD Director Language Plot Poster Production Rated Released
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> USA <NA> Rose Cu~ English Rach~ <NA> <NA> <NA> 26 Apr ~
## 2 <NA> USA <NA> James B~ <NA> The ~ <NA> <NA> <NA> <NA>
## 3 <NA> USA <NA> Dimitri~ <NA> <NA> <NA> <NA> <NA> 27 Mar ~
## 4 <NA> USA <NA> Julia H~ English A Gi~ <NA> <NA> <NA> 27 May ~
## 5 <NA> Sri La~ <NA> Udara S~ Sinhale~ <NA> <NA> <NA> <NA> 20 Mar ~
## 6 <NA> USA <NA> Malini ~ English When~ <NA> <NA> <NA> <NA>
## # ... with 8 more variables: Runtime <chr>, Title <chr>, Type <chr>,
## # Website <chr>, Year <dbl>, imdbID <chr>, imdbRating <dbl>, imdbVotes <dbl>
#Importing IMDB Rating Dataset
IMDB <-read_csv("IMDb ratings.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## imdb_title_id = col_character()
## )
## See spec(...) for full column specifications.
head(IMDB)
## # A tibble: 6 x 49
## imdb_title_id weighted_averag~ total_votes mean_vote median_vote votes_10
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 tt0000009 5.9 154 5.9 6 12
## 2 tt0000574 6.1 589 6.3 6 57
## 3 tt0001892 5.8 188 6 6 6
## 4 tt0002101 5.2 446 5.3 5 15
## 5 tt0002130 7 2237 6.9 7 210
## 6 tt0002199 5.7 484 5.8 6 33
## # ... with 43 more variables: votes_9 <dbl>, votes_8 <dbl>, votes_7 <dbl>,
## # votes_6 <dbl>, votes_5 <dbl>, votes_4 <dbl>, votes_3 <dbl>, votes_2 <dbl>,
## # votes_1 <dbl>, allgenders_0age_avg_vote <dbl>, allgenders_0age_votes <dbl>,
## # allgenders_18age_avg_vote <dbl>, allgenders_18age_votes <dbl>,
## # allgenders_30age_avg_vote <dbl>, allgenders_30age_votes <dbl>,
## # allgenders_45age_avg_vote <dbl>, allgenders_45age_votes <dbl>,
## # males_allages_avg_vote <dbl>, males_allages_votes <dbl>,
## # males_0age_avg_vote <dbl>, males_0age_votes <dbl>,
## # males_18age_avg_vote <dbl>, males_18age_votes <dbl>,
## # males_30age_avg_vote <dbl>, males_30age_votes <dbl>,
## # males_45age_avg_vote <dbl>, males_45age_votes <dbl>,
## # females_allages_avg_vote <dbl>, females_allages_votes <dbl>,
## # females_0age_avg_vote <dbl>, females_0age_votes <dbl>,
## # females_18age_avg_vote <dbl>, females_18age_votes <dbl>,
## # females_30age_avg_vote <dbl>, females_30age_votes <dbl>,
## # females_45age_avg_vote <dbl>, females_45age_votes <dbl>,
## # top1000_voters_rating <dbl>, top1000_voters_votes <dbl>,
## # us_voters_rating <dbl>, us_voters_votes <dbl>, non_us_voters_rating <dbl>,
## # non_us_voters_votes <dbl>
glimpse() function is used for checking and understading the datasets.The all three datasets contains multiple data types such as characters and numerics.The required columns are renamed for mutating joins. The three datasets are filtered to show only required variables. As the stringAsFactor argument default (read_csv), the variables are imported as character.The variable type conversion are carried out.
For movie dataset: -45,466 observations and 24 variables -The column “imdb_id” was renamed to “imdbID” -The column “title” was renamed to “Title” -Subsetting variables:imdbID,Title,popularity,budget,revenue
For movie2 dataset:
-178,687 observations and 18 variables -Subsetting variables:imdbID,Title,Type,Country
For IMDB dataset:
-85,855 observations and 49 variables -Subsetting variables:imdbID,weighted_average_vote, total_vote,mean_vote,median_vote
The movie and movie2 are left joined by imdbID and Title. Followed by IMDB dataset is inner joined by imdbID. After mutating joins, the variable“imdbID” is converted to factor as well as “weighted_average_vote”.
glimpse(movie)
## Observations: 45,466
## Variables: 24
## $ adult <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ belongs_to_collection <chr> "{'id': 10194, 'name': 'Toy Story Collection'...
## $ budget <dbl> 30000000, 65000000, 0, 16000000, 0, 60000000,...
## $ genres <chr> "[{'id': 16, 'name': 'Animation'}, {'id': 35,...
## $ homepage <chr> "http://toystory.disney.com/toy-story", NA, N...
## $ id <dbl> 862, 8844, 15602, 31357, 11862, 949, 11860, 4...
## $ imdb_id <chr> "tt0114709", "tt0113497", "tt0113228", "tt011...
## $ original_language <chr> "en", "en", "en", "en", "en", "en", "en", "en...
## $ original_title <chr> "Toy Story", "Jumanji", "Grumpier Old Men", "...
## $ overview <chr> "Led by Woody, Andy's toys live happily in hi...
## $ popularity <dbl> 21.946943, 17.015539, 11.712900, 3.859495, 8....
## $ poster_path <chr> "/rhIRbceoE9lR4veEXuwCC2wARtG.jpg", "/vzmL6fP...
## $ production_companies <chr> "[{'name': 'Pixar Animation Studios', 'id': 3...
## $ production_countries <chr> "[{'iso_3166_1': 'US', 'name': 'United States...
## $ release_date <date> 1995-10-30, 1995-12-15, 1995-12-22, 1995-12-...
## $ revenue <dbl> 373554033, 262797249, 0, 81452156, 76578911, ...
## $ runtime <dbl> 81, 104, 101, 127, 106, 170, 127, 97, 106, 13...
## $ spoken_languages <chr> "[{'iso_639_1': 'en', 'name': 'English'}]", "...
## $ status <chr> "Released", "Released", "Released", "Released...
## $ tagline <chr> NA, "Roll the dice and unleash the excitement...
## $ title <chr> "Toy Story", "Jumanji", "Grumpier Old Men", "...
## $ video <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ vote_average <dbl> 7.7, 6.9, 6.5, 6.1, 5.7, 7.7, 6.2, 5.4, 5.5, ...
## $ vote_count <dbl> 5415, 2413, 92, 34, 173, 1886, 141, 45, 174, ...
glimpse(movie2)
## Observations: 178,687
## Variables: 18
## $ Awards <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Country <chr> "USA", "USA", "USA", "USA", "Sri Lanka", "USA", "UK", "U...
## $ DVD <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Director <chr> "Rose Cummings", "James Byrne", "Dimitri Buchowetzki", "...
## $ Language <chr> "English", NA, NA, "English", "Sinhalese", "English", "E...
## $ Plot <chr> "Rachel constantly hears her baby cry from the bedroom, ...
## $ Poster <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "https://ima...
## $ Production <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Rated <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Released <chr> "26 Apr 2012", NA, "27 Mar 1926", "27 May 2013", "20 Mar...
## $ Runtime <chr> "20 min", "9 min", "50 min", "2 min", "23 min", "42 min"...
## $ Title <chr> "Baby's Breath", "Winter Trees", "The Crown of Lies", "A...
## $ Type <chr> "movie", "movie", "movie", "movie", "movie", "movie", "m...
## $ Website <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Year <dbl> 2012, 2008, 1926, 2013, 2014, 2014, 2009, 1915, 1966, 20...
## $ imdbID <chr> "tt2268369", "tt1560760", "tt0016750", "tt3405286", "tt3...
## $ imdbRating <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 6.9, NA, NA, NA, NA,...
## $ imdbVotes <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 35, NA, NA, NA, NA, ...
glimpse(IMDB)
## Observations: 85,855
## Variables: 49
## $ imdb_title_id <chr> "tt0000009", "tt0000574", "tt0001892", "t...
## $ weighted_average_vote <dbl> 5.9, 6.1, 5.8, 5.2, 7.0, 5.7, 6.8, 6.2, 6...
## $ total_votes <dbl> 154, 589, 188, 446, 2237, 484, 753, 273, ...
## $ mean_vote <dbl> 5.9, 6.3, 6.0, 5.3, 6.9, 5.8, 6.8, 6.2, 7...
## $ median_vote <dbl> 6, 6, 6, 5, 7, 6, 7, 6, 7, 6, 7, 7, 7, 7,...
## $ votes_10 <dbl> 12, 57, 6, 15, 210, 33, 80, 15, 53, 9, 17...
## $ votes_9 <dbl> 4, 18, 6, 8, 225, 15, 65, 8, 7, 6, 23, 14...
## $ votes_8 <dbl> 10, 58, 17, 16, 436, 48, 105, 30, 23, 22,...
## $ votes_7 <dbl> 43, 137, 44, 62, 641, 80, 209, 74, 35, 34...
## $ votes_6 <dbl> 28, 139, 52, 98, 344, 123, 142, 66, 30, 5...
## $ votes_5 <dbl> 28, 103, 32, 117, 169, 77, 80, 40, 24, 40...
## $ votes_4 <dbl> 9, 28, 16, 63, 66, 36, 27, 22, 12, 23, 11...
## $ votes_3 <dbl> 1, 20, 5, 26, 39, 20, 13, 6, 5, 8, 9, 19,...
## $ votes_2 <dbl> 5, 13, 6, 25, 20, 18, 4, 2, 3, 4, 5, 11, ...
## $ votes_1 <dbl> 14, 16, 4, 16, 87, 34, 28, 10, 6, 26, 14,...
## $ allgenders_0age_avg_vote <dbl> 7.2, 6.0, NA, NA, 7.5, NA, 3.0, 6.0, NA, ...
## $ allgenders_0age_votes <dbl> 4, 1, NA, NA, 4, NA, 1, 1, NA, NA, NA, 3,...
## $ allgenders_18age_avg_vote <dbl> 6.0, 6.1, 5.5, 5.3, 7.0, 5.6, 6.4, 5.9, 7...
## $ allgenders_18age_votes <dbl> 38, 114, 25, 23, 402, 35, 59, 34, 28, 11,...
## $ allgenders_30age_avg_vote <dbl> 5.7, 6.0, 5.8, 5.0, 7.0, 5.6, 6.7, 6.2, 6...
## $ allgenders_30age_votes <dbl> 50, 239, 72, 111, 895, 177, 287, 99, 104,...
## $ allgenders_45age_avg_vote <dbl> 6.6, 6.3, 6.2, 5.3, 7.1, 5.8, 7.0, 6.3, 6...
## $ allgenders_45age_votes <dbl> 35, 115, 62, 193, 482, 168, 276, 69, 39, ...
## $ males_allages_avg_vote <dbl> 6.2, 6.1, 5.9, 5.1, 7.0, 5.7, 6.6, 6.2, 6...
## $ males_allages_votes <dbl> 97, 425, 146, 299, 1607, 354, 530, 191, 1...
## $ males_0age_avg_vote <dbl> 7, 6, NA, NA, 8, NA, 3, 6, NA, NA, NA, 8,...
## $ males_0age_votes <dbl> 1, 1, NA, NA, 2, NA, 1, 1, NA, NA, NA, 3,...
## $ males_18age_avg_vote <dbl> 5.9, 6.2, 5.5, 5.2, 7.0, 5.6, 6.4, 5.8, 7...
## $ males_18age_votes <dbl> 24, 102, 21, 20, 346, 32, 41, 29, 24, 11,...
## $ males_30age_avg_vote <dbl> 5.6, 6.0, 5.9, 4.9, 7.0, 5.7, 6.6, 6.2, 6...
## $ males_30age_votes <dbl> 36, 210, 67, 96, 804, 159, 240, 85, 92, 8...
## $ males_45age_avg_vote <dbl> 6.7, 6.2, 6.2, 5.2, 7.0, 5.8, 6.8, 6.3, 6...
## $ males_45age_votes <dbl> 31, 100, 55, 171, 396, 151, 238, 64, 33, ...
## $ females_allages_avg_vote <dbl> 6.0, 6.2, 5.7, 5.9, 7.2, 5.8, 7.4, 6.3, 7...
## $ females_allages_votes <dbl> 35, 50, 15, 39, 215, 31, 93, 21, 19, 15, ...
## $ females_0age_avg_vote <dbl> 7.3, NA, NA, NA, 7.0, NA, NA, NA, NA, NA,...
## $ females_0age_votes <dbl> 3, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA,...
## $ females_18age_avg_vote <dbl> 5.9, 5.9, 5.8, 5.7, 7.0, 6.3, 6.5, 6.5, 8...
## $ females_18age_votes <dbl> 14, 12, 4, 3, 52, 3, 16, 5, 3, NA, 5, 32,...
## $ females_30age_avg_vote <dbl> 5.7, 6.2, 5.8, 5.5, 7.3, 5.1, 7.2, 6.3, 6...
## $ females_30age_votes <dbl> 13, 23, 4, 14, 82, 13, 40, 11, 10, 6, 13,...
## $ females_45age_avg_vote <dbl> 4.5, 6.6, 6.8, 6.1, 7.4, 6.5, 8.2, 6.2, 6...
## $ females_45age_votes <dbl> 4, 14, 7, 21, 77, 15, 34, 5, 6, 9, 15, 97...
## $ top1000_voters_rating <dbl> 5.7, 6.4, 5.4, 4.9, 6.9, 5.5, 6.2, 5.7, 5...
## $ top1000_voters_votes <dbl> 34, 66, 32, 57, 139, 67, 88, 41, 25, 37, ...
## $ us_voters_rating <dbl> 6.4, 6.0, 6.2, 5.5, 7.0, 5.9, 6.6, 5.8, 6...
## $ us_voters_votes <dbl> 51, 96, 31, 207, 488, 173, 139, 40, 25, 1...
## $ non_us_voters_rating <dbl> 6.0, 6.2, 5.9, 4.7, 7.0, 5.6, 6.8, 6.2, 6...
## $ non_us_voters_votes <dbl> 70, 331, 123, 105, 1166, 181, 455, 154, 1...
# Renaming variable
movie<- rename(movie, imdbID="imdb_id",Title="title")
IMDB <-rename(IMDB, imdbID="imdb_title_id")
#Selecting required columns from movie data
newmovie<-movie %>% select(imdbID,Title,popularity,budget,revenue)
head(newmovie)
## # A tibble: 6 x 5
## imdbID Title popularity budget revenue
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 tt0114709 Toy Story 21.9 30000000 373554033
## 2 tt0113497 Jumanji 17.0 65000000 262797249
## 3 tt0113228 Grumpier Old Men 11.7 0 0
## 4 tt0114885 Waiting to Exhale 3.86 16000000 81452156
## 5 tt0113041 Father of the Bride Part II 8.39 0 76578911
## 6 tt0113277 Heat 17.9 60000000 187436818
#Selecting required columns from movie2 data
newmovie2<-movie2 %>% select(imdbID,Title,Type,Country)
head(newmovie2)
## # A tibble: 6 x 4
## imdbID Title Type Country
## <chr> <chr> <chr> <chr>
## 1 tt2268369 Baby's Breath movie USA
## 2 tt1560760 Winter Trees movie USA
## 3 tt0016750 The Crown of Lies movie USA
## 4 tt3405286 A Gift movie USA
## 5 tt3816698 Journey movie Sri Lanka
## 6 tt3645730 In His Own Home movie USA
#Selecting required columns from IMDB data
newIMDB <- IMDB[,c(1:5)]
head(newIMDB)
## # A tibble: 6 x 5
## imdbID weighted_average_vote total_votes mean_vote median_vote
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 tt0000009 5.9 154 5.9 6
## 2 tt0000574 6.1 589 6.3 6
## 3 tt0001892 5.8 188 6 6
## 4 tt0002101 5.2 446 5.3 5
## 5 tt0002130 7 2237 6.9 7
## 6 tt0002199 5.7 484 5.8 6
#Joining movie2 data to movie data
combined <- newmovie %>% left_join(newmovie2)
## Joining, by = c("imdbID", "Title")
head(combined)
## # A tibble: 6 x 7
## imdbID Title popularity budget revenue Type Country
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 tt0114709 Toy Story 21.9 30000000 3.74e8 movie USA
## 2 tt0113497 Jumanji 17.0 65000000 2.63e8 movie USA
## 3 tt0113228 Grumpier Old Men 11.7 0 0. movie USA
## 4 tt0114885 Waiting to Exhale 3.86 16000000 8.15e7 movie USA
## 5 tt0113041 Father of the Bride Part~ 8.39 0 7.66e7 movie USA
## 6 tt0113277 Heat 17.9 60000000 1.87e8 movie USA
#Adding movie rate data by using inner join-> retain only rows in both sets
movieDF <- combined %>% inner_join(newIMDB,by="imdbID")
head(movieDF)
## # A tibble: 6 x 11
## imdbID Title popularity budget revenue Type Country weighted_averag~
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 tt011~ Toy ~ 21.9 3.00e7 3.74e8 movie USA 8.3
## 2 tt011~ Juma~ 17.0 6.50e7 2.63e8 movie USA 7
## 3 tt011~ Grum~ 11.7 0. 0. movie USA 6.7
## 4 tt011~ Wait~ 3.86 1.60e7 8.15e7 movie USA 5.9
## 5 tt011~ Fath~ 8.39 0. 7.66e7 movie USA 6.1
## 6 tt011~ Heat 17.9 6.00e7 1.87e8 movie USA 8.2
## # ... with 3 more variables: total_votes <dbl>, mean_vote <dbl>,
## # median_vote <dbl>
#Reordering columns
movieDF <- movieDF[,c(1,2,6,7,4,5,3,8,9,10,11)]
head(movieDF)
## # A tibble: 6 x 11
## imdbID Title Type Country budget revenue popularity weighted_averag~
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 tt011~ Toy ~ movie USA 3.00e7 3.74e8 21.9 8.3
## 2 tt011~ Juma~ movie USA 6.50e7 2.63e8 17.0 7
## 3 tt011~ Grum~ movie USA 0. 0. 11.7 6.7
## 4 tt011~ Wait~ movie USA 1.60e7 8.15e7 3.86 5.9
## 5 tt011~ Fath~ movie USA 0. 7.66e7 8.39 6.1
## 6 tt011~ Heat movie USA 6.00e7 1.87e8 17.9 8.2
## # ... with 3 more variables: total_votes <dbl>, mean_vote <dbl>,
## # median_vote <dbl>
# Converting variable type
movieDF$imdbID <-as.factor(movieDF$imdbID)
movieDF$weighted_average_vote <-as.factor(movieDF$weighted_average_vote)
str(movieDF)
## Classes 'tbl_df', 'tbl' and 'data.frame': 34098 obs. of 11 variables:
## $ imdbID : Factor w/ 34073 levels "tt0000574","tt0002101",..: 14660 14481 14434 14681 14394 14446 14607 14264 14636 14426 ...
## $ Title : chr "Toy Story" "Jumanji" "Grumpier Old Men" "Waiting to Exhale" ...
## $ Type : chr "movie" "movie" "movie" "movie" ...
## $ Country : chr "USA" "USA" "USA" "USA" ...
## $ budget : num 3.0e+07 6.5e+07 0.0 1.6e+07 0.0 6.0e+07 5.8e+07 0.0 3.5e+07 5.8e+07 ...
## $ revenue : num 3.74e+08 2.63e+08 0.00 8.15e+07 7.66e+07 ...
## $ popularity : num 21.95 17.02 11.71 3.86 8.39 ...
## $ weighted_average_vote: Factor w/ 82 levels "1.1","1.3","1.4",..: 72 59 56 48 50 71 52 44 47 61 ...
## $ total_votes : num 864461 299083 23742 9353 33587 ...
## $ mean_vote : num 8.3 7.1 6.8 6.2 6.2 8.2 6.5 5.9 5.9 7.3 ...
## $ median_vote : num 8 7 7 6 6 8 7 6 6 7 ...
#rearranging weighted average vote in descending order
movieDF %>% arrange(desc(weighted_average_vote))
## # A tibble: 34,098 x 11
## imdbID Title Type Country budget revenue popularity weighted_averag~
## <fct> <chr> <chr> <chr> <dbl> <dbl> <dbl> <fct>
## 1 tt011~ The ~ movie USA 2.50e7 2.83e7 51.6 9.3
## 2 tt025~ The ~ movie Turkey 0. 0. 0.869 9.3
## 3 tt006~ The ~ <NA> <NA> 6.00e6 2.45e8 41.1 9.2
## 4 tt017~ Yest~ <NA> <NA> 0. 0. 0.000844 9.1
## 5 tt531~ Nats~ <NA> <NA> 0. 0. 0.251 9.1
## 6 tt007~ The ~ <NA> <NA> 1.30e7 4.75e7 36.6 9
## 7 tt008~ The ~ <NA> <NA> 0. 0. 1.98 9
## 8 tt046~ The ~ movie USA, UK 1.85e8 1.00e9 123. 9
## 9 tt025~ Haba~ <NA> <NA> 0. 0. 1.34 9
## 10 tt035~ Sand~ <NA> <NA> 0. 0. 1.13 9
## # ... with 34,088 more rows, and 3 more variables: total_votes <dbl>,
## # mean_vote <dbl>, median_vote <dbl>
According to Hadley Wickham and Grolemund, the three principles of tidy data are:
Each variable must have its own column. Each observation must have its own row. Each value must have its own cell.
The movie datset has the column “release_date” where day,month and year are recorded in one column. However, the separation is not carried out as the release_date columns is not required. The separate() function could be used when multiple variables are stored in one column. The required dataframe “movieDF” is already tidy.
#Showing the example of untidy data
head(movie$release_date)
## [1] "1995-10-30" "1995-12-15" "1995-12-22" "1995-12-22" "1995-02-10"
## [6] "1995-12-15"
All variables are scanned for missing values, special values and obvious errors. The missing values are represented by NA which stands for not available.To identify the missing values, is.na() function is used and also colSums() is used for the total missing values in the each columns. It is observed that variable “Type” consists of 24783 NAs and “Country” contains 24783 NAs. The common method of handling missing values are omitting the records or replacing with mean/median/mode value of the column. However, those methods for handling missing value of variable Type and Country are not applicable as number of missing value are quite large. The omitting missing values may cause bias analysis. The missing values for Type and Country are handled by replacing with “Unknown”. The missing values for title is not handled.As per recommendations, if there very small size of missing data (upto 5%),then can be left out with missing features. The missing value of revenue is replaced with 0. It is because those missing values in revenue can not be replaced by mean/mode/median. The missing value of popularity is replaced with mean of the column. is.special() function is used to check if there is any infinite and NaN values in this data. To check obvious errors, editrules package is utilised.The rule is created to define the restriction on the mean_vote and median_vote variables. These values are considered as the minimum value is 1 and the maximum value is 10. The checking is done by whether values are less than 0 or greater than 10. After creating rules, violatedEdits() is used.No special values found as well as obvious errors.
# Checking number of NA's per column
sum(is.na(movieDF))
## [1] 49572
colSums(is.na(movieDF))
## imdbID Title Type
## 0 2 24783
## Country budget revenue
## 24783 0 2
## popularity weighted_average_vote total_votes
## 2 0 0
## mean_vote median_vote
## 0 0
#Imputing NA value with "Unknown"
movieDF$Type[is.na(movieDF$Type)] <-"Unknown"
movieDF$Country[is.na(movieDF$Country)] <-"Unknown"
#Rechecking NA's in Type and Country columns
colSums(is.na(movieDF))
## imdbID Title Type
## 0 2 0
## Country budget revenue
## 0 0 2
## popularity weighted_average_vote total_votes
## 2 0 0
## mean_vote median_vote
## 0 0
#Replcing NA of popularity with mean of the column
movieDF$popularity[is.na(movieDF$popularity)] <-mean(movieDF$popularity,na.rm = TRUE)
#Replcing NA of revenue with 0
movieDF$revenue[is.na(movieDF$revenue)] <-0
#Rechecking NA's in revenue and popularity columns
colSums(is.na(movieDF))
## imdbID Title Type
## 0 2 0
## Country budget revenue
## 0 0 0
## popularity weighted_average_vote total_votes
## 0 0 0
## mean_vote median_vote
## 0 0
# Checking each numerical columns for special values
is.specialorNA <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sapply(movieDF[,c(5,6,7,9,10,11)],function(y)sum(is.specialorNA(y)))
## budget revenue popularity total_votes mean_vote median_vote
## 0 0 0 0 0 0
#Scanning for obvious errors
#Defining the restrictions
(Rule1<-editset(c("mean_vote >= 0","mean_vote <=10",
"median_vote>=0", "median_vote<=10")))
##
## Edit set:
## num1 : 0 <= mean_vote
## num2 : mean_vote <= 10
## num3 : 0 <= median_vote
## num4 : median_vote <= 10
#Checking the errors
colSums((violatedEdits(Rule1, movieDF)))
## num1 num2 num3 num4
## 0 0 0 0
The mutation function from the tidyr package is used to produce a new variable.Firstly, Title, popularity, weighted_average_vote and revenue and budget are selected from the data and stored in new data frame. The new variable “NetIncome” is created.The new varaible “NetIncome” takes revenue value and substracted by budget.Then the NetIncome is ordered as descending. The Title, popularity and weighted_average_vote variables are selected to see the net income in relation to poularity and average vote of each movies. The movie “Avatar” is observed as highest net income.
# Select specific variables and creating new variable by using mutate()
movieDF2 <- movieDF %>% select(Title,popularity,weighted_average_vote,revenue,budget)%>%
mutate(NetIncome= revenue-budget) %>%
arrange(desc(NetIncome))
movieDF2
## # A tibble: 34,098 x 6
## Title popularity weighted_average_~ revenue budget NetIncome
## <chr> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 Avatar 185. 7.8 2.79e9 2.37e8 2.55e9
## 2 Star Wars: The Forc~ 31.6 7.9 2.07e9 2.45e8 1.82e9
## 3 Titanic 26.9 7.8 1.85e9 2.00e8 1.65e9
## 4 Jurassic World 32.8 7 1.51e9 1.50e8 1.36e9
## 5 Furious 7 27.3 7.1 1.51e9 1.90e8 1.32e9
## 6 The Avengers 89.9 8 1.52e9 2.20e8 1.30e9
## 7 Harry Potter and th~ 25.0 8.1 1.34e9 1.25e8 1.22e9
## 8 Avengers: Age of Ul~ 37.4 7.3 1.41e9 2.80e8 1.13e9
## 9 Frozen 24.2 7.4 1.27e9 1.50e8 1.12e9
## 10 Beauty and the Beast 287. 7.1 1.26e9 1.60e8 1.10e9
## # ... with 34,088 more rows
str(movieDF2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 34098 obs. of 6 variables:
## $ Title : chr "Avatar" "Star Wars: The Force Awakens" "Titanic" "Jurassic World" ...
## $ popularity : num 185.1 31.6 26.9 32.8 27.3 ...
## $ weighted_average_vote: Factor w/ 82 levels "1.1","1.3","1.4",..: 67 68 67 59 60 69 70 62 63 60 ...
## $ revenue : num 2.79e+09 2.07e+09 1.85e+09 1.51e+09 1.51e+09 ...
## $ budget : num 2.37e+08 2.45e+08 2.00e+08 1.50e+08 1.90e+08 2.20e+08 1.25e+08 2.80e+08 1.50e+08 1.60e+08 ...
## $ NetIncome : num 2.55e+09 1.82e+09 1.65e+09 1.36e+09 1.32e+09 ...
In statistics, an outlier is defined as an observation which stands far away from the most of the other observations. The outlier is often result of data entry errors, measurement errors, experimental errors, intentional erros and data processing errors. The numberic variables such as popularity, revenue, budget and NetIncome are considered for outlier assessment.To detect outliers, boxplot is used. Refer to the boxplots below, several outliers are observed in each variables. The Z-score is then calculated.
# This is the R chunk for the Scan II, boxplot of numeric variables
par(mfrow=c(2,2))
movieDF2$popularity %>% boxplot(main = "Boxplot of popularity")
movieDF2$revenue %>% boxplot(main = "Boxplot of revenue")
movieDF2$budget %>% boxplot( main = "Boxplot of budget")
movieDF2$NetIncome%>% boxplot( main = "Boxplot of NetIncome")
#Histrogram of each variables
par(mfrow=c(2,2))
hist(movieDF2$popularity)
hist(movieDF2$revenue)
hist(movieDF2$budget)
hist(movieDF2$NetIncome)
z.scores_popularity <- movieDF2$popularity %>% scores(type = "z")
z.scores_popularity %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.5096 -0.4319 -0.3001 0.0000 0.2041 80.8661
z.scores_revenue <- movieDF2$revenue %>% scores(type = "z")
z.scores_revenue %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2011 -0.2011 -0.2011 0.0000 -0.2011 37.5305
z.scores_budget<- movieDF2$budget %>% scores(type = "z")
z.scores_budget %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2776 -0.2776 -0.2776 0.0000 -0.2776 18.8452
z.scores_netincome <- movieDF2$NetIncome %>% scores(type = "z")
z.scores_netincome %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.9193 -0.1557 -0.1557 0.0000 -0.1557 42.3867
#Finding total number of outliers according to the z-score
length(which( abs(z.scores_popularity) >3 ))
## [1] 177
length(which( abs(z.scores_revenue) >3 ))
## [1] 542
length(which( abs(z.scores_budget) >3 ))
## [1] 745
length(which( abs(z.scores_netincome) >3 ))
## [1] 487
Capping is used for imputing the outliers. The Capping involves replacing the outliers with the nearest neighbors that are not outliers. The function is defined to cap the values outside the limits. This method is applied to popularity, revenue, budget and NetIncome variables. The outlier is still observed eventhough capping method is implemented.
#Capping function to deal with outliers
#Define function to cap the values that are outside the limits
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5 * IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5 * IQR(x) ] <- quantiles[4]
x
}
#Capping
movieDF2$popularity <- movieDF2$popularity %>% cap()
movieDF2$revenue <- movieDF2$revenue %>% cap()
movieDF2$budget <-movieDF2$budget %>% cap()
movieDF2$NetIncome <- movieDF2$NetIncome %>% cap()
#Visualisation after capping
boxplot(movieDF2$popularity, main = "Capped Popularity")
boxplot(movieDF2$revenue, main = "Capped Revenue")
boxplot(movieDF2$budget, main = "Capped Budget")
boxplot(movieDF2$NetIncome, main = "Capped NetIncome")
The data transformation is often requisite of variables to further proceed with statistical analysis. In this analysis,the NetIncome variable is considered . Refer to graph below, it is observed that the histogram of Netincome is right -skewed. To reduce the right-skewness, the logarithms, reciprocal, Boxcox transformations are applied.
# Checking the disbribution of NetIncome
hist(movieDF2$NetIncome, main ="Histrogram of NetIncome")
par(mfrow=c(3,2))
#Trying different transformations
hist(log10(movieDF2$NetIncome), main= "log10", xlab = "NetIncome")
## Warning in hist(log10(movieDF2$NetIncome), main = "log10", xlab = "NetIncome"):
## NaNs produced
hist(log(movieDF2$NetIncome), main= "log", xlab = "NetIncome")
## Warning in log(movieDF2$NetIncome): NaNs produced
hist(sqrt(movieDF2$NetIncome), main= "sqrt", xlab = "NetIncome")
## Warning in sqrt(movieDF2$NetIncome): NaNs produced
hist(BoxCox(movieDF2$NetIncome, lambda = "auto"), main= "Boxcox", xlab = "NetIncome")
hist((movieDF2$NetIncome)^(-1), main= "reciprocal", xlab = "NetIncome")
The most effective at increasing normality appears to be reciprocal transformation. The log10 and log resulted left skewness. The square root transformation was not effective. The Boxcox transformation reduced skewness but it did not completely improve the symmetry of the Netincome distribution.