Assignment 2

Required packages

#required packages

library(readr)
library(foreign)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(validate)

## 
## Attaching package: 'validate'

## The following object is masked from 'package:dplyr':
## 
##     expr

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:validate':
## 
##     expr

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:validate':
## 
##     label, label<-

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(stringr)
library(ggplot2)
library(editrules)

## Loading required package: igraph

## 
## Attaching package: 'igraph'

## The following object is masked from 'package:validate':
## 
##     compare

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

## 
## Attaching package: 'editrules'

## The following objects are masked from 'package:igraph':
## 
##     blocks, normalize

## The following objects are masked from 'package:tidyr':
## 
##     contains, separate

## The following object is masked from 'package:dplyr':
## 
##     contains

library(outliers)
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Executive Summary

In this analysis, three datasets were used . The datasets are “The movies Dataset”, " Movie Information Dataset" and“IMDB Movies Extensive Dataset”. The datasets were read by using read_csv(). To understand the data structure and data types, glimpse() function was used. Prior to mutating join, columns were renamed and required variables were selected by using select(). Two movies datasets were merged by imdbID and Title using left_join(). Followed by rating dataset was merged by imdbID using inner_join(). After imputing missing values, the obvious errors were assessed by using editset(). The NetIncome variable was created using mutate(). The outliers were detected by using boxplots and were capped. The capping was not succesful as outliers were still observed. The distribution of NetIncome was determined by using reciprocal transformation.

Data

The Movies Dataset

The Movies Dataset was taken from https://www.kaggle.com/rounakbanik/the-movies-dataset?select=links.csv, containing 45,000 movies released on or before July 2017. The dataset has 26 million rating from 270,000 users for 45,000 movies. Ratings are on a scale of 1-5 that are obtained from the official GroupLens Website.

Variables

-IMDB_ID: The identifier of the movie in IMDB -Title: The title of the movie -Popularity: The popularity of the movie based on the voting -Revenue: Income from movie -Budget: Movie production cost

Movie Information Dataset

The Movie Information Dataset was taken from https://www.kaggle.com/beyjin/movies-1990-to-2017?select=Movie_Movies.csv, consists of 178687 observations and 18 variables. However , only 4 features were used

Variables

-IMDB_ID : The identifier of the movie in IMDB -Title: The title of the movie -Type: The type of the production -Country: The production Country

IMDB Movies Extensive Dataset

The IMDB Dataset was taken from https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset?select=IMDb+ratings.csv. The IMDB is the most popular movie website and it combines movie plot description, rating, reviews, release dates and many more aspects. There were four datasets provided by IMDB Dataset but only IMDB rating dataset was used for this analysis. The IMDB rating dataset includes 85,855 rating details from demographic perspective.

Variables

IMDB_Title_ID: The identifier of the movie in IMDB Weighted Average Vote:Total weighted average rating Total Vote: The totle votes received Mean Vote: The total mean vote *Median Vote: The total median vote

##Reading and Understanding Data The three datasets used in this analysis were imported using readar function, read_CSV().

#Importing the Movies Dataset
movie <- read_csv("movies_metadata.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   adult = col_logical(),
##   budget = col_double(),
##   id = col_double(),
##   popularity = col_double(),
##   release_date = col_date(format = ""),
##   revenue = col_double(),
##   runtime = col_double(),
##   video = col_logical(),
##   vote_average = col_double(),
##   vote_count = col_double()
## )

## See spec(...) for full column specifications.

## Warning: 19 parsing failures.
##   row          col               expected                           actual                  file
## 19730 NA           24 columns             10 columns                       'movies_metadata.csv'
## 19731 adult        1/0/T/F/TRUE/FALSE     - Written by Ã˜rnÃ¥s             'movies_metadata.csv'
## 19731 budget       a double               /ff9qCepilowshEtG2GYWwzt2bs4.jpg 'movies_metadata.csv'
## 19731 id           no trailing characters -08-20                           'movies_metadata.csv'
## 19731 release_date date like              1                                'movies_metadata.csv'
## ..... ............ ...................... ................................ .....................
## See problems(...) for more details.

head(movie)

## # A tibble: 6 x 24
##   adult belongs_to_coll~ budget genres homepage    id imdb_id original_langua~
##   <lgl> <chr>             <dbl> <chr>  <chr>    <dbl> <chr>   <chr>           
## 1 FALSE {'id': 10194, '~ 3.00e7 [{'id~ http://~   862 tt0114~ en              
## 2 FALSE <NA>             6.50e7 [{'id~ <NA>      8844 tt0113~ en              
## 3 FALSE {'id': 119050, ~ 0.     [{'id~ <NA>     15602 tt0113~ en              
## 4 FALSE <NA>             1.60e7 [{'id~ <NA>     31357 tt0114~ en              
## 5 FALSE {'id': 96871, '~ 0.     [{'id~ <NA>     11862 tt0113~ en              
## 6 FALSE <NA>             6.00e7 [{'id~ <NA>       949 tt0113~ en              
## # ... with 16 more variables: original_title <chr>, overview <chr>,
## #   popularity <dbl>, poster_path <chr>, production_companies <chr>,
## #   production_countries <chr>, release_date <date>, revenue <dbl>,
## #   runtime <dbl>, spoken_languages <chr>, status <chr>, tagline <chr>,
## #   title <chr>, video <lgl>, vote_average <dbl>, vote_count <dbl>

#Importing Movie Information Dataset
movie2<- read_csv("Movie_Movies.csv")

## Parsed with column specification:
## cols(
##   Awards = col_character(),
##   Country = col_character(),
##   DVD = col_character(),
##   Director = col_character(),
##   Language = col_character(),
##   Plot = col_character(),
##   Poster = col_character(),
##   Production = col_character(),
##   Rated = col_character(),
##   Released = col_character(),
##   Runtime = col_character(),
##   Title = col_character(),
##   Type = col_character(),
##   Website = col_character(),
##   Year = col_double(),
##   imdbID = col_character(),
##   imdbRating = col_double(),
##   imdbVotes = col_number()
## )

## Warning: 4 parsing failures.
##    row  col               expected  actual               file
## 174977 Year no trailing characters â€“2005 'Movie_Movies.csv'
## 175430 Year no trailing characters â€“2011 'Movie_Movies.csv'
## 176138 Year no trailing characters â€“2013 'Movie_Movies.csv'
## 176359 Year no trailing characters â€“2016 'Movie_Movies.csv'

head(movie2)

## # A tibble: 6 x 18
##   Awards Country DVD   Director Language Plot  Poster Production Rated Released
##   <chr>  <chr>   <chr> <chr>    <chr>    <chr> <chr>  <chr>      <chr> <chr>   
## 1 <NA>   USA     <NA>  Rose Cu~ English  Rach~ <NA>   <NA>       <NA>  26 Apr ~
## 2 <NA>   USA     <NA>  James B~ <NA>     The ~ <NA>   <NA>       <NA>  <NA>    
## 3 <NA>   USA     <NA>  Dimitri~ <NA>     <NA>  <NA>   <NA>       <NA>  27 Mar ~
## 4 <NA>   USA     <NA>  Julia H~ English  A Gi~ <NA>   <NA>       <NA>  27 May ~
## 5 <NA>   Sri La~ <NA>  Udara S~ Sinhale~ <NA>  <NA>   <NA>       <NA>  20 Mar ~
## 6 <NA>   USA     <NA>  Malini ~ English  When~ <NA>   <NA>       <NA>  <NA>    
## # ... with 8 more variables: Runtime <chr>, Title <chr>, Type <chr>,
## #   Website <chr>, Year <dbl>, imdbID <chr>, imdbRating <dbl>, imdbVotes <dbl>

#Importing IMDB Rating Dataset
IMDB <-read_csv("IMDb ratings.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   imdb_title_id = col_character()
## )

## See spec(...) for full column specifications.

head(IMDB)

## # A tibble: 6 x 49
##   imdb_title_id weighted_averag~ total_votes mean_vote median_vote votes_10
##   <chr>                    <dbl>       <dbl>     <dbl>       <dbl>    <dbl>
## 1 tt0000009                  5.9         154       5.9           6       12
## 2 tt0000574                  6.1         589       6.3           6       57
## 3 tt0001892                  5.8         188       6             6        6
## 4 tt0002101                  5.2         446       5.3           5       15
## 5 tt0002130                  7          2237       6.9           7      210
## 6 tt0002199                  5.7         484       5.8           6       33
## # ... with 43 more variables: votes_9 <dbl>, votes_8 <dbl>, votes_7 <dbl>,
## #   votes_6 <dbl>, votes_5 <dbl>, votes_4 <dbl>, votes_3 <dbl>, votes_2 <dbl>,
## #   votes_1 <dbl>, allgenders_0age_avg_vote <dbl>, allgenders_0age_votes <dbl>,
## #   allgenders_18age_avg_vote <dbl>, allgenders_18age_votes <dbl>,
## #   allgenders_30age_avg_vote <dbl>, allgenders_30age_votes <dbl>,
## #   allgenders_45age_avg_vote <dbl>, allgenders_45age_votes <dbl>,
## #   males_allages_avg_vote <dbl>, males_allages_votes <dbl>,
## #   males_0age_avg_vote <dbl>, males_0age_votes <dbl>,
## #   males_18age_avg_vote <dbl>, males_18age_votes <dbl>,
## #   males_30age_avg_vote <dbl>, males_30age_votes <dbl>,
## #   males_45age_avg_vote <dbl>, males_45age_votes <dbl>,
## #   females_allages_avg_vote <dbl>, females_allages_votes <dbl>,
## #   females_0age_avg_vote <dbl>, females_0age_votes <dbl>,
## #   females_18age_avg_vote <dbl>, females_18age_votes <dbl>,
## #   females_30age_avg_vote <dbl>, females_30age_votes <dbl>,
## #   females_45age_avg_vote <dbl>, females_45age_votes <dbl>,
## #   top1000_voters_rating <dbl>, top1000_voters_votes <dbl>,
## #   us_voters_rating <dbl>, us_voters_votes <dbl>, non_us_voters_rating <dbl>,
## #   non_us_voters_votes <dbl>

Understand

glimpse() function is used for checking and understading the datasets.The all three datasets contains multiple data types such as characters and numerics.The required columns are renamed for mutating joins. The three datasets are filtered to show only required variables. As the stringAsFactor argument default (read_csv), the variables are imported as character.The variable type conversion are carried out.

For movie dataset: -45,466 observations and 24 variables -The column “imdb_id” was renamed to “imdbID” -The column “title” was renamed to “Title” -Subsetting variables:imdbID,Title,popularity,budget,revenue

For movie2 dataset:
-178,687 observations and 18 variables -Subsetting variables:imdbID,Title,Type,Country

For IMDB dataset:
-85,855 observations and 49 variables -Subsetting variables:imdbID,weighted_average_vote, total_vote,mean_vote,median_vote

The movie and movie2 are left joined by imdbID and Title. Followed by IMDB dataset is inner joined by imdbID. After mutating joins, the variable“imdbID” is converted to factor as well as “weighted_average_vote”.

glimpse(movie)

## Observations: 45,466
## Variables: 24
## $ adult                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ belongs_to_collection <chr> "{'id': 10194, 'name': 'Toy Story Collection'...
## $ budget                <dbl> 30000000, 65000000, 0, 16000000, 0, 60000000,...
## $ genres                <chr> "[{'id': 16, 'name': 'Animation'}, {'id': 35,...
## $ homepage              <chr> "http://toystory.disney.com/toy-story", NA, N...
## $ id                    <dbl> 862, 8844, 15602, 31357, 11862, 949, 11860, 4...
## $ imdb_id               <chr> "tt0114709", "tt0113497", "tt0113228", "tt011...
## $ original_language     <chr> "en", "en", "en", "en", "en", "en", "en", "en...
## $ original_title        <chr> "Toy Story", "Jumanji", "Grumpier Old Men", "...
## $ overview              <chr> "Led by Woody, Andy's toys live happily in hi...
## $ popularity            <dbl> 21.946943, 17.015539, 11.712900, 3.859495, 8....
## $ poster_path           <chr> "/rhIRbceoE9lR4veEXuwCC2wARtG.jpg", "/vzmL6fP...
## $ production_companies  <chr> "[{'name': 'Pixar Animation Studios', 'id': 3...
## $ production_countries  <chr> "[{'iso_3166_1': 'US', 'name': 'United States...
## $ release_date          <date> 1995-10-30, 1995-12-15, 1995-12-22, 1995-12-...
## $ revenue               <dbl> 373554033, 262797249, 0, 81452156, 76578911, ...
## $ runtime               <dbl> 81, 104, 101, 127, 106, 170, 127, 97, 106, 13...
## $ spoken_languages      <chr> "[{'iso_639_1': 'en', 'name': 'English'}]", "...
## $ status                <chr> "Released", "Released", "Released", "Released...
## $ tagline               <chr> NA, "Roll the dice and unleash the excitement...
## $ title                 <chr> "Toy Story", "Jumanji", "Grumpier Old Men", "...
## $ video                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ vote_average          <dbl> 7.7, 6.9, 6.5, 6.1, 5.7, 7.7, 6.2, 5.4, 5.5, ...
## $ vote_count            <dbl> 5415, 2413, 92, 34, 173, 1886, 141, 45, 174, ...

glimpse(movie2)

## Observations: 178,687
## Variables: 18
## $ Awards     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Country    <chr> "USA", "USA", "USA", "USA", "Sri Lanka", "USA", "UK", "U...
## $ DVD        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Director   <chr> "Rose Cummings", "James Byrne", "Dimitri Buchowetzki", "...
## $ Language   <chr> "English", NA, NA, "English", "Sinhalese", "English", "E...
## $ Plot       <chr> "Rachel constantly hears her baby cry from the bedroom, ...
## $ Poster     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "https://ima...
## $ Production <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Rated      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Released   <chr> "26 Apr 2012", NA, "27 Mar 1926", "27 May 2013", "20 Mar...
## $ Runtime    <chr> "20 min", "9 min", "50 min", "2 min", "23 min", "42 min"...
## $ Title      <chr> "Baby's Breath", "Winter Trees", "The Crown of Lies", "A...
## $ Type       <chr> "movie", "movie", "movie", "movie", "movie", "movie", "m...
## $ Website    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Year       <dbl> 2012, 2008, 1926, 2013, 2014, 2014, 2009, 1915, 1966, 20...
## $ imdbID     <chr> "tt2268369", "tt1560760", "tt0016750", "tt3405286", "tt3...
## $ imdbRating <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 6.9, NA, NA, NA, NA,...
## $ imdbVotes  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 35, NA, NA, NA, NA, ...

glimpse(IMDB)

## Observations: 85,855
## Variables: 49
## $ imdb_title_id             <chr> "tt0000009", "tt0000574", "tt0001892", "t...
## $ weighted_average_vote     <dbl> 5.9, 6.1, 5.8, 5.2, 7.0, 5.7, 6.8, 6.2, 6...
## $ total_votes               <dbl> 154, 589, 188, 446, 2237, 484, 753, 273, ...
## $ mean_vote                 <dbl> 5.9, 6.3, 6.0, 5.3, 6.9, 5.8, 6.8, 6.2, 7...
## $ median_vote               <dbl> 6, 6, 6, 5, 7, 6, 7, 6, 7, 6, 7, 7, 7, 7,...
## $ votes_10                  <dbl> 12, 57, 6, 15, 210, 33, 80, 15, 53, 9, 17...
## $ votes_9                   <dbl> 4, 18, 6, 8, 225, 15, 65, 8, 7, 6, 23, 14...
## $ votes_8                   <dbl> 10, 58, 17, 16, 436, 48, 105, 30, 23, 22,...
## $ votes_7                   <dbl> 43, 137, 44, 62, 641, 80, 209, 74, 35, 34...
## $ votes_6                   <dbl> 28, 139, 52, 98, 344, 123, 142, 66, 30, 5...
## $ votes_5                   <dbl> 28, 103, 32, 117, 169, 77, 80, 40, 24, 40...
## $ votes_4                   <dbl> 9, 28, 16, 63, 66, 36, 27, 22, 12, 23, 11...
## $ votes_3                   <dbl> 1, 20, 5, 26, 39, 20, 13, 6, 5, 8, 9, 19,...
## $ votes_2                   <dbl> 5, 13, 6, 25, 20, 18, 4, 2, 3, 4, 5, 11, ...
## $ votes_1                   <dbl> 14, 16, 4, 16, 87, 34, 28, 10, 6, 26, 14,...
## $ allgenders_0age_avg_vote  <dbl> 7.2, 6.0, NA, NA, 7.5, NA, 3.0, 6.0, NA, ...
## $ allgenders_0age_votes     <dbl> 4, 1, NA, NA, 4, NA, 1, 1, NA, NA, NA, 3,...
## $ allgenders_18age_avg_vote <dbl> 6.0, 6.1, 5.5, 5.3, 7.0, 5.6, 6.4, 5.9, 7...
## $ allgenders_18age_votes    <dbl> 38, 114, 25, 23, 402, 35, 59, 34, 28, 11,...
## $ allgenders_30age_avg_vote <dbl> 5.7, 6.0, 5.8, 5.0, 7.0, 5.6, 6.7, 6.2, 6...
## $ allgenders_30age_votes    <dbl> 50, 239, 72, 111, 895, 177, 287, 99, 104,...
## $ allgenders_45age_avg_vote <dbl> 6.6, 6.3, 6.2, 5.3, 7.1, 5.8, 7.0, 6.3, 6...
## $ allgenders_45age_votes    <dbl> 35, 115, 62, 193, 482, 168, 276, 69, 39, ...
## $ males_allages_avg_vote    <dbl> 6.2, 6.1, 5.9, 5.1, 7.0, 5.7, 6.6, 6.2, 6...
## $ males_allages_votes       <dbl> 97, 425, 146, 299, 1607, 354, 530, 191, 1...
## $ males_0age_avg_vote       <dbl> 7, 6, NA, NA, 8, NA, 3, 6, NA, NA, NA, 8,...
## $ males_0age_votes          <dbl> 1, 1, NA, NA, 2, NA, 1, 1, NA, NA, NA, 3,...
## $ males_18age_avg_vote      <dbl> 5.9, 6.2, 5.5, 5.2, 7.0, 5.6, 6.4, 5.8, 7...
## $ males_18age_votes         <dbl> 24, 102, 21, 20, 346, 32, 41, 29, 24, 11,...
## $ males_30age_avg_vote      <dbl> 5.6, 6.0, 5.9, 4.9, 7.0, 5.7, 6.6, 6.2, 6...
## $ males_30age_votes         <dbl> 36, 210, 67, 96, 804, 159, 240, 85, 92, 8...
## $ males_45age_avg_vote      <dbl> 6.7, 6.2, 6.2, 5.2, 7.0, 5.8, 6.8, 6.3, 6...
## $ males_45age_votes         <dbl> 31, 100, 55, 171, 396, 151, 238, 64, 33, ...
## $ females_allages_avg_vote  <dbl> 6.0, 6.2, 5.7, 5.9, 7.2, 5.8, 7.4, 6.3, 7...
## $ females_allages_votes     <dbl> 35, 50, 15, 39, 215, 31, 93, 21, 19, 15, ...
## $ females_0age_avg_vote     <dbl> 7.3, NA, NA, NA, 7.0, NA, NA, NA, NA, NA,...
## $ females_0age_votes        <dbl> 3, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA,...
## $ females_18age_avg_vote    <dbl> 5.9, 5.9, 5.8, 5.7, 7.0, 6.3, 6.5, 6.5, 8...
## $ females_18age_votes       <dbl> 14, 12, 4, 3, 52, 3, 16, 5, 3, NA, 5, 32,...
## $ females_30age_avg_vote    <dbl> 5.7, 6.2, 5.8, 5.5, 7.3, 5.1, 7.2, 6.3, 6...
## $ females_30age_votes       <dbl> 13, 23, 4, 14, 82, 13, 40, 11, 10, 6, 13,...
## $ females_45age_avg_vote    <dbl> 4.5, 6.6, 6.8, 6.1, 7.4, 6.5, 8.2, 6.2, 6...
## $ females_45age_votes       <dbl> 4, 14, 7, 21, 77, 15, 34, 5, 6, 9, 15, 97...
## $ top1000_voters_rating     <dbl> 5.7, 6.4, 5.4, 4.9, 6.9, 5.5, 6.2, 5.7, 5...
## $ top1000_voters_votes      <dbl> 34, 66, 32, 57, 139, 67, 88, 41, 25, 37, ...
## $ us_voters_rating          <dbl> 6.4, 6.0, 6.2, 5.5, 7.0, 5.9, 6.6, 5.8, 6...
## $ us_voters_votes           <dbl> 51, 96, 31, 207, 488, 173, 139, 40, 25, 1...
## $ non_us_voters_rating      <dbl> 6.0, 6.2, 5.9, 4.7, 7.0, 5.6, 6.8, 6.2, 6...
## $ non_us_voters_votes       <dbl> 70, 331, 123, 105, 1166, 181, 455, 154, 1...

# Renaming variable 
movie<- rename(movie, imdbID="imdb_id",Title="title")
IMDB <-rename(IMDB, imdbID="imdb_title_id")

#Selecting required columns from movie data
newmovie<-movie %>% select(imdbID,Title,popularity,budget,revenue)
head(newmovie)

## # A tibble: 6 x 5
##   imdbID    Title                       popularity   budget   revenue
##   <chr>     <chr>                            <dbl>    <dbl>     <dbl>
## 1 tt0114709 Toy Story                        21.9  30000000 373554033
## 2 tt0113497 Jumanji                          17.0  65000000 262797249
## 3 tt0113228 Grumpier Old Men                 11.7         0         0
## 4 tt0114885 Waiting to Exhale                 3.86 16000000  81452156
## 5 tt0113041 Father of the Bride Part II       8.39        0  76578911
## 6 tt0113277 Heat                             17.9  60000000 187436818

#Selecting required columns from movie2 data
newmovie2<-movie2 %>% select(imdbID,Title,Type,Country)
head(newmovie2)

## # A tibble: 6 x 4
##   imdbID    Title             Type  Country  
##   <chr>     <chr>             <chr> <chr>    
## 1 tt2268369 Baby's Breath     movie USA      
## 2 tt1560760 Winter Trees      movie USA      
## 3 tt0016750 The Crown of Lies movie USA      
## 4 tt3405286 A Gift            movie USA      
## 5 tt3816698 Journey           movie Sri Lanka
## 6 tt3645730 In His Own Home   movie USA

#Selecting required columns from IMDB data
newIMDB <- IMDB[,c(1:5)]
head(newIMDB)

## # A tibble: 6 x 5
##   imdbID    weighted_average_vote total_votes mean_vote median_vote
##   <chr>                     <dbl>       <dbl>     <dbl>       <dbl>
## 1 tt0000009                   5.9         154       5.9           6
## 2 tt0000574                   6.1         589       6.3           6
## 3 tt0001892                   5.8         188       6             6
## 4 tt0002101                   5.2         446       5.3           5
## 5 tt0002130                   7          2237       6.9           7
## 6 tt0002199                   5.7         484       5.8           6

#Joining movie2 data to movie data
combined <- newmovie %>% left_join(newmovie2)

## Joining, by = c("imdbID", "Title")

head(combined)

## # A tibble: 6 x 7
##   imdbID    Title                     popularity   budget  revenue Type  Country
##   <chr>     <chr>                          <dbl>    <dbl>    <dbl> <chr> <chr>  
## 1 tt0114709 Toy Story                      21.9  30000000   3.74e8 movie USA    
## 2 tt0113497 Jumanji                        17.0  65000000   2.63e8 movie USA    
## 3 tt0113228 Grumpier Old Men               11.7         0   0.     movie USA    
## 4 tt0114885 Waiting to Exhale               3.86 16000000   8.15e7 movie USA    
## 5 tt0113041 Father of the Bride Part~       8.39        0   7.66e7 movie USA    
## 6 tt0113277 Heat                           17.9  60000000   1.87e8 movie USA

#Adding movie rate data by using inner join-> retain only rows in both sets
movieDF <- combined %>% inner_join(newIMDB,by="imdbID")
head(movieDF)

## # A tibble: 6 x 11
##   imdbID Title popularity budget revenue Type  Country weighted_averag~
##   <chr>  <chr>      <dbl>  <dbl>   <dbl> <chr> <chr>              <dbl>
## 1 tt011~ Toy ~      21.9  3.00e7  3.74e8 movie USA                  8.3
## 2 tt011~ Juma~      17.0  6.50e7  2.63e8 movie USA                  7  
## 3 tt011~ Grum~      11.7  0.      0.     movie USA                  6.7
## 4 tt011~ Wait~       3.86 1.60e7  8.15e7 movie USA                  5.9
## 5 tt011~ Fath~       8.39 0.      7.66e7 movie USA                  6.1
## 6 tt011~ Heat       17.9  6.00e7  1.87e8 movie USA                  8.2
## # ... with 3 more variables: total_votes <dbl>, mean_vote <dbl>,
## #   median_vote <dbl>

#Reordering columns
movieDF <- movieDF[,c(1,2,6,7,4,5,3,8,9,10,11)]
head(movieDF)

## # A tibble: 6 x 11
##   imdbID Title Type  Country budget revenue popularity weighted_averag~
##   <chr>  <chr> <chr> <chr>    <dbl>   <dbl>      <dbl>            <dbl>
## 1 tt011~ Toy ~ movie USA     3.00e7  3.74e8      21.9               8.3
## 2 tt011~ Juma~ movie USA     6.50e7  2.63e8      17.0               7  
## 3 tt011~ Grum~ movie USA     0.      0.          11.7               6.7
## 4 tt011~ Wait~ movie USA     1.60e7  8.15e7       3.86              5.9
## 5 tt011~ Fath~ movie USA     0.      7.66e7       8.39              6.1
## 6 tt011~ Heat  movie USA     6.00e7  1.87e8      17.9               8.2
## # ... with 3 more variables: total_votes <dbl>, mean_vote <dbl>,
## #   median_vote <dbl>

# Converting variable type
movieDF$imdbID <-as.factor(movieDF$imdbID)
movieDF$weighted_average_vote <-as.factor(movieDF$weighted_average_vote)
str(movieDF)

## Classes 'tbl_df', 'tbl' and 'data.frame':    34098 obs. of  11 variables:
##  $ imdbID               : Factor w/ 34073 levels "tt0000574","tt0002101",..: 14660 14481 14434 14681 14394 14446 14607 14264 14636 14426 ...
##  $ Title                : chr  "Toy Story" "Jumanji" "Grumpier Old Men" "Waiting to Exhale" ...
##  $ Type                 : chr  "movie" "movie" "movie" "movie" ...
##  $ Country              : chr  "USA" "USA" "USA" "USA" ...
##  $ budget               : num  3.0e+07 6.5e+07 0.0 1.6e+07 0.0 6.0e+07 5.8e+07 0.0 3.5e+07 5.8e+07 ...
##  $ revenue              : num  3.74e+08 2.63e+08 0.00 8.15e+07 7.66e+07 ...
##  $ popularity           : num  21.95 17.02 11.71 3.86 8.39 ...
##  $ weighted_average_vote: Factor w/ 82 levels "1.1","1.3","1.4",..: 72 59 56 48 50 71 52 44 47 61 ...
##  $ total_votes          : num  864461 299083 23742 9353 33587 ...
##  $ mean_vote            : num  8.3 7.1 6.8 6.2 6.2 8.2 6.5 5.9 5.9 7.3 ...
##  $ median_vote          : num  8 7 7 6 6 8 7 6 6 7 ...

#rearranging weighted average vote in descending order
movieDF %>% arrange(desc(weighted_average_vote))

## # A tibble: 34,098 x 11
##    imdbID Title Type  Country budget revenue popularity weighted_averag~
##    <fct>  <chr> <chr> <chr>    <dbl>   <dbl>      <dbl> <fct>           
##  1 tt011~ The ~ movie USA     2.50e7  2.83e7  51.6      9.3             
##  2 tt025~ The ~ movie Turkey  0.      0.       0.869    9.3             
##  3 tt006~ The ~ <NA>  <NA>    6.00e6  2.45e8  41.1      9.2             
##  4 tt017~ Yest~ <NA>  <NA>    0.      0.       0.000844 9.1             
##  5 tt531~ Nats~ <NA>  <NA>    0.      0.       0.251    9.1             
##  6 tt007~ The ~ <NA>  <NA>    1.30e7  4.75e7  36.6      9               
##  7 tt008~ The ~ <NA>  <NA>    0.      0.       1.98     9               
##  8 tt046~ The ~ movie USA, UK 1.85e8  1.00e9 123.       9               
##  9 tt025~ Haba~ <NA>  <NA>    0.      0.       1.34     9               
## 10 tt035~ Sand~ <NA>  <NA>    0.      0.       1.13     9               
## # ... with 34,088 more rows, and 3 more variables: total_votes <dbl>,
## #   mean_vote <dbl>, median_vote <dbl>

Tidy & Manipulate Data I

According to Hadley Wickham and Grolemund, the three principles of tidy data are:

Each variable must have its own column. Each observation must have its own row. Each value must have its own cell.

The movie datset has the column “release_date” where day,month and year are recorded in one column. However, the separation is not carried out as the release_date columns is not required. The separate() function could be used when multiple variables are stored in one column. The required dataframe “movieDF” is already tidy.

#Showing the example of untidy data
head(movie$release_date)

## [1] "1995-10-30" "1995-12-15" "1995-12-22" "1995-12-22" "1995-02-10"
## [6] "1995-12-15"

Scan I

All variables are scanned for missing values, special values and obvious errors. The missing values are represented by NA which stands for not available.To identify the missing values, is.na() function is used and also colSums() is used for the total missing values in the each columns. It is observed that variable “Type” consists of 24783 NAs and “Country” contains 24783 NAs. The common method of handling missing values are omitting the records or replacing with mean/median/mode value of the column. However, those methods for handling missing value of variable Type and Country are not applicable as number of missing value are quite large. The omitting missing values may cause bias analysis. The missing values for Type and Country are handled by replacing with “Unknown”. The missing values for title is not handled.As per recommendations, if there very small size of missing data (upto 5%),then can be left out with missing features. The missing value of revenue is replaced with 0. It is because those missing values in revenue can not be replaced by mean/mode/median. The missing value of popularity is replaced with mean of the column. is.special() function is used to check if there is any infinite and NaN values in this data. To check obvious errors, editrules package is utilised.The rule is created to define the restriction on the mean_vote and median_vote variables. These values are considered as the minimum value is 1 and the maximum value is 10. The checking is done by whether values are less than 0 or greater than 10. After creating rules, violatedEdits() is used.No special values found as well as obvious errors.

# Checking number of NA's per column
sum(is.na(movieDF))

## [1] 49572

colSums(is.na(movieDF))

##                imdbID                 Title                  Type 
##                     0                     2                 24783 
##               Country                budget               revenue 
##                 24783                     0                     2 
##            popularity weighted_average_vote           total_votes 
##                     2                     0                     0 
##             mean_vote           median_vote 
##                     0                     0

#Imputing NA value with "Unknown"
movieDF$Type[is.na(movieDF$Type)] <-"Unknown"
movieDF$Country[is.na(movieDF$Country)] <-"Unknown"

#Rechecking NA's in Type and Country columns
colSums(is.na(movieDF))

##                imdbID                 Title                  Type 
##                     0                     2                     0 
##               Country                budget               revenue 
##                     0                     0                     2 
##            popularity weighted_average_vote           total_votes 
##                     2                     0                     0 
##             mean_vote           median_vote 
##                     0                     0

#Replcing NA of popularity with mean of the column
movieDF$popularity[is.na(movieDF$popularity)] <-mean(movieDF$popularity,na.rm = TRUE)

#Replcing NA of revenue with 0
movieDF$revenue[is.na(movieDF$revenue)] <-0

#Rechecking NA's in revenue and popularity columns
colSums(is.na(movieDF))

##                imdbID                 Title                  Type 
##                     0                     2                     0 
##               Country                budget               revenue 
##                     0                     0                     0 
##            popularity weighted_average_vote           total_votes 
##                     0                     0                     0 
##             mean_vote           median_vote 
##                     0                     0

# Checking each numerical columns for special values 

is.specialorNA <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sapply(movieDF[,c(5,6,7,9,10,11)],function(y)sum(is.specialorNA(y)))

##      budget     revenue  popularity total_votes   mean_vote median_vote 
##           0           0           0           0           0           0

#Scanning for obvious errors
#Defining the restrictions

(Rule1<-editset(c("mean_vote >= 0","mean_vote <=10",
                  "median_vote>=0", "median_vote<=10")))

## 
## Edit set:
## num1 : 0 <= mean_vote
## num2 : mean_vote <= 10
## num3 : 0 <= median_vote
## num4 : median_vote <= 10

#Checking the errors
colSums((violatedEdits(Rule1, movieDF)))

## num1 num2 num3 num4 
##    0    0    0    0

Tidy & Manipulate Data II

add a new variable "NetIncome’

The mutation function from the tidyr package is used to produce a new variable.Firstly, Title, popularity, weighted_average_vote and revenue and budget are selected from the data and stored in new data frame. The new variable “NetIncome” is created.The new varaible “NetIncome” takes revenue value and substracted by budget.Then the NetIncome is ordered as descending. The Title, popularity and weighted_average_vote variables are selected to see the net income in relation to poularity and average vote of each movies. The movie “Avatar” is observed as highest net income.

# Select specific variables and creating new variable by using mutate()
movieDF2 <- movieDF %>%  select(Title,popularity,weighted_average_vote,revenue,budget)%>% 
mutate(NetIncome= revenue-budget) %>%
arrange(desc(NetIncome))
movieDF2

## # A tibble: 34,098 x 6
##    Title                popularity weighted_average_~  revenue  budget NetIncome
##    <chr>                     <dbl> <fct>                 <dbl>   <dbl>     <dbl>
##  1 Avatar                    185.  7.8                  2.79e9  2.37e8    2.55e9
##  2 Star Wars: The Forc~       31.6 7.9                  2.07e9  2.45e8    1.82e9
##  3 Titanic                    26.9 7.8                  1.85e9  2.00e8    1.65e9
##  4 Jurassic World             32.8 7                    1.51e9  1.50e8    1.36e9
##  5 Furious 7                  27.3 7.1                  1.51e9  1.90e8    1.32e9
##  6 The Avengers               89.9 8                    1.52e9  2.20e8    1.30e9
##  7 Harry Potter and th~       25.0 8.1                  1.34e9  1.25e8    1.22e9
##  8 Avengers: Age of Ul~       37.4 7.3                  1.41e9  2.80e8    1.13e9
##  9 Frozen                     24.2 7.4                  1.27e9  1.50e8    1.12e9
## 10 Beauty and the Beast      287.  7.1                  1.26e9  1.60e8    1.10e9
## # ... with 34,088 more rows

str(movieDF2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    34098 obs. of  6 variables:
##  $ Title                : chr  "Avatar" "Star Wars: The Force Awakens" "Titanic" "Jurassic World" ...
##  $ popularity           : num  185.1 31.6 26.9 32.8 27.3 ...
##  $ weighted_average_vote: Factor w/ 82 levels "1.1","1.3","1.4",..: 67 68 67 59 60 69 70 62 63 60 ...
##  $ revenue              : num  2.79e+09 2.07e+09 1.85e+09 1.51e+09 1.51e+09 ...
##  $ budget               : num  2.37e+08 2.45e+08 2.00e+08 1.50e+08 1.90e+08 2.20e+08 1.25e+08 2.80e+08 1.50e+08 1.60e+08 ...
##  $ NetIncome            : num  2.55e+09 1.82e+09 1.65e+09 1.36e+09 1.32e+09 ...

Scan II

In statistics, an outlier is defined as an observation which stands far away from the most of the other observations. The outlier is often result of data entry errors, measurement errors, experimental errors, intentional erros and data processing errors. The numberic variables such as popularity, revenue, budget and NetIncome are considered for outlier assessment.To detect outliers, boxplot is used. Refer to the boxplots below, several outliers are observed in each variables. The Z-score is then calculated.

# This is the R chunk for the Scan II, boxplot of numeric variables 
par(mfrow=c(2,2))
movieDF2$popularity %>% boxplot(main = "Boxplot of popularity")
movieDF2$revenue %>% boxplot(main = "Boxplot of revenue")
movieDF2$budget %>% boxplot( main = "Boxplot of budget")
movieDF2$NetIncome%>% boxplot( main = "Boxplot of NetIncome")

#Histrogram of each variables
par(mfrow=c(2,2))
hist(movieDF2$popularity)
hist(movieDF2$revenue)
hist(movieDF2$budget)
hist(movieDF2$NetIncome)

z.scores_popularity <- movieDF2$popularity %>% scores(type = "z")
z.scores_popularity %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.5096 -0.4319 -0.3001  0.0000  0.2041 80.8661

z.scores_revenue <- movieDF2$revenue %>%  scores(type = "z")
z.scores_revenue %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2011 -0.2011 -0.2011  0.0000 -0.2011 37.5305

z.scores_budget<- movieDF2$budget %>%  scores(type = "z")
z.scores_budget %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2776 -0.2776 -0.2776  0.0000 -0.2776 18.8452

z.scores_netincome <- movieDF2$NetIncome %>%  scores(type = "z")
z.scores_netincome %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.9193 -0.1557 -0.1557  0.0000 -0.1557 42.3867

#Finding total number of outliers according to the z-score

length(which( abs(z.scores_popularity) >3 ))

## [1] 177

length(which( abs(z.scores_revenue) >3 ))

## [1] 542

length(which( abs(z.scores_budget) >3 ))

## [1] 745

length(which( abs(z.scores_netincome) >3 ))

## [1] 487

Capping is used for imputing the outliers. The Capping involves replacing the outliers with the nearest neighbors that are not outliers. The function is defined to cap the values outside the limits. This method is applied to popularity, revenue, budget and NetIncome variables. The outlier is still observed eventhough capping method is implemented.

#Capping function to deal with outliers
#Define function to cap the values that are outside the limits
cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5 * IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5 * IQR(x) ] <- quantiles[4]
    x
}

#Capping 
movieDF2$popularity <- movieDF2$popularity %>% cap()
movieDF2$revenue <- movieDF2$revenue %>% cap()
movieDF2$budget <-movieDF2$budget %>% cap()
movieDF2$NetIncome <- movieDF2$NetIncome %>% cap()


#Visualisation after capping
boxplot(movieDF2$popularity, main = "Capped Popularity")

boxplot(movieDF2$revenue, main = "Capped Revenue")

boxplot(movieDF2$budget, main = "Capped Budget")

boxplot(movieDF2$NetIncome, main = "Capped NetIncome")

Transform

The data transformation is often requisite of variables to further proceed with statistical analysis. In this analysis,the NetIncome variable is considered . Refer to graph below, it is observed that the histogram of Netincome is right -skewed. To reduce the right-skewness, the logarithms, reciprocal, Boxcox transformations are applied.

# Checking the disbribution of NetIncome
hist(movieDF2$NetIncome, main ="Histrogram of NetIncome")

par(mfrow=c(3,2))
#Trying different transformations
hist(log10(movieDF2$NetIncome), main= "log10", xlab = "NetIncome")

## Warning in hist(log10(movieDF2$NetIncome), main = "log10", xlab = "NetIncome"):
## NaNs produced

hist(log(movieDF2$NetIncome), main= "log", xlab = "NetIncome")

## Warning in log(movieDF2$NetIncome): NaNs produced

hist(sqrt(movieDF2$NetIncome), main= "sqrt", xlab = "NetIncome")

## Warning in sqrt(movieDF2$NetIncome): NaNs produced

hist(BoxCox(movieDF2$NetIncome, lambda = "auto"), main= "Boxcox", xlab = "NetIncome")
hist((movieDF2$NetIncome)^(-1), main= "reciprocal", xlab = "NetIncome")

The most effective at increasing normality appears to be reciprocal transformation. The log10 and log resulted left skewness. The square root transformation was not effective. The Boxcox transformation reduced skewness but it did not completely improve the symmetry of the Netincome distribution.

Assignment 2

Seungyeon Lee (s3366440)

26/10/2020

Required packages

Executive Summary

Data

Understand

Tidy & Manipulate Data I

Scan I

Tidy & Manipulate Data II

add a new variable "NetIncome’

Scan II

Transform