Required packages
library(readr)
library(tidyr)
library(dplyr)
library(outliers)
library(lubridate)
library(stringr)
library(knitr)
library(MVN)
library(forecast)
Executive Summary
In this assignment, four data sets will be taken from an open data source and inputted into RStudio. Each of them will be described in details in regard with their source and variables. At the beginning, they will be combined into two data sets, and the data types of all variables within these two data sets will be checked and converted into correct types if necassary. Then after tidying up them according to Hadley Wickham’s “Tidy Data” principles, they will be combined again into a final-form data set for the rest of this assignment. Straight after this, the new single data set will be scanned to seek for the existence of any missing values, special values or inconsistencies, which will in turn be solved by appropriate approaches. Hereafter, some new variables created based on calculation over existing variables will be added to the right of the data set. Finally, each numeric variable within this data set will be scanned again to seek for the existence of outliers, which will also be handled by proper methodologies. Transformation will then be performed on certain variables in the last part. All these steps metioned above are considered as major tasks in data preprocessing, after which further procedures in data analysis, such as exploring and modelling, can be implemented on this data set.
Data
The following four data sets are acquired from Kaggle (https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset). They are orginally scraped from publicly available website - International Movie Database a.k.a. IMDb (https://www.imdb.com) and uploaded by Kaggle user.
movie <- read_csv("IMDb movies.csv")
people <- read_csv("IMDb names.csv")
rate <- read_csv("IMDb ratings.csv")
principal <- read_csv("IMDb title_principals.csv")
These four data sets are in turn inputted into R using the “read_csv” function from “readr” package as data frames, each of them has been assigned a new name as well.
Among them, the “movie” data frame includes 81,273 movies with 22 variables such as movie description, average rating, number of votes, genre, etc.
The “people” data frame includes 175,719 cast members with 20 variables such as birth details, death details, height, spouses, children, etc.
The “rate” data frame includes 81,273 rating details from demographic perspective with 49 variables.
Lastly, the “principal” data frame includes 377,848 cast members’ roles in movies with 6 variables such as IMDb title id, IMDb name id, order of importance in the movie, role, and characters played.
Since these four data frames contain nearly 100 variables (coloumns) in total, they will be in turn subsetted into smaller data frames so that only those variables (columns) useful for this assignment will be kept.
movie <- movie %>% select(imdb_title_id, title, date_published:duration, production_company, budget, worlwide_gross_income)
head(movie)
The “select” function is applied to subset “movie” data frame, leaving 8 variables as shown above. Their descriptions are as follows:
*imdb_title_id: identifier of each movie title assigned by IMDb website.
*title: title of each movie.
*date_published: date of release of each movie.
*genre: genres of each movie, a movie can have at most three genres according to IMDb’s rule.
*duration: length of each movie, recorded in minutes.
*production_company: production company (or studio) of each movie.
*budget: budget of each movie, mostly recorded in American Dollars (USD). Some are recorded in other currencies which are no longer available nowadays, especially those movices produced in late 19th/early 20th centuries.
*worlwide_gross_income: World-wide gross income of each movie, mostly recorded in American Dollars (USD). Some are recorded in other currencies which are no longer available nowadays, especially those movices produced in late 19th/early 20th centuries.
people <- people %>% select(imdb_name_id, name)
head(people)
The “select” function is applied to subset “people” data frame, leaving 2 variables as shown above. Their descriptions are as follows:
*imdb_name_id: identifier of each cast member assigned by IMDb website.
*name: full name of each cast member.
rate <- rate %>% select(imdb_title_id, total_votes, mean_vote)
head(rate)
The “select” function is applied to subset “rate” data frame, leaving 3 variables as shown above. Their descriptions are as follows:
*imdb_title_id: identifier of each movie title assigned by IMDb website.
*total_votes: total number of votes of each movie received by IMDb.
*mean_vote: average rating (on a 0-10 scale) determined by all the votes for each movie.
principal <- principal %>% select(imdb_title_id:category)
head(principal)
The “select” function is applied to subset “principal” data frame, leaving 4 variables as shown above. Their descriptions are as follows:
*imdb_title_id: identifier of each movie title assigned by IMDb website.
*ordering: Order of importance of each cast member in the movie. Notice that at most 10 most important cast members of each movie are recorded in this data frame.
*imdb_name_id: identifier of each cast member assigned by IMDb website.
*category: Category of job (role) done by the cast member.
The data frames “movie” and “rate” will be combined into a new data frame called “movie_rate”:
movie_rate <- movie %>% left_join(rate, by = "imdb_title_id")
head(movie_rate)
The function “left_join” is applied to conduct such combination so that all the original information in “movie” can be preserved, whereas only ratings that have a match in “movie” will be added to the right. Since “movie” and “rate” are connected via variable “imdb_title_id”, it is chosen as the key variable in “by” argument.
The “principal” data frame seems to be understandable for computers, but they will certainly raise confusion among human beings. To make it more informative for people, movie titles and cast members’ full names should be added into this data frame. Hence, the data frames “movie”, “principal” and “people” will be combined into a new data frame called “movie_principal_people”:
movie_principal_people <- principal %>% left_join(movie %>% select(imdb_title_id, title), by = "imdb_title_id") %>%
left_join(people %>% select(imdb_name_id, name), by = "imdb_name_id") %>% select(imdb_title_id, title, ordering, imdb_name_id, name, category)
head(movie_principal_people)
Notice that joining three data frames requires two steps of combination. Again, function “left_join” is applied throughout these two combinations so that all the original information in “principal” can be preserved, whereas only movie titles and cast members’ names that have a match in “principal” will be added to the right. Since “principal” and “movie” are connected via variable “imdb_title_id”, “principal” and “people” are connected via variable “imdb_name_id”, these two variables are chosen as the keys for the “by” argument in each step of combinations. To make the new data frame “movie_principal_people” easier to read, its columns are arranged in such way that movie title is straight after “imdb_title_id”, and cast members’ name is straight after “imdb_name_id”.
“movie_rate” and “movie_principal_people” will be joined together later after tidying up.
Understand
str(movie_rate)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 81273 obs. of 10 variables:
$ imdb_title_id : chr "tt0000574" "tt0001892" "tt0002101" "tt0002130" ...
$ title : chr "The Story of the Kelly Gang" "Den sorte drøm" "Cleopatra" "L'Inferno" ...
$ date_published : chr "1906-12-26" "1911-08-19" "1912-11-13" "1911-03-06" ...
$ genre : chr "Biography, Crime, Drama" "Drama" "Drama, History" "Adventure, Drama, Fantasy" ...
$ duration : num 70 53 100 68 60 85 120 120 55 121 ...
$ production_company : chr "J. and N. Tait" "Fotorama" "Helen Gardner Picture Players" "Milano Film" ...
$ budget : chr "$ 2250" NA "$ 45000" NA ...
$ worlwide_gross_income: chr NA NA NA NA ...
$ total_votes : num 537 171 420 2019 438 ...
$ mean_vote : num 6.3 6.1 5.2 6.9 5.8 6.8 6.2 7.1 5.4 6.6 ...
By applying “str” function, 10 variables are found in data frame “movie_rate”. Data types of each column are automatically “guessed” by this function, as shown above.
*Column “imdb_title_id” is of correct data type, because identifier of each movie is a string of characters and numbers.
*Column “title” is of correct data type, because movie titles mainly consist of characters, sometimes numbers are involved as well.
*Column “date_published” should be of date type instead of character. However, for some movies, only the number of year is recorded on IMDb, so this column cannot be directly converted into date type by “as.Date” function. Such conversion will be conducted later in the part “Scan I” after these input inconsistencies are solved.
*Column “genre” is of correct data type under current condition, where each movie can have more than one genre. However, if in some other cases where one movie can only have one genre, then this column can be considered as factor type as well.
*Column “duration” is of correct data type, since lengths of movies are expressed as integers.
*Column “production_company” is of correct data type, since names of the production companies (or studios) consist of characters.
*Column “budget” is of correct data type, because budget of each movie is recorded in number values together with corresponding currencies labelled in front of the number.
*Column “worlwide_gross_income” is of correct data type, since gross income is recorded in number values together with corresponding currencies labelled in front of the number.
*Column “total_votes” is of correct data type, because total number of votes of each movie is denoted by integers.
*Column “mean_vote” is of correct data type, since the average rating of each movie is expressed as a floating number over a scale of 0-10 with 2 significant numbers, where 0 indicates the movie is least favoured and 10 indicates the movie is most favoured.
str(movie_principal_people)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 377848 obs. of 6 variables:
$ imdb_title_id: chr "tt0000574" "tt0000574" "tt0000574" "tt0000574" ...
$ title : chr "The Story of the Kelly Gang" "The Story of the Kelly Gang" "The Story of the Kelly Gang" "The Story of the Kelly Gang" ...
$ ordering : num 1 2 3 4 5 6 7 8 9 10 ...
$ imdb_name_id : chr "nm0846887" "nm0846894" "nm3002376" "nm0170118" ...
$ name : chr "Elizabeth Tait" "John Tait" "Norman Campbell" "Bella Cola" ...
$ category : chr "actress" "actor" "actor" "actress" ...
movie_principal_people$ordering <- factor(movie_principal_people$ordering, levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"), ordered=TRUE)
movie_principal_people$category <- as.factor(movie_principal_people$category)
By applying “str” function, 6 variables are found in data frame “movie_principal_people”. Data types of each column are automatically “guessed” by this function, as shown above.
*Both “imdb_title_id” and “title” are of correct data types, as discussed before.
*Column “ordering” should be of factor type, because each movie can have at most 10 most important cast members, so values under this column can only be one of the 10 integers (from 1 to 10). Moreover, these 10 factors are ordinal and they should be ordered as 1<2<3<4<5<6>7<8<9<10, where 1 means the most important cast member and 10 means the least important cast member.
*Column “imdb_name_id” is of correct data type, since identifier of each cast member is a string of characters and numbers.
*Column “name” is of correct data type, because each cast member’s full name consists of only characters.
*Column “category” should be of factor type, since there are only 12 distinct values under this column: “actor”, “actress”, “archive_footage”, “archive_sound”, “cinematographer”, “composer”, “director”, “editor”, “producer”, “production_designer”, “self” and “writer”. Notice that there is no order between these 12 factors.
str(movie_principal_people)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 377848 obs. of 6 variables:
$ imdb_title_id: chr "tt0000574" "tt0000574" "tt0000574" "tt0000574" ...
$ title : chr "The Story of the Kelly Gang" "The Story of the Kelly Gang" "The Story of the Kelly Gang" "The Story of the Kelly Gang" ...
$ ordering : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 2 3 4 5 6 7 8 9 10 ...
$ imdb_name_id : chr "nm0846887" "nm0846894" "nm3002376" "nm0170118" ...
$ name : chr "Elizabeth Tait" "John Tait" "Norman Campbell" "Bella Cola" ...
$ category : Factor w/ 12 levels "actor","actress",..: 2 1 1 2 7 9 9 9 6 5 ...
Now every column in “movie_principal_people” is of correct data types.
Tidy & Manipulate Data I
Although the data frame “movie_principal_people” seems to be tidy on its own, it is considered as untidy when people need to join it with “movie_rate”. Suppose that we want to figure out the 3 most important cast members of each movie, combining “movie_rate” and “movie_principal_people” directly will result in information of the same movie being repeated several times in different rows, namely information redundancy. Hence, before joining these two data frames, “movie_principal_people” should first be “spreaded” into wide format:
movie_principal_people <- movie_principal_people %>% unite(name_category, "name", "category", sep = " (")
movie_principal_people$name_category <- paste0(movie_principal_people$name_category, ")")
In order to employ the “spread” function correctly, the two columns, “name” and “category”, should be combined together to prevent any errors. This is done by the assistance from “unite” function, and the new column is named as “name_category”. The separator being used here is " (“, so that after pasting a”)" to each row of the new column “name_category”, the role of each cast member will be nicely demostrated inside a bracket next to their full names, as shown below.
head(movie_principal_people)
From now on, the “spread” function can be smoothly operated over the data frame “movie_principal_people”.
important_cast_3 <- movie_principal_people %>% filter(ordering <= 3) %>% select(imdb_title_id, ordering, name_category) %>% spread(key = ordering, value = name_category)
colnames(important_cast_3) <- c("imdb_title_id", "first", "second", "third")
Since only the 3 most important cast members are going to be added to “movie_rate” data frame, “filter” function has been applied to select the rows with ordering equals to either 1, 2 or 3 from “movie_principal_people”. Also, since titles of each movie already exist in the “movie_rate” data frame, only the columns “imdb_title_id”, “ordering” and “name_category” are chosen by “select” function. To “spread” the data frame, the key column should be “ordering” as it contains variable names, and the value column is “name_category” as it contains values forming multiple variables. The new data frame after spreading is temporarily called “important_cast_3”, and its columns are renamed as “imdb_title_id”, “first”, “second” and “third”. The first few rows are shown below:
head(important_cast_3)
Now that each movie only appears in one row, it can be joined to “movie_rate”.
movie_rate_3principals <- movie_rate %>% left_join(important_cast_3, by = "imdb_title_id")
movie_rate_principals <- movie_rate_3principals %>% unite(principal_casts, "first", "second", "third", sep = ", ", na.rm = TRUE)
“movie_rate” and “important_cast_3” are combined together through their common column, “imdb_title_id”. Here function “left_join” is applied so that all the original information from “movie_rate” can be reserved. To make the whole data frame more compendious, we compress the three columns containing the 3 most important cast members into a single column called “principal_casts”, by using “unite” function and skipping all the NA’s (NA’s exist since some movies have less then 3 most important cast members). The new data frame is named as “movie_rate_principals”, and its first few rows are shown below:
head(movie_rate_principals)
movie_rate_principals$principal_casts[movie_rate_principals$principal_casts == ""] <- "Unknown"
It is observed that some rows under the column “principal_casts” don’t contain any information. This is because not all movies are recorded in the data frame “important_cast_3” (which originally came from the data frame “principal”) in regard with information about their most important cast members. To make this column more straightforward (and less confusing), replace all the empty values by “Unknown”. The final form of data frame “movie_rate_principals” is shown below:
head(movie_rate_principals)
Notice that at this stage, all the combinations of data frames have been completed. “movie_rate_principals” will be used throughout the rest of this assignment.
Scan I
In this part, the presence of any missing value, special value and input inconsistency will be investigated throughout data frame “movie_rate_principals”.
colSums(is.na(movie_rate_principals))
imdb_title_id title date_published genre duration
0 0 0 0 0
production_company budget worlwide_gross_income total_votes mean_vote
4325 58469 51381 0 0
principal_casts
0
By checking number of NA’s of each column in “movie_rate_principals”, it is found that only “production_company”, “budget” and “worlwide_gross_income” contain missing values. There are several reasons that the rows having missing values should not be excluded from the data frame. First of all, all three columns mentioned previously possess considerably large propotion of NA’s (way larger than 5% - the recommended proportion for excluding missing data), especially for “budget” and “worlwide_gross_income” in which the number of missing values is more than half of number of observations (rows). Removing these rows will certainly result in losing too much significant information.
On the other hand, it is rather rational that these columns containing such large number of missing values. In the U.S., movie production has been industrialised & standardised for nearly a century, thus most movies originated in America are produced by specific production companies. However, in some other countries, movie production industry has only been established for a very short time, it might be relatively difficult for websites such as IMDb to trace back on the information about those production companies. This is why the column “production_company” possesses so many missing values.
For the other two columns, “budget” & “worlwide_gross_income”, the reason is pretty similar as well. Immature movie production industry makes it troublesome for people to record these values. In other scenarios, some production companies might be unwilling to publish their budget and gross income of certain movies, since a lot of “trading secrets” are involved in such information.
Nevertheless, missing values in “production_company” can be handled by replacing the NA’s with “Unknown”, so that people can understand that they are not recorded:
movie_rate_principals$production_company[is.na(movie_rate_principals$production_company)] <- "Unknown"
Function “is.na” is used to locate the positions of NA’s.
At this stage, missing values in “budget” and “worlwide_gross_income” are still left there, these two columns have too many NA’s so substituting column mean/median will be unmeaningful. Moreover, there also exist some input inconsistancies among them which will be discussed later.
sapply(movie_rate_principals, function(x) sum(is.infinite(x)))
imdb_title_id title date_published genre duration
0 0 0 0 0
production_company budget worlwide_gross_income total_votes mean_vote
0 0 0 0 0
principal_casts
0
sapply(movie_rate_principals, function(x) sum(is.nan(x)))
imdb_title_id title date_published genre duration
0 0 0 0 0
production_company budget worlwide_gross_income total_votes mean_vote
0 0 0 0 0
principal_casts
0
By applying “is.infinite” and “is.nan” functions, no special value is found throughout data frame “movie_rate_principals”.
However, there still exist some input inconsistancies in several columns. In “date_published”, most values are recorded in th form “YYYY-MM-DD”, whereas some other rows only contain the value of year. This situation is solved as shown below:
movie_rate_principals <- movie_rate_principals %>% separate(date_published, into = c("year_published", "month_published", "day_published"), sep = "-")
movie_rate_principals$month_published[is.na(movie_rate_principals$month_published)] <- 01
movie_rate_principals$day_published[is.na(movie_rate_principals$day_published)] <- 01
movie_rate_principals <- movie_rate_principals %>% unite(date_published, "year_published", "month_published", "day_published", sep = "-")
movie_rate_principals$date_published <- as.Date(movie_rate_principals$date_published)
For the first step, “date_published” is temporarily seperated into three columns, each contains the value of year, month and day respectively. By doing so, those rows that miss the value of month and day can be located by “is.na” function. To make it easier to convert column “date_published” into date data type, a not-so-justifiable assumption has been made on those movies (rows) only contatining year value, that is they are all released on the 1st of January. I chose to do so because in the succeeding parts of this assignment, only the value of year will be utilised for analysis, and it will be easier to extract year value from a date-type column by using “lubridate” package. However, in other scenarios where values of month and day are also required for analysis, such assumption should never be made.
To implememnt the above assumption, replace NA’s in month and day columns by “01”, then combine the temporary columns “year_published”, “month_published” and “day_published” back into “date_published” by using “unite” function. After this, “date_published” can be converted to date type through “as.Date” function. Now that every vairable in data frame “movie_rate_principals” is of correct data types, as shown below:
str(movie_rate_principals)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 81273 obs. of 11 variables:
$ imdb_title_id : chr "tt0000574" "tt0001892" "tt0002101" "tt0002130" ...
$ title : chr "The Story of the Kelly Gang" "Den sorte drøm" "Cleopatra" "L'Inferno" ...
$ date_published : Date, format: "1906-12-26" "1911-08-19" "1912-11-13" "1911-03-06" ...
$ genre : chr "Biography, Crime, Drama" "Drama" "Drama, History" "Adventure, Drama, Fantasy" ...
$ duration : num 70 53 100 68 60 85 120 120 55 121 ...
$ production_company : chr "J. and N. Tait" "Fotorama" "Helen Gardner Picture Players" "Milano Film" ...
$ budget : chr "$ 2250" NA "$ 45000" NA ...
$ worlwide_gross_income: chr NA NA NA NA ...
$ total_votes : num 537 171 420 2019 438 ...
$ mean_vote : num 6.3 6.1 5.2 6.9 5.8 6.8 6.2 7.1 5.4 6.6 ...
$ principal_casts : chr "Elizabeth Tait (actress), John Tait (actor), Norman Campbell (actor)" "Unknown" "Unknown" "Salvatore Papa (actor), Arturo Pirovano (actor), Giuseppe de Liguoro (actor)" ...
Inconsistancies also exist in the columns “budget” and “worlwide_gross_income”, where the strings consist of currency labels followed by the actual numeric values. Among them, most currencies are denoted by three capital letters, with the exception that U.S. Dollars is denoted by the symbol “$”. Such inconsistancy is handled as follows:
movie_rate_principals <- movie_rate_principals %>% separate(budget, into = c("budget_unit", "budget_value"), sep = " ")
movie_rate_principals <- movie_rate_principals %>% separate(worlwide_gross_income, into = c("gross_income_unit", "gross_income_value"), sep = " ")
movie_rate_principals$budget_unit[movie_rate_principals$budget_unit == "$"] <- "USD"
movie_rate_principals$gross_income_unit[movie_rate_principals$gross_income_unit == "$"] <- "USD"
Notice that there is always a space (" “) between currency labels and numeric values. Based on this special format created by IMDb, we can separate currency labels and numeric values into two columns simply by using”separate" function. During this step, currency columns are named as "%_unit“, and numeric columns are named as”%_value“. Now that all the”$" symbols in currency columns can be replaced by a more formal format “USD”, just like all the other currencies. Another benefit of separating currency labels and numeric values is that we can now perform mathematical calculations over budget and gross income for each movie. The current data types of columns in the data frame “movie_rate_principals” is shown below:
str(movie_rate_principals)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 81273 obs. of 13 variables:
$ imdb_title_id : chr "tt0000574" "tt0001892" "tt0002101" "tt0002130" ...
$ title : chr "The Story of the Kelly Gang" "Den sorte drøm" "Cleopatra" "L'Inferno" ...
$ date_published : Date, format: "1906-12-26" "1911-08-19" "1912-11-13" "1911-03-06" ...
$ genre : chr "Biography, Crime, Drama" "Drama" "Drama, History" "Adventure, Drama, Fantasy" ...
$ duration : num 70 53 100 68 60 85 120 120 55 121 ...
$ production_company: chr "J. and N. Tait" "Fotorama" "Helen Gardner Picture Players" "Milano Film" ...
$ budget_unit : chr "USD" NA "USD" NA ...
$ budget_value : chr "2250" NA "45000" NA ...
$ gross_income_unit : chr NA NA NA NA ...
$ gross_income_value: chr NA NA NA NA ...
$ total_votes : num 537 171 420 2019 438 ...
$ mean_vote : num 6.3 6.1 5.2 6.9 5.8 6.8 6.2 7.1 5.4 6.6 ...
$ principal_casts : chr "Elizabeth Tait (actress), John Tait (actor), Norman Campbell (actor)" "Unknown" "Unknown" "Salvatore Papa (actor), Arturo Pirovano (actor), Giuseppe de Liguoro (actor)" ...
Tidy & Manipulate Data II
In real life, people usually reference to the box office of a movie to determine its comprehensive performance. Although the actual quality of a movie is not always related to its box office, it worths the effort to investigate on how much profit a movie can produce. In this part, profit and profit rate of each movie will be calculated and added to the right of data frame “movie_rate_principals”, as follows:
movie_rate_principals$budget_value <- as.numeric(movie_rate_principals$budget_value)
movie_rate_principals$gross_income_value <- as.numeric(movie_rate_principals$gross_income_value)
movie_rate_principals_profit <- movie_rate_principals %>% filter(budget_unit == gross_income_unit) %>%
mutate(profit = gross_income_value - budget_value, profit_rate = profit/budget_value)
Before any calculation, the two columns “budget_value” and “gross_income_value” shoud first be converted into numeric types. Besides, it is only meaningful to calculate profit and profit rate when a movie’s budget and gross income are recorded in the same currency. The “filter” function is applied to filter out those rows that qualify such condition. Profit is calculated by budget subtracted from gross income, and profit rate equals to quotient of profit over budget. Notice that a negative profit/rate indicates the movie has lost money, while a positive profit/rate indicates the movie has actually made profit. These two new variables can be added to the right of data frame “movie_rate_principals” by using “mutate” function. The new data frame after mutation is renamed as “movie_rate_principals_profit”, and its first few rows are shown below:
head(movie_rate_principals_profit)
Scan II
In this part, each column from data frame “movie_rate_principals_profit” which contains numeric values will be checked whether outliers exist or not, with the assistance of Box Plot (a.k.a. Tukey’s method of outlier detection), Histogram and their corresponding z-scores (only when the values are approximately normally distributed, which is implied by a nearly symmetric histogram).
par(mfrow=c(1,2))
movie_rate_principals_profit$duration %>% boxplot(main="Box Plot of Movies' Duration", ylab = "Duration (in minutes)")
hist(movie_rate_principals_profit$duration, main = "Histogram of Movies' Duration")

According to Box Plot, there exist a considerable amount of outliers under column “duration”, all of them are located above the maximum calculated by Tukey’s Method. This is because, although most movies are maintained between 90-120 minutes for the favour of audience’s patience, there is no rule to limit a movie’s length and some directors may have a lot more content to be presented in their movies than others, which is pretty reasonable. On the other hand, as the shortest movie last more than 50 minutes, no outlier is found under the minimum. Since the histogram is dramatically skewed to the right, z-score test is not appropriate here.
par(mfrow=c(1,2))
movie_rate_principals_profit$budget_value %>% boxplot(main="Box Plot of Movies' Budget", ylab = "Budget")
hist(movie_rate_principals_profit$budget_value, main = "Histogram of Movies' Budget")

According to Box Plot, there exist a considerable amount of outliers under column “budget_value”, all of them are located above the maximum calculated by Tukey’s Method. This is because no rules are made upon maximum amount of budget a production company can put in their movies. In fact, it is pretty common that Sci-Fi movies cost way more budget than other genres. Besides, due to economic inflation, money value nowadays is way different than in the past, which will also result in budget value keeping increasing as time goes by. On the other hand, as budget must be a positive value, a default lower bound is set above zero, thus no outlier is found under the minimum. Since the histogram is completely unsymmetric, z-score test is not appropriate here.
par(mfrow=c(1,2))
movie_rate_principals_profit$gross_income_value %>% boxplot(main="Box Plot of Movies' Gross Income", ylab = "Gross Income")
hist(movie_rate_principals_profit$gross_income_value, main = "Histogram of Movies' Gross Income")

According to Box Plot, there exist a considerable amount of outliers under column “gross_income_value”, all of them are located above the maximum calculated by Tukey’s Method. Again, this is because no upper bound is predefined on how much gross income a movie can make. If a particular movie is outstandingly favoured by the audience, then its gross income will upsurprisingly reach a new peak. Besides, due to economic inflation, money value nowadays is way different than in the past, which will also result in gross income value keeping increasing as time goes by. On the other hand, as gross income must be a positive value, a default lower bound is set above zero, thus no outlier is found under the minimum. Since the histogram is nowhere close to be symmetric, z-score test is not appropriate here.
par(mfrow=c(1,2))
movie_rate_principals_profit$total_votes %>% boxplot(main="Box Plot of Movies' Number of Votes", ylab = "Number of Votes")
hist(movie_rate_principals_profit$total_votes, main = "Histogram of Movies' Number of Votes")

According to Box Plot, there exist a considerable amount of outliers under column “total_votes”, all of them are located above the maximum calculated by Tukey’s Method. This is because there is no upper bound to prevent audience from voting. If a particular movie is outstandingly favoured, or detested, by the audience, then it will be undoubtfully voted way more times than other movies. On the other hand, as number of votes cannot be negative, a default lower bound is set at zero, thus no outlier is found under the minimum. Since the histogram is dramatically skewed to the right, z-score test is not appropriate here.
par(mfrow=c(1,2))
movie_rate_principals_profit$mean_vote %>% boxplot(main="Box Plot of Movies' Average Rating", ylab = "Average Rating")
hist(movie_rate_principals_profit$mean_vote, main = "Histogram of Movies' Average Rating")

z.scores <- movie_rate_principals_profit$mean_vote %>% scores(type = "z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.98290 -0.60703 0.07366 0.00000 0.68142 2.89367
which( abs(z.scores) >3 )
[1] 328 606 670 730 1024 1035 1038 1855 2161 2214 2297 2477 2901 2975 3046 3253 3467 3628 3747 3899 4026 4095 4131 4208 4331
[26] 4722 4852 4898 5029 5164 5191 5224 5501 5741 5747 5769 5785 5818 5976 6068 6143 6160 6168 6229 6357 6382 6559 6580 6645 6703
[51] 7000 7255 7468 7572 7582 7638 7662 7667 7688 7739 7773 7825 7904 7911 8148 8204 8216 8229 8286 8299 8355 8388 8617 8658 8666
length(which( abs(z.scores) >3 ))
[1] 75
According to Box Plot, there exist a considerable amount of outliers under column “mean_vote”, most of them are located under the minimum calculated by Tukey’s Method, while some are above the maximum as well. This turns out to be very different from previous cases. The reason behind is that rating of a movie is set over the scale of 0-10, i.e. there exist a predifined lower bound at 0 and upper bound at 10. While most movies receive average ratings between 4 and 9, outliers under the minimum are apparently more than those above the maxium, because it is reasonably harder for a movie to gain audience’s praise than disapproval. It is observed that the histogram is slightly skewed to the left, but still pretty close to be symmetric. Thus z-score test can be performed in this case. Based on the result, 75 outliers are identified by z-score test.
par(mfrow=c(1,2))
movie_rate_principals_profit$profit %>% boxplot(main="Box Plot of Movies' Profit", ylab = "Profit")
hist(movie_rate_principals_profit$profit, main = "Histogram of Movies' Profit")

According to Box Plot, there exist a considerable amount of outliers under column “profit”, most of them are located above the maximum calculated by Tukey’s Method, while some are below the minimum as well. Notice that there is neither predefined upper bound on how much profit a movie can make, nor predefined lower bound on how much money a movie can lose. Outliers above the maximum are apparently more than those under the minimum, because as a well-known recreational product, it is reasonably common that a movie is able to make certain amount of profit, unless it is extremely unpopular or with extremly low quality. Since the histogram is completely unsymmetric, z-score test is not appropriate here.
par(mfrow=c(1,2))
movie_rate_principals_profit$profit_rate %>% boxplot(main="Box Plot of Movies' Profit Rate", ylab = "Profit Rate")
hist(movie_rate_principals_profit$profit_rate, main = "Histogram of Movies' Profit Rate")

z.scores2 <- movie_rate_principals_profit$profit_rate %>% scores(type = "z")
z.scores2 %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.01447 -0.01447 -0.01445 0.00000 -0.01441 84.50993
which( abs(z.scores2) >3 )
[1] 618 7045
length(which( abs(z.scores2) >3 ))
[1] 2
According to Box Plot, there only exist a small number of outliers under column “profit_rate”, a few of them are located above the maximum calculated by Tukey’s Method and others are below the minimum. Similarly to “profit”, there is neither predefined upper bound nor predefined lower bound on possible profit rate. Except those values in upper tail, the histogram turns out to be nearly symmetric, thus the z-score test can be performed in this case. Based on its result, only 2 outliers are found.
In summary, outliers in columns “duration”, “budget_value”, “gross_income_value”, “total_votes”, “mean_vote” and “profit” should not be excluded, since their numbers are considerably large compared to the number of observations, i.e. they have taken up a large proportion of samples. They shoud neither be handled by imputation or capping, since none of these outliers mentioned above is caused by data entry/processing errors. Instead, these outliers can be “eliminated” by transformation so that their original values are “maintained” (which means people can achieve their original values by simply inversing the transformation) and at the same time, further statistical analysis can be applied over these variables. This will be discussed in the next part “Transform”.
However, outliers in the column “profit_rate” might be excluded upon demand, especially those 2 found by z-score test as they are relatively small in numbers compared to the total number of oberservations (which is 8908). Such elimination can be down as follows:
profit_rate_clean <- movie_rate_principals_profit$profit_rate[-which(abs(z.scores2) >3)]
Aside from univariate outliers, bivariate outliers may also exist in the data frame “movie_rate_principals_profit”. Suppose that people are interested in the relationship between budget and gross income of movies released between 2000 and 2010, then scatter plot and the Mahalanobis distance can be used to detect outliers based on budget gross income, as shown below:
par(mfrow=c(1,2))
gross_income_budget <- movie_rate_principals_profit %>% filter(year(date_published)<=2010, year(date_published)>=2000) %>% select(gross_income_value, budget_value)
gross_income_budget %>% plot(gross_income_value ~ budget_value, data = ., ylab = "Gross Income", xlab = "Budget", main = "Gross Income VS Budget")
results <- mvn(data = gross_income_budget, multivariateOutlierMethod = "quan", showOutliers = TRUE)

Before plotting, movies released between 2000 and 2010 should first be filtered out by applying “filter” function. Next, the columns “gross_income_value” and “budget_value” which are useful for plotting can be extracted and saved into a new data frame called “gross_income_budget”. To get the scatter plot, “plot” function is applied where y-axis represents gross income and x-axis represents budget. To acquire Mahalanobis distance, “mvn” function from the MVN package is applied and the method being used here is chi-square distribution critical value, as mentioned in the argument by “quan”.
In the scatter plot, it is observed that for most movies, small gross income is associated with small budget and large gross income is associated with large budget. However, there do exist some outliers located in bottom-right and top-right corners. In Mahalanobis distance, it is found that outliers have taken up nearly half of the number of observations. Thus they should not be excluded. Besides, they shoud neither be handled by imputation or capping, since none of them is caused by data entry/processing errors.
Transform
As mentioned in previous part, most of the variables in data frame “movie_rate_principals_profit” have their distributions skewed to the right, which is not preferred by statistical inference. In order to achieve a symmetric (normal) distribution, we can seek assistance from transformation. In the following, the coulmn “budget_value” will be used as an exmample:
par(mfrow=c(1,2))
hist(log(movie_rate_principals_profit$budget_value), main = "Histogram of log(budget value)")
bc_budget <- BoxCox(movie_rate_principals_profit$budget_value, lambda = "auto")
hist(bc_budget, main = "Histogram of Box-Cox Transformed Budget Value")

Since all values under the column “budget_value” are positive, logarithm transformation can be applied to reduce the right skewness. This can be simply implemented by using “log” function. Another approach to achieve a symmetric (normal) distribution is the BoxCox Transformation. This is done by using “BoxCox” function from the “forecast” package. As shown above, both transformations have behaved pretty well in turning the distribution into a nearly normal one. Notice that logarithm transformation can only be applied when all the values are positive. If there exist any zero or negative values (such as in “profit” and “profit_rate”), then people should turn to reciprocal (or BoxCox) transformation to reduce right skewness.
Across all the numeric variables in data frame “movie_rate_principals_profit”, it is also observed that they are set on completely different ranges to each other. Such condition might have serious impact on statistical anlysis or machine learning prediction accuracy. In order to reduce such impact, all these numeric variables should be normalised into similar scales, which can be done by z-score standardisation as follows:
z_movie_rate_principals_profit <- movie_rate_principals_profit %>% select(duration, budget_value, gross_income_value, total_votes, mean_vote, profit, profit_rate) %>% scale(center = TRUE, scale = TRUE)
attributes(z_movie_rate_principals_profit)
$dim
[1] 8908 7
$dimnames
$dimnames[[1]]
NULL
$dimnames[[2]]
[1] "duration" "budget_value" "gross_income_value" "total_votes" "mean_vote" "profit"
[7] "profit_rate"
$`scaled:center`
duration budget_value gross_income_value total_votes mean_vote profit
1.062924e+02 2.470251e+07 6.569612e+07 6.893791e+04 6.424248e+00 4.099361e+07
profit_rate
6.594215e+02
$`scaled:scale`
duration budget_value gross_income_value total_votes mean_vote profit
1.891083e+01 3.564445e+07 1.529234e+08 1.416507e+05 1.028367e+00 1.282126e+08
profit_rate
4.563320e+04
Firsly, all 7 numeric variables should be extracted from data frame “movie_rate_principals_profit”. Then z-score standardisation can be implemented by the “scale” fucntion with both arguments “center” and “scale” setting as TRUE.
---
title: "MATH2349 Data Wrangling Semester 1, 2020"
author: "Weipu Xue S3848653"
subtitle: Assignment 2
output:
  html_notebook: default
---
## Required packages 
```{r, message = FALSE, error=FALSE, warning=FALSE}
library(readr)
library(tidyr)
library(dplyr)
library(outliers)
library(lubridate)
library(stringr)
library(knitr)
library(MVN)
library(forecast)
```

## Executive Summary 
In this assignment, four data sets will be taken from an open data source and inputted into RStudio. Each of them will be described in details in regard with their source and variables. At the beginning, they will be combined into two data sets, and the data types of all variables within these two data sets will be checked and converted into correct types if necassary. Then after tidying up them according to Hadley Wickham’s “Tidy Data” principles, they will be combined again into a final-form data set for the rest of this assignment. Straight after this, the new single data set will be scanned to seek for the existence of any missing values, special values or inconsistencies, which will in turn be solved by appropriate approaches. Hereafter, some new variables created based on calculation over existing variables will be added to the right of the data set. Finally, each numeric variable within this data set will be scanned again to seek for the existence of outliers, which will also be handled by proper methodologies. Transformation will then be performed on certain variables in the last part. All these steps metioned above are considered as major tasks in data preprocessing, after which further procedures in data analysis, such as exploring and modelling, can be implemented on this data set.


## Data 
The following four data sets are acquired from Kaggle (https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset). They are orginally scraped from publicly available website - International Movie Database a.k.a. IMDb (https://www.imdb.com) and uploaded by Kaggle user. 
```{r, message = FALSE, error=FALSE, warning=FALSE}
movie <- read_csv("IMDb movies.csv")
people <- read_csv("IMDb names.csv")
rate <- read_csv("IMDb ratings.csv")
principal <- read_csv("IMDb title_principals.csv")
```
These four data sets are in turn inputted into R using the "read_csv" function from "readr" package as data frames, each of them has been assigned a new name as well. 

Among them, the "movie" data frame includes 81,273 movies with 22 variables such as movie description, average rating, number of votes, genre, etc. 

The "people" data frame includes 175,719 cast members with 20 variables such as birth details, death details, height, spouses, children, etc. 

The "rate" data frame includes 81,273 rating details from demographic perspective with 49 variables.

Lastly, the "principal" data frame includes 377,848 cast members' roles in movies with 6 variables such as IMDb title id, IMDb name id, order of importance in the movie, role, and characters played.

Since these four data frames contain nearly 100 variables (coloumns) in total, they will be in turn subsetted into smaller data frames so that only those variables (columns) useful for this assignment will be kept.
```{r}
movie <- movie %>% select(imdb_title_id, title, date_published:duration, production_company, budget, worlwide_gross_income)
head(movie)
```
The "select" function is applied to subset "movie" data frame, leaving 8 variables as shown above. Their descriptions are as follows:

*imdb_title_id: identifier of each movie title assigned by IMDb website.

*title: title of each movie.

*date_published: date of release of each movie.

*genre: genres of each movie, a movie can have at most three genres according to IMDb's rule.

*duration: length of each movie, recorded in minutes.

*production_company: production company (or studio) of each movie.

*budget: budget of each movie, mostly recorded in American Dollars (USD). Some are recorded in other currencies which are no longer available nowadays, especially those movices produced in late 19th/early 20th centuries.

*worlwide_gross_income: World-wide gross income of each movie, mostly recorded in American Dollars (USD). Some are recorded in other currencies which are no longer available nowadays, especially those movices produced in late 19th/early 20th centuries.

```{r}
people <- people %>% select(imdb_name_id, name)
head(people)
```
The "select" function is applied to subset "people" data frame, leaving 2 variables as shown above. Their descriptions are as follows:

*imdb_name_id: identifier of each cast member assigned by IMDb website.

*name: full name of each cast member.

```{r}
rate <- rate %>% select(imdb_title_id, total_votes, mean_vote)
head(rate)
```
The "select" function is applied to subset "rate" data frame, leaving 3 variables as shown above. Their descriptions are as follows:

*imdb_title_id: identifier of each movie title assigned by IMDb website.

*total_votes: total number of votes of each movie received by IMDb.

*mean_vote: average rating (on a 0-10 scale) determined by all the votes for each movie.

```{r}
principal <- principal %>% select(imdb_title_id:category)
head(principal)
```
The "select" function is applied to subset "principal" data frame, leaving 4 variables as shown above. Their descriptions are as follows:

*imdb_title_id: identifier of each movie title assigned by IMDb website.

*ordering: Order of importance of each cast member in the movie. Notice that at most 10 most important cast members of each movie are recorded in this data frame.

*imdb_name_id: identifier of each cast member assigned by IMDb website.

*category: Category of job (role) done by the cast member.

The data frames "movie" and "rate" will be combined into a new data frame called "movie_rate":
```{r}
movie_rate <- movie %>% left_join(rate, by = "imdb_title_id")
head(movie_rate)
```
The function "left_join" is applied to conduct such combination so that all the original information in "movie" can be preserved, whereas only ratings that have a match in "movie" will be added to the right. Since "movie" and "rate" are connected via variable "imdb_title_id", it is chosen as the key variable in "by" argument.

The "principal" data frame seems to be understandable for computers, but they will certainly raise confusion among human beings. To make it more informative for people, movie titles and cast members' full names should be added into this data frame. Hence, the data frames "movie", "principal" and "people" will be combined into a new data frame called "movie_principal_people":
```{r}
movie_principal_people <- principal %>% left_join(movie %>% select(imdb_title_id, title), by = "imdb_title_id") %>% 
  left_join(people %>% select(imdb_name_id, name), by = "imdb_name_id") %>% select(imdb_title_id, title, ordering, imdb_name_id, name, category)
head(movie_principal_people)
```
Notice that joining three data frames requires two steps of combination. Again, function "left_join" is applied throughout these two combinations so that all the original information in "principal" can be preserved, whereas only movie titles and cast members' names that have a match in "principal" will be added to the right. Since "principal" and "movie" are connected via variable "imdb_title_id", "principal" and "people" are connected via variable "imdb_name_id", these two variables are chosen as the keys for the "by" argument in each step of combinations. To make the new data frame "movie_principal_people" easier to read, its columns are arranged in such way that movie title is straight after "imdb_title_id", and cast members' name is straight after "imdb_name_id".

"movie_rate" and "movie_principal_people" will be joined together later after tidying up.


## Understand 
```{r}
str(movie_rate)
```
By applying "str" function, 10 variables are found in data frame "movie_rate". Data types of each column are automatically "guessed" by this function, as shown above. 

*Column "imdb_title_id" is of correct data type, because identifier of each movie is a string of characters and numbers.

*Column "title" is of correct data type, because movie titles mainly consist of characters, sometimes numbers are involved as well.

*Column "date_published" should be of date type instead of character. However, for some movies, only the number of year is recorded on IMDb, so this column cannot be directly converted into date type by "as.Date" function. Such conversion will be conducted later in the part "Scan I" after these input inconsistencies are solved. 

*Column "genre" is of correct data type under current condition, where each movie can have more than one genre. However, if in some other cases where one movie can only have one genre, then this column can be considered as factor type as well.

*Column "duration" is of correct data type, since lengths of movies are expressed as integers.

*Column "production_company" is of correct data type, since names of the production companies (or studios) consist of characters.

*Column "budget" is of correct data type, because budget of each movie is recorded in number values together with corresponding currencies labelled in front of the number. 

*Column "worlwide_gross_income" is of correct data type, since gross income is recorded in number values together with corresponding currencies labelled in front of the number.

*Column "total_votes" is of correct data type, because total number of votes of each movie is denoted by integers.

*Column "mean_vote" is of correct data type, since the average rating of each movie is expressed as a floating number over a scale of 0-10 with 2 significant numbers, where 0 indicates the movie is least favoured and 10 indicates the movie is most favoured.

```{r}
str(movie_principal_people)
movie_principal_people$ordering <- factor(movie_principal_people$ordering, levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"), ordered=TRUE)
movie_principal_people$category <- as.factor(movie_principal_people$category)
```
By applying "str" function, 6 variables are found in data frame "movie_principal_people". Data types of each column are automatically "guessed" by this function, as shown above.

*Both "imdb_title_id" and "title" are of correct data types, as discussed before.

*Column "ordering" should be of factor type, because each movie can have at most 10 most important cast members, so values under this column can only be one of the 10 integers (from 1 to 10). Moreover, these 10 factors are ordinal and they should be ordered as 1<2<3<4<5<6>7<8<9<10, where 1 means the most important cast member and 10 means the least important cast member.

*Column "imdb_name_id" is of correct data type, since identifier of each cast member is a string of characters and numbers.

*Column "name" is of correct data type, because each cast member's full name consists of only characters.

*Column "category" should be of factor type, since there are only 12 distinct values under this column: "actor", "actress", "archive_footage", "archive_sound", "cinematographer", "composer", "director", "editor", "producer", "production_designer", "self" and "writer". Notice that there is no order between these 12 factors.
```{r}
str(movie_principal_people)
```
Now every column in "movie_principal_people" is of correct data types.


##	Tidy & Manipulate Data I 
Although the data frame "movie_principal_people" seems to be tidy on its own, it is considered as untidy when people need to join it with "movie_rate". Suppose that we want to figure out the 3 most important cast members of each movie, combining "movie_rate" and "movie_principal_people" directly will result in information of the same movie being repeated several times in different rows, namely information redundancy. Hence, before joining these two data frames, "movie_principal_people" should first be "spreaded" into wide format:
```{r}
movie_principal_people <- movie_principal_people %>% unite(name_category, "name", "category", sep = " (")
movie_principal_people$name_category <- paste0(movie_principal_people$name_category, ")")
```
In order to employ the "spread" function correctly, the two columns, "name" and "category", should be combined together to prevent any errors. This is done by the assistance from "unite" function, and the new column is named as "name_category". The separator being used here is " (", so that after pasting a ")" to each row of the new column "name_category", the role of each cast member will be nicely demostrated inside a bracket next to their full names, as shown below.
```{r}
head(movie_principal_people)
```
From now on, the "spread" function can be smoothly operated over the data frame "movie_principal_people".
```{r}
important_cast_3 <- movie_principal_people %>% filter(ordering <= 3) %>% select(imdb_title_id, ordering, name_category) %>% spread(key = ordering, value = name_category)
colnames(important_cast_3) <- c("imdb_title_id", "first", "second", "third")
```
Since only the 3 most important cast members are going to be added to "movie_rate" data frame, "filter" function has been applied to select the rows with ordering equals to either 1, 2 or 3 from "movie_principal_people". Also, since titles of each movie already exist in the "movie_rate" data frame, only the columns "imdb_title_id", "ordering" and "name_category" are chosen by "select" function. To "spread" the data frame, the key column should be "ordering" as it contains variable names, and the value column is "name_category" as it contains values forming multiple variables. The new data frame after spreading is temporarily called "important_cast_3", and its columns are renamed as "imdb_title_id", "first", "second" and "third". The first few rows are shown below:
```{r}
head(important_cast_3)
```
Now that each movie only appears in one row, it can be joined to "movie_rate".
```{r}
movie_rate_3principals <- movie_rate %>% left_join(important_cast_3, by = "imdb_title_id")
movie_rate_principals <- movie_rate_3principals %>% unite(principal_casts, "first", "second", "third", sep = ", ", na.rm = TRUE)
```
"movie_rate" and "important_cast_3" are combined together through their common column, "imdb_title_id". Here function "left_join" is applied so that all the original information from "movie_rate" can be reserved. To make the whole data frame more compendious, we compress the three columns containing the 3 most important cast members into a single column called "principal_casts", by using "unite" function and skipping all the NA's (NA's exist since some movies have less then 3 most important cast members). The new data frame is named as "movie_rate_principals", and its first few rows are shown below:
```{r}
head(movie_rate_principals)
movie_rate_principals$principal_casts[movie_rate_principals$principal_casts == ""] <- "Unknown"
```
It is observed that some rows under the column "principal_casts" don't contain any information. This is because not all movies are recorded in the data frame "important_cast_3" (which originally came from the data frame "principal") in regard with information about their most important cast members. To make this column more straightforward (and less confusing), replace all the empty values by "Unknown". The final form of data frame "movie_rate_principals" is shown below:
```{r}
head(movie_rate_principals)
```
Notice that at this stage, all the combinations of data frames have been completed. "movie_rate_principals" will be used throughout the rest of this assignment.


##	Scan I 
In this part, the presence of any missing value, special value and input inconsistency will be investigated throughout data frame "movie_rate_principals".
```{r}
colSums(is.na(movie_rate_principals))
```
By checking number of NA's of each column in "movie_rate_principals", it is found that only "production_company", "budget" and "worlwide_gross_income" contain missing values. There are several reasons that the rows having missing values should not be excluded from the data frame. First of all, all three columns mentioned previously possess considerably large propotion of NA's (way larger than 5% - the recommended proportion for excluding missing data), especially for "budget" and "worlwide_gross_income" in which the number of missing values is more than half of number of observations (rows). Removing these rows will certainly result in losing too much significant information. 

On the other hand, it is rather rational that these columns containing such large number of missing values. In the U.S., movie production has been industrialised & standardised for nearly a century, thus most movies originated in America are produced by specific production companies. However, in some other countries, movie production industry has only been established for a very short time, it might be relatively difficult for websites such as IMDb to trace back on the information about those production companies. This is why the column "production_company" possesses so many missing values.

For the other two columns, "budget" & "worlwide_gross_income", the reason is pretty similar as well. Immature movie production industry makes it troublesome for people to record these values. In other scenarios, some production companies might be unwilling to publish their budget and gross income of certain movies, since a lot of "trading secrets" are involved in such information.

Nevertheless, missing values in "production_company" can be handled by replacing the NA's with "Unknown", so that people can understand that they are not recorded:
```{r}
movie_rate_principals$production_company[is.na(movie_rate_principals$production_company)] <- "Unknown"
```
Function "is.na" is used to locate the positions of NA's.

At this stage, missing values in "budget" and "worlwide_gross_income" are still left there, these two columns have too many NA's so substituting column mean/median will be unmeaningful. Moreover, there also exist some input inconsistancies among them which will be discussed later.

```{r}
sapply(movie_rate_principals, function(x) sum(is.infinite(x)))
sapply(movie_rate_principals, function(x) sum(is.nan(x)))
```
By applying "is.infinite" and "is.nan" functions, no special value is found throughout data frame "movie_rate_principals".

However, there still exist some input inconsistancies in several columns. In "date_published", most values are recorded in th form "YYYY-MM-DD", whereas some other rows only contain the value of year. This situation is solved as shown below:
```{r, message = FALSE, error=FALSE, warning=FALSE}
movie_rate_principals <- movie_rate_principals %>% separate(date_published, into = c("year_published", "month_published", "day_published"), sep = "-")
movie_rate_principals$month_published[is.na(movie_rate_principals$month_published)] <- 01
movie_rate_principals$day_published[is.na(movie_rate_principals$day_published)] <- 01
movie_rate_principals <- movie_rate_principals %>% unite(date_published, "year_published", "month_published", "day_published", sep = "-")
movie_rate_principals$date_published <- as.Date(movie_rate_principals$date_published)
```
For the first step, "date_published" is temporarily seperated into three columns, each contains the value of year, month and day respectively. By doing so, those rows that miss the value of month and day can be located by "is.na" function. To make it easier to convert column "date_published" into date data type, a not-so-justifiable assumption has been made on those movies (rows) only contatining year value, that is they are all released on the 1st of January. I chose to do so because in the succeeding parts of this assignment, only the value of year will be utilised for analysis, and it will be easier to extract year value from a date-type column by using "lubridate" package. However, in other scenarios where values of month and day are also required for analysis, such assumption should never be made.

To implememnt the above assumption, replace NA's in month and day columns by "01", then combine the temporary columns "year_published", "month_published" and "day_published" back into "date_published" by using "unite" function. After this, "date_published" can be converted to date type through "as.Date" function. Now that every vairable in data frame "movie_rate_principals" is of correct data types, as shown below:
```{r}
str(movie_rate_principals)
```

Inconsistancies also exist in the columns "budget" and "worlwide_gross_income", where the strings consist of currency labels followed by the actual numeric values. Among them, most currencies are denoted by three capital letters, with the exception that U.S. Dollars is denoted by the symbol "$". Such inconsistancy is handled as follows: 
```{r}
movie_rate_principals <- movie_rate_principals %>% separate(budget, into = c("budget_unit", "budget_value"), sep = " ")
movie_rate_principals <- movie_rate_principals %>% separate(worlwide_gross_income, into = c("gross_income_unit", "gross_income_value"), sep = " ")
movie_rate_principals$budget_unit[movie_rate_principals$budget_unit == "$"] <- "USD"
movie_rate_principals$gross_income_unit[movie_rate_principals$gross_income_unit == "$"] <- "USD"
```
Notice that there is always a space (" ") between currency labels and numeric values. Based on this special format created by IMDb, we can separate currency labels and numeric values into two columns simply by using "separate" function. During this step, currency columns are named as "%_unit", and numeric columns are named as "%_value". Now that all the "$" symbols in currency columns can be replaced by a more formal format "USD", just like all the other currencies. Another benefit of separating currency labels and numeric values is that we can now perform mathematical calculations over budget and gross income for each movie. The current data types of columns in the data frame "movie_rate_principals" is shown below:
```{r}
str(movie_rate_principals)
```


##	Tidy & Manipulate Data II 
In real life, people usually reference to the box office of a movie to determine its comprehensive performance. Although the actual quality of a movie is not always related to its box office, it worths the effort to investigate on how much profit a movie can produce. In this part, profit and profit rate of each movie will be calculated and added to the right of data frame "movie_rate_principals", as follows:
```{r}
movie_rate_principals$budget_value <- as.numeric(movie_rate_principals$budget_value)
movie_rate_principals$gross_income_value <- as.numeric(movie_rate_principals$gross_income_value)
movie_rate_principals_profit <- movie_rate_principals %>% filter(budget_unit == gross_income_unit) %>% 
  mutate(profit = gross_income_value - budget_value, profit_rate = profit/budget_value)
```
Before any calculation, the two columns "budget_value" and "gross_income_value" shoud first be converted into numeric types. Besides, it is only meaningful to calculate profit and profit rate when a movie's budget and gross income are recorded in the same currency. The "filter" function is applied to filter out those rows that qualify such condition. Profit is calculated by budget subtracted from gross income, and profit rate equals to quotient of profit over budget. Notice that a negative profit/rate indicates the movie has lost money, while a positive profit/rate indicates the movie has actually made profit. These two new variables can be added to the right of data frame "movie_rate_principals" by using "mutate" function. The new data frame after mutation is renamed as "movie_rate_principals_profit", and its first few rows are shown below: 
```{r}
head(movie_rate_principals_profit)
```


##	Scan II
In this part, each column from data frame "movie_rate_principals_profit" which contains numeric values will be checked whether outliers exist or not, with the assistance of Box Plot (a.k.a. Tukey’s method of outlier detection), Histogram and their corresponding z-scores (only when the values are approximately normally distributed, which is implied by a nearly symmetric histogram).
```{r, message = FALSE, error=FALSE, warning=FALSE}
par(mfrow=c(1,2))
movie_rate_principals_profit$duration %>% boxplot(main="Box Plot of Movies' Duration", ylab = "Duration (in minutes)")
hist(movie_rate_principals_profit$duration, main = "Histogram of Movies' Duration")
```
According to Box Plot, there exist a considerable amount of outliers under column "duration", all of them are located above the maximum calculated by Tukey's Method. This is because, although most movies are maintained between 90-120 minutes for the favour of audience's patience, there is no rule to limit a movie's length and some directors may have a lot more content to be presented in their movies than others, which is pretty reasonable. On the other hand, as the shortest movie last more than 50 minutes, no outlier is found under the minimum. Since the histogram is dramatically skewed to the right, z-score test is not appropriate here.

```{r, message = FALSE, error=FALSE, warning=FALSE}
par(mfrow=c(1,2))
movie_rate_principals_profit$budget_value %>% boxplot(main="Box Plot of Movies' Budget", ylab = "Budget")
hist(movie_rate_principals_profit$budget_value, main = "Histogram of Movies' Budget")
```
According to Box Plot, there exist a considerable amount of outliers under column "budget_value", all of them are located above the maximum calculated by Tukey's Method. This is because no rules are made upon maximum amount of budget a production company can put in their movies. In fact, it is pretty common that Sci-Fi movies cost way more budget than other genres. Besides, due to economic inflation, money value nowadays is way different than in the past, which will also result in budget value keeping increasing as time goes by. On the other hand, as budget must be a positive value, a default lower bound is set above zero, thus no outlier is found under the minimum. Since the histogram is completely unsymmetric, z-score test is not appropriate here.

```{r, message = FALSE, error=FALSE, warning=FALSE}
par(mfrow=c(1,2))
movie_rate_principals_profit$gross_income_value %>% boxplot(main="Box Plot of Movies' Gross Income", ylab = "Gross Income")
hist(movie_rate_principals_profit$gross_income_value, main = "Histogram of Movies' Gross Income")
```
According to Box Plot, there exist a considerable amount of outliers under column "gross_income_value", all of them are located above the maximum calculated by Tukey's Method. Again, this is because no upper bound is predefined on how much gross income a movie can make. If a particular movie is outstandingly favoured by the audience, then its gross income will upsurprisingly reach a new peak. Besides, due to economic inflation, money value nowadays is way different than in the past, which will also result in gross income value keeping increasing as time goes by. On the other hand, as gross income must be a positive value, a default lower bound is set above zero, thus no outlier is found under the minimum. Since the histogram is nowhere close to be symmetric, z-score test is not appropriate here.

```{r, message = FALSE, error=FALSE, warning=FALSE}
par(mfrow=c(1,2))
movie_rate_principals_profit$total_votes %>% boxplot(main="Box Plot of Movies' Number of Votes", ylab = "Number of Votes")
hist(movie_rate_principals_profit$total_votes, main = "Histogram of Movies' Number of Votes")
```
According to Box Plot, there exist a considerable amount of outliers under column "total_votes", all of them are located above the maximum calculated by Tukey's Method. This is because there is no upper bound to prevent audience from voting. If a particular movie is outstandingly favoured, or detested, by the audience, then it will be undoubtfully voted way more times than other movies. On the other hand, as number of votes cannot be negative, a default lower bound is set at zero, thus no outlier is found under the minimum. Since the histogram is dramatically skewed to the right, z-score test is not appropriate here.

```{r, message = FALSE, error=FALSE, warning=FALSE}
par(mfrow=c(1,2))
movie_rate_principals_profit$mean_vote %>% boxplot(main="Box Plot of Movies' Average Rating", ylab = "Average Rating")
hist(movie_rate_principals_profit$mean_vote, main = "Histogram of Movies' Average Rating")
z.scores <- movie_rate_principals_profit$mean_vote %>%  scores(type = "z")
z.scores %>% summary()
which( abs(z.scores) >3 )
length(which( abs(z.scores) >3 ))
```
According to Box Plot, there exist a considerable amount of outliers under column "mean_vote", most of them are located under the minimum calculated by Tukey's Method, while some are above the maximum as well. This turns out to be very different from previous cases. The reason behind is that rating of a movie is set over the scale of 0-10, i.e. there exist a predifined lower bound at 0 and upper bound at 10. While most movies receive average ratings between 4 and 9, outliers under the minimum are apparently more than those above the maxium, because it is reasonably harder for a movie to gain audience's praise than disapproval. It is observed that the histogram is slightly skewed to the left, but still pretty close to be symmetric. Thus z-score test can be performed in this case. Based on the result, 75 outliers are identified by z-score test.

```{r, message = FALSE, error=FALSE, warning=FALSE}
par(mfrow=c(1,2))
movie_rate_principals_profit$profit %>% boxplot(main="Box Plot of Movies' Profit", ylab = "Profit")
hist(movie_rate_principals_profit$profit, main = "Histogram of Movies' Profit")
```
According to Box Plot, there exist a considerable amount of outliers under column "profit", most of them are located above the maximum calculated by Tukey's Method, while some are below the minimum as well. Notice that there is neither predefined upper bound on how much profit a movie can make, nor predefined lower bound on how much money a movie can lose. Outliers above the maximum are apparently more than those under the minimum, because as a well-known recreational product, it is reasonably common that a movie is able to make certain amount of profit, unless it is extremely unpopular or with extremly low quality. Since the histogram is completely unsymmetric, z-score test is not appropriate here.

```{r, message = FALSE, error=FALSE, warning=FALSE}
par(mfrow=c(1,2))
movie_rate_principals_profit$profit_rate %>% boxplot(main="Box Plot of Movies' Profit Rate", ylab = "Profit Rate")
hist(movie_rate_principals_profit$profit_rate, main = "Histogram of Movies' Profit Rate")
z.scores2 <- movie_rate_principals_profit$profit_rate %>%  scores(type = "z")
z.scores2 %>% summary()
which( abs(z.scores2) >3 )
length(which( abs(z.scores2) >3 ))
```
According to Box Plot, there only exist a small number of outliers under column "profit_rate", a few of them are located above the maximum calculated by Tukey's Method and others are below the minimum. Similarly to "profit", there is neither predefined upper bound nor predefined lower bound on possible profit rate. Except those values in upper tail, the histogram turns out to be nearly symmetric, thus the z-score test can be performed in this case. Based on its result, only 2 outliers are found.

In summary, outliers in columns "duration", "budget_value",  "gross_income_value", "total_votes", "mean_vote" and "profit" should not be excluded, since their numbers are considerably large compared to the number of observations, i.e. they have taken up a large proportion of samples. They shoud neither be handled by imputation or capping, since none of these outliers mentioned above is caused by data entry/processing errors. Instead, these outliers can be "eliminated" by transformation so that their original values are "maintained" (which means people can achieve their original values by simply inversing the transformation) and at the same time, further statistical analysis can be applied over these variables. This will be discussed in the next part "Transform".

However, outliers in the column "profit_rate" might be excluded upon demand, especially those 2 found by z-score test as they are relatively small in numbers compared to the total number of oberservations (which is 8908). Such elimination can be down as follows:
```{r}
profit_rate_clean <- movie_rate_principals_profit$profit_rate[-which(abs(z.scores2) >3)]
```

Aside from univariate outliers, bivariate outliers may also exist in the data frame "movie_rate_principals_profit". Suppose that people are interested in the relationship between budget and gross income of movies released between 2000 and 2010, then scatter plot and the Mahalanobis distance can be used to detect outliers based on budget gross income, as shown below:
```{r, message = FALSE, error=FALSE, warning=FALSE}
par(mfrow=c(1,2))
gross_income_budget <- movie_rate_principals_profit %>% filter(year(date_published)<=2010, year(date_published)>=2000) %>% select(gross_income_value, budget_value)
gross_income_budget %>% plot(gross_income_value ~ budget_value, data = ., ylab = "Gross Income", xlab = "Budget", main = "Gross Income VS Budget")
results <- mvn(data = gross_income_budget, multivariateOutlierMethod = "quan", showOutliers = TRUE)
```
Before plotting, movies released between 2000 and 2010 should first be filtered out by applying "filter" function. Next, the columns "gross_income_value" and  "budget_value" which are useful for plotting can be extracted and saved into a new data frame called "gross_income_budget". To get the scatter plot, "plot" function is applied where y-axis represents gross income and x-axis represents budget. To acquire Mahalanobis distance, "mvn" function from the MVN package is applied and the method being used here is chi-square distribution critical value, as mentioned in the argument by "quan".

In the scatter plot, it is observed that for most movies, small gross income is associated with small budget and large gross income is associated with large budget. However, there do exist some outliers located in bottom-right and top-right corners. In Mahalanobis distance, it is found that outliers have taken up nearly half of the number of observations. Thus they should not be excluded. Besides, they shoud neither be handled by imputation or capping, since none of them is caused by data entry/processing errors.


##	Transform 
As mentioned in previous part, most of the variables in data frame "movie_rate_principals_profit" have their distributions skewed to the right, which is not preferred by statistical inference. In order to achieve a symmetric (normal) distribution, we can seek assistance from transformation. In the following, the coulmn "budget_value" will be used as an exmample:
```{r}
par(mfrow=c(1,2))
hist(log(movie_rate_principals_profit$budget_value), main = "Histogram of log(budget value)")
bc_budget <- BoxCox(movie_rate_principals_profit$budget_value, lambda = "auto")
hist(bc_budget, main = "Histogram of Box-Cox Transformed Budget Value")
```
Since all values under the column "budget_value" are positive, logarithm transformation can be applied to reduce the right skewness. This can be simply implemented by using "log" function. Another approach to achieve a symmetric (normal) distribution is the BoxCox Transformation. This is done by using "BoxCox" function from the "forecast" package. As shown above, both transformations have behaved pretty well in turning the distribution into a nearly normal one. Notice that logarithm transformation can only be applied when all the values are positive. If there exist any zero or negative values (such as in "profit" and "profit_rate"), then people should turn to reciprocal (or BoxCox) transformation to reduce right skewness. 

Across all the numeric variables in data frame "movie_rate_principals_profit", it is also observed that they are set on completely different ranges to each other. Such condition might have serious impact on statistical anlysis or machine learning prediction accuracy. In order to reduce such impact, all these numeric variables should be normalised into similar scales, which can be done by z-score standardisation as follows:
```{r, message = FALSE, error=FALSE, warning=FALSE}
z_movie_rate_principals_profit <- movie_rate_principals_profit %>% select(duration, budget_value, gross_income_value, total_votes, mean_vote, profit, profit_rate) %>% scale(center = TRUE, scale = TRUE)
attributes(z_movie_rate_principals_profit)
```
Firsly, all 7 numeric variables should be extracted from data frame "movie_rate_principals_profit". Then z-score standardisation can be implemented by the "scale" fucntion with both arguments "center" and "scale" setting as TRUE.
<br>
<br>
