Executive Summary

This assignment demonstrates the key steps in data pro-processing as studied in this subject. Data processing techniques are applied to Movie data from the IMDB data set is used to demonstrate these steps. Firstly, the required packages are loaded into and then the data is read into R as data frames from CSV files based on the local computer. This data is then checked and with data pre processing methods and then selected features across a number tables are merged to apply further pre-processing and reading the data for further analysis. Outliers are identified and addressed using Tukeys univariate methods. Following this, a range of transformations are applied to the selected columns in the new data set to further improve and stabilise the data for further analysis.

Required packages

The following section loads the required packages for this assignment.

# Load the required packages.
library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(editrules)
library(deductive)
library(validate)
library(forecast)
# Clear any environmnet variables. 
rm(list = ls())

Data

The data sets used in the this assignment is an extract from the IMDB movie data base was sourced from the Relational Data Set Repository website (https://relational.fit.cvut.cz/dataset/IMDb). It consists of a number of tables that form a relational database. As had some issues connecting directly to the database from R (slowness etc) for the purposes of obtaining the data for my assignment, I connected to the source repository with a SQL client (MySQL Workbench) and extracted the tables as CSV files to my local machine and then imported them into R. The files are mentioned in the section below. Whilst the dataset is generally of good quality and the required checks for data quality and consistency was still applied.

Read the data

The following codes reads in the initial dataset from CSV files.

# Read in the IBMDB dataset tables as CSV files to Dataframes
actors_df <- data.frame(read.csv("actors.csv", sep = ';'))
directors_df <- data.frame(read.csv("directors.csv", sep = ';'))
movies_df <- data.frame(read.csv("movies.csv", sep = ';'))
movies2actors_df <- data.frame(read.csv("movies2actors.csv", sep = ';'))
movies2directors_df <- data.frame(read.csv("movies2directors.csv", sep = ';'))
ratings_df <- data.frame(read.csv("ratings.csv", sep = ';'))
runningtimes_df <- data.frame(read.csv("runningtimes.csv", sep = ';'))
movies_genres_df <- data.frame(read.csv("movie_genres.csv", sep = ';'))
# Create a list of dataframes consisting of the tables read to data frames in the above code.
IMDB_Dataframes <- list(actors_df, directors_df, movies_df, movies2actors_df, movies2directors_df, ratings_df,runningtimes_df)

Check the data.

This section performs some rudimentary checks of initial data set to ensure that it they are readable and useable by R. A for loop is used here to be efficient with code and addressing repetitive task with respected to variables that are repeatedly used.

# Loop through data sets and display head obserrvations in each. 
for (data_set in IMDB_Dataframes){
  print(head(data_set))
}

The following section of code checks that each table has been read in as a data frame.

# Check if individual data sets are of class data frame. 
for (data_set_name in IMDB_Dataframes){
  #print("Start of  =========================", data_set_name)
  print(class(data_set_name))
}
[1] "data.frame"
[1] "data.frame"
[1] "data.frame"
[1] "data.frame"
[1] "data.frame"
[1] "data.frame"
[1] "data.frame"

Understand

This section summarises the types of variables and data structures, checks the attributes in the data set and applies data type conversions as required. As I am creating a new, merged data set, only the variables in this new data set are subject type conversion, cleaning etc.

# Loop through data sets and check the structure, indicating the variable names and types. 
for (data_set in IMDB_Dataframes){
  print(str(data_set))
}
'data.frame':   98690 obs. of  3 variables:
 $ actorid: int  4 16 28 566 580 636 769 797 802 808 ...
 $ name   : Factor w/ 98686 levels "'Bear'Boyd, Steven",..: 3 1 2 6 7 9 18 19 20 21 ...
 $ sex    : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
NULL
'data.frame':   2201 obs. of  5 variables:
 $ directorid: int  67 92 284 708 746 836 947 1151 1756 1836 ...
 $ name      : Factor w/ 2201 levels "Aalmuhammed, Jefri",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ rate      : num  7 4.3 6.8 6.8 26.2 3 6 6.1 4.1 5 ...
 $ gross     : num  4.30e+04 1.28e+07 0.00 3.30e+04 2.13e+08 ...
 $ num       : int  1 1 1 1 4 1 1 1 1 1 ...
NULL
'data.frame':   3832 obs. of  3 variables:
 $ movieid: int  1672052 1672111 1672580 1672716 1672946 1673647 1673658 1673848 1674388 1674737 ...
 $ title  : Factor w/ 3832 levels "¡Three Amigos! (1986)",..: 3 4 6 5 2 7 8 9 10 11 ...
 $ year   : int  1980 1988 1997 1986 2014 1961 1996 1999 1996 1957 ...
NULL
'data.frame':   138349 obs. of  4 variables:
 $ movieid     : int  1672052 1672052 1672052 1672052 1672052 1672052 1672052 1672052 1672052 1672052 ...
 $ actorid     : int  88796 88848 121878 134348 198682 254523 291369 344412 417547 501765 ...
 $ as_character: Factor w/ 123599 levels "","(also archive footage)  [Babe' Hanson]  <5>",..: 36628 34625 73808 70087 73796 73114 25299 7897 29587 73710 ...
 $ leading     : int  20 7 21 41 4 13 22 23 6 11 ...
NULL
'data.frame':   4141 obs. of  3 variables:
 $ movieid   : int  1672052 1672111 1672580 1672716 1672946 1673647 1673658 1673848 1674388 1674737 ...
 $ directorid: int  22397 54934 297253 188926 188940 302682 155385 133605 115990 164962 ...
 $ genre     : Factor w/ 24 levels "","Action","Adult",..: 10 2 7 10 2 10 7 7 4 10 ...
NULL
'data.frame':   3773 obs. of  4 variables:
 $ movieid     : int  1672052 1672111 1672580 1672716 1673647 1673658 1673848 1674388 1674737 1677011 ...
 $ rank        : num  7.8 5.3 4.4 7 6.5 3.9 7 5.5 8.9 6.4 ...
 $ votes       : int  8111 32183 1894 1255 128 20 137748 47380 245366 12576 ...
 $ distribution: Factor w/ 369 levels "","....211..5",..: 100 222 230 97 171 349 136 222 85 151 ...
NULL
'data.frame':   5086 obs. of  4 variables:
 $ movieid : int  1672052 1672111 1672580 1672716 1673647 1673647 1673658 1673848 1674388 1674737 ...
 $ time    : Factor w/ 836 levels "(None):111","(None):236",..: 12 13 18 177 47 496 179 178 8 177 ...
 $ addition: Factor w/ 219 levels "","(10th Anniversary edition)",..: 1 1 1 1 1 89 1 1 1 1 ...
 $ time1   : int  107 108 113 96 140 125 98 97 103 96 ...
NULL

Create new merged data set

The IMDB custom data set (IMDB_Custom) is created from merging a number of data frames, collating the relevant information for this assignment. The common variable between these merged dataset is column movieid. This operation is a right join and is not prone to resulting in missing values and preserves existing information. The data frames that are amalgamated for my analysis is movies2driectors, movies, ratings and running times. Once this final set was constructed, unwanted variables and duplicates were removed to form the final data set IMDB_Custom for pre-processing.

The following code renames the movie_id variable to be consistent with the other tables to support merger and analysis with other tables.

## use dplyer to align the movie id variable name to support merging with other tables. 
#Renaming columns with dplyr::rename()
movies_genres_df <- movies_genres_df %>%  rename(movieid = movie_id)

The following code mergers a number movies_genres_df and movies_df to the movies data frame.

# Join movies and directors
moviesNdirectors <- merge(movies2directors_df, movies_df,by="movieid")
# Now add more data that will be useful for dataset.
moviesNdirectorsNratings <- merge(moviesNdirectors, ratings_df,by="movieid")
# add data from running times
moviesNdirectorsNratingsNTimes <- merge(moviesNdirectorsNratings, runningtimes_df,by="movieid")
# Rename to a customised table.
IMDB_Custom <-moviesNdirectorsNratingsNTimes
# Remove unwanted columns. 
IMDB_Custom$time <- NULL
IMDB_Custom$directorid <- NULL
IMDB_Custom$addition <- NULL
# Inspect the final data set. 
head(IMDB_Custom)

Inspect and Understand

This section inspects the data frame and variables using R functions. The following code chunk checks the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. The str() function is used which provides the detailed structure within the data set. It also provides details with regards to the number of variables and observations. Data types are also changed to support the pre-processing and analysis.

# Check the dimensions fo the DF
dim(IMDB_Custom)
[1] 5544    8
# list the column names and change where appropriate.
str(IMDB_Custom)
'data.frame':   5544 obs. of  8 variables:
 $ movieid     : int  1672052 1672111 1672580 1672716 1673647 1673647 1673658 1673848 1674388 1674737 ...
 $ genre       : Factor w/ 24 levels "","Action","Adult",..: 10 2 7 10 10 10 7 7 4 10 ...
 $ title       : Factor w/ 3832 levels "¡Three Amigos! (1986)",..: 3 4 6 5 7 7 8 9 10 11 ...
 $ year        : int  1980 1988 1997 1986 1961 1961 1996 1999 1996 1957 ...
 $ rank        : num  7.8 5.3 4.4 7 6.5 6.5 3.9 7 5.5 8.9 ...
 $ votes       : int  8111 32183 1894 1255 128 128 20 137748 47380 245366 ...
 $ distribution: Factor w/ 369 levels "","....211..5",..: 100 222 230 97 171 171 349 136 222 85 ...
 $ time1       : int  107 108 113 96 140 125 98 97 103 96 ...

The following section of code isnspect and Correct varibales

#Rename columns with dplyr::rename() to provide some context to the dataset. 
# Rename the variable names to meaningful ones
# rank to movie_ranking
IMDB_Custom<- IMDB_Custom%>%rename(movie_ranking=rank)
# votes to movie_votes
IMDB_Custom<- IMDB_Custom%>%rename(movie_votes=votes)
# name to time1 to running time. 
IMDB_Custom<- IMDB_Custom%>%rename(running_time=time1)
# rate to director_rating
# The following variable types were corrected and changed to support the analysis. 
# Chnage the rank variable from nuemrical to ordinal
IMDB_Custom$movie_ranking <- as.numeric(IMDB_Custom$movie_ranking)
# Change title to string
IMDB_Custom$title <-as.character(IMDB_Custom$title)
# Change distributio from factor to ordinal
IMDB_Custom$distribution<-as.ordered(IMDB_Custom$distribution)

Recheck the structure and confirm the changes have been applied.

str(IMDB_Custom)
'data.frame':   5544 obs. of  8 variables:
 $ movieid      : int  1672052 1672111 1672580 1672716 1673647 1673647 1673658 1673848 1674388 1674737 ...
 $ genre        : Factor w/ 24 levels "","Action","Adult",..: 10 2 7 10 10 10 7 7 4 10 ...
 $ title        : chr  "'Breaker' Morant (1980)" "'Crocodile' Dundee II (1988)" "'Til There Was You (1997)" "'night, Mother (1986)" ...
 $ year         : int  1980 1988 1997 1986 1961 1961 1996 1999 1996 1957 ...
 $ movie_ranking: num  7.8 5.3 4.4 7 6.5 6.5 3.9 7 5.5 8.9 ...
 $ movie_votes  : int  8111 32183 1894 1255 128 128 20 137748 47380 245366 ...
 $ distribution : Ord.factor w/ 352 levels ""<"....211..5"<..: 90 211 219 87 160 160 332 126 211 75 ...
 $ running_time : int  107 108 113 96 140 125 98 97 103 96 ...

As the initial mergers may have resulted in some duplicate values, remove these values and finalise the data set.

# Finally remove any duplicate rows
IMDB_Custom <- distinct(IMDB_Custom)
dim((IMDB_Custom))
[1] 4916    8

Tidy & Manipulate Data I

The data set I have chosen and the merger of relevant tables resulting in the final table, IMDB_Custom already conforms the tidy data principle and no reshaping of my data into a tidy format is required.

Tidy & Manipulate Data II

The following code uses the mutate function Create/mutate the running time variable to add the column running_time_seconds.

# Create a new variable for running time in seconds b
IMDB_Custom <- IMDB_Custom %>% mutate(running_time_seconds = running_time*60)
head(IMDB_Custom)

Scan I

The following sections of code scan the data for missing values (NaNs and NAs) and any inconsistencies and obvious errors.

# 
sum(is.na(IMDB_Custom$movieid))
[1] 0
sum(is.na(IMDB_Custom$genre))
[1] 0
sum(is.na(IMDB_Custom$title))
[1] 0
sum(is.na(IMDB_Custom$year))
[1] 0
sum(is.na(IMDB_Custom$movie_ranking))
[1] 0
sum(is.na(IMDB_Custom$movie_votes))
[1] 0
sum(is.na(IMDB_Custom$distribution))      
[1] 0
sum(is.na(IMDB_Custom$running_time))
[1] 0
sum(is.nan(IMDB_Custom$movieid))
[1] 0
sum(is.nan(IMDB_Custom$genre))
[1] 0
sum(is.nan(IMDB_Custom$title))
[1] 0
sum(is.nan(IMDB_Custom$year))
[1] 0
sum(is.nan(IMDB_Custom$movie_ranking))
[1] 0
sum(is.nan(IMDB_Custom$movie_votes))
[1] 0
sum(is.nan(IMDB_Custom$distribution))
[1] 0
sum(is.nan(IMDB_Custom$running_time))
[1] 0

Scan II

The following sections of code scan and address outliers and uses Tukey’s univariate method of outlier detection is used to detect outliers in the box plots.

Display Box Plots - Pre Outlier resolution

First a box plot is created to provide any visualisation of outliers. As seen in the boxplots below there are a number of outliers for every variable and this may be addressed and improved with Tukeys method.

# create a collection of boxplots of the selected variables showing the outliers. 
par(mfrow=c(2,2))
boxplot(IMDB_Custom$year, main="\nMovie Year", ylab="Year", col="blue", pars=list(outcol="red"))
boxplot(IMDB_Custom$running_time, main="\nRunning Time", ylab="Minutes", col="blue", pars=list(outcol="red"))
boxplot(IMDB_Custom$movie_votes, main="Movie Running Votes", ylab="Minutes", col="blue", pars=list(outcol="red"))
boxplot(IMDB_Custom$running_time, main="Movie Running Ranking", ylab="Minutes", col="blue", pars=list(outcol="red"))
mtext("Box Plots for Variables pre Transformation", side = 3, line = -1,outer = TRUE)

Check outliers using z-score and respective adjustments

Use Tukeys method for each of the variables to detect and address the outliers. ### Movie Year Variable

# Check for outliers using the z-score
z.scores <-  IMDB_Custom$year %>% scores(type = "z")
z.scores %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-4.0320 -0.2399  0.4644  0.0000  0.6811  1.3853 

There are 86 values that can be considered outliers

# Check the number of values that fall outside the range. .
length (which( abs(z.scores) >3 ))
[1] 86
# Replace the outliers with median values
IMDB_Custom$year[which( abs(z.scores) >3 )] <- median(IMDB_Custom$year)

The values outside the range have now been reduced to 74

# Recheck the z scores after the imputations to confirm the outlier have been addressed. 
z.scores <-  IMDB_Custom$year %>% scores(type = "z")
z.scores %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3.3330 -0.2768  0.4423  0.0000  0.6820  1.4610 
length (which( abs(z.scores) >3 ))
[1] 74

Movie Running Time Variable

# Check for outliers using the z-score
z.scores <-  IMDB_Custom$running_time %>% scores(type = "z")
z.scores %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-4.1359 -0.5594 -0.1530  0.0000  0.3753  9.1132 

There are 109 values that can be considered outliers.

# Check the number of values that fall outside the range. There are 109 values that can be considered outliers. 
length (which( abs(z.scores) >3 ))
[1] 109
# Replace the outliers with median values
IMDB_Custom$running_time[which( abs(z.scores) >3 )] <- median(IMDB_Custom$running_time)

The values outside the range have now been reduced to 71.

# Recheck the z scores after the imputations to confirm the outlier have been addressed. The values outside the range have now been reduced to 71.
z.scores <-  IMDB_Custom$running_time %>% scores(type = "z")
z.scores %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3.4406 -0.6590 -0.1241  0.0000  0.5178  3.9947 
length (which( abs(z.scores) >3 ))
[1] 71

Movie Votes Variable

# Check for outliers using the z-score
z.scores <-  IMDB_Custom$movie_votes %>% scores(type = "z")
z.scores %>% summary()
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.44796 -0.41869 -0.33531  0.00000 -0.03218 14.98511 

There are 99 values that can be considered outliers.

# Check the number of values that fall outside the range. There are 99 values that can be considered outliers.
length (which( abs(z.scores) >3 ))
[1] 99
# Replace the outliers with median values
IMDB_Custom$movie_votes[which( abs(z.scores) >3 )] <- median(IMDB_Custom$movie_votes)

The values outside the range have now been reduced to 152

# Recheck the z scores after the imputations to confirm the outlier have been addressed. There are 86 values that can be considered outliers. The values outside the range have now been reduced to 152
z.scores <-  IMDB_Custom$movie_votes %>% scores(type = "z")
z.scores %>% summary()
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.61902 -0.56426 -0.40830  0.00000  0.07881  5.78189 
length (which( abs(z.scores) >3 ))
[1] 152

Movie ranking

# Check for outliers using the z-score
z.scores <-  IMDB_Custom$movie_ranking %>% scores(type = "z")
z.scores %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-4.0239 -0.5589  0.1340  0.0000  0.7404  2.4729 

There are 40 values that can be considered outliers.

# Check the number of values that fall outside the range. There are 40 values that can be considered outliers. 
length (which( abs(z.scores) >3 ))
[1] 40
# Replace the outliers with median values
IMDB_Custom$movie_ranking[which( abs(z.scores) >3 )] <- median(IMDB_Custom$movie_ranking)

The values outside the range have now been reduced to 13.

# Recheck the z scores after the imputations to confirm the outlier have been addressed. The values outside the range have now been reduced to 13.
z.scores <-  IMDB_Custom$movie_ranking %>% scores(type = "z")
z.scores %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3.1610 -0.6161  0.1110  0.0000  0.7473  2.5650 
length (which( abs(z.scores) >3 ))
[1] 13
Display Box Plots - Post Oulier resolution

The section below displays the variables after Tukeys univariate outlier detection and resolution methods have been applied. As can be seen there are considerable improvements from the initial set of plots.

par(mfrow=c(2,2))
boxplot(IMDB_Custom$year, main="\nMovie Year", ylab="Year", col="blue", pars=list(outcol="red"))
boxplot(IMDB_Custom$running_time, main="\nRunning Time", ylab="Minutes", col="blue", pars=list(outcol="red"))
boxplot(IMDB_Custom$movie_votes, main="Movie Running Votes", ylab="Minutes", col="blue", pars=list(outcol="red"))
boxplot(IMDB_Custom$running_time, main="Movie Running Ranking", ylab="Minutes", col="blue", pars=list(outcol="red"))
# mtext("My 'Collection of Box Plots for Variables", side = 3, line = -21, outer = TRUE)
mtext("Box Plots for Variables Post Outlier Resolution", side = 3, line = -1,outer = TRUE)

Transform

This section applies transformation to the selected variables by visualy analysing hte skew of the distribution and applying a relevant transoframtion. Please note I have applied a range of transformation to view the effefcts but have specified wich ones are relevant to based on the skew. The plots below display respective histograms of the variables of interest.

# Plot histograms on untransformed variables. 
par(mfrow=c(2,2))
hist(IMDB_Custom$year,main="Movie Year", xlab="Year", col="blue")
hist(IMDB_Custom$running_time,main="\nMovie Running Time", xlab="Running Time", col="blue")
hist(IMDB_Custom$movie_votes, main="\nMovie Votes",xlab="Movie Votes", col="blue")
hist(IMDB_Custom$movie_ranking,main="Movie Ranking",xlab="Movie Ranking", col="blue")
mtext("Histograms for Variables Pre Transformation", side = 3, line = -1,outer = TRUE)

Movie Year

Data for movie year time is strongley skewed to the left. Whilst we apply all transforamion to se the effect the most applicable transformation to left skewed data is square root and reciprocal transformation, other transformations including Box Cox with a lambda value of auto is applied. In this case the transformations had little effect in producing a symmetrical distribution.

# Movie Year Transformations - Left Skewed
# Square Transforamton
year_sq1_trn <- IMDB_Custom$year^2
# Box Cox
year_BC_trn <- BoxCox(IMDB_Custom$year, lambda = "auto")
#log transformation
year_log_trn <- log10(IMDB_Custom$year)
year_ln_trn <- log(IMDB_Custom$year)
# Reciprocal Transformation
year_recip_trn <- 1/IMDB_Custom$year
# Plot histograms on untransformed variables. 
par(mfrow=c(1,5))
hist(year_sq1_trn,main="\nSquare Root", xlab="Year", col="blue")
hist(year_BC_trn,main="\nBox Cox", xlab="Year", col="blue")
hist(year_log_trn,main="\nLog 10", xlab="Year", col="blue")
hist(year_ln_trn,main="\nNatural Log", xlab="Year", col="blue")
hist(year_recip_trn,main="\nReciprocal Log", xlab="Year", col="blue")
mtext("Histograms for Year Variables Post Transformation", side = 3, line = -1,outer = TRUE)

Movie Running Time

Data for movie running times apper to be reasonably symmetrical, lets see if it can be improved. Again, all transforamion are applied to see the effect. Again none of the transformations applied had any strong impact on improving the distrubution.

# Movie Year Transformations - Left Skewed
# Square Transforamton
running_time_sq1_trn <- IMDB_Custom$running_time^2
# Box Cox
running_time_BC_trn <- BoxCox(IMDB_Custom$running_time, lambda = "auto")
#log transformation
running_time_log_trn <- log10(IMDB_Custom$running_time)
running_time_ln_trn <- log(IMDB_Custom$running_time)
# Reciprocal Transformation
running_time_recip_trn <- 1/IMDB_Custom$running_time
# Plot histograms on untransformed variables. 
par(mfrow=c(1,5))
hist(running_time_sq1_trn,main="\nSquare Root", xlab="Running Time", col="blue")
hist(running_time_BC_trn,main="\nBox Cox", xlab="Running Time", col="blue")
hist(running_time_log_trn,main="\nLog 10", xlab="Running Time", col="blue")
hist(running_time_ln_trn,main="\nNatural Log", xlab="Running Time", col="blue")
hist(running_time_recip_trn,main="\nReciprocal", xlab="Running Time", col="blue")
mtext("Histograms for Running Times Variables Post Transformation", side = 3, line = -1,outer = TRUE)

Movie Votes

Data for movie votes is very strongly skewed to the right. Whilst I have applied all transformation to see the effect the most applicable transformation to right skewed data is log, natural log and square root transformation. Here the both the log transformations applied appear to have a reasonable impact on improving the symmetry of distribution.

# Movie Votes Transformations - Left Skewed
# Square Transforamton
movie_votes_sq1_trn <- IMDB_Custom$movie_votes^2
# Box Cox
movie_votes_BC_trn <- BoxCox(IMDB_Custom$movie_votes, lambda = "auto")
#log transformation
movie_votes_log_trn <- log10(IMDB_Custom$movie_votes)
movie_votes_ln_trn <- log(IMDB_Custom$movie_votes)
# Reciprocal Transformation
movie_votes_recip_trn <- 1/IMDB_Custom$movie_votes
# Plot histograms on untransformed variables. 
par(mfrow=c(1,5))
hist(movie_votes_sq1_trn,main="\nSquare Root", xlab="Movie Votes", col="blue")
hist(movie_votes_BC_trn,main="\nBox Cox", xlab="Movie Votes", col="blue")
hist(movie_votes_log_trn,main="\nLog 10", xlab="Movie Votes", col="blue")
hist(movie_votes_ln_trn,main="\nNatural Log", xlab="Movie Votes", col="blue")
hist(movie_votes_recip_trn,main="\nReciprocal", xlab="Movie Votes", col="blue")
mtext("Histograms for Movie Votess Variables Post Transformation", side = 3, line = -1,outer = TRUE)

Movie Ranking

Data for movie running times apper to be reasonably symmetrical, lets see if it can be improved. Again, all transforamion are applied to see the effect. As per the plots below Square root and Box Cox transformations appear to have a moderate impact on improving the symmetry of the distrubution.

# Movie Votes Transformations - Left Skewed
# Square Transforamton
movie_ranking_sq1_trn <- IMDB_Custom$movie_ranking^2
# Box Cox
movie_ranking_BC_trn <- BoxCox(IMDB_Custom$movie_ranking, lambda = "auto")
#log transformation
movie_ranking_log_trn <- log10(IMDB_Custom$movie_ranking)
movie_ranking_ln_trn <- log(IMDB_Custom$movie_ranking)
# Reciprocal Transformation
movie_ranking_recip_trn <- 1/IMDB_Custom$movie_ranking
# Plot histograms on untransformed variables. 
par(mfrow=c(1,5))
hist(movie_ranking_sq1_trn,main="\nSquare Root", xlab="Movie Votes", col="blue")
hist(movie_ranking_BC_trn,main="\nBox Cox", xlab="Movie Votes", col="blue")
hist(movie_ranking_log_trn,main="\nLog 10", xlab="Movie Votes", col="blue")
hist(movie_ranking_ln_trn,main="\nNatural Log", xlab="Movie Votes", col="blue")
hist(movie_ranking_recip_trn,main="\nReciprocal", xlab="Movie Votes", col="blue")
mtext("Histograms for Movie Votess Variables Post Transformation", side = 3, line = -1,outer = TRUE)

---
title: "MATH2349 Semester 1, 2019: Pre-processing of IMDB Data"
author: "KEVIN JOHNSON (INDIVIDUAL ASSIGNMENT), S3207343"
subtitle: Assignment 3
output:
  html_notebook: default
---




## Executive Summary 

This assignment demonstrates the key steps in data pro-processing as studied in this subject. Data processing techniques are applied to Movie data from the IMDB data set is used to demonstrate these steps. Firstly, the required packages are loaded into and then the data is read into R as data frames from CSV files based on the local computer. This data is then checked and with data pre processing methods and then selected features across a number tables are merged to apply further pre-processing and reading the data for further analysis. Outliers are identified and addressed using Tukeys univariate methods. Following this, a range of transformations are applied to the selected columns in the new data set to further improve and stabilise the data for further analysis. 

## Required packages 

The following section loads the required packages for this assignment.

```{r}
# Load the required packages.
library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(editrules)
library(deductive)
library(validate)
library(forecast)
```

```{r}
# Clear any environmnet variables. 
rm(list = ls())
```


## Data 


The data sets used in the this assignment is an extract from the IMDB movie data base was sourced from the Relational Data Set Repository website (https://relational.fit.cvut.cz/dataset/IMDb). It consists of a number of tables that form a relational database. As had some issues connecting directly to the database from R (slowness etc) for the purposes of obtaining the data for my assignment, I connected to the source repository with a SQL client (MySQL Workbench) and extracted the tables as CSV files to my local machine and then imported them into R. The files are mentioned in the section below. Whilst the dataset is generally of good quality and the required checks for data quality and consistency was still applied. 

### Read the data
The following codes reads in the initial dataset from CSV files.
```{r}
# Read in the IBMDB dataset tables as CSV files to Dataframes
actors_df <- data.frame(read.csv("actors.csv", sep = ';'))
directors_df <- data.frame(read.csv("directors.csv", sep = ';'))
movies_df <- data.frame(read.csv("movies.csv", sep = ';'))
movies2actors_df <- data.frame(read.csv("movies2actors.csv", sep = ';'))
movies2directors_df <- data.frame(read.csv("movies2directors.csv", sep = ';'))
ratings_df <- data.frame(read.csv("ratings.csv", sep = ';'))
runningtimes_df <- data.frame(read.csv("runningtimes.csv", sep = ';'))
movies_genres_df <- data.frame(read.csv("movie_genres.csv", sep = ';'))
```

```{r}
# Create a list of dataframes consisting of the tables read to data frames in the above code.
IMDB_Dataframes <- list(actors_df, directors_df, movies_df, movies2actors_df, movies2directors_df, ratings_df,runningtimes_df)
```

### Check the data. 
This section performs some rudimentary checks of initial data set to ensure that it they are readable and useable by R. A for loop is used here to be efficient with code and addressing repetitive task with respected to variables that are repeatedly used.

```{r}
# Loop through data sets and display head obserrvations in each. 
for (data_set in IMDB_Dataframes){
  print(head(data_set))
}
```

The following section of code checks that each table has been read in as a data frame. 
```{r}
# Check if individual data sets are of class data frame. 
for (data_set_name in IMDB_Dataframes){
  #print("Start of  =========================", data_set_name)
  print(class(data_set_name))
}
```

## Understand 
This section summarises the types of variables and data structures, checks the attributes in the data set and applies data type conversions as required. As I am creating a new, merged data set, only the variables in this new data set are subject type conversion, cleaning etc. 

```{r}
# Loop through data sets and check the structure, indicating the variable names and types. 
for (data_set in IMDB_Dataframes){
  print(str(data_set))
}
```
### Create new merged data set
The IMDB custom data set  (IMDB_Custom) is created from merging a number of data frames, collating the relevant information for this assignment. The common variable between these merged dataset is column movieid. This operation is a right join and is not prone to resulting in missing values and preserves existing information. The data frames that are amalgamated for my analysis is movies2driectors, movies, ratings and running times. Once this final set was constructed, unwanted variables and duplicates were removed to form the final data set IMDB_Custom for pre-processing.

The following code renames the movie_id variable to be consistent with the other tables to support merger and analysis with other tables.
```{r}
## use dplyer to align the movie id variable name to support merging with other tables. 
#Renaming columns with dplyr::rename()
movies_genres_df <- movies_genres_df %>%  rename(movieid = movie_id)
```

The following code mergers a number movies_genres_df and movies_df to the movies data frame.
```{r}
# Join movies and directors
moviesNdirectors <- merge(movies2directors_df, movies_df,by="movieid")
# Now add more data that will be useful for dataset.
moviesNdirectorsNratings <- merge(moviesNdirectors, ratings_df,by="movieid")
# add data from running times
moviesNdirectorsNratingsNTimes <- merge(moviesNdirectorsNratings, runningtimes_df,by="movieid")
# Rename to a customised table.
IMDB_Custom <-moviesNdirectorsNratingsNTimes
```

```{r}
# Remove unwanted columns. 
IMDB_Custom$time <- NULL
IMDB_Custom$directorid <- NULL
IMDB_Custom$addition <- NULL
```

```{r}
# Inspect the final data set. 
head(IMDB_Custom)
```

## Inspect and Understand
This section inspects the data frame and variables using R functions. The following code chunk checks the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. The str() function is used which provides the detailed structure within the data set. It also provides details with regards to the number of variables and observations. Data types are also changed to support the pre-processing and analysis. 


```{r}
# Check the dimensions fo the DF
dim(IMDB_Custom)
```

```{r}
# list the column names and change where appropriate.
str(IMDB_Custom)
```

The following section of code isnspect and Correct varibales
```{r}
#Rename columns with dplyr::rename() to provide some context to the dataset. 
# Rename the variable names to meaningful ones
# rank to movie_ranking
IMDB_Custom<- IMDB_Custom%>%rename(movie_ranking=rank)
# votes to movie_votes
IMDB_Custom<- IMDB_Custom%>%rename(movie_votes=votes)
# name to time1 to running time. 
IMDB_Custom<- IMDB_Custom%>%rename(running_time=time1)
# rate to director_rating
```

```{r}
# The following variable types were corrected and changed to support the analysis. 

# Chnage the rank variable from nuemrical to ordinal
IMDB_Custom$movie_ranking <- as.numeric(IMDB_Custom$movie_ranking)
# Change title to string
IMDB_Custom$title <-as.character(IMDB_Custom$title)
# Change distributio from factor to ordinal
IMDB_Custom$distribution<-as.ordered(IMDB_Custom$distribution)
```


Recheck the structure and confirm the changes have been applied. 
```{r}
str(IMDB_Custom)
```

As the initial mergers may have resulted in some duplicate values, remove these values and finalise the data set.  

```{r}
# Finally remove any duplicate rows
IMDB_Custom <- distinct(IMDB_Custom)
```

```{r}
dim((IMDB_Custom))
```

##	Tidy & Manipulate Data I 
The data set I have chosen and the merger of relevant tables resulting in the final table, IMDB_Custom already conforms the tidy data principle and no reshaping of my data into a tidy format is required. 

##	Tidy & Manipulate Data II 
The following code uses the mutate function Create/mutate the running time variable to add the column running_time_seconds. 

```{r}
# Create a new variable for running time in seconds b
IMDB_Custom <- IMDB_Custom %>% mutate(running_time_seconds = running_time*60)
head(IMDB_Custom)
```

##	Scan I 

The following sections of code scan the data for missing values (NaNs and NAs) and any inconsistencies and obvious errors. 
```{r}
# 
sum(is.na(IMDB_Custom$movieid))
sum(is.na(IMDB_Custom$genre))
sum(is.na(IMDB_Custom$title))
sum(is.na(IMDB_Custom$year))
sum(is.na(IMDB_Custom$movie_ranking))
sum(is.na(IMDB_Custom$movie_votes))
sum(is.na(IMDB_Custom$distribution))      
sum(is.na(IMDB_Custom$running_time))
```

```{r}
sum(is.nan(IMDB_Custom$movieid))
sum(is.nan(IMDB_Custom$genre))
sum(is.nan(IMDB_Custom$title))
sum(is.nan(IMDB_Custom$year))
sum(is.nan(IMDB_Custom$movie_ranking))
sum(is.nan(IMDB_Custom$movie_votes))
sum(is.nan(IMDB_Custom$distribution))
sum(is.nan(IMDB_Custom$running_time))
```

##	Scan II
The following sections of code scan and address outliers and uses Tukey’s univariate method of outlier detection is used to detect outliers in the box plots. 

### Display Box Plots - Pre Outlier resolution
First a box plot is created to provide any visualisation of outliers. As seen in the boxplots below there are a number of outliers for every variable and this may be addressed and improved with Tukeys method.  

```{r}
# create a collection of boxplots of the selected variables showing the outliers. 
par(mfrow=c(2,2))
boxplot(IMDB_Custom$year, main="\nMovie Year", ylab="Year", col="blue", pars=list(outcol="red"))
boxplot(IMDB_Custom$running_time, main="\nRunning Time", ylab="Minutes", col="blue", pars=list(outcol="red"))
boxplot(IMDB_Custom$movie_votes, main="Movie Running Votes", ylab="Minutes", col="blue", pars=list(outcol="red"))
boxplot(IMDB_Custom$running_time, main="Movie Running Ranking", ylab="Minutes", col="blue", pars=list(outcol="red"))
mtext("Box Plots for Variables pre Transformation", side = 3, line = -1,outer = TRUE)
```

### Check outliers using z-score and respective adjustments
Use Tukeys method for each of the variables to detect and address the outliers. 
### Movie Year Variable
```{r}
# Check for outliers using the z-score
z.scores <-  IMDB_Custom$year %>% scores(type = "z")
z.scores %>% summary()
```
There are 86 values that can be considered outliers
```{r}
# Check the number of values that fall outside the range. .
length (which( abs(z.scores) >3 ))
```
```{r}
# Replace the outliers with median values
IMDB_Custom$year[which( abs(z.scores) >3 )] <- median(IMDB_Custom$year)
```
The values outside the range have now been reduced to 74
```{r}
# Recheck the z scores after the imputations to confirm the outlier have been addressed. 
z.scores <-  IMDB_Custom$year %>% scores(type = "z")
z.scores %>% summary()
length (which( abs(z.scores) >3 ))
```

### Movie Running Time  Variable
```{r}
# Check for outliers using the z-score
z.scores <-  IMDB_Custom$running_time %>% scores(type = "z")
z.scores %>% summary()
```
There are 109 values that can be considered outliers. 
```{r}
# Check the number of values that fall outside the range. There are 109 values that can be considered outliers. 
length (which( abs(z.scores) >3 ))
```
```{r}
# Replace the outliers with median values
IMDB_Custom$running_time[which( abs(z.scores) >3 )] <- median(IMDB_Custom$running_time)
```
The values outside the range have now been reduced to 71.
```{r}
# Recheck the z scores after the imputations to confirm the outlier have been addressed. The values outside the range have now been reduced to 71.
z.scores <-  IMDB_Custom$running_time %>% scores(type = "z")
z.scores %>% summary()
length (which( abs(z.scores) >3 ))
```

#### Movie Votes  Variable
```{r}
# Check for outliers using the z-score
z.scores <-  IMDB_Custom$movie_votes %>% scores(type = "z")
z.scores %>% summary()
```
There are 99 values that can be considered outliers.
```{r}
# Check the number of values that fall outside the range. There are 99 values that can be considered outliers.
length (which( abs(z.scores) >3 ))
```

```{r}
# Replace the outliers with median values
IMDB_Custom$movie_votes[which( abs(z.scores) >3 )] <- median(IMDB_Custom$movie_votes)
```
The values outside the range have now been reduced to 152
```{r}
# Recheck the z scores after the imputations to confirm the outlier have been addressed. There are 86 values that can be considered outliers. The values outside the range have now been reduced to 152
z.scores <-  IMDB_Custom$movie_votes %>% scores(type = "z")
z.scores %>% summary()
length (which( abs(z.scores) >3 ))
```

# Movie ranking
```{r}
# Check for outliers using the z-score
z.scores <-  IMDB_Custom$movie_ranking %>% scores(type = "z")
z.scores %>% summary()
```
There are 40 values that can be considered outliers.
```{r}
# Check the number of values that fall outside the range. There are 40 values that can be considered outliers. 
length (which( abs(z.scores) >3 ))
```

```{r}
# Replace the outliers with median values
IMDB_Custom$movie_ranking[which( abs(z.scores) >3 )] <- median(IMDB_Custom$movie_ranking)
```
The values outside the range have now been reduced to 13.
```{r}
# Recheck the z scores after the imputations to confirm the outlier have been addressed. The values outside the range have now been reduced to 13.
z.scores <-  IMDB_Custom$movie_ranking %>% scores(type = "z")
z.scores %>% summary()
length (which( abs(z.scores) >3 ))
```


##### Display Box Plots - Post Oulier resolution
The section below displays the variables after Tukeys univariate outlier detection and resolution methods have been applied. As can be seen there are considerable improvements from the initial set of plots. 

```{r}
par(mfrow=c(2,2))
boxplot(IMDB_Custom$year, main="\nMovie Year", ylab="Year", col="blue", pars=list(outcol="red"))
boxplot(IMDB_Custom$running_time, main="\nRunning Time", ylab="Minutes", col="blue", pars=list(outcol="red"))
boxplot(IMDB_Custom$movie_votes, main="Movie Running Votes", ylab="Minutes", col="blue", pars=list(outcol="red"))
boxplot(IMDB_Custom$running_time, main="Movie Running Ranking", ylab="Minutes", col="blue", pars=list(outcol="red"))
# mtext("My 'Collection of Box Plots for Variables", side = 3, line = -21, outer = TRUE)
mtext("Box Plots for Variables Post Outlier Resolution", side = 3, line = -1,outer = TRUE)
```
##	Transform 

This section applies transformation to the selected variables by visualy analysing hte skew of the distribution and applying a relevant transoframtion. Please note I have applied a range of transformation to view the effefcts but have specified wich ones are relevant to based on the skew.  The plots below display respective histograms of the variables of interest. 

```{r}
# Plot histograms on untransformed variables. 
par(mfrow=c(2,2))
hist(IMDB_Custom$year,main="Movie Year", xlab="Year", col="blue")
hist(IMDB_Custom$running_time,main="\nMovie Running Time", xlab="Running Time", col="blue")
hist(IMDB_Custom$movie_votes, main="\nMovie Votes",xlab="Movie Votes", col="blue")
hist(IMDB_Custom$movie_ranking,main="Movie Ranking",xlab="Movie Ranking", col="blue")
mtext("Histograms for Variables Pre Transformation", side = 3, line = -1,outer = TRUE)
```
#### Movie Year
Data for movie year time is strongley skewed to the left.  Whilst we apply all transforamion to se the effect the most applicable transformation to left skewed data is square root and reciprocal transformation, other transformations including Box Cox with a lambda value of auto is applied. In this case the transformations had little effect in producing a symmetrical distribution. 

```{r}
# Movie Year Transformations - Left Skewed
# Square Transforamton
year_sq1_trn <- IMDB_Custom$year^2
# Box Cox
year_BC_trn <- BoxCox(IMDB_Custom$year, lambda = "auto")
#log transformation
year_log_trn <- log10(IMDB_Custom$year)
year_ln_trn <- log(IMDB_Custom$year)
# Reciprocal Transformation
year_recip_trn <- 1/IMDB_Custom$year
```

```{r}
# Plot histograms on untransformed variables. 
par(mfrow=c(1,5))
hist(year_sq1_trn,main="\nSquare Root", xlab="Year", col="blue")
hist(year_BC_trn,main="\nBox Cox", xlab="Year", col="blue")
hist(year_log_trn,main="\nLog 10", xlab="Year", col="blue")
hist(year_ln_trn,main="\nNatural Log", xlab="Year", col="blue")
hist(year_recip_trn,main="\nReciprocal Log", xlab="Year", col="blue")
mtext("Histograms for Year Variables Post Transformation", side = 3, line = -1,outer = TRUE)
```

#### Movie Running Time
Data for movie running times apper to be reasonably symmetrical, lets see if it can be improved. Again, all transforamion are applied to see the effect. Again none of the transformations applied had any strong impact on improving the distrubution. 

```{r}
# Movie Year Transformations - Left Skewed
# Square Transforamton
running_time_sq1_trn <- IMDB_Custom$running_time^2
# Box Cox
running_time_BC_trn <- BoxCox(IMDB_Custom$running_time, lambda = "auto")
#log transformation
running_time_log_trn <- log10(IMDB_Custom$running_time)
running_time_ln_trn <- log(IMDB_Custom$running_time)
# Reciprocal Transformation
running_time_recip_trn <- 1/IMDB_Custom$running_time
```

```{r}
# Plot histograms on untransformed variables. 
par(mfrow=c(1,5))
hist(running_time_sq1_trn,main="\nSquare Root", xlab="Running Time", col="blue")
hist(running_time_BC_trn,main="\nBox Cox", xlab="Running Time", col="blue")
hist(running_time_log_trn,main="\nLog 10", xlab="Running Time", col="blue")
hist(running_time_ln_trn,main="\nNatural Log", xlab="Running Time", col="blue")
hist(running_time_recip_trn,main="\nReciprocal", xlab="Running Time", col="blue")
mtext("Histograms for Running Times Variables Post Transformation", side = 3, line = -1,outer = TRUE)
```

#### Movie Votes
Data for movie votes is very strongly  skewed to the right. Whilst I have applied all transformation to see the effect the most applicable transformation to right skewed data is log, natural log and square root transformation. Here the both the log transformations applied appear to have a reasonable impact on improving the symmetry of distribution. 

```{r}
# Movie Votes Transformations - Left Skewed
# Square Transforamton
movie_votes_sq1_trn <- IMDB_Custom$movie_votes^2
# Box Cox
movie_votes_BC_trn <- BoxCox(IMDB_Custom$movie_votes, lambda = "auto")
#log transformation
movie_votes_log_trn <- log10(IMDB_Custom$movie_votes)
movie_votes_ln_trn <- log(IMDB_Custom$movie_votes)
# Reciprocal Transformation
movie_votes_recip_trn <- 1/IMDB_Custom$movie_votes
```

```{r}
# Plot histograms on untransformed variables. 
par(mfrow=c(1,5))
hist(movie_votes_sq1_trn,main="\nSquare Root", xlab="Movie Votes", col="blue")
hist(movie_votes_BC_trn,main="\nBox Cox", xlab="Movie Votes", col="blue")
hist(movie_votes_log_trn,main="\nLog 10", xlab="Movie Votes", col="blue")
hist(movie_votes_ln_trn,main="\nNatural Log", xlab="Movie Votes", col="blue")
hist(movie_votes_recip_trn,main="\nReciprocal", xlab="Movie Votes", col="blue")
mtext("Histograms for Movie Votess Variables Post Transformation", side = 3, line = -1,outer = TRUE)
```

#### Movie Ranking
Data for movie running times apper to be reasonably symmetrical, lets see if it can be improved. Again, all transforamion are applied to see the effect. As per the plots below Square root and Box Cox transformations appear to have a moderate impact on improving the symmetry of the distrubution. 

```{r}
# Movie Votes Transformations - Left Skewed
# Square Transforamton
movie_ranking_sq1_trn <- IMDB_Custom$movie_ranking^2
# Box Cox
movie_ranking_BC_trn <- BoxCox(IMDB_Custom$movie_ranking, lambda = "auto")
#log transformation
movie_ranking_log_trn <- log10(IMDB_Custom$movie_ranking)
movie_ranking_ln_trn <- log(IMDB_Custom$movie_ranking)
# Reciprocal Transformation
movie_ranking_recip_trn <- 1/IMDB_Custom$movie_ranking
```

```{r}
# Plot histograms on untransformed variables. 
par(mfrow=c(1,5))
hist(movie_ranking_sq1_trn,main="\nSquare Root", xlab="Movie Votes", col="blue")
hist(movie_ranking_BC_trn,main="\nBox Cox", xlab="Movie Votes", col="blue")
hist(movie_ranking_log_trn,main="\nLog 10", xlab="Movie Votes", col="blue")
hist(movie_ranking_ln_trn,main="\nNatural Log", xlab="Movie Votes", col="blue")
hist(movie_ranking_recip_trn,main="\nReciprocal", xlab="Movie Votes", col="blue")
mtext("Histograms for Movie Votess Variables Post Transformation", side = 3, line = -1,outer = TRUE)
```


