library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(stringr)
library(lubridate)
The purpose for this assignment is to apply various data pre-processing techniques on two movie datasets before conducting our analysis. The dataset contains information from Rotten Tomatoes, a website that keeps track of all reviews for each movie and aggregates the results and Internet Movie Database IMDB, an online database of information related to films, television programs.
The two datasets used for this assignment include a 1) movies dataset and 2) movies budget dataset which consist of 15,452 rows and 45,466 rows respectively. Each row represents a different movie and each column contains variables that apply to that particular movie. In order to simplify the process and achieve the main aim of the analysis, only 8 variables from the first dataset and 2 variables from the second dataset were selected. Our first task for this assignment was choose which variables to include for analysis from each dataset. Once we had chosen which variables to include, we joined the two datasets using the primary key ImdbID.
Understand the data is one of key requirement of this assignment, in this part we had to check the type of each variable and convert them to the correct variable type if necessary. Next is tidy and manipulate data. We created two new variables days between DVD and movie release and gross profit, which provide an insight into the details of the movie industry.
Data Scan function that will detect for missing values, inconsistencies or obvious errors in dataset. We also commenced the scan for numeric data for outliers including calculating the z-scores and finding how many outliers based on z-scores. Final part will be the transform at least one variable in dataset. Gross profit variable has been selected and we provide the histogram of transformed variable.
Movie dataset The data set contains information of all movies which were released in the years of 1990-2017. The movie dataset can be found at https://www.kaggle.com/beyjin/movies-1990-to-2017/data. Variables that were extracted from this dataset include:
Movie budget dataset This dataset provides critical information on various movies to provide users insight on the successfulness of each movie. This dataset can be found at https://www.kaggle.com/tmdb/tmdb-movie-metadata/data. Variables that were extracted from this dataset include:
#Importing of movie csv
movie <- read.csv("F:/Data Preprocessing/Assignment 3/Movie_Movies.csv",
stringsAsFactors=FALSE, na.string = "N/A")
#Selection and reordering of varible columns
movie <- movie[,c(21,17,10,11,16,14,4,22)]
#Removal of movie duplicate entries
movie <- distinct(movie)
#Preview of movie dataframe
head(movie)
#Importing of budget csv
budget <- read.csv("F:/Data Preprocessing/Assignment 3/movies_metadata.csv",
stringsAsFactors=FALSE, na.string = 0)
#Selection and reordering of varible columns
budget <- budget [,c(7,3,16)]
#Changing of variable column names
colnames(budget)[which(names(budget) == "imdb_id")] <- "imdbID"
colnames(budget)[which(names(budget) == "budget")] <- "Budget"
colnames(budget)[which(names(budget) == "revenue")] <- "Revenue"
#Preview of budget dataframe
head(budget)
#Inner join of movie and budget data frames based on imdbID
movie.budget <- inner_join(movie, budget, by = "imdbID")
#Preview of combined dataframe
head(movie.budget)
To check the type of each variable the str() command was used. This allowed us to identify which variable were associated with the wrong type. As a result, we had to set the correct variable types for four variables including Rated, Released, DVD and Budget. After changing the variable to the correct types, we had a look at the levels in the rated classification variable. We decided that the Rated variable should be re-ordered according to appropriate audience age.
#Changing of variable to appropriate variable type
movie.budget$Rated <- as.factor(movie.budget$Rated)
movie.budget$Released <- as.Date(movie.budget$Released, format = "%d-%b-%y")
movie.budget$DVD <- as.Date(movie.budget$DVD, format = "%d-%b-%y")
movie.budget$Budget <- as.numeric(movie.budget$Budget)
#Checking movie structure and variable type
str(movie.budget)
## 'data.frame': 4769 obs. of 10 variables:
## $ imdbID : chr "tt0172495" "tt0162222" "tt0207201" "tt0130623" ...
## $ Title : chr "Gladiator" "Cast Away" "What Women Want" "Dinosaur" ...
## $ Production: chr "Dreamworks Distribution LLC" "20th Century Fox" "Paramount Pictures" "Buena Vista" ...
## $ Rated : Factor w/ 14 levels "G","NC-17","NOT RATED",..: 7 6 6 5 5 6 6 6 6 6 ...
## $ Runtime : chr "155 min" "143 min" "127 min" "82 min" ...
## $ Released : Date, format: "2000-05-05" "2000-12-22" ...
## $ DVD : Date, format: "2000-11-21" "2001-06-12" ...
## $ imdbRating: num 8.5 7.8 6.4 6.5 6.1 7 6.4 7.4 6.6 7.9 ...
## $ Budget : num 1.03e+08 9.00e+07 7.00e+07 1.28e+08 1.23e+08 ...
## $ Revenue : num 4.58e+08 4.30e+08 3.74e+08 3.54e+08 3.45e+08 ...
#Looking at movie rated classifcation levels
levels(movie.budget$Rated)
## [1] "G" "NC-17" "NOT RATED" "NR" "PG"
## [6] "PG-13" "R" "TV-14" "TV-G" "TV-MA"
## [11] "TV-PG" "TV-Y" "TV-Y7" "UNRATED"
#Rearrange movie rated classification based on appropriate audience age
movie.budget$Rated <- factor(movie.budget$Rated,
levels = c("TV-Y", "TV-Y7", "G", "TV-G", "PG", "TV-PG", "PG-13",
"TV-14","R", "TV-MA", "X", "NC-17", "NOT RATED", "NR", "UNRATED"),
ordered = TRUE)
#Rechecking levels of movie rate classification after re-order
levels(movie.budget$Rated)
## [1] "TV-Y" "TV-Y7" "G" "TV-G" "PG"
## [6] "TV-PG" "PG-13" "TV-14" "R" "TV-MA"
## [11] "X" "NC-17" "NOT RATED" "NR" "UNRATED"
The dataset that we constructed using the movie and movie budget datasets is believed to already be tidy. The reason is because:
Through the use of the mutate() function we were able to create another two new variables. Using the movie release date and DVD date, we were able to calculate the total number of days it took for each movie title to be released on DVD. We were also able to calculate the gross profit of each movie title using the revenue and budget variables.
#Use of mutate() functions to create two new variables
movie.release.gross <-
movie.budget %>%
mutate(Days.between.DVD.and.Movie.release = DVD-Released,
Gross.profit = Revenue-Budget)
## Warning: package 'bindrcpp' was built under R version 3.3.3
head(movie.release.gross)
To figure out if there was any missing values in the dataset we used the any(is.na()) function. We investigated this problem further using the sum(is.na()) and colSums(is.na()) to determine number of missing values throughout the dataset. To address the issue of the missing data we decided to remove all observations that had missing values using the na.omit() function.
# Check for NA's
any(is.na(movie.release.gross))
## [1] TRUE
# Find number of NA's
sum(is.na(movie.release.gross))
## [1] 5018
# Check NA's for each variable
colSums(is.na(movie.release.gross))
## imdbID Title
## 0 0
## Production Rated
## 135 79
## Runtime Released
## 1 3
## DVD imdbRating
## 191 0
## Budget Revenue
## 1389 1294
## Days.between.DVD.and.Movie.release Gross.profit
## 194 1732
# Removal of rows with NA
movie.omit <- na.omit(movie.release.gross)
# Recheck for NA's
any(is.na(movie.omit))
## [1] FALSE
To check for outliers we produced boxplots for all numeric variables in our dataset. Based on first impressions from the boxplots it appears our dataset is believed to contain outliers. We then calculated the z-scores for all three numeric variables and removed observations that had an absolute value of greater than three. After removing the outliers in the dataset, we reproduced the boxplots to review the clean dataset. However, upon review of the clean dataset the boxplot still suggested that there were still outliers in our dataset. Based on results, we believe that these observations deemed as outliers are indeed not the case and that it is due to our dataset not being normally distributed.
# Boxplot of numeric variable
movie.omit$Budget %>% boxplot(main="Box Plot of movie budget",
ylab="Budget ($)", col = "grey")
movie.omit$Revenue %>% boxplot(main="Box Plot of movie revenue",
ylab="Revenue ($)", col = "grey")
movie.omit$Gross.profit %>% boxplot(main="Box Plot of movie gross profit",
ylab="Gross profit ($)", col = "grey")
# Calculate z-score
z.scores.budget <- movie.omit$Budget %>% scores(type = "z")
z.scores.budget %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0260 -0.6779 -0.3297 0.0000 0.2798 7.2450
z.scores.revenue <- movie.omit$Revenue %>% scores(type = "z")
z.scores.revenue %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.6781 -0.5640 -0.3512 0.0000 0.1312 13.1900
z.scores.gross <- movie.omit$Gross.profit %>% scores(type = "z")
z.scores.gross %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4880 -0.5203 -0.3406 0.0000 0.1086 14.3700
# Finds the locations of outliers in the hindfoot length variable
which( abs(z.scores.budget) >3 )
## [1] 313 403 526 642 648 650 767 769 893 894 899 902 1021 1022
## [15] 1024 1072 1144 1165 1166 1167 1171 1172 1215 1309 1310 1317 1397 1477
## [29] 1478 1479 1482 1485 1511 1528 1539 1610 1612 1615 1617 1645 1650 1681
## [43] 1684 1747 1748 1752 1755 1894 1896 1897 1899 1950 2037 2038 2039 2043
## [57] 2187 2189 2203 2223 2234 2239 2687
which( abs(z.scores.revenue) >3 )
## [1] 105 200 201 202 307 308 401 402 403 523 524 642 643 767
## [15] 768 769 770 893 894 1021 1022 1023 1024 1086 1165 1166 1167 1168
## [29] 1169 1308 1309 1310 1477 1478 1479 1480 1481 1482 1483 1609 1610 1611
## [43] 1612 1613 1614 1615 1747 1748 1749 1750 1751 1752 1894 1895 1896 1897
## [57] 1898 1899 1900 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2186
## [71] 2187 2188 2189 2190 2223 2409 2473 2612 2687 2852
which( abs(z.scores.gross) >3 )
## [1] 105 200 201 202 307 308 401 402 523 524 642 643 767 768
## [15] 769 770 893 1021 1022 1023 1026 1086 1165 1166 1167 1168 1170 1308
## [29] 1309 1310 1477 1478 1479 1480 1481 1484 1609 1610 1611 1612 1613 1614
## [43] 1616 1747 1748 1749 1751 1894 1895 1896 1897 1898 1899 1900 2037 2038
## [57] 2039 2040 2041 2042 2043 2044 2045 2186 2187 2188 2189 2190 2409 2473
## [71] 2474 2612 2687 2852 2853
# Finds the total number of outliers according to the z-score
length( which(abs(z.scores.budget) >3 ))
## [1] 63
length( which(abs(z.scores.revenue) >3 ))
## [1] 80
length( which(abs(z.scores.gross) >3 ))
## [1] 75
# Exclude outlier
movie.clean.budget <- movie.omit[- which(abs(z.scores.budget) >3 ),]
movie.clean.revenue<- movie.omit[- which(abs(z.scores.budget) >3 ),]
movie.clean.gross <- movie.omit[- which(abs(z.scores.budget) >3 ),]
# Boxplot excluding outlier
movie.clean.budget$Budget %>% boxplot(main="Box Plot of movie budget",
ylab="Budget ($)", col = "grey")
movie.clean.revenue$Revenue %>% boxplot(main="Box Plot of movie revenue",
ylab="Revenue ($)", col = "grey")
movie.clean.gross$Gross.profit %>% boxplot(main="Box Plot of movie gross profit",
ylab="Gross profit ($)", col = "grey")
By producing a histogram of the gross profit variable in our dataset, its shows that the data is right skewed. We decided to transpose the gross profit variable using the mutate() function and do a reciprocal transformation for each observation. The idea behind the use of a reciprocal transformation is to reduce the right skewness of the distribution and make it more normal. After the transformation the distribution of the gross profit variable looks to be more normally distributed.
movie.gross.transpose <-
movie.omit %>%
mutate(Gross.profit.transpose = Gross.profit^(-1))
hist(movie.gross.transpose$Gross.profit,
main = "Histogram of Gross profit", xlab = "Gross profit")
hist(movie.gross.transpose$Gross.profit.transpose,
main = "Histogram of transposed Gross profit", xlab = "Gross profit reciprocal")
Movie dataset: ‘Movie Collection Analysis’ 2018, accessed May 18, 2018, from https://www.kaggle.com/beyjin/movies-1990-to-2017/data.
Movie budget dataset: ‘TMDB 5000 Movie Dataset’ 2017, accessed May 18, 2018, from https://www.kaggle.com/tmdb/tmdb-movie-metadata/data.