The goal of this assignment is to apply a variety of data preprocessing tasks to a series of datasets. For the purposes of this report, we have considered three IMDB Movie datasets, two sourced from the Relational Dataset Repository and another sourced from Kaggle.
The first two datasets have been extracted from a relational database with the schema shown below. Specifically, only the movies table and the movies_genre table are considered in this report.
IMDB Database Movie Schema sourced from the Relational Dataset Repository
The rest of this report is organised as follows. Section 2 explores the completion of tasks 1-9 which are completed using the dataset sourced from the Relational Database Repository. Section 3 explores Task 10 which is completed using a combination of the data used in the previous section and the data sourced from Kaggle.
After extraction from the database, the movies and movies_genres tables are in .csv format and so the read_csv() function from the readr package can be used to import this data into R.
# Load required packages
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(outliers)
#Loading in the data
movies <- read_csv("movies.csv")
movies_genres <- read_csv("movies_genres.csv")
The structure and dimensions of these datasets can be seen below.
The movies dataset contains 388269 observations(rows) and has 4 variables(columns). These variables are id, name, year and rank. On the other hand, the movies_genres dataset contains 395119 observation and 2 variables. These variables are movied_id and genre. The specifics of these variables are described in detail in this section of this report.
#Check on the structure and dimensions of the data frames
# Movies dataset
str(movies)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 388269 obs. of 4 variables:
$ id : int 0 1 2 3 4 5 6 7 8 9 ...
$ name: chr "#28" "#7 Train: An Immigrant Journey, The" "$" "$1,000 Reward" ...
$ year: int 2002 2000 1971 1913 1915 1923 1971 1920 1921 1915 ...
$ rank: chr "NULL" "NULL" "6.4" "NULL" ...
- attr(*, "spec")=List of 2
..$ cols :List of 4
.. ..$ id : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ name: list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ year: list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ rank: list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
dim(movies)
[1] 388269 4
# Movie Genres dataset
str(movies_genres)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 395119 obs. of 2 variables:
$ movie_id: int 1 1 2 2 5 6 6 8 8 8 ...
$ genre : chr "Documentary" "Short" "Comedy" "Crime" ...
- attr(*, "spec")=List of 2
..$ cols :List of 2
.. ..$ movie_id: list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ genre : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
dim(movies_genres)
[1] 395119 2
# View the movies dataset
head(as.data.frame(movies))
movie_id name year rank
1 0 #28 2002 NULL
2 1 #7 Train: An Immigrant Journey, The 2000 NULL
3 2 $ 1971 6.4
4 3 $1,000 Reward 1913 NULL
5 4 $1,000 Reward 1915 NULL
6 5 $1,000 Reward 1923 NULL
# View the movies_genres dataset
head(as.data.frame(movies_genres))
movie_id genre
1 1 Documentary
2 1 Short
3 2 Comedy
4 2 Crime
5 5 Western
6 6 Comedy
In order to merge both datasets, we rename the id variable in the movies dataset to movie_id using the names function from Base R. These datasets can now be merged using the mutual variable movie_id now present in both datasets.
# Join both datasets
# Rename the movies dataset id variable#
names(movies)[1]<-"movie_id"
movies_combined <- merge(movies,movies_genres, by="movie_id")
The structure and dimensions of this combined dataset, movies_combined can be seen below. This dataset has 395119 observations and 5 variables which are discussed briefly below.
# Checking on the structure and dimensions of the merged dataset
str(movies_combined)
'data.frame': 395119 obs. of 5 variables:
$ movie_id: int 1 1 2 2 5 6 6 8 8 8 ...
$ name : chr "#7 Train: An Immigrant Journey, The" "#7 Train: An Immigrant Journey, The" "$" "$" ...
$ year : int 2000 2000 1971 1971 1923 1971 1971 1921 1921 1921 ...
$ rank : chr "NULL" "NULL" "6.4" "6.4" ...
$ genre : chr "Documentary" "Short" "Comedy" "Crime" ...
dim(movies_combined)
[1] 395119 5
# View the movies_combined dataset
head(movies_combined)
movie_id name year rank genre
1 1 #7 Train: An Immigrant Journey, The 2000 NULL Documentary
2 1 #7 Train: An Immigrant Journey, The 2000 NULL Short
3 2 $ 1971 6.4 Comedy
4 2 $ 1971 6.4 Crime
5 5 $1,000 Reward 1923 NULL Western
6 6 $1,000,000 Duck 1971 5 Comedy
As seen above, there are 5 variables present and each is described briefly below:
movie_id - A unique identifier assigned to every movie present in the IMDB database. It has been read in as an integer.
name - The name of a movie in the database. This has been read in as a character.
year - This variable describes the year a movie was produced. It has been read in as an integer.
rank - This describes the overall rating given to a movie. It has been read in as a character but should be a numeric variable.
genre - This variable describes the classification given to a movie. It has been read in as a character but should be a factor.
These variables are renamed to be slightly more meaningful and thereafter converted to their appropriate types.
# Rename combined variables of the combined dataset
names(movies_combined)[2]<-"movie_name"
names(movies_combined)[3]<-"year_released"
names(movies_combined)[4]<-"movie_rating"
# Convert variable types
movies_combined$movie_rating<-as.numeric(movies_combined$movie_rating)
NAs introduced by coercion
class(movies_combined$movie_rating)
[1] "numeric"
head(movies_combined$movie_rating)
[1] NA NA 6.4 6.4 NA 5.0
movies_combined$genre <- as.factor(movies_combined$genre)
class(movies_combined$genre )
[1] "factor"
head(movies_combined$genre)
[1] Documentary Short Comedy Crime Western Comedy
21 Levels: Action Adult Adventure Animation Comedy Crime Documentary ... Western
it is important to highlight that the movie_rating variable previously had values marked “NULL” which are replaced with NAs by coercion when it is changed to be of type numeric. These two variables are
We shall begin by observing whether the data observed Hadley Wickham’s Tidy Principles.
head(movies_combined)
movie_id movie_name year_released movie_rating
1 1 #7 Train: An Immigrant Journey, The 2000 NA
2 1 #7 Train: An Immigrant Journey, The 2000 NA
3 2 $ 1971 6.4
4 2 $ 1971 6.4
5 5 $1,000 Reward 1923 NA
6 6 $1,000,000 Duck 1971 5.0
genre
1 Documentary
2 Short
3 Comedy
4 Crime
5 Western
6 Comedy
tail(movies_combined)
movie_id movie_name year_released movie_rating genre
395114 378611 nc Selim'in gzdesi 1950 NA Romance
395115 378612 nz de mihlarim 1965 NA Adventure
395116 378612 nz de mihlarim 1965 NA Drama
395117 378613 egar a gerist 1998 NA Comedy
395118 378613 egar a gerist 1998 NA Drama
395119 378614 . 19,99 1998 6.3 Comedy
From the above, we can observe that the data is generally tidy as each variable forms a column, each observations forms a row and each value is in its own cell. However, there are numerous instances where an observation is duplicated as a movie can have several genres. This causes the movie identifier to be duplicated.
For the purposes of this report, we shall consider a movie to have only one genre and so shall remove duplicate rows based on the variable movie_id using the function distinct from the dplyr package as shown below.
movies_distinct<- distinct(movies_combined,movie_id, .keep_all = TRUE)
#movies_distinct2<-movies_combined[!duplicated(movies_combined$movie_id), ]#
head(movies_distinct)
movie_id movie_name year_released movie_rating
1 1 #7 Train: An Immigrant Journey, The 2000 NA
2 2 $ 1971 6.4
3 5 $1,000 Reward 1923 NA
4 6 $1,000,000 Duck 1971 5.0
5 8 $10,000 Under a Pillow 1921 NA
6 9 $100,000 1915 NA
genre
1 Documentary
2 Comedy
3 Western
4 Comedy
5 Animation
6 Drama
Additionally, we shall subset this distinct data frame for observations only within the last 10 years in order to answer the question, What has been the Most Popular Movie Genre in the last 10 years?
# Filtering and subsetting table to 1998-2008
movies_subset <- movies_distinct %>% filter(year_released >= "1998", genre != "Adult")
#Missing values in the dataset
colSums(is.na(movies_subset))
movie_id movie_name year_released movie_rating genre
0 0 0 38395 0
The numeric variable movie_rating is the only variable with missing values. There are 38395 NA values in the variable which is an equivalent of 73.87% of missing data. It is likely that this is because users are not required to rate movies on the website or users only tend to rate bigger budget films that receive a lot of media attention.
We can observe the total number of missing ratings by the genre that they are associated with as shown below.
# Missing total values in rating column of movies_subset grouped by genre
as.data.frame(movies_subset %>% group_by(genre) %>% summarise(missing_rating = sum(is.na(movie_rating))) %>% arrange(desc(missing_rating)))
genre missing_rating
1 Documentary 9533
2 Short 6560
3 Drama 6411
4 Comedy 5440
5 Action 3073
6 Animation 1747
7 Crime 1332
8 Horror 751
9 Music 749
10 Adventure 660
11 Thriller 507
12 Family 453
13 Romance 305
14 Fantasy 277
15 Sci-Fi 245
16 Musical 165
17 Mystery 139
18 War 27
19 Western 21
We observe that the highest non-rated movies are documentaries, drama, short films and comedies. Conversely, Western and War movies have the lowest number of missing ratings.
We find the mean rating, median rating and the number of movies rated per genre as shown below. It is observed that there is not much deviation from the mean for each of the groups.
# Mean of Movie Genres
genres_mean<-as.data.frame(movies_subset %>% group_by(genre) %>% summarise(group_mean = round(mean(movie_rating, na.rm = TRUE),2), group_median=median(movie_rating, na.rm = TRUE), Frequency=n())) %>% arrange(desc(group_mean))
genres_mean
genre group_mean group_median Frequency
1 Documentary 6.91 7.10 10829
2 War 6.62 6.60 31
3 Animation 6.58 6.70 2228
4 Short 6.57 6.70 7603
5 Drama 6.27 6.40 10326
6 Musical 6.22 6.30 189
7 Music 6.17 7.40 756
8 Romance 6.10 6.20 436
9 Mystery 6.03 6.20 201
10 Comedy 5.96 6.10 9359
11 Crime 5.88 6.10 1790
12 Adventure 5.60 5.70 994
13 Action 5.52 5.70 4449
14 Fantasy 5.49 5.80 383
15 Family 5.39 5.40 568
16 Sci-Fi 5.20 5.00 320
17 Thriller 4.99 4.90 851
18 Horror 4.81 4.70 1170
19 Western 4.77 4.55 27
Documentaries have the highest mean of 6.91 and also the highest frequency count of 10829. It is therefore not surprising that this category appears to be the most popular. It is indicative that possibly this category was the most frequently rated in the dataset.
This report proposes to impute the missing values present in the movie_rating variable with the mean of the genre that the missing observation belongs to. This is successfully done below using the mutate function from the dplyr package.
# Impute Missing values with group mean using mutate
movies_imputed<-movies_subset%>%
group_by(genre) %>%
mutate(movie_rating = ifelse(is.na(movie_rating), mean(movie_rating, na.rm = TRUE),movie_rating))
# Check for missing values
sum(is.na(movies_imputed$movie_rating))
[1] 0
# Round off to two decimal places
movies_imputed$movie_rating<-round(movies_imputed$movie_rating,2)
head(movies_imputed)
dim(movies_imputed)
[1] 52510 5
In order to detect for outliers in the movie_rating variable, we shall rely on the movie_subset dataframe created earlier as it is more representative of the dataset. As this is one variable in question, univariate methods of outlier detection and handling shall be considered.
A boxplot is developed for this purpose as shown below.
movies_subset$movie_rating %>% boxplot(main="Box Plot of Movie Ratings", ylab="Rating", col = "pink")
Possible outliers are observed below the lower outlier fence.
sum(is.na(movies_subset$movie_rating))
[1] 38395
mean(!complete.cases(movies_subset$movie_rating))
[1] 0.7311941
As had been noted earlier, there are a high number of missing values within the ratings column, accounting for about 73% of the data in this variable. This is also indicated above.
We shall create a subset of the data without these missing values and see if the outliers are still present.
complete_ratings<-movies_subset[!is.na(movies_subset$movie_rating), ]
complete_ratings$movie_rating %>% boxplot(main="Box Plot of Movie Ratings after removing NA values", ylab="Rating", col = "pink")
Outliers are still present after removing the NA values.
We calculate the z scores as shown below and as observed in the summary we identify that the minimum z score is -3.14310 while the maximum is 2.36846. Outliers are considered to be observations that have an absolute value larger than 3. This method finds that there are 45 observations that meet this criterion i.e. there are 45 outliers.
# Z score method to detect univariate outliers
z.scores <- complete_ratings$movie_rating %>% scores(type = "z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.14310 -0.60407 0.07713 0.00000 0.69641 2.36846
# Establish location of outliers
which( abs(z.scores) > 3 )
[1] 422 924 1498 1833 2285 3276 3426 3857 3878 4088 4419 4456 4600
[14] 4747 5504 5603 5609 5774 5872 6042 6063 6210 6699 7133 7505 7655
[27] 7912 8283 8463 8655 8953 10086 10247 10538 10783 10855 11124 11317 11575
[40] 12356 12460 12618 13192 13368 13807
# Establish number of outliers
length (which( abs(z.scores) >3 ))
[1] 45
We can use the capping method to replace the observations lying outside the lower limit with the value of 5th percentile and those that lie above the upper limit, with the value of 95th percentile. A function is created below to perform this operation and further a boxplot is created to observe whether capping had any effect on the outliers.
# Handling outliers using the capping method
fun <- function(x){
quantiles <- quantile( x, c(.05, .95 ) )
x[ x < quantiles[1] ] <- quantiles[1]
x[ x > quantiles[2] ] <- quantiles[2]
x
}
ratings_capped<-complete_ratings$movie_rating %>% fun
# Observe if capping has been successful to get rid of outliers
ratings_capped %>% boxplot(main="Box Plot of Movie Ratings after Capping", ylab="Rating", col = "pink")
We observe that the capping method has been successful in eliminating the 45 outliers that were present in this variable.
To the combined dataset movies_subset we merge the additional IMDB movie dataset in order to observe the distribution of movie revenues for action movies.
The dataset is in .csv format and so the function read_csv is used to import it into R.
head(moviesB)
[38;5;246m# A tibble: 6 x 12[39m
Rank Title Genre Description Director Actors Year `Runtime (Minut… Rating
[3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m 1 Guard… Acti… A group of in… James G… Chris … [4m2[24m014 121 8.1
[38;5;250m2[39m 2 Prome… Adve… Following clu… Ridley … Noomi … [4m2[24m012 124 7
[38;5;250m3[39m 3 Split Horr… Three girls a… M. Nigh… James … [4m2[24m016 117 7.3
[38;5;250m4[39m 4 Sing Anim… In a city of … Christo… Matthe… [4m2[24m016 108 7.2
[38;5;250m5[39m 5 Suici… Acti… A secret gove… David A… Will S… [4m2[24m016 123 6.2
[38;5;250m6[39m 6 The G… Acti… European merc… Yimou Z… Matt D… [4m2[24m016 103 6.1
[38;5;246m# ... with 3 more variables: Votes [3m[38;5;246m<int>[38;5;246m[23m, `Revenue (Millions)` [3m[38;5;246m<dbl>[38;5;246m[23m,
# Metascore [3m[38;5;246m<int>[38;5;246m[23m[39m
This dataset has 1000 observations and 12 variables as shown below. It contains a mix of numeric and character variables.
str(moviesB)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1000 obs. of 12 variables:
$ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
$ Title : chr "Guardians of the Galaxy" "Prometheus" "Split" "Sing" ...
$ Genre : chr "Action,Adventure,Sci-Fi" "Adventure,Mystery,Sci-Fi" "Horror,Thriller" "Animation,Comedy,Family" ...
$ Description : chr "A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control "| __truncated__ "Following clues to the origin of mankind, a team finds a structure on a distant moon, but they soon realize the"| __truncated__ "Three girls are kidnapped by a man with a diagnosed 23 distinct personalities. They must try to escape before t"| __truncated__ "In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing compe"| __truncated__ ...
$ Director : chr "James Gunn" "Ridley Scott" "M. Night Shyamalan" "Christophe Lourdelet" ...
$ Actors : chr "Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana" "Noomi Rapace, Logan Marshall-Green, Michael Fassbender, Charlize Theron" "James McAvoy, Anya Taylor-Joy, Haley Lu Richardson, Jessica Sula" "Matthew McConaughey,Reese Witherspoon, Seth MacFarlane, Scarlett Johansson" ...
$ Year : int 2014 2012 2016 2016 2016 2016 2016 2016 2016 2016 ...
$ Runtime (Minutes) : int 121 124 117 108 123 103 128 89 141 116 ...
$ Rating : num 8.1 7 7.3 7.2 6.2 6.1 8.3 6.4 7.1 7 ...
$ Votes : int 757074 485820 157606 60545 393727 56036 258682 2490 7188 192177 ...
$ Revenue (Millions): num 333 126 138 270 325 ...
$ Metascore : int 76 65 62 59 40 42 93 71 78 41 ...
- attr(*, "spec")=List of 2
..$ cols :List of 12
.. ..$ Rank : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Title : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Genre : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Description : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Director : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Actors : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Year : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Runtime (Minutes) : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Rating : list()
.. .. ..- attr(*, "class")= chr "collector_double" "collector"
.. ..$ Votes : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Revenue (Millions): list()
.. .. ..- attr(*, "class")= chr "collector_double" "collector"
.. ..$ Metascore : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
dim(moviesB)
[1] 1000 12
For the purposes of this task, we shall select the Title, Year, Rating, Revenue and Genre variables. These variables are renamed in order to be consistent with data previously analysed by this report.
We merge the datasets by movie_name in order to observe what movies these two datasets have in common, filter to include only action movies and rename the columns.
movies_joined_B <- merge(movies_subset,moviesB_,by=c("movie_name"))
#filering movies_joined df to action only
movies_joined_b <-movies_joined_B %>% filter(genre == "Action")
#Subset further to remove double up variables 'year_released.x', 'movie_rating.x' and 'Genre'
movies_joined_c <- movies_joined_b[-c(3,4,9)]
#removing 2 entries that do no have revenue recorded
movies_joined_d <- movies_joined_c[-c(7,8),]
#renaming column names
names(movies_joined_d)[4]<-"year_released"
names(movies_joined_d)[5]<-"movie_rating"
names(movies_joined_d)[6]<-"revenue"
head(movies_joined_d)
movie_name movie_id genre year_released movie_rating revenue
1 300 2238 Action 2006 7.7 210.59
2 Argo 20101 Action 2012 7.7 136.02
3 Avatar 24874 Action 2009 7.8 760.51
4 Beowulf 34191 Action 2007 6.2 82.16
5 Clown 65995 Action 2014 5.7 0.05
6 Dead Awake 80394 Action 2016 4.7 0.01
movies_joined_d$revenue%>% hist(main="Histogram of Movie Revenues for Action Movies", ylab="Rating",col = "pink")
The distribution of movie revenues for action movies shown above reveals that this variable is a right-skewed distribution. There are some movies that make a lot more revenue than others.
We shall attempt a series of transformations to address this right skewed nature of this variable. These transformations are namely:log10 transformation, reciprocal transformation, natural log transformation and square root transformation.
log_revenue<- log10(movies_joined_d$revenue)
log_revenue%>%hist(main="Histogram of Movie Revenues after log10 transformation", xlab="Rating",col = "pink")
# Reciprocal transformation
recip_revenue<-1/movies_joined_d$revenue
recip_revenue%>%hist(main="Histogram of Movie Revenues after reciprocal transformation", xlab="Revenue",col = "pink")
# Natural log transformation
ln_revenue<-log(movies_joined_d$revenue)
ln_revenue %>%hist(main="Histogram of Movie Revenues after natural log transformation", xlab="Revenue",col = "pink")
# Square Root revenue
sqrt_revenue<-sqrt(movies_joined_d$revenue)
sqrt_revenue %>%hist(main="Histogram of Movie Revenues after square root transformation", xlab="Revenue",col = "pink")
The square root transformation is the most successful in addressing the right skew in this variable.