Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.
library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(lubridate)
library(outliers)
library(forecast)
In this assignment the first requirement was to merge two datasets which I did using the join function. The next step was to understand my data. In this step I first understood the types of variables in the data that was imported and then made necessary changes like conversion to factors and numeric types. After this step we were asked to check if our dataset satisfies the tidy principle. My dataset did not satisfy the tidy principles and hence the next step required us to convert the dataframe to a tidy format. The genre column in my dataset was not tidy as it did not contain one value. It contained upto 3 genres for each movie with the first genre being the main genre as per the IMDB website. So for this assignment I separated the values and then used the main genre for tmy analysis and dropped the sub genres. The next important part was to deal with the null values which was done by replacing the null values in numerical columns with the mean, and then the numerical columns were checked for outliers. If outliers were detected I handled them using the capping method. The last requirement was to transform a variable into normalised form which was I did using the square root method.
The datasets that I chose for this assignment were extracted from the IMDB website that can be downloaded from https://www.imdb.com/interfaces/. The movies_title dataset contains information of all movies, tv series, short movies etc from 2005 to 2022. This dataset includes attributes like the title type ie whether it is a movie, tv series, documentary etc, primary title of each movie, tconst(which is a unique identifier given to each title), isAdult, startYear, runtimeMinutes and the genres. The second dataset is the ratings dataset that contained attributes like tconst, averageRating and numVotes. The two datasets were joined using the tconst variable that was present in both datasets to produce a new dataset, the movies dataset.
movie_title <- read_csv("movie_title.csv")
Parsed with column specification:
cols(
tconst = [31mcol_character()[39m,
titleType = [31mcol_character()[39m,
primaryTitle = [31mcol_character()[39m,
originalTitle = [31mcol_character()[39m,
isAdult = [32mcol_double()[39m,
startYear = [32mcol_double()[39m,
runtimeMinutes = [32mcol_double()[39m,
genres = [31mcol_character()[39m
)
movie_title
ratings <- read_csv("ratings.csv") #Loading ratings dataset
Parsed with column specification:
cols(
tconst = [31mcol_character()[39m,
averageRating = [32mcol_double()[39m,
numVotes = [32mcol_double()[39m
)
ratings
movies <- left_join(movie_title, ratings, by="tconst") #Joining the two datasets/merging
movies
The movies dataset has 10 attributes and 67237 observations. The types of variables in this dataset are characters, numerics and factors. titleType was converted into factor and it was observed that there are 10 types of titles in this dataframe. The isAdult column signifies whether the movie is Adult or not. 0 is for movies that are not Adult and 1 for movies that are adult and hence this column was also converted to factor.
str(movies)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 67237 obs. of 10 variables:
$ tconst : chr "tt0065791" "tt0068943" "tt0069049" "tt0088751" ...
$ titleType : chr "tvShort" "short" "movie" "movie" ...
$ primaryTitle : chr "Góry o zmierzchu" "Miedzy Wroclawiem a Zielona Góra" "The Other Side of the Wind" "The Naked Monster" ...
$ originalTitle : chr "Góry o zmierzchu" "Miedzy Wroclawiem a Zielona Góra" "The Other Side of the Wind" "The Naked Monster" ...
$ isAdult : num 0 0 0 0 0 0 0 0 0 0 ...
$ startYear : num 2009 2010 2018 2005 2019 ...
$ runtimeMinutes: num 28 11 122 100 20 80 36 73 75 90 ...
$ genres : chr "Short" "Documentary,Short" "Drama" "Comedy,Horror,Sci-Fi" ...
$ averageRating : num 6.5 5.1 6.9 5.6 5.8 6.6 8.6 6.2 NA 4.8 ...
$ numVotes : num 13 30 4926 229 27 ...
dim(movies)
[1] 67237 10
movies$titleType <- factor(movies$titleType)
is.factor(movies$titleType)
[1] TRUE
table(movies$titleType)
movie short tvEpisode tvMiniSeries tvMovie tvSeries tvShort tvSpecial video
13358 13203 23628 449 3784 3626 408 662 8109
videoGame
10
movies <- mutate(movies, isAdult= factor(movies$isAdult, levels = c(0, 1))) #Converting isAdult to factor
movies
According to the Tidy principles each observation must have its own cell. In this dataset a maximum of 3 genres have been provided for each movie in the genre column with the first genre being the main genre of that movie. To convert my dataset into Tidy format, I separated the Genres into 3 columns; “main-genre”, “sub-genre1” and “sub-genre2”. Since I only wanted to consider the main genre in my analysis I dropped the other two columns.
movies<- movies %>% separate('genres',into=c("main-genre", "sub-genre1", "sub-genre2"), sep=",")
Expected 3 pieces. Missing pieces filled with `NA` in 49856 rows [1, 2, 3, 5, 7, 8, 9, 10, 11, 12, 13, 15, 20, 21, 23, 24, 27, 29, 30, 31, ...].
movies
table(movies$`main-genre`)
Action Adult Adventure Animation Biography Comedy Crime Documentary Drama Family
5300 3258 2413 4026 881 13050 3376 9326 11621 786
Fantasy Game-Show History Horror Music Musical Mystery News Reality-TV Romance
340 1058 126 1455 1511 115 237 672 1422 371
Sci-Fi Short Sport Talk-Show Thriller War Western
313 4451 283 506 306 11 23
table(movies$`sub-genre1`)
Adult Adventure Animation Biography Comedy Crime Documentary Drama Family Fantasy
9 2268 1682 37 2831 1820 1024 6848 1579 812
Game-Show History Horror Music Musical Mystery News Reality-TV Romance Sci-Fi
347 652 764 887 149 819 376 1164 2999 435
Short Sport Talk-Show Thriller War Western
9990 441 719 1071 166 36
table(movies$`sub-genre2`)
Adventure Animation Biography Comedy Crime Documentary Drama Family Fantasy Game-Show
4 1357 4 1132 166 73 2471 875 1219 80
History Horror Music Musical Mystery News Reality-TV Romance Sci-Fi Short
320 426 366 116 1921 143 413 1667 725 1487
Sport Talk-Show Thriller War Western
434 368 1243 250 121
movies <- movies %>% select(-(`sub-genre1`),-(`sub-genre2`))
movies
For this step I created a new Column named PopularityIndex by multiplying the averageRating and numVotes data. The Popularity Index gives a measure of the populatrity of the movies. Some movies might be rated highly but received only a handful number of Votes on the other hand some movies might be given an average rating but voted by a large number of users. Hence by the product of the two columns we can know the Popularity of that movies based on both votes and rating.
movies <- mutate(movies, PopularityIndex = averageRating * numVotes )
movies
On scanning the data for NA values I found that the 3 numerical columns had more than 30% values as NA. The NA values were hence not omitted due to the large number and were instead replaced by the mean of the columns. I also checked for NAN and INF values and found that there were none of the two present in my dataset.
sum(is.na(movies)) #Checking the total number of NA in the dataframe
[1] 76614
colSums(is.na(movies)) #Checking the total sum of NA in each column
tconst titleType primaryTitle originalTitle isAdult startYear runtimeMinutes
0 0 0 0 0 0 0
main-genre averageRating numVotes PopularityIndex
0 25538 25538 25538
replaceNAbyMean <- function(x){
(replace(x, is.na(x), mean(x, na.rm = TRUE)))
}
#a<-lapply(movies, replaceNAbyMean)
movies[] <- lapply(movies, replaceNAbyMean)
argument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NA
colSums(is.na(movies))
tconst titleType primaryTitle originalTitle isAdult startYear runtimeMinutes
0 0 0 0 0 0 0
main-genre averageRating numVotes PopularityIndex
0 0 0 0
colSums(is.na(movies))
tconst titleType primaryTitle originalTitle isAdult startYear runtimeMinutes
0 0 0 0 0 0 0
main-genre averageRating numVotes PopularityIndex
0 0 0 0
is.specialorNA<- function(x){
if (is.numeric(x)) (sum(is.infinite(x)) | sum(is.nan(x)) | sum(is.na(x)))
}
sapply(movies, is.specialorNA)
$tconst
NULL
$titleType
NULL
$primaryTitle
NULL
$originalTitle
NULL
$isAdult
NULL
$startYear
[1] FALSE
$runtimeMinutes
[1] FALSE
$`main-genre`
NULL
$averageRating
[1] FALSE
$numVotes
[1] FALSE
$PopularityIndex
[1] FALSE
Checked for outliers in all 4 numerical columns; runtimeMinutes, averageRating, numVotes and PopularityIndex using boxplots. Found that all 4 have a large number of outliers. I used the capping method to take care of the outliers after which I plotted the box plots again for all 4 variables.
movies$runtimeMinutes %>% boxplot(col="lightblue",main="RUN TIME MINUTES")
movies$averageRating %>% boxplot(col="lightblue",main="Average Rating")
movies$numVotes %>% boxplot(col="lightblue",main="Number of Votes")
movies$PopularityIndex %>% boxplot(col="lightblue",main="Popularity Index")
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
return(x)
}
movies$runtimeMinutes <-cap(movies$runtimeMinutes)
movies$runtimeMinutes %>% boxplot(col="purple",main="RUN TIME MINUTES")
movies$averageRating <-cap(movies$averageRating)
movies$averageRating %>% boxplot(col="purple",main="Average Rating")
movies$numVotes <-cap(movies$numVotes)
movies$numVotes %>% boxplot(col="purple",main="Number of Votes")
movies$PopularityIndex <-cap(movies$PopularityIndex)
movies$PopularityIndex %>% boxplot(col="purple",main="Popularity Index")
In this assignment we were required to transform atleast one variable. I plotted the histogram for runtimeMinutes to check for the distribution. The distribution was not normal and was skewed towards the right. I used a number of transformations including the BoxCox but noticed that sqrt(square root) gave the best result.
hist_averageRating<-hist(movies$runtimeMinutes, col="seagreen",main="Histogram for Runtime Minutes",xlab="Minutes")