Netflix Data Cleaning And Data Manipulating

Project Overview

The aim of this portfolio project was to analyze and present insights on a data set containing movie details from 2015 to 2021. The initial data set required cleaning before any meaningful conclusions could be drawn from it.

To clean the data, various processes were employed. Additionally, instead of using the ggplot2 package, Tableau was used to showcase some of the Tableau skills.

The final result of this project is a thorough analysis and presentation of insights from the cleaned data set, demonstrating the proficiency in data cleaning, data analysis, and data visualization using Tableau.

Project Problems

This portfolio project aimed to answer several questions related to a data set containing movie details from 2015 to 2021. These questions include:

Is the number of genres related to getting a high IMDB score or being nominated for awards?
Is the number of languages related to getting a high IMDB score or being nominated for awards?
Is the number of countries a movie is released in related to getting a high IMDB score or being nominated for awards?

The goal of this project was to use data cleaning, analysis, and visualization techniques to answer these questions and provide valuable insights into the movie industry and Netflix users.

Cleaning the Data Processes

Step-1

Package Installation

library(tidyverse)
library(dplyr)
library(readr)
library(stringr)

Step-2

Loading the Data into R

library(readr)
netflix_raw_file <- read_csv("netflix-rotten-tomatoes-metacritic-imdb_raw_file.csv")

Data Types

glimpse(netflix_raw_file)

## Rows: 15,480
## Columns: 29
## $ Title                   <chr> "Lets Fight Ghost", "HOW TO BUILD A GIRL", "Ce…
## $ Genre                   <chr> "Crime, Drama, Fantasy, Horror, Romance", "Com…
## $ Tags                    <chr> "Comedy Programmes,Romantic TV Comedies,Horror…
## $ Languages               <chr> "Swedish, Spanish", "English", "English", "Tur…
## $ `Series or Movie`       <chr> "Series", "Movie", "Movie", "Series", "Movie",…
## $ `Hidden Gem Score`      <dbl> 4.3, 7.0, 6.4, 7.7, 8.1, 8.6, 8.7, 6.9, 8.3, 5…
## $ `Country Availability`  <chr> "Thailand", "Canada", "Canada", "Belgium,Nethe…
## $ Runtime                 <chr> "< 30 minutes", "1-2 hour", "1-2 hour", "< 30 …
## $ Director                <chr> "Tomas Alfredson", "Coky Giedroyc", "Brendan W…
## $ Writer                  <chr> "John Ajvide Lindqvist", "Caitlin Moran", "Bre…
## $ Actors                  <chr> "Kåre Hedebrant, Per Ragnar, Lina Leandersson,…
## $ `View Rating`           <chr> "R", "R", "Unrated", NA, NA, NA, NA, NA, NA, N…
## $ `IMDb Score`            <dbl> 7.9, 5.8, 4.3, 6.5, 6.3, 7.4, 7.5, 3.9, 6.7, 6…
## $ `Rotten Tomatoes Score` <dbl> 98, 79, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20…
## $ `Metacritic Score`      <dbl> 82, 69, 46, NA, NA, NA, NA, NA, NA, NA, NA, 36…
## $ `Awards Received`       <dbl> 74, 1, NA, 1, NA, NA, 2, NA, 2, NA, NA, NA, 7,…
## $ `Awards Nominated For`  <dbl> 57, NA, NA, NA, 4, NA, 4, NA, 1, NA, 7, 1, 2, …
## $ Boxoffice               <chr> "$2,122,065", "$70,632", "$16,263", NA, NA, NA…
## $ `Release Date`          <chr> "12 Dec 2008", "08 May 2020", "28 Aug 2020", "…
## $ `Netflix Release Date`  <date> 2021-03-04, 2021-03-04, 2021-03-04, 2021-03-0…
## $ `Production House`      <chr> "Canal+, Sandrew Metronome", "Film 4, Monument…
## $ `Netflix Link`          <chr> "https://www.netflix.com/watch/81415947", "htt…
## $ `IMDb Link`             <chr> "https://www.imdb.com/title/tt1139797", "https…
## $ Summary                 <chr> "A med student with a supernatural gift tries …
## $ `IMDb Votes`            <dbl> 205926, 2838, 1720, 1147, 63, 131, 47, 107, 88…
## $ Image                   <chr> "https://occ-0-4708-64.1.nflxso.net/dnm/api/v6…
## $ Poster                  <chr> "https://m.media-amazon.com/images/M/MV5BOWM4N…
## $ `TMDb Trailer`          <chr> NA, "https://www.youtube.com/watch?v=eIbcxPy4o…
## $ `Trailer Site`          <chr> NA, "YouTube", "YouTube", NA, NA, "YouTube", N…

Step-3

Trimming & Naming Conventions (Using of clean_name () function)

netflix_raw_file_trimmed <- netflix_raw_file %>% 
                            mutate_all(str_trim)
library(janitor)
netflix_column_name_changed <- netflix_raw_file_trimmed %>% 
                               clean_names()
netflix_column_name_changed <- netflix_column_name_changed %>% 
                               rename(series_movie = series_or_movie, imdb_link = im_db_link, 
                               imdb_score = im_db_score, imdb_votes = im_db_votes, 
                               tmdb_trailer = tm_db_trailer)
colnames(netflix_column_name_changed)

##  [1] "title"                 "genre"                 "tags"                 
##  [4] "languages"             "series_movie"          "hidden_gem_score"     
##  [7] "country_availability"  "runtime"               "director"             
## [10] "writer"                "actors"                "view_rating"          
## [13] "imdb_score"            "rotten_tomatoes_score" "metacritic_score"     
## [16] "awards_received"       "awards_nominated_for"  "boxoffice"            
## [19] "release_date"          "netflix_release_date"  "production_house"     
## [22] "netflix_link"          "imdb_link"             "summary"              
## [25] "imdb_votes"            "image"                 "poster"               
## [28] "tmdb_trailer"          "trailer_site"

Step-4

Duplicate Removing

(no_of_rows <- nrow(netflix_column_name_changed))

## [1] 15480

duplicate_rows <- netflix_column_name_changed %>% 
                  duplicated.data.frame()
netflix_unique_rows <- netflix_column_name_changed %>%
                       distinct(title, .keep_all = TRUE)
(no_of_rows <- nrow(netflix_unique_rows))

## [1] 15071

##There are 409 titles which are duplicate and removed from the data.

Step-5

Converting to same unit (Using of case_when() function)

netflix_runtime_changed <- netflix_unique_rows %>% 
                           mutate(runtime = case_when(runtime == '1-2 hour' ~ '60-120 minutes', 
                                                      runtime == '> 2 hour' ~ '> 120 minutes',
                                                      runtime == '< 30 minutes' ~ '< 30 minutes',
                                                      runtime == '30-60 minutes' ~ '30-60 minutes',
                                                      runtime == NA ~ 'Unknown'))

Step-6

Filling in Missing Data for Genre Column

The “genre” column had 1706 missing or “NA” values, which needed to be addressed. To fill in the missing data, I decided to use the information available in the “Tags” column.

genre_NA <- netflix_runtime_changed %>% 
            filter(is.na(genre))
nrow(genre_NA)

## [1] 1706

Step-6.1

Need to create a list of unique genres to help fill in missing data in the “genre” column.

#Selecting only genre column without Null value
genre_not_NA <- netflix_runtime_changed %>% 
                filter(!is.na(genre)) %>% 
                select(genre)
#Separating all the genres in the data set using a delimiter (",")
separate_genre <- str_split_fixed(genre_not_NA$genre, ",",15)
separate_genre_1 <- genre_not_NA %>% 
                    separate(genre,into = paste0("col",1:12),sep= ",")
#Unioning all the separated genres into a single list
union_genre <- data.frame(x=unlist(separate_genre_1))
               nrow(union_genre)

## [1] 160380

#Filtering out duplicate values to create a list of distinct genres
unique_genre <- union_genre %>% distinct()
nrow(unique_genre)

## [1] 56

#Trimming for extra space
unique_genre$x <- gsub(" ", "", unique_genre$x)
unique_genre <- union_genre %>% distinct()
nrow(unique_genre)

## [1] 56

# While the initial count suggested that there were 56 unique genres, further investigation revealed that there were only 28 unique genres present.To address this issue, i decided to remove all duplicate values and NA values through Excel. By cleaning up the data in this way ensuring that the data set was complete, accurate, and free from any errors that might have compromised the analysis. This note serves as a reminder of the importance of quality control and error checking throughout the data cleaning process.
#To save the data in excel file
library(openxlsx)
write.xlsx(unique_genre,"unique_genre.xlsx", row.namees = FALSE )
# Now reloading the updated unique_genre file
updated_unique_genre <- read.xlsx("unique_genre_1.xlsx")
nrow(updated_unique_genre)

## [1] 28

Step-6.2

28 unique genres identified present in the data set. Similarly, we will also determine the unique tags in the “tags” column.

# Identifying 12 number of rows in the data set that do not have any genre or tags information.
genre_tags_NA <- netflix_runtime_changed %>% 
                 filter(is.na(tags)) %>% 
                 filter(is.na(genre))
nrow(genre_tags_NA)

## [1] 12

# Identifying 66 number of rows in the data set that do not have any tags information.
tags_NA <- netflix_runtime_changed %>% 
           filter(is.na(tags))
nrow(tags_NA)

## [1] 66

We have identified that there are 44 rows in the data set that have genre information but no tags information, and 12 rows that have no genre or tags information.

#Selecting only tags column without Null value
tags_not_NA <- netflix_runtime_changed %>% 
               filter(!is.na(tags)) %>% 
               filter(!is.na(genre)) %>% 
               select(tags)
nrow(tags_not_NA)

## [1] 13311

#Separating all the tags in the data set using a delimiter (",")
separate_tags <- str_split_fixed(tags_not_NA$tags,",",33)
separate_tags_1 <- tags_not_NA %>% 
                   separate(tags,into = paste0("col_",1:33),",")
nrow(separate_tags_1)

## [1] 13311

#Unioning all the separated tags into a single list
union_tags <- data.frame(x=unlist(separate_tags_1))
nrow(union_tags)

## [1] 439263

#Trimming for extra space
unique_tags_1 <- union_tags %>% 
                 mutate_all(str_trim)
#Filtering out duplicate values to create a list of distinct tags
unique_tags_2 <- unique_tags_1 %>% 
                 distinct(x)
nrow(unique_tags_2)

## [1] 1001

# While the initial count suggested that there were 1000 unique genres, further investigation revealed that there were only 985 unique genres present.To address this issue, i decided to remove all duplicate values and NA values through Excel. By cleaning up the data in this way ensuring that the data set was complete, accurate, and free from any errors that might have compromised the analysis. This note serves as a reminder of the importance of quality control and error checking throughout the data cleaning process.
write.xlsx(unique_tags_2,"unique_tags_2.xlsx", rowNames = FALSE)
# Now reinserting the updated unique_genre file.
updated_unique_tag <- read.xlsx("unique_tags_2_Cleaned.xlsx")
nrow(updated_unique_tag)

## [1] 985

Step-6.3

There are 985 unique tags present in our data set. The next step involved matching each tag with 28 unique genres created earlier to help with this process.

# It is identified that some tags in the data set had more than one genre listed in each cell.First separate the tags column with a space delimiter.
updated_unique_tag$tags_split <- strsplit(as.character(updated_unique_tag$tags), " ")
#Creating a function to match the first four letter of genre and first four letter of tags and if matched then give the genre name.And if more than one genre then add with comma delimiter.
updated_unique_tag$genres <- lapply(updated_unique_tag$tags_split, function(x) {
                             matching_genre <- ""
                             for (word in x) {
                             if (substr(word, 1, 4) %in% substr(updated_unique_genre$genre, 1, 4)) {
                             matching_genre <- paste(matching_genre, updated_unique_genre$genre[substr(updated_unique_genre$genre, 1, 4) == substr(word, 1, 4)], sep = ",")
    }
  }
  matching_genre
})
nrow(updated_unique_tag)

## [1] 985

write.xlsx(updated_unique_tag, "updated_unique_tag.xlsx")
#Now i have removed commas and modify this on excel as it is easy to do in excel as this have only small data set.
tags_genre_list <- read.xlsx("updated_unique_tag__genre_cleaned.xlsx")
nrow(tags_genre_list)

## [1] 985

# Identifying the number of rows which have no genre in this tags list.
tags_have_no_genre <- tags_genre_list %>% 
                      filter(is.na(genre))
nrow(tags_have_no_genre)

## [1] 219

#There are 219 tags have no genres.

Step-6.4 (Use of left_join(), row_number() functions)

We will match the tags in tags column and put genre in them.

# Separating the words in 'genre_NA_Tags_not_NA' and pivoting.
genre_NA_Tags_not_NA <- genre_NA %>% 
                        select(title, genre, tags)
df_sep <- genre_NA_Tags_not_NA %>%
          mutate(row_num = row_number()) %>%
          separate_rows(tags, sep = ",")
nrow(df_sep)

## [1] 8485

# join df_sep with tags_genre_list based on matching words in tags column using join function
df_join <- df_sep %>%
           left_join(tags_genre_list, by = c("tags" = "tags"))
nrow(df_join)

## [1] 8485

# group the data by the original row numbers in genre_NA_Tags_not_NA and concatenate the values in genre.y column without adding null values.
df_result <- df_join %>%
             group_by(row_num) %>%
             summarise(col2 = paste(na.omit(genre.y), collapse = ",")) %>%
             ungroup() %>%
             select(-row_num)
             nrow(df_result)

## [1] 1706

# Merging the result with the original 'genre_NA_Tags_not_NA'
updated_genre_NA_Tags_not_NA <- cbind(genre_NA_Tags_not_NA, df_result)
#Selecting all the rows which have no genre information with "Unknown".
updated_genre_NA_Tags_not_NA$col2[updated_genre_NA_Tags_not_NA$col2 == ""] <- "Unknown"
updated_genre_NA_Tags_not_NA <- updated_genre_NA_Tags_not_NA %>% 
                                select(col2, title, genre,tags)
# Define a function to remove duplicates from each cell
remove_duplicates <- function(x) {
                     if (is.na(x) || x == "") {
                     # Return an empty string for NA or empty cells
                     return("")
                     } else {
                     # Split the cell by comma
                     split_cell <- strsplit(x, ",")[[1]]
                     # Remove duplicates and return the modified cell
                     return(paste0(unique(split_cell), collapse = ","))
                                              }
                                  }
# Applying the remove_duplicates function to each cell. 
updated_genre_NA_Tags_not_NA$new_genre <- apply(updated_genre_NA_Tags_not_NA, 1, remove_duplicates)

# Converting each row to a string and wrap with parentheses
updated_genre_NA_Tags_not_NA$new_genre <- paste0(updated_genre_NA_Tags_not_NA$new_genre)

# Removing row names
rownames(updated_genre_NA_Tags_not_NA) <- NULL
update_genre_tags <-  updated_genre_NA_Tags_not_NA %>% 
                      select(title, new_genre,tags)
# Combining this data with main file before assigning genre file data using left join.
new_file <- netflix_runtime_changed %>% 
            left_join(update_genre_tags, by = c("title" = "title"))
# Assigning the missing genres for the 1706 rows.
new_file$genre[is.na(new_file$genre)] <- new_file$new_genre[is.na(new_file$genre)]
# Removing unnecessary columns
new_file <- new_file %>% 
            select(-tags.y, -new_genre)
# Saving the file
write.csv(new_file, "new_file.csv")
# Now we got all the genre column filled up.

Step-7

Creating a genre tag file

genre <- new_file %>% 
         select(title, genre)
nrow(genre)

## [1] 15071

# Trimming all column
genre <- genre %>% 
         mutate_all(str_trim)
# Adding row number and separating in each row
genre_1 <- genre %>% 
           mutate(row_number = row_number()) %>% 
           separate_rows(genre, sep = ",") %>% 
           ungroup() %>% 
           select(-row_number)
nrow(genre_1)

## [1] 38895

# saving file
write.csv(genre_1, "cleaned_genre_file.csv")

Step-8

Lets work on language column first we will find how many tittles have no languages

langauge_NA <- new_file %>% 
               filter(is.na(languages))
nrow(langauge_NA)

## [1] 1926

So there are 1926 rows have no language. There is no data provided for filling up these empty cells.

Step-8.1

Separating the multiple languages for each movie and create a new file for only tittle and languages column.

# Selecting only title and language column
library(knitr)
language_column <- new_file %>% 
                   select(title, languages)
nrow(language_column)

## [1] 15071

# Trimming all column
language_column <- language_column %>% 
                   mutate_all(str_trim)
# Adding row number and separating in each row
language_column_1 <- language_column %>% 
                     mutate(row_number = row_number()) %>% 
                     separate_rows(languages, sep = ",") %>% 
                     ungroup() %>% 
                     select(-row_number)
nrow(language_column_1)

## [1] 20768

# saving file
write.csv(language_column_1, "cleaned_language_file.csv")
cleaned_language_file <- language_column_1
#To find unique number of languages
unique_language <- cleaned_language_file %>% 
                   select(languages)
unique_language_list <- unique_language %>% 
                        distinct(languages) %>% 
                        filter(!is.na(languages))
nrow(unique_language_list)

## [1] 286

# Netflix movies releases in 198 languages.

There are 198 unique languages of movie & series content release by Netflix.

Step-9

Working on the country_availability column to determine the number of titles that have no country data.

country_NA <- new_file %>% 
              filter(is.na(country_availability))
nrow(country_NA)

## [1] 18

So there are 18 rows have no language. There is no data provided for filling up these empty cells.

Step-9.1

Separating the multiple countries for each movie and create a new file for only tittle and country_availability column.

country <- new_file %>% 
           select(title, country_availability)
nrow(country)

## [1] 15071

# Trimming all column
country <- country %>% 
           mutate_all(str_trim)
# Adding row number and separating in each row
country_1 <- country %>% 
             mutate(row_number = row_number()) %>% 
             separate_rows(country_availability, sep = ",") %>% 
             select(-row_number)
nrow(country_1)

## [1] 190778

# Trimming for blank spaces
cleaned_country_file <- country_1
# saving file
write.csv(cleaned_country_file, "cleaned_country_file.csv")
#To find unique number of countries
unique_country <- cleaned_country_file %>% 
                  select(country_availability)
unique_country_list <- unique_country %>% 
                       distinct(country_availability) %>% 
                       filter(!is.na(country_availability))
nrow(unique_country_list)

## [1] 36

# Netflix movies releases in 36 countries.

Netflix have most of the consumers in this 36 countries. Project Solutions

The project solution was analyzed and visualized using Tableau. To access the visualizations, kindly follow the provided Tableau profile link below. https://public.tableau.com/views/NetflixFinal_16781166617170/Story2?:language=en-GB&:display_count=n&:origin=viz_share_link

Summary on Genre

1.After analyzing the genres of titles on Netflix, it was found that a large proportion of titles have between 1 to 4 genres. This indicates that Netflix tends to produce titles that focus on a specific genre or a combination of a few genres.

2.Additionally, this trend is consistent across both movies and series on the platform.

3.The analysis of award-winning titles on Netflix revealed that titles with 1 to 4 genres tend to receive more awards than those with more genres. Additionally, as the number of genres increases, the number of awards received decreases. It’s worth noting that there are only 1460 titles on the platform with 5 to 12 genres.

4.The analysis of award-nominating titles on Netflix revealed that titles with 2 to 5 genres tend to nominate for more awards than those with more genres. Additionally, as the number of genres increases, the number of nominating for awards decreases.

5.The analysis of IMDB scores for titles on Netflix showed that, in general, series titles tend to have higher scores than movie titles. This suggests that audiences may be more drawn to series content on the platform due to their higher perceived quality.

Summary on Language

1.After analyzing the languages of titles on Netflix, it was found that a large proportion of titles have one language. This indicates that Netflix tends to add titles that focus on a specific language.

2.Additionally, this trend is consistent across both movies and series on the platform.

3.The analysis of award-winning titles on Netflix showed that titles with 1 to 4 languages tend to receive more awards than those with more languages. This trend is consistent for movies only and no such relationship exists for series content.

4.This analysis of award- nominating titles on Netflix revealed that titles with more languages tend to get nominated for awards. However, this trend only applies to movies and not series content.

5.The analysis of IMDB scores for titles on Netflix showed that, in general, series titles tend to have higher scores than movie titles.

Summary on Country

1.The analysis reveals that Netflix tends to release a highest count of movies for single country count, indicating a focus on catering to specific countries’ preferences.

2.In contrast, the highest number of series titles are released in countries where Netflix has a consumer base, indicating a strong global popularity for series content on the platform.

3.The analysis reveals that series with widespread international releases have a higher chance of winning awards, while the opposite is true for those with limited country releases. Interestingly, the study finds no significant relationship between the content of movies and their award-winning potential.

4.The analysis reveals that series with widespread international releases have a higher chance of nominate for awards. The study finds no significant relationship between the content of movies and their award-nominating potential.

5.The analysis of IMDB scores for titles on Netflix showed that, in general, series titles tend to have higher scores than movie titles.

                                                    **END**

Netflix Data Cleaning And Data Manipulating

Himanshu

2023-03-09