Hi everyone👋! This is my Learning by Doing or LBB project as a Scholaship Grantee on Algoritma Data Science School Vulcan Cohort. For all readers and data scientists, feedback is welcome!
Netflix is one of the most popular media and video streaming platforms. They have over 8000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc. You can access this dataset through this link: Netflix Dataset.
First of all, data must be inputted and assigned to a variable. Make
sure that the file is in the same directory as the R project. I will
assigend the data from data_input folder as
netflix.
netflix <- read.csv("data_input/netflix_titles.csv")netflixdim(netflix)## [1] 8807 12
names(netflix)## [1] "show_id" "type" "title" "director" "cast"
## [6] "country" "date_added" "release_year" "rating" "duration"
## [11] "listed_in" "description"
We can conclude several things from the data inspection conducted:
netflix data contains 8807 rows and 12 columns. And
each column isshow_id = Unique ID for every Movie / Tv Showtype = Identifier - A Movie or TV Showtitle = Title of the Movie / Tv Showdirector = Director’s name of the moviecast = Actors involved in the movie / showcountry = Country where the movie / show was
produceddate_added = Date it was added on Netflixrelease_year = Actual Release year of the move /
showrating = TV Rating of the movie / showduration = Total Duration - in minutes or number of
seasonslisted_in = Genredescription = A short brief about the showstr(netflix)## 'data.frame': 8807 obs. of 12 variables:
## $ show_id : chr "s1" "s2" "s3" "s4" ...
## $ type : chr "Movie" "TV Show" "TV Show" "TV Show" ...
## $ title : chr "Dick Johnson Is Dead" "Blood & Water" "Ganglands" "Jailbirds New Orleans" ...
## $ director : chr "Kirsten Johnson" "" "Julien Leclercq" "" ...
## $ cast : chr "" "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile "| __truncated__ "Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, G"| __truncated__ "" ...
## $ country : chr "United States" "South Africa" "" "" ...
## $ date_added : chr "September 25, 2021" "September 24, 2021" "September 24, 2021" "September 24, 2021" ...
## $ release_year: int 2020 2021 2021 2021 2021 2021 2021 1993 2021 2021 ...
## $ rating : chr "PG-13" "TV-MA" "TV-MA" "TV-MA" ...
## $ duration : chr "90 min" "2 Seasons" "1 Season" "1 Season" ...
## $ listed_in : chr "Documentaries" "International TV Shows, TV Dramas, TV Mysteries" "Crime TV Shows, International TV Shows, TV Action & Adventure" "Docuseries, Reality TV" ...
## $ description : chr "As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical wa"| __truncated__ "After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is h"| __truncated__ "To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled "| __truncated__ "Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Or"| __truncated__ ...
Based on this data summary, we found some of data types are not correct. We must convert it into the correct type.
netflix$type <- as.factor(netflix$type)
netflix$country <- as.factor(netflix$country)
netflix$rating <- as.factor(netflix$rating)
netflix$date_added <- as.Date(netflix$date_added, format = "%B %d, %Y")
str(netflix)## 'data.frame': 8807 obs. of 12 variables:
## $ show_id : chr "s1" "s2" "s3" "s4" ...
## $ type : Factor w/ 2 levels "Movie","TV Show": 1 2 2 2 2 2 1 1 2 1 ...
## $ title : chr "Dick Johnson Is Dead" "Blood & Water" "Ganglands" "Jailbirds New Orleans" ...
## $ director : chr "Kirsten Johnson" "" "Julien Leclercq" "" ...
## $ cast : chr "" "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile "| __truncated__ "Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, G"| __truncated__ "" ...
## $ country : Factor w/ 749 levels "",", France, Algeria",..: 605 428 1 1 253 1 1 665 508 605 ...
## $ date_added : Date, format: "2021-09-25" "2021-09-24" ...
## $ release_year: int 2020 2021 2021 2021 2021 2021 2021 1993 2021 2021 ...
## $ rating : Factor w/ 18 levels "","66 min","74 min",..: 9 13 13 13 13 13 8 13 11 9 ...
## $ duration : chr "90 min" "2 Seasons" "1 Season" "1 Season" ...
## $ listed_in : chr "Documentaries" "International TV Shows, TV Dramas, TV Mysteries" "Crime TV Shows, International TV Shows, TV Action & Adventure" "Docuseries, Reality TV" ...
## $ description : chr "As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical wa"| __truncated__ "After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is h"| __truncated__ "To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled "| __truncated__ "Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Or"| __truncated__ ...
Each column is already changed into desired data type.
We must check is our data has missing value or not using
is.na().
colSums(is.na(netflix))## show_id type title director cast country
## 0 0 0 0 0 0
## date_added release_year rating duration listed_in description
## 98 0 0 0 0 0
We found there are 98 rows (data) that is missing on date_added
value. After I check the dataframe, the 98 data does not have a
date_added before I convert it. In this case, I will handle
that missing values with drop that rows.
I will use the tidyr package to help me drop the missing
values using drop_na()
#Import the tidyr package
library("tidyr")netflix_clean <- netflix %>% drop_na()
colSums(is.na(netflix_clean))## show_id type title director cast country
## 0 0 0 0 0 0
## date_added release_year rating duration listed_in description
## 0 0 0 0 0 0
Great!👍. Now, Netflix dataset is ready to be processes and analyzed!
summary(netflix_clean)## show_id type title director
## Length:8709 Movie :6131 Length:8709 Length:8709
## Class :character TV Show:2578 Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## cast country date_added release_year
## Length:8709 United States :2778 Min. :2008-01-01 Min. :1925
## Class :character India : 971 1st Qu.:2018-04-20 1st Qu.:2013
## Mode :character : 827 Median :2019-07-12 Median :2017
## United Kingdom: 403 Mean :2019-05-23 Mean :2014
## Japan : 241 3rd Qu.:2020-08-26 3rd Qu.:2019
## South Korea : 195 Max. :2021-09-25 Max. :2021
## (Other) :3294
## rating duration listed_in description
## TV-MA :3183 Length:8709 Length:8709 Length:8709
## TV-14 :2133 Class :character Class :character Class :character
## TV-PG : 838 Mode :character Mode :character Mode :character
## R : 799
## PG-13 : 490
## TV-Y7 : 330
## (Other): 936
Here is some brief explanation from Netflix Dataset:
Here is some business question that we can ask from Netflix Dataset.
table(netflix_clean$type)##
## Movie TV Show
## 6131 2578
Answer: TV Show
s <- strsplit(as.character(netflix_clean$country), split = ", ")
netflix_clean_countries_fuul <- data.frame(type = rep(netflix_clean$type, sapply(s, length)), country = unlist(s))
netflix_clean_countries_fuul$country <- as.character(gsub(",","",netflix_clean_countries_fuul$country))We must split the character from country.
country_tab <- table(netflix_clean_countries_fuul$country)
sorted_country_tab <- country_tab[order(country_tab, decreasing = T)]
prop_country_tab <- (prop.table(table(netflix_clean_countries_fuul$country)))
sorted_prop_country_tab <- prop_country_tab[order(prop_country_tab, decreasing = T)]
head(sorted_country_tab)##
## United States India United Kingdom Canada France
## 3643 1045 787 432 389
## Japan
## 314
head(sorted_prop_country_tab)##
## United States India United Kingdom Canada France
## 0.36771979 0.10548097 0.07943878 0.04360553 0.03926517
## Japan
## 0.03169476
Answer: United States
From here, I learned from the start how to make R Markdown to gain insight into the data I chose, namely the Netflix Dataset taken from Kaggle. I also implemented programming for data science in this project, starting with reading the data to making a summary of the data to answer the available business questions.