Learning by Doing Programming for Data Science

Explanation

Brief

Hi everyone👋! This is my Learning by Doing or LBB project as a Scholaship Grantee on Algoritma Data Science School Vulcan Cohort. For all readers and data scientists, feedback is welcome!

Dataset

Netflix is one of the most popular media and video streaming platforms. They have over 8000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc. You can access this dataset through this link: Netflix Dataset.

Data Reading

First of all, data must be inputted and assigned to a variable. Make sure that the file is in the same directory as the R project. I will assigend the data from data_input folder as netflix.

netflix <- read.csv("data_input/netflix_titles.csv")

Data Inspection

netflix

dim(netflix)

## [1] 8807   12

names(netflix)

##  [1] "show_id"      "type"         "title"        "director"     "cast"        
##  [6] "country"      "date_added"   "release_year" "rating"       "duration"    
## [11] "listed_in"    "description"

We can conclude several things from the data inspection conducted:

netflix data contains 8807 rows and 12 columns. And each column is
show_id = Unique ID for every Movie / Tv Show
type = Identifier - A Movie or TV Show
title = Title of the Movie / Tv Show
director = Director’s name of the movie
cast = Actors involved in the movie / show
country = Country where the movie / show was produced
date_added = Date it was added on Netflix
release_year = Actual Release year of the move / show
rating = TV Rating of the movie / show
duration = Total Duration - in minutes or number of seasons
listed_in = Genre
description = A short brief about the show

Data Cleansing

str(netflix)

## 'data.frame':    8807 obs. of  12 variables:
##  $ show_id     : chr  "s1" "s2" "s3" "s4" ...
##  $ type        : chr  "Movie" "TV Show" "TV Show" "TV Show" ...
##  $ title       : chr  "Dick Johnson Is Dead" "Blood & Water" "Ganglands" "Jailbirds New Orleans" ...
##  $ director    : chr  "Kirsten Johnson" "" "Julien Leclercq" "" ...
##  $ cast        : chr  "" "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile "| __truncated__ "Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, G"| __truncated__ "" ...
##  $ country     : chr  "United States" "South Africa" "" "" ...
##  $ date_added  : chr  "September 25, 2021" "September 24, 2021" "September 24, 2021" "September 24, 2021" ...
##  $ release_year: int  2020 2021 2021 2021 2021 2021 2021 1993 2021 2021 ...
##  $ rating      : chr  "PG-13" "TV-MA" "TV-MA" "TV-MA" ...
##  $ duration    : chr  "90 min" "2 Seasons" "1 Season" "1 Season" ...
##  $ listed_in   : chr  "Documentaries" "International TV Shows, TV Dramas, TV Mysteries" "Crime TV Shows, International TV Shows, TV Action & Adventure" "Docuseries, Reality TV" ...
##  $ description : chr  "As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical wa"| __truncated__ "After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is h"| __truncated__ "To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled "| __truncated__ "Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Or"| __truncated__ ...

Based on this data summary, we found some of data types are not correct. We must convert it into the correct type.

netflix$type <- as.factor(netflix$type)
netflix$country <- as.factor(netflix$country)
netflix$rating <- as.factor(netflix$rating)
netflix$date_added <- as.Date(netflix$date_added, format = "%B %d, %Y")
str(netflix)

## 'data.frame':    8807 obs. of  12 variables:
##  $ show_id     : chr  "s1" "s2" "s3" "s4" ...
##  $ type        : Factor w/ 2 levels "Movie","TV Show": 1 2 2 2 2 2 1 1 2 1 ...
##  $ title       : chr  "Dick Johnson Is Dead" "Blood & Water" "Ganglands" "Jailbirds New Orleans" ...
##  $ director    : chr  "Kirsten Johnson" "" "Julien Leclercq" "" ...
##  $ cast        : chr  "" "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile "| __truncated__ "Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, G"| __truncated__ "" ...
##  $ country     : Factor w/ 749 levels "",", France, Algeria",..: 605 428 1 1 253 1 1 665 508 605 ...
##  $ date_added  : Date, format: "2021-09-25" "2021-09-24" ...
##  $ release_year: int  2020 2021 2021 2021 2021 2021 2021 1993 2021 2021 ...
##  $ rating      : Factor w/ 18 levels "","66 min","74 min",..: 9 13 13 13 13 13 8 13 11 9 ...
##  $ duration    : chr  "90 min" "2 Seasons" "1 Season" "1 Season" ...
##  $ listed_in   : chr  "Documentaries" "International TV Shows, TV Dramas, TV Mysteries" "Crime TV Shows, International TV Shows, TV Action & Adventure" "Docuseries, Reality TV" ...
##  $ description : chr  "As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical wa"| __truncated__ "After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is h"| __truncated__ "To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled "| __truncated__ "Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Or"| __truncated__ ...

Each column is already changed into desired data type.

Check Missing Value

We must check is our data has missing value or not using is.na().

colSums(is.na(netflix))

##      show_id         type        title     director         cast      country 
##            0            0            0            0            0            0 
##   date_added release_year       rating     duration    listed_in  description 
##           98            0            0            0            0            0

We found there are 98 rows (data) that is missing on date_added value. After I check the dataframe, the 98 data does not have a date_added before I convert it. In this case, I will handle that missing values with drop that rows.

I will use the tidyr package to help me drop the missing values using drop_na()

#Import the tidyr package                 
library("tidyr")

netflix_clean <- netflix %>% drop_na()
colSums(is.na(netflix_clean))

##      show_id         type        title     director         cast      country 
##            0            0            0            0            0            0 
##   date_added release_year       rating     duration    listed_in  description 
##            0            0            0            0            0            0

Great!👍. Now, Netflix dataset is ready to be processes and analyzed!

Data Explanation

summary(netflix_clean)

##    show_id               type         title             director        
##  Length:8709        Movie  :6131   Length:8709        Length:8709       
##  Class :character   TV Show:2578   Class :character   Class :character  
##  Mode  :character                  Mode  :character   Mode  :character  
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##      cast                     country       date_added          release_year 
##  Length:8709        United States :2778   Min.   :2008-01-01   Min.   :1925  
##  Class :character   India         : 971   1st Qu.:2018-04-20   1st Qu.:2013  
##  Mode  :character                 : 827   Median :2019-07-12   Median :2017  
##                     United Kingdom: 403   Mean   :2019-05-23   Mean   :2014  
##                     Japan         : 241   3rd Qu.:2020-08-26   3rd Qu.:2019  
##                     South Korea   : 195   Max.   :2021-09-25   Max.   :2021  
##                     (Other)       :3294                                      
##      rating       duration          listed_in         description       
##  TV-MA  :3183   Length:8709        Length:8709        Length:8709       
##  TV-14  :2133   Class :character   Class :character   Class :character  
##  TV-PG  : 838   Mode  :character   Mode  :character   Mode  :character  
##  R      : 799                                                           
##  PG-13  : 490                                                           
##  TV-Y7  : 330                                                           
##  (Other): 936

Here is some brief explanation from Netflix Dataset:

Of the 8709 available data, it turns out that the number of shows with the Movie type (a total of 6131) is more than TV Shows (a total of 2578).
Most shows come from the United States with a total of 2778.
This data is input from January 1, 2008 to September 25, 2021.
The rating with the highest number is TV-MA with a total of 3183.

Data Manipulation and Transformation

Here is some business question that we can ask from Netflix Dataset.

Which type of show is less in number?

table(netflix_clean$type)

## 
##   Movie TV Show 
##    6131    2578

Answer: TV Show

Sort the country where the show is produced from the highest

s <- strsplit(as.character(netflix_clean$country), split = ", ")
netflix_clean_countries_fuul <- data.frame(type = rep(netflix_clean$type, sapply(s, length)), country = unlist(s))
netflix_clean_countries_fuul$country <- as.character(gsub(",","",netflix_clean_countries_fuul$country))

We must split the character from country.

country_tab <- table(netflix_clean_countries_fuul$country)
sorted_country_tab <- country_tab[order(country_tab, decreasing = T)]

prop_country_tab <- (prop.table(table(netflix_clean_countries_fuul$country)))
sorted_prop_country_tab <- prop_country_tab[order(prop_country_tab, decreasing = T)]

head(sorted_country_tab)

## 
##  United States          India United Kingdom         Canada         France 
##           3643           1045            787            432            389 
##          Japan 
##            314

head(sorted_prop_country_tab)

## 
##  United States          India United Kingdom         Canada         France 
##     0.36771979     0.10548097     0.07943878     0.04360553     0.03926517 
##          Japan 
##     0.03169476

Answer: United States

Explanatory Text

From here, I learned from the start how to make R Markdown to gain insight into the data I chose, namely the Netflix Dataset taken from Kaggle. I also implemented programming for data science in this project, starting with reading the data to making a summary of the data to answer the available business questions.