1 Intro

In this project, we will analyze data from Kaggle, namely Netflix data. Netflix is one of the most popular media and video streaming platforms. They have over 8000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally. After we finish analyzing Netflix data, then we will visualize based on the results of the analysis that has been done.

1.1 Input Data

First, we need to read data

netflix <- read.csv("data/netflix.csv")

Then, we want to see top 10 data owned by Netflix

head(netflix,10)

Next, we want to know dimension of data

dim(netflix)
## [1] 8807   12

This data contains 8807 rows and 12 columns

2 Data Cleansing

The first step in conducting data analysis is to ensure that the data to be used is clean.

2.1 Load Libraries

First, we need to load required libraries

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)

2.2 Explicit Coercion

Next, we need to check data types in every column

str(netflix)
## 'data.frame':    8807 obs. of  12 variables:
##  $ show_id     : chr  "s1" "s2" "s3" "s4" ...
##  $ type        : chr  "Movie" "TV Show" "TV Show" "TV Show" ...
##  $ title       : chr  "Dick Johnson Is Dead" "Blood & Water" "Ganglands" "Jailbirds New Orleans" ...
##  $ director    : chr  "Kirsten Johnson" "" "Julien Leclercq" "" ...
##  $ cast        : chr  "" "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile "| __truncated__ "Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, G"| __truncated__ "" ...
##  $ country     : chr  "United States" "South Africa" "" "" ...
##  $ date_added  : chr  "September 25, 2021" "September 24, 2021" "September 24, 2021" "September 24, 2021" ...
##  $ release_year: int  2020 2021 2021 2021 2021 2021 2021 1993 2021 2021 ...
##  $ rating      : chr  "PG-13" "TV-MA" "TV-MA" "TV-MA" ...
##  $ duration    : chr  "90 min" "2 Seasons" "1 Season" "1 Season" ...
##  $ listed_in   : chr  "Documentaries" "International TV Shows, TV Dramas, TV Mysteries" "Crime TV Shows, International TV Shows, TV Action & Adventure" "Docuseries, Reality TV" ...
##  $ description : chr  "As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical wa"| __truncated__ "After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is h"| __truncated__ "To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled "| __truncated__ "Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Or"| __truncated__ ...

Based on data type for each column above, there is an incorrect data type. Therefore, we have to change the data type

netflix$type <- as.factor(netflix$type)
netflix$date_added <- mdy(netflix$date_added)
netflix$release_year <- parse_date_time(netflix$release_year,'y')

2.3 Check Missing Values

colSums(is.na(netflix))
##      show_id         type        title     director         cast      country 
##            0            0            0            0            0            0 
##   date_added release_year       rating     duration    listed_in  description 
##           10            0            0            0            0            0

We have missing values in column date_added, we will do action on NA and convert it to string “Missing Values”

netflix$director[netflix$director==""] <- NA
netflix$cast[netflix$cast==""] <- NA
netflix$country[netflix$country==""] <- NA
netflix$rating[netflix$rating==""] <- NA
netflix$director[which(is.na(netflix$director))] <- "Missing Values"
netflix$cast[which(is.na(netflix$cast))] <- "Missing Values"
netflix$country[which(is.na(netflix$country))] <- "Missing Values"
netflix$date_added[which(is.na(netflix$date_added))] <- "01-01-01" #because the date_added column has a date data type
netflix$rating[which(is.na(netflix$rating))] <- "Missing Values"
colSums(is.na(netflix))
##      show_id         type        title     director         cast      country 
##            0            0            0            0            0            0 
##   date_added release_year       rating     duration    listed_in  description 
##            0            0            0            0            0            0

Great! We don’t have missing values.

2.4 Finishing Data Cleansing

First, we recheck data again.

head(netflix,10)

Next step, we want to retrieve the data needed to perform data analysis, i.e. data that is not from

netflix <- netflix %>% select(,c(-"show_id",-"director",-"cast",-"country",-"description"))
head(netflix,10)

we want to take the first category from listed_in columns

netflix <- netflix %>% separate(listed_in, c("category", "category2", "category3"), sep = ",") 
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 5078 rows [1, 4,
## 7, 9, 10, 13, 14, 16, 17, 19, 23, 24, 28, 29, 30, 32, 35, 38, 39, 40, ...].
netflix <- netflix %>% select(c(-"category2", -"category3"))
head(netflix,10)

Finally, we want to create a new column, namely year added and month added from date_added column

netflix$year_added <- year(netflix$date_added)
netflix$month_added <- month(netflix$date_added,label=T)
head(netflix,10)

Data cleaning has been completed and is ready to be used for analysis and visualization

3 Data Explanation

summary(netflix)
##       type         title             date_added        
##  Movie  :6131   Length:8807        Min.   :0001-01-01  
##  TV Show:2676   Class :character   1st Qu.:2018-04-03  
##                 Mode  :character   Median :2019-07-01  
##                                    Mean   :2017-01-30  
##                                    3rd Qu.:2020-08-18  
##                                    Max.   :2021-09-25  
##                                                        
##   release_year                    rating            duration        
##  Min.   :1925-01-01 00:00:00   Length:8807        Length:8807       
##  1st Qu.:2013-01-01 00:00:00   Class :character   Class :character  
##  Median :2017-01-01 00:00:00   Mode  :character   Mode  :character  
##  Mean   :2014-03-07 16:27:24                                        
##  3rd Qu.:2019-01-01 00:00:00                                        
##  Max.   :2021-01-01 00:00:00                                        
##                                                                     
##    category           year_added    month_added  
##  Length:8807        Min.   :   1   Jul    : 827  
##  Class :character   1st Qu.:2018   Dec    : 813  
##  Mode  :character   Median :2019   Sep    : 770  
##                     Mean   :2017   Apr    : 764  
##                     3rd Qu.:2020   Oct    : 760  
##                     Max.   :2021   Aug    : 755  
##                                    (Other):4118
  1. In column type in Netflix data, Movie has 6131 titles and TV show have 2676 titles
  2. In this data used, the last data entered in Netflix is September 25th, 2021
  3. The year of release of Movie/TV Show on Netflix is in the range 1925 - 2021
  4. The maximum year for Movie/TV Show added by Netflix is 2021
  5. Netflix most often added Movie/TV Show in July, December, and September

4 Study Case

1.Show comparison between type of Movie or type of TV Show by year of release

ggplot(netflix, mapping = aes(x=release_year, fill = type)) +
  geom_histogram()+
  labs(title="Netflix Films Released by Year", x="Release Year", y="Total Film")+
  scale_fill_manual(values = c("Movie" = "Red","TV Show" = "Black"))+
  theme_minimal()+
  theme(plot.title = element_text(face="bold",hjust = 0.5))

Answer : Movie and TV Show tend to have an increasing trend every year. Movie outperform every year than TV shows. However, TV show also have a significant increasing trend from 2000 to 2021.

  1. What categories are included in the top 10 categories on Netflix
top_categories <- netflix %>% group_by(category) %>% count() %>% arrange(desc(n))
top10_categories <- top_categories %>% filter(n>270)

top10_categories
ggplot(top10_categories,mapping=aes(x=n, reorder(category,n)))+
  geom_col(aes(fill=n),color = "maroon",show.legend = F)+
  scale_fill_gradient(low="pink",high="#cf2e2e")+
  labs(title = "Netflix's Top 10 Categories", x = "Total Film", y = NULL)+
  theme_minimal()+
  theme(plot.title=element_text(face="bold", hjust = 0.5))+
  geom_label(data=top10_categories[1:5,], mapping=aes(label=n))+
  geom_vline(xintercept = mean(top10_categories$n), col="yellow",linetype=2,lwd=1)

Answer : The plot above is a visualization of the top 10 categories on Netflix. There is a yellow line indicating the average of the categories. There are 5 categories that exceed the average, namely the category Dramas, Comedies, Action & Adventure, Documentaries, International TV Shows

  1. What ratings are included in the top 10 categories on Netflix
top_ratings <- netflix %>% group_by(rating) %>% count() %>% arrange(desc(n))
top10_ratings <- top_ratings %>% filter(n>50)

top10_ratings
ggplot(data = top10_ratings, mapping=aes(x=n,y=reorder(rating,n)))+
  geom_col(aes(fill=n), color="black", show.legend = F)+
  scale_fill_gradient(low="#79DAE8",high="#0AA1DD")+
  labs(title="Netflix's Top 10 Ratings", x = "Total Film", y= NULL)+
  theme_minimal()+
  theme(plot.title = element_text(face="bold",hjust=0.5))+
  geom_label(data = top10_ratings[1:2,], mapping=aes(label=n))+
  geom_vline(xintercept = mean(top10_ratings$n), col = "#FCF69C",linetype=2,lwd=1)

Answer : The plot above is a visualization of the top 10 ratings on Netflix. There is a yellow line indicating the average of ratings. There are two ratings that exceed the average, namely the rating TV-MA and TV-14

  1. What month does Netflix add the most films?
netflix %>% group_by(month_added,type)%>% 
  count() %>% 
  ggplot(aes(x=month_added,y=n,fill=type))+
  geom_col(aes(fill=type))+
  labs(title="Netflix Films Added by Month", x="Month", y="Total Film")+
  theme_minimal()+
  theme(plot.title=element_text(face="bold",hjust=0.5))

Answer : From the plot above, trend type Movies are higher than trend type TV Shows. Then, the month that Netflix adds films most often are July, September, and December

  1. Show the trend of the duration of type movies from 2000 to 2020
netflix %>% 
  filter(type=='Movie' & release_year>="2000-01-01" & release_year<="2020-01-02") %>% 
  mutate(movie_duration=substr(duration,1,nchar(as.character(duration))-4)) %>% 
  mutate(movie_duration = as.integer(movie_duration)) %>% 
  group_by(release_year) %>% 
  summarise(avg_duration = mean(movie_duration)) %>% 
  ggplot(aes(x=release_year, y= avg_duration))+
  geom_point() + geom_smooth()+
  labs(title = "Netflix movie duration from 2000 - 2020", x = "Year", y = "Duration(Minutes)")+
  theme_minimal()+
  theme(plot.title=element_text(face="bold",hjust=0.5))

Answer : The plot above shows that the duration of movies from 2000 to 2020 has a downward trend.

5 Final Conclusion

From the analysis and visualization that has been done, it can be concluded that movies and tv shows increase every year. Movies outperform every year than TV shows. However, TV shows also have a significant increasing trend from 2000 to 2021. Then, the most popular categories on Netflix are Dramas, Comedies, and Action & Adventure categories.

Then two ratings exceed the average rating, namely the rating TV-MA and TV-14. Trend type Movies are higher than the trend type TV Shows. The month that Netflix adds films most often are July, September, and December. Finally, the duration of movies from 2000 to 2020 has a downward trend.