In this project, we will analyze data from Kaggle, namely Netflix data. Netflix is one of the most popular media and video streaming platforms. They have over 8000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally. After we finish analyzing Netflix data, then we will visualize based on the results of the analysis that has been done.
First, we need to read data
netflix <- read.csv("data/netflix.csv")Then, we want to see top 10 data owned by Netflix
head(netflix,10)Next, we want to know dimension of data
dim(netflix)## [1] 8807 12
This data contains 8807 rows and 12 columns
The first step in conducting data analysis is to ensure that the data to be used is clean.
First, we need to load required libraries
library(tidyverse)## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(lubridate)##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)Next, we need to check data types in every column
str(netflix)## 'data.frame': 8807 obs. of 12 variables:
## $ show_id : chr "s1" "s2" "s3" "s4" ...
## $ type : chr "Movie" "TV Show" "TV Show" "TV Show" ...
## $ title : chr "Dick Johnson Is Dead" "Blood & Water" "Ganglands" "Jailbirds New Orleans" ...
## $ director : chr "Kirsten Johnson" "" "Julien Leclercq" "" ...
## $ cast : chr "" "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile "| __truncated__ "Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, G"| __truncated__ "" ...
## $ country : chr "United States" "South Africa" "" "" ...
## $ date_added : chr "September 25, 2021" "September 24, 2021" "September 24, 2021" "September 24, 2021" ...
## $ release_year: int 2020 2021 2021 2021 2021 2021 2021 1993 2021 2021 ...
## $ rating : chr "PG-13" "TV-MA" "TV-MA" "TV-MA" ...
## $ duration : chr "90 min" "2 Seasons" "1 Season" "1 Season" ...
## $ listed_in : chr "Documentaries" "International TV Shows, TV Dramas, TV Mysteries" "Crime TV Shows, International TV Shows, TV Action & Adventure" "Docuseries, Reality TV" ...
## $ description : chr "As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical wa"| __truncated__ "After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is h"| __truncated__ "To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled "| __truncated__ "Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Or"| __truncated__ ...
Based on data type for each column above, there is an incorrect data type. Therefore, we have to change the data type
netflix$type <- as.factor(netflix$type)
netflix$date_added <- mdy(netflix$date_added)
netflix$release_year <- parse_date_time(netflix$release_year,'y')colSums(is.na(netflix))## show_id type title director cast country
## 0 0 0 0 0 0
## date_added release_year rating duration listed_in description
## 10 0 0 0 0 0
We have missing values in column date_added, we will do action on NA and convert it to string “Missing Values”
netflix$director[netflix$director==""] <- NA
netflix$cast[netflix$cast==""] <- NA
netflix$country[netflix$country==""] <- NA
netflix$rating[netflix$rating==""] <- NAnetflix$director[which(is.na(netflix$director))] <- "Missing Values"
netflix$cast[which(is.na(netflix$cast))] <- "Missing Values"
netflix$country[which(is.na(netflix$country))] <- "Missing Values"
netflix$date_added[which(is.na(netflix$date_added))] <- "01-01-01" #because the date_added column has a date data type
netflix$rating[which(is.na(netflix$rating))] <- "Missing Values"
colSums(is.na(netflix))## show_id type title director cast country
## 0 0 0 0 0 0
## date_added release_year rating duration listed_in description
## 0 0 0 0 0 0
Great! We don’t have missing values.
First, we recheck data again.
head(netflix,10)Next step, we want to retrieve the data needed to perform data analysis, i.e. data that is not from
netflix <- netflix %>% select(,c(-"show_id",-"director",-"cast",-"country",-"description"))
head(netflix,10)we want to take the first category from listed_in columns
netflix <- netflix %>% separate(listed_in, c("category", "category2", "category3"), sep = ",") ## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 5078 rows [1, 4,
## 7, 9, 10, 13, 14, 16, 17, 19, 23, 24, 28, 29, 30, 32, 35, 38, 39, 40, ...].
netflix <- netflix %>% select(c(-"category2", -"category3"))
head(netflix,10)Finally, we want to create a new column, namely year added and month added from date_added column
netflix$year_added <- year(netflix$date_added)
netflix$month_added <- month(netflix$date_added,label=T)
head(netflix,10)Data cleaning has been completed and is ready to be used for analysis and visualization
summary(netflix)## type title date_added
## Movie :6131 Length:8807 Min. :0001-01-01
## TV Show:2676 Class :character 1st Qu.:2018-04-03
## Mode :character Median :2019-07-01
## Mean :2017-01-30
## 3rd Qu.:2020-08-18
## Max. :2021-09-25
##
## release_year rating duration
## Min. :1925-01-01 00:00:00 Length:8807 Length:8807
## 1st Qu.:2013-01-01 00:00:00 Class :character Class :character
## Median :2017-01-01 00:00:00 Mode :character Mode :character
## Mean :2014-03-07 16:27:24
## 3rd Qu.:2019-01-01 00:00:00
## Max. :2021-01-01 00:00:00
##
## category year_added month_added
## Length:8807 Min. : 1 Jul : 827
## Class :character 1st Qu.:2018 Dec : 813
## Mode :character Median :2019 Sep : 770
## Mean :2017 Apr : 764
## 3rd Qu.:2020 Oct : 760
## Max. :2021 Aug : 755
## (Other):4118
1.Show comparison between type of Movie or type of TV Show by year of release
ggplot(netflix, mapping = aes(x=release_year, fill = type)) +
geom_histogram()+
labs(title="Netflix Films Released by Year", x="Release Year", y="Total Film")+
scale_fill_manual(values = c("Movie" = "Red","TV Show" = "Black"))+
theme_minimal()+
theme(plot.title = element_text(face="bold",hjust = 0.5))Answer : Movie and TV Show tend to have an increasing trend every year. Movie outperform every year than TV shows. However, TV show also have a significant increasing trend from 2000 to 2021.
top_categories <- netflix %>% group_by(category) %>% count() %>% arrange(desc(n))
top10_categories <- top_categories %>% filter(n>270)
top10_categoriesggplot(top10_categories,mapping=aes(x=n, reorder(category,n)))+
geom_col(aes(fill=n),color = "maroon",show.legend = F)+
scale_fill_gradient(low="pink",high="#cf2e2e")+
labs(title = "Netflix's Top 10 Categories", x = "Total Film", y = NULL)+
theme_minimal()+
theme(plot.title=element_text(face="bold", hjust = 0.5))+
geom_label(data=top10_categories[1:5,], mapping=aes(label=n))+
geom_vline(xintercept = mean(top10_categories$n), col="yellow",linetype=2,lwd=1)Answer : The plot above is a visualization of the top 10 categories on Netflix. There is a yellow line indicating the average of the categories. There are 5 categories that exceed the average, namely the category Dramas, Comedies, Action & Adventure, Documentaries, International TV Shows
top_ratings <- netflix %>% group_by(rating) %>% count() %>% arrange(desc(n))
top10_ratings <- top_ratings %>% filter(n>50)
top10_ratingsggplot(data = top10_ratings, mapping=aes(x=n,y=reorder(rating,n)))+
geom_col(aes(fill=n), color="black", show.legend = F)+
scale_fill_gradient(low="#79DAE8",high="#0AA1DD")+
labs(title="Netflix's Top 10 Ratings", x = "Total Film", y= NULL)+
theme_minimal()+
theme(plot.title = element_text(face="bold",hjust=0.5))+
geom_label(data = top10_ratings[1:2,], mapping=aes(label=n))+
geom_vline(xintercept = mean(top10_ratings$n), col = "#FCF69C",linetype=2,lwd=1)Answer : The plot above is a visualization of the top 10 ratings on Netflix. There is a yellow line indicating the average of ratings. There are two ratings that exceed the average, namely the rating TV-MA and TV-14
netflix %>% group_by(month_added,type)%>%
count() %>%
ggplot(aes(x=month_added,y=n,fill=type))+
geom_col(aes(fill=type))+
labs(title="Netflix Films Added by Month", x="Month", y="Total Film")+
theme_minimal()+
theme(plot.title=element_text(face="bold",hjust=0.5))Answer : From the plot above, trend type Movies are higher than trend type TV Shows. Then, the month that Netflix adds films most often are July, September, and December
netflix %>%
filter(type=='Movie' & release_year>="2000-01-01" & release_year<="2020-01-02") %>%
mutate(movie_duration=substr(duration,1,nchar(as.character(duration))-4)) %>%
mutate(movie_duration = as.integer(movie_duration)) %>%
group_by(release_year) %>%
summarise(avg_duration = mean(movie_duration)) %>%
ggplot(aes(x=release_year, y= avg_duration))+
geom_point() + geom_smooth()+
labs(title = "Netflix movie duration from 2000 - 2020", x = "Year", y = "Duration(Minutes)")+
theme_minimal()+
theme(plot.title=element_text(face="bold",hjust=0.5))Answer : The plot above shows that the duration of movies from 2000 to 2020 has a downward trend.
From the analysis and visualization that has been done, it can be concluded that movies and tv shows increase every year. Movies outperform every year than TV shows. However, TV shows also have a significant increasing trend from 2000 to 2021. Then, the most popular categories on Netflix are Dramas, Comedies, and Action & Adventure categories.
Then two ratings exceed the average rating, namely the rating TV-MA and TV-14. Trend type Movies are higher than the trend type TV Shows. The month that Netflix adds films most often are July, September, and December. Finally, the duration of movies from 2000 to 2020 has a downward trend.