1. Introduction 📣


1.1 About Netflix

Netflix, Inc. is an American subscription streaming service and production company. It offers a library of films and television series through distribution deals as well as its own productions, known as Netflix Originals.


2. Ask Phase ❓


2.1 Business task

This analysis is to show how Netflix subscribers have grown in four different regions from Q1 2018 to Q2 2020. Also, I will seek for the trend of subscribers growth quarterly between 2018 and 2020.

2.2 Breakdown problem into small questions

  • What is the proportion of subscribers take up per region quarterly?
  • How much we earned quarterly by region from Q1 2018 to Q2 2020?
  • What is the trend of subscribers & revenue growth quarterly per region?
  • Did we see any recurring pattern and related seasonality per region?
  • Are there any external factors impacting the pattern of subscribers & revenue growth?


3. Prepare Phase 📚


3.1 Data description

This dataset contains four CSV files.

  • DataNetflixRevenue2020_V2.csv: three columns Area, Years, Revenue.

  • DataNetflixSubscriber2020_V2.csv: three columns Area, Years, Subscribers.

  • NetflixSubscribersbyCountryfrom2018toQ2_2020.csv: eleven columns Area, Q1 - 2018, Q2 - 2018, Q3 - 2018, Q4 - 2018, Q1 - 2019, Q2 - 2019, Q3 - 2019, Q4 - 2019, Q1 - 2020, Q2 - 2020

  • Netflix'sRevenue2018toQ2_2020.csv: eleven columns Area, Q1 - 2018, Q2 - 2018, Q3 - 2018, Q4 - 2018, Q1 - 2019, Q2 - 2019, Q3 - 2019, Q4 - 2019, Q1 - 2020, Q2 - 2020

3.2 Accessibility and privacy of data

This dataset can access through the account of PARIA AGHARABI on Kaggle Netflix subscribers and revenue by country

3.3 Installing packages and loading libraries

#Loading libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(ggplot2)
library(dplyr)
library(here)
## here() starts at /Users/florence/Downloads
library(skimr)
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(ggpubr)
library(readr)

3.4 Importing and previewing dataset

I will import 4 csv files and then view, clean, format and organize the data.


  • DataNetflixRevenue2020_V2.csv

DataNetflixRevenue2020_V2.csv is a Long Data since all the observations in a single column.

#Importing Daily Activity dataset:

data_revenue_long <- read.csv(file = "~/Downloads/Netflix/DataNetflixRevenue2020_V2.csv")
head(data_revenue_long)
colnames(data_revenue_long)
## [1] "Area"    "Years"   "Revenue"
str(data_revenue_long)
## 'data.frame':    40 obs. of  3 variables:
##  $ Area   : chr  "United States and Canada" "Europe,  Middle East and Africa" "Latin America" "Asia-Pacific" ...
##  $ Years  : chr  "Q1 - 2018" "Q1 - 2018" "Q1 - 2018" "Q1 - 2018" ...
##  $ Revenue: num  1.98e+09 8.87e+08 5.40e+08 1.99e+08 2.05e+09 ...


  • DataNetflixSubscriber2020_V2.csv
data_subscribers_long <- read.csv(file = "~/Downloads/Netflix/DataNetflixSubscriber2020_V2.csv")
head(data_subscribers_long)
colnames(data_subscribers_long)
## [1] "Area"        "Years"       "Subscribers"
str(data_subscribers_long)
## 'data.frame':    40 obs. of  3 variables:
##  $ Area       : chr  "United States and Canada" "Europe,  Middle East and Africa" "Latin America" "Asia-Pacific" ...
##  $ Years      : chr  "Q1 - 2018" "Q1 - 2018" "Q1 - 2018" "Q1 - 2018" ...
##  $ Subscribers: int  60909000 29339000 21260000 7394000 61870000 31317000 22795000 8372000 63010000 33836000 ...


  • NetflixsRevenue2018toQ2_2020.csv
data_revenue_wide <- read.csv(file = "~/Downloads/Netflix/NetflixsRevenue2018toQ2_2020.csv")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on '~/Downloads/Netflix/NetflixsRevenue2018toQ2_2020.csv'
head(data_revenue_wide)
colnames(data_revenue_wide)
##  [1] "Area"      "Q1...2018" "Q2...2018" "Q3...2018" "Q4...2018" "Q1...2019" "Q2...2019" "Q3...2019" "Q4...2019" "Q1...2020" "Q2...2020"
str(data_revenue_wide)
## 'data.frame':    4 obs. of  11 variables:
##  $ Area     : chr  "United States and Canada" "Europe, Middle East, and Africa" "Latin America" "Asia-Pacific"
##  $ Q1...2018: int  1976157000 886649000 540182000 199117000
##  $ Q2...2018: int  2049546000 975497000 568071000 221252000
##  $ Q3...2018: int  2094850000 1004749000 562307000 248691000
##  $ Q4...2018: num  2.16e+09 1.10e+09 5.67e+08 2.77e+08
##  $ Q1...2019: num  2.26e+09 1.23e+09 6.30e+08 3.20e+08
##  $ Q2...2019: num  2.50e+09 1.32e+09 6.77e+08 3.49e+08
##  $ Q3...2019: num  2.62e+09 1.43e+09 7.41e+08 3.82e+08
##  $ Q4...2019: num  2.67e+09 1.56e+09 7.46e+08 4.18e+08
##  $ Q1...2020: num  2.70e+09 1.72e+09 7.93e+08 4.84e+08
##  $ Q2...2020: num  2.84e+09 1.89e+09 7.85e+08 5.69e+08


  • NetflixSubscribersbyCountryfrom2018toQ2_2020.csv
data_subscribers_wide <- read.csv(file = "~/Downloads/Netflix/NetflixSubscribersbyCountryfrom2018toQ2_2020.csv")
head(data_subscribers_wide)
colnames(data_subscribers_wide)
##  [1] "Area"      "Q1...2018" "Q2...2018" "Q3...2018" "Q4...2018" "Q1...2019" "Q2...2019" "Q3...2019" "Q4...2019" "Q1...2020" "Q2...2020"
str(data_subscribers_wide)
## 'data.frame':    4 obs. of  11 variables:
##  $ Area     : chr  "United States and Canada" "Europe, Middle East, and Africa" "Latin America" "Asia-Pacific"
##  $ Q1...2018: int  60909000 29339000 21260000 7394000
##  $ Q2...2018: int  61870000 31317000 22795000 8372000
##  $ Q3...2018: int  63010000 33836000 24115000 9461000
##  $ Q4...2018: int  64757000 37818000 26077000 10607000
##  $ Q1...2019: int  66633000 42542000 27547000 12141000
##  $ Q2...2019: int  66501000 44229000 27890000 12942000
##  $ Q3...2019: int  67114000 47355000 29380000 14485000
##  $ Q4...2019: int  67662000 51778000 31417000 16233000
##  $ Q1...2020: int  69969000 58734000 34318000 19835000
##  $ Q2...2020: int  72904000 61483000 36068000 22492000


4. Process Phase 🗄


We will take up cleaning up data to get ride of any possible errors, inaccuracies, or inconsistencies.

4.1 Checking for missing data

As we got to know more about our data, I will process to check for any errors.

 data_revenue_long <- data_revenue_long %>%
  drop_na() %>%
  distinct()
 data_subscribers_long <- data_subscribers_long %>%
  drop_na() %>%
  distinct()
 data_revenue_wide <- data_revenue_wide %>%
  drop_na() %>%
  distinct()
 data_subscribers_wide <- data_subscribers_wide %>%
  drop_na() %>%
  distinct()

4.2 Checking for duplicates


I will be inspecting for any duplicates.

sum(duplicated(data_revenue_long))
## [1] 0
sum(duplicated(data_subscribers_long))
## [1] 0
sum(duplicated(data_revenue_wide))
## [1] 0
sum(duplicated(data_subscribers_wide))
## [1] 0

4.3 Clean and rename columns

To ensure the column name are using the same syntax and format, I will clean and reformat column name as follows.

data_revenue_wide <- data_revenue_wide %>%
rename(Q1_2018 = Q1...2018) %>%
rename(Q2_2018 = Q2...2018) %>%
rename(Q3_2018 = Q3...2018) %>%
rename(Q4_2018 = Q4...2018) %>%
rename(Q1_2019 = Q1...2019) %>%
rename(Q2_2019 = Q2...2019) %>%
rename(Q3_2019 = Q3...2019) %>%
rename(Q4_2019 = Q4...2019) %>%
rename(Q1_2020 = Q1...2020) %>%
rename(Q2_2020 = Q2...2020) 
head(data_revenue_wide)

We do the same to data_subscribers_wide.

data_subscribers_wide <- data_subscribers_wide %>%
rename(Q1_2018 = Q1...2018) %>%
rename(Q2_2018 = Q2...2018) %>%
rename(Q3_2018 = Q3...2018) %>%
rename(Q4_2018 = Q4...2018) %>%
rename(Q1_2019 = Q1...2019) %>%
rename(Q2_2019 = Q2...2019) %>%
rename(Q3_2019 = Q3...2019) %>%
rename(Q4_2019 = Q4...2019) %>%
rename(Q1_2020 = Q1...2020) %>%
rename(Q2_2020 = Q2...2020) 
head(data_subscribers_wide)


5. Analyze Phase 🔎


Now it’s time to analyse the trend of the subscribers and revenue growth quarterly per region.

5.1 Define region

We can classified four regions in our data set as following:

  • United States and Canada
  • Europe, Middle Ease and Africa
  • Latin America
  • Asia-Pacific

5.2 Calculate the propotion of revenue and subscribers per region

Add new column named “total” in data_revenue_widewhich add up all the revenue in 2018, 2019 and 2020 per region.

total_2018_subscribers_wide

total_2018_subscribers_wide<- data_subscribers_wide %>%
  group_by(Area) %>%
  summarise(Q1_2018, Q2_2018, Q3_2018, Q4_2018) %>%
  group_by(Area) %>%
  mutate(total = sum(Q1_2018,Q2_2018,Q3_2018,Q4_2018))
  head(total_2018_subscribers_wide)


df_2018_subsribers_wide

Now I will create a new dataframe including “total” and “proportion” by region.

df_2018_subscribers_wide <- total_2018_subscribers_wide %>%
  group_by(Area) %>%
  mutate(proportion = total / sum(total_2018_subscribers_wide$total)) %>%
  mutate(percent = proportion) %>%
  mutate(proportion= percent(proportion)) %>%
  summarise(total, proportion, percent)
head(df_2018_subscribers_wide)

I will do the same to file 2019.

total_2019_subsribers_wide

total_2019_subsribers_wide <- data_subscribers_wide %>%
  group_by(Area) %>%
  summarise(Q1_2019, Q2_2019, Q3_2019, Q4_2019) %>%
  group_by(Area) %>%
  mutate(total = sum(Q1_2019,Q2_2019,Q3_2019,Q4_2019)) 
  head(total_2019_subsribers_wide)


df_2019_subsribers_wide

I will also create the new dataframe that contains variable of “total” and “proportion” for 2019.

df_2019_subsribers_wide <- total_2019_subsribers_wide %>%
  group_by(Area) %>%
  mutate(proportion = total / sum(total_2019_subsribers_wide$total)) %>%
  mutate(percent = proportion) %>%
  mutate(proportion= percent(proportion)) %>%
  summarise(total, proportion, percent)
head(df_2019_subsribers_wide)

Now for 2020.

total_2020_subscribers_wide

total_2020_subscribers_wide <- data_subscribers_wide %>%
  group_by(Area) %>%
  summarise(Q1_2020, Q2_2020) %>%
  group_by(Area) %>%
  mutate(total = sum(Q1_2020,Q2_2020)) 
  head(total_2020_subscribers_wide)


df_2020_subscribers_wide

df_2020_subscribers_wide <- total_2020_subscribers_wide %>%
  mutate(proportion = total / sum(total_2020_subscribers_wide$total)) %>% 
  mutate(percent = proportion) %>% 
  mutate(proportion= percent(proportion)) %>%
  summarise(total, proportion, percent) 
head(df_2020_subscribers_wide)

5.3 Calculate the distribution of subscribers quarterly in 2018, 2019, 2020


proportion_2018_subscribers_wide

proportion_2018_subscribers_wide <- total_2018_subscribers_wide %>%
  group_by(Area) %>%
  mutate(Q1_proportion = Q1_2018 / sum(total_2018_subscribers_wide$Q1_2018)) %>%
  mutate(Q1_percent = as.double(Q1_proportion ))%>%
  mutate(Q1_proportion = percent(Q1_proportion))%>%
  mutate(Q2_proportion = Q2_2018 / sum(total_2018_subscribers_wide$Q2_2018)) %>%
  mutate(Q2_percent = as.double(Q2_proportion ))%>%
  mutate(Q2_proportion = percent(Q2_proportion))%>%
  mutate(Q3_proportion = Q3_2018 / sum(total_2018_subscribers_wide$Q3_2018)) %>%
  mutate(Q3_percent = as.double(Q3_proportion ))%>%
  mutate(Q3_proportion = percent(Q3_proportion))%>%
  mutate(Q4_proportion = Q4_2018 / sum(total_2018_subscribers_wide$Q4_2018)) %>%
  mutate(Q4_percent = as.double(Q4_proportion ))%>%
  mutate(Q4_proportion = percent(Q4_proportion))%>%
  summarise(Q1_proportion, Q1_percent, Q2_proportion, Q2_percent, Q3_proportion, Q3_percent, Q4_proportion, Q4_percent)
head(proportion_2018_subscribers_wide)


proportion_2019_subscribers_wide

proportion_2019_subscribers_wide <- total_2019_subsribers_wide %>%
  group_by(Area) %>%
  mutate(Q1_proportion = Q1_2019 / sum(total_2019_subsribers_wide$Q1_2019))  %>%
  mutate(Q1_percent = as.double(Q1_proportion ))%>%
  mutate(Q1_proportion = percent(Q1_proportion))%>%
  mutate(Q2_proportion = Q2_2019 / sum(total_2019_subsribers_wide$Q2_2019))  %>%
  mutate(Q2_percent = as.double(Q2_proportion ))%>%
  mutate(Q2_proportion = percent(Q2_proportion))%>%
  mutate(Q3_proportion = Q3_2019 / sum(total_2019_subsribers_wide$Q3_2019))  %>%
  mutate(Q3_percent = as.double(Q3_proportion ))%>%
  mutate(Q3_proportion = percent(Q3_proportion))%>%
  mutate(Q4_proportion = Q4_2019 / sum(total_2019_subsribers_wide$Q4_2019))  %>%
  mutate(Q4_percent = as.double(Q4_proportion ))%>%
  mutate(Q4_proportion = percent(Q4_proportion))%>%

  summarise(Q1_proportion, Q1_percent, Q2_proportion, Q2_percent, Q3_proportion, Q3_percent, Q4_proportion, Q4_percent)
head(proportion_2019_subscribers_wide)


proportion_2020_subscribers_wide

proportion_2020_subscribers_wide <- total_2020_subscribers_wide %>%
  group_by(Area) %>%
  mutate(Q1_proportion = Q1_2020 / sum(total_2020_subscribers_wide$Q1_2020)) %>%
  mutate(Q1_percent = as.double(Q1_proportion ))%>%
  mutate(Q1_proportion = percent(Q1_proportion))%>%
  mutate(Q2_proportion = Q2_2020 / sum(total_2020_subscribers_wide$Q2_2020)) %>%
  mutate(Q2_percent = as.double(Q2_proportion ))%>%
  mutate(Q2_proportion = percent(Q2_proportion))%>%
  summarise(Q1_proportion, Q1_percent, Q2_proportion, Q2_percent)
head(proportion_2020_subscribers_wide)


6. Share Phase 🪢


We will use data visualization to clearly display the proportion of Netflix subscribers per region, and the proportion of each region quarterly from Q1 2018 to Q2 2020. Furthermore,

6.1 Data visualization: Netflix subscribers per region

The proportion of Netflix subscribers per region

p1<-ggplot(df_2018_subscribers_wide, aes(x="", y=percent, fill=Area))+
  geom_bar(width =1, stat = "identity")+
  coord_polar("y")+
  theme_minimal()+
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust=0.3, vjust=2, size=12, face = "bold"),
        legend.position = "none")+
  geom_text(aes(label= proportion), size = 2, position = position_stack(vjust=0.5))+
  scale_fill_manual(values = c("Asia-Pacific" = "#ff3333", "Europe, Middle East, and Africa"="#ffa64d", "Latin America" = "#88cc00", "United States and Canada" = "#4db8ff"))+
labs(title = "Netflix subscirbers per region", subtitle = "2018")


p2<-ggplot(df_2019_subsribers_wide, aes(x="", y=percent, fill=Area))+
  geom_bar(width =1, stat = "identity")+
  coord_polar("y")+
  theme_minimal()+
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust=0.1, vjust=5, size=9, face = "bold"),
        legend.position = "none")+
  geom_text(aes(label= proportion), size = 2, position = position_stack(vjust=0.5))+
  scale_fill_manual(values = c("Asia-Pacific" = "#ff3333", "Europe, Middle East, and Africa"="#ffa64d", "Latin America" = "#88cc00", "United States and Canada" = "#4db8ff"))+
labs(title = "", subtitle = "2019")

p3<-ggplot(df_2020_subscribers_wide, aes(x="", y=percent, fill=Area))+
  geom_bar(width =1, stat = "identity")+
  coord_polar("y")+
  theme_minimal()+
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust=0.1, vjust=5, size=9, face = "bold"),
        legend.position = "none")+
  geom_text(aes(label= proportion), size = 2, position = position_stack(vjust=0.5))+
  scale_fill_manual(values = c("Asia-Pacific" = "#ff3333", "Europe, Middle East, and Africa"="#ffa64d", "Latin America" = "#88cc00", "United States and Canada" = "#4db8ff"))+
labs(title = "", subtitle =  "2020")

ggarrange(p1,p2,p3, 
      common.legend=TRUE, 
      hjust=-0.8,
      ncol = 2, nrow = 2)

6.2 Data visualization : Netflix subscribers quarterly from Q1 2018 to Q2 2020

data_subscribers_long %>%
ggplot()+
 geom_col(aes(x=Area, y=Subscribers, fill= Years),position = "dodge")+
theme(axis.text.x = element_text(angle=90))+
  labs(title="Netflix Subscribers Quarterly per region")

data_subscribers_long %>%

ggplot()+
 geom_col(aes(x=factor(Years,level = c("Q1 - 2018", "Q2 - 2018", "Q3 - 2018", "Q4 - 2018", "Q1 - 2019", "Q2 - 2019", "Q3 - 2019", "Q4 - 2019", "Q1 - 2020", "Q2 - 2020")), y=Subscribers, fill= Area),position = "dodge")+

theme(axis.text.x = element_text(angle = 90))+
  labs(title="Netflix Subscribers Quarterly", subtitle = "Time: Q1-2018 to Q2-2020", x="Years")

7. Act Phase 📍

Netflix is one of the world largest entertainment and streaming media service. After we go through the four csv files and visualize them, I could draw conclusion as follows:


Thank you for your interest in my Netflix Data Analytics report