Netflix, Inc. is an American subscription streaming service and production company. It offers a library of films and television series through distribution deals as well as its own productions, known as Netflix Originals.
This analysis is to show how Netflix subscribers have grown in four different regions from Q1 2018 to Q2 2020. Also, I will seek for the trend of subscribers growth quarterly between 2018 and 2020.
This dataset contains four CSV files.
DataNetflixRevenue2020_V2.csv
: three columns Area,
Years, Revenue.
DataNetflixSubscriber2020_V2.csv
: three columns
Area, Years, Subscribers.
NetflixSubscribersbyCountryfrom2018toQ2_2020.csv
:
eleven columns Area, Q1 - 2018, Q2 - 2018, Q3 - 2018, Q4 - 2018, Q1 -
2019, Q2 - 2019, Q3 - 2019, Q4 - 2019, Q1 - 2020, Q2 - 2020
Netflix'sRevenue2018toQ2_2020.csv
: eleven columns
Area, Q1 - 2018, Q2 - 2018, Q3 - 2018, Q4 - 2018, Q1 - 2019, Q2 - 2019,
Q3 - 2019, Q4 - 2019, Q1 - 2020, Q2 - 2020
This dataset can access through the account of PARIA AGHARABI on Kaggle Netflix subscribers and revenue by country
#Loading libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(ggplot2)
library(dplyr)
library(here)
## here() starts at /Users/florence/Downloads
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ggpubr)
library(readr)
I will import 4 csv files and then view, clean, format and organize the data.
DataNetflixRevenue2020_V2.csv
DataNetflixRevenue2020_V2.csv
is a Long Data since all
the observations in a single column.
#Importing Daily Activity dataset:
data_revenue_long <- read.csv(file = "~/Downloads/Netflix/DataNetflixRevenue2020_V2.csv")
head(data_revenue_long)
colnames(data_revenue_long)
## [1] "Area" "Years" "Revenue"
str(data_revenue_long)
## 'data.frame': 40 obs. of 3 variables:
## $ Area : chr "United States and Canada" "Europe, Middle East and Africa" "Latin America" "Asia-Pacific" ...
## $ Years : chr "Q1 - 2018" "Q1 - 2018" "Q1 - 2018" "Q1 - 2018" ...
## $ Revenue: num 1.98e+09 8.87e+08 5.40e+08 1.99e+08 2.05e+09 ...
DataNetflixSubscriber2020_V2.csv
data_subscribers_long <- read.csv(file = "~/Downloads/Netflix/DataNetflixSubscriber2020_V2.csv")
head(data_subscribers_long)
colnames(data_subscribers_long)
## [1] "Area" "Years" "Subscribers"
str(data_subscribers_long)
## 'data.frame': 40 obs. of 3 variables:
## $ Area : chr "United States and Canada" "Europe, Middle East and Africa" "Latin America" "Asia-Pacific" ...
## $ Years : chr "Q1 - 2018" "Q1 - 2018" "Q1 - 2018" "Q1 - 2018" ...
## $ Subscribers: int 60909000 29339000 21260000 7394000 61870000 31317000 22795000 8372000 63010000 33836000 ...
NetflixsRevenue2018toQ2_2020.csv
data_revenue_wide <- read.csv(file = "~/Downloads/Netflix/NetflixsRevenue2018toQ2_2020.csv")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on '~/Downloads/Netflix/NetflixsRevenue2018toQ2_2020.csv'
head(data_revenue_wide)
colnames(data_revenue_wide)
## [1] "Area" "Q1...2018" "Q2...2018" "Q3...2018" "Q4...2018" "Q1...2019" "Q2...2019" "Q3...2019" "Q4...2019" "Q1...2020" "Q2...2020"
str(data_revenue_wide)
## 'data.frame': 4 obs. of 11 variables:
## $ Area : chr "United States and Canada" "Europe, Middle East, and Africa" "Latin America" "Asia-Pacific"
## $ Q1...2018: int 1976157000 886649000 540182000 199117000
## $ Q2...2018: int 2049546000 975497000 568071000 221252000
## $ Q3...2018: int 2094850000 1004749000 562307000 248691000
## $ Q4...2018: num 2.16e+09 1.10e+09 5.67e+08 2.77e+08
## $ Q1...2019: num 2.26e+09 1.23e+09 6.30e+08 3.20e+08
## $ Q2...2019: num 2.50e+09 1.32e+09 6.77e+08 3.49e+08
## $ Q3...2019: num 2.62e+09 1.43e+09 7.41e+08 3.82e+08
## $ Q4...2019: num 2.67e+09 1.56e+09 7.46e+08 4.18e+08
## $ Q1...2020: num 2.70e+09 1.72e+09 7.93e+08 4.84e+08
## $ Q2...2020: num 2.84e+09 1.89e+09 7.85e+08 5.69e+08
NetflixSubscribersbyCountryfrom2018toQ2_2020.csv
data_subscribers_wide <- read.csv(file = "~/Downloads/Netflix/NetflixSubscribersbyCountryfrom2018toQ2_2020.csv")
head(data_subscribers_wide)
colnames(data_subscribers_wide)
## [1] "Area" "Q1...2018" "Q2...2018" "Q3...2018" "Q4...2018" "Q1...2019" "Q2...2019" "Q3...2019" "Q4...2019" "Q1...2020" "Q2...2020"
str(data_subscribers_wide)
## 'data.frame': 4 obs. of 11 variables:
## $ Area : chr "United States and Canada" "Europe, Middle East, and Africa" "Latin America" "Asia-Pacific"
## $ Q1...2018: int 60909000 29339000 21260000 7394000
## $ Q2...2018: int 61870000 31317000 22795000 8372000
## $ Q3...2018: int 63010000 33836000 24115000 9461000
## $ Q4...2018: int 64757000 37818000 26077000 10607000
## $ Q1...2019: int 66633000 42542000 27547000 12141000
## $ Q2...2019: int 66501000 44229000 27890000 12942000
## $ Q3...2019: int 67114000 47355000 29380000 14485000
## $ Q4...2019: int 67662000 51778000 31417000 16233000
## $ Q1...2020: int 69969000 58734000 34318000 19835000
## $ Q2...2020: int 72904000 61483000 36068000 22492000
We will take up cleaning up data to get ride of any possible errors, inaccuracies, or inconsistencies.
As we got to know more about our data, I will process to check for any errors.
data_revenue_long <- data_revenue_long %>%
drop_na() %>%
distinct()
data_subscribers_long <- data_subscribers_long %>%
drop_na() %>%
distinct()
data_revenue_wide <- data_revenue_wide %>%
drop_na() %>%
distinct()
data_subscribers_wide <- data_subscribers_wide %>%
drop_na() %>%
distinct()
I will be inspecting for any duplicates.
sum(duplicated(data_revenue_long))
## [1] 0
sum(duplicated(data_subscribers_long))
## [1] 0
sum(duplicated(data_revenue_wide))
## [1] 0
sum(duplicated(data_subscribers_wide))
## [1] 0
To ensure the column name are using the same syntax and format, I will clean and reformat column name as follows.
data_revenue_wide <- data_revenue_wide %>%
rename(Q1_2018 = Q1...2018) %>%
rename(Q2_2018 = Q2...2018) %>%
rename(Q3_2018 = Q3...2018) %>%
rename(Q4_2018 = Q4...2018) %>%
rename(Q1_2019 = Q1...2019) %>%
rename(Q2_2019 = Q2...2019) %>%
rename(Q3_2019 = Q3...2019) %>%
rename(Q4_2019 = Q4...2019) %>%
rename(Q1_2020 = Q1...2020) %>%
rename(Q2_2020 = Q2...2020)
head(data_revenue_wide)
We do the same to data_subscribers_wide
.
data_subscribers_wide <- data_subscribers_wide %>%
rename(Q1_2018 = Q1...2018) %>%
rename(Q2_2018 = Q2...2018) %>%
rename(Q3_2018 = Q3...2018) %>%
rename(Q4_2018 = Q4...2018) %>%
rename(Q1_2019 = Q1...2019) %>%
rename(Q2_2019 = Q2...2019) %>%
rename(Q3_2019 = Q3...2019) %>%
rename(Q4_2019 = Q4...2019) %>%
rename(Q1_2020 = Q1...2020) %>%
rename(Q2_2020 = Q2...2020)
head(data_subscribers_wide)
Now it’s time to analyse the trend of the subscribers and revenue growth quarterly per region.
We can classified four regions in our data set as following:
Add new column named “total” in
data_revenue_wide
which add up all the revenue in 2018, 2019
and 2020 per region.
total_2018_subscribers_wide
total_2018_subscribers_wide<- data_subscribers_wide %>%
group_by(Area) %>%
summarise(Q1_2018, Q2_2018, Q3_2018, Q4_2018) %>%
group_by(Area) %>%
mutate(total = sum(Q1_2018,Q2_2018,Q3_2018,Q4_2018))
head(total_2018_subscribers_wide)
df_2018_subsribers_wide
Now I will create a new dataframe including “total” and “proportion” by region.
df_2018_subscribers_wide <- total_2018_subscribers_wide %>%
group_by(Area) %>%
mutate(proportion = total / sum(total_2018_subscribers_wide$total)) %>%
mutate(percent = proportion) %>%
mutate(proportion= percent(proportion)) %>%
summarise(total, proportion, percent)
head(df_2018_subscribers_wide)
I will do the same to file 2019.
total_2019_subsribers_wide
total_2019_subsribers_wide <- data_subscribers_wide %>%
group_by(Area) %>%
summarise(Q1_2019, Q2_2019, Q3_2019, Q4_2019) %>%
group_by(Area) %>%
mutate(total = sum(Q1_2019,Q2_2019,Q3_2019,Q4_2019))
head(total_2019_subsribers_wide)
df_2019_subsribers_wide
I will also create the new dataframe that contains variable of “total” and “proportion” for 2019.
df_2019_subsribers_wide <- total_2019_subsribers_wide %>%
group_by(Area) %>%
mutate(proportion = total / sum(total_2019_subsribers_wide$total)) %>%
mutate(percent = proportion) %>%
mutate(proportion= percent(proportion)) %>%
summarise(total, proportion, percent)
head(df_2019_subsribers_wide)
Now for 2020.
total_2020_subscribers_wide
total_2020_subscribers_wide <- data_subscribers_wide %>%
group_by(Area) %>%
summarise(Q1_2020, Q2_2020) %>%
group_by(Area) %>%
mutate(total = sum(Q1_2020,Q2_2020))
head(total_2020_subscribers_wide)
df_2020_subscribers_wide
df_2020_subscribers_wide <- total_2020_subscribers_wide %>%
mutate(proportion = total / sum(total_2020_subscribers_wide$total)) %>%
mutate(percent = proportion) %>%
mutate(proportion= percent(proportion)) %>%
summarise(total, proportion, percent)
head(df_2020_subscribers_wide)
proportion_2018_subscribers_wide
proportion_2018_subscribers_wide <- total_2018_subscribers_wide %>%
group_by(Area) %>%
mutate(Q1_proportion = Q1_2018 / sum(total_2018_subscribers_wide$Q1_2018)) %>%
mutate(Q1_percent = as.double(Q1_proportion ))%>%
mutate(Q1_proportion = percent(Q1_proportion))%>%
mutate(Q2_proportion = Q2_2018 / sum(total_2018_subscribers_wide$Q2_2018)) %>%
mutate(Q2_percent = as.double(Q2_proportion ))%>%
mutate(Q2_proportion = percent(Q2_proportion))%>%
mutate(Q3_proportion = Q3_2018 / sum(total_2018_subscribers_wide$Q3_2018)) %>%
mutate(Q3_percent = as.double(Q3_proportion ))%>%
mutate(Q3_proportion = percent(Q3_proportion))%>%
mutate(Q4_proportion = Q4_2018 / sum(total_2018_subscribers_wide$Q4_2018)) %>%
mutate(Q4_percent = as.double(Q4_proportion ))%>%
mutate(Q4_proportion = percent(Q4_proportion))%>%
summarise(Q1_proportion, Q1_percent, Q2_proportion, Q2_percent, Q3_proportion, Q3_percent, Q4_proportion, Q4_percent)
head(proportion_2018_subscribers_wide)
proportion_2019_subscribers_wide
proportion_2019_subscribers_wide <- total_2019_subsribers_wide %>%
group_by(Area) %>%
mutate(Q1_proportion = Q1_2019 / sum(total_2019_subsribers_wide$Q1_2019)) %>%
mutate(Q1_percent = as.double(Q1_proportion ))%>%
mutate(Q1_proportion = percent(Q1_proportion))%>%
mutate(Q2_proportion = Q2_2019 / sum(total_2019_subsribers_wide$Q2_2019)) %>%
mutate(Q2_percent = as.double(Q2_proportion ))%>%
mutate(Q2_proportion = percent(Q2_proportion))%>%
mutate(Q3_proportion = Q3_2019 / sum(total_2019_subsribers_wide$Q3_2019)) %>%
mutate(Q3_percent = as.double(Q3_proportion ))%>%
mutate(Q3_proportion = percent(Q3_proportion))%>%
mutate(Q4_proportion = Q4_2019 / sum(total_2019_subsribers_wide$Q4_2019)) %>%
mutate(Q4_percent = as.double(Q4_proportion ))%>%
mutate(Q4_proportion = percent(Q4_proportion))%>%
summarise(Q1_proportion, Q1_percent, Q2_proportion, Q2_percent, Q3_proportion, Q3_percent, Q4_proportion, Q4_percent)
head(proportion_2019_subscribers_wide)
proportion_2020_subscribers_wide
proportion_2020_subscribers_wide <- total_2020_subscribers_wide %>%
group_by(Area) %>%
mutate(Q1_proportion = Q1_2020 / sum(total_2020_subscribers_wide$Q1_2020)) %>%
mutate(Q1_percent = as.double(Q1_proportion ))%>%
mutate(Q1_proportion = percent(Q1_proportion))%>%
mutate(Q2_proportion = Q2_2020 / sum(total_2020_subscribers_wide$Q2_2020)) %>%
mutate(Q2_percent = as.double(Q2_proportion ))%>%
mutate(Q2_proportion = percent(Q2_proportion))%>%
summarise(Q1_proportion, Q1_percent, Q2_proportion, Q2_percent)
head(proportion_2020_subscribers_wide)
Netflix is one of the world largest entertainment and streaming media service. After we go through the four csv files and visualize them, I could draw conclusion as follows: