This project delves into the Netflix Userbase Dataset to gain actionable insights into user behaviors, subscription patterns, and platform growth dynamics. The dataset, comprising 2,500 rows representing unique users, encompasses key features such as subscription type, monthly revenue, join date, and user demographics.
Three key figures were created to unravel distinct facets of the Netflix userbase: 1. Geographical Subscription Distribution: A grouped bar chart illustrated the distribution of subscription plans across different countries. The chosen figure type allows for easy comparison, aiding in identifying prevalent subscription types in various regions. 2. Age Distribution of Users: A histogram depicted the age distribution of Netflix users. This figure provides insights into the demographic composition of the userbase, aiding in targeted content recommendations and marketing strategies. 3. Monthly Revenue by Subscription Type: A boxplot visualized the distribution of monthly revenue across different subscription types. This figure allows for a nuanced understanding of revenue variations, aiding in pricing strategy considerations.
A time series analysis explored the trend in user join dates over time. A scatter plot showcased the relationship between user join dates and their respective countries. This visual exploration contributes to understanding regional user dynamics and potential influences on user engagement.
The insights derived from this analysis hold strategic implications for Netflix. Understanding subscription patterns, user demographics, and temporal dynamics empowers data-driven decision-making. It informs marketing strategies, content localization efforts, and user experience enhancements, contributing to sustained user growth and engagement.
The dataset used for this analysis is the Netflix Userbase Dataset, sourced from Kaggle (https://www.kaggle.com/datasets/arnavsmayan/netflix-userbase-dataset/data). It comprises a snapshot of a sample Netflix userbase, showcasing various aspects of user subscriptions, revenue, account details, and activity. The dataset consists of 2,500 rows, with each row representing a unique user. There are 10 feature variables provided:
The Netflix Userbase Dataset underwent concise cleaning and reshaping in R. Columns were renamed for clarity, and date formats were standardized using the rename and mutate functions. Date columns (join_date and last_payment_date) were converted to the “%Y-%m-%d” format for consistency.
netflix_user <- read.csv("data/Netflix Userbase.csv") %>%
rename(user_id = User.ID,
subscription_type = Subscription.Type,
monthly_revenue = Monthly.Revenue,
join_date = Join.Date,
last_payment_date = Last.Payment.Date,
country = Country,
age = Age,
gender = Gender,
device = Device) %>%
mutate(join_date = format(as.Date(join_date, format = "%d-%m-%y"), "%Y-%m-%d"),
last_payment_date = format(as.Date(last_payment_date, format = "%d-%m-%y"), "%Y-%m-%d"))
head(netflix_user)
## user_id subscription_type monthly_revenue join_date last_payment_date
## 1 1 Basic 10 2022-01-15 2023-06-10
## 2 2 Premium 15 2021-09-05 2023-06-22
## 3 3 Standard 12 2023-02-28 2023-06-27
## 4 4 Standard 12 2022-07-10 2023-06-26
## 5 5 Basic 10 2023-05-01 2023-06-28
## 6 6 Premium 15 2022-03-18 2023-06-27
## country age gender device Plan.Duration
## 1 United States 28 Male Smartphone 1 Month
## 2 Canada 35 Female Tablet 1 Month
## 3 United Kingdom 42 Male Smart TV 1 Month
## 4 Australia 51 Female Laptop 1 Month
## 5 Germany 33 Male Smartphone 1 Month
## 6 France 29 Female Smart TV 1 Month
Different regions may exhibit preferences for specific genres, languages, or cultural content. Analyzing geographic distribution informs content localization strategies, allowing Netflix to curate and promote content that resonates with specific audiences. Understanding which regions contribute most to the userbase aids in strategic decision-making for market expansion. It guides resource allocation, marketing efforts, and the prioritization of content licensing in regions with growth potential.
A horizontal bar chart was created to visualize the distribution of
Netflix users across different countries. The dataset, named
user_country, was derived from the original
netflix_user dataset by grouping it based on the countries
and summarizing the user count for each country.
The ggplot function was employed to build the bar chart, specifying
the aesthetics with the x-axis representing countries
(country), the y-axis denoting the user count
(country_count), and differentiating countries by fill
color.
The choice of a bar chart was intentional as it effectively
communicates the variation in user counts across countries. The
horizontal orientation, achieved with the coord_flip
function, enhances readability, especially when dealing with a large
number of countries. This orientation allows for clearer labeling on the
y-axis, facilitating easier comparison.
To streamline the presentation, the legend was removed using the
guides function, as the fill color directly corresponds to
the countries. The overall aesthetic was refined by centering the plot
title using the theme function.
user_country <- netflix_user %>%
group_by(country) %>%
summarise(country_count = n())
p <- ggplot(data = user_country,
aes(x = country, y = country_count, fill = country))
p1 <- p + geom_bar(stat = "identity") +
labs(title = "Number of Netflix Users by Country",
x = "Country",
y = "Count") +
coord_flip() +
guides(fill = FALSE)+
theme(plot.title = element_text(hjust = 0.5))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p1
ggsave(here("images", "user_country.png"), plot = p1)
## Saving 7 x 5 in image
From the plot above we can see that Netflix’s users are spread across several countries. The countries with the most users are the United States, Spain, and Canada, among others. Netflix’s wide geographic reach is a testament to its global appeal.
Gender-based analysis helps in tailoring content recommendations. Understanding viewing habits and preferences across genders allows Netflix to enhance its recommendation algorithms, providing a more personalized and engaging user experience. Gender distribution insights inform targeted marketing strategies. Adapting promotional efforts based on gender preferences contributes to more effective and resonant advertising campaigns.
A bar chart was crafted to visually represent the gender distribution
of Netflix users. The dataset, netflix_user, was utilized
with the ggplot function, specifying aesthetics to map the
x-axis to the ‘gender’ variable.
The chosen figure type is a bar chart, where the y-axis represents
the proportion of users for each gender. This choice is effective for
illustrating the relative distribution of users across different gender
categories. The geom_bar function was employed with the
..prop.. argument to normalize the counts, ensuring that
the y-axis represents proportions rather than raw counts.
p <- ggplot(data = netflix_user,
mapping = aes(x = gender))
p2 <- p + geom_bar(mapping = aes(y = ..prop.., group = 1)) +
labs(title = "Gender Distribution of Users",
x = "Gender",
y = "Proportion") +
theme(plot.title = element_text(hjust = 0.5))
p2
## Warning: The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(prop)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggsave(here("images", "user_gender.png"), plot = p2)
## Saving 7 x 5 in image
In terms of gender distribution, Netflix seems to have a fairly even split between male and female users. This balanced distribution indicates that the platform’s content appeals to both genders equally.
Different age groups may have distinct content preferences. Analyzing age distribution guides content curation efforts, helping Netflix offer a diverse library that appeals to users across age demographics. Understanding the age composition aids in optimizing the user interface and experience. It informs decisions related to user interface design, feature prioritization, and the implementation of age-appropriate content filters.
A histogram was generated to illustrate the distribution of Netflix
user ages. The process began by creating a new dataframe, named
user_age, which was derived from the original
netflix_user dataset. This dataframe was formed by grouping
the data based on age and summarizing the count of users for each age
using the summarise function.
The figure type chosen is a histogram, as it is well-suited for
displaying the distribution of continuous data, such as age. The ggplot
function was utilized to initialize the plot, with the x-axis mapped to
the ‘age’ variable. The geom_histogram function was then
applied to create the histogram, specifying parameters such as the
number of bins (15) and transparency (alpha = 0.7) for visual
clarity.
user_age <- netflix_user %>%
group_by(age) %>%
summarise(age_count = n())
user_age
## # A tibble: 26 × 2
## age age_count
## <int> <int>
## 1 26 1
## 2 27 87
## 3 28 115
## 4 29 104
## 5 30 116
## 6 31 115
## 7 32 92
## 8 33 93
## 9 34 88
## 10 35 105
## # ℹ 16 more rows
p <- ggplot(data = netflix_user,
aes(x = age))
p3 <- p + geom_histogram(bins = 15, alpha = 0.7) +
labs(title = "Distribution of Subscribers Age",
x = "Age",
y = "Count") +
theme(plot.title = element_text(hjust = 0.5))
p3
ggsave(here("images", "user_age.png"), plot = p3)
## Saving 7 x 5 in image
The age distribution of Netflix users is relatively broad, with significant representation across different age groups. The most common age group of Netflix users is around 30-40 years, but there’s also a considerable number of users in the 20-30 and 40-50 age brackets. This tells us that Netflix’s content appeals to a wide age range, which is a positive sign for the company’s ability to maintain a diverse user base.
Understanding which devices are predominantly used to access Netflix allows for the optimization of the user interface and experience on those devices. Tailoring the platform to the specifications of popular devices ensures a seamless and enjoyable user experience. Different devices may have varying capabilities, screen sizes, and resolutions. Analyzing device usage helps in optimizing content delivery and format. For instance, content may need to be adapted for smaller screens or different aspect ratios to enhance viewing quality.
It also aids in resource allocation, especially in terms of technological infrastructure. For example, if a significant portion of users accesses Netflix through smart TVs, the platform can allocate resources to enhance the performance and features of the TV app.
A bar chart was crafted to visualize the distribution of devices used by Netflix users. The process involved creating a new dataframe, named ‘devices’, derived from the original ‘netflix_user’ dataset. The ‘devices’ dataframe was formed by grouping the data based on device types and summarizing the count of users for each device type using the ‘summarise’ function.
The ggplot function was employed to initialize the plot, with aesthetics mapped to the x-axis representing device types (‘device’), the y-axis representing user counts (‘device_count’), and fill color distinguishing between device types.
The ‘geom_bar’ function was then applied to create the bar chart, utilizing the “identity” statistic to use the counts directly.
The bar chart effectively communicates the distribution of Netflix users across different device types, allowing for easy comparison and identification of the most prevalent devices.
devices <- netflix_user %>%
group_by(device) %>%
summarise(device_count = n())
devices
## # A tibble: 4 × 2
## device device_count
## <chr> <int>
## 1 Laptop 636
## 2 Smart TV 610
## 3 Smartphone 621
## 4 Tablet 633
p <- ggplot(data = devices,
aes(x = device, y = device_count, fill = device))
p4 <- p + geom_bar(stat = "identity") +
labs(title = "Distribution of Devices",
x = "Devices Type",
y = "Number of Users") +
guides(fill = FALSE) +
theme(plot.title = element_text(hjust = 0.5))
p4
ggsave(here("images", "devices.png"), plot = p4)
## Saving 7 x 5 in image
As we can see, the most popular device for consuming Netflix content is the laptop, followed by the tablet, smartphone, and smart TV. This tells us that a significant proportion of Netflix users prefer to consume content on mobile devices. This could be due to the flexibility and convenience offered by these devices, as users can watch content on-the-go.
Analyzing subscription type distribution informs decisions regarding pricing strategies. It helps evaluate the popularity of each subscription tier, guiding potential adjustments or the introduction of new plans to optimize revenue without compromising user satisfaction.
A bar chart was crafted to visually represent the distribution of
Netflix users across different subscription types. The process commenced
by creating a new dataframe, named sub_type, which was
derived from the original netflix_user dataset. This
dataframe was formed by grouping the data based on subscription types
and summarizing the count of users for each subscription type using the
summarise function.
The ggplot function was employed to initialize the plot, with
aesthetics mapped to the x-axis representing subscription types
(subscription_type), the y-axis representing user counts
(subtype_count), and fill color distinguishing between
different subscription types.
The geom_bar function was then applied to create the bar
chart, utilizing the “identity” statistic to use the counts
directly.
sub_type <- netflix_user %>%
group_by(subscription_type) %>%
summarise(subtype_count = n())
sub_type
## # A tibble: 3 × 2
## subscription_type subtype_count
## <chr> <int>
## 1 Basic 999
## 2 Premium 733
## 3 Standard 768
p <- ggplot(data = sub_type,
aes(x = subscription_type, y = subtype_count, fill = subscription_type))
p5 <- p + geom_bar(stat = "identity") +
labs(title = "Distribution of Subscription Types",
x = "Subscription Type",
y = "Count") +
guides(fill = FALSE) +
theme(plot.title = element_text(hjust = 0.5))
p5
ggsave(here("images", "sub_type.png"), plot = p5)
## Saving 7 x 5 in image
The distribution of subscription types among Netflix users reveals a predominant preference for the Basic subscription, followed by the Standard and Premium subscriptions. This trend suggests that a significant portion of users leans towards a more cost-effective option, characteristic of the Basic plan. The Basic subscription, known for its affordability, provides users with access to the Netflix library but limits streaming quality to standard definition and allows only one device to stream at a time. This choice indicates a prioritization among users for a budget-friendly plan, even if it comes with some limitations in terms of streaming features.
Understanding the distribution of subscription plans can be crucial for strategic decision-making within the company. It can help Netflix tailor its offerings, marketing strategies, and content recommendations to better suit the preferences and demands of users in specific regions. Moreover, identifying patterns in subscription choices may contribute to refining pricing models or introducing targeted promotions to attract and retain users.
A grouped bar chart was created to depict the distribution of subscription plan counts by country in the Netflix userbase.
The country_sub dataframe is generated by grouping the original netflix_user dataset based on the country and subscription type, and then summarizing the count of users for each combination using the summarise function.
The ggplot function is utilized to initialize the plot, specifying aesthetics with the x-axis representing countries (country), the y-axis denoting the number of subscriptions (N), and differentiating subscription types by fill color.
The geom_col function is applied with the “dodge2” position parameter to create a grouped bar chart, facilitating a clear comparison of subscription types within each country.
country_sub <- netflix_user %>%
group_by(country, subscription_type) %>%
summarise(N = n())
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.
p <- ggplot(data = country_sub,
mapping = aes(x = country, y = N, fill = subscription_type))
p6 <- p + geom_col(position = "dodge2") +
labs(title = "Subscription Plan Counts by Country",
x = "Country",
y = "Number of Subscriptions",
fill = "Subscription Type") +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5))
p6
ggsave(here("images", "country_sub.png"), plot = p6)
## Saving 7 x 5 in image
Understanding how revenue is distributed among different subscription types helps in identifying the primary sources of income for Netflix.
A boxplot was created to visually represent the distribution of
monthly revenue across different subscription types in the Netflix
userbase. The ggplot function was employed to initialize the plot,
specifying aesthetics with the x-axis representing subscription types
(subscription_type), the y-axis representing monthly
revenue (monthly_revenue), and fill color distinguishing
between subscription types.
Boxplot is particularly effective for visualizing the central
tendency, spread, and presence of outliers within a dataset. The
geom_boxplot function was applied to generate the boxplot,
providing a clear representation of the distribution of monthly revenue
for each subscription type.
p <- ggplot(data = netflix_user,
mapping = aes(x = subscription_type, y = monthly_revenue, fill = subscription_type))
p7 <- p + geom_boxplot() +
labs(title = "Monthly Revenue by Subscription Type",
x = "Subscription Type",
y = "Monthly Revenue") +
guides(fill = FALSE) +
theme(plot.title = element_text(hjust = 0.5))
p7
ggsave(here("images", "rev_subtype.png"), plot = p7)
## Saving 7 x 5 in image
As expected, the Premium subscription type brings in the most revenue per user per month. Even though the Basic subscription has the most users, the higher price point of the Premium subscription leads to more revenue per user.
Analyzing user join dates helps assess how the Netflix userbase has been growing over time. It provides a historical perspective on user acquisition, indicating periods of accelerated growth or potential fluctuations. Examining join dates can reveal any seasonal patterns or trends in user sign-ups. For example, there might be increased subscriptions during holidays, promotional events, or specific seasons, which can inform marketing strategies.
If Netflix introduces new features, original content, or expands its service to new regions, analyzing user join dates helps understand how these events impact user acquisition.Analyzing join dates can be complemented with data on user retention and engagement. This holistic view helps in understanding not only how many users are joining but also how well they are retained over time.
In this R code snippet, a time series analysis of user join dates in
the Netflix userbase is performed. The user_join dataframe
is created by transforming the original netflix_user
dataset. The ‘join_date’ column is first converted to a date format
using as.Date, and a new column named ‘join_month’ is
created to represent the month in which each user joined.
The data is then grouped by ‘join_month,’ and the number of users
joining in each month is summarized using the summarise
function.
A line plot is generated using the ggplot2 library, where the x-axis
represents the join dates (‘join_month’), and the y-axis represents the
count of users joining in each month (‘join_count’). Both line segments
(geom_line) and points (geom_point) are added
to provide a comprehensive view of the trend.
Furthermore, the x-axis breaks and labels are formatted to display
months
(scale_x_date(date_breaks = "1 month", date_labels = "%Y-%m")).
user_join <- netflix_user %>%
mutate(join_date = as.Date(join_date)) %>%
mutate(join_month = floor_date(join_date, "month")) %>%
group_by(join_month) %>%
summarise(join_count = n())
user_join
## # A tibble: 22 × 2
## join_month join_count
## <date> <int>
## 1 2021-09-01 3
## 2 2021-10-01 3
## 3 2021-11-01 4
## 4 2021-12-01 4
## 5 2022-01-01 8
## 6 2022-02-01 5
## 7 2022-03-01 13
## 8 2022-04-01 19
## 9 2022-05-01 40
## 10 2022-06-01 295
## # ℹ 12 more rows
p <- ggplot(user_join, aes(x = join_month,
y = join_count))
p8 <- p + geom_line() +
geom_point() +
labs(title = "Number of Users Joining Over Time",
x = "Date",
y = "Number of Users") +
scale_x_date(date_breaks = "1 month", date_labels = "%Y-%m") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5))
p8
ggsave(here("images", "user_join.png"), plot = p8)
## Saving 7 x 5 in image
Understanding where new users are joining from helps in assessing the geographical distribution of the Netflix userbase. It allows for insights into which countries contribute most to user growth. The analysis can reveal the effectiveness of marketing strategies or regional campaigns. Patterns in join dates may align with specific marketing efforts, helping to assess the success of promotions or advertising campaigns in different countries.
Join date analysis can highlight periods associated with the launch of Netflix in new countries or the introduction of localized content. This information is crucial for evaluating the impact of expansions on user acquisition. Examining the relationship between join dates and countries allows for the identification of potential differences in user engagement across regions. Understanding when users join and how they engage with the platform can inform content localization and user experience improvements. The insights gained from this analysis can inform strategic decisions related to resource allocation, marketing budgets, and content localization efforts. It helps Netflix tailor its strategies to better meet the needs and preferences of users in different regions.
A scatter plot was created to visualize the relationship between the join date of Netflix users and their respective countries.
The join_date column in the netflix_user
dataset is converted to a date format using the as.Date
function. The ggplot function is then used to initiate the plot, with
aesthetics mapped to the x-axis representing join dates
(join_date), the y-axis denoting countries
(country), and each point colored according to its
respective country.
The geom_point function is applied to generate the
scatter plot, where each point represents a user, and its position is
determined by the join date and the corresponding country.
netflix_user$join_date <- as.Date(netflix_user$join_date)
p <- ggplot(netflix_user, aes(x = join_date,
y = country,
color = country))
p9 <- p + geom_point() +
labs(title = "Relationship Between Join Date and Country",
x = "",
y = "Country") +
scale_x_date(date_breaks = "1 month", date_labels = "%Y-%m") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5))
p9
ggsave(here("images", "join_country.png"), plot = p9)
## Saving 7 x 5 in image
It reveals that the majority of new user growth in several countries was concentrated in the latter half of 2022. However, since then, there has been a notable decline in new user additions, particularly in France, Canada, Brazil, and Australia when compared to other countries. This could suggest challenges specific to these countries, such as market saturation or changing user demands.
Leveraging the successful strategies from the peak growth period can be essential for Netflix to sustain user growth. For countries experiencing stagnation in new user additions, a reassessment of market strategies may be necessary. This could involve adjusting promotional activities or offering customized content to better align with local preferences.