Executive summary

This project delves into the Netflix Userbase Dataset to gain actionable insights into user behaviors, subscription patterns, and platform growth dynamics. The dataset, comprising 2,500 rows representing unique users, encompasses key features such as subscription type, monthly revenue, join date, and user demographics.

Three key figures were created to unravel distinct facets of the Netflix userbase: 1. Geographical Subscription Distribution: A grouped bar chart illustrated the distribution of subscription plans across different countries. The chosen figure type allows for easy comparison, aiding in identifying prevalent subscription types in various regions. 2. Age Distribution of Users: A histogram depicted the age distribution of Netflix users. This figure provides insights into the demographic composition of the userbase, aiding in targeted content recommendations and marketing strategies. 3. Monthly Revenue by Subscription Type: A boxplot visualized the distribution of monthly revenue across different subscription types. This figure allows for a nuanced understanding of revenue variations, aiding in pricing strategy considerations.

A time series analysis explored the trend in user join dates over time. A scatter plot showcased the relationship between user join dates and their respective countries. This visual exploration contributes to understanding regional user dynamics and potential influences on user engagement.

The insights derived from this analysis hold strategic implications for Netflix. Understanding subscription patterns, user demographics, and temporal dynamics empowers data-driven decision-making. It informs marketing strategies, content localization efforts, and user experience enhancements, contributing to sustained user growth and engagement.

Data background

The dataset used for this analysis is the Netflix Userbase Dataset, sourced from Kaggle (https://www.kaggle.com/datasets/arnavsmayan/netflix-userbase-dataset/data). It comprises a snapshot of a sample Netflix userbase, showcasing various aspects of user subscriptions, revenue, account details, and activity. The dataset consists of 2,500 rows, with each row representing a unique user. There are 10 feature variables provided:

  1. User ID: A unique identifier for each user.
  2. Subscription Type: The type of subscription a user has (Basic, Standard, or Premium).
  3. Monthly Revenue: The revenue generated from a user’s subscription every month.
  4. Join Date: The date the user joined Netflix.
  5. Last Payment Date: The last date the user made a payment.
  6. Country: The country in which the user is located.
  7. Age: The age of the user.
  8. Gender: The gender of the user.
  9. Device: The primary device used by the user to access Netflix.
  10. Plan Duration: The duration of the subscription plan.

Data cleaning

The Netflix Userbase Dataset underwent concise cleaning and reshaping in R. Columns were renamed for clarity, and date formats were standardized using the rename and mutate functions. Date columns (join_date and last_payment_date) were converted to the “%Y-%m-%d” format for consistency.

netflix_user <- read.csv("data/Netflix Userbase.csv") %>%
    rename(user_id = User.ID,
           subscription_type = Subscription.Type,
           monthly_revenue = Monthly.Revenue,
           join_date = Join.Date,
           last_payment_date = Last.Payment.Date,
           country = Country,
           age = Age,
           gender = Gender,
           device = Device) %>%
    mutate(join_date = format(as.Date(join_date, format = "%d-%m-%y"), "%Y-%m-%d"),
           last_payment_date = format(as.Date(last_payment_date, format = "%d-%m-%y"), "%Y-%m-%d"))
head(netflix_user)
##   user_id subscription_type monthly_revenue  join_date last_payment_date
## 1       1             Basic              10 2022-01-15        2023-06-10
## 2       2           Premium              15 2021-09-05        2023-06-22
## 3       3          Standard              12 2023-02-28        2023-06-27
## 4       4          Standard              12 2022-07-10        2023-06-26
## 5       5             Basic              10 2023-05-01        2023-06-28
## 6       6           Premium              15 2022-03-18        2023-06-27
##          country age gender     device Plan.Duration
## 1  United States  28   Male Smartphone       1 Month
## 2         Canada  35 Female     Tablet       1 Month
## 3 United Kingdom  42   Male   Smart TV       1 Month
## 4      Australia  51 Female     Laptop       1 Month
## 5        Germany  33   Male Smartphone       1 Month
## 6         France  29 Female   Smart TV       1 Month

Understanding Netflix’s Users

Figure 1: Geographic Distribution

Different regions may exhibit preferences for specific genres, languages, or cultural content. Analyzing geographic distribution informs content localization strategies, allowing Netflix to curate and promote content that resonates with specific audiences. Understanding which regions contribute most to the userbase aids in strategic decision-making for market expansion. It guides resource allocation, marketing efforts, and the prioritization of content licensing in regions with growth potential.

A horizontal bar chart was created to visualize the distribution of Netflix users across different countries. The dataset, named user_country, was derived from the original netflix_user dataset by grouping it based on the countries and summarizing the user count for each country.

The ggplot function was employed to build the bar chart, specifying the aesthetics with the x-axis representing countries (country), the y-axis denoting the user count (country_count), and differentiating countries by fill color.

The choice of a bar chart was intentional as it effectively communicates the variation in user counts across countries. The horizontal orientation, achieved with the coord_flip function, enhances readability, especially when dealing with a large number of countries. This orientation allows for clearer labeling on the y-axis, facilitating easier comparison.

To streamline the presentation, the legend was removed using the guides function, as the fill color directly corresponds to the countries. The overall aesthetic was refined by centering the plot title using the theme function.

user_country <- netflix_user %>%
    group_by(country) %>%
    summarise(country_count = n())

p <- ggplot(data = user_country, 
            aes(x = country, y = country_count, fill = country))

p1 <- p + geom_bar(stat = "identity") +
    labs(title = "Number of Netflix Users by Country", 
         x = "Country", 
         y = "Count") +
    coord_flip() +
    guides(fill = FALSE)+
    theme(plot.title = element_text(hjust = 0.5))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p1

ggsave(here("images", "user_country.png"), plot = p1)
## Saving 7 x 5 in image

From the plot above we can see that Netflix’s users are spread across several countries. The countries with the most users are the United States, Spain, and Canada, among others. Netflix’s wide geographic reach is a testament to its global appeal.

Figure 2: Gender Distribution

Gender-based analysis helps in tailoring content recommendations. Understanding viewing habits and preferences across genders allows Netflix to enhance its recommendation algorithms, providing a more personalized and engaging user experience. Gender distribution insights inform targeted marketing strategies. Adapting promotional efforts based on gender preferences contributes to more effective and resonant advertising campaigns.

A bar chart was crafted to visually represent the gender distribution of Netflix users. The dataset, netflix_user, was utilized with the ggplot function, specifying aesthetics to map the x-axis to the ‘gender’ variable.

The chosen figure type is a bar chart, where the y-axis represents the proportion of users for each gender. This choice is effective for illustrating the relative distribution of users across different gender categories. The geom_bar function was employed with the ..prop.. argument to normalize the counts, ensuring that the y-axis represents proportions rather than raw counts.

p <- ggplot(data = netflix_user,
            mapping = aes(x = gender))

p2 <- p + geom_bar(mapping = aes(y = ..prop.., group = 1)) +
    labs(title = "Gender Distribution of Users", 
         x = "Gender",
         y = "Proportion") +
    theme(plot.title = element_text(hjust = 0.5))
p2
## Warning: The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(prop)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggsave(here("images", "user_gender.png"), plot = p2)
## Saving 7 x 5 in image

In terms of gender distribution, Netflix seems to have a fairly even split between male and female users. This balanced distribution indicates that the platform’s content appeals to both genders equally.

Figure 3: Age Distribution

Different age groups may have distinct content preferences. Analyzing age distribution guides content curation efforts, helping Netflix offer a diverse library that appeals to users across age demographics. Understanding the age composition aids in optimizing the user interface and experience. It informs decisions related to user interface design, feature prioritization, and the implementation of age-appropriate content filters.

A histogram was generated to illustrate the distribution of Netflix user ages. The process began by creating a new dataframe, named user_age, which was derived from the original netflix_user dataset. This dataframe was formed by grouping the data based on age and summarizing the count of users for each age using the summarise function.

The figure type chosen is a histogram, as it is well-suited for displaying the distribution of continuous data, such as age. The ggplot function was utilized to initialize the plot, with the x-axis mapped to the ‘age’ variable. The geom_histogram function was then applied to create the histogram, specifying parameters such as the number of bins (15) and transparency (alpha = 0.7) for visual clarity.

user_age <- netflix_user %>%
    group_by(age) %>%
    summarise(age_count = n())
user_age
## # A tibble: 26 × 2
##      age age_count
##    <int>     <int>
##  1    26         1
##  2    27        87
##  3    28       115
##  4    29       104
##  5    30       116
##  6    31       115
##  7    32        92
##  8    33        93
##  9    34        88
## 10    35       105
## # ℹ 16 more rows
p <- ggplot(data = netflix_user, 
            aes(x = age))

p3 <- p + geom_histogram(bins = 15, alpha = 0.7) +
    labs(title = "Distribution of Subscribers Age", 
         x = "Age",
         y = "Count") +
    theme(plot.title = element_text(hjust = 0.5))
p3

ggsave(here("images", "user_age.png"), plot = p3)
## Saving 7 x 5 in image

The age distribution of Netflix users is relatively broad, with significant representation across different age groups. The most common age group of Netflix users is around 30-40 years, but there’s also a considerable number of users in the 20-30 and 40-50 age brackets. This tells us that Netflix’s content appeals to a wide age range, which is a positive sign for the company’s ability to maintain a diverse user base.

Understanding Netflix’s Content Consumption

Figure 4: Device Usage Distribution

Understanding which devices are predominantly used to access Netflix allows for the optimization of the user interface and experience on those devices. Tailoring the platform to the specifications of popular devices ensures a seamless and enjoyable user experience. Different devices may have varying capabilities, screen sizes, and resolutions. Analyzing device usage helps in optimizing content delivery and format. For instance, content may need to be adapted for smaller screens or different aspect ratios to enhance viewing quality.

It also aids in resource allocation, especially in terms of technological infrastructure. For example, if a significant portion of users accesses Netflix through smart TVs, the platform can allocate resources to enhance the performance and features of the TV app.

A bar chart was crafted to visualize the distribution of devices used by Netflix users. The process involved creating a new dataframe, named ‘devices’, derived from the original ‘netflix_user’ dataset. The ‘devices’ dataframe was formed by grouping the data based on device types and summarizing the count of users for each device type using the ‘summarise’ function.

The ggplot function was employed to initialize the plot, with aesthetics mapped to the x-axis representing device types (‘device’), the y-axis representing user counts (‘device_count’), and fill color distinguishing between device types.

The ‘geom_bar’ function was then applied to create the bar chart, utilizing the “identity” statistic to use the counts directly.

The bar chart effectively communicates the distribution of Netflix users across different device types, allowing for easy comparison and identification of the most prevalent devices.

devices <- netflix_user %>%
    group_by(device) %>%
    summarise(device_count = n())
devices
## # A tibble: 4 × 2
##   device     device_count
##   <chr>             <int>
## 1 Laptop              636
## 2 Smart TV            610
## 3 Smartphone          621
## 4 Tablet              633
p <- ggplot(data = devices, 
            aes(x = device, y = device_count, fill = device))

p4 <- p + geom_bar(stat = "identity") +
    labs(title = "Distribution of Devices", 
         x = "Devices Type", 
         y = "Number of Users") +
    guides(fill = FALSE) + 
    theme(plot.title = element_text(hjust = 0.5))
p4

ggsave(here("images", "devices.png"), plot = p4)
## Saving 7 x 5 in image

As we can see, the most popular device for consuming Netflix content is the laptop, followed by the tablet, smartphone, and smart TV. This tells us that a significant proportion of Netflix users prefer to consume content on mobile devices. This could be due to the flexibility and convenience offered by these devices, as users can watch content on-the-go.

Understanding Netflix’s Subscription Habits

Figure 5: Subscription Type Distribution

Analyzing subscription type distribution informs decisions regarding pricing strategies. It helps evaluate the popularity of each subscription tier, guiding potential adjustments or the introduction of new plans to optimize revenue without compromising user satisfaction.

A bar chart was crafted to visually represent the distribution of Netflix users across different subscription types. The process commenced by creating a new dataframe, named sub_type, which was derived from the original netflix_user dataset. This dataframe was formed by grouping the data based on subscription types and summarizing the count of users for each subscription type using the summarise function.

The ggplot function was employed to initialize the plot, with aesthetics mapped to the x-axis representing subscription types (subscription_type), the y-axis representing user counts (subtype_count), and fill color distinguishing between different subscription types.

The geom_bar function was then applied to create the bar chart, utilizing the “identity” statistic to use the counts directly.

sub_type <- netflix_user %>%
    group_by(subscription_type) %>%
    summarise(subtype_count = n())
sub_type
## # A tibble: 3 × 2
##   subscription_type subtype_count
##   <chr>                     <int>
## 1 Basic                       999
## 2 Premium                     733
## 3 Standard                    768
p <- ggplot(data = sub_type, 
            aes(x = subscription_type, y = subtype_count, fill = subscription_type))

p5 <- p + geom_bar(stat = "identity") +
    labs(title = "Distribution of Subscription Types", 
         x = "Subscription Type", 
         y = "Count") +
    guides(fill = FALSE) + 
    theme(plot.title = element_text(hjust = 0.5))
p5

ggsave(here("images", "sub_type.png"), plot = p5)
## Saving 7 x 5 in image

The distribution of subscription types among Netflix users reveals a predominant preference for the Basic subscription, followed by the Standard and Premium subscriptions. This trend suggests that a significant portion of users leans towards a more cost-effective option, characteristic of the Basic plan. The Basic subscription, known for its affordability, provides users with access to the Netflix library but limits streaming quality to standard definition and allows only one device to stream at a time. This choice indicates a prioritization among users for a budget-friendly plan, even if it comes with some limitations in terms of streaming features.

Figure 6: Distribution of Subscription Plans by Country

Understanding the distribution of subscription plans can be crucial for strategic decision-making within the company. It can help Netflix tailor its offerings, marketing strategies, and content recommendations to better suit the preferences and demands of users in specific regions. Moreover, identifying patterns in subscription choices may contribute to refining pricing models or introducing targeted promotions to attract and retain users.

A grouped bar chart was created to depict the distribution of subscription plan counts by country in the Netflix userbase.

The country_sub dataframe is generated by grouping the original netflix_user dataset based on the country and subscription type, and then summarizing the count of users for each combination using the summarise function.

The ggplot function is utilized to initialize the plot, specifying aesthetics with the x-axis representing countries (country), the y-axis denoting the number of subscriptions (N), and differentiating subscription types by fill color.

The geom_col function is applied with the “dodge2” position parameter to create a grouped bar chart, facilitating a clear comparison of subscription types within each country.

country_sub <- netflix_user %>%
  group_by(country, subscription_type) %>%
  summarise(N = n())
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.
p <- ggplot(data = country_sub,
            mapping = aes(x = country, y = N, fill = subscription_type))

p6 <- p + geom_col(position = "dodge2") +
    labs(title = "Subscription Plan Counts by Country",
         x = "Country", 
         y = "Number of Subscriptions", 
         fill = "Subscription Type") +
    theme(legend.position = "bottom",
          plot.title = element_text(hjust = 0.5))
p6

ggsave(here("images", "country_sub.png"), plot = p6)
## Saving 7 x 5 in image

Understanding Netflix’s Revenue Generation

Figure 7: Revenue Distribution by Subscription Type

Understanding how revenue is distributed among different subscription types helps in identifying the primary sources of income for Netflix.

A boxplot was created to visually represent the distribution of monthly revenue across different subscription types in the Netflix userbase. The ggplot function was employed to initialize the plot, specifying aesthetics with the x-axis representing subscription types (subscription_type), the y-axis representing monthly revenue (monthly_revenue), and fill color distinguishing between subscription types.

Boxplot is particularly effective for visualizing the central tendency, spread, and presence of outliers within a dataset. The geom_boxplot function was applied to generate the boxplot, providing a clear representation of the distribution of monthly revenue for each subscription type.

p <- ggplot(data = netflix_user,
            mapping = aes(x = subscription_type, y = monthly_revenue, fill = subscription_type))

p7 <- p + geom_boxplot() +
     labs(title = "Monthly Revenue by Subscription Type", 
         x = "Subscription Type", 
         y = "Monthly Revenue") +
    guides(fill = FALSE) + 
    theme(plot.title = element_text(hjust = 0.5))
p7

ggsave(here("images", "rev_subtype.png"), plot = p7)
## Saving 7 x 5 in image

As expected, the Premium subscription type brings in the most revenue per user per month. Even though the Basic subscription has the most users, the higher price point of the Premium subscription leads to more revenue per user.

Figure 8: The trend in the number of users joining Netflix over time

Analyzing user join dates helps assess how the Netflix userbase has been growing over time. It provides a historical perspective on user acquisition, indicating periods of accelerated growth or potential fluctuations. Examining join dates can reveal any seasonal patterns or trends in user sign-ups. For example, there might be increased subscriptions during holidays, promotional events, or specific seasons, which can inform marketing strategies.

If Netflix introduces new features, original content, or expands its service to new regions, analyzing user join dates helps understand how these events impact user acquisition.Analyzing join dates can be complemented with data on user retention and engagement. This holistic view helps in understanding not only how many users are joining but also how well they are retained over time.

In this R code snippet, a time series analysis of user join dates in the Netflix userbase is performed. The user_join dataframe is created by transforming the original netflix_user dataset. The ‘join_date’ column is first converted to a date format using as.Date, and a new column named ‘join_month’ is created to represent the month in which each user joined.

The data is then grouped by ‘join_month,’ and the number of users joining in each month is summarized using the summarise function.

A line plot is generated using the ggplot2 library, where the x-axis represents the join dates (‘join_month’), and the y-axis represents the count of users joining in each month (‘join_count’). Both line segments (geom_line) and points (geom_point) are added to provide a comprehensive view of the trend.

Furthermore, the x-axis breaks and labels are formatted to display months (scale_x_date(date_breaks = "1 month", date_labels = "%Y-%m")).

user_join <- netflix_user %>%
    mutate(join_date = as.Date(join_date)) %>%
    mutate(join_month = floor_date(join_date, "month")) %>%
    group_by(join_month) %>%
    summarise(join_count = n())
user_join
## # A tibble: 22 × 2
##    join_month join_count
##    <date>          <int>
##  1 2021-09-01          3
##  2 2021-10-01          3
##  3 2021-11-01          4
##  4 2021-12-01          4
##  5 2022-01-01          8
##  6 2022-02-01          5
##  7 2022-03-01         13
##  8 2022-04-01         19
##  9 2022-05-01         40
## 10 2022-06-01        295
## # ℹ 12 more rows
p <- ggplot(user_join, aes(x = join_month, 
                           y = join_count))
p8 <- p + geom_line() +
    geom_point() +
    labs(title = "Number of Users Joining Over Time",
         x = "Date",
         y = "Number of Users") +
    scale_x_date(date_breaks = "1 month", date_labels = "%Y-%m") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          plot.title = element_text(hjust = 0.5))
p8

ggsave(here("images", "user_join.png"), plot = p8)
## Saving 7 x 5 in image
  1. Stability Periods (2021-08 to 2022-05 and 2022-12 to 2023-07):
  • During these intervals, the number of users joining Netflix remained relatively stable.
  1. Periods of Rapid Increase (2022-05 to 2022-07 and 2022-09 to 2022-10):
  • Significant spikes in user joins were observed during these periods, signaling a surge in user acquisition. The rapid increase suggests successful strategies, marketing campaigns, or external factors that attracted a large number of new users within relatively short timeframes.
  1. Periods of Rapid Decrease (2022-07 to 2022-09 and 2022-10 to 2022-12):
  • Conversely, these periods saw a sharp decline in the number of users joining Netflix. The rapid decrease may be attributed to various factors, such as seasonal fluctuations, changes in market dynamics, or temporary shifts in user interest.

Figure 9: Relationship between join date and country

Understanding where new users are joining from helps in assessing the geographical distribution of the Netflix userbase. It allows for insights into which countries contribute most to user growth. The analysis can reveal the effectiveness of marketing strategies or regional campaigns. Patterns in join dates may align with specific marketing efforts, helping to assess the success of promotions or advertising campaigns in different countries.

Join date analysis can highlight periods associated with the launch of Netflix in new countries or the introduction of localized content. This information is crucial for evaluating the impact of expansions on user acquisition. Examining the relationship between join dates and countries allows for the identification of potential differences in user engagement across regions. Understanding when users join and how they engage with the platform can inform content localization and user experience improvements. The insights gained from this analysis can inform strategic decisions related to resource allocation, marketing budgets, and content localization efforts. It helps Netflix tailor its strategies to better meet the needs and preferences of users in different regions.

A scatter plot was created to visualize the relationship between the join date of Netflix users and their respective countries.

The join_date column in the netflix_user dataset is converted to a date format using the as.Date function. The ggplot function is then used to initiate the plot, with aesthetics mapped to the x-axis representing join dates (join_date), the y-axis denoting countries (country), and each point colored according to its respective country.

The geom_point function is applied to generate the scatter plot, where each point represents a user, and its position is determined by the join date and the corresponding country.

netflix_user$join_date <- as.Date(netflix_user$join_date)

p <- ggplot(netflix_user, aes(x = join_date, 
                              y = country,
                              color = country))
p9 <- p + geom_point() +
    labs(title = "Relationship Between Join Date and Country",
         x = "",
         y = "Country") +
    scale_x_date(date_breaks = "1 month", date_labels = "%Y-%m") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          plot.title = element_text(hjust = 0.5))
p9

ggsave(here("images", "join_country.png"), plot = p9)
## Saving 7 x 5 in image

It reveals that the majority of new user growth in several countries was concentrated in the latter half of 2022. However, since then, there has been a notable decline in new user additions, particularly in France, Canada, Brazil, and Australia when compared to other countries. This could suggest challenges specific to these countries, such as market saturation or changing user demands.

Leveraging the successful strategies from the peak growth period can be essential for Netflix to sustain user growth. For countries experiencing stagnation in new user additions, a reassessment of market strategies may be necessary. This could involve adjusting promotional activities or offering customized content to better align with local preferences.