The dataset I will be examining in this project contains data about Netflix users. For example, this dataset contains information about subscriptions, revenue, account details, and activity. The source for this dataset can be found here Link to source.
This dataset contains 2500 observations and 10 variables that include the following:
User ID that records the individual id as a numeric variable
Subscription Type: a character variable that records the type of subscription plan.
Monthly Revenue: a numeric variable that shows the amount of revenue per month.
Join Date: a character variable that records the date a user joined.
Last Payment Date: a character variable that records the last date of payment.
Country: a character variable that shows the origin of the user.
Age: a numeric variable that shows the age of the user.
Gender: a character variable that shows the gender of the user.
Device: a character variable that records the device type the user uses.
Plan Duration: a character variable that shows the duration of plan.
library(tidyverse)
setwd("/Users/mikea/Desktop/Datasets")
df <- read_csv("Netflix Userbase.csv")
str(df, 10)
## spc_tbl_ [2,500 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ User ID : num [1:2500] 1 2 3 4 5 6 7 8 9 10 ...
## $ Subscription Type: chr [1:2500] "Basic" "Premium" "Standard" "Standard" ...
## $ Monthly Revenue : num [1:2500] 10 15 12 12 10 15 12 10 12 15 ...
## $ Join Date : chr [1:2500] "15-01-22" "05-09-21" "28-02-23" "10-07-22" ...
## $ Last Payment Date: chr [1:2500] "10-06-23" "22-06-23" "27-06-23" "26-06-23" ...
## $ Country : chr [1:2500] "United States" "Canada" "United Kingdom" "Australia" ...
## $ Age : num [1:2500] 28 35 42 51 33 29 46 39 37 44 ...
## $ Gender : chr [1:2500] "Male" "Female" "Male" "Female" ...
## $ Device : chr [1:2500] "Smartphone" "Tablet" "Smart TV" "Laptop" ...
## $ Plan Duration : chr [1:2500] "1 Month" "1 Month" "1 Month" "1 Month" ...
## - attr(*, "spec")=
## .. cols(
## .. `User ID` = col_double(),
## .. `Subscription Type` = col_character(),
## .. `Monthly Revenue` = col_double(),
## .. `Join Date` = col_character(),
## .. `Last Payment Date` = col_character(),
## .. Country = col_character(),
## .. Age = col_double(),
## .. Gender = col_character(),
## .. Device = col_character(),
## .. `Plan Duration` = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
#glimpse(df)
dim(df)
## [1] 2500 10
colSums(is.na(df))
## User ID Subscription Type Monthly Revenue Join Date
## 0 0 0 0
## Last Payment Date Country Age Gender
## 0 0 0 0
## Device Plan Duration
## 0 0
# Great! no NA's
My two questions that I want to investigate in this project are:
Which country has the most users?
Which country has the most users subscribed to the basic Subscriptions plan?
Which country has the most users?
Which country has the most users subscribed to the basic Subscriptions plan?
str(df)
## spc_tbl_ [2,500 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ User ID : num [1:2500] 1 2 3 4 5 6 7 8 9 10 ...
## $ Subscription Type: chr [1:2500] "Basic" "Premium" "Standard" "Standard" ...
## $ Monthly Revenue : num [1:2500] 10 15 12 12 10 15 12 10 12 15 ...
## $ Join Date : chr [1:2500] "15-01-22" "05-09-21" "28-02-23" "10-07-22" ...
## $ Last Payment Date: chr [1:2500] "10-06-23" "22-06-23" "27-06-23" "26-06-23" ...
## $ Country : chr [1:2500] "United States" "Canada" "United Kingdom" "Australia" ...
## $ Age : num [1:2500] 28 35 42 51 33 29 46 39 37 44 ...
## $ Gender : chr [1:2500] "Male" "Female" "Male" "Female" ...
## $ Device : chr [1:2500] "Smartphone" "Tablet" "Smart TV" "Laptop" ...
## $ Plan Duration : chr [1:2500] "1 Month" "1 Month" "1 Month" "1 Month" ...
## - attr(*, "spec")=
## .. cols(
## .. `User ID` = col_double(),
## .. `Subscription Type` = col_character(),
## .. `Monthly Revenue` = col_double(),
## .. `Join Date` = col_character(),
## .. `Last Payment Date` = col_character(),
## .. Country = col_character(),
## .. Age = col_double(),
## .. Gender = col_character(),
## .. Device = col_character(),
## .. `Plan Duration` = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
head(df)
## # A tibble: 6 × 10
## `User ID` `Subscription Type` `Monthly Revenue` `Join Date`
## <dbl> <chr> <dbl> <chr>
## 1 1 Basic 10 15-01-22
## 2 2 Premium 15 05-09-21
## 3 3 Standard 12 28-02-23
## 4 4 Standard 12 10-07-22
## 5 5 Basic 10 01-05-23
## 6 6 Premium 15 18-03-22
## # ℹ 6 more variables: `Last Payment Date` <chr>, Country <chr>, Age <dbl>,
## # Gender <chr>, Device <chr>, `Plan Duration` <chr>
tail(df)
## # A tibble: 6 × 10
## `User ID` `Subscription Type` `Monthly Revenue` `Join Date`
## <dbl> <chr> <dbl> <chr>
## 1 2495 Basic 15 23-07-22
## 2 2496 Premium 14 25-07-22
## 3 2497 Basic 15 04-08-22
## 4 2498 Standard 12 09-08-22
## 5 2499 Standard 13 12-08-22
## 6 2500 Basic 15 13-08-22
## # ℹ 6 more variables: `Last Payment Date` <chr>, Country <chr>, Age <dbl>,
## # Gender <chr>, Device <chr>, `Plan Duration` <chr>
colSums(is.na(df))
## User ID Subscription Type Monthly Revenue Join Date
## 0 0 0 0
## Last Payment Date Country Age Gender
## 0 0 0 0
## Device Plan Duration
## 0 0
Here is a data source that contains information about Netflix users from different countries.
df <- df %>%
mutate(`Subscription Type` = factor(`Subscription Type`)) %>%
mutate(Gender = factor(Gender)) %>%
mutate(Country = factor(Country))
str(df)
## tibble [2,500 × 10] (S3: tbl_df/tbl/data.frame)
## $ User ID : num [1:2500] 1 2 3 4 5 6 7 8 9 10 ...
## $ Subscription Type: Factor w/ 3 levels "Basic","Premium",..: 1 2 3 3 1 2 3 1 3 2 ...
## $ Monthly Revenue : num [1:2500] 10 15 12 12 10 15 12 10 12 15 ...
## $ Join Date : chr [1:2500] "15-01-22" "05-09-21" "28-02-23" "10-07-22" ...
## $ Last Payment Date: chr [1:2500] "10-06-23" "22-06-23" "27-06-23" "26-06-23" ...
## $ Country : Factor w/ 10 levels "Australia","Brazil",..: 10 3 9 1 5 4 2 7 8 6 ...
## $ Age : num [1:2500] 28 35 42 51 33 29 46 39 37 44 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 1 2 1 2 1 ...
## $ Device : chr [1:2500] "Smartphone" "Tablet" "Smart TV" "Laptop" ...
## $ Plan Duration : chr [1:2500] "1 Month" "1 Month" "1 Month" "1 Month" ...
most_users <- df %>%
group_by(Country) %>%
summarise(total_users = n()) %>%
arrange(desc(total_users))
most_users
## # A tibble: 10 × 2
## Country total_users
## <fct> <int>
## 1 Spain 451
## 2 United States 451
## 3 Canada 317
## 4 Australia 183
## 5 Brazil 183
## 6 France 183
## 7 Germany 183
## 8 Italy 183
## 9 Mexico 183
## 10 United Kingdom 183
barplot(most_users$total_users, names.arg = most_users$Country,
xlab = "Country", ylab = "Total Users",
main = "Total Users by Country",
las = 2, cex.names = 0.7)
most_popular_per_country <- df %>%
group_by(Country, `Subscription Type`) %>%
filter(`Subscription Type` == "Basic") %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
group_by(Country)
## `summarise()` has grouped output by 'Country'. You can override using the
## `.groups` argument.
most_popular_per_country
## # A tibble: 10 × 3
## # Groups: Country [10]
## Country `Subscription Type` count
## <fct> <fct> <int>
## 1 United States Basic 199
## 2 Italy Basic 176
## 3 Germany Basic 149
## 4 Brazil Basic 146
## 5 Canada Basic 145
## 6 Spain Basic 110
## 7 France Basic 36
## 8 Australia Basic 31
## 9 Mexico Basic 4
## 10 United Kingdom Basic 3
barplot(most_popular_per_country$count, names.arg = most_popular_per_country$Country,
main = "Number of Basic Subscription Plans by Country",
xlab = "Country",
ylab = "Count",
las = 2, cex.names = 0.7)
popularity <- df %>%
group_by(Device) %>%
summarise(device_usage = n()) %>%
arrange(desc(Device))
popularity
## # A tibble: 4 × 2
## Device device_usage
## <chr> <int>
## 1 Tablet 633
## 2 Smartphone 621
## 3 Smart TV 610
## 4 Laptop 636
device_freq <- table(popularity$Device)
colors <- rainbow(length(device_freq))
pie(device_freq, labels = names(device_freq), col = colors)
legend("topright", legend = names(device_freq), fill = colors)
title("Popularity of Devices")
Through my EDA exploration using a dataset based on users of Netflix. I followed Rodger Pang’s advice and got a feel for the data. Meaning, I looked at the structure, beginning, and end of the data to have a better understanding of the data I am working with. I concluded that they were no NA’s in this dataset. I also had to change three variables into factors in order to better represent their unique values.
After cleaning I determined that most Netflix users are from the United States or Spain. I represented my findings using a bar plot to identify the number of users of each country within this dataset. Next, I wanted to determine which country has the most users subscribed to their basic subscription plan. I discovered that again the United States was first followed behind by Italy and Germany. Additionally, I also used a pie chart to determine the most common device type to watch Netflix on.
Do younger subscribers tend to watch more on their phone vs. Smart TV?
Null Hypothesis (H₀): there are no significant differences in the average age among subscribers using different devices.
Alternative Hypothesis (Hₐ): there are significant differences in the average age among subscribers using different devices.
I think my dataset meets the requirement to use an ANOVA test because I will be using both categorical and quantitative variables. Also it would be interesting to see the results as we can then do some marketing advertising towards younger audiences to increase the subscriptions of Netflix.
anova_result <- aov(Age ~ Device, data = df)
print(summary(anova_result))
## Df Sum Sq Mean Sq F value Pr(>F)
## Device 3 50 16.58 0.322 0.809
## Residuals 2496 128485 51.48
Since the p-value (Pr(>F)) is greater than the common significance level of 0.05, we fail to reject the null hypothesis. This suggests that there is not enough evidence to conclude that there are significant differences in the average age among subscribers using different devices.
boxplot(Age ~ Device, data = df, main = "Age Distribution by Device",
xlab = "Device", ylab = "Age")
As shown through our testing we fail to reject the null hypothesis. This suggests that there is not enough evidence to conclude that there are significant differences in the average age among subscribers using different devices.
Also, through our box plots we can see that the majority of our data or age range, is between 32- 45. Meaning, we would not run advertising campaigns on younger or older audiences. Additionally, we might post more movies or shows that we believe people between these ages might enjoy so that they continue to be subscribers of Netflix.
summary(df)
## User ID Subscription Type Monthly Revenue Join Date
## Min. : 1.0 Basic :999 Min. :10.00 Length:2500
## 1st Qu.: 625.8 Premium :733 1st Qu.:11.00 Class :character
## Median :1250.5 Standard:768 Median :12.00 Mode :character
## Mean :1250.5 Mean :12.51
## 3rd Qu.:1875.2 3rd Qu.:14.00
## Max. :2500.0 Max. :15.00
##
## Last Payment Date Country Age Gender
## Length:2500 Spain :451 Min. :26.0 Female:1257
## Class :character United States:451 1st Qu.:32.0 Male :1243
## Mode :character Canada :317 Median :39.0
## Australia :183 Mean :38.8
## Brazil :183 3rd Qu.:45.0
## France :183 Max. :51.0
## (Other) :732
## Device Plan Duration
## Length:2500 Length:2500
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
Is there a relationship between a user’s age and how much the company makes in a month?
lm_model <- lm(`Monthly Revenue` ~ Age, data = df)
summary(lm_model)
##
## Call:
## lm(formula = `Monthly Revenue` ~ Age, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5671 -1.5024 -0.4477 1.4926 2.5523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.701332 0.185624 68.425 <2e-16 ***
## Age -0.004973 0.004705 -1.057 0.291
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.687 on 2498 degrees of freedom
## Multiple R-squared: 0.000447, Adjusted R-squared: 4.69e-05
## F-statistic: 1.117 on 1 and 2498 DF, p-value: 0.2906
It seems that the “Age” variable is not statistically significant in predicting “Monthly Revenue.” Meaning, there is not enough evidence to conclude that “Age” has a significant impact on “Monthly Revenue.”
The R-squared value of 0.000447 suggests that less than 0.05% of the variation in Monthly Revenue is explained by the Age variable. A weak relationship between the two.
plot(df$Age, df$`Monthly Revenue`, main = "Scatter Plot: Monthly Revenue vs. Age",
xlab = "Age", ylab = "Monthly Revenue", pch = 16, col = "blue")
abline(lm_model, col = "red")
Also, since the quantitative variables in this dataset are few, its hard to accurately perform a regression test.
One concern is about the appropriate data collection and Use. Meaning, will this company sell this information to 3rd parties companies? For example, depending on the device users use to watch Netflix, will they receive more ads to buy new devices? This is important to think about given that many companies about backdoor deals with others.
Another concern is about the storage and security of the data. For example, how is data stored? Especially personal identifiable data? Of course, in this day and age there are many hackers and it’s important to think about how safe the data really is. Another concern is what if Netflix users unsubscribe and leave its service. How long should the company keep the data? In the recent years there has been a lot more streaming service competition and its important to know who has and how long are data will be kept in these companies.