Final Project

Problem 1

The dataset I will be examining in this project contains data about Netflix users. For example, this dataset contains information about subscriptions, revenue, account details, and activity. The source for this dataset can be found here Link to source.

This dataset contains 2500 observations and 10 variables that include the following:

User ID that records the individual id as a numeric variable

Subscription Type: a character variable that records the type of subscription plan.

Monthly Revenue: a numeric variable that shows the amount of revenue per month.

Join Date: a character variable that records the date a user joined.

Last Payment Date: a character variable that records the last date of payment.

Country: a character variable that shows the origin of the user.

Age: a numeric variable that shows the age of the user.

Gender: a character variable that shows the gender of the user.

Device: a character variable that records the device type the user uses.

Plan Duration: a character variable that shows the duration of plan.

Problem 2

library(tidyverse)
setwd("/Users/mikea/Desktop/Datasets")
df <- read_csv("Netflix Userbase.csv")
str(df, 10)
## spc_tbl_ [2,500 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ User ID          : num [1:2500] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Subscription Type: chr [1:2500] "Basic" "Premium" "Standard" "Standard" ...
##  $ Monthly Revenue  : num [1:2500] 10 15 12 12 10 15 12 10 12 15 ...
##  $ Join Date        : chr [1:2500] "15-01-22" "05-09-21" "28-02-23" "10-07-22" ...
##  $ Last Payment Date: chr [1:2500] "10-06-23" "22-06-23" "27-06-23" "26-06-23" ...
##  $ Country          : chr [1:2500] "United States" "Canada" "United Kingdom" "Australia" ...
##  $ Age              : num [1:2500] 28 35 42 51 33 29 46 39 37 44 ...
##  $ Gender           : chr [1:2500] "Male" "Female" "Male" "Female" ...
##  $ Device           : chr [1:2500] "Smartphone" "Tablet" "Smart TV" "Laptop" ...
##  $ Plan Duration    : chr [1:2500] "1 Month" "1 Month" "1 Month" "1 Month" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `User ID` = col_double(),
##   ..   `Subscription Type` = col_character(),
##   ..   `Monthly Revenue` = col_double(),
##   ..   `Join Date` = col_character(),
##   ..   `Last Payment Date` = col_character(),
##   ..   Country = col_character(),
##   ..   Age = col_double(),
##   ..   Gender = col_character(),
##   ..   Device = col_character(),
##   ..   `Plan Duration` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
#glimpse(df)
dim(df)
## [1] 2500   10

Missing Values

colSums(is.na(df))
##           User ID Subscription Type   Monthly Revenue         Join Date 
##                 0                 0                 0                 0 
## Last Payment Date           Country               Age            Gender 
##                 0                 0                 0                 0 
##            Device     Plan Duration 
##                 0                 0
# Great! no NA's

Problem 3

My two questions that I want to investigate in this project are:

Which country has the most users?

Which country has the most users subscribed to the basic Subscriptions plan?

Problem 4

Exploratory Data Analysis Checklist

Formulate your question

Which country has the most users?

Which country has the most users subscribed to the basic Subscriptions plan?

Read in your data - Completed above.

Check the packaging - Completed.

Run str()

str(df)
## spc_tbl_ [2,500 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ User ID          : num [1:2500] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Subscription Type: chr [1:2500] "Basic" "Premium" "Standard" "Standard" ...
##  $ Monthly Revenue  : num [1:2500] 10 15 12 12 10 15 12 10 12 15 ...
##  $ Join Date        : chr [1:2500] "15-01-22" "05-09-21" "28-02-23" "10-07-22" ...
##  $ Last Payment Date: chr [1:2500] "10-06-23" "22-06-23" "27-06-23" "26-06-23" ...
##  $ Country          : chr [1:2500] "United States" "Canada" "United Kingdom" "Australia" ...
##  $ Age              : num [1:2500] 28 35 42 51 33 29 46 39 37 44 ...
##  $ Gender           : chr [1:2500] "Male" "Female" "Male" "Female" ...
##  $ Device           : chr [1:2500] "Smartphone" "Tablet" "Smart TV" "Laptop" ...
##  $ Plan Duration    : chr [1:2500] "1 Month" "1 Month" "1 Month" "1 Month" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `User ID` = col_double(),
##   ..   `Subscription Type` = col_character(),
##   ..   `Monthly Revenue` = col_double(),
##   ..   `Join Date` = col_character(),
##   ..   `Last Payment Date` = col_character(),
##   ..   Country = col_character(),
##   ..   Age = col_double(),
##   ..   Gender = col_character(),
##   ..   Device = col_character(),
##   ..   `Plan Duration` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Look at the top and the bottom of your data

head(df)
## # A tibble: 6 × 10
##   `User ID` `Subscription Type` `Monthly Revenue` `Join Date`
##       <dbl> <chr>                           <dbl> <chr>      
## 1         1 Basic                              10 15-01-22   
## 2         2 Premium                            15 05-09-21   
## 3         3 Standard                           12 28-02-23   
## 4         4 Standard                           12 10-07-22   
## 5         5 Basic                              10 01-05-23   
## 6         6 Premium                            15 18-03-22   
## # ℹ 6 more variables: `Last Payment Date` <chr>, Country <chr>, Age <dbl>,
## #   Gender <chr>, Device <chr>, `Plan Duration` <chr>
tail(df)
## # A tibble: 6 × 10
##   `User ID` `Subscription Type` `Monthly Revenue` `Join Date`
##       <dbl> <chr>                           <dbl> <chr>      
## 1      2495 Basic                              15 23-07-22   
## 2      2496 Premium                            14 25-07-22   
## 3      2497 Basic                              15 04-08-22   
## 4      2498 Standard                           12 09-08-22   
## 5      2499 Standard                           13 12-08-22   
## 6      2500 Basic                              15 13-08-22   
## # ℹ 6 more variables: `Last Payment Date` <chr>, Country <chr>, Age <dbl>,
## #   Gender <chr>, Device <chr>, `Plan Duration` <chr>

Check your “n”s

colSums(is.na(df))
##           User ID Subscription Type   Monthly Revenue         Join Date 
##                 0                 0                 0                 0 
## Last Payment Date           Country               Age            Gender 
##                 0                 0                 0                 0 
##            Device     Plan Duration 
##                 0                 0

Validate with at least one external data source

Here is a data source that contains information about Netflix users from different countries.

Link to source

1st Question Answered

df <- df %>% 
  mutate(`Subscription Type` = factor(`Subscription Type`)) %>% 
  mutate(Gender = factor(Gender)) %>% 
  mutate(Country = factor(Country))

str(df)
## tibble [2,500 × 10] (S3: tbl_df/tbl/data.frame)
##  $ User ID          : num [1:2500] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Subscription Type: Factor w/ 3 levels "Basic","Premium",..: 1 2 3 3 1 2 3 1 3 2 ...
##  $ Monthly Revenue  : num [1:2500] 10 15 12 12 10 15 12 10 12 15 ...
##  $ Join Date        : chr [1:2500] "15-01-22" "05-09-21" "28-02-23" "10-07-22" ...
##  $ Last Payment Date: chr [1:2500] "10-06-23" "22-06-23" "27-06-23" "26-06-23" ...
##  $ Country          : Factor w/ 10 levels "Australia","Brazil",..: 10 3 9 1 5 4 2 7 8 6 ...
##  $ Age              : num [1:2500] 28 35 42 51 33 29 46 39 37 44 ...
##  $ Gender           : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 1 2 1 2 1 ...
##  $ Device           : chr [1:2500] "Smartphone" "Tablet" "Smart TV" "Laptop" ...
##  $ Plan Duration    : chr [1:2500] "1 Month" "1 Month" "1 Month" "1 Month" ...
most_users <- df %>% 
  group_by(Country) %>% 
  summarise(total_users = n()) %>% 
  arrange(desc(total_users))
most_users
## # A tibble: 10 × 2
##    Country        total_users
##    <fct>                <int>
##  1 Spain                  451
##  2 United States          451
##  3 Canada                 317
##  4 Australia              183
##  5 Brazil                 183
##  6 France                 183
##  7 Germany                183
##  8 Italy                  183
##  9 Mexico                 183
## 10 United Kingdom         183
barplot(most_users$total_users, names.arg = most_users$Country, 
        xlab = "Country", ylab = "Total Users", 
        main = "Total Users by Country", 
        las = 2, cex.names = 0.7)  

2nd Question Answered

most_popular_per_country <- df %>%
  group_by(Country, `Subscription Type`) %>%
  filter(`Subscription Type` == "Basic") %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  group_by(Country)
## `summarise()` has grouped output by 'Country'. You can override using the
## `.groups` argument.
most_popular_per_country
## # A tibble: 10 × 3
## # Groups:   Country [10]
##    Country        `Subscription Type` count
##    <fct>          <fct>               <int>
##  1 United States  Basic                 199
##  2 Italy          Basic                 176
##  3 Germany        Basic                 149
##  4 Brazil         Basic                 146
##  5 Canada         Basic                 145
##  6 Spain          Basic                 110
##  7 France         Basic                  36
##  8 Australia      Basic                  31
##  9 Mexico         Basic                   4
## 10 United Kingdom Basic                   3
barplot(most_popular_per_country$count, names.arg = most_popular_per_country$Country,
        main = "Number of Basic Subscription Plans by Country",
        xlab = "Country",
        ylab = "Count",
        las = 2, cex.names = 0.7)  

I wondered What Devices Were Most Mommonly Used?

popularity <- df %>% 
  group_by(Device) %>% 
  summarise(device_usage = n()) %>% 
  arrange(desc(Device))
popularity
## # A tibble: 4 × 2
##   Device     device_usage
##   <chr>             <int>
## 1 Tablet              633
## 2 Smartphone          621
## 3 Smart TV            610
## 4 Laptop              636
device_freq <- table(popularity$Device)
 colors <- rainbow(length(device_freq))
pie(device_freq, labels = names(device_freq), col = colors)
legend("topright", legend = names(device_freq), fill = colors)
title("Popularity of Devices")

Problem 5

Through my EDA exploration using a dataset based on users of Netflix. I followed Rodger Pang’s advice and got a feel for the data. Meaning, I looked at the structure, beginning, and end of the data to have a better understanding of the data I am working with. I concluded that they were no NA’s in this dataset. I also had to change three variables into factors in order to better represent their unique values.

After cleaning I determined that most Netflix users are from the United States or Spain. I represented my findings using a bar plot to identify the number of users of each country within this dataset. Next, I wanted to determine which country has the most users subscribed to their basic subscription plan. I discovered that again the United States was first followed behind by Italy and Germany. Additionally, I also used a pie chart to determine the most common device type to watch Netflix on.

Problem 6

Do younger subscribers tend to watch more on their phone vs. Smart TV?

Null Hypothesis (H₀): there are no significant differences in the average age among subscribers using different devices.

Alternative Hypothesis (Hₐ): there are significant differences in the average age among subscribers using different devices.

I think my dataset meets the requirement to use an ANOVA test because I will be using both categorical and quantitative variables. Also it would be interesting to see the results as we can then do some marketing advertising towards younger audiences to increase the subscriptions of Netflix.

anova_result <- aov(Age ~ Device, data = df)
print(summary(anova_result))
##               Df Sum Sq Mean Sq F value Pr(>F)
## Device         3     50   16.58   0.322  0.809
## Residuals   2496 128485   51.48

Since the p-value (Pr(>F)) is greater than the common significance level of 0.05, we fail to reject the null hypothesis. This suggests that there is not enough evidence to conclude that there are significant differences in the average age among subscribers using different devices.

boxplot(Age ~ Device, data = df, main = "Age Distribution by Device",
        xlab = "Device", ylab = "Age")

Final Thoughts

As shown through our testing we fail to reject the null hypothesis. This suggests that there is not enough evidence to conclude that there are significant differences in the average age among subscribers using different devices.

Also, through our box plots we can see that the majority of our data or age range, is between 32- 45. Meaning, we would not run advertising campaigns on younger or older audiences. Additionally, we might post more movies or shows that we believe people between these ages might enjoy so that they continue to be subscribers of Netflix.

summary(df)
##     User ID       Subscription Type Monthly Revenue  Join Date        
##  Min.   :   1.0   Basic   :999      Min.   :10.00   Length:2500       
##  1st Qu.: 625.8   Premium :733      1st Qu.:11.00   Class :character  
##  Median :1250.5   Standard:768      Median :12.00   Mode  :character  
##  Mean   :1250.5                     Mean   :12.51                     
##  3rd Qu.:1875.2                     3rd Qu.:14.00                     
##  Max.   :2500.0                     Max.   :15.00                     
##                                                                       
##  Last Payment Date           Country         Age          Gender    
##  Length:2500        Spain        :451   Min.   :26.0   Female:1257  
##  Class :character   United States:451   1st Qu.:32.0   Male  :1243  
##  Mode  :character   Canada       :317   Median :39.0                
##                     Australia    :183   Mean   :38.8                
##                     Brazil       :183   3rd Qu.:45.0                
##                     France       :183   Max.   :51.0                
##                     (Other)      :732                               
##     Device          Plan Duration     
##  Length:2500        Length:2500       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 

Problem 7

Is there a relationship between a user’s age and how much the company makes in a month?

lm_model <- lm(`Monthly Revenue` ~ Age, data = df)
summary(lm_model)
## 
## Call:
## lm(formula = `Monthly Revenue` ~ Age, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5671 -1.5024 -0.4477  1.4926  2.5523 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.701332   0.185624  68.425   <2e-16 ***
## Age         -0.004973   0.004705  -1.057    0.291    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.687 on 2498 degrees of freedom
## Multiple R-squared:  0.000447,   Adjusted R-squared:  4.69e-05 
## F-statistic: 1.117 on 1 and 2498 DF,  p-value: 0.2906

It seems that the “Age” variable is not statistically significant in predicting “Monthly Revenue.” Meaning, there is not enough evidence to conclude that “Age” has a significant impact on “Monthly Revenue.”

The R-squared value of 0.000447 suggests that less than 0.05% of the variation in Monthly Revenue is explained by the Age variable. A weak relationship between the two.

plot(df$Age, df$`Monthly Revenue`, main = "Scatter Plot: Monthly Revenue vs. Age",
     xlab = "Age", ylab = "Monthly Revenue", pch = 16, col = "blue")

abline(lm_model, col = "red")

Also, since the quantitative variables in this dataset are few, its hard to accurately perform a regression test.

Problem 8

One concern is about the appropriate data collection and Use. Meaning, will this company sell this information to 3rd parties companies? For example, depending on the device users use to watch Netflix, will they receive more ads to buy new devices? This is important to think about given that many companies about backdoor deals with others.

Another concern is about the storage and security of the data. For example, how is data stored? Especially personal identifiable data? Of course, in this day and age there are many hackers and it’s important to think about how safe the data really is. Another concern is what if Netflix users unsubscribe and leave its service. How long should the company keep the data? In the recent years there has been a lot more streaming service competition and its important to know who has and how long are data will be kept in these companies.