** Netflix vs. Hulu: Analyzing Customer Retention Strategies - Hulu

Project Objective

What are the customer rentention strategies presented at Hulu? 

Step 1 & 2: Install and Load libraries

#install.packages("gclus")
#install.packages("ggpubr")
#install.packages("tidyverse")
#install.packages("readxl")
library(ggpubr)
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gclus)
## Loading required package: cluster
library(readxl)

###Step 2: Import Netflix data and clean data

rev_df <- read_excel("Hulu excel (2).xlsx")
head(rev_df)
## # A tibble: 6 × 4
##    Subs Pricing Revenue Rating
##   <dbl>   <dbl>   <dbl>  <dbl>
## 1  25.2    12.0   0.702   0.97
## 2  26.8    12.0   0.792   0.97
## 3  27.9    12.0   0.891   0.97
## 4  28.5    12.0   1.03    0.97
## 5  30.4    12.0   1.01    0.78
## 6  32.1    12.0   1.29    0.91

###Step 3: Summarize the data

summary(rev_df)
##       Subs          Pricing         Revenue          Rating     
##  Min.   :25.20   Min.   :11.99   Min.   :0.702   Min.   :0.610  
##  1st Qu.:31.68   1st Qu.:11.99   1st Qu.:1.224   1st Qu.:0.760  
##  Median :42.20   Median :11.99   Median :1.841   Median :0.880  
##  Mean   :39.40   Mean   :13.09   Mean   :1.787   Mean   :0.847  
##  3rd Qu.:46.45   3rd Qu.:13.49   3rd Qu.:2.349   3rd Qu.:0.970  
##  Max.   :48.50   Max.   :17.99   Max.   :2.782   Max.   :0.990
Interpretation: On average for Hulu there are 39.4 subscriptions, with prices ranging from $11.99 - $17.99, and the revenue averaging around $1.79. Customer ratings are  low, with a mean of 0.85 and a maximum of 0.99, saying that while subscription count and revenue have moderate variability, customer satisfaction remains stable

Step 4: Set up to create pairwise scatterplot matrix

pairs(~ Subs + Revenue + Pricing + Rating, data = rev_df)

Interpretation: The correlation matrix shows that subscriptions and revenue have a  strong positive relationship by (0.99), indicating that as subscriptions increase, revenue also increases significantly.Ratings also show weak negative correlations with all other variables, suggesting that customer satisfaction (ratings) is not strongly correlated to subscription numbers, pricing, or revenue.

Step 5: Analyze the correlation matrix for the dataset

corr <- cor(rev_df)
corr
##               Subs    Pricing    Revenue     Rating
## Subs     1.0000000  0.6952897  0.9858598 -0.4417660
## Pricing  0.6952897  1.0000000  0.7879706 -0.2068254
## Revenue  0.9858598  0.7879706  1.0000000 -0.3899287
## Rating  -0.4417660 -0.2068254 -0.3899287  1.0000000
Interpretation: The residuals show a range from -1.71 - 1.10, with the majority of values clustered around 0, showing that the model's predictions are fairly close to the actual values. 

Step 6: Create the summary model for the data

model <- lm(Subs ~ Revenue + Pricing + Rating, data = rev_df)
summary(model)
## 
## Call:
## lm(formula = Subs ~ Revenue + Pricing + Rating, data = rev_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7149 -0.3920  0.1116  0.5398  1.1043 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.4357     2.3517  12.942 6.83e-10 ***
## Revenue      13.8194     0.5085  27.177 8.10e-15 ***
## Pricing      -1.0191     0.1958  -5.205 8.68e-05 ***
## Rating       -2.8333     1.7022  -1.665    0.115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8539 on 16 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9893 
## F-statistic: 588.1 on 3 and 16 DF,  p-value: < 2.2e-16
Interpretation: The regression analysis shows that Revenue has a strong positive relationship with the dependent variable (p-value < 0.001), while Pricing has a significant negative impact (p-value < 0.001). However, Rating is not statistically significant (p-value = 0.115), suggesting it has little effect on the outcome. The model explains 99.1% of the variance (R-squared = 0.991), indicating a very strong fit.