Introduction

In today’s world, Netflix, Hulu, Amazon Prime Videos all compete to gain more subscriptions. Besides these three major subscription services, YouTube is a major platform for content creators to freely share their work and gain viewers. However, one may wonder if the web traffic into YouTube decreases as the subscription services get more and more popular. After all, we have only so much time in our days and if we spend several hours per day watching Netflix, surely we are watching less of YouTube. This project is an attempt to answer that question.

Objective
The objective of this project is to investigate whether the addition of the variable ‘number of subscriptions to a monthly video service’ such as Netflix, Amazon Prime Videos, or Hulu, account for a significantly greater proportion of the variance in the amount of time spent on YouTube per day. The nuisance variables included are age, gender, number of credit hours currently being taken, and the college classification. This project was a final assignment in the class SDS 358 at UT Austin in Spring 2018 and updated in February 2019.

Hypothesis
Having a subscription to any of the three major subscription services (Netflix, Hulu, Amazon Prime Videos) has a signficant impact on how much time one spends on YouTube.

Methodology

Sample
The sample data was gathered via a Google Form Survey from 43 current college students. The survey collected data about their age (years), gender (male, female), number of credit hours being taken, college classification (Business, Engineering, Liberal Arts, Natural Science, Others), the subscription services they use (Netflix, Amazon Prime, Hulu, none of the above), and time spent on YouTube (minutes).

Data

We removed one observation with a bad, non-numeric response for time spent on YouTube. Subsequently, variable names were transformed and we added logical variables ‘netflix’, ‘hulu’, and ‘amazonprime’, where TRUE indicates subscription and FALSE indicates no subscription. Furthermore, the variables ‘college’ and ‘timespent’ had to be preprocessed in order to reduce variation in the answers. This issue arised because of the design of the survey in which the students typed out their answer instead of being presented with choices (e.g “CNS” “Natural science”, “45 minutes” “Around 2 hr”). Below, we display the first 6 elements of the preprocessed dataset.

Assumptions

We noticed that the response variable (time spent on youtube) is right skewed and transformed it by square root. Homoscedasticity was checked and confirmed by the Residuals vs. Fitted graph. Normality of residuals was also checked.

Training and Testing

We divided the dataset into training and testing dataset and used the train() function of the caret package in order to calculate the final model. It was bootstrapped 25 times with resampling and the resulting RMSE for training data was 4.52. We then used the model to predict on the testing data and the resulting RMSE was 5.62.

Results

The final model was significant and could account for 37.25% of the variance (F(2,30) = 8.903, p = 0.001). Gender, college, and subscriptions all proved to be not significant. However, age (t(30) = -3.511, p < 0.05) and number of credit hours (t(30) = -3.927, p < 0.05) were significant predictors of time spent on YouTube per day. In conclusion, our initial hypothesis that the number of subscriptions will have a significant effect on the amount of time one watches YouTube could not be confirmed. In fact, we found out that age and the number of hours one takes in college are the significant factors, both suggesting that the older we get or the more hours we take in school, the time spent on YouTube decreases.

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9949 -2.8057  0.2372  2.4555  8.9313 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  57.8856    12.0586   4.800 4.09e-05 ***
## credit.hr    -1.0104     0.2573  -3.927 0.000466 ***
## age          -1.7223     0.4906  -3.511 0.001435 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.16 on 30 degrees of freedom
## Multiple R-squared:  0.3725, Adjusted R-squared:  0.3306 
## F-statistic: 8.903 on 2 and 30 DF,  p-value: 0.0009215

Limitations

The sample size was small (N=43) with a possible response bias. The question was not leaded but average person may underestimate/overestimate or skew the time spent on YouTube. Therefore, the data may not be accurate. Moreover, the data had slightly right skewed residuals. A possible confounding variable is subscription to YouTube Red, as that was not considered.

Implications

One thing I would change is to change the method from a survey in which they reflect and estimate the time spent on YouTube to a more active survey where each participant logs the time spent on YouTube for 3 days to gain more accurate data. Moreover, the data was collected specifically on current college students or recent college graduates from The University of Texas at Austin. Therefore, we cannot extend the conclusion beyond this university.

Appendix

Complete code used for the analysis is given below.

library(dplyr)
library(car)
library(caret)
youtube <- read.csv("youtube_response.csv", stringsAsFactors = FALSE)

youtube <- youtube %>%
    slice(-5) %>% # remove bad response
    transmute(id = 1:42,
        age = Age.,
        gender = as.factor(Gender),
        credit.hr = Number.of.credit.hours.you.re.taking.in.college.this.semester.,
               
        # Preprocess the variable college
        college = What.college.is.your.major.under.,
        college = sub("^Natural.*|CNS|^Computer.*|^Maritime.*", 
                      "Natural Science", college), # remove variation in Natural Science
        college = sub("Engineering ", "Engineering", college), # fix spacing error
        college = sub("^[Bb]usiness.*|McCombs", "Business", college), # remove variation in Business
        college = sub("COLA", "Liberal Arts", college), # remove variation in Liberal Arts
        college = sub("^Arts.*|Communication|Education|Undecided|Fine Arts", "Other", college), # Combine other majors into a single category
        college = as.factor(college),  
        netflix = grepl("Netflix", Are.you.subscribed.to.any.of.these.services.),
        hulu = grepl("Hulu", Are.you.subscribed.to.any.of.these.services.),
        amazonprime = grepl("Amazon", Are.you.subscribed.to.any.of.these.services.),
        
        # Preprocess the variable timespent
        timespent = How.much.time.do.you.spend.on.Youtube.per.day.,
        timespent = sub(" ?(hr|hrs|hour|hours) ?", "hr", timespent), # remove variation in hours
        timespent = sub(" ?(min|mins|minute|minutes) ?", "min", timespent), # remove variation in minutes
        timespent = sub(".*?([0-9]hr([0-9]+min)?).*", "\\1", timespent), # remove unnecessary text
        timespent = sub("([0-9])$", "\\1hr", timespent), # fix id = 17 which had no hr/min specification
        timespent = sub("^([0-9]+min)", "0hr\\1", timespent), # for those without an hr specification, add 0hr
        timespent = sub("([0-9]hr)$", "\\10", timespent), # for those without a min specification, add 0 min
        timespent = sub("min", "", timespent), # remove the word 'min' to make converting to time easier
        # Convert time characters into numeric minute values
        timespent = sapply(strsplit(timespent, "hr"), function(x){
                        x <- as.numeric(x)
                        x[1] * 60 + x[2]
                    })
    )


lm.youtube <- lm(timespent ~ ., data = youtube)
summary(lm.youtube)
anova(lm.youtube)

par(mfrow= c(2,2))
plot(lm.youtube) # there seems t


# check conditions on the linear model, fix as needed
hist(youtube$timespent, breaks = 10) # hmm right skewed, try log and sqrt
plot(x = youtube$timespent)
hist(log(youtube$timespent), breaks = 10)
hist(sqrt(youtube$timespent), breaks = 10)
#try log 
loglm.youtube <- lm(log(timespent + 1) ~ ., data = youtube)
summary(loglm.youtube)
par(mfrow= c(2,2))
plot(loglm.youtube)

#try sqrt

sqrtlm.youtube <- lm(sqrt(timespent) ~., data = youtube)
summary(sqrtlm.youtube)
plot(sqrtlm.youtube)

# answer the main question, whether addition of netflix/hulu/amazonprime affects time spent on youtube

anova(sqrtlm.youtube)

# seems like not, but credit.hr may be a significant predictor

fit2 <- lm(sqrt(timespent) ~ credit.hr, data = youtube)
#summary(fit2) #not significant by itself, try adding age

fit3 <- lm(sqrt(timespent) ~ credit.hr + age, data = youtube)
#summary(fit3) # both significant now, maybe an interaction?

fit4 <- lm(sqrt(timespent) ~ credit.hr + age + credit.hr * age, data = youtube)
#summary(fit4) #nope

# fit3 seems to be the best, and we doubt adding other predictors will improve the fit

# divide into train and testing set

set.seed(401)
inTrain <- createDataPartition(y = youtube$timespent, p = 0.75, list = FALSE)
training <- youtube[inTrain,]
testing <- youtube[-inTrain,]

lmFit <- train(sqrt(timespent) ~ credit.hr + age, data = training, method = "lm")
summary(lmFit$finalModel)
print(lmFit)

lmpred <- predict(lmFit, testing)
RMSE(lmpred, sqrt(testing$timespent))