In today’s world, Netflix, Hulu, Amazon Prime Videos all compete to gain more subscriptions. Besides these three major subscription services, YouTube is a major platform for content creators to freely share their work and gain viewers. However, one may wonder if the web traffic into YouTube decreases as the subscription services get more and more popular. After all, we have only so much time in our days and if we spend several hours per day watching Netflix, surely we are watching less of YouTube. This project is an attempt to answer that question.
Objective
The objective of this project is to investigate whether the addition of the variable ‘number of subscriptions to a monthly video service’ such as Netflix, Amazon Prime Videos, or Hulu, account for a significantly greater proportion of the variance in the amount of time spent on YouTube per day. The nuisance variables included are age, gender, number of credit hours currently being taken, and the college classification. This project was a final assignment in the class SDS 358 at UT Austin in Spring 2018 and updated in February 2019.
Hypothesis
Having a subscription to any of the three major subscription services (Netflix, Hulu, Amazon Prime Videos) has a signficant impact on how much time one spends on YouTube.
Sample
The sample data was gathered via a Google Form Survey from 43 current college students. The survey collected data about their age (years), gender (male, female), number of credit hours being taken, college classification (Business, Engineering, Liberal Arts, Natural Science, Others), the subscription services they use (Netflix, Amazon Prime, Hulu, none of the above), and time spent on YouTube (minutes).
Data
We removed one observation with a bad, non-numeric response for time spent on YouTube. Subsequently, variable names were transformed and we added logical variables ‘netflix’, ‘hulu’, and ‘amazonprime’, where TRUE indicates subscription and FALSE indicates no subscription. Furthermore, the variables ‘college’ and ‘timespent’ had to be preprocessed in order to reduce variation in the answers. This issue arised because of the design of the survey in which the students typed out their answer instead of being presented with choices (e.g “CNS” “Natural science”, “45 minutes” “Around 2 hr”). Below, we display the first 6 elements of the preprocessed dataset.
Assumptions
We noticed that the response variable (time spent on youtube) is right skewed and transformed it by square root. Homoscedasticity was checked and confirmed by the Residuals vs. Fitted graph. Normality of residuals was also checked.
Training and Testing
We divided the dataset into training and testing dataset and used the train() function of the caret package in order to calculate the final model. It was bootstrapped 25 times with resampling and the resulting RMSE for training data was 4.52. We then used the model to predict on the testing data and the resulting RMSE was 5.62.
The final model was significant and could account for 37.25% of the variance (F(2,30) = 8.903, p = 0.001). Gender, college, and subscriptions all proved to be not significant. However, age (t(30) = -3.511, p < 0.05) and number of credit hours (t(30) = -3.927, p < 0.05) were significant predictors of time spent on YouTube per day. In conclusion, our initial hypothesis that the number of subscriptions will have a significant effect on the amount of time one watches YouTube could not be confirmed. In fact, we found out that age and the number of hours one takes in college are the significant factors, both suggesting that the older we get or the more hours we take in school, the time spent on YouTube decreases.
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9949 -2.8057 0.2372 2.4555 8.9313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.8856 12.0586 4.800 4.09e-05 ***
## credit.hr -1.0104 0.2573 -3.927 0.000466 ***
## age -1.7223 0.4906 -3.511 0.001435 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.16 on 30 degrees of freedom
## Multiple R-squared: 0.3725, Adjusted R-squared: 0.3306
## F-statistic: 8.903 on 2 and 30 DF, p-value: 0.0009215
Limitations
The sample size was small (N=43) with a possible response bias. The question was not leaded but average person may underestimate/overestimate or skew the time spent on YouTube. Therefore, the data may not be accurate. Moreover, the data had slightly right skewed residuals. A possible confounding variable is subscription to YouTube Red, as that was not considered.
Implications
One thing I would change is to change the method from a survey in which they reflect and estimate the time spent on YouTube to a more active survey where each participant logs the time spent on YouTube for 3 days to gain more accurate data. Moreover, the data was collected specifically on current college students or recent college graduates from The University of Texas at Austin. Therefore, we cannot extend the conclusion beyond this university.
Complete code used for the analysis is given below.
library(dplyr)
library(car)
library(caret)
youtube <- read.csv("youtube_response.csv", stringsAsFactors = FALSE)
youtube <- youtube %>%
slice(-5) %>% # remove bad response
transmute(id = 1:42,
age = Age.,
gender = as.factor(Gender),
credit.hr = Number.of.credit.hours.you.re.taking.in.college.this.semester.,
# Preprocess the variable college
college = What.college.is.your.major.under.,
college = sub("^Natural.*|CNS|^Computer.*|^Maritime.*",
"Natural Science", college), # remove variation in Natural Science
college = sub("Engineering ", "Engineering", college), # fix spacing error
college = sub("^[Bb]usiness.*|McCombs", "Business", college), # remove variation in Business
college = sub("COLA", "Liberal Arts", college), # remove variation in Liberal Arts
college = sub("^Arts.*|Communication|Education|Undecided|Fine Arts", "Other", college), # Combine other majors into a single category
college = as.factor(college),
netflix = grepl("Netflix", Are.you.subscribed.to.any.of.these.services.),
hulu = grepl("Hulu", Are.you.subscribed.to.any.of.these.services.),
amazonprime = grepl("Amazon", Are.you.subscribed.to.any.of.these.services.),
# Preprocess the variable timespent
timespent = How.much.time.do.you.spend.on.Youtube.per.day.,
timespent = sub(" ?(hr|hrs|hour|hours) ?", "hr", timespent), # remove variation in hours
timespent = sub(" ?(min|mins|minute|minutes) ?", "min", timespent), # remove variation in minutes
timespent = sub(".*?([0-9]hr([0-9]+min)?).*", "\\1", timespent), # remove unnecessary text
timespent = sub("([0-9])$", "\\1hr", timespent), # fix id = 17 which had no hr/min specification
timespent = sub("^([0-9]+min)", "0hr\\1", timespent), # for those without an hr specification, add 0hr
timespent = sub("([0-9]hr)$", "\\10", timespent), # for those without a min specification, add 0 min
timespent = sub("min", "", timespent), # remove the word 'min' to make converting to time easier
# Convert time characters into numeric minute values
timespent = sapply(strsplit(timespent, "hr"), function(x){
x <- as.numeric(x)
x[1] * 60 + x[2]
})
)
lm.youtube <- lm(timespent ~ ., data = youtube)
summary(lm.youtube)
anova(lm.youtube)
par(mfrow= c(2,2))
plot(lm.youtube) # there seems t
# check conditions on the linear model, fix as needed
hist(youtube$timespent, breaks = 10) # hmm right skewed, try log and sqrt
plot(x = youtube$timespent)
hist(log(youtube$timespent), breaks = 10)
hist(sqrt(youtube$timespent), breaks = 10)
#try log
loglm.youtube <- lm(log(timespent + 1) ~ ., data = youtube)
summary(loglm.youtube)
par(mfrow= c(2,2))
plot(loglm.youtube)
#try sqrt
sqrtlm.youtube <- lm(sqrt(timespent) ~., data = youtube)
summary(sqrtlm.youtube)
plot(sqrtlm.youtube)
# answer the main question, whether addition of netflix/hulu/amazonprime affects time spent on youtube
anova(sqrtlm.youtube)
# seems like not, but credit.hr may be a significant predictor
fit2 <- lm(sqrt(timespent) ~ credit.hr, data = youtube)
#summary(fit2) #not significant by itself, try adding age
fit3 <- lm(sqrt(timespent) ~ credit.hr + age, data = youtube)
#summary(fit3) # both significant now, maybe an interaction?
fit4 <- lm(sqrt(timespent) ~ credit.hr + age + credit.hr * age, data = youtube)
#summary(fit4) #nope
# fit3 seems to be the best, and we doubt adding other predictors will improve the fit
# divide into train and testing set
set.seed(401)
inTrain <- createDataPartition(y = youtube$timespent, p = 0.75, list = FALSE)
training <- youtube[inTrain,]
testing <- youtube[-inTrain,]
lmFit <- train(sqrt(timespent) ~ credit.hr + age, data = training, method = "lm")
summary(lmFit$finalModel)
print(lmFit)
lmpred <- predict(lmFit, testing)
RMSE(lmpred, sqrt(testing$timespent))