TED Talks

I downloaded a .csv file with data on TED Talks from kaggle.com. TED Talks are short conference style talks, mostly in science and technology fields, targeted to educated non-specialist. My question is, is there a relationship between the duration of a TED talk and how many views it gets? To answer this question I will make a linear model between the duration and views fields in the data.

pth <- 'C:\\Users\\Nate\\Documents\\DataSet\\ted_main.csv'
ted <- pth %>% read.csv() %>% data.frame()
ncol(ted)
## [1] 17
nrow(ted)
## [1] 2550

Checking the data frame against the metrics on Kaggle it is 17 columns and 2,550 rows. These data were successfully loaded into a data frame.

To get an idea for the range and distribution of both variables we will make histograms. Note that there seem to be significant outliers so I viewed both variables in natural log space.

summary(ted$duration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   135.0   577.0   848.0   826.5  1046.8  5256.0
hist(log(ted$duration))

summary(ted$views)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    50443   755793  1124524  1698297  1700760 47227110
hist(log(ted$views))

Linear Model

Now that the data was successfully loaded, we will create a linear regression model between duration and views.

fit <- lm(ted$views ~ ted$duration)
summary(fit)
## 
## Call:
## lm(formula = ted$views ~ ted$duration)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2667314  -932675  -554748    28483 45418926 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1429186.7   119912.2  11.919   <2e-16 ***
## ted$duration     325.6      132.2   2.463   0.0138 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2496000 on 2548 degrees of freedom
## Multiple R-squared:  0.002376,   Adjusted R-squared:  0.001984 
## F-statistic: 6.068 on 1 and 2548 DF,  p-value: 0.01383
res <- resid(fit)
summary(res)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2667314  -932675  -554748        0    28483 45418926
hist(res)

The \(p = 0.01383<0.05\), therefore the is a correlation between duration and views at the 95% confidence level. Furthermore the Histogram of the residuals does not appear to be Normally Distributed, and therefore violates the assumptions of the residuals of a linear model.

Visualazation

To convince ourselves that the linear model is valid we will create a scatter plot to see how well the data fit the linear regression line.

plot(ted$duration, ted$views, col = 'Blue')
abline(fit, col = 'Red')

It seems as if this correlation is due to the presence of outliers. To test this will will perform a log-log transformation of the data to reduce the influence of outliers.

plot(ted$duration, res, col = 'Orange')

Log-Log Transformation

Note that the default base for R is the natural logarithm base, e. I choose to use this as switching to base 10 would make the duration range from (0,3.7]

fit2 <- lm(log(ted$views) ~ log(ted$duration))
summary(fit2)
## 
## Call:
## lm(formula = log(ted$views) ~ log(ted$duration))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2495 -0.4413 -0.0478  0.3734  3.6626 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       13.56589    0.20516  66.123   <2e-16 ***
## log(ted$duration)  0.06261    0.03096   2.023   0.0432 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7711 on 2548 degrees of freedom
## Multiple R-squared:  0.001603,   Adjusted R-squared:  0.001211 
## F-statistic: 4.091 on 1 and 2548 DF,  p-value: 0.04322
res2 <- resid(fit2)
summary(res2)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.24952 -0.44135 -0.04776  0.00000  0.37344  3.66258
hist(res2)

Although p did increase to 0.04322 it is still below the 95% significance threshold of 0.05. There is still a statistically significant correlation between duration and views. The residuals here look normally distributed.

Vizualzation II

plot(log(ted$duration), log(ted$views), col = 'Cyan')
abline(fit2, col = 'Magenta')

The slope of the line is very close to zero. Statistical significance may have come about more from the number of data points in the set and less due to a clear correlation.

plot(log(ted$duration), res2, col = 'Orange')

Summary

Even though there is statistical significance to our findings, it isn’t obvious that there is practical significance. The trend line is fairly horizontal through these data. Note also that the numerous outliers in views, when subtracted from the mean number of views, changed very little since the mean was so much smaller than the outliers. This makes the residual data appear non-random and places doubt as to if the linear model is valid.