DATA 605 Discussion 11

TED Talks

I downloaded a .csv file with data on TED Talks from kaggle.com. TED Talks are short conference style talks, mostly in science and technology fields, targeted to educated non-specialist. My question is, is there a relationship between the duration of a TED talk and how many views it gets? To answer this question I will make a linear model between the duration and views fields in the data.

pth <- 'C:\\Users\\Nate\\Documents\\DataSet\\ted_main.csv'
ted <- pth %>% read.csv() %>% data.frame()
ncol(ted)

## [1] 17

nrow(ted)

## [1] 2550

Checking the data frame against the metrics on Kaggle it is 17 columns and 2,550 rows. These data were successfully loaded into a data frame.

To get an idea for the range and distribution of both variables we will make histograms. Note that there seem to be significant outliers so I viewed both variables in natural log space.

summary(ted$duration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   135.0   577.0   848.0   826.5  1046.8  5256.0

hist(log(ted$duration))

summary(ted$views)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    50443   755793  1124524  1698297  1700760 47227110

hist(log(ted$views))

Linear Model

Now that the data was successfully loaded, we will create a linear regression model between duration and views.

fit <- lm(ted$views ~ ted$duration)
summary(fit)

## 
## Call:
## lm(formula = ted$views ~ ted$duration)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2667314  -932675  -554748    28483 45418926 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1429186.7   119912.2  11.919   <2e-16 ***
## ted$duration     325.6      132.2   2.463   0.0138 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2496000 on 2548 degrees of freedom
## Multiple R-squared:  0.002376,   Adjusted R-squared:  0.001984 
## F-statistic: 6.068 on 1 and 2548 DF,  p-value: 0.01383

res <- resid(fit)
summary(res)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2667314  -932675  -554748        0    28483 45418926

hist(res)

The \(p = 0.01383<0.05\), therefore the is a correlation between duration and views at the 95% confidence level. Furthermore the Histogram of the residuals does not appear to be Normally Distributed, and therefore violates the assumptions of the residuals of a linear model.

Visualazation

To convince ourselves that the linear model is valid we will create a scatter plot to see how well the data fit the linear regression line.

plot(ted$duration, ted$views, col = 'Blue')
abline(fit, col = 'Red')

It seems as if this correlation is due to the presence of outliers. To test this will will perform a log-log transformation of the data to reduce the influence of outliers.

plot(ted$duration, res, col = 'Orange')

Log-Log Transformation

Note that the default base for R is the natural logarithm base, e. I choose to use this as switching to base 10 would make the duration range from (0,3.7]

fit2 <- lm(log(ted$views) ~ log(ted$duration))
summary(fit2)

## 
## Call:
## lm(formula = log(ted$views) ~ log(ted$duration))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2495 -0.4413 -0.0478  0.3734  3.6626 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       13.56589    0.20516  66.123   <2e-16 ***
## log(ted$duration)  0.06261    0.03096   2.023   0.0432 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7711 on 2548 degrees of freedom
## Multiple R-squared:  0.001603,   Adjusted R-squared:  0.001211 
## F-statistic: 4.091 on 1 and 2548 DF,  p-value: 0.04322

res2 <- resid(fit2)
summary(res2)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.24952 -0.44135 -0.04776  0.00000  0.37344  3.66258

hist(res2)

Although p did increase to 0.04322 it is still below the 95% significance threshold of 0.05. There is still a statistically significant correlation between duration and views. The residuals here look normally distributed.

Vizualzation II

plot(log(ted$duration), log(ted$views), col = 'Cyan')
abline(fit2, col = 'Magenta')

The slope of the line is very close to zero. Statistical significance may have come about more from the number of data points in the set and less due to a clear correlation.

plot(log(ted$duration), res2, col = 'Orange')

DATA 605 Discussion 11

Nathan Cooper

October 30, 2017

TED Talks

Linear Model

Visualazation

Log-Log Transformation

Vizualzation II

Summary