Last Updated on Sat Jun 10 23:03:19 2017
Published at http://rpubs.com/enghanwenalvin/283651


Part 1: You would like to understand what factors will determine the attractiveness of a pitch. The pitch attractiveness is measured by whether the story has convinced other members to contribute. How would you approach this problem? Pick a dependent variable (or a few dependent variables) that will give you some insights on contribution behaviors. Discuss why you choose this variable (or this set of variables) as your dependent variable(s). Then, given each dependent variable, select an appropriate model and relevant independent variables. Justify your choices with a few sentences. Finally, show the output of your model in STATA or R and discuss your interpretations of the results.

To approach this question, let us first IMPORT the dataset.

Crowdfunding <- read_excel("C:/Users/engha/Desktop/Alvin/NYU MSBA/Digital Marketing Analytics/Crowdfunding.xls")

Note: Please import the dataset from your relevant directory.


We begin with an EXPLORATORY first look at the dataset as needed.

head(Crowdfunding)
## # A tibble: 6 × 15
##   story_id abs_week rel_week story_readtime story_views
##      <dbl>    <dbl>    <dbl>          <dbl>       <dbl>
## 1      860       19        1           1711          17
## 2      860       20        2           2171          24
## 3      860       21        3           1608          30
## 4      860       22        4           2411          29
## 5      860       23        5             53           6
## 6      860       24        6            215           9
## # ... with 10 more variables: story_funding_duration <dbl>,
## #   story_length <dbl>, focal_page <dbl>, pitch_contributors <dbl>,
## #   tags <dbl>, pitch_views <dbl>, arindex <dbl>, pitch_budget <dbl>,
## #   video <dbl>, insights <dbl>
str(Crowdfunding)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1848 obs. of  15 variables:
##  $ story_id              : num  860 860 860 860 860 860 860 860 860 860 ...
##  $ abs_week              : num  19 20 21 22 23 24 25 26 27 28 ...
##  $ rel_week              : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ story_readtime        : num  1711 2171 1608 2411 53 ...
##  $ story_views           : num  17 24 30 29 6 9 12 6 10 7 ...
##  $ story_funding_duration: num  294 294 294 294 294 294 294 294 294 294 ...
##  $ story_length          : num  3620 3620 3620 3620 3620 3620 3620 3620 3620 3620 ...
##  $ focal_page            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ pitch_contributors    : num  17 17 17 17 17 17 17 17 17 17 ...
##  $ tags                  : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ pitch_views           : num  355 355 355 355 355 355 355 355 355 355 ...
##  $ arindex               : num  11.6 11.6 11.6 11.6 11.6 ...
##  $ pitch_budget          : num  1e+05 1e+05 1e+05 1e+05 1e+05 1e+05 1e+05 1e+05 1e+05 1e+05 ...
##  $ video                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ insights              : num  77 73 73 66 66 56 56 52 65 55 ...
summary(Crowdfunding)
##     story_id       abs_week        rel_week     story_readtime    
##  Min.   : 860   Min.   : 2.00   Min.   : 1.00   Min.   :     0.0  
##  1st Qu.: 976   1st Qu.:34.00   1st Qu.:13.00   1st Qu.:     0.0  
##  Median :1007   Median :49.00   Median :26.00   Median :    34.0  
##  Mean   :1004   Mean   :47.01   Mean   :27.99   Mean   :   799.8  
##  3rd Qu.:1048   3rd Qu.:62.00   3rd Qu.:41.00   3rd Qu.:   437.0  
##  Max.   :1110   Max.   :74.00   Max.   :73.00   Max.   :127455.0  
##   story_views     story_funding_duration  story_length     focal_page    
##  Min.   :  0.00   Min.   :  4.00         Min.   : 1853   Min.   :0.0000  
##  1st Qu.:  1.00   1st Qu.: 26.00         1st Qu.: 6324   1st Qu.:0.0000  
##  Median :  3.00   Median : 68.00         Median :12004   Median :0.0000  
##  Mean   :  9.91   Mean   : 86.62         Mean   :13431   Mean   :0.2511  
##  3rd Qu.:  7.00   3rd Qu.:114.00         3rd Qu.:18304   3rd Qu.:1.0000  
##  Max.   :960.00   Max.   :300.00         Max.   :72000   Max.   :1.0000  
##  pitch_contributors      tags         pitch_views        arindex     
##  Min.   :  1.00     Min.   : 2.000   Min.   :  33.0   Min.   : 5.80  
##  1st Qu.:  6.00     1st Qu.: 3.000   1st Qu.: 167.0   1st Qu.:10.50  
##  Median : 11.00     Median : 4.000   Median : 366.0   Median :11.80  
##  Mean   : 24.72     Mean   : 4.429   Mean   : 616.1   Mean   :12.18  
##  3rd Qu.: 26.00     3rd Qu.: 5.000   3rd Qu.: 728.0   3rd Qu.:13.20  
##  Max.   :144.00     Max.   :13.000   Max.   :5461.0   Max.   :19.70  
##   pitch_budget         video            insights     
##  Min.   :  25000   Min.   :0.00000   Min.   :  0.00  
##  1st Qu.:  50000   1st Qu.:0.00000   1st Qu.: 35.00  
##  Median :  80000   Median :0.00000   Median : 56.00  
##  Mean   : 156216   Mean   :0.05249   Mean   : 51.73  
##  3rd Qu.: 160000   3rd Qu.:0.00000   3rd Qu.: 72.00  
##  Max.   :1000000   Max.   :1.00000   Max.   :100.00

We can make a few observations:
* The dataset contains 15 variables with 1,848 observations.
* All the variables are numeric.
* There is no missing data (which is good).
* Many variables have rows that are repeated because of the dataset is set up to show the absolute/relative weeks.

Furthermore, some of the variables, such as pitch_contributors, appears fairly skewed, as we can see from this chart.

ggplot(Crowdfunding, aes(pitch_contributors)) +
  geom_histogram(binwidth = 10) +
  labs(title = "Total Number of Contributors",
       x = "Total Number of Contributors")

Clearly, most stories attract relatively few contributors.

To make the dataset tidier for ease of exploration, we can TRANSFORM the data by grouping the stories together and taking log of those variables that exhibit significant skew.

Crowdfunding_story <- Crowdfunding %>% 
  group_by(story_id) %>% 
  summarise(abs_week_start = min(abs_week),
            abs_week_end = max(abs_week),
            rel_week = n(), 
            story_readtime = log(sum(story_readtime)),
            story_views = log(sum(story_views)),
            story_funding_duration = log(max(story_funding_duration)),
            story_length = log(max(story_length)),
            focal_page = log(mean(focal_page)),
            pitch_contributors = log(max(pitch_contributors)),
            tags = log(max(tags)),
            pitch_views = log(max(pitch_views)),
            arindex = max(arindex),
            pitch_budget = log(max(pitch_budget)/1000),
            video = mean(video),
            insights = mean(insights))
head(Crowdfunding_story)
## # A tibble: 6 × 16
##   story_id abs_week_start abs_week_end rel_week story_readtime story_views
##      <dbl>          <dbl>        <dbl>    <int>          <dbl>       <dbl>
## 1      860             19           74       56       9.728837    5.634790
## 2      871              6           74       69      10.932017    6.539586
## 3      918              5           74       70      10.636793    6.095825
## 4      923              2           74       73       8.147578    3.891820
## 5      940              2           74       73       9.701004    5.187386
## 6      958              5           74       70       9.609385    5.129899
## # ... with 10 more variables: story_funding_duration <dbl>,
## #   story_length <dbl>, focal_page <dbl>, pitch_contributors <dbl>,
## #   tags <dbl>, pitch_views <dbl>, arindex <dbl>, pitch_budget <dbl>,
## #   video <dbl>, insights <dbl>
dim(Crowdfunding_story)
## [1] 37 16

Note that we have transformed a few of the variables:
* abs_week has been decomposed into the start and end of the observation period (abs_week_start and abs_week_end respectively)
* story_readtime now show the log total time visitors spent at the story URL for the entire observation period
* story_views now show the log total number of vistors to the story URL for the entire observation period
* focal_page now shows the log proportion of time the story is on the first page
* pitch_budget has been scaled to be in thousands and in log form
* insights now shows the average Google Search Trends for pitch keywords for the entire observation period
* story_funding duration, story_length, pitch_contributors, tags, pitch_views have also all been expressed in log form

We can see that there are 37 stories. Notably, we can also now easily see that regardless of when the story was published, the story was always last published in week 74. This makes the variable a constant, which can be ignored. Indeed, abs_week_start, abs_week_end and rel_week are simply linear transformations of one another.

Another somewhat peculiar observation in the original dataset is that for some observations, even when there are visitors to the story URL, the time spent could be 0s. It could suggest that visitors who visit the URL but don’t click on the story are captured as not having time spent. This is not an issue after the transformation.


Picking the Dependent Variable

Since we would like to understand the factors determining the attractiveness of a pitch, as measured by whether the story has convinced other members to contribute, the obvious dependent variable would be pitch_contributors, the total number of contributors who actually did fund the pitch. It is reasonabale to assume that rational contributors would fund a pitch only if they found the pitch attractive! Another possible dependent variable would be insights, as one would expect potential contributors to search for pitch keywords if they thought the pitch was attractive. However, this does not translate to actual contributions, hence we will use pitch_contributors as the dependent variable.

Let us VISUALIZE the data to aid our exploration further.

We begin with a simple scatter plot of the story funding duration and pitch_contributors. Note that we will not concern ourselves with story_id, abs_week, rel_week, story_readtime, story_views and focal_page since these can be regarded as ex-post variables; i.e. they are measured after the pitch has been funded.

ggplot(Crowdfunding_story, aes(story_funding_duration, pitch_contributors)) + 
  geom_point() +
  geom_smooth(se = FALSE, method = 'lm') +
  labs(title = "Number of Days Pitch Spent being Funded vs Total Number of Contributors",
       x = "(Log) Number of Days Pitch Spent being Funded",
       y = "(Log) Total Number of Contributors")

It appears that there is some positive relationship between the number of days a pitch spent being funded and the total number of contributors (correlation of 0.405352). This makes intuitive sense. An attractive pitch should get a longer airtime and get more contributions.

What about story length?

ggplot(Crowdfunding_story, aes(story_length, pitch_contributors)) + 
  geom_point() +
  geom_smooth(se = FALSE, method = 'lm') +
  labs(title = "Length of Story vs Total Number of Contributors",
       x = "(Log) Length of Story (in characters)",
       y = "(Log) Total Number of Contributors")

The relationship seems to be a negative one, albeit less apparent. The correlation between story length and pitch_contributors is -0.1609324. Apparently, succinct stories are more attractive, which is reasonable.

Let us turn to the number of tags associated with a story. We assume that the number of tags is known when the pitch is made.

ggplot(Crowdfunding_story, aes(tags, pitch_contributors)) + 
  geom_bar(stat = 'identity') +
  labs(title = "Number of Story Tags vs Total Number of Contributors",
       x = "(Log) Number of Story Tags",
       y = "(Log) Total Number of Contributors")

The correlation between the tags and pitch_contributors is quite low at 0.0509702. The sign is positive, which matches expectations; the more tags a story has, the wider its likely appeal, which should make it more attractive.

Next, we consider pitch_views. A priori, we would expect this to be positively correlated with pitch_contributors. The more people view a pitch, the more likely it will be funded.

ggplot(Crowdfunding_story, aes(pitch_views, pitch_contributors)) + 
  geom_point() +
  geom_smooth(se = FALSE, method = 'lm') +
  labs(title = "Number of Visitors to pitch URL vs Total Number of Contributors",
       x = "(Log) Number of Visitors to pitch URL",
       y = "(Log) Total Number of Contributors")

As expected, a strong positive relationship exists between pitch_views and pitch_contributors. The correlation is a high 0.5952077.

Let us now take a look at the Automated Readability Index value for the story text. Presumably, a story can only be given a value after it has been written and not when it was pitched, but we proceed anyhow.

ggplot(Crowdfunding_story, aes(arindex, pitch_contributors)) + 
  geom_point() +
  geom_smooth(se = FALSE, method = 'lm') +
  labs(title = "Automated Readability Index Value For Story vs Total Number of Contributors",
       x = "Automated Readability Index Value",
       y = "(Log) Total Number of Contributors")

There appears to be a positive relationship bewteen the Automated Readability Index value and pitch_contributors. The correlation is 0.3331197.

Next, we turn our attention to pitchsum budget. We would expect this to strongly influence the number of contributors.

ggplot(Crowdfunding_story, aes(pitch_budget, pitch_contributors)) + 
  geom_point() +
  geom_smooth(se = FALSE, method = 'lm') +
  labs(title = "Funding Target of Story vs Total Number of Contributors",
       x = "(Log) Funding Target of Story (in thousands)",
       y = "(Log) Total Number of Contributors")

The relationship is a strong positive one, with a high correlation of 0.6440522.

Let’s see whether having a video affects pitch_contributors.

ggplot(Crowdfunding_story, aes(video, pitch_contributors, group = video)) + 
  geom_boxplot() +
  labs(title = "Inclusion of Video vs Total Number of Contributors",
       x = "Video (0 = No, 1 = Yes)",
       y = "(Log) Total Number of Contributors")

The median is rougly the same, although the variability is higher for stories without a video, which is not surprising since there are not many stories with a video.

Last but not least, let’s see how the average Google Search trends for pitch keywords influence pitch_contributors.

ggplot(Crowdfunding_story, aes(insights, pitch_contributors)) + 
  geom_point() +
  geom_smooth(se = FALSE, method = 'lm') +
  labs(title = "Google Search Trends for Pitch Keywords vs Total Number of Contributors",
       x = "Google Search Trends for Pitch Keywords (0-100, relative to Peak)",
       y = "(Log) Total Number of Contributors")

The relationship appears negative, with a correlation of -0.333706, which is somewhat counter-intuitive at first glance as one would have thought more searches would point to a pitch being more attractive. But it could also point to a story’s lack of clarity and focus such that people would have to make more searches.


Let us now know turn to MODEL SELECTION to tease out the factors determining the attractiveness of a pitch. To avoid imposing any prior assumptions about the model structure, we begin with backward stepwise regression, which allows us to start with all the variables and progressively eliminate the variables using the AIC model fit criterion. Linear regression is appropriate here since the dependent variable is continuous.

step(lm(pitch_contributors ~ ., data = Crowdfunding_story, direction = "backward"))
## Start:  AIC=-7.63
## pitch_contributors ~ story_id + abs_week_start + abs_week_end + 
##     rel_week + story_readtime + story_views + story_funding_duration + 
##     story_length + focal_page + tags + pitch_views + arindex + 
##     pitch_budget + video + insights
## 
## 
## Step:  AIC=-7.63
## pitch_contributors ~ story_id + abs_week_start + abs_week_end + 
##     story_readtime + story_views + story_funding_duration + story_length + 
##     focal_page + tags + pitch_views + arindex + pitch_budget + 
##     video + insights
## 
## 
## Step:  AIC=-7.63
## pitch_contributors ~ story_id + abs_week_start + story_readtime + 
##     story_views + story_funding_duration + story_length + focal_page + 
##     tags + pitch_views + arindex + pitch_budget + video + insights
## 
##                          Df Sum of Sq    RSS     AIC
## - focal_page              1   0.06310 14.188 -9.4661
## - story_readtime          1   0.08150 14.206 -9.4182
## - tags                    1   0.10350 14.228 -9.3609
## - video                   1   0.14702 14.272 -9.2479
## - insights                1   0.19133 14.316 -9.1332
## - story_views             1   0.27155 14.396 -8.9265
## <none>                                14.125 -7.6310
## - story_funding_duration  1   0.85041 14.975 -7.4678
## - pitch_views             1   1.10588 15.230 -6.8420
## - arindex                 1   1.27546 15.400 -6.4322
## - story_length            1   1.34580 15.470 -6.2636
## - abs_week_start          1   1.65573 15.780 -5.5297
## - story_id                1   1.79862 15.923 -5.1962
## - pitch_budget            1   2.51206 16.637 -3.5745
## 
## Step:  AIC=-9.47
## pitch_contributors ~ story_id + abs_week_start + story_readtime + 
##     story_views + story_funding_duration + story_length + tags + 
##     pitch_views + arindex + pitch_budget + video + insights
## 
##                          Df Sum of Sq    RSS      AIC
## - story_readtime          1   0.05720 14.245 -11.3172
## - tags                    1   0.10640 14.294 -11.1897
## - video                   1   0.13441 14.322 -11.1172
## - insights                1   0.18244 14.370 -10.9934
## - story_views             1   0.22822 14.416 -10.8757
## <none>                                14.188  -9.4661
## - story_funding_duration  1   1.09104 15.279  -8.7249
## - pitch_views             1   1.20930 15.397  -8.4396
## - arindex                 1   1.22631 15.414  -8.3988
## - story_length            1   1.28361 15.471  -8.2615
## - abs_week_start          1   2.25566 16.443  -6.0069
## - story_id                1   2.28308 16.471  -5.9452
## - pitch_budget            1   2.52273 16.710  -5.4108
## 
## Step:  AIC=-11.32
## pitch_contributors ~ story_id + abs_week_start + story_views + 
##     story_funding_duration + story_length + tags + pitch_views + 
##     arindex + pitch_budget + video + insights
## 
##                          Df Sum of Sq    RSS      AIC
## - tags                    1   0.08358 14.329 -13.1008
## - video                   1   0.12207 14.367 -13.0015
## - insights                1   0.21691 14.462 -12.7581
## <none>                                14.245 -11.3172
## - story_views             1   0.87959 15.124 -11.1003
## - story_funding_duration  1   1.15474 15.400 -10.4333
## - story_length            1   1.32804 15.573 -10.0192
## - pitch_views             1   1.32943 15.574 -10.0159
## - arindex                 1   1.33279 15.578 -10.0079
## - abs_week_start          1   2.81526 17.060  -6.6444
## - story_id                1   2.82484 17.070  -6.6236
## - pitch_budget            1   2.85555 17.100  -6.5571
## 
## Step:  AIC=-13.1
## pitch_contributors ~ story_id + abs_week_start + story_views + 
##     story_funding_duration + story_length + pitch_views + arindex + 
##     pitch_budget + video + insights
## 
##                          Df Sum of Sq    RSS      AIC
## - video                   1   0.09320 14.422 -14.8609
## - insights                1   0.19996 14.528 -14.5880
## <none>                                14.329 -13.1008
## - story_views             1   0.81889 15.147 -13.0444
## - story_funding_duration  1   1.13853 15.467 -12.2718
## - story_length            1   1.24618 15.575 -12.0151
## - arindex                 1   1.39770 15.726 -11.6569
## - pitch_views             1   1.45175 15.780 -11.5300
## - abs_week_start          1   2.75174 17.080  -8.6009
## - pitch_budget            1   2.77573 17.104  -8.5490
## - story_id                1   2.84524 17.174  -8.3989
## 
## Step:  AIC=-14.86
## pitch_contributors ~ story_id + abs_week_start + story_views + 
##     story_funding_duration + story_length + pitch_views + arindex + 
##     pitch_budget + insights
## 
##                          Df Sum of Sq    RSS      AIC
## - insights                1   0.23550 14.657 -16.2616
## - story_views             1   0.73715 15.159 -15.0164
## <none>                                14.422 -14.8609
## - story_funding_duration  1   1.14635 15.568 -14.0309
## - story_length            1   1.15652 15.578 -14.0067
## - arindex                 1   1.30459 15.726 -13.6567
## - pitch_views             1   1.87431 16.296 -12.3400
## - pitch_budget            1   2.68413 17.106 -10.5455
## - abs_week_start          1   2.72650 17.148 -10.4540
## - story_id                1   3.08187 17.503  -9.6951
## 
## Step:  AIC=-16.26
## pitch_contributors ~ story_id + abs_week_start + story_views + 
##     story_funding_duration + story_length + pitch_views + arindex + 
##     pitch_budget
## 
##                          Df Sum of Sq    RSS     AIC
## <none>                                14.657 -16.262
## - story_views             1    0.8569 15.514 -16.159
## - story_length            1    1.2373 15.895 -15.263
## - story_funding_duration  1    1.3451 16.002 -15.013
## - arindex                 1    1.6948 16.352 -14.213
## - pitch_views             1    1.9083 16.565 -13.733
## - abs_week_start          1    2.9561 17.613 -11.464
## - pitch_budget            1    3.0257 17.683 -11.318
## - story_id                1    3.2800 17.937 -10.790
## 
## Call:
## lm(formula = pitch_contributors ~ story_id + abs_week_start + 
##     story_views + story_funding_duration + story_length + pitch_views + 
##     arindex + pitch_budget, data = Crowdfunding_story, direction = "backward")
## 
## Coefficients:
##            (Intercept)                story_id          abs_week_start  
##               11.96662                -0.01251                 0.05074  
##            story_views  story_funding_duration            story_length  
##                0.17841                -0.34716                -0.24853  
##            pitch_views                 arindex            pitch_budget  
##                0.33678                 0.08225                 0.40518

The regression suggests the following independent variables: story_id, abs_week_start, story_views, story_funding_duration, story_length, pitch_views, arindex and pitch_budget. Let us run the linear regression model to check the significance of the variables (dropping story_id since it is serves as an unique identfier and should not affect pitch_contributors).

model_backward <- lm(pitch_contributors ~ abs_week_start + 
    story_views + story_funding_duration + story_length + pitch_views + 
    arindex + pitch_budget, data = Crowdfunding_story) 
summary(model_backward)
## 
## Call:
## lm(formula = pitch_contributors ~ abs_week_start + story_views + 
##     story_funding_duration + story_length + pitch_views + arindex + 
##     pitch_budget, data = Crowdfunding_story)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.07204 -0.34683 -0.01779  0.30475  1.49957 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            -1.89405    1.88708  -1.004   0.3238  
## abs_week_start          0.00428    0.01147   0.373   0.7118  
## story_views             0.17354    0.15156   1.145   0.2616  
## story_funding_duration  0.09299    0.13743   0.677   0.5040  
## story_length           -0.18818    0.17375  -1.083   0.2877  
## pitch_views             0.30328    0.19118   1.586   0.1235  
## arindex                 0.06979    0.04939   1.413   0.1683  
## pitch_budget            0.47158    0.18091   2.607   0.0143 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7865 on 29 degrees of freedom
## Multiple R-squared:  0.6005, Adjusted R-squared:  0.5041 
## F-statistic: 6.228 on 7 and 29 DF,  p-value: 0.0001657

At the 5% level, the only significant variable is pitch_budget. The adjusted R-squared is moderate at 0.5041.

Let us also apply the forward stepwise regression, which starts with a null model and progressively adds variables until the next best variable does not decrease AIC.

null <- lm(pitch_contributors ~ 1, data = Crowdfunding_story)
full <- lm(pitch_contributors ~ ., data = Crowdfunding_story)
step(null, scope = list(lower = null, upper = full), direction = "forward")
## Start:  AIC=9.16
## pitch_contributors ~ 1
## 
##                          Df Sum of Sq    RSS     AIC
## + pitch_budget            1   18.6257 26.277 -8.6626
## + pitch_views             1   15.9077 28.995 -5.0207
## + story_funding_duration  1    7.3779 37.525  4.5208
## + insights                1    5.0003 39.902  6.7939
## + arindex                 1    4.9828 39.920  6.8102
## + story_views             1    4.0693 40.833  7.6473
## + story_readtime          1    3.2957 41.607  8.3418
## <none>                                44.902  9.1622
## + abs_week_start          1    2.2588 42.644  9.2525
## + rel_week                1    2.2588 42.644  9.2525
## + story_length            1    1.1629 43.739 10.1913
## + focal_page              1    1.0974 43.805 10.2467
## + story_id                1    0.9113 43.991 10.4036
## + tags                    1    0.1167 44.786 11.0660
## + video                   1    0.0088 44.894 11.1550
## 
## Step:  AIC=-8.66
## pitch_contributors ~ pitch_budget
## 
##                          Df Sum of Sq    RSS      AIC
## + pitch_views             1    4.9010 21.376 -14.3004
## + story_views             1    2.4471 23.830 -10.2795
## + arindex                 1    1.5181 24.759  -8.8644
## + insights                1    1.3894 24.887  -8.6726
## <none>                                26.277  -8.6626
## + story_readtime          1    1.2423 25.034  -8.4546
## + story_funding_duration  1    0.9524 25.324  -8.0286
## + rel_week                1    0.5671 25.710  -7.4698
## + abs_week_start          1    0.5671 25.710  -7.4698
## + story_length            1    0.3625 25.914  -7.1766
## + story_id                1    0.1318 26.145  -6.8487
## + focal_page              1    0.1316 26.145  -6.8484
## + video                   1    0.1313 26.145  -6.8479
## + tags                    1    0.0119 26.265  -6.6793
## 
## Step:  AIC=-14.3
## pitch_contributors ~ pitch_budget + pitch_views
## 
##                          Df Sum of Sq    RSS     AIC
## + arindex                 1   1.91731 19.459 -15.777
## + story_id                1   1.77831 19.598 -15.514
## <none>                                21.376 -14.300
## + insights                1   1.06523 20.311 -14.192
## + story_length            1   0.78222 20.594 -13.680
## + story_views             1   0.55776 20.818 -13.279
## + story_funding_duration  1   0.26428 21.111 -12.761
## + abs_week_start          1   0.18547 21.190 -12.623
## + rel_week                1   0.18547 21.190 -12.623
## + story_readtime          1   0.17614 21.200 -12.607
## + tags                    1   0.02752 21.348 -12.348
## + focal_page              1   0.01083 21.365 -12.319
## + video                   1   0.00080 21.375 -12.302
## 
## Step:  AIC=-15.78
## pitch_contributors ~ pitch_budget + pitch_views + arindex
## 
##                          Df Sum of Sq    RSS     AIC
## + story_id                1   1.07282 18.386 -15.876
## <none>                                19.459 -15.777
## + insights                1   0.47313 18.985 -14.688
## + story_funding_duration  1   0.39876 19.060 -14.544
## + story_views             1   0.37278 19.086 -14.493
## + story_length            1   0.31586 19.143 -14.383
## + focal_page              1   0.14206 19.316 -14.049
## + video                   1   0.11055 19.348 -13.988
## + story_readtime          1   0.10107 19.357 -13.970
## + abs_week_start          1   0.01834 19.440 -13.812
## + rel_week                1   0.01834 19.440 -13.812
## + tags                    1   0.00033 19.458 -13.778
## 
## Step:  AIC=-15.88
## pitch_contributors ~ pitch_budget + pitch_views + arindex + story_id
## 
##                          Df Sum of Sq    RSS     AIC
## + abs_week_start          1   1.18258 17.203 -16.336
## + rel_week                1   1.18258 17.203 -16.336
## <none>                                18.386 -15.876
## + insights                1   0.52057 17.865 -14.938
## + story_length            1   0.26197 18.124 -14.407
## + focal_page              1   0.22419 18.162 -14.330
## + story_views             1   0.18223 18.203 -14.244
## + story_funding_duration  1   0.05334 18.332 -13.983
## + story_readtime          1   0.03025 18.355 -13.937
## + tags                    1   0.01383 18.372 -13.904
## + video                   1   0.00120 18.384 -13.878
## 
## Step:  AIC=-16.34
## pitch_contributors ~ pitch_budget + pitch_views + arindex + story_id + 
##     abs_week_start
## 
##                          Df Sum of Sq    RSS     AIC
## + story_funding_duration  1   1.00188 16.201 -16.556
## <none>                                17.203 -16.336
## + insights                1   0.57281 16.630 -15.589
## + story_length            1   0.44339 16.760 -15.302
## + story_views             1   0.31953 16.884 -15.029
## + story_readtime          1   0.18160 17.021 -14.728
## + focal_page              1   0.04039 17.163 -14.423
## + tags                    1   0.00328 17.200 -14.343
## + video                   1   0.00027 17.203 -14.336
## 
## Step:  AIC=-16.56
## pitch_contributors ~ pitch_budget + pitch_views + arindex + story_id + 
##     abs_week_start + story_funding_duration
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        16.201 -16.556
## + story_length    1   0.68713 15.514 -16.159
## + insights        1   0.38138 15.820 -15.437
## + story_views     1   0.30671 15.895 -15.263
## + story_readtime  1   0.19455 16.007 -15.003
## + focal_page      1   0.00894 16.192 -14.576
## + video           1   0.00384 16.197 -14.565
## + tags            1   0.00178 16.199 -14.560
## 
## Call:
## lm(formula = pitch_contributors ~ pitch_budget + pitch_views + 
##     arindex + story_id + abs_week_start + story_funding_duration, 
##     data = Crowdfunding_story)
## 
## Coefficients:
##            (Intercept)            pitch_budget             pitch_views  
##                8.99460                 0.40182                 0.43347  
##                arindex                story_id          abs_week_start  
##                0.09471                -0.01153                 0.04170  
## story_funding_duration  
##               -0.29573
model_forward <- lm(pitch_contributors ~ pitch_budget + pitch_views + 
    arindex + abs_week_start + story_funding_duration, data = Crowdfunding_story) 
summary(model_forward)
## 
## Call:
## lm(formula = pitch_contributors ~ pitch_budget + pitch_views + 
##     arindex + abs_week_start + story_funding_duration, data = Crowdfunding_story)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.98695 -0.35855 -0.00198  0.32649  1.63245 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)   
## (Intercept)            -3.2395511  0.9502579  -3.409  0.00183 **
## pitch_budget            0.4589356  0.1787402   2.568  0.01528 * 
## pitch_views             0.4015112  0.1687998   2.379  0.02372 * 
## arindex                 0.0802402  0.0483398   1.660  0.10701   
## abs_week_start         -0.0005177  0.0108671  -0.048  0.96231   
## story_funding_duration  0.1076388  0.1365852   0.788  0.43664   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7841 on 31 degrees of freedom
## Multiple R-squared:  0.5756, Adjusted R-squared:  0.5071 
## F-statistic: 8.408 on 5 and 31 DF,  p-value: 4.095e-05

At the 5% level, the significant variables are pitch_budget and pitch_views. The R-squared remains moderate at 0.5071.

We triangulate the regression results using the leaps package, which performs an exhaustive search for the best subsets of the variables for predicting pitch_contributors in linear regression.

library(leaps)
leaps = regsubsets(pitch_contributors ~ ., data = Crowdfunding, nbest = 1)
plot(leaps, scale = "r2")

The “best” subset (the most parsimonious model without decreasing r-squared) selects story_funding_duration, tags, pitch_views, arindex, pitch_budget and insights as the independent variables. Let us run the regression on these variables.

model_leaps <- lm(pitch_contributors ~ story_funding_duration + tags + pitch_views + arindex + pitch_budget + insights, data = Crowdfunding_story) 
summary(model_leaps)
## 
## Call:
## lm(formula = pitch_contributors ~ story_funding_duration + tags + 
##     pitch_views + arindex + pitch_budget + insights, data = Crowdfunding_story)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.94203 -0.43889  0.05399  0.40341  1.44719 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            -2.522279   1.269562  -1.987   0.0562 .
## story_funding_duration  0.121315   0.132477   0.916   0.3671  
## tags                   -0.022830   0.311103  -0.073   0.9420  
## pitch_views             0.381478   0.148095   2.576   0.0152 *
## arindex                 0.068968   0.046580   1.481   0.1491  
## pitch_budget            0.431857   0.181406   2.381   0.0238 *
## insights               -0.007212   0.007348  -0.981   0.3342  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7846 on 30 degrees of freedom
## Multiple R-squared:  0.5887, Adjusted R-squared:  0.5065 
## F-statistic: 7.158 on 6 and 30 DF,  p-value: 8.372e-05

Again, only pitch_views and pitch_budget are significant at the 5% level, consistent with the forward stepwise regression.

The residuals from the model also appear to be fairly well-behaved.

plot(residuals(model_forward))


Our approach has taken us from DATA IMPORT to EXPLORATION to TRANSFORMATION to VISUALIZATION to MODEL SELECTION. We are now ready to INTERPRET the results.

Let us run the regression again using just pitch_budget and pitch_views.

model_chosen <- lm(pitch_contributors ~ pitch_budget + pitch_views, data = Crowdfunding_story) 
summary(model_chosen)
## 
## Call:
## lm(formula = pitch_contributors ~ pitch_budget + pitch_views, 
##     data = Crowdfunding_story)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2311 -0.3890  0.1162  0.3372  1.3734 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -2.4299     0.8595  -2.827  0.00781 **
## pitch_budget   0.5774     0.1658   3.481  0.00139 **
## pitch_views    0.4052     0.1451   2.792  0.00853 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7929 on 34 degrees of freedom
## Multiple R-squared:  0.524,  Adjusted R-squared:  0.4959 
## F-statistic: 18.71 on 2 and 34 DF,  p-value: 3.312e-06

The model is given by: (log)pitch_contributors = -2.4299 + 0.5774 x (log)pitch_budget + 0.4052 x (log)pitch_views

A 1% increase in the funding target of the associated story results in an approximately 0.6% increase in the number of contributors. Stories with larger funding targets could be indicative of its importance and relevance, or could be pitched by more established journalists with a wider appeal and higher credibility, hence making them more attractive.

A 1% increase in the number of visitors to the pitch URL results in about 0.4% increase in the number of contributors. This conforms with expectations as well: the more number of pitch views, the more likely there will be contributions.


Part 2: Now, you want to build and estimate a regression model where the dependent variable is a measure of how much time people spent on reading a story or how many people actually read a story (story_ read time or story_ views, respectively). What variables would you choose as independent variables? Show the output of the final model in STATA or R. Provide in a few sentences on your main takeaways from the analyses.

Let the dependent variable be story_readtime. Like above, we start with backward stepwise regression.

step(lm(story_readtime ~ ., data = Crowdfunding_story, direction = "backward"))
## Start:  AIC=-84.14
## story_readtime ~ story_id + abs_week_start + abs_week_end + rel_week + 
##     story_views + story_funding_duration + story_length + focal_page + 
##     pitch_contributors + tags + pitch_views + arindex + pitch_budget + 
##     video + insights
## 
## 
## Step:  AIC=-84.14
## story_readtime ~ story_id + abs_week_start + abs_week_end + story_views + 
##     story_funding_duration + story_length + focal_page + pitch_contributors + 
##     tags + pitch_views + arindex + pitch_budget + video + insights
## 
## 
## Step:  AIC=-84.14
## story_readtime ~ story_id + abs_week_start + story_views + story_funding_duration + 
##     story_length + focal_page + pitch_contributors + tags + pitch_views + 
##     arindex + pitch_budget + video + insights
## 
##                          Df Sum of Sq     RSS     AIC
## - story_length            1    0.0000  1.7861 -86.142
## - pitch_contributors      1    0.0103  1.7964 -85.930
## - arindex                 1    0.0146  1.8007 -85.841
## - video                   1    0.0166  1.8027 -85.801
## - insights                1    0.0261  1.8122 -85.606
## - story_funding_duration  1    0.0324  1.8185 -85.478
## - pitch_views             1    0.0453  1.8314 -85.215
## - tags                    1    0.0541  1.8402 -85.039
## - focal_page              1    0.0812  1.8673 -84.497
## <none>                                 1.7861 -84.143
## - story_id                1    0.2052  1.9913 -82.118
## - abs_week_start          1    0.2248  2.0109 -81.757
## - pitch_budget            1    0.5067  2.2928 -76.902
## - story_views             1   23.9456 25.7317  12.562
## 
## Step:  AIC=-86.14
## story_readtime ~ story_id + abs_week_start + story_views + story_funding_duration + 
##     focal_page + pitch_contributors + tags + pitch_views + arindex + 
##     pitch_budget + video + insights
## 
##                          Df Sum of Sq    RSS     AIC
## - pitch_contributors      1    0.0117  1.798 -87.901
## - arindex                 1    0.0146  1.801 -87.841
## - video                   1    0.0197  1.806 -87.736
## - insights                1    0.0262  1.812 -87.604
## - story_funding_duration  1    0.0329  1.819 -87.466
## - pitch_views             1    0.0468  1.833 -87.185
## - tags                    1    0.0572  1.843 -86.975
## - focal_page              1    0.0856  1.872 -86.411
## <none>                                 1.786 -86.142
## - story_id                1    0.2084  1.995 -84.059
## - abs_week_start          1    0.2394  2.025 -83.489
## - pitch_budget            1    0.5186  2.305 -78.711
## - story_views             1   31.2025 32.989  19.754
## 
## Step:  AIC=-87.9
## story_readtime ~ story_id + abs_week_start + story_views + story_funding_duration + 
##     focal_page + tags + pitch_views + arindex + pitch_budget + 
##     video + insights
## 
##                          Df Sum of Sq    RSS     AIC
## - video                   1    0.0201  1.818 -89.490
## - arindex                 1    0.0254  1.823 -89.382
## - insights                1    0.0319  1.830 -89.250
## - story_funding_duration  1    0.0440  1.842 -89.007
## - tags                    1    0.0566  1.854 -88.754
## - pitch_views             1    0.0737  1.872 -88.414
## - focal_page              1    0.0862  1.884 -88.169
## <none>                                 1.798 -87.901
## - story_id                1    0.2822  2.080 -84.506
## - abs_week_start          1    0.3057  2.103 -84.091
## - pitch_budget            1    0.5387  2.337 -80.203
## - story_views             1   31.5245 33.322  18.126
## 
## Step:  AIC=-89.49
## story_readtime ~ story_id + abs_week_start + story_views + story_funding_duration + 
##     focal_page + tags + pitch_views + arindex + pitch_budget + 
##     insights
## 
##                          Df Sum of Sq    RSS     AIC
## - insights                1     0.028  1.846 -90.918
## - story_funding_duration  1     0.041  1.859 -90.670
## - arindex                 1     0.043  1.861 -90.619
## - tags                    1     0.050  1.868 -90.476
## - pitch_views             1     0.059  1.877 -90.311
## - focal_page              1     0.088  1.906 -89.745
## <none>                                 1.818 -89.490
## - story_id                1     0.264  2.082 -86.472
## - abs_week_start          1     0.300  2.118 -85.843
## - pitch_budget            1     0.519  2.337 -82.197
## - story_views             1    31.716 33.533  16.360
## 
## Step:  AIC=-90.92
## story_readtime ~ story_id + abs_week_start + story_views + story_funding_duration + 
##     focal_page + tags + pitch_views + arindex + pitch_budget
## 
##                          Df Sum of Sq    RSS     AIC
## - story_funding_duration  1     0.054  1.900 -91.851
## - tags                    1     0.058  1.904 -91.777
## - pitch_views             1     0.063  1.909 -91.680
## - arindex                 1     0.066  1.913 -91.609
## - focal_page              1     0.097  1.943 -91.019
## <none>                                 1.846 -90.918
## - story_id                1     0.287  2.133 -87.572
## - abs_week_start          1     0.328  2.174 -86.871
## - pitch_budget            1     0.493  2.340 -84.155
## - story_views             1    31.842 33.689  14.531
## 
## Step:  AIC=-91.85
## story_readtime ~ story_id + abs_week_start + story_views + focal_page + 
##     tags + pitch_views + arindex + pitch_budget
## 
##                  Df Sum of Sq    RSS     AIC
## - pitch_views     1     0.052  1.952 -92.856
## - arindex         1     0.059  1.960 -92.712
## - tags            1     0.063  1.963 -92.644
## - focal_page      1     0.065  1.965 -92.608
## <none>                         1.900 -91.851
## - story_id        1     0.369  2.269 -87.288
## - abs_week_start  1     0.418  2.318 -86.495
## - pitch_budget    1     0.533  2.434 -84.699
## - story_views     1    31.816 33.717  12.562
## 
## Step:  AIC=-92.86
## story_readtime ~ story_id + abs_week_start + story_views + focal_page + 
##     tags + arindex + pitch_budget
## 
##                  Df Sum of Sq    RSS     AIC
## - tags            1     0.056  2.008 -93.802
## - focal_page      1     0.061  2.013 -93.722
## - arindex         1     0.076  2.028 -93.436
## <none>                         1.952 -92.856
## - story_id        1     0.356  2.308 -88.653
## - pitch_budget    1     0.486  2.438 -86.628
## - abs_week_start  1     0.621  2.573 -84.642
## - story_views     1    38.430 40.382  17.236
## 
## Step:  AIC=-93.8
## story_readtime ~ story_id + abs_week_start + story_views + focal_page + 
##     arindex + pitch_budget
## 
##                  Df Sum of Sq    RSS     AIC
## - arindex         1     0.063  2.072 -94.656
## - focal_page      1     0.074  2.082 -94.471
## <none>                         2.008 -93.802
## - story_id        1     0.349  2.358 -89.871
## - pitch_budget    1     0.454  2.463 -88.257
## - abs_week_start  1     0.649  2.658 -85.438
## - story_views     1    38.373 40.382  15.236
## 
## Step:  AIC=-94.66
## story_readtime ~ story_id + abs_week_start + story_views + focal_page + 
##     pitch_budget
## 
##                  Df Sum of Sq    RSS     AIC
## - focal_page      1     0.111  2.182 -94.729
## <none>                         2.072 -94.656
## - story_id        1     0.338  2.410 -91.058
## - pitch_budget    1     0.392  2.464 -90.241
## - abs_week_start  1     0.591  2.663 -87.366
## - story_views     1    38.374 40.445  13.294
## 
## Step:  AIC=-94.73
## story_readtime ~ story_id + abs_week_start + story_views + pitch_budget
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                         2.182 -94.729
## - story_id        1     0.300  2.482 -91.963
## - pitch_budget    1     0.414  2.596 -90.306
## - abs_week_start  1     0.484  2.667 -89.312
## - story_views     1    38.358 40.540  11.381
## 
## Call:
## lm(formula = story_readtime ~ story_id + abs_week_start + story_views + 
##     pitch_budget, data = Crowdfunding_story, direction = "backward")
## 
## Coefficients:
##    (Intercept)        story_id  abs_week_start     story_views  
##       1.808821        0.002201       -0.010881        0.988757  
##   pitch_budget  
##       0.130066
model_backward2 <- lm(story_readtime ~ abs_week_start + story_views + pitch_budget, data = Crowdfunding_story) 
summary(model_backward2)
## 
## Call:
## lm(formula = story_readtime ~ abs_week_start + story_views + 
##     pitch_budget, data = Crowdfunding_story)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.60426 -0.16132  0.06613  0.18145  0.39172 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.097579   0.329424  12.439 5.26e-14 ***
## abs_week_start -0.004552   0.002888  -1.576   0.1246    
## story_views     0.985151   0.043749  22.518  < 2e-16 ***
## pitch_budget    0.089763   0.051657   1.738   0.0916 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2743 on 33 degrees of freedom
## Multiple R-squared:  0.9407, Adjusted R-squared:  0.9354 
## F-statistic: 174.6 on 3 and 33 DF,  p-value: < 2.2e-16

At the 5% level, the only significant independent variable is story_views. This makes sense: the more total number of visitors to the story URL, the more total time visitors would have spent at the story URL. The adjusted R-squared is high at 0.9354.

We cross-check the results using the forward stepwise regression.

null2 <- lm(story_readtime ~ 1, data = Crowdfunding_story)
full2 <- lm(story_readtime ~ ., data = Crowdfunding_story)
step(null2, scope = list(lower = null2, upper = full2), direction = "forward")
## Start:  AIC=6.59
## story_readtime ~ 1
## 
##                          Df Sum of Sq    RSS     AIC
## + story_views             1    39.054  2.835 -91.043
## + pitch_views             1     5.433 36.457   3.453
## + story_length            1     3.694 38.195   5.176
## + pitch_contributors      1     3.075 38.815   5.772
## <none>                                41.889   6.592
## + pitch_budget            1     1.154 40.735   7.559
## + insights                1     0.755 41.134   7.919
## + arindex                 1     0.309 41.580   8.318
## + story_funding_duration  1     0.112 41.778   8.493
## + focal_page              1     0.083 41.807   8.519
## + story_id                1     0.038 41.851   8.559
## + abs_week_start          1     0.019 41.871   8.576
## + rel_week                1     0.019 41.871   8.576
## + video                   1     0.001 41.888   8.591
## + tags                    1     0.001 41.888   8.591
## 
## Step:  AIC=-91.04
## story_readtime ~ story_views
## 
##                          Df Sum of Sq    RSS     AIC
## + pitch_budget            1  0.166171 2.6692 -91.278
## <none>                                2.8354 -91.043
## + abs_week_start          1  0.125883 2.7095 -90.723
## + rel_week                1  0.125883 2.7095 -90.723
## + story_funding_duration  1  0.087265 2.7481 -90.200
## + tags                    1  0.053715 2.7817 -89.751
## + pitch_views             1  0.030268 2.8051 -89.440
## + insights                1  0.020784 2.8146 -89.315
## + focal_page              1  0.018500 2.8169 -89.285
## + pitch_contributors      1  0.017978 2.8174 -89.278
## + video                   1  0.005334 2.8301 -89.113
## + story_length            1  0.002489 2.8329 -89.076
## + story_id                1  0.000038 2.8354 -89.044
## + arindex                 1  0.000001 2.8354 -89.043
## 
## Step:  AIC=-91.28
## story_readtime ~ story_views + pitch_budget
## 
##                          Df Sum of Sq    RSS     AIC
## + pitch_contributors      1  0.270252 2.3990 -93.227
## + story_funding_duration  1  0.263201 2.4060 -93.119
## + abs_week_start          1  0.186847 2.4824 -91.963
## + rel_week                1  0.186847 2.4824 -91.963
## + pitch_views             1  0.171962 2.4973 -91.742
## <none>                                2.6692 -91.278
## + tags                    1  0.075125 2.5941 -90.334
## + insights                1  0.062434 2.6068 -90.153
## + story_length            1  0.012722 2.6565 -89.454
## + arindex                 1  0.009343 2.6599 -89.407
## + focal_page              1  0.005252 2.6640 -89.351
## + story_id                1  0.002450 2.6668 -89.312
## + video                   1  0.000888 2.6684 -89.290
## 
## Step:  AIC=-93.23
## story_readtime ~ story_views + pitch_budget + pitch_contributors
## 
##                          Df Sum of Sq    RSS     AIC
## + story_funding_duration  1  0.181583 2.2174 -94.140
## + abs_week_start          1  0.128732 2.2702 -93.268
## + rel_week                1  0.128732 2.2702 -93.268
## <none>                                2.3990 -93.227
## + tags                    1  0.084684 2.3143 -92.557
## + pitch_views             1  0.060486 2.3385 -92.172
## + insights                1  0.022387 2.3766 -91.574
## + focal_page              1  0.012139 2.3868 -91.415
## + arindex                 1  0.000605 2.3984 -91.237
## + story_id                1  0.000182 2.3988 -91.230
## + story_length            1  0.000110 2.3989 -91.229
## + video                   1  0.000034 2.3990 -91.228
## 
## Step:  AIC=-94.14
## story_readtime ~ story_views + pitch_budget + pitch_contributors + 
##     story_funding_duration
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        2.2174 -94.140
## + tags            1  0.081449 2.1359 -93.524
## + abs_week_start  1  0.053925 2.1635 -93.050
## + rel_week        1  0.053925 2.1635 -93.050
## + insights        1  0.050862 2.1665 -92.998
## + pitch_views     1  0.033097 2.1843 -92.696
## + focal_page      1  0.017020 2.2004 -92.425
## + story_id        1  0.008691 2.2087 -92.285
## + video           1  0.003661 2.2137 -92.201
## + arindex         1  0.001522 2.2159 -92.165
## + story_length    1  0.000309 2.2171 -92.145
## 
## Call:
## lm(formula = story_readtime ~ story_views + pitch_budget + pitch_contributors + 
##     story_funding_duration, data = Crowdfunding_story)
## 
## Coefficients:
##            (Intercept)             story_views            pitch_budget  
##                3.94359                 1.01082                 0.18181  
##     pitch_contributors  story_funding_duration  
##               -0.09046                -0.07084
model_forward2 <- lm(story_readtime ~ story_views + pitch_budget + pitch_contributors + 
    story_funding_duration, data = Crowdfunding_story) 
summary(model_forward2)
## 
## Call:
## lm(formula = story_readtime ~ story_views + pitch_budget + pitch_contributors + 
##     story_funding_duration, data = Crowdfunding_story)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73039 -0.09914  0.02503  0.17985  0.44545 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             3.94359    0.34809  11.329 9.83e-13 ***
## story_views             1.01082    0.04409  22.928  < 2e-16 ***
## pitch_budget            0.18181    0.06572   2.767  0.00933 ** 
## pitch_contributors     -0.09046    0.05483  -1.650  0.10874    
## story_funding_duration -0.07084    0.04376  -1.619  0.11531    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2632 on 32 degrees of freedom
## Multiple R-squared:  0.9471, Adjusted R-squared:  0.9404 
## F-statistic: 143.1 on 4 and 32 DF,  p-value: < 2.2e-16

The significant variables are story_views (consistent with backward stepwise regression, with the same sign) and pitch_budget. The adjusted R-squared is high at 0.9404.

We again triangulate the regression results using the leaps package.

leaps2 = regsubsets(story_readtime ~ ., data = Crowdfunding, nbest = 1)
plot(leaps2, scale = "r2")

The ‘best’ subset comprises of story_views and pitch_budget, consistent with the forward stepwise regression. From the model we see that a 1% increase in story_views results in an approximately 1% increase in story_readtime while a 1% increase in pitch_budget results in about 0.2% in story_readtime.


Part 3: Note that the dependent variable above is a count of number of seconds people took to read a story or a count of number of views for a given story. Would estimating an OLS regression, even with panel data models, give you precise estimates? If not, what kind of model would you want to run to get estimates that are more precise and less biased?

Since the dependent variable in part 2 is measured in terms of count values, estimating an OLS regression, even with panel data models, would not give precise estimates. This is essentially because OLS regression assumes that true values are normally distributed around the expected value and can take any real value, which is not the case for count values. A more appropriate model would be a Poisson or Negative Binomial model.

Let us run a Poisson regression using the variables chosen by the forward stepwise regression.

model_poisson <- glm(story_readtime ~ story_views + pitch_budget + pitch_contributors + story_funding_duration, data = Crowdfunding, family = poisson)
summary(model_poisson)
## 
## Call:
## glm(formula = story_readtime ~ story_views + pitch_budget + pitch_contributors + 
##     story_funding_duration, family = poisson, data = Crowdfunding)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -181.57   -32.73   -29.91    -7.72   398.33  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            6.216e+00  1.488e-03 4176.96   <2e-16 ***
## story_views            5.554e-03  4.318e-06 1286.11   <2e-16 ***
## pitch_budget           2.657e-07  8.381e-09   31.70   <2e-16 ***
## pitch_contributors     3.127e-03  5.447e-05   57.41   <2e-16 ***
## story_funding_duration 2.391e-04  1.282e-05   18.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6209386  on 1847  degrees of freedom
## Residual deviance: 3274157  on 1843  degrees of freedom
## AIC: 3282374
## 
## Number of Fisher Scoring iterations: 7

While all the variables are highly significant, it is worth pointing out that the mean of story_readtime (799.8425325) is significantly smaller than its variance (18344073), which causes the Poisson distribution to suffer from over-dispersion. The model doesn’t fit well in this case.

A better model would be the Negative Binomial Regression, which relaxes the assumption that the variance be equal to the mean. Let us try fitting it.

model_nb <- glm.nb(story_readtime ~ story_views + pitch_budget + pitch_contributors + 
    story_funding_duration, data = Crowdfunding)
summary(model_nb)
## 
## Call:
## glm.nb(formula = story_readtime ~ story_views + pitch_budget + 
##     pitch_contributors + story_funding_duration, data = Crowdfunding, 
##     init.theta = 0.147262933, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.2500  -1.4268  -0.5946   0.0395   2.4646  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             4.909e+00  9.808e-02  50.052   <2e-16 ***
## story_views             7.152e-02  1.538e-03  46.515   <2e-16 ***
## pitch_budget           -1.834e-07  6.706e-07  -0.274    0.784    
## pitch_contributors      5.586e-03  3.999e-03   1.397    0.162    
## story_funding_duration  2.643e-04  8.492e-04   0.311    0.756    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.1473) family taken to be 1)
## 
##     Null deviance: 2451.5  on 1847  degrees of freedom
## Residual deviance: 1943.2  on 1843  degrees of freedom
## AIC: 19351
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.14726 
##           Std. Err.:  0.00495 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -19339.05500

Note that only story_views is significant at the 5% level now. A one unit increase in story_views is associated with a 0.07 increase in log(story_readtime), i.e. exp(0.07) = 1.07s of time visitors spent at the story URL.