Last Updated on Sat Jun 10 23:03:19 2017
Published at http://rpubs.com/enghanwenalvin/283651
To approach this question, let us first IMPORT the dataset.
Crowdfunding <- read_excel("C:/Users/engha/Desktop/Alvin/NYU MSBA/Digital Marketing Analytics/Crowdfunding.xls")
Note: Please import the dataset from your relevant directory.
We begin with an EXPLORATORY first look at the dataset as needed.
head(Crowdfunding)
## # A tibble: 6 × 15
## story_id abs_week rel_week story_readtime story_views
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 860 19 1 1711 17
## 2 860 20 2 2171 24
## 3 860 21 3 1608 30
## 4 860 22 4 2411 29
## 5 860 23 5 53 6
## 6 860 24 6 215 9
## # ... with 10 more variables: story_funding_duration <dbl>,
## # story_length <dbl>, focal_page <dbl>, pitch_contributors <dbl>,
## # tags <dbl>, pitch_views <dbl>, arindex <dbl>, pitch_budget <dbl>,
## # video <dbl>, insights <dbl>
str(Crowdfunding)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1848 obs. of 15 variables:
## $ story_id : num 860 860 860 860 860 860 860 860 860 860 ...
## $ abs_week : num 19 20 21 22 23 24 25 26 27 28 ...
## $ rel_week : num 1 2 3 4 5 6 7 8 9 10 ...
## $ story_readtime : num 1711 2171 1608 2411 53 ...
## $ story_views : num 17 24 30 29 6 9 12 6 10 7 ...
## $ story_funding_duration: num 294 294 294 294 294 294 294 294 294 294 ...
## $ story_length : num 3620 3620 3620 3620 3620 3620 3620 3620 3620 3620 ...
## $ focal_page : num 1 1 1 1 1 1 1 1 1 1 ...
## $ pitch_contributors : num 17 17 17 17 17 17 17 17 17 17 ...
## $ tags : num 4 4 4 4 4 4 4 4 4 4 ...
## $ pitch_views : num 355 355 355 355 355 355 355 355 355 355 ...
## $ arindex : num 11.6 11.6 11.6 11.6 11.6 ...
## $ pitch_budget : num 1e+05 1e+05 1e+05 1e+05 1e+05 1e+05 1e+05 1e+05 1e+05 1e+05 ...
## $ video : num 0 0 0 0 0 0 0 0 0 0 ...
## $ insights : num 77 73 73 66 66 56 56 52 65 55 ...
summary(Crowdfunding)
## story_id abs_week rel_week story_readtime
## Min. : 860 Min. : 2.00 Min. : 1.00 Min. : 0.0
## 1st Qu.: 976 1st Qu.:34.00 1st Qu.:13.00 1st Qu.: 0.0
## Median :1007 Median :49.00 Median :26.00 Median : 34.0
## Mean :1004 Mean :47.01 Mean :27.99 Mean : 799.8
## 3rd Qu.:1048 3rd Qu.:62.00 3rd Qu.:41.00 3rd Qu.: 437.0
## Max. :1110 Max. :74.00 Max. :73.00 Max. :127455.0
## story_views story_funding_duration story_length focal_page
## Min. : 0.00 Min. : 4.00 Min. : 1853 Min. :0.0000
## 1st Qu.: 1.00 1st Qu.: 26.00 1st Qu.: 6324 1st Qu.:0.0000
## Median : 3.00 Median : 68.00 Median :12004 Median :0.0000
## Mean : 9.91 Mean : 86.62 Mean :13431 Mean :0.2511
## 3rd Qu.: 7.00 3rd Qu.:114.00 3rd Qu.:18304 3rd Qu.:1.0000
## Max. :960.00 Max. :300.00 Max. :72000 Max. :1.0000
## pitch_contributors tags pitch_views arindex
## Min. : 1.00 Min. : 2.000 Min. : 33.0 Min. : 5.80
## 1st Qu.: 6.00 1st Qu.: 3.000 1st Qu.: 167.0 1st Qu.:10.50
## Median : 11.00 Median : 4.000 Median : 366.0 Median :11.80
## Mean : 24.72 Mean : 4.429 Mean : 616.1 Mean :12.18
## 3rd Qu.: 26.00 3rd Qu.: 5.000 3rd Qu.: 728.0 3rd Qu.:13.20
## Max. :144.00 Max. :13.000 Max. :5461.0 Max. :19.70
## pitch_budget video insights
## Min. : 25000 Min. :0.00000 Min. : 0.00
## 1st Qu.: 50000 1st Qu.:0.00000 1st Qu.: 35.00
## Median : 80000 Median :0.00000 Median : 56.00
## Mean : 156216 Mean :0.05249 Mean : 51.73
## 3rd Qu.: 160000 3rd Qu.:0.00000 3rd Qu.: 72.00
## Max. :1000000 Max. :1.00000 Max. :100.00
We can make a few observations:
* The dataset contains 15 variables with 1,848 observations.
* All the variables are numeric.
* There is no missing data (which is good).
* Many variables have rows that are repeated because of the dataset is set up to show the absolute/relative weeks.
Furthermore, some of the variables, such as pitch_contributors, appears fairly skewed, as we can see from this chart.
ggplot(Crowdfunding, aes(pitch_contributors)) +
geom_histogram(binwidth = 10) +
labs(title = "Total Number of Contributors",
x = "Total Number of Contributors")
Clearly, most stories attract relatively few contributors.
To make the dataset tidier for ease of exploration, we can TRANSFORM the data by grouping the stories together and taking log of those variables that exhibit significant skew.
Crowdfunding_story <- Crowdfunding %>%
group_by(story_id) %>%
summarise(abs_week_start = min(abs_week),
abs_week_end = max(abs_week),
rel_week = n(),
story_readtime = log(sum(story_readtime)),
story_views = log(sum(story_views)),
story_funding_duration = log(max(story_funding_duration)),
story_length = log(max(story_length)),
focal_page = log(mean(focal_page)),
pitch_contributors = log(max(pitch_contributors)),
tags = log(max(tags)),
pitch_views = log(max(pitch_views)),
arindex = max(arindex),
pitch_budget = log(max(pitch_budget)/1000),
video = mean(video),
insights = mean(insights))
head(Crowdfunding_story)
## # A tibble: 6 × 16
## story_id abs_week_start abs_week_end rel_week story_readtime story_views
## <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 860 19 74 56 9.728837 5.634790
## 2 871 6 74 69 10.932017 6.539586
## 3 918 5 74 70 10.636793 6.095825
## 4 923 2 74 73 8.147578 3.891820
## 5 940 2 74 73 9.701004 5.187386
## 6 958 5 74 70 9.609385 5.129899
## # ... with 10 more variables: story_funding_duration <dbl>,
## # story_length <dbl>, focal_page <dbl>, pitch_contributors <dbl>,
## # tags <dbl>, pitch_views <dbl>, arindex <dbl>, pitch_budget <dbl>,
## # video <dbl>, insights <dbl>
dim(Crowdfunding_story)
## [1] 37 16
Note that we have transformed a few of the variables:
* abs_week has been decomposed into the start and end of the observation period (abs_week_start and abs_week_end respectively)
* story_readtime now show the log total time visitors spent at the story URL for the entire observation period
* story_views now show the log total number of vistors to the story URL for the entire observation period
* focal_page now shows the log proportion of time the story is on the first page
* pitch_budget has been scaled to be in thousands and in log form
* insights now shows the average Google Search Trends for pitch keywords for the entire observation period
* story_funding duration, story_length, pitch_contributors, tags, pitch_views have also all been expressed in log form
We can see that there are 37 stories. Notably, we can also now easily see that regardless of when the story was published, the story was always last published in week 74. This makes the variable a constant, which can be ignored. Indeed, abs_week_start, abs_week_end and rel_week are simply linear transformations of one another.
Another somewhat peculiar observation in the original dataset is that for some observations, even when there are visitors to the story URL, the time spent could be 0s. It could suggest that visitors who visit the URL but don’t click on the story are captured as not having time spent. This is not an issue after the transformation.
Picking the Dependent Variable
Since we would like to understand the factors determining the attractiveness of a pitch, as measured by whether the story has convinced other members to contribute, the obvious dependent variable would be pitch_contributors, the total number of contributors who actually did fund the pitch. It is reasonabale to assume that rational contributors would fund a pitch only if they found the pitch attractive! Another possible dependent variable would be insights, as one would expect potential contributors to search for pitch keywords if they thought the pitch was attractive. However, this does not translate to actual contributions, hence we will use pitch_contributors as the dependent variable.
Let us VISUALIZE the data to aid our exploration further.
We begin with a simple scatter plot of the story funding duration and pitch_contributors. Note that we will not concern ourselves with story_id, abs_week, rel_week, story_readtime, story_views and focal_page since these can be regarded as ex-post variables; i.e. they are measured after the pitch has been funded.
ggplot(Crowdfunding_story, aes(story_funding_duration, pitch_contributors)) +
geom_point() +
geom_smooth(se = FALSE, method = 'lm') +
labs(title = "Number of Days Pitch Spent being Funded vs Total Number of Contributors",
x = "(Log) Number of Days Pitch Spent being Funded",
y = "(Log) Total Number of Contributors")
It appears that there is some positive relationship between the number of days a pitch spent being funded and the total number of contributors (correlation of 0.405352). This makes intuitive sense. An attractive pitch should get a longer airtime and get more contributions.
What about story length?
ggplot(Crowdfunding_story, aes(story_length, pitch_contributors)) +
geom_point() +
geom_smooth(se = FALSE, method = 'lm') +
labs(title = "Length of Story vs Total Number of Contributors",
x = "(Log) Length of Story (in characters)",
y = "(Log) Total Number of Contributors")
The relationship seems to be a negative one, albeit less apparent. The correlation between story length and pitch_contributors is -0.1609324. Apparently, succinct stories are more attractive, which is reasonable.
Let us turn to the number of tags associated with a story. We assume that the number of tags is known when the pitch is made.
ggplot(Crowdfunding_story, aes(tags, pitch_contributors)) +
geom_bar(stat = 'identity') +
labs(title = "Number of Story Tags vs Total Number of Contributors",
x = "(Log) Number of Story Tags",
y = "(Log) Total Number of Contributors")
The correlation between the tags and pitch_contributors is quite low at 0.0509702. The sign is positive, which matches expectations; the more tags a story has, the wider its likely appeal, which should make it more attractive.
Next, we consider pitch_views. A priori, we would expect this to be positively correlated with pitch_contributors. The more people view a pitch, the more likely it will be funded.
ggplot(Crowdfunding_story, aes(pitch_views, pitch_contributors)) +
geom_point() +
geom_smooth(se = FALSE, method = 'lm') +
labs(title = "Number of Visitors to pitch URL vs Total Number of Contributors",
x = "(Log) Number of Visitors to pitch URL",
y = "(Log) Total Number of Contributors")
As expected, a strong positive relationship exists between pitch_views and pitch_contributors. The correlation is a high 0.5952077.
Let us now take a look at the Automated Readability Index value for the story text. Presumably, a story can only be given a value after it has been written and not when it was pitched, but we proceed anyhow.
ggplot(Crowdfunding_story, aes(arindex, pitch_contributors)) +
geom_point() +
geom_smooth(se = FALSE, method = 'lm') +
labs(title = "Automated Readability Index Value For Story vs Total Number of Contributors",
x = "Automated Readability Index Value",
y = "(Log) Total Number of Contributors")
There appears to be a positive relationship bewteen the Automated Readability Index value and pitch_contributors. The correlation is 0.3331197.
Next, we turn our attention to pitchsum budget. We would expect this to strongly influence the number of contributors.
ggplot(Crowdfunding_story, aes(pitch_budget, pitch_contributors)) +
geom_point() +
geom_smooth(se = FALSE, method = 'lm') +
labs(title = "Funding Target of Story vs Total Number of Contributors",
x = "(Log) Funding Target of Story (in thousands)",
y = "(Log) Total Number of Contributors")
The relationship is a strong positive one, with a high correlation of 0.6440522.
Let’s see whether having a video affects pitch_contributors.
ggplot(Crowdfunding_story, aes(video, pitch_contributors, group = video)) +
geom_boxplot() +
labs(title = "Inclusion of Video vs Total Number of Contributors",
x = "Video (0 = No, 1 = Yes)",
y = "(Log) Total Number of Contributors")
The median is rougly the same, although the variability is higher for stories without a video, which is not surprising since there are not many stories with a video.
Last but not least, let’s see how the average Google Search trends for pitch keywords influence pitch_contributors.
ggplot(Crowdfunding_story, aes(insights, pitch_contributors)) +
geom_point() +
geom_smooth(se = FALSE, method = 'lm') +
labs(title = "Google Search Trends for Pitch Keywords vs Total Number of Contributors",
x = "Google Search Trends for Pitch Keywords (0-100, relative to Peak)",
y = "(Log) Total Number of Contributors")
The relationship appears negative, with a correlation of -0.333706, which is somewhat counter-intuitive at first glance as one would have thought more searches would point to a pitch being more attractive. But it could also point to a story’s lack of clarity and focus such that people would have to make more searches.
Let us now know turn to MODEL SELECTION to tease out the factors determining the attractiveness of a pitch. To avoid imposing any prior assumptions about the model structure, we begin with backward stepwise regression, which allows us to start with all the variables and progressively eliminate the variables using the AIC model fit criterion. Linear regression is appropriate here since the dependent variable is continuous.
step(lm(pitch_contributors ~ ., data = Crowdfunding_story, direction = "backward"))
## Start: AIC=-7.63
## pitch_contributors ~ story_id + abs_week_start + abs_week_end +
## rel_week + story_readtime + story_views + story_funding_duration +
## story_length + focal_page + tags + pitch_views + arindex +
## pitch_budget + video + insights
##
##
## Step: AIC=-7.63
## pitch_contributors ~ story_id + abs_week_start + abs_week_end +
## story_readtime + story_views + story_funding_duration + story_length +
## focal_page + tags + pitch_views + arindex + pitch_budget +
## video + insights
##
##
## Step: AIC=-7.63
## pitch_contributors ~ story_id + abs_week_start + story_readtime +
## story_views + story_funding_duration + story_length + focal_page +
## tags + pitch_views + arindex + pitch_budget + video + insights
##
## Df Sum of Sq RSS AIC
## - focal_page 1 0.06310 14.188 -9.4661
## - story_readtime 1 0.08150 14.206 -9.4182
## - tags 1 0.10350 14.228 -9.3609
## - video 1 0.14702 14.272 -9.2479
## - insights 1 0.19133 14.316 -9.1332
## - story_views 1 0.27155 14.396 -8.9265
## <none> 14.125 -7.6310
## - story_funding_duration 1 0.85041 14.975 -7.4678
## - pitch_views 1 1.10588 15.230 -6.8420
## - arindex 1 1.27546 15.400 -6.4322
## - story_length 1 1.34580 15.470 -6.2636
## - abs_week_start 1 1.65573 15.780 -5.5297
## - story_id 1 1.79862 15.923 -5.1962
## - pitch_budget 1 2.51206 16.637 -3.5745
##
## Step: AIC=-9.47
## pitch_contributors ~ story_id + abs_week_start + story_readtime +
## story_views + story_funding_duration + story_length + tags +
## pitch_views + arindex + pitch_budget + video + insights
##
## Df Sum of Sq RSS AIC
## - story_readtime 1 0.05720 14.245 -11.3172
## - tags 1 0.10640 14.294 -11.1897
## - video 1 0.13441 14.322 -11.1172
## - insights 1 0.18244 14.370 -10.9934
## - story_views 1 0.22822 14.416 -10.8757
## <none> 14.188 -9.4661
## - story_funding_duration 1 1.09104 15.279 -8.7249
## - pitch_views 1 1.20930 15.397 -8.4396
## - arindex 1 1.22631 15.414 -8.3988
## - story_length 1 1.28361 15.471 -8.2615
## - abs_week_start 1 2.25566 16.443 -6.0069
## - story_id 1 2.28308 16.471 -5.9452
## - pitch_budget 1 2.52273 16.710 -5.4108
##
## Step: AIC=-11.32
## pitch_contributors ~ story_id + abs_week_start + story_views +
## story_funding_duration + story_length + tags + pitch_views +
## arindex + pitch_budget + video + insights
##
## Df Sum of Sq RSS AIC
## - tags 1 0.08358 14.329 -13.1008
## - video 1 0.12207 14.367 -13.0015
## - insights 1 0.21691 14.462 -12.7581
## <none> 14.245 -11.3172
## - story_views 1 0.87959 15.124 -11.1003
## - story_funding_duration 1 1.15474 15.400 -10.4333
## - story_length 1 1.32804 15.573 -10.0192
## - pitch_views 1 1.32943 15.574 -10.0159
## - arindex 1 1.33279 15.578 -10.0079
## - abs_week_start 1 2.81526 17.060 -6.6444
## - story_id 1 2.82484 17.070 -6.6236
## - pitch_budget 1 2.85555 17.100 -6.5571
##
## Step: AIC=-13.1
## pitch_contributors ~ story_id + abs_week_start + story_views +
## story_funding_duration + story_length + pitch_views + arindex +
## pitch_budget + video + insights
##
## Df Sum of Sq RSS AIC
## - video 1 0.09320 14.422 -14.8609
## - insights 1 0.19996 14.528 -14.5880
## <none> 14.329 -13.1008
## - story_views 1 0.81889 15.147 -13.0444
## - story_funding_duration 1 1.13853 15.467 -12.2718
## - story_length 1 1.24618 15.575 -12.0151
## - arindex 1 1.39770 15.726 -11.6569
## - pitch_views 1 1.45175 15.780 -11.5300
## - abs_week_start 1 2.75174 17.080 -8.6009
## - pitch_budget 1 2.77573 17.104 -8.5490
## - story_id 1 2.84524 17.174 -8.3989
##
## Step: AIC=-14.86
## pitch_contributors ~ story_id + abs_week_start + story_views +
## story_funding_duration + story_length + pitch_views + arindex +
## pitch_budget + insights
##
## Df Sum of Sq RSS AIC
## - insights 1 0.23550 14.657 -16.2616
## - story_views 1 0.73715 15.159 -15.0164
## <none> 14.422 -14.8609
## - story_funding_duration 1 1.14635 15.568 -14.0309
## - story_length 1 1.15652 15.578 -14.0067
## - arindex 1 1.30459 15.726 -13.6567
## - pitch_views 1 1.87431 16.296 -12.3400
## - pitch_budget 1 2.68413 17.106 -10.5455
## - abs_week_start 1 2.72650 17.148 -10.4540
## - story_id 1 3.08187 17.503 -9.6951
##
## Step: AIC=-16.26
## pitch_contributors ~ story_id + abs_week_start + story_views +
## story_funding_duration + story_length + pitch_views + arindex +
## pitch_budget
##
## Df Sum of Sq RSS AIC
## <none> 14.657 -16.262
## - story_views 1 0.8569 15.514 -16.159
## - story_length 1 1.2373 15.895 -15.263
## - story_funding_duration 1 1.3451 16.002 -15.013
## - arindex 1 1.6948 16.352 -14.213
## - pitch_views 1 1.9083 16.565 -13.733
## - abs_week_start 1 2.9561 17.613 -11.464
## - pitch_budget 1 3.0257 17.683 -11.318
## - story_id 1 3.2800 17.937 -10.790
##
## Call:
## lm(formula = pitch_contributors ~ story_id + abs_week_start +
## story_views + story_funding_duration + story_length + pitch_views +
## arindex + pitch_budget, data = Crowdfunding_story, direction = "backward")
##
## Coefficients:
## (Intercept) story_id abs_week_start
## 11.96662 -0.01251 0.05074
## story_views story_funding_duration story_length
## 0.17841 -0.34716 -0.24853
## pitch_views arindex pitch_budget
## 0.33678 0.08225 0.40518
The regression suggests the following independent variables: story_id, abs_week_start, story_views, story_funding_duration, story_length, pitch_views, arindex and pitch_budget. Let us run the linear regression model to check the significance of the variables (dropping story_id since it is serves as an unique identfier and should not affect pitch_contributors).
model_backward <- lm(pitch_contributors ~ abs_week_start +
story_views + story_funding_duration + story_length + pitch_views +
arindex + pitch_budget, data = Crowdfunding_story)
summary(model_backward)
##
## Call:
## lm(formula = pitch_contributors ~ abs_week_start + story_views +
## story_funding_duration + story_length + pitch_views + arindex +
## pitch_budget, data = Crowdfunding_story)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.07204 -0.34683 -0.01779 0.30475 1.49957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.89405 1.88708 -1.004 0.3238
## abs_week_start 0.00428 0.01147 0.373 0.7118
## story_views 0.17354 0.15156 1.145 0.2616
## story_funding_duration 0.09299 0.13743 0.677 0.5040
## story_length -0.18818 0.17375 -1.083 0.2877
## pitch_views 0.30328 0.19118 1.586 0.1235
## arindex 0.06979 0.04939 1.413 0.1683
## pitch_budget 0.47158 0.18091 2.607 0.0143 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7865 on 29 degrees of freedom
## Multiple R-squared: 0.6005, Adjusted R-squared: 0.5041
## F-statistic: 6.228 on 7 and 29 DF, p-value: 0.0001657
At the 5% level, the only significant variable is pitch_budget. The adjusted R-squared is moderate at 0.5041.
Let us also apply the forward stepwise regression, which starts with a null model and progressively adds variables until the next best variable does not decrease AIC.
null <- lm(pitch_contributors ~ 1, data = Crowdfunding_story)
full <- lm(pitch_contributors ~ ., data = Crowdfunding_story)
step(null, scope = list(lower = null, upper = full), direction = "forward")
## Start: AIC=9.16
## pitch_contributors ~ 1
##
## Df Sum of Sq RSS AIC
## + pitch_budget 1 18.6257 26.277 -8.6626
## + pitch_views 1 15.9077 28.995 -5.0207
## + story_funding_duration 1 7.3779 37.525 4.5208
## + insights 1 5.0003 39.902 6.7939
## + arindex 1 4.9828 39.920 6.8102
## + story_views 1 4.0693 40.833 7.6473
## + story_readtime 1 3.2957 41.607 8.3418
## <none> 44.902 9.1622
## + abs_week_start 1 2.2588 42.644 9.2525
## + rel_week 1 2.2588 42.644 9.2525
## + story_length 1 1.1629 43.739 10.1913
## + focal_page 1 1.0974 43.805 10.2467
## + story_id 1 0.9113 43.991 10.4036
## + tags 1 0.1167 44.786 11.0660
## + video 1 0.0088 44.894 11.1550
##
## Step: AIC=-8.66
## pitch_contributors ~ pitch_budget
##
## Df Sum of Sq RSS AIC
## + pitch_views 1 4.9010 21.376 -14.3004
## + story_views 1 2.4471 23.830 -10.2795
## + arindex 1 1.5181 24.759 -8.8644
## + insights 1 1.3894 24.887 -8.6726
## <none> 26.277 -8.6626
## + story_readtime 1 1.2423 25.034 -8.4546
## + story_funding_duration 1 0.9524 25.324 -8.0286
## + rel_week 1 0.5671 25.710 -7.4698
## + abs_week_start 1 0.5671 25.710 -7.4698
## + story_length 1 0.3625 25.914 -7.1766
## + story_id 1 0.1318 26.145 -6.8487
## + focal_page 1 0.1316 26.145 -6.8484
## + video 1 0.1313 26.145 -6.8479
## + tags 1 0.0119 26.265 -6.6793
##
## Step: AIC=-14.3
## pitch_contributors ~ pitch_budget + pitch_views
##
## Df Sum of Sq RSS AIC
## + arindex 1 1.91731 19.459 -15.777
## + story_id 1 1.77831 19.598 -15.514
## <none> 21.376 -14.300
## + insights 1 1.06523 20.311 -14.192
## + story_length 1 0.78222 20.594 -13.680
## + story_views 1 0.55776 20.818 -13.279
## + story_funding_duration 1 0.26428 21.111 -12.761
## + abs_week_start 1 0.18547 21.190 -12.623
## + rel_week 1 0.18547 21.190 -12.623
## + story_readtime 1 0.17614 21.200 -12.607
## + tags 1 0.02752 21.348 -12.348
## + focal_page 1 0.01083 21.365 -12.319
## + video 1 0.00080 21.375 -12.302
##
## Step: AIC=-15.78
## pitch_contributors ~ pitch_budget + pitch_views + arindex
##
## Df Sum of Sq RSS AIC
## + story_id 1 1.07282 18.386 -15.876
## <none> 19.459 -15.777
## + insights 1 0.47313 18.985 -14.688
## + story_funding_duration 1 0.39876 19.060 -14.544
## + story_views 1 0.37278 19.086 -14.493
## + story_length 1 0.31586 19.143 -14.383
## + focal_page 1 0.14206 19.316 -14.049
## + video 1 0.11055 19.348 -13.988
## + story_readtime 1 0.10107 19.357 -13.970
## + abs_week_start 1 0.01834 19.440 -13.812
## + rel_week 1 0.01834 19.440 -13.812
## + tags 1 0.00033 19.458 -13.778
##
## Step: AIC=-15.88
## pitch_contributors ~ pitch_budget + pitch_views + arindex + story_id
##
## Df Sum of Sq RSS AIC
## + abs_week_start 1 1.18258 17.203 -16.336
## + rel_week 1 1.18258 17.203 -16.336
## <none> 18.386 -15.876
## + insights 1 0.52057 17.865 -14.938
## + story_length 1 0.26197 18.124 -14.407
## + focal_page 1 0.22419 18.162 -14.330
## + story_views 1 0.18223 18.203 -14.244
## + story_funding_duration 1 0.05334 18.332 -13.983
## + story_readtime 1 0.03025 18.355 -13.937
## + tags 1 0.01383 18.372 -13.904
## + video 1 0.00120 18.384 -13.878
##
## Step: AIC=-16.34
## pitch_contributors ~ pitch_budget + pitch_views + arindex + story_id +
## abs_week_start
##
## Df Sum of Sq RSS AIC
## + story_funding_duration 1 1.00188 16.201 -16.556
## <none> 17.203 -16.336
## + insights 1 0.57281 16.630 -15.589
## + story_length 1 0.44339 16.760 -15.302
## + story_views 1 0.31953 16.884 -15.029
## + story_readtime 1 0.18160 17.021 -14.728
## + focal_page 1 0.04039 17.163 -14.423
## + tags 1 0.00328 17.200 -14.343
## + video 1 0.00027 17.203 -14.336
##
## Step: AIC=-16.56
## pitch_contributors ~ pitch_budget + pitch_views + arindex + story_id +
## abs_week_start + story_funding_duration
##
## Df Sum of Sq RSS AIC
## <none> 16.201 -16.556
## + story_length 1 0.68713 15.514 -16.159
## + insights 1 0.38138 15.820 -15.437
## + story_views 1 0.30671 15.895 -15.263
## + story_readtime 1 0.19455 16.007 -15.003
## + focal_page 1 0.00894 16.192 -14.576
## + video 1 0.00384 16.197 -14.565
## + tags 1 0.00178 16.199 -14.560
##
## Call:
## lm(formula = pitch_contributors ~ pitch_budget + pitch_views +
## arindex + story_id + abs_week_start + story_funding_duration,
## data = Crowdfunding_story)
##
## Coefficients:
## (Intercept) pitch_budget pitch_views
## 8.99460 0.40182 0.43347
## arindex story_id abs_week_start
## 0.09471 -0.01153 0.04170
## story_funding_duration
## -0.29573
model_forward <- lm(pitch_contributors ~ pitch_budget + pitch_views +
arindex + abs_week_start + story_funding_duration, data = Crowdfunding_story)
summary(model_forward)
##
## Call:
## lm(formula = pitch_contributors ~ pitch_budget + pitch_views +
## arindex + abs_week_start + story_funding_duration, data = Crowdfunding_story)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98695 -0.35855 -0.00198 0.32649 1.63245
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.2395511 0.9502579 -3.409 0.00183 **
## pitch_budget 0.4589356 0.1787402 2.568 0.01528 *
## pitch_views 0.4015112 0.1687998 2.379 0.02372 *
## arindex 0.0802402 0.0483398 1.660 0.10701
## abs_week_start -0.0005177 0.0108671 -0.048 0.96231
## story_funding_duration 0.1076388 0.1365852 0.788 0.43664
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7841 on 31 degrees of freedom
## Multiple R-squared: 0.5756, Adjusted R-squared: 0.5071
## F-statistic: 8.408 on 5 and 31 DF, p-value: 4.095e-05
At the 5% level, the significant variables are pitch_budget and pitch_views. The R-squared remains moderate at 0.5071.
We triangulate the regression results using the leaps package, which performs an exhaustive search for the best subsets of the variables for predicting pitch_contributors in linear regression.
library(leaps)
leaps = regsubsets(pitch_contributors ~ ., data = Crowdfunding, nbest = 1)
plot(leaps, scale = "r2")
The “best” subset (the most parsimonious model without decreasing r-squared) selects story_funding_duration, tags, pitch_views, arindex, pitch_budget and insights as the independent variables. Let us run the regression on these variables.
model_leaps <- lm(pitch_contributors ~ story_funding_duration + tags + pitch_views + arindex + pitch_budget + insights, data = Crowdfunding_story)
summary(model_leaps)
##
## Call:
## lm(formula = pitch_contributors ~ story_funding_duration + tags +
## pitch_views + arindex + pitch_budget + insights, data = Crowdfunding_story)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.94203 -0.43889 0.05399 0.40341 1.44719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.522279 1.269562 -1.987 0.0562 .
## story_funding_duration 0.121315 0.132477 0.916 0.3671
## tags -0.022830 0.311103 -0.073 0.9420
## pitch_views 0.381478 0.148095 2.576 0.0152 *
## arindex 0.068968 0.046580 1.481 0.1491
## pitch_budget 0.431857 0.181406 2.381 0.0238 *
## insights -0.007212 0.007348 -0.981 0.3342
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7846 on 30 degrees of freedom
## Multiple R-squared: 0.5887, Adjusted R-squared: 0.5065
## F-statistic: 7.158 on 6 and 30 DF, p-value: 8.372e-05
Again, only pitch_views and pitch_budget are significant at the 5% level, consistent with the forward stepwise regression.
The residuals from the model also appear to be fairly well-behaved.
plot(residuals(model_forward))
Our approach has taken us from DATA IMPORT to EXPLORATION to TRANSFORMATION to VISUALIZATION to MODEL SELECTION. We are now ready to INTERPRET the results.
Let us run the regression again using just pitch_budget and pitch_views.
model_chosen <- lm(pitch_contributors ~ pitch_budget + pitch_views, data = Crowdfunding_story)
summary(model_chosen)
##
## Call:
## lm(formula = pitch_contributors ~ pitch_budget + pitch_views,
## data = Crowdfunding_story)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2311 -0.3890 0.1162 0.3372 1.3734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.4299 0.8595 -2.827 0.00781 **
## pitch_budget 0.5774 0.1658 3.481 0.00139 **
## pitch_views 0.4052 0.1451 2.792 0.00853 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7929 on 34 degrees of freedom
## Multiple R-squared: 0.524, Adjusted R-squared: 0.4959
## F-statistic: 18.71 on 2 and 34 DF, p-value: 3.312e-06
The model is given by: (log)pitch_contributors = -2.4299 + 0.5774 x (log)pitch_budget + 0.4052 x (log)pitch_views
A 1% increase in the funding target of the associated story results in an approximately 0.6% increase in the number of contributors. Stories with larger funding targets could be indicative of its importance and relevance, or could be pitched by more established journalists with a wider appeal and higher credibility, hence making them more attractive.
A 1% increase in the number of visitors to the pitch URL results in about 0.4% increase in the number of contributors. This conforms with expectations as well: the more number of pitch views, the more likely there will be contributions.
Let the dependent variable be story_readtime. Like above, we start with backward stepwise regression.
step(lm(story_readtime ~ ., data = Crowdfunding_story, direction = "backward"))
## Start: AIC=-84.14
## story_readtime ~ story_id + abs_week_start + abs_week_end + rel_week +
## story_views + story_funding_duration + story_length + focal_page +
## pitch_contributors + tags + pitch_views + arindex + pitch_budget +
## video + insights
##
##
## Step: AIC=-84.14
## story_readtime ~ story_id + abs_week_start + abs_week_end + story_views +
## story_funding_duration + story_length + focal_page + pitch_contributors +
## tags + pitch_views + arindex + pitch_budget + video + insights
##
##
## Step: AIC=-84.14
## story_readtime ~ story_id + abs_week_start + story_views + story_funding_duration +
## story_length + focal_page + pitch_contributors + tags + pitch_views +
## arindex + pitch_budget + video + insights
##
## Df Sum of Sq RSS AIC
## - story_length 1 0.0000 1.7861 -86.142
## - pitch_contributors 1 0.0103 1.7964 -85.930
## - arindex 1 0.0146 1.8007 -85.841
## - video 1 0.0166 1.8027 -85.801
## - insights 1 0.0261 1.8122 -85.606
## - story_funding_duration 1 0.0324 1.8185 -85.478
## - pitch_views 1 0.0453 1.8314 -85.215
## - tags 1 0.0541 1.8402 -85.039
## - focal_page 1 0.0812 1.8673 -84.497
## <none> 1.7861 -84.143
## - story_id 1 0.2052 1.9913 -82.118
## - abs_week_start 1 0.2248 2.0109 -81.757
## - pitch_budget 1 0.5067 2.2928 -76.902
## - story_views 1 23.9456 25.7317 12.562
##
## Step: AIC=-86.14
## story_readtime ~ story_id + abs_week_start + story_views + story_funding_duration +
## focal_page + pitch_contributors + tags + pitch_views + arindex +
## pitch_budget + video + insights
##
## Df Sum of Sq RSS AIC
## - pitch_contributors 1 0.0117 1.798 -87.901
## - arindex 1 0.0146 1.801 -87.841
## - video 1 0.0197 1.806 -87.736
## - insights 1 0.0262 1.812 -87.604
## - story_funding_duration 1 0.0329 1.819 -87.466
## - pitch_views 1 0.0468 1.833 -87.185
## - tags 1 0.0572 1.843 -86.975
## - focal_page 1 0.0856 1.872 -86.411
## <none> 1.786 -86.142
## - story_id 1 0.2084 1.995 -84.059
## - abs_week_start 1 0.2394 2.025 -83.489
## - pitch_budget 1 0.5186 2.305 -78.711
## - story_views 1 31.2025 32.989 19.754
##
## Step: AIC=-87.9
## story_readtime ~ story_id + abs_week_start + story_views + story_funding_duration +
## focal_page + tags + pitch_views + arindex + pitch_budget +
## video + insights
##
## Df Sum of Sq RSS AIC
## - video 1 0.0201 1.818 -89.490
## - arindex 1 0.0254 1.823 -89.382
## - insights 1 0.0319 1.830 -89.250
## - story_funding_duration 1 0.0440 1.842 -89.007
## - tags 1 0.0566 1.854 -88.754
## - pitch_views 1 0.0737 1.872 -88.414
## - focal_page 1 0.0862 1.884 -88.169
## <none> 1.798 -87.901
## - story_id 1 0.2822 2.080 -84.506
## - abs_week_start 1 0.3057 2.103 -84.091
## - pitch_budget 1 0.5387 2.337 -80.203
## - story_views 1 31.5245 33.322 18.126
##
## Step: AIC=-89.49
## story_readtime ~ story_id + abs_week_start + story_views + story_funding_duration +
## focal_page + tags + pitch_views + arindex + pitch_budget +
## insights
##
## Df Sum of Sq RSS AIC
## - insights 1 0.028 1.846 -90.918
## - story_funding_duration 1 0.041 1.859 -90.670
## - arindex 1 0.043 1.861 -90.619
## - tags 1 0.050 1.868 -90.476
## - pitch_views 1 0.059 1.877 -90.311
## - focal_page 1 0.088 1.906 -89.745
## <none> 1.818 -89.490
## - story_id 1 0.264 2.082 -86.472
## - abs_week_start 1 0.300 2.118 -85.843
## - pitch_budget 1 0.519 2.337 -82.197
## - story_views 1 31.716 33.533 16.360
##
## Step: AIC=-90.92
## story_readtime ~ story_id + abs_week_start + story_views + story_funding_duration +
## focal_page + tags + pitch_views + arindex + pitch_budget
##
## Df Sum of Sq RSS AIC
## - story_funding_duration 1 0.054 1.900 -91.851
## - tags 1 0.058 1.904 -91.777
## - pitch_views 1 0.063 1.909 -91.680
## - arindex 1 0.066 1.913 -91.609
## - focal_page 1 0.097 1.943 -91.019
## <none> 1.846 -90.918
## - story_id 1 0.287 2.133 -87.572
## - abs_week_start 1 0.328 2.174 -86.871
## - pitch_budget 1 0.493 2.340 -84.155
## - story_views 1 31.842 33.689 14.531
##
## Step: AIC=-91.85
## story_readtime ~ story_id + abs_week_start + story_views + focal_page +
## tags + pitch_views + arindex + pitch_budget
##
## Df Sum of Sq RSS AIC
## - pitch_views 1 0.052 1.952 -92.856
## - arindex 1 0.059 1.960 -92.712
## - tags 1 0.063 1.963 -92.644
## - focal_page 1 0.065 1.965 -92.608
## <none> 1.900 -91.851
## - story_id 1 0.369 2.269 -87.288
## - abs_week_start 1 0.418 2.318 -86.495
## - pitch_budget 1 0.533 2.434 -84.699
## - story_views 1 31.816 33.717 12.562
##
## Step: AIC=-92.86
## story_readtime ~ story_id + abs_week_start + story_views + focal_page +
## tags + arindex + pitch_budget
##
## Df Sum of Sq RSS AIC
## - tags 1 0.056 2.008 -93.802
## - focal_page 1 0.061 2.013 -93.722
## - arindex 1 0.076 2.028 -93.436
## <none> 1.952 -92.856
## - story_id 1 0.356 2.308 -88.653
## - pitch_budget 1 0.486 2.438 -86.628
## - abs_week_start 1 0.621 2.573 -84.642
## - story_views 1 38.430 40.382 17.236
##
## Step: AIC=-93.8
## story_readtime ~ story_id + abs_week_start + story_views + focal_page +
## arindex + pitch_budget
##
## Df Sum of Sq RSS AIC
## - arindex 1 0.063 2.072 -94.656
## - focal_page 1 0.074 2.082 -94.471
## <none> 2.008 -93.802
## - story_id 1 0.349 2.358 -89.871
## - pitch_budget 1 0.454 2.463 -88.257
## - abs_week_start 1 0.649 2.658 -85.438
## - story_views 1 38.373 40.382 15.236
##
## Step: AIC=-94.66
## story_readtime ~ story_id + abs_week_start + story_views + focal_page +
## pitch_budget
##
## Df Sum of Sq RSS AIC
## - focal_page 1 0.111 2.182 -94.729
## <none> 2.072 -94.656
## - story_id 1 0.338 2.410 -91.058
## - pitch_budget 1 0.392 2.464 -90.241
## - abs_week_start 1 0.591 2.663 -87.366
## - story_views 1 38.374 40.445 13.294
##
## Step: AIC=-94.73
## story_readtime ~ story_id + abs_week_start + story_views + pitch_budget
##
## Df Sum of Sq RSS AIC
## <none> 2.182 -94.729
## - story_id 1 0.300 2.482 -91.963
## - pitch_budget 1 0.414 2.596 -90.306
## - abs_week_start 1 0.484 2.667 -89.312
## - story_views 1 38.358 40.540 11.381
##
## Call:
## lm(formula = story_readtime ~ story_id + abs_week_start + story_views +
## pitch_budget, data = Crowdfunding_story, direction = "backward")
##
## Coefficients:
## (Intercept) story_id abs_week_start story_views
## 1.808821 0.002201 -0.010881 0.988757
## pitch_budget
## 0.130066
model_backward2 <- lm(story_readtime ~ abs_week_start + story_views + pitch_budget, data = Crowdfunding_story)
summary(model_backward2)
##
## Call:
## lm(formula = story_readtime ~ abs_week_start + story_views +
## pitch_budget, data = Crowdfunding_story)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.60426 -0.16132 0.06613 0.18145 0.39172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.097579 0.329424 12.439 5.26e-14 ***
## abs_week_start -0.004552 0.002888 -1.576 0.1246
## story_views 0.985151 0.043749 22.518 < 2e-16 ***
## pitch_budget 0.089763 0.051657 1.738 0.0916 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2743 on 33 degrees of freedom
## Multiple R-squared: 0.9407, Adjusted R-squared: 0.9354
## F-statistic: 174.6 on 3 and 33 DF, p-value: < 2.2e-16
At the 5% level, the only significant independent variable is story_views. This makes sense: the more total number of visitors to the story URL, the more total time visitors would have spent at the story URL. The adjusted R-squared is high at 0.9354.
We cross-check the results using the forward stepwise regression.
null2 <- lm(story_readtime ~ 1, data = Crowdfunding_story)
full2 <- lm(story_readtime ~ ., data = Crowdfunding_story)
step(null2, scope = list(lower = null2, upper = full2), direction = "forward")
## Start: AIC=6.59
## story_readtime ~ 1
##
## Df Sum of Sq RSS AIC
## + story_views 1 39.054 2.835 -91.043
## + pitch_views 1 5.433 36.457 3.453
## + story_length 1 3.694 38.195 5.176
## + pitch_contributors 1 3.075 38.815 5.772
## <none> 41.889 6.592
## + pitch_budget 1 1.154 40.735 7.559
## + insights 1 0.755 41.134 7.919
## + arindex 1 0.309 41.580 8.318
## + story_funding_duration 1 0.112 41.778 8.493
## + focal_page 1 0.083 41.807 8.519
## + story_id 1 0.038 41.851 8.559
## + abs_week_start 1 0.019 41.871 8.576
## + rel_week 1 0.019 41.871 8.576
## + video 1 0.001 41.888 8.591
## + tags 1 0.001 41.888 8.591
##
## Step: AIC=-91.04
## story_readtime ~ story_views
##
## Df Sum of Sq RSS AIC
## + pitch_budget 1 0.166171 2.6692 -91.278
## <none> 2.8354 -91.043
## + abs_week_start 1 0.125883 2.7095 -90.723
## + rel_week 1 0.125883 2.7095 -90.723
## + story_funding_duration 1 0.087265 2.7481 -90.200
## + tags 1 0.053715 2.7817 -89.751
## + pitch_views 1 0.030268 2.8051 -89.440
## + insights 1 0.020784 2.8146 -89.315
## + focal_page 1 0.018500 2.8169 -89.285
## + pitch_contributors 1 0.017978 2.8174 -89.278
## + video 1 0.005334 2.8301 -89.113
## + story_length 1 0.002489 2.8329 -89.076
## + story_id 1 0.000038 2.8354 -89.044
## + arindex 1 0.000001 2.8354 -89.043
##
## Step: AIC=-91.28
## story_readtime ~ story_views + pitch_budget
##
## Df Sum of Sq RSS AIC
## + pitch_contributors 1 0.270252 2.3990 -93.227
## + story_funding_duration 1 0.263201 2.4060 -93.119
## + abs_week_start 1 0.186847 2.4824 -91.963
## + rel_week 1 0.186847 2.4824 -91.963
## + pitch_views 1 0.171962 2.4973 -91.742
## <none> 2.6692 -91.278
## + tags 1 0.075125 2.5941 -90.334
## + insights 1 0.062434 2.6068 -90.153
## + story_length 1 0.012722 2.6565 -89.454
## + arindex 1 0.009343 2.6599 -89.407
## + focal_page 1 0.005252 2.6640 -89.351
## + story_id 1 0.002450 2.6668 -89.312
## + video 1 0.000888 2.6684 -89.290
##
## Step: AIC=-93.23
## story_readtime ~ story_views + pitch_budget + pitch_contributors
##
## Df Sum of Sq RSS AIC
## + story_funding_duration 1 0.181583 2.2174 -94.140
## + abs_week_start 1 0.128732 2.2702 -93.268
## + rel_week 1 0.128732 2.2702 -93.268
## <none> 2.3990 -93.227
## + tags 1 0.084684 2.3143 -92.557
## + pitch_views 1 0.060486 2.3385 -92.172
## + insights 1 0.022387 2.3766 -91.574
## + focal_page 1 0.012139 2.3868 -91.415
## + arindex 1 0.000605 2.3984 -91.237
## + story_id 1 0.000182 2.3988 -91.230
## + story_length 1 0.000110 2.3989 -91.229
## + video 1 0.000034 2.3990 -91.228
##
## Step: AIC=-94.14
## story_readtime ~ story_views + pitch_budget + pitch_contributors +
## story_funding_duration
##
## Df Sum of Sq RSS AIC
## <none> 2.2174 -94.140
## + tags 1 0.081449 2.1359 -93.524
## + abs_week_start 1 0.053925 2.1635 -93.050
## + rel_week 1 0.053925 2.1635 -93.050
## + insights 1 0.050862 2.1665 -92.998
## + pitch_views 1 0.033097 2.1843 -92.696
## + focal_page 1 0.017020 2.2004 -92.425
## + story_id 1 0.008691 2.2087 -92.285
## + video 1 0.003661 2.2137 -92.201
## + arindex 1 0.001522 2.2159 -92.165
## + story_length 1 0.000309 2.2171 -92.145
##
## Call:
## lm(formula = story_readtime ~ story_views + pitch_budget + pitch_contributors +
## story_funding_duration, data = Crowdfunding_story)
##
## Coefficients:
## (Intercept) story_views pitch_budget
## 3.94359 1.01082 0.18181
## pitch_contributors story_funding_duration
## -0.09046 -0.07084
model_forward2 <- lm(story_readtime ~ story_views + pitch_budget + pitch_contributors +
story_funding_duration, data = Crowdfunding_story)
summary(model_forward2)
##
## Call:
## lm(formula = story_readtime ~ story_views + pitch_budget + pitch_contributors +
## story_funding_duration, data = Crowdfunding_story)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73039 -0.09914 0.02503 0.17985 0.44545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.94359 0.34809 11.329 9.83e-13 ***
## story_views 1.01082 0.04409 22.928 < 2e-16 ***
## pitch_budget 0.18181 0.06572 2.767 0.00933 **
## pitch_contributors -0.09046 0.05483 -1.650 0.10874
## story_funding_duration -0.07084 0.04376 -1.619 0.11531
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2632 on 32 degrees of freedom
## Multiple R-squared: 0.9471, Adjusted R-squared: 0.9404
## F-statistic: 143.1 on 4 and 32 DF, p-value: < 2.2e-16
The significant variables are story_views (consistent with backward stepwise regression, with the same sign) and pitch_budget. The adjusted R-squared is high at 0.9404.
We again triangulate the regression results using the leaps package.
leaps2 = regsubsets(story_readtime ~ ., data = Crowdfunding, nbest = 1)
plot(leaps2, scale = "r2")
The ‘best’ subset comprises of story_views and pitch_budget, consistent with the forward stepwise regression. From the model we see that a 1% increase in story_views results in an approximately 1% increase in story_readtime while a 1% increase in pitch_budget results in about 0.2% in story_readtime.
Since the dependent variable in part 2 is measured in terms of count values, estimating an OLS regression, even with panel data models, would not give precise estimates. This is essentially because OLS regression assumes that true values are normally distributed around the expected value and can take any real value, which is not the case for count values. A more appropriate model would be a Poisson or Negative Binomial model.
Let us run a Poisson regression using the variables chosen by the forward stepwise regression.
model_poisson <- glm(story_readtime ~ story_views + pitch_budget + pitch_contributors + story_funding_duration, data = Crowdfunding, family = poisson)
summary(model_poisson)
##
## Call:
## glm(formula = story_readtime ~ story_views + pitch_budget + pitch_contributors +
## story_funding_duration, family = poisson, data = Crowdfunding)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -181.57 -32.73 -29.91 -7.72 398.33
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.216e+00 1.488e-03 4176.96 <2e-16 ***
## story_views 5.554e-03 4.318e-06 1286.11 <2e-16 ***
## pitch_budget 2.657e-07 8.381e-09 31.70 <2e-16 ***
## pitch_contributors 3.127e-03 5.447e-05 57.41 <2e-16 ***
## story_funding_duration 2.391e-04 1.282e-05 18.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 6209386 on 1847 degrees of freedom
## Residual deviance: 3274157 on 1843 degrees of freedom
## AIC: 3282374
##
## Number of Fisher Scoring iterations: 7
While all the variables are highly significant, it is worth pointing out that the mean of story_readtime (799.8425325) is significantly smaller than its variance (18344073), which causes the Poisson distribution to suffer from over-dispersion. The model doesn’t fit well in this case.
A better model would be the Negative Binomial Regression, which relaxes the assumption that the variance be equal to the mean. Let us try fitting it.
model_nb <- glm.nb(story_readtime ~ story_views + pitch_budget + pitch_contributors +
story_funding_duration, data = Crowdfunding)
summary(model_nb)
##
## Call:
## glm.nb(formula = story_readtime ~ story_views + pitch_budget +
## pitch_contributors + story_funding_duration, data = Crowdfunding,
## init.theta = 0.147262933, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.2500 -1.4268 -0.5946 0.0395 2.4646
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.909e+00 9.808e-02 50.052 <2e-16 ***
## story_views 7.152e-02 1.538e-03 46.515 <2e-16 ***
## pitch_budget -1.834e-07 6.706e-07 -0.274 0.784
## pitch_contributors 5.586e-03 3.999e-03 1.397 0.162
## story_funding_duration 2.643e-04 8.492e-04 0.311 0.756
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.1473) family taken to be 1)
##
## Null deviance: 2451.5 on 1847 degrees of freedom
## Residual deviance: 1943.2 on 1843 degrees of freedom
## AIC: 19351
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.14726
## Std. Err.: 0.00495
## Warning while fitting theta: alternation limit reached
##
## 2 x log-likelihood: -19339.05500
Note that only story_views is significant at the 5% level now. A one unit increase in story_views is associated with a 0.07 increase in log(story_readtime), i.e. exp(0.07) = 1.07s of time visitors spent at the story URL.