Part1. Research Scenario Description

Kickstarter is a public benefit corporation that maintains a global crowdfunding platform focused on creativity. As of December 2019, Kickstarter has received more than $4.6 billion in pledges from 17.2 million backers to fund 445,000 projects, such as films, music, stage shows, comics, journalism, video games, technology, publishing, and food-related projects. People who back Kickstarter projects are offered tangible rewards or experiences in exchange for their pledges. This Kickstarter dataset consists of data points such as project’s name and category, crowdfunding’s deadline, goal, launched date, pledged amount, project’s condition (successful or failed) etc of a list of more than 300,000 Kickstarter projects. Having these data, we would like to predict what kinds of projects are most likely to be successful based on the given data. Such as looking into questions like whether backer number is correlated with pledged amount, and whether project categories, funding period, funding goal, pledged amount have impact on project success by using techniques like linear regression, ANOVA, and logistic regression.

Part2. Dataset description

Data was obtained under the topic Kickstarter Projects dataset from kaggle website, following link: https://www.kaggle.com/kemical/kickstarter-projects. The original dataset consists of 15 columns and 378661 rows of Kickstarter projects, which then being reduced to 11 columns including name, category, main_category(category of campaign), currency(currency used to support), deadline(deadline for crowdfunding), launched(date launched), pledged(amount pledged by “crowd”), state(failed or successful), backers, country(country pledged from), usd_pledged_real(amount of money pledged), usd_goal_real(the amount of money that a creator needs to complete their project) for 497 random projects. Amount of Pledged and goal have been converted to US dollars (conversion from Fixer.io API). Three outliers in goal have been removed after testing the data using boxplot.

Part3. Research Questions

  • Q1.How many projects are successful?
  • Q2.Which main categories appear the most frequently?
  • Q3. How many projects exceeded their funding goal by 50% or more? What are their categories?
  • Q4.What are the funding periods?
  • Q5.Whether Kickstarter projects’ main category, pledged amount, backers number, and goal have impact on project success(state)?
  • Q6.Whether backer number is correlated with pledged amount?
  • Q7.What’s the median goal amount for successful projects?

Part4. Results

#Preparing the data

## Warning: Missing column names filled in: 'X1' [1]
## 
## ── Column specification ──────────────────────────────────────────────────────
## cols(
##   X1 = col_double(),
##   name = col_character(),
##   category = col_character(),
##   main_category = col_character(),
##   currency = col_character(),
##   deadline = col_date(format = ""),
##   launched = col_date(format = ""),
##   state = col_character(),
##   backers = col_double(),
##   country = col_character(),
##   usd_pledged_real = col_double(),
##   usd_goal_real = col_double(),
##   goal_flag = col_double(),
##   goal_flag2 = col_double()
## )

Q1.How many projects are successful?

p <- prop.table(table(ks$state))
p
## 
##    canceled      failed        live  successful   suspended   undefined 
## 0.086519115 0.498993964 0.006036217 0.388329980 0.006036217 0.014084507
cols = c("#40DCA7","#106E85","#FFD31F","#009db2","#DA92AA")
piepercent = paste(round(100*table(ks$state)/sum(table(ks$state))), "%")
pie(p, label = paste(names(table(ks$state)), piepercent, sep=", "), col=cols, main = "Project State")

Q2.Which main categories appear the most frequently?

library(ggplot2)
scdf <- data.frame(table(ks$main_category))
scdf2 <- scdf[order(scdf$Freq, decreasing = T),]
scdf2$Percent <- scdf2$Freq/sum(scdf2$Freq)
p <- ggplot(data=scdf2, aes(x=Var1, y = Percent)) +  geom_bar(colour=c("#106E85"), fill=c("#40DCA7"), stat = "identity")
p + coord_flip()

#What are the total amount of pledged money of each main categories?

head(scdf2)
##            Var1 Freq    Percent
## 7  Film & Video   98 0.19718310
## 11        Music   65 0.13078471
## 1           Art   47 0.09456740
## 9         Games   44 0.08853119
## 5        Design   39 0.07847082
## 13   Publishing   39 0.07847082
n1 = subset(ks, main_category == "Film & Video")
sprintf("The total amount of money pledged of Film & Video is %s",sum(n1$usd_pledged_real))
## [1] "The total amount of money pledged of Film & Video is 367626.16"
n2 = subset(ks, main_category == "Music")
sprintf("The total amount of money pledged of Music is %s",sum(n2$usd_pledged_real))
## [1] "The total amount of money pledged of Music is 272649.65"
n3 = subset(ks, main_category == "Art")
sprintf("The total amount of money pledged of Art is %s",sum(n3$usd_pledged_real))
## [1] "The total amount of money pledged of Art is 102194.27"

Q3.How many projects exceeded their funding goal by 50% or more? What are their categories?

ks$goal_flag <- ifelse(ks$usd_pledged_real >= (ks$usd_goal_real)*1.5, 1, 0)
prop.table(table(ks$goal_flag))*100
## 
##       0       1 
## 86.1167 13.8833
ca <- subset(ks, ks$goal_flag == 1)
barplot(table(ca$main_category),col= c("#DA92AA"), xlab ="categories", ylab = "freq")

barplot(table(ca$category),col= c("#40DCA7"), xlab ="categories", ylab = "freq")

Q4.What are the funding periods?

ks$launched <- as.Date(ks$launched, "%Y/%m/%d")
period <- ks$deadline-ks$launched
table(period)
## period
##   3   5   7  11  12  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28 
##   1   1   7   2   1   4   4   3   4   8   1  12   7   5   6   4  14   4   3   4 
##  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48 
##   8 219  13  15   4   3  16   1   5   3   2  11   2   3   3   4  20   3   2   1 
##  49  50  51  52  53  54  55  56  57  59  60  90 
##   1   7   1   1   2   1   2   2   1   2  41   3
data.frame(table(period))
##    period Freq
## 1       3    1
## 2       5    1
## 3       7    7
## 4      11    2
## 5      12    1
## 6      14    4
## 7      15    4
## 8      16    3
## 9      17    4
## 10     18    8
## 11     19    1
## 12     20   12
## 13     21    7
## 14     22    5
## 15     23    6
## 16     24    4
## 17     25   14
## 18     26    4
## 19     27    3
## 20     28    4
## 21     29    8
## 22     30  219
## 23     31   13
## 24     32   15
## 25     33    4
## 26     34    3
## 27     35   16
## 28     36    1
## 29     37    5
## 30     38    3
## 31     39    2
## 32     40   11
## 33     41    2
## 34     42    3
## 35     43    3
## 36     44    4
## 37     45   20
## 38     46    3
## 39     47    2
## 40     48    1
## 41     49    1
## 42     50    7
## 43     51    1
## 44     52    1
## 45     53    2
## 46     54    1
## 47     55    2
## 48     56    2
## 49     57    1
## 50     59    2
## 51     60   41
## 52     90    3
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(data.frame(table(period)), x = ~period, y = ~Freq, type = 'bar',
        mode = "markers", marker = list(color = c("#FFD31F")))
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: 'bar' objects don't have these attributes: 'mode'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'texttemplate', 'hovertext', 'hovertemplate', 'textposition', 'insidetextanchor', 'textangle', 'textfont', 'insidetextfont', 'outsidetextfont', 'constraintext', 'cliponaxis', 'orientation', 'base', 'offset', 'width', 'marker', 'offsetgroup', 'alignmentgroup', 'selected', 'unselected', 'r', 't', '_deprecated', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'basesrc', 'offsetsrc', 'widthsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

Q5.Whether Kickstarter projects’ category, main category, pledged amount, backers number, and goal have impact on project success(state)?

ks$goal_flag2 <- ifelse(ks$usd_pledged_real >= (ks$usd_goal_real), 1, 0)
m <- lm(ks$goal_flag2 ~ ks$backers+ks$usd_pledged_real+ks$usd_goal_real+ks$backers*ks$usd_pledged_real*ks$usd_goal_real)
summary(m)
## 
## Call:
## lm(formula = ks$goal_flag2 ~ ks$backers + ks$usd_pledged_real + 
##     ks$usd_goal_real + ks$backers * ks$usd_pledged_real * ks$usd_goal_real)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1685 -0.3430 -0.2415  0.4847  0.6461 
## 
## Coefficients:
##                                                   Estimate Std. Error t value
## (Intercept)                                      3.509e-01  2.373e-02  14.782
## ks$backers                                       1.506e-03  1.784e-04   8.446
## ks$usd_pledged_real                              7.573e-06  2.112e-06   3.586
## ks$usd_goal_real                                -4.404e-06  7.432e-07  -5.926
## ks$backers:ks$usd_pledged_real                  -6.743e-09  1.153e-09  -5.847
## ks$backers:ks$usd_goal_real                     -1.490e-08  2.398e-09  -6.216
## ks$usd_pledged_real:ks$usd_goal_real            -4.850e-11  8.475e-11  -0.572
## ks$backers:ks$usd_pledged_real:ks$usd_goal_real  6.361e-14  1.460e-14   4.357
##                                                 Pr(>|t|)    
## (Intercept)                                      < 2e-16 ***
## ks$backers                                      3.46e-16 ***
## ks$usd_pledged_real                             0.000369 ***
## ks$usd_goal_real                                5.87e-09 ***
## ks$backers:ks$usd_pledged_real                  9.17e-09 ***
## ks$backers:ks$usd_goal_real                     1.09e-09 ***
## ks$usd_pledged_real:ks$usd_goal_real            0.567411    
## ks$backers:ks$usd_pledged_real:ks$usd_goal_real 1.61e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4165 on 489 degrees of freedom
## Multiple R-squared:  0.2866, Adjusted R-squared:  0.2763 
## F-statistic: 28.06 on 7 and 489 DF,  p-value: < 2.2e-16
anova(m)
## Analysis of Variance Table
## 
## Response: ks$goal_flag2
##                                                  Df Sum Sq Mean Sq F value
## ks$backers                                        1  6.423  6.4230 37.0215
## ks$usd_pledged_real                               1  0.023  0.0228  0.1315
## ks$usd_goal_real                                  1 10.460 10.4603 60.2923
## ks$backers:ks$usd_pledged_real                    1 10.043 10.0427 57.8851
## ks$backers:ks$usd_goal_real                       1  1.081  1.0810  6.2308
## ks$usd_pledged_real:ks$usd_goal_real              1  2.752  2.7515 15.8596
## ks$backers:ks$usd_pledged_real:ks$usd_goal_real   1  3.294  3.2936 18.9842
## Residuals                                       489 84.838  0.1735        
##                                                    Pr(>F)    
## ks$backers                                      2.361e-09 ***
## ks$usd_pledged_real                               0.71701    
## ks$usd_goal_real                                4.830e-14 ***
## ks$backers:ks$usd_pledged_real                  1.439e-13 ***
## ks$backers:ks$usd_goal_real                       0.01288 *  
## ks$usd_pledged_real:ks$usd_goal_real            7.859e-05 ***
## ks$backers:ks$usd_pledged_real:ks$usd_goal_real 1.607e-05 ***
## Residuals                                                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Influence of backers number on project success

x <- glm (ks$goal_flag2 ~ ks$backers, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary (x)
## 
## Call:
## glm(formula = ks$goal_flag2 ~ ks$backers, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4958  -0.6815  -0.6513   0.7390   1.8192  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.442685   0.141254 -10.213  < 2e-16 ***
## ks$backers   0.025253   0.003183   7.935 2.11e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 667.49  on 496  degrees of freedom
## Residual deviance: 488.58  on 495  degrees of freedom
## AIC: 492.58
## 
## Number of Fisher Scoring iterations: 8
exp(cbind(OR = coef (x), confint.default (x)))
##                    OR     2.5 %    97.5 %
## (Intercept) 0.2362925 0.1791489 0.3116633
## ks$backers  1.0255746 1.0191970 1.0319921
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
ks$prob <- predict(x, type =c("response"))
g <- roc(ks$goal_flag2 ~ ks$prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot (g)

print(g)
## 
## Call:
## roc.formula(formula = ks$goal_flag2 ~ ks$prob)
## 
## Data: ks$prob in 300 controls (ks$goal_flag2 0) < 197 cases (ks$goal_flag2 1).
## Area under the curve: 0.895
cor(ks$backers,ks$goal_flag2)
## [1] 0.2324091
plot_ly(ks, x = ~ks$backers, y = ~ks$state, type = 'scatter',
        mode = "markers", marker = list(color = c("#009db2")))

#Influence of amount money of goal on project success

cor(ks$usd_goal_real,ks$goal_flag2)
## [1] -0.2062566
plot_ly(ks, x = ~ks$usd_goal_real, y = ~ks$state, type = 'histogram',
        mode = "markers", marker = list(color = c("#DA92AA")))
## Warning: 'histogram' objects don't have these attributes: 'mode'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'y', 'text', 'hovertext', 'orientation', 'histfunc', 'histnorm', 'cumulative', 'nbinsx', 'xbins', 'nbinsy', 'ybins', 'autobinx', 'autobiny', 'bingroup', 'hovertemplate', 'marker', 'offsetgroup', 'alignmentgroup', 'selected', 'unselected', '_deprecated', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'hovertemplatesrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

#Influence of funding period on project success

pperiod <- stringr::str_remove_all(as.character(period), "[^0-9]")
newks <- data.frame(ks, pperiod)
f <- lm(newks$goal_flag2 ~ newks$pperiod)
anova(f)
## Analysis of Variance Table
## 
## Response: newks$goal_flag2
##                Df  Sum Sq Mean Sq F value Pr(>F)
## newks$pperiod  51  14.244 0.27930  1.1874 0.1855
## Residuals     445 104.669 0.23521

#Influence of main category on project success

sks <- subset(ks, ks$state=="successful")
sks1 <- data.frame(table(sks$main_category))
sks2 <- sks1[order(sks1$Freq, decreasing = T),]
scdf2$Percent <- scdf2$Freq/sum(scdf2$Freq)*100
sks3 <- data.frame(sks2,scdf2$Percent)
head(sks3)
##            Var1 Freq scdf2.Percent
## 7  Film & Video   33     19.718310
## 11        Music   28     13.078471
## 1           Art   22      9.456740
## 5        Design   17      8.853119
## 13   Publishing   17      7.847082
## 8          Food   14      7.847082
plot_ly(sks3, x = ~sks3$Var1, y = ~sks3$scdf2.Percent, type = 'bar',
        mode = "markers", marker = list(color = c("#009db2")))
## Warning: 'bar' objects don't have these attributes: 'mode'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'texttemplate', 'hovertext', 'hovertemplate', 'textposition', 'insidetextanchor', 'textangle', 'textfont', 'insidetextfont', 'outsidetextfont', 'constraintext', 'cliponaxis', 'orientation', 'base', 'offset', 'width', 'marker', 'offsetgroup', 'alignmentgroup', 'selected', 'unselected', 'r', 't', '_deprecated', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'basesrc', 'offsetsrc', 'widthsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

Q6.Whether backer number is correlated with pledged amount?

lm(ks$usd_pledged_real~ks$backers)
## 
## Call:
## lm(formula = ks$usd_pledged_real ~ ks$backers)
## 
## Coefficients:
## (Intercept)   ks$backers  
##      166.99        91.72
anova(lm(ks$usd_pledged_real~ks$backers))
## Analysis of Variance Table
## 
## Response: ks$usd_pledged_real
##             Df     Sum Sq    Mean Sq F value    Pr(>F)    
## ks$backers   1 7.5590e+11 7.5590e+11  566.03 < 2.2e-16 ***
## Residuals  495 6.6104e+11 1.3354e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot_ly(ks, x = ~ks$backers, y = ~ks$usd_pledged_real, type = 'scatter',
        mode = "markers", marker = list(color = c("#106E85")))
ks2 <- subset(ks, ks$usd_pledged_real < 200000)
plot_ly(ks2, x = ~ks2$backers, y = ~ks2$usd_pledged_real, type = 'scatter',
        mode = "markers", marker = list(color = c("#106E85")))

Q7.What’s the median goal amount for successful projects?

ks3 <- subset(ks2, ks2$goal_flag2=="1")
median(ks3$usd_goal_real)
## [1] 4000
t <- data.frame(table(ks3$usd_goal_real))
plot_ly(ks3, x = ~t$Var1 , y = ~t$Freq , type = 'scatter',
        mode = "markers", marker = list(color = c("#009db2")))

Part5. Conclusion

39% of projects on Kickstarter successfully finished crowdfunding. Film & Video projects were the most popular category. 13.88% of the projects exceeded their funding goal by over 50%. Typical funding period is 30 days.

Backers and pledged amount have significant weak positive influence on project success, while goal amount has weak negative impact on it. Project success also differ by main categories, Film & Video, Music and Art were more likely to success. Further, more backer means more pledged amount. The median goal amount for successful projects was 4000. So, people want their projects to be more successful should control the goal.