Kickstarter is a public benefit corporation that maintains a global crowdfunding platform focused on creativity. As of December 2019, Kickstarter has received more than $4.6 billion in pledges from 17.2 million backers to fund 445,000 projects, such as films, music, stage shows, comics, journalism, video games, technology, publishing, and food-related projects. People who back Kickstarter projects are offered tangible rewards or experiences in exchange for their pledges. This Kickstarter dataset consists of data points such as project’s name and category, crowdfunding’s deadline, goal, launched date, pledged amount, project’s condition (successful or failed) etc of a list of more than 300,000 Kickstarter projects. Having these data, we would like to predict what kinds of projects are most likely to be successful based on the given data. Such as looking into questions like whether backer number is correlated with pledged amount, and whether project categories, funding period, funding goal, pledged amount have impact on project success by using techniques like linear regression, ANOVA, and logistic regression.
Data was obtained under the topic Kickstarter Projects dataset from kaggle website, following link: https://www.kaggle.com/kemical/kickstarter-projects. The original dataset consists of 15 columns and 378661 rows of Kickstarter projects, which then being reduced to 11 columns including name, category, main_category(category of campaign), currency(currency used to support), deadline(deadline for crowdfunding), launched(date launched), pledged(amount pledged by “crowd”), state(failed or successful), backers, country(country pledged from), usd_pledged_real(amount of money pledged), usd_goal_real(the amount of money that a creator needs to complete their project) for 497 random projects. Amount of Pledged and goal have been converted to US dollars (conversion from Fixer.io API). Three outliers in goal have been removed after testing the data using boxplot.
#Preparing the data
## Warning: Missing column names filled in: 'X1' [1]
##
## ── Column specification ──────────────────────────────────────────────────────
## cols(
## X1 = col_double(),
## name = col_character(),
## category = col_character(),
## main_category = col_character(),
## currency = col_character(),
## deadline = col_date(format = ""),
## launched = col_date(format = ""),
## state = col_character(),
## backers = col_double(),
## country = col_character(),
## usd_pledged_real = col_double(),
## usd_goal_real = col_double(),
## goal_flag = col_double(),
## goal_flag2 = col_double()
## )
##
## canceled failed live successful suspended undefined
## 0.086519115 0.498993964 0.006036217 0.388329980 0.006036217 0.014084507
cols = c("#40DCA7","#106E85","#FFD31F","#009db2","#DA92AA")
piepercent = paste(round(100*table(ks$state)/sum(table(ks$state))), "%")
pie(p, label = paste(names(table(ks$state)), piepercent, sep=", "), col=cols, main = "Project State")library(ggplot2)
scdf <- data.frame(table(ks$main_category))
scdf2 <- scdf[order(scdf$Freq, decreasing = T),]
scdf2$Percent <- scdf2$Freq/sum(scdf2$Freq)
p <- ggplot(data=scdf2, aes(x=Var1, y = Percent)) + geom_bar(colour=c("#106E85"), fill=c("#40DCA7"), stat = "identity")
p + coord_flip()#What are the total amount of pledged money of each main categories?
## Var1 Freq Percent
## 7 Film & Video 98 0.19718310
## 11 Music 65 0.13078471
## 1 Art 47 0.09456740
## 9 Games 44 0.08853119
## 5 Design 39 0.07847082
## 13 Publishing 39 0.07847082
n1 = subset(ks, main_category == "Film & Video")
sprintf("The total amount of money pledged of Film & Video is %s",sum(n1$usd_pledged_real))## [1] "The total amount of money pledged of Film & Video is 367626.16"
n2 = subset(ks, main_category == "Music")
sprintf("The total amount of money pledged of Music is %s",sum(n2$usd_pledged_real))## [1] "The total amount of money pledged of Music is 272649.65"
n3 = subset(ks, main_category == "Art")
sprintf("The total amount of money pledged of Art is %s",sum(n3$usd_pledged_real))## [1] "The total amount of money pledged of Art is 102194.27"
ks$goal_flag <- ifelse(ks$usd_pledged_real >= (ks$usd_goal_real)*1.5, 1, 0)
prop.table(table(ks$goal_flag))*100##
## 0 1
## 86.1167 13.8833
ca <- subset(ks, ks$goal_flag == 1)
barplot(table(ca$main_category),col= c("#DA92AA"), xlab ="categories", ylab = "freq")## period
## 3 5 7 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## 1 1 7 2 1 4 4 3 4 8 1 12 7 5 6 4 14 4 3 4
## 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 8 219 13 15 4 3 16 1 5 3 2 11 2 3 3 4 20 3 2 1
## 49 50 51 52 53 54 55 56 57 59 60 90
## 1 7 1 1 2 1 2 2 1 2 41 3
## period Freq
## 1 3 1
## 2 5 1
## 3 7 7
## 4 11 2
## 5 12 1
## 6 14 4
## 7 15 4
## 8 16 3
## 9 17 4
## 10 18 8
## 11 19 1
## 12 20 12
## 13 21 7
## 14 22 5
## 15 23 6
## 16 24 4
## 17 25 14
## 18 26 4
## 19 27 3
## 20 28 4
## 21 29 8
## 22 30 219
## 23 31 13
## 24 32 15
## 25 33 4
## 26 34 3
## 27 35 16
## 28 36 1
## 29 37 5
## 30 38 3
## 31 39 2
## 32 40 11
## 33 41 2
## 34 42 3
## 35 43 3
## 36 44 4
## 37 45 20
## 38 46 3
## 39 47 2
## 40 48 1
## 41 49 1
## 42 50 7
## 43 51 1
## 44 52 1
## 45 53 2
## 46 54 1
## 47 55 2
## 48 56 2
## 49 57 1
## 50 59 2
## 51 60 41
## 52 90 3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(data.frame(table(period)), x = ~period, y = ~Freq, type = 'bar',
mode = "markers", marker = list(color = c("#FFD31F")))## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: 'bar' objects don't have these attributes: 'mode'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'texttemplate', 'hovertext', 'hovertemplate', 'textposition', 'insidetextanchor', 'textangle', 'textfont', 'insidetextfont', 'outsidetextfont', 'constraintext', 'cliponaxis', 'orientation', 'base', 'offset', 'width', 'marker', 'offsetgroup', 'alignmentgroup', 'selected', 'unselected', 'r', 't', '_deprecated', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'basesrc', 'offsetsrc', 'widthsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
ks$goal_flag2 <- ifelse(ks$usd_pledged_real >= (ks$usd_goal_real), 1, 0)
m <- lm(ks$goal_flag2 ~ ks$backers+ks$usd_pledged_real+ks$usd_goal_real+ks$backers*ks$usd_pledged_real*ks$usd_goal_real)
summary(m)##
## Call:
## lm(formula = ks$goal_flag2 ~ ks$backers + ks$usd_pledged_real +
## ks$usd_goal_real + ks$backers * ks$usd_pledged_real * ks$usd_goal_real)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1685 -0.3430 -0.2415 0.4847 0.6461
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 3.509e-01 2.373e-02 14.782
## ks$backers 1.506e-03 1.784e-04 8.446
## ks$usd_pledged_real 7.573e-06 2.112e-06 3.586
## ks$usd_goal_real -4.404e-06 7.432e-07 -5.926
## ks$backers:ks$usd_pledged_real -6.743e-09 1.153e-09 -5.847
## ks$backers:ks$usd_goal_real -1.490e-08 2.398e-09 -6.216
## ks$usd_pledged_real:ks$usd_goal_real -4.850e-11 8.475e-11 -0.572
## ks$backers:ks$usd_pledged_real:ks$usd_goal_real 6.361e-14 1.460e-14 4.357
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## ks$backers 3.46e-16 ***
## ks$usd_pledged_real 0.000369 ***
## ks$usd_goal_real 5.87e-09 ***
## ks$backers:ks$usd_pledged_real 9.17e-09 ***
## ks$backers:ks$usd_goal_real 1.09e-09 ***
## ks$usd_pledged_real:ks$usd_goal_real 0.567411
## ks$backers:ks$usd_pledged_real:ks$usd_goal_real 1.61e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4165 on 489 degrees of freedom
## Multiple R-squared: 0.2866, Adjusted R-squared: 0.2763
## F-statistic: 28.06 on 7 and 489 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: ks$goal_flag2
## Df Sum Sq Mean Sq F value
## ks$backers 1 6.423 6.4230 37.0215
## ks$usd_pledged_real 1 0.023 0.0228 0.1315
## ks$usd_goal_real 1 10.460 10.4603 60.2923
## ks$backers:ks$usd_pledged_real 1 10.043 10.0427 57.8851
## ks$backers:ks$usd_goal_real 1 1.081 1.0810 6.2308
## ks$usd_pledged_real:ks$usd_goal_real 1 2.752 2.7515 15.8596
## ks$backers:ks$usd_pledged_real:ks$usd_goal_real 1 3.294 3.2936 18.9842
## Residuals 489 84.838 0.1735
## Pr(>F)
## ks$backers 2.361e-09 ***
## ks$usd_pledged_real 0.71701
## ks$usd_goal_real 4.830e-14 ***
## ks$backers:ks$usd_pledged_real 1.439e-13 ***
## ks$backers:ks$usd_goal_real 0.01288 *
## ks$usd_pledged_real:ks$usd_goal_real 7.859e-05 ***
## ks$backers:ks$usd_pledged_real:ks$usd_goal_real 1.607e-05 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Influence of backers number on project success
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = ks$goal_flag2 ~ ks$backers, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4958 -0.6815 -0.6513 0.7390 1.8192
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.442685 0.141254 -10.213 < 2e-16 ***
## ks$backers 0.025253 0.003183 7.935 2.11e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 667.49 on 496 degrees of freedom
## Residual deviance: 488.58 on 495 degrees of freedom
## AIC: 492.58
##
## Number of Fisher Scoring iterations: 8
## OR 2.5 % 97.5 %
## (Intercept) 0.2362925 0.1791489 0.3116633
## ks$backers 1.0255746 1.0191970 1.0319921
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.formula(formula = ks$goal_flag2 ~ ks$prob)
##
## Data: ks$prob in 300 controls (ks$goal_flag2 0) < 197 cases (ks$goal_flag2 1).
## Area under the curve: 0.895
## [1] 0.2324091
plot_ly(ks, x = ~ks$backers, y = ~ks$state, type = 'scatter',
mode = "markers", marker = list(color = c("#009db2")))#Influence of amount money of goal on project success
## [1] -0.2062566
plot_ly(ks, x = ~ks$usd_goal_real, y = ~ks$state, type = 'histogram',
mode = "markers", marker = list(color = c("#DA92AA")))## Warning: 'histogram' objects don't have these attributes: 'mode'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'y', 'text', 'hovertext', 'orientation', 'histfunc', 'histnorm', 'cumulative', 'nbinsx', 'xbins', 'nbinsy', 'ybins', 'autobinx', 'autobiny', 'bingroup', 'hovertemplate', 'marker', 'offsetgroup', 'alignmentgroup', 'selected', 'unselected', '_deprecated', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'hovertemplatesrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
#Influence of funding period on project success
pperiod <- stringr::str_remove_all(as.character(period), "[^0-9]")
newks <- data.frame(ks, pperiod)
f <- lm(newks$goal_flag2 ~ newks$pperiod)
anova(f)## Analysis of Variance Table
##
## Response: newks$goal_flag2
## Df Sum Sq Mean Sq F value Pr(>F)
## newks$pperiod 51 14.244 0.27930 1.1874 0.1855
## Residuals 445 104.669 0.23521
#Influence of main category on project success
sks <- subset(ks, ks$state=="successful")
sks1 <- data.frame(table(sks$main_category))
sks2 <- sks1[order(sks1$Freq, decreasing = T),]
scdf2$Percent <- scdf2$Freq/sum(scdf2$Freq)*100
sks3 <- data.frame(sks2,scdf2$Percent)
head(sks3)## Var1 Freq scdf2.Percent
## 7 Film & Video 33 19.718310
## 11 Music 28 13.078471
## 1 Art 22 9.456740
## 5 Design 17 8.853119
## 13 Publishing 17 7.847082
## 8 Food 14 7.847082
plot_ly(sks3, x = ~sks3$Var1, y = ~sks3$scdf2.Percent, type = 'bar',
mode = "markers", marker = list(color = c("#009db2")))## Warning: 'bar' objects don't have these attributes: 'mode'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'texttemplate', 'hovertext', 'hovertemplate', 'textposition', 'insidetextanchor', 'textangle', 'textfont', 'insidetextfont', 'outsidetextfont', 'constraintext', 'cliponaxis', 'orientation', 'base', 'offset', 'width', 'marker', 'offsetgroup', 'alignmentgroup', 'selected', 'unselected', 'r', 't', '_deprecated', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'basesrc', 'offsetsrc', 'widthsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
39% of projects on Kickstarter successfully finished crowdfunding. Film & Video projects were the most popular category. 13.88% of the projects exceeded their funding goal by over 50%. Typical funding period is 30 days.
Backers and pledged amount have significant weak positive influence on project success, while goal amount has weak negative impact on it. Project success also differ by main categories, Film & Video, Music and Art were more likely to success. Further, more backer means more pledged amount. The median goal amount for successful projects was 4000. So, people want their projects to be more successful should control the goal.