1. Project Overview

Kickstarter (Links to an external site.)Links to an external site. is a global community built around creativity and creative projects. Kickstarter maintains a global crowdfunding platform focused on creativity and merchandising. This model traces its roots to the subscription model of arts patronage, where artists would go directly to their audiences to fund their work. Kickstarter has reportedly received more than $4 billion in pledges from 15.5 million backers to fund 257,000 creative projects, such as films, music, stage shows, comics, journalism, video games, technology, and food-related projects.

The original dataset of this project contains 378,661 records of project pledged on kickstarter, from April 2009 to January 2018. Variables that we are interested in include main_category, deadline,launched, state, backers, usd_pledged_real, and usd_goal_real.


df <- read.csv("ks-projects-201801.csv", header = TRUE)

df$launched_date <- as.Date(df$launched, format = "%Y-%m-%d")
df$deadline <- as.Date(df$deadline)

#Remove records with launched_date of 1970-01-01.
df <- df[!(df$launched_date == "1970-01-01"),]

summary(df)
##        ID                                  name       
##  Min.   :5.971e+03   New EP/Music Development:    41  
##  1st Qu.:5.383e+08   Canceled (Canceled)     :    13  
##  Median :1.075e+09   Music Video             :    11  
##  Mean   :1.075e+09   N/A (Canceled)          :    11  
##  3rd Qu.:1.610e+09   Cancelled (Canceled)    :    10  
##  Max.   :2.147e+09   Debut Album             :    10  
##                      (Other)                 :378558  
##            category           main_category       currency     
##  Product Design: 22314   Film & Video: 63583   USD    :295359  
##  Documentary   : 16139   Music       : 51917   GBP    : 34132  
##  Music         : 15726   Publishing  : 39873   EUR    : 17405  
##  Tabletop Games: 14180   Games       : 35231   CAD    : 14962  
##  Shorts        : 12357   Technology  : 32569   AUD    :  7950  
##  Video Games   : 11830   Design      : 30069   SEK    :  1788  
##  (Other)       :286108   (Other)     :125412   (Other):  7058  
##     deadline               goal                          launched     
##  Min.   :2009-05-03   Min.   :        0   2009-09-15 05:56:28:     2  
##  1st Qu.:2013-06-08   1st Qu.:     2000   2010-06-30 17:29:43:     2  
##  Median :2015-01-14   Median :     5200   2011-02-08 04:29:48:     2  
##  Mean   :2014-11-01   Mean   :    49082   2011-02-25 09:58:36:     2  
##  3rd Qu.:2016-04-28   3rd Qu.:    16000   2011-03-03 17:55:38:     2  
##  Max.   :2018-03-03   Max.   :100000000   2011-03-07 17:11:18:     2  
##                                           (Other)            :378642  
##     pledged                state           backers        
##  Min.   :       0   canceled  : 38773   Min.   :     0.0  
##  1st Qu.:      30   failed    :197719   1st Qu.:     2.0  
##  Median :     620   live      :  2799   Median :    12.0  
##  Mean   :    9683   successful:133956   Mean   :   105.6  
##  3rd Qu.:    4076   suspended :  1845   3rd Qu.:    56.0  
##  Max.   :20338986   undefined :  3562   Max.   :219382.0  
##                                                           
##     country        usd.pledged       usd_pledged_real  
##  US     :292621   Min.   :       0   Min.   :       0  
##  GB     : 33672   1st Qu.:      17   1st Qu.:      31  
##  CA     : 14756   Median :     395   Median :     624  
##  AU     :  7839   Mean   :    7037   Mean   :    9059  
##  DE     :  4171   3rd Qu.:    3035   3rd Qu.:    4050  
##  N,0"   :  3797   Max.   :20338986   Max.   :20338986  
##  (Other): 21798   NA's   :3797                         
##  usd_goal_real       launched_date       
##  Min.   :        0   Min.   :2009-04-21  
##  1st Qu.:     2000   1st Qu.:2013-05-07  
##  Median :     5500   Median :2014-12-10  
##  Mean   :    45455   Mean   :2014-09-28  
##  3rd Qu.:    15500   3rd Qu.:2016-03-24  
##  Max.   :166361391   Max.   :2018-01-02  
##

2. Number of Projects Launched on Kickstarter

The first question we would like to answer is how many projects are launching on Kickstarter everyday.

2.1 Exploratory Data Analysis

According to the historical data, the average number of projects launched is 120.The histogram shows that the distribution has strong right skeweness, because of a few outliers whose numbers are around 800 and 1000.


summary(count(df,launched_date)$n)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    48.0   103.0   119.5   174.0   928.0

hist(count(df,launched_date)$n,
     main = ("Histogram of Daily Number of Projects Launched on Kickstarter"),
     xlab = ("Number of Projects"))

2.2 Time Series

2.2.1 Trend and Seasonality

We aggregate the data by the project launched dates. According to the time plot, this time series has strong seasonality, but the trend is not very clear. Further, the correlogram verifies the seasonality of 7 observations, which means a weekly seasonality; meanwhile, the trend is not very significant as heights of spikes decreases very slow. Further, the subseries plot shows that Sunday and Monday have fewer projects launched comparing with other days.

On the other hand, as illustrated in the time plot, there exists a yearly seasonality that January usually has the lowest volumes. Thus, we use msts() function to define a multi-seasonality time series.

df_launched <- count(df,launched_date)

p1 <- autoplot(ts(df_launched$n)) + ggtitle("Time Plot of Daily Number of Projects Launched on Kickstarter") +
  xlab("Days") + ylab("Number of Projects")

p2 <- ggAcf(ts(df_launched$n)) + ggtitle("ACF Plot of Number of Projects Launched")

p3 <- ggsubseriesplot(ts(df_launched$n, frequency = 7)) + ggtitle("Subseries Plot of Number of Projects Launched")
                      
grid.arrange(p1, grid.arrange(p2, p3, ncol =  2), nrow = 2)

2.2.2 Outliers

The second observation from the time plots is the outliers happening between 1800 to 2000 days. These outliers are dramatically larger than the other records, which will result in great variance when forecasting. In general, there are two ways to handle these outliers. The first method is to replace the outliers with some reasonable values. In this project, we are using tsclean() function to clean missing values and replace outliers.
Another option is to create a dummy variable, whose values are equal to one when the corresponding records are outliers, and otherwise zero. However, by using dummy variable, we will limit forecasting method to regression. Given these outlier events only happened once over past 9 years, it is acceptable to disregard these events when forecasting.


p1 <- autoplot(ts(df_launched$n)) + xlab("Days") + ylab("Number of Projects") + ggtitle("Original Time Series")

p2 <- autoplot(ts(tsclean(df_launched$n))) + xlab("Days") + ylab("Number of Projects") + ggtitle("Replace Outliers")

grid.arrange(p1, p2, nrow = 2)


df_launched$n <- tsclean(df_launched$n)

launched_proj <- msts(df_launched$n, seasonal.periods=c(7,365.25))

2.2.3 Decomposition

According to the decomposition results, we show that the weekly seasonality is changing overtime; however, the yearly seasonality is constant. The range of trend is relatively to the original time series, because there is little trend seen in the data.


autoplot(mstl(launched_proj)) + xlab("Days")

2.2.4 Forecast

Given the time series has multiple seasonality and the seasonality are changing overtime, ETS method and TBATS method are capable to handle this situation. Since the TBATS model is very slow to train when the time series has long history, we only use the last 1500 data points for this model.

launched_proj %>%
  stlf() %>% autoplot(include = length(launched_proj)-1500) + xlab("Days") -> p1

launched_proj %>%
  subset(start = length(launched_proj)-1500) %>%
  tbats()-> fit2

fc2 <- forecast(fit2)

autoplot(fc2) +xlab("Days") -> p2

grid.arrange(p1, p2, nrow = 2)

According to the forecasting plots above, both ETS and TBATS models show the weekly and yearly seasonality, and the forecasts are reasonable. However, when considering the prediction interval, ETS has too much wider prediction interval which may give negative values. Therefore, the TBATS model is the best fit for this time series.

checkresiduals(fit2)

## 
##  Ljung-Box test
## 
## data:  Residuals from TBATS
## Q* = 580.54, df = 267.2, p-value < 2.2e-16
## 
## Model df: 33.   Total lags used: 300.2

fc2$model

## TBATS(0.008, {2,2}, -, {<7,3>, <365.25,6>})
## 
## Call: tbats(y = .)
## 
## Parameters
##   Lambda: 0.007677
##   Alpha: 0.06056698
##   Gamma-1 Values: -0.0005109216 0.000345694
##   Gamma-2 Values: -0.0002019023 2.987982e-05
##   AR coefficients: 0.72583 -0.183192
##   MA coefficients: -0.31963 -0.004066
## 
## Seed States:
##               [,1]
##  [1,]  4.795724464
##  [2,] -0.431193817
##  [3,] -0.291501070
##  [4,] -0.110092587
##  [5,]  0.260500265
##  [6,]  0.119603253
##  [7,]  0.044829461
##  [8,] -0.076588296
##  [9,] -0.053762266
## [10,]  0.025025470
## [11,]  0.039507469
## [12,]  0.044201968
## [13,]  0.063631424
## [14,] -0.042017089
## [15,] -0.094896707
## [16,] -0.107621035
## [17,] -0.078048693
## [18,] -0.026116379
## [19,] -0.001684443
## [20,]  0.000000000
## [21,]  0.000000000
## [22,]  0.000000000
## [23,]  0.000000000
## attr(,"lambda")
## [1] 0.007676753
## 
## Sigma: 0.2161688
## AIC: 21340.4

Although the short-term forecasts look reasonable, the residuals demonstrate that there is information left unexplained by the model. More sophisticated model may provide better results.

3. Number of Successful Projects

3.1 Exploratory Data Analysis

According to the histogram, the distribution of daily successful projects is centered around 50 with slightly skewness. The average is 43.69 daily. In the previous section, we know that the average projects launched are 119.5. So roughly calculating, the average successful rate is around 37%.


df_success <- df[df$state == "successful",]

df_pledged <- aggregate(df_success$usd_pledged_real, by = list(Deadline = df_success$deadline), FUN = sum)

df_pledged$proj <- count(df_success, deadline)$n

summary(df_pledged$proj)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   29.00   44.00   43.68   59.00  158.00

par(mfrow = c(1,2))

hist(df_pledged$proj,
           main = "Histogram of Daily Successful Projects",
           xlab = "Number of Projects")
boxplot(df_pledged$proj,
        main = "Boxplot of Daily Successful Projects",
        xlab = "Number of Projects")

3.2 Time Series

3.3.1 Trend and Seasonality

According to the time series, there is a strong yearly seasonality and an increasing trend overtime. The correlogram shows the existence of a weekly seasonality. Apart from that, by using the tsclean() function, the variance is reduced.

p1 <- autoplot(ts(df_pledged$proj)) + ggtitle("Time Plot of Successful Projects on Kickstarter") +
  xlab("Days") + ylab("US Dollars")

p2 <- ggAcf(tsclean(ts(df_pledged$proj))) + ggtitle("ACF Plot of Succesful Projects")

p3 <- ggsubseriesplot(tsclean(ts(df_pledged$proj, frequency = 7))) + ggtitle("Subseries Plot of Succesful Projects")

p4 <- autoplot(tsclean(ts(df_pledged$proj))) + ggtitle("Time Plot of Successful Projects (with outlier replacements)") +
  xlab("Days") + ylab("US Dollars")

grid.arrange(p1, p4, grid.arrange(p2, p3, ncol =  2), nrow = 3)

3.3.2 Decomposition

According to the decomposition plots, there is an increasing trend overtime. Weekly seasonality has large variance and is increasing significantly in the beginning 2 years. The yearly seasonality is constant. In this case, the weekly seasonality has relatively narrow ranges compared to the other components, because the weekly seasonality is week.

successful_proj <- tsclean(msts(df_pledged$proj, seasonal.periods=c(7,365.25) ))

autoplot(mstl(successful_proj)) + xlab("Days") + ggtitle("Decomposition of Successful Projects")

3.3.3 Forecast

Given the time series has multiple seasonalities and the seasonalities are changing overtime, ETS method and TBATS method are capable to handle this situation. Since the TBATS model is very slow to train when the time series has long history, we only use the last 1500 data points for this model.

successful_proj %>%
  stlf() %>% autoplot(include = length(successful_proj)-1500) + xlab("Days") + ylab("Number of Projects") -> p1

successful_proj %>%
  subset(start = length(successful_proj)-1500) %>%
  tbats()-> fit2

fc2 <- forecast(fit2)

autoplot(fc2) +xlab("Days") + ylab("Number of Projects") -> p2

grid.arrange(p1, p2, nrow = 2)

According to the forecasting plots above, both ETS and TBATS models show the weekly and yearly seasonality, and the forecasts are reasonable. The TBATS model has more smooth results, but it losses the representation of large variance of the original time series. Further, the prediction interval of TBATS model is wider than the ETS model.

Though the prediction interval of the ETS model has some negative values, the ETS model has better representation of the original time series. Considering we will only use the forecasting values instead of the prediction internal, the ETS model is the best choice.

checkresiduals(stlf(successful_proj))

## Warning in checkresiduals(stlf(successful_proj)): The fitted degrees of
## freedom is based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(A,N,N)
## Q* = 1782.6, df = 611.4, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 613.4

successful_proj_fc <- stlf(successful_proj)$mean

Although the short-term forecasts look reasonable, the residuals demonstrate that there is information left unexplained by the model. More sophisticated model may provide better results.

4. Number of Backers

4.1 Exploratory Data Analysis

According to the histogram, the distribution is dramatically right skewed, which is very similar with the pledged amount. Thus, we may expect that the correlation between these two variables will be stronger. For successful projects, Kickstarter has an average of 11527 backers, and the median is 8773.


df_pledged$backers <- aggregate(df_success$backers, by = list(Deadline = df_success$deadline), FUN = sum)$x

summary(df_pledged$backers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2    3020    8773   11527   15550  231312

par(mfrow = c(1,2))

hist(df_pledged$backers,
           main = "Histogram of Number of Backers",
           xlab = "Number of Backers")
boxplot(df_pledged$proj,
        main = "Histogram of Number of Backers",
        xlab = "Number of Backers")

4.2 Time Series

4.2.1 Trend and Seasonality

According to the time plot, the number of backers varies significantly overtime. The correlogram shows a strong weekly seasonality and the subseries plot also shows a slightly increasing trend overtime. The yearly seasonality is not clear. There are several outliers existing in this time series. By using tsclean() function, we can replace the outliers with reasonable values, resulting in small variance.

p1 <- autoplot(ts(df_pledged$backers)) + ggtitle("Time Plot of Number of Backers on Kickstarter") + xlab("Days") + ylab("Number of Backers")

p2 <- ggAcf(tsclean(ts(df_pledged$backers))) + ggtitle("ACF Plot of Number of Backers")

p3 <- ggsubseriesplot(tsclean(ts(df_pledged$backers, frequency = 7))) + ggtitle("Subseries Plot of Number of Backers")

p4 <- autoplot(tsclean(ts(df_pledged$backers))) + ggtitle("Time Plot of Number of Backers (with outlier replacements)") +
  xlab("Days") + ylab("US Dollars")

grid.arrange(p1, p4, grid.arrange(p2, p3, ncol = 2), nrow = 3)

4.2.2 Decomposition

The decomposition results show an increasing trend. The weekly seasonality changes dramatically and the yearly seasonality changes a little bit. Since the variance of this time series is significant, thus, the decomposition remainder is large as well. In this case, the weekly seasonality has relatively narrow ranges compared to the other components, because the weekly seasonality is weak.

backers <- tsclean(msts(df_pledged$backers, seasonal.periods=c(7,365.25)))

autoplot(mstl(backers)) + xlab("Days") + ggtitle("Decomposition of Number of Backers")

4.2.3 Forecast

fcast1 <- stlf(backers)
p1 <- autoplot(fcast1, include = length(backers)-1500) +
  xlab("Days") + ylab("Number of Backers")

fit2 <- tbats(subset(backers, start = length(backers)-1500))
fcast2 <- forecast(fit2)
p2 <- autoplot(fcast2, include = length(backers) - 1500) +
  xlab("Days") + ylab("Number of Backers")

grid.arrange(p1, p2, nrow = 2)

According to the forecasting plots above, both ETS and TBATS models show the weekly and yearly seasonality, and the forecasts are reasonable. The ETS model has some forecasts near zero, which is not reasonable for this case. On the other hand, the TBATS model has more smooth results, but it losses the representation of large variance of the original time series. Further, the prediction interval of TBATS model is wider than the ETS model.

checkresiduals(fcast2)

## 
##  Ljung-Box test
## 
## data:  Residuals from TBATS(0.008, {5,1}, -, {<7,3>, <365.25,6>})
## Q* = 327.81, df = 263.2, p-value = 0.004117
## 
## Model df: 37.   Total lags used: 300.2

backers_fc <- fcast2$mean

According to the residual plots, the residual distribution is centered around zero. Though there is some autocorrelation left in the residuals, the overall residual is acceptable. Therefore, the TBATS model is the best choice.

5. Daily Pledged Amount

5.1 Exploratory Data Analysis

The amount of successful pledges has large variance with lots of outliers, as illustrated in the boxplot. The distribution is dramatically right skewed. The average amount that Kickstarter collects for successful projects is 990,182 US dollars in one day, and the median is 677,253 US dollars.


summary(df_pledged$x)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##       35   221292   677253   990182  1296458 21255947

par(mfrow = c(1,2))
hist(df_pledged$x,
           main = "Histogram of Daily Pledged US Dollars",
           xlab = "US Dollars")
boxplot(df_pledged$x,
        main = "Boxplot of Daily Pledged US Dollars",
        xlab = "US Dollars")

5.2 Time Series

5.2.1 Trend and Seasonality

Amount of Successful Pledged US Dollars

According to the time plot, the amount of successful pledges varies significantly overtime. The correlogram shows a strong weekly seasonality and the subseries plot also shows a slightly increasing trend overtime. The yearly seasonality is not clear. There are several outliers existing in this time series. By using tsclean() function, we can replace the outliers with reasonable values, resulting in small variance.

p1 <- autoplot(ts(df_pledged$x)) + ggtitle("Time Plot of Successful Pledged US Dollars on Kickstarter") +
  xlab("Days") + ylab("US Dollars")

p2 <- ggAcf(tsclean(ts(df_pledged$x))) + ggtitle("ACF Plot of Successful Pledged US Dollars")

p3 <- ggsubseriesplot(tsclean(ts(df_pledged$x, frequency = 7))) + ggtitle("Subseries Plot of Successful Pledged US Dollars")

p4 <- autoplot(tsclean(ts(df_pledged$x))) + ggtitle("Time Plot of Successful Pledged US Dollars (with outlier replacements)") +
  xlab("Days") + ylab("US Dollars")

grid.arrange(p1, p4, grid.arrange(p2, p3, ncol =  2), nrow = 3)

5.2.2 Decomposition

The decomposition results show an increasing trend. The weekly seasonality changes dramatically and the yearly seasonality changes a little bit. In this case, the weekly and yearly seasonality have relatively narrow ranges, compared to the trend component. The reminders are small, which means the decomposition is good.

pledged <- tsclean(msts(df_pledged$x, seasonal.periods=c(7,365.25)))

autoplot(mstl(pledged)) + xlab("Days") + ggtitle("Decomposition of Successful Pledged US Dollars")

5.2.3 Covariance

It is expected that when the number of successful projects increases, Kickstarter will pledge more money.Same for the number of backers, when there are more supporters, Kickstarters will collect more money.

According to the scatter plot, there is some covariance between the number of successful projects and pledged US Dollars. But the correlation is not very strong. On the other hand, there is a strong correlation between number of backers and pledged amount. Therefore, in the forecasting, we use the number of backers as a predictor.

p1 <- ggplot(df_pledged, aes(x = proj, y = x)) + geom_point() +
    xlab("Number of Successful Projects") +
    ylab("Pledged US Dollars")

p2 <- ggplot(df_pledged, aes(x = backers, y = x)) + geom_point() +
  xlab("Number of Backers") +
  ylab("Pledged US Dollars")

grid.arrange(p1, p2, ncol = 2)

5.2.4 Forecast

First, we build a dynamic regression model with number of backers as a predictor. For forecasting, we use the forecasting results of successful projects from the previous section.

According to the forecasting plot, the results show the seasonality. However, the variance is too small compared to the original time series. Because the seasonality is changing overtime. Moreover, the prediction interval is too wide. Further, the residual plots show that there is lots of information unexplained. Thus, the dynamic regression model is not a good fit.

fit1 <- auto.arima(pledged, xreg = df_pledged$backers)
fcast1 <- forecast(fit1, xreg = backers_fc)

autoplot(fcast1, include = length(pledged)-1500) +
  ggtitle("Dynamic Regression Model with Predictor") +
  xlab("Days") + ylab("Pledged US Dollars")

checkresiduals(fit1)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(4,1,1) errors
## Q* = 3605.2, df = 607.4, p-value < 2.2e-16
## 
## Model df: 6.   Total lags used: 613.4

Given adding a predictor into a forecasting model may not increase the performance, in this section, we build two forecast models using the variable of pledged US Dollars only.

fcast2 <- stlf(pledged)
p1 <- autoplot(fcast2, include = length(successful_proj)-1500) +
  xlab("Days") + ylab("Pledged US Dollars")

fit3 <- tbats(subset(pledged, start = length(pledged) - 1500))
fcast3 <- forecast(fit3)
p2 <- autoplot(fcast3, include = length(successful_proj)-1500) +
  xlab("Days") + ylab("Pledged US Dollars")

grid.arrange(p1, p2, nrow = 2)

checkresiduals(fcast2)

## Warning in checkresiduals(fcast2): The fitted degrees of freedom is based
## on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(A,N,N)
## Q* = 1428.1, df = 611.4, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 613.4

According to the forecasting plots above, both ETS and TBATS models show the weekly and yearly seasonality, and the forecasts are reasonable. The ETS model has better representative of the original patterns, and the prediction interval is relatively small. The TBATS forecasts are more smooth, but the prediction interval is dramatically large. Moreover, even though the residual plots show that there is information unexplained, especially a yearly seasonality, in the residual, the overall plots meet the requirements. Therefore, the ETS model is a good fit.

6 Other Analysis

6.1 Basic Statistics


#Total number of projects launched
nrow(df)
## [1] 378654

#Total number of successful projects
sum(df$state == "successful")
## [1] 133956

#Total amount of pledges (US Dollars)
sum(df[df$state == "successful",]$usd_pledged_real)
## [1] 3036889046

#Total number of backers
sum(df$backers)
## [1] 39993219

6.2 Main Category

#Number of main categories
nlevels(df$main_category)

## [1] 15

#Number of launched projects in each categories
df_category <- count(df, main_category)

#Number of successful projects in each categories
df_category$successful <- count(df[df$state == "successful",], main_category)$n

#Successful Ratio
df_category$successful_ratio <- df_category$successful/df_category$n

df_category

## # A tibble: 15 x 4
##    main_category     n successful successful_ratio
##    <fct>         <int>      <int>            <dbl>
##  1 Art           28152      11510            0.409
##  2 Comics        10819       5842            0.540
##  3 Crafts         8809       2115            0.240
##  4 Dance          3768       2338            0.620
##  5 Design        30069      10550            0.351
##  6 Fashion       22816       5593            0.245
##  7 Film & Video  63583      23623            0.372
##  8 Food          24602       6085            0.247
##  9 Games         35231      12518            0.355
## 10 Journalism     4755       1012            0.213
## 11 Music         51917      24197            0.466
## 12 Photography   10779       3305            0.307
## 13 Publishing    39873      12300            0.308
## 14 Technology    32569       6434            0.198
## 15 Theater       10912       6534            0.599

ggplot(df, aes(main_category)) + geom_bar(aes(fill = state)) +
  xlab("Main Categories") + ylab("Number of projects launched") +
  ggtitle("Histogram of Projects Launched on Kickstarter")

DSCI-FXP-SP19- Project #3 Kickstarter Projects Forecast

Yiling He

April 25, 2019