Kickstarter Projects Success Prediction

Introduction

Over the years, major crowdfunding platforms such as Kickstarter, GoFundMe, RocketHub, CircleUp and others have started facing the heat thanks to its unconventional means of investment. Nathan Resnick in his article ‘Why Kickstarter Is Corrupted’, mentions how Paid Advertising, Investor Backed campaigns and the involvement of crowdfunding agencies has changed what Kickstarter used to stand for. In such challenging times, this project is a means to analyze the projects launched on Kickstarter and figure out the keys to a successful project on Kickstarter. Though a success on Kickstarter does not guarantee product success in the market, backers could this analysis to identify if projects are investment-worthy.

The data used for the analysis was collected from Kickstarter.com and is available on Kaggle here. The raw dataset contains details about 378,661 projects launched over the years 2009 to 2017. The dataset presents many details that can be used to predict the final state of the Kickstarter project.

Analysis Methodology:

My analysis of the Kickstarter Projects is broken down into the following sections.

Data Preprocessing -

The below listed steps were performed in the Data cleaning/ wrangling stage to prepare the data into a format compatible for analysis.

Sub setting the data set to the Failed and Successful projects
Data Type conversions
Re-ordering data columns to a meaningful order
Check for duplicates
Check for missing values 6 Feature creation
Data Validation to ensure all variables contained meaningful values

Data Exploration -

In the Data exploration stage, the response and predictor variables were analyzed by plotting univariate and bi-variate charts. The descriptive analysis forms a baseline to understand which variables could potentially be important in the predictive analysis.

Modeling -

With the data exploration step completed, the modeling step was attempted. Classification algorithms such as Logistic Regression, GAM, Classification Tree and Random Forest was performed on the data set to identify the projects that will be successful on Kickstarter.

Required Packages

The following packages were used for the analysis. Details about the code block that uses the package is also mentioned.

library(dplyr)        # Data wrangling tasks
library(tidyr)        # Data wrangling tasks
library(ggplot2)      # Plotting/ Data visualization tasks
library(lubridate)    # Date/ Time manipulation
library(magrittr)     # Pipe operator
library(corrplot)     # Correlation function
library(formattable)  # Data Preview section
library(knitr)        # Data Preview section
library(broom)        # Glance function
library(boot)         # Bootstrapping function
library(glmnet)       # Cross validation function
library(mgcv)         # GAM function
library(verification) # ROC plots
library(rpart)        # Classification Tree
library(rpart.plot)   # Classification Tree plot
library(caret)        # Confusion Matrix function
library(randomForest) # Random Forest function

Data Pre-processing

The Data Preparation section contains the logic and the steps performed in bringing the data to a form suitable for analysis. Each tab below explains the related steps. The cleaned data can be previewed at the Preview tab with details about variables presented in the Data Description tab.

Data Import

Dataset:

The data used for the analysis was collected from Kickstarter.com and is available on Kaggle. The raw dataset contains details about 378,661 projects launched over the years 2009 to 2017. The dataset presents many details that can be used to predict the final state of the Kickstarter project.

The dataset was downloaded from Kaggle. Other details about the dataset can be found here. The data dictionary can also be found in the above link.

Import:

I downloaded the dataset from Kaggle and read it as a data frame using the read.csv() function.

ks <- read.csv("ks-projects-201801.csv")
str(ks)

## 'data.frame':    378661 obs. of  15 variables:
##  $ ID              : int  1000002330 1000003930 1000004038 1000007540 1000011046 1000014025 1000023410 1000030581 1000034518 100004195 ...
##  $ name            : Factor w/ 375765 levels "","    IT’S A HOT CAPPUCCINO NIGHT  ",..: 332493 135633 364946 344770 77274 206067 293430 69281 284103 290686 ...
##  $ category        : Factor w/ 159 levels "3D Printing",..: 109 94 94 91 56 124 59 42 114 40 ...
##  $ main_category   : Factor w/ 15 levels "Art","Comics",..: 13 7 7 11 7 8 8 8 5 7 ...
##  $ currency        : Factor w/ 14 levels "AUD","CAD","CHF",..: 6 14 14 14 14 14 14 14 14 14 ...
##  $ deadline        : Factor w/ 3164 levels "2009-05-03","2009-05-16",..: 2288 3042 1333 1017 2247 2463 1996 2448 1790 1863 ...
##  $ goal            : num  1000 30000 45000 5000 19500 50000 1000 25000 125000 65000 ...
##  $ launched        : Factor w/ 378089 levels "1970-01-01 01:00:00",..: 243292 361975 80409 46557 235943 278600 187500 274014 139367 153766 ...
##  $ pledged         : num  0 2421 220 1 1283 ...
##  $ state           : Factor w/ 6 levels "canceled","failed",..: 2 2 2 2 1 4 4 2 1 1 ...
##  $ backers         : int  0 15 3 1 14 224 16 40 58 43 ...
##  $ country         : Factor w/ 23 levels "AT","AU","BE",..: 10 23 23 23 23 23 23 23 23 23 ...
##  $ usd.pledged     : num  0 100 220 1 1283 ...
##  $ usd_pledged_real: num  0 2421 220 1 1283 ...
##  $ usd_goal_real   : num  1534 30000 45000 5000 19500 ...

The raw Kickstarter Projects data has 378661 observations and 15 variables.

The target variable in the analysis, State, the Final state of the Kickstarter (KS) Project has six levels – Failed (52%), Successful (35%) and Others (Cancelled, Live, Suspended and Undefined, 12%).

Data Cleaning

1.a. Initial Data Type conversions

Data type conversion was done to relevant columns such as ID, Name that were converted to ‘Character’ datatype and Deadline, Launched date that were converted to ‘Date’ datatype.

ks$ID <- as.character(ks$ID)
ks$name <- as.character(ks$name)
ks$deadline <- as.Date(ks$deadline)
ks$launched <- as.Date(ks$launched)

2.a. Re-ordering the dataset variables to a meaningful order

The variables in the dataset were re-ordered to a meaningful order.

ks <- ks[, c(1:4, 12, 8, 6, 5, 7, 9, 11, 13:15, 10)]

3. Data sub-setting:

ggplot(ks, aes(state)) +
  geom_bar() +
  ylab("# of Projects") + xlab("Final State") +
  ggtitle("Final State of the Kickstarter projects")

For the purpose of this project, only the projects with ‘Failed’ and ‘Successful’ states will be considered.

ks.proj <- ks %>% filter(state == "failed" | state == "successful") 
ks.proj$state <- as.character(ks.proj$state)
ks.proj$state <- as.factor(ks.proj$state)
summary(ks.proj$state)

##     failed successful 
##     197719     133956

4. Check for duplicates:

There were no duplicated observations in the dataset. Hence, no changes were made in this step.

5. Check for missing values:

0.063% of the observations had missing values. As the number of observations with missing values were few, they were removed and stored in a temporary dataset. The final dataset contains 331,465 observations.

sum(is.na(ks.proj))

## [1] 210

colSums(is.na(ks.proj))

##               ID             name         category    main_category 
##                0                0                0                0 
##          country         launched         deadline         currency 
##                0                0                0                0 
##             goal          pledged          backers      usd.pledged 
##                0                0                0              210 
## usd_pledged_real    usd_goal_real            state 
##                0                0                0

# As it's a very small value (0.063%), I'm deleting the values
missing <- ks.proj %>% filter(is.na(usd.pledged)) 
ks.proj <- na.omit(ks.proj)

6. Feature Creation

The variable ‘Duration’ was created as the difference (in days) between the deadline and launched date. Further, the deadline and launched date fields were separated into the respective year and month as variables, deadline_year, deadline_month and launched_year, launched_month.

ks.proj$duration <- as.numeric(ks.proj$deadline - ks.proj$launched)
ks.proj <- ks.proj %>% 
  separate(col = "deadline", into = c("deadline_year", "deadline_month", "deadline_day"), sep = "-") %>%
  separate(col = "launched", into = c("launched_year", "launched_month", "launched_day"), sep = "-")

## 'data.frame':    331465 obs. of  18 variables:
##  $ ID              : chr  "1000002330" "1000003930" "1000004038" "1000007540" ...
##  $ name            : chr  "The Songs of Adelaide & Abullah" "Greeting From Earth: ZGAC Arts Capsule For ET" "Where is Hank?" "ToshiCapital Rekordz Needs Help to Complete Album" ...
##  $ category        : Factor w/ 159 levels "3D Printing",..: 109 94 94 91 124 59 42 96 73 33 ...
##  $ main_category   : Factor w/ 15 levels "Art","Comics",..: 13 7 7 11 8 8 8 13 11 3 ...
##  $ country         : Factor w/ 23 levels "AT","AU","BE",..: 10 23 23 23 23 23 23 4 23 23 ...
##  $ launched_year   : chr  "2015" "2017" "2013" "2012" ...
##  $ launched_month  : chr  "08" "09" "01" "03" ...
##  $ deadline_year   : chr  "2015" "2017" "2013" "2012" ...
##  $ deadline_month  : chr  "10" "11" "02" "04" ...
##  $ duration        : num  59 60 45 30 35 20 45 30 30 30 ...
##  $ currency        : Factor w/ 14 levels "AUD","CAD","CHF",..: 6 14 14 14 14 14 14 2 14 14 ...
##  $ goal            : num  1000 30000 45000 5000 50000 1000 25000 2500 12500 5000 ...
##  $ pledged         : num  0 2421 220 1 52375 ...
##  $ backers         : int  0 15 3 1 224 16 40 0 100 0 ...
##  $ usd.pledged     : num  0 100 220 1 52375 ...
##  $ usd_pledged_real: num  0 2421 220 1 52375 ...
##  $ usd_goal_real   : num  1534 30000 45000 5000 50000 ...
##  $ state           : Factor w/ 2 levels "failed","successful": 1 1 1 1 2 2 1 1 2 1 ...

7. Feature Engineering:

The dataset contains many categorical variables with multiple levels. Many of those levels have too few observations within the level. In order to reduce the number of parameters in the predictive analysis, such levels were consolidated.

a. Country:

KS projects can be launched in 22 countries. However, 94% of the projects launched between 2009 and 2017 were launched in the US, UK, Canada and Australia. Projects launched in New Zealand, Denmark, Norway, Sweden, The Netherlands, Ireland, Spain, France, Germany, Austria, Italy, Belgium, Luxembourg, Switzerland and Mexico, that comprised of 6% of all the projects in the dataset were grouped in a level ‘Other’.

ks.proj$country <- as.character(ks.proj$country)
ks.proj$country <- as.factor(ks.proj$country)
# Reducing levels in Country
ks.proj$country <- as.character(ks.proj$country)
ks.proj$country[ks.proj$country %in% c("JP", "LU", "AT", "HK", "SG", "BE", "CH", "IE", "NO", "DK", 
                                       "MX", "NZ", "SE", "ES", "IT", "NL", "FR", "DE")] <- "Other"
ks.proj$country <- as.factor(ks.proj$country)

levels(ks.proj$country) # 5 levels

## [1] "AU"    "CA"    "GB"    "Other" "US"

sort(round(prop.table(table(ks.proj$country)),2))

## 
##    AU    CA Other    GB    US 
##  0.02  0.04  0.07  0.09  0.79

b. Launched Year:

The dataset records all KS projects launched between 2009 and 2017. However, only 10% of the projects were launched before 2012. Thus, a new level ‘Before 2012’ was created to include all projects launched in 2009, 2010 and 2011.

levels(ks.proj$launched_year) # 9 levels

## [1] "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017"

round(prop.table(table(ks.proj$launched_year)),2)

## 
## 2009 2010 2011 2012 2013 2014 2015 2016 2017 
## 0.00 0.03 0.07 0.12 0.12 0.18 0.20 0.15 0.13

# Reducing levels in Launched Year
ks.proj$launched_year <- as.character(ks.proj$launched_year)
ks.proj$launched_year[ks.proj$launched_year %in% c("2009", "2010", "2011")] <- "Before 2012"
ks.proj$launched_year <- as.factor(ks.proj$launched_year)

c. Currency:

14 different currencies were used by KS projects recorded in the dataset. However, 98% of the projects were launched in US Dollar, British Pound, Euro, Canadian Dollar and Australian Dollar. The rest of the currencies were grouped in a label, ‘Other’.

levels(ks.proj$currency)

##  [1] "AUD" "CAD" "CHF" "DKK" "EUR" "GBP" "HKD" "JPY" "MXN" "NOK" "NZD"
## [12] "SEK" "SGD" "USD"

sort(round(prop.table(table(ks.proj$currency)),2))

## 
##  CHF  DKK  HKD  JPY  MXN  NOK  NZD  SEK  SGD  AUD  CAD  EUR  GBP  USD 
## 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.04 0.04 0.09 0.79

# Reducing levels in Country
ks.proj$currency <- as.character(ks.proj$currency)
ks.proj$currency[ks.proj$currency %in% c("JPY", "HKD", "SGD", "CHF", "NOK", "DKK", "MXN", "NZD", 
                                         "SEK")] <- "Other"
ks.proj$currency <- as.factor(ks.proj$currency)

Data Preview

Listed below are the first 10 entries of the Kickstarter Projects data. Each row is a KS project entry with details about the project listed in the columns.

head(ks.proj, n = 50) %>%
  formattable() %>%
  as.datatable(options = list(dom = 't',scrollX = TRUE,scrollCollapse = TRUE))

Data Description

Details about the Variables in the Dataset are provided below.

variable.type <- lapply(ks.proj, class)
variable.description <- c("ID of Kickstarter Project", "Name of Kickstarter Project", 
                          "Category of Kickstarter Project", "Main Category of Kickstarter Project",
                          "Country where Kickstarter Project was launched",
                          "Year when Kickstarter Project was launched",
                          "Month when Kickstarter Project was launched",
                          "Year when Kickstarter Project ended",
                          "Month when Kickstarter Project ended",
                          "Active Duration of the Kickstarter Project", "Currency of amount pledged",
                          "Goal of Kickstarter Project in original currency", 
                          "Total amount pledged in original currency", "Number of Backers", 
                          "Conversion in USD of the pledged column (conversion done by Kickstarter)",
                          "Conversion in USD of the pledged column (conversion from Fixer.io API)",
                          "Conversion in USD of the goal column (conversion from Fixer.io API)",
                          "Final State of Kickstarter Project")
variable.name <- colnames(ks.proj)

ks_datadesc <- as_data_frame(cbind(variable.name, variable.type, variable.description))
colnames(ks_datadesc) <- c("Variable Name","Data Type","Variable Description")

kable(ks_datadesc)

Variable Name	Data Type	Variable Description
ID	character	ID of Kickstarter Project
name	character	Name of Kickstarter Project
category	factor	Category of Kickstarter Project
main_category	factor	Main Category of Kickstarter Project
country	factor	Country where Kickstarter Project was launched
launched_year	factor	Year when Kickstarter Project was launched
launched_month	factor	Month when Kickstarter Project was launched
deadline_year	factor	Year when Kickstarter Project ended
deadline_month	factor	Month when Kickstarter Project ended
duration	numeric	Active Duration of the Kickstarter Project
currency	factor	Currency of amount pledged
goal	numeric	Goal of Kickstarter Project in original currency
pledged	numeric	Total amount pledged in original currency
backers	integer	Number of Backers
usd.pledged	numeric	Conversion in USD of the pledged column (conversion done by Kickstarter)
usd_pledged_real	numeric	Conversion in USD of the pledged column (conversion from Fixer.io API)
usd_goal_real	numeric	Conversion in USD of the goal column (conversion from Fixer.io API)
state	factor	Final State of Kickstarter Project

Data Analysis

The analysis of the Kickstarter Projects data can be found under the following tabs.

Descriptive Analysis

The Descriptive Analysis is sectionally divided into Univariate and Bivariate explorations. The correlations between the predictors can be found in the last tab.

Cleaned Dataset

Before we start the Descriptive Analysis, let’s look at the structure of the cleaned dataset, ks.projand understand the dimensions of the dataset.

str(ks.proj)

## 'data.frame':    331465 obs. of  18 variables:
##  $ ID              : chr  "1000002330" "1000003930" "1000004038" "1000007540" ...
##  $ name            : chr  "The Songs of Adelaide & Abullah" "Greeting From Earth: ZGAC Arts Capsule For ET" "Where is Hank?" "ToshiCapital Rekordz Needs Help to Complete Album" ...
##  $ category        : Factor w/ 159 levels "3D Printing",..: 109 94 94 91 124 59 42 96 73 33 ...
##  $ main_category   : Factor w/ 15 levels "Art","Comics",..: 13 7 7 11 8 8 8 13 11 3 ...
##  $ country         : Factor w/ 5 levels "AU","CA","GB",..: 3 5 5 5 5 5 5 2 5 5 ...
##  $ launched_year   : Factor w/ 7 levels "2012","2013",..: 4 6 2 1 5 3 5 2 2 3 ...
##  $ launched_month  : Factor w/ 12 levels "01","02","03",..: 8 9 1 3 2 12 2 9 3 9 ...
##  $ deadline_year   : Factor w/ 10 levels "2009","2010",..: 7 9 5 4 8 6 8 5 5 6 ...
##  $ deadline_month  : Factor w/ 12 levels "01","02","03",..: 10 11 2 4 4 12 3 10 4 10 ...
##  $ duration        : num  59 60 45 30 35 20 45 30 30 30 ...
##  $ currency        : Factor w/ 6 levels "AUD","CAD","EUR",..: 4 6 6 6 6 6 6 2 6 6 ...
##  $ goal            : num  1000 30000 45000 5000 50000 1000 25000 2500 12500 5000 ...
##  $ pledged         : num  0 2421 220 1 52375 ...
##  $ backers         : int  0 15 3 1 224 16 40 0 100 0 ...
##  $ usd.pledged     : num  0 100 220 1 52375 ...
##  $ usd_pledged_real: num  0 2421 220 1 52375 ...
##  $ usd_goal_real   : num  1534 30000 45000 5000 50000 ...
##  $ state           : Factor w/ 2 levels "failed","successful": 1 1 1 1 2 2 1 1 2 1 ...

The cleaned dataset has 331,465 observations and 18 variables. The target variable, ‘State’ is categorical, while the predictor variables are a combination of numerical and categorical variables.

Univariate Analysis

Response Variable: State

The response variable, State, indicates which of the projects Kickstarter deemed successful. Of the 331,465 projects listed in the dataset, 40% of the projects were successful, while the rest failed.

# Final State of the KS project
ggplot(ks.proj, aes(state, fill = state)) +
  geom_bar() +
  ylab("# of Projects") + xlab("Final State") +
  theme(legend.position = "bottom") +
  ggtitle("Final State of the Kickstarter projects")

Predictor Variables:

1. Main Category

There are 15 different categories of KS projects listed in the dataset, of which Film & Video, Music and Publishing, collectively make up more 40% of the projects. Figures below show the distribution of KS projects and the success/ failure rate across categories.

# 1. Main Categories present in the dataset
p2 <- ggplot(ks.proj, aes(x = main_category, fill = ks.proj$state)) +
  geom_bar() +
  coord_flip() +
  theme(legend.position = "bottom") +
  ylab("Number of projects") + xlab("") +
  ggtitle("Main categories of the KS Projects")  

# Main categories of the KS Projects - percent to whole
p1 <- ks.proj %>% 
  count(main_category) %>%
  mutate(pct = n / sum(n)) %>%
  ggplot(aes(reorder(main_category, pct), pct)) +
  geom_col() +
  coord_flip() + 
  ylab("% of projects") + xlab("") +
  ggtitle("Main categories of the KS Projects") 

gridExtra::grid.arrange(p1, p2, ncol = 2)

2. Country

Of the 22 countries where KS projects can be launched, 94% of the projects were launched in US, UK, Canada and Australia. Figure 4 shows the distribution of KS projects across countries and the associated rate of success across countries.

# Final State of the KS project
ggplot(ks.proj, aes(x = country, fill = ks.proj$state)) +
  geom_bar() +
  theme(legend.position = "bottom") +
  scale_y_continuous(labels = scales::comma) +
  ylab("Number of projects") + xlab("") +
  ggtitle("KS Projects by country")

3. Launched Year

Figure 5 shows the distribution of the KS projects based on the project launch year. There’s a clear peak in the number of projects that were launched in 2014 and 2015, of which about 30% were successful.

ggplot(ks.proj, aes(x = launched_year, fill = ks.proj$state)) +
  geom_bar() +
  theme(legend.position = "bottom") +
  ylab("Number of projects") + xlab("Year launched") +
  ggtitle("KS projects by Year")

4. Launched Month

Figure 6 shows the distribution over time when the KS projects were launched. There was a strong peak between July and August in 2014 and 2015.

df1 <- as.data.frame(table(ks.proj$launched_year, ks.proj$launched_month))
names(df1) <- c('Launched_Year','Launched_Month', 'Freq')
df1 <- df1 %>% group_by(Launched_Year) %>% arrange(desc(Freq))

ggplot(df1, aes(x = Launched_Year, y = `Freq`, fill = Launched_Month)) +
  geom_bar(stat = "identity", position = "dodge") +
  ggtitle("KS Projects launched over time") +
  scale_y_continuous("# of Projects launched", labels = scales::comma) +
  scale_x_discrete("Year")

4. Deadline Year

Figure 7 shows the distribution of the project deadlines over time. The distribution of this variable being similar to the launched year and duration variable makes the need for this variable dubious.

ggplot(ks.proj, aes(x = deadline_year, fill = ks.proj$state)) +
  geom_bar() +
  theme(legend.position = "bottom") +
  ylab("Number of projects") + xlab("Deadline Year") +
  ggtitle("KS projects by Deadline")

5. Active Duration

The figure reflect the distribution of the variable showing that both successful and unsuccessful projects have a mean duration of 30 days with many outliers.

p1 <- ggplot(ks.proj, aes(duration, fill = ks.proj$state)) +
  geom_histogram(binwidth = 5) +
  theme(legend.position = "bottom") +
  ylab("Number of projects") + xlab("Duration") +
  ggtitle("Duration of the KS projects (in days)")

p2 <- ggplot(ks.proj, aes(x = state, y = duration, fill = ks.proj$state)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  coord_flip() +
  xlab("") + ylab("Duration") +
  ggtitle("Active Duration of the KS projects")

gridExtra::grid.arrange(p1, p2, ncol = 2)

6. Currency

Of the 22 countries where KS projects can be launched, 14 unique currencies were used. 98% of the projects were launched in US Dollar, British Pound, Canadian Dollar and Australian Dollar. Figure 9 shows the distribution of KS projects across countries and the associated rate of success across countries. As this plot resembles the distribution of the country variable, it might not be too useful.

ggplot(ks.proj, aes(x = currency, fill = ks.proj$state)) +
  geom_bar() +
  scale_y_continuous(labels = scales::comma) +
  theme(legend.position = "bottom") +
  ylab("Number of projects") + xlab("") +
  ggtitle("Currency of the KS Projects")

7. Backers

Table 4 shows the quantile distribution of the ‘Backers’ variable. The distribution is highly skewed as the variance of this variable is large. Clearly, for 50% of the KS projects, there were fewer than 15 backers.

quantile(ks.proj$backers, probs = seq(from = 0, to = 1, by = .1))

##     0%    10%    20%    30%    40%    50%    60%    70%    80%    90% 
##      0      0      1      3      7     15     28     48     84    183 
##   100% 
## 219382

To help understand the distribution better, a log transformation has been used to represent it as can be seen in Figures 10a, 10b. The ‘Backers’ variable shows a large difference between successful and failed KS projects and hence, might be a useful variable in the predictive analysis.

# Log transforming the backers field shows the distribution better
p1 <- ggplot(ks.proj, aes(log(backers),  fill = ks.proj$state)) +
  geom_density() +
  theme(legend.position = "bottom") +
  ylab("Number of Backers") + xlab("") +
  ggtitle("# of Backers of the KS projects")

p2 <- ggplot(ks.proj, aes(x = state, y = log(backers), fill = ks.proj$state)) +
  geom_boxplot() +
  coord_flip() + 
  theme(legend.position = "bottom") +
  ylab("# of Backers (log-transformed)") + xlab("") +
  ggtitle("# of Backers of the KS projects (Log)")

gridExtra::grid.arrange(p1, p2, ncol = 2)

8. Amount Pledged

Table 5 shows the quantile distribution of the ‘Amount Pledged’ variable. Similar to the ‘Backers’ variable, the distribution is highly skewed, and the variance of this variable is large. From the quantile table, it’s clear that about 20% the projects had lesser than 20 USD and another 20% had more than 6300 USD pledged towards the project.

quantile(ks.proj$usd_pledged_real, probs = seq(from = 0, to = 1, by = .2))

##           0%          20%          40%          60%          80% 
##        0.000       20.000      307.982     1681.000     6305.594 
##         100% 
## 20338986.270

Figures 11a, 11b show the distribution better by using a log transformation. As the ‘Amount Pledged’ variable also shows a large difference between successful and failed KS projects, it might be a useful predictor in the analysis.

# Log transforming the usd_pledged_real field shows the distribution better
p1 <- ggplot(ks.proj, aes(log(usd_pledged_real),  fill = ks.proj$state)) +
  geom_density() +
  theme(legend.position = "bottom") +
  xlab("USD pledged (log-transformed)") + ylab("") +
  ggtitle("USD pledged for the KS projects")

# Log-transformed usd_pledged_real
p2 <- ggplot(ks.proj, aes(x = state, y = log(usd_pledged_real), fill = ks.proj$state)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ylab("USD pledged (log-transformed)") + xlab("") +
  scale_y_continuous(labels = scales::comma) +
  coord_flip() + 
  ggtitle("USD pledged for the KS projects (Log)")

gridExtra::grid.arrange(p1, p2, ncol = 2)

9. Goal

The below table shows the quantile distribution of the ‘Goal’ variable. Similar to the ‘Backers’ and ‘Amount Pledged’ variable, the distribution is highly skewed, and the variance of this variable is large. Figures 12a, 12b show the distribution using a log transformation.

quantile(ks.proj$usd_goal_real, probs = seq(from = 0, to = 1, by = .2))

##           0%          20%          40%          60%          80% 
##         0.01      1500.00      3975.00      8000.00     20000.00 
##         100% 
## 166361390.71

# Log transforming the goal field shows the distribution better
p1 <- ggplot(ks.proj, aes(log(usd_goal_real),  fill = ks.proj$state)) +
  geom_density() +
  theme(legend.position = "bottom") +
  xlab("USD pledged (log-transformed)") + ylab("") +
  ggtitle("KS projects' Goal")

# Log-transformed usd_goal_real
p2 <- ggplot(ks.proj, aes(x = state, y = log(usd_goal_real), fill = ks.proj$state)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ylab("Goal in USD (log-transformed)") + xlab("") +
  scale_y_continuous(labels = scales::comma) +
  coord_flip() + 
  ggtitle(" Goal of the KS projects (Log)")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Bivariate Analysis

Category by Year

Figure 13 shows that across the years KS Projects in Music, Film & Video have been consistently high. No other category stands out as much. This indicates that category might not be a strong predictive variable.

ggplot(ks.proj, aes(launched_year, fill = ks.proj$state)) +
  geom_bar() +
  theme(legend.position = "bottom") +
  facet_wrap( ~ main_category) +
  ylab("Number of Projects") + xlab("Launched Year") +
  ggtitle("KS projects launched over time by Category")

Country by Year

Figure 15 shows that over the years, KS projects have been launched in the US much more than in any other country. The plot clearly shows that both country and year might not be highly indicative of which projects will be successful if launched on Kickstarter.

ggplot(ks.proj, aes(launched_year, fill = ks.proj$state)) +
  geom_bar() +
  theme(legend.position = "bottom") +
  facet_wrap( ~ country) +
  ylab("Number of Projects") + xlab("Launched Year") +
  ggtitle("KS projects launched over time by Country")

USD Pledged vs Backers and Goal vs Backers

Figure 14a shows a strong positive correlation between USD pledged and Backers and Figure 14b shows a mild correlation between Goal and Backers. The strong separation of the successful and failed outcomes with these variables indicates that they might be strong predictors.

p1 <- ggplot(ks.proj, aes(x = log(backers), y = log(usd_pledged_real))) +
  geom_jitter(aes(color = state)) +
  theme(legend.position = "bottom") +
  ylab("Amount pledged (log)") + xlab("Backers (log)") +
  ggtitle("KS projects USD Pledged vs Backers")

# 4. Goal vs Backers
p2 <- ggplot(ks.proj, aes(x = log(backers), y = log(usd_goal_real))) +
  geom_jitter(aes(color = state)) +
  theme(legend.position = "bottom") +
  ylab("Goal (log)") + xlab("Backers (log)") +
  ggtitle("KS projects' Goal vs Backers")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Correlations

From the correlation plot, many strong correlations can be identified. However, the variable ‘Goal’ and ‘Pledged’ is the goal and amount pledged by the creator in the respective currency at the time of launch. ‘USD_goal_real’ and ‘USD_pledged_real’ are the USD equivalent of both variables (conversion provided by fixer.api). Hence, the variables goal, pledged, usd.pledged will be dropped.

# Correlation between all numerical variables
corMat <- cor(ks.proj[, c(10, 12:17)])
corrplot.mixed(corMat,tl.pos = "lt")

The only other high correlation noted from Figure 16 is the correlation of 0.75 between usd_pledged_real and Backers. This was also previously observed in Figure 13a. Hence, the Variance Inflation factor of the final model built will be checked in the future steps.

Predictive Analysis

The below tabs show the approach to modeling and the results of the modeling exercise.

Approach

To build a predictive model that differentiates and classifies the KS projects that will/will not succeed, the following modeling approach was used. The data set was split into a 60-20-20 split for Train, Validation and Test set respectively.

Modeling

Model Building: Modeling techniques such as Logistic Regression, Generalized Additive Models (GAM), Classification Trees and Random Forest was performed on the training set.
Parameter tuning: Cross validation and Grid search methods were used on the training data to identify the best parameter(s) suited for each of the models previously built.

Model Selection:

Model performance was evaluated for each model using AUC and accuracy as the criterion on the validation set (20% of the dataset) in order to choose the best performing model for this dataset.

Model Performance Evaluation:

The performance of the best model chosen from the previous step was evaluated by measuring the accuracy of the final model.

Modeling

The data set was split into a 60-20-20 split for Train, Validation and Test set respectively.

# create Training - Test and Validation set (60 - 20 - 20%)
set.seed(12420360)
spec = c(train = .6, test = .2, validate = .2)
g = sample(cut(seq(nrow(data)), nrow(data)*cumsum(c(0,spec)), labels = names(spec)))

res = split(data, g)
train.data <- res$train
test.data <- res$test
validation.data <- res$validate

1. Logistic Regression

A full logistic regression model was built with State as the response variable against the rest of the eight predictor variables – Main Category, Country, Launched Year, Launched Month, Duration, Currency, Backers and Goal. Step-wise variable selection method was used to identify the most important variables in the logistic regression model.

nullmodel <- glm(state~1, data = train.data, family = "binomial")
fullmodel <- glm(state~., data = train.data, family = "binomial")

#Backward Elimination
model.step.b <- step(fullmodel, direction = 'backward')

## Start:  AIC=1023.75
## state ~ main_category + country + launched_year + launched_month + 
##     duration + currency + backers + usd_pledged_real + usd_goal_real
## 
## 
## Step:  AIC=1023.75
## state ~ main_category + launched_year + launched_month + duration + 
##     currency + backers + usd_pledged_real + usd_goal_real
## 
##                    Df Deviance    AIC
## - currency          5      945   1017
## - launched_month   11      959   1019
## - main_category    14      969   1023
## <none>                     942   1024
## - duration          1      945   1025
## - backers           1      954   1034
## - launched_year     6      967   1037
## - usd_pledged_real  1   123274 123354
## - usd_goal_real     1   172009 172089
## 
## Step:  AIC=1016.78
## state ~ main_category + launched_year + launched_month + duration + 
##     backers + usd_pledged_real + usd_goal_real
## 
##                    Df Deviance    AIC
## - launched_month   11      962   1012
## - main_category    14      972   1016
## <none>                     945   1017
## - duration          1      948   1018
## - launched_year     6      972   1032
## - backers           1      978   1048
## - usd_pledged_real  1   123334 123404
## - usd_goal_real     1   172259 172329
## 
## Step:  AIC=1011.69
## state ~ main_category + launched_year + duration + backers + 
##     usd_pledged_real + usd_goal_real
## 
##                    Df Deviance    AIC
## <none>                     962   1012
## - duration          1      966   1014
## - main_category    14      994   1016
## - backers           1      969   1017
## - launched_year     6      989   1027
## - usd_pledged_real  1   123393 123441
## - usd_goal_real     1   172432 172480

#Forward Selection
model.step.f <- step(nullmodel, scope = list(lower = nullmodel, upper = fullmodel), direction = 'forward')

## Start:  AIC=268224.2
## state ~ 1
## 
##                    Df Deviance    AIC
## + backers           1   190220 190224
## + usd_pledged_real  1   208657 208661
## + main_category    14   258852 258882
## + usd_goal_real     1   260086 260090
## + launched_year     6   265066 265080
## + duration          1   265449 265453
## + currency          5   267089 267101
## + country           4   267146 267156
## + launched_month   11   267867 267891
## <none>                  268222 268224
## 
## Step:  AIC=190224
## state ~ backers
## 
##                    Df Deviance    AIC
## + usd_goal_real     1   129495 129501
## + main_category    14   177586 177618
## + duration          1   187547 187553
## + launched_year     6   187697 187713
## + currency          5   189436 189450
## + country           4   189513 189525
## + launched_month   11   190088 190114
## + usd_pledged_real  1   190137 190143
## <none>                  190220 190224
## 
## Step:  AIC=129500.7
## state ~ backers + usd_goal_real
## 
##                    Df Deviance    AIC
## + usd_pledged_real  1     1036   1044
## + main_category    14   124239 124273
## + launched_year     6   128885 128903
## + duration          1   129119 129127
## + currency          5   129262 129278
## + country           4   129270 129284
## + launched_month   11   129413 129441
## <none>                  129495 129501
## 
## Step:  AIC=1043.69
## state ~ backers + usd_goal_real + usd_pledged_real
## 
##                  Df Deviance    AIC
## + launched_year   6   998.11 1018.1
## + main_category  14   992.15 1028.2
## + launched_month 11  1010.69 1040.7
## <none>               1035.69 1043.7
## + country         4  1027.86 1043.9
## + duration        1  1034.22 1044.2
## + currency        5  1027.58 1045.6
## 
## Step:  AIC=1018.11
## state ~ backers + usd_goal_real + usd_pledged_real + launched_year
## 
##                  Df Deviance    AIC
## + main_category  14   965.77 1013.8
## + duration        1   994.31 1016.3
## <none>                998.11 1018.1
## + launched_month 11   976.27 1018.3
## + country         4   993.84 1021.8
## + currency        5   993.24 1023.2
## 
## Step:  AIC=1013.77
## state ~ backers + usd_goal_real + usd_pledged_real + launched_year + 
##     main_category
## 
##                  Df Deviance    AIC
## + duration        1   961.69 1011.7
## <none>                965.77 1013.8
## + launched_month 11   948.47 1018.5
## + country         4   963.38 1019.4
## + currency        5   962.41 1020.4
## 
## Step:  AIC=1011.69
## state ~ backers + usd_goal_real + usd_pledged_real + launched_year + 
##     main_category + duration
## 
##                  Df Deviance    AIC
## <none>                961.69 1011.7
## + launched_month 11   944.78 1016.8
## + country         4   959.61 1017.6
## + currency        5   958.55 1018.5

Both Forward and Backward selection methods resulted in the same final model. The final model chosen with AIC as the criterion for selection generated a logistic regression model with the lowest AIC value of 1012 as below.

# Forward and backward selection gives the same final model
model.glm <- glm(state ~ main_category + launched_year + duration +
                   backers + usd_pledged_real + usd_goal_real, data = train.data, family = "binomial")

summary(model.glm)

## 
## Call:
## glm(formula = state ~ main_category + launched_year + duration + 
##     backers + usd_pledged_real + usd_goal_real, family = "binomial", 
##     data = train.data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -8.49    0.00    0.00    0.00    6.55  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                3.592292   0.473563   7.586 3.31e-14 ***
## main_categoryComics        0.364321   0.735345   0.495  0.62029    
## main_categoryCrafts       -0.298821   0.417253  -0.716  0.47389    
## main_categoryDance         1.786427   1.085384   1.646  0.09979 .  
## main_categoryDesign       -0.038988   0.446695  -0.087  0.93045    
## main_categoryFashion       0.514163   0.463273   1.110  0.26706    
## main_categoryFilm & Video  0.859109   0.323352   2.657  0.00789 ** 
## main_categoryFood         -1.155341   0.266997  -4.327 1.51e-05 ***
## main_categoryGames        -0.469846   0.368808  -1.274  0.20268    
## main_categoryJournalism    0.800839   0.882433   0.908  0.36412    
## main_categoryMusic         0.430053   0.306353   1.404  0.16038    
## main_categoryPhotography  -0.216085   0.403612  -0.535  0.59239    
## main_categoryPublishing   -0.199835   0.294969  -0.677  0.49810    
## main_categoryTechnology    0.036450   0.505447   0.072  0.94251    
## main_categoryTheater       1.415904   0.622875   2.273  0.02302 *  
## launched_year2013          0.338879   0.612214   0.554  0.57990    
## launched_year2014         -1.352207   0.414653  -3.261  0.00111 ** 
## launched_year2015         -1.143311   0.416774  -2.743  0.00608 ** 
## launched_year2016         -0.987829   0.441421  -2.238  0.02523 *  
## launched_year2017         -0.310882   0.478285  -0.650  0.51570    
## launched_yearBefore 2012   0.282530   0.532915   0.530  0.59600    
## duration                  -0.013058   0.005980  -2.184  0.02899 *  
## backers                    0.035993   0.006546   5.498 3.84e-08 ***
## usd_pledged_real           0.234382   0.010890  21.524  < 2e-16 ***
## usd_goal_real             -0.234333   0.010899 -21.501  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 268222.18  on 198878  degrees of freedom
## Residual deviance:    961.69  on 198854  degrees of freedom
## AIC: 1011.7
## 
## Number of Fisher Scoring iterations: 25

The final selected model implies that if a new KS project is launched under the ‘Comics’ main category, the odds of its success increases by 44%. The columns ‘Odds’ in Table 7 shows the relative importance of a particular field.

Parameter Tuning

To tune the parameter, pcut, which is the optimal cut-off probability for classifying projects into the successful or failure class, a three-fold cross-validation method was used on the training data. Using a symmetric cost function for wrongly classified KS projects, the optimal cut-off probability was calculated.

The optimal cut-off probability is the probability at which the logistic regression model has the least misclassification rate.

# Logistic Regression - Parameter Tuning
# CV to choose cut-off probability
searchgrid = seq(0.4, 0.7, 0.02)
result = cbind(searchgrid, NA)
cost1 <- function(r, pi) {
  weight1 = 1
  weight0 = 1
  c1 = (r == 1) & (pi < pcut)  #logical vector - true if actual 1 but predict 0 (False Negative)
  c0 = (r == 0) & (pi > pcut)  #logical vector - true if actual 0 but predict 1 (False Positive)
  return(mean(weight1 * c1 + weight0 * c0))
}

for (i in 1:length(searchgrid)) {
  pcut <- result[i, 1]
  result[i, 2] <- cv.glm(data = train.data, glmfit = model.glm, cost = cost1, K = 3)$delta[2]
}

plot(result, ylab = "CV Cost",main = "Optimal cut-off probability identification")

Plotting the cross-validated cost with different values of pcut resulted in Figure 17. 0.64 was chosen as the optimal cut-off probability with a CV cost of 0.00036.

par(mfrow = c(1,2))
# In-sample Prediction
pred.in <- predict(model.glm, newdata = train.data, type = "response")
prediction.in <- ifelse(pred.in < 0.64,0,1)
table(as.factor(train.data$state), prediction.in)

##             prediction.in
##                   0      1
##   failed     118597     70
##   successful      5  80207

roc.plot(train.data$state == "successful", pred.in, main = "In-sample ROC")$roc.vol

##      Model      Area p.value binorm.area
## 1 Model  1 0.9999841       0          NA

# Model selection - Validation data
pred.val <- predict(model.glm, newdata = validation.data, type = "response")
prediction.val <- ifelse(pred.val < 0.64,0,1)
table(as.factor(validation.data$state), prediction.val)

##             prediction.val
##                  0     1
##   failed     39324    21
##   successful     0 26948

roc.plot(validation.data$state == "successful", pred.val, main = "Validation ROC")$roc.vol

##      Model      Area p.value binorm.area
## 1 Model  1 0.9999986       0          NA

2. Generalized Additive Models

Generalized Additive models are linear models in which the response depends linearly on unknown smooth functions of numerical predictor variables. Using GAM, the non-linear relationships of predictors with the response can be tested and used in the final model.

Using a smoothing term on the predictors - Duration, Backers, Pledged and Goal generated a linear estimated degree of freedom for Backers, Pledged and Goal variables as can be seen in Figure 18.

model.gam <- gam(state ~ main_category + country + launched_year + launched_month + 
                   s(duration) + currency + s(backers) + s(usd_pledged_real) +
                   s(usd_goal_real), family = binomial, data = train.data)
summary(model.gam)

## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## state ~ main_category + country + launched_year + launched_month + 
##     s(duration) + currency + s(backers) + s(usd_pledged_real) + 
##     s(usd_goal_real)
## 
## Parametric coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -7.551e+03  3.608e+02 -20.928  < 2e-16 ***
## main_categoryComics        3.514e-01  7.530e-01   0.467  0.64076    
## main_categoryCrafts       -1.920e-01  4.209e-01  -0.456  0.64828    
## main_categoryDance         1.913e+00  1.122e+00   1.706  0.08807 .  
## main_categoryDesign        6.718e-02  4.567e-01   0.147  0.88307    
## main_categoryFashion       5.488e-01  4.625e-01   1.187  0.23538    
## main_categoryFilm & Video  8.507e-01  3.240e-01   2.625  0.00865 ** 
## main_categoryFood         -9.137e-01  2.800e-01  -3.264  0.00110 ** 
## main_categoryGames        -4.974e-01  3.746e-01  -1.328  0.18424    
## main_categoryJournalism    7.362e-01  8.967e-01   0.821  0.41162    
## main_categoryMusic         4.832e-01  3.106e-01   1.556  0.11982    
## main_categoryPhotography  -2.021e-01  4.087e-01  -0.495  0.62085    
## main_categoryPublishing   -2.179e-01  2.952e-01  -0.738  0.46051    
## main_categoryTechnology    1.355e-01  5.140e-01   0.264  0.79208    
## main_categoryTheater       1.349e+00  6.250e-01   2.158  0.03092 *  
## countryCA                 -1.005e-01  5.118e-01  -0.196  0.84427    
## countryGB                  0.000e+00  0.000e+00      NA       NA    
## countryOther               0.000e+00  0.000e+00      NA       NA    
## countryUS                  0.000e+00  0.000e+00      NA       NA    
## launched_year2013          3.385e-01  6.100e-01   0.555  0.57894    
## launched_year2014         -1.104e+00  4.227e-01  -2.613  0.00898 ** 
## launched_year2015         -1.165e+00  4.172e-01  -2.792  0.00524 ** 
## launched_year2016         -9.649e-01  4.448e-01  -2.169  0.03006 *  
## launched_year2017         -2.744e-01  4.807e-01  -0.571  0.56810    
## launched_yearBefore 2012   3.540e-01  5.315e-01   0.666  0.50533    
## launched_month02           4.598e-01  4.283e-01   1.074  0.28298    
## launched_month03           1.446e-01  4.008e-01   0.361  0.71821    
## launched_month04           3.571e-01  4.435e-01   0.805  0.42075    
## launched_month05           5.193e-01  4.538e-01   1.144  0.25248    
## launched_month06           1.056e+00  5.350e-01   1.973  0.04845 *  
## launched_month07          -4.954e-01  3.646e-01  -1.359  0.17426    
## launched_month08           2.970e-01  4.155e-01   0.715  0.47469    
## launched_month09           2.040e-01  4.226e-01   0.483  0.62921    
## launched_month10           1.276e-02  3.995e-01   0.032  0.97453    
## launched_month11          -6.790e-02  4.133e-01  -0.164  0.86951    
## launched_month12           1.267e-01  4.623e-01   0.274  0.78399    
## currencyCAD                0.000e+00  0.000e+00      NA       NA    
## currencyEUR                5.841e-01  5.840e-01   1.000  0.31722    
## currencyGBP                5.030e-01  4.945e-01   1.017  0.30900    
## currencyOther             -1.149e-01  5.980e-01  -0.192  0.84766    
## currencyUSD                3.030e-01  4.377e-01   0.692  0.48876    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df  Chi.sq  p-value    
## s(duration)         1.048  1.095   4.162   0.0488 *  
## s(backers)          1.000  1.000  23.210 1.45e-06 ***
## s(usd_pledged_real) 1.000  1.000 439.800  < 2e-16 ***
## s(usd_goal_real)    1.000  1.000 439.037  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 73/77
## R-sq.(adj) =  0.998   Deviance explained = 99.6%
## UBRE = -0.99485  Scale est. = 1         n = 198879

plot(model.gam, shade = TRUE, seWithMean = TRUE, scale = 0, pages = 1)

The final model was built with linear terms on all the predictors. The final GAM model generated an AIC of 1023.8 comparable to AIC of the final Logistic regression model with 1012.

model.gam <- gam(state ~ main_category + country + launched_year + launched_month + 
                   duration + currency + backers + usd_pledged_real +
                   usd_goal_real, family = binomial, data = train.data)

# In-sample Prediction
prob.gam.in <- predict(model.gam, train.data, type = "response")
pred.gam.in <- (prob.gam.in >= 0.64) * 1
table(train.data$state, pred.gam.in, dnn = c("Observation", "Prediction"))

##             Prediction
## Observation       0      1
##   failed     118598     69
##   successful     14  80198

# roc.plot(train.data$state == "successful", prob.gam.in)$roc.vol

# Model selection - Validation data
prob.gam.val <- predict(model.gam, validation.data, type = "response")
pred.gam.val <- (prob.gam.val >= 0.64) * 1
table(validation.data$state, pred.gam.val, dnn = c("Observation", "Prediction"))

##             Prediction
## Observation      0     1
##   failed     39329    16
##   successful     4 26944

roc.plot(validation.data$state == "successful", prob.gam.val, main = "Validation ROC", 
         ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol

##      Model      Area p.value binorm.area
## 1 Model  1 0.9999984       0          NA

3. Classification Tree

Tree models, the easiest machine learning models to build, split data into two or more homogeneous sets based on the most significant differentiator in the predictor variables-value set.

Building a classification tree on the binary response variable, ‘State’, resulted in the below classification tree (Figure 19). Figure 19 shows that the predictors - backers, goal and usd_pledged are the most significant variables that classify the response variable.

The below figure shows the separation between the failed and the successful classes. If a KS project has less than 18 backers and lesser than $999 pledged towards it with a goal greater than 808 USD, it will definitely fail.

tree.model <- rpart(state ~ ., data = train.data, method = "class")
rpart.plot(tree.model)

Parameter Tuning

In order to prune the classification tree from the initial tree built, the complexity parameter was tuned. The Cp value after which the relative error does not decrease significantly was chosen as the final Cp (0.011).

plotcp(tree.model)

# Optimal cut-off value
# Pruning the tree
tree.model <- rpart(state ~ ., data = train.data, method = "class", cp = 0.011)
rpart.plot(tree.model)

Pruning the initial decision tree with a Cp value of 0.011 resulted in the Final decision tree (Figure 20). Clearly, the initial tree was the most pruned and the final tree is the same as the initial decision tree built.

par(mfrow = c(1,2))
# In-sample Prediction
tree.predict.in <- predict(tree.model, train.data, type = "class")
tree.pred.in <- predict(tree.model, train.data, type = "prob")
confusionMatrix(train.data$state, tree.predict.in)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   failed successful
##   failed     113621       5046
##   successful   1753      78459
##                                          
##                Accuracy : 0.9658         
##                  95% CI : (0.965, 0.9666)
##     No Information Rate : 0.5801         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9294         
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9848         
##             Specificity : 0.9396         
##          Pos Pred Value : 0.9575         
##          Neg Pred Value : 0.9781         
##              Prevalence : 0.5801         
##          Detection Rate : 0.5713         
##    Detection Prevalence : 0.5967         
##       Balanced Accuracy : 0.9622         
##                                          
##        'Positive' Class : failed         
##

roc.plot(train.data$state == "successful", tree.pred.in[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol

##      Model      Area p.value binorm.area
## 1 Model  1 0.9801735       0          NA

# Model selection - Validation data
tree.predict.val <- predict(tree.model, validation.data, type = "class")
tree.pred.val <- predict(tree.model, validation.data, type = "prob")
confusionMatrix(validation.data$state, tree.predict.val)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   failed successful
##   failed      37606       1739
##   successful    592      26356
##                                           
##                Accuracy : 0.9648          
##                  95% CI : (0.9634, 0.9662)
##     No Information Rate : 0.5762          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9276          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9845          
##             Specificity : 0.9381          
##          Pos Pred Value : 0.9558          
##          Neg Pred Value : 0.9780          
##              Prevalence : 0.5762          
##          Detection Rate : 0.5673          
##    Detection Prevalence : 0.5935          
##       Balanced Accuracy : 0.9613          
##                                           
##        'Positive' Class : failed          
##

roc.plot(validation.data$state == "successful", tree.pred.val[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol

##      Model      Area p.value binorm.area
## 1 Model  1 0.9798618       0          NA

4. Random Forest

Random Forests are an extension of the simple Classification Tree algorithm where many trees are built on different selections of the data with a subset of variables used for building each tree. The random forest combines the decision of the individual trees built and makes the final decision about each observation. The random forest package in R automatically calculates the average out of bag error when the random forest is built. The OOB error is synonymous to a misclassification rate that is calculated on the subset that is not chosen to build the trees.

set.seed(12420360)
rf.model <- randomForest(formula = state ~ ., data = train.data, importance = TRUE)
print(rf.model)

## 
## Call:
##  randomForest(formula = state ~ ., data = train.data, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0.27%
## Confusion matrix:
##            failed successful  class.error
## failed     118148        519 0.0043735832
## successful     11      80201 0.0001371366

Parameter Tuning

There are multiple parameters that can be tuned in the Random Forest model. Parameters ntree, mtry and nodesize are respectively the number of trees that are built for the Random Forest to generate the lowest OOB error, the number of variables sampled on each candidate split and the minimum size of the terminal nodes.

Tuning only the mtry with a fixed number of 500 trees for the Random Forest resulted in the OOB errors represented in Figure 22. From the Figure, mtry of 3 was chosen as the ideal number to introduce variability in the Random Forest although the mtry of 7 resulted in the lowest OOB.

# Tune only mtry parameter using tuneRF()
set.seed(12420360)      
res <- tuneRF(x = subset(train.data, select = -state),
              y = train.data$state,
              ntreeTry = 500)

## mtry = 3  OOB error = 0.27% 
## Searching left ...
## mtry = 2     OOB error = 0.52% 
## -0.9512195 0.05 
## Searching right ...
## mtry = 6     OOB error = 0.1% 
## 0.6097561 0.05 
## mtry = 9     OOB error = 0.08% 
## 0.25 0.05

print(res)

##       mtry     OOBError
## 2.OOB    2 0.0052293103
## 3.OOB    3 0.0026800215
## 6.OOB    6 0.0010458621
## 9.OOB    9 0.0007843965

Tuning the other parameters nodesize and ntree using the fixed mtry of 3 resulted in lowest OOB error of 0.0027 with nodesize as 3 and ntree as 200. The above tuned parameters were used to build the final Random Forest model.

# Manually tuning all parameters of the Random Forest
# Establish a list of possible values for nodesize and ntree
nodesize <- seq(3, 8, 2)
ntree <- seq(50, 200, 50)

hyper_grid <- expand.grid(nodesize = nodesize, ntree = ntree)
oob_err <- c()

for (i in 1:nrow(hyper_grid)) {
  model <- randomForest(formula = state ~ ., data = train.data,
                        mtry = 3,
                        nodesize = hyper_grid$nodesize[i],
                        ntree = hyper_grid$ntree[i])
  
  oob_err[i] <- model$err.rate[nrow(model$err.rate), "OOB"]
}

# Identify optimal set of hyperparmeters based on OOB error
opt_i <- which.min(oob_err)
print(hyper_grid[opt_i,])

##    nodesize ntree
## 10        3   200

# Build the tuned randomForest with the above parameters
rf.model <- randomForest(formula = state ~ ., data = train.data,
                         mtry = 3, nodesize = 3,
                         ntree = 200, importance = TRUE)

par(mfrow = c(1,2))
# In-sample Prediction
rf.predict.in <- predict(rf.model, train.data)
rf.pred.in <- predict(rf.model, train.data, type = "prob")
confusionMatrix(train.data$state, rf.predict.in)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   failed successful
##   failed     118664          3
##   successful      0      80212
##                                 
##                Accuracy : 1     
##                  95% CI : (1, 1)
##     No Information Rate : 0.5967
##     P-Value [Acc > NIR] : <2e-16
##                                 
##                   Kappa : 1     
##  Mcnemar's Test P-Value : 0.2482
##                                 
##             Sensitivity : 1.0000
##             Specificity : 1.0000
##          Pos Pred Value : 1.0000
##          Neg Pred Value : 1.0000
##              Prevalence : 0.5967
##          Detection Rate : 0.5967
##    Detection Prevalence : 0.5967
##       Balanced Accuracy : 1.0000
##                                 
##        'Positive' Class : failed
##

roc.plot(train.data$state == "successful", rf.pred.in[,2], ylab = "True Positive Rate", 
         xlab = "False Positive Rate")$roc.vol

##      Model Area p.value binorm.area
## 1 Model  1    1       0          NA

# Model selection - Validation data
rf.predict.val <- predict(rf.model, validation.data)
rf.pred.val <- predict(rf.model, validation.data, type = "prob")
confusionMatrix(validation.data$state, rf.predict.val)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   failed successful
##   failed      39175        170
##   successful      5      26943
##                                           
##                Accuracy : 0.9974          
##                  95% CI : (0.9969, 0.9977)
##     No Information Rate : 0.591           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9945          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9999          
##             Specificity : 0.9937          
##          Pos Pred Value : 0.9957          
##          Neg Pred Value : 0.9998          
##              Prevalence : 0.5910          
##          Detection Rate : 0.5909          
##    Detection Prevalence : 0.5935          
##       Balanced Accuracy : 0.9968          
##                                           
##        'Positive' Class : failed          
##

roc.plot(validation.data$state == "successful", rf.pred.val[,2], ylab = "True Positive Rate", 
         xlab = "False Positive Rate")$roc.vol

##      Model      Area p.value binorm.area
## 1 Model  1 0.9999771       0          NA

Model Selection

The final model for classifying the KS projects was determined after comparison of all models’ performance on the validation data (20% of the data).

The best model was chosen based on model performance on the validation data using AUC as the criterion for model selection. Across the in-sample data, the Random Forest model performed the best by recording the highest AUC and accuracy of 1. On the validation data, the performance of the Logistic Regression model and GAM was similar. Hence, the simpler Logistic Regression model was chosen as the best model with an accuracy of 0.9996 and AUC of 0.9999.

The below figure compares the ROC plots across all models for in-sample and validation data.

par(mfrow = c(1,2))

# Plot multiple ROC curves in one - Logistic, GAM, Tree and Random Forest
# Compare different models using ROC curves
# In-sample ROC
rocplot1 <- roc.plot(x = (train.data$state == "successful"), 
                     pred = cbind(pred.in, prob.gam.in, tree.pred.in[,2], rf.pred.in[,2]),
                     main = "ROC curve: In-sample",
                     legend = T, leg.text = c("GLM","GAM", "Tree", "Random Forest"))

# Validation set ROC
rocplot2 <- roc.plot(x = (validation.data$state == "successful"), 
                     pred = cbind(pred.val, prob.gam.val, tree.pred.val[,2], rf.pred.val[,2]),
                     main = "ROC curve: Validation set",
                     legend = T, leg.text = c("GLM","GAM", "Tree", "Random Forest"))

rocplot1$roc.vol

##      Model      Area p.value binorm.area
## 1 Model  1 0.9999841       0          NA
## 2 Model  2 0.9999845       0          NA
## 3 Model  3 0.9801735       0          NA
## 4 Model  4 1.0000000       0          NA

rocplot2$roc.vol

##      Model      Area p.value binorm.area
## 1 Model  1 0.9999986       0          NA
## 2 Model  2 0.9999984       0          NA
## 3 Model  3 0.9798618       0          NA
## 4 Model  4 0.9999771       0          NA

Model Performance Evaluation

The Logistic Regression model was chosen as the best in terms of most performance criterions – Accuracy and AUC. With the Logistic Regression model chosen as the final model for classifying KS projects as potential successes and failures, the below results were found while testing the performance on the Test data.

# Model performance - Test data
pred.out <- predict(model.glm, newdata = test.data, type = "response")
prediction.out <- ifelse(pred.out < 0.64,0,1)
table(as.factor(test.data$state), prediction.out)

##             prediction.out
##                  0     1
##   failed     39580    22
##   successful     2 26689

roc.plot(test.data$state == "successful", pred.out)$roc.vol

##      Model      Area p.value binorm.area
## 1 Model  1 0.9999739       0          NA

References

Kickstarter Projects Success Prediction

Preethi Jayaraman

June 20, 2018

Introduction

Required Packages

Data Pre-processing

Data Import

Data Cleaning

Data Preview

Data Description

Data Analysis

Descriptive Analysis

Cleaned Dataset

Univariate Analysis

Bivariate Analysis

Correlations

Predictive Analysis

Approach

Modeling

Model Selection

Model Performance Evaluation