Over the years, major crowdfunding platforms such as Kickstarter, GoFundMe, RocketHub, CircleUp and others have started facing the heat thanks to its unconventional means of investment. Nathan Resnick in his article ‘Why Kickstarter Is Corrupted’, mentions how Paid Advertising, Investor Backed campaigns and the involvement of crowdfunding agencies has changed what Kickstarter used to stand for. In such challenging times, this project is a means to analyze the projects launched on Kickstarter and figure out the keys to a successful project on Kickstarter. Though a success on Kickstarter does not guarantee product success in the market, backers could this analysis to identify if projects are investment-worthy.
The data used for the analysis was collected from Kickstarter.com and is available on Kaggle here. The raw dataset contains details about 378,661 projects launched over the years 2009 to 2017. The dataset presents many details that can be used to predict the final state of the Kickstarter project.
Analysis Methodology:
My analysis of the Kickstarter Projects is broken down into the following sections.
The below listed steps were performed in the Data cleaning/ wrangling stage to prepare the data into a format compatible for analysis.
In the Data exploration stage, the response and predictor variables were analyzed by plotting univariate and bi-variate charts. The descriptive analysis forms a baseline to understand which variables could potentially be important in the predictive analysis.
With the data exploration step completed, the modeling step was attempted. Classification algorithms such as Logistic Regression, GAM, Classification Tree and Random Forest was performed on the data set to identify the projects that will be successful on Kickstarter.
The following packages were used for the analysis. Details about the code block that uses the package is also mentioned.
library(dplyr) # Data wrangling tasks
library(tidyr) # Data wrangling tasks
library(ggplot2) # Plotting/ Data visualization tasks
library(lubridate) # Date/ Time manipulation
library(magrittr) # Pipe operator
library(corrplot) # Correlation function
library(formattable) # Data Preview section
library(knitr) # Data Preview section
library(broom) # Glance function
library(boot) # Bootstrapping function
library(glmnet) # Cross validation function
library(mgcv) # GAM function
library(verification) # ROC plots
library(rpart) # Classification Tree
library(rpart.plot) # Classification Tree plot
library(caret) # Confusion Matrix function
library(randomForest) # Random Forest function
The Data Preparation section contains the logic and the steps performed in bringing the data to a form suitable for analysis. Each tab below explains the related steps. The cleaned data can be previewed at the Preview tab with details about variables presented in the Data Description tab.
Dataset:
The data used for the analysis was collected from Kickstarter.com and is available on Kaggle. The raw dataset contains details about 378,661 projects launched over the years 2009 to 2017. The dataset presents many details that can be used to predict the final state of the Kickstarter project.
The dataset was downloaded from Kaggle. Other details about the dataset can be found here. The data dictionary can also be found in the above link.
Import:
I downloaded the dataset from Kaggle and read it as a data frame using the read.csv() function.
ks <- read.csv("ks-projects-201801.csv")
str(ks)
## 'data.frame': 378661 obs. of 15 variables:
## $ ID : int 1000002330 1000003930 1000004038 1000007540 1000011046 1000014025 1000023410 1000030581 1000034518 100004195 ...
## $ name : Factor w/ 375765 levels ""," IT’S A HOT CAPPUCCINO NIGHT ",..: 332493 135633 364946 344770 77274 206067 293430 69281 284103 290686 ...
## $ category : Factor w/ 159 levels "3D Printing",..: 109 94 94 91 56 124 59 42 114 40 ...
## $ main_category : Factor w/ 15 levels "Art","Comics",..: 13 7 7 11 7 8 8 8 5 7 ...
## $ currency : Factor w/ 14 levels "AUD","CAD","CHF",..: 6 14 14 14 14 14 14 14 14 14 ...
## $ deadline : Factor w/ 3164 levels "2009-05-03","2009-05-16",..: 2288 3042 1333 1017 2247 2463 1996 2448 1790 1863 ...
## $ goal : num 1000 30000 45000 5000 19500 50000 1000 25000 125000 65000 ...
## $ launched : Factor w/ 378089 levels "1970-01-01 01:00:00",..: 243292 361975 80409 46557 235943 278600 187500 274014 139367 153766 ...
## $ pledged : num 0 2421 220 1 1283 ...
## $ state : Factor w/ 6 levels "canceled","failed",..: 2 2 2 2 1 4 4 2 1 1 ...
## $ backers : int 0 15 3 1 14 224 16 40 58 43 ...
## $ country : Factor w/ 23 levels "AT","AU","BE",..: 10 23 23 23 23 23 23 23 23 23 ...
## $ usd.pledged : num 0 100 220 1 1283 ...
## $ usd_pledged_real: num 0 2421 220 1 1283 ...
## $ usd_goal_real : num 1534 30000 45000 5000 19500 ...
The raw Kickstarter Projects data has 378661 observations and 15 variables.
The target variable in the analysis, State, the Final state of the Kickstarter (KS) Project has six levels – Failed (52%), Successful (35%) and Others (Cancelled, Live, Suspended and Undefined, 12%).
1.a. Initial Data Type conversions
Data type conversion was done to relevant columns such as ID, Name that were converted to ‘Character’ datatype and Deadline, Launched date that were converted to ‘Date’ datatype.
ks$ID <- as.character(ks$ID)
ks$name <- as.character(ks$name)
ks$deadline <- as.Date(ks$deadline)
ks$launched <- as.Date(ks$launched)
2.a. Re-ordering the dataset variables to a meaningful order
The variables in the dataset were re-ordered to a meaningful order.
ks <- ks[, c(1:4, 12, 8, 6, 5, 7, 9, 11, 13:15, 10)]
3. Data sub-setting:
The target variable in the analysis, State, the Final state of the Kickstarter (KS) Project has six levels – Failed (52%), Successful (35%) and Others (Cancelled, Live, Suspended and Undefined, 12%).
ggplot(ks, aes(state)) +
geom_bar() +
ylab("# of Projects") + xlab("Final State") +
ggtitle("Final State of the Kickstarter projects")
For the purpose of this project, only the projects with ‘Failed’ and ‘Successful’ states will be considered.
ks.proj <- ks %>% filter(state == "failed" | state == "successful")
ks.proj$state <- as.character(ks.proj$state)
ks.proj$state <- as.factor(ks.proj$state)
summary(ks.proj$state)
## failed successful
## 197719 133956
4. Check for duplicates:
There were no duplicated observations in the dataset. Hence, no changes were made in this step.
5. Check for missing values:
0.063% of the observations had missing values. As the number of observations with missing values were few, they were removed and stored in a temporary dataset. The final dataset contains 331,465 observations.
sum(is.na(ks.proj))
## [1] 210
colSums(is.na(ks.proj))
## ID name category main_category
## 0 0 0 0
## country launched deadline currency
## 0 0 0 0
## goal pledged backers usd.pledged
## 0 0 0 210
## usd_pledged_real usd_goal_real state
## 0 0 0
# As it's a very small value (0.063%), I'm deleting the values
missing <- ks.proj %>% filter(is.na(usd.pledged))
ks.proj <- na.omit(ks.proj)
6. Feature Creation
The variable ‘Duration’ was created as the difference (in days) between the deadline and launched date. Further, the deadline and launched date fields were separated into the respective year and month as variables, deadline_year, deadline_month and launched_year, launched_month.
ks.proj$duration <- as.numeric(ks.proj$deadline - ks.proj$launched)
ks.proj <- ks.proj %>%
separate(col = "deadline", into = c("deadline_year", "deadline_month", "deadline_day"), sep = "-") %>%
separate(col = "launched", into = c("launched_year", "launched_month", "launched_day"), sep = "-")
## 'data.frame': 331465 obs. of 18 variables:
## $ ID : chr "1000002330" "1000003930" "1000004038" "1000007540" ...
## $ name : chr "The Songs of Adelaide & Abullah" "Greeting From Earth: ZGAC Arts Capsule For ET" "Where is Hank?" "ToshiCapital Rekordz Needs Help to Complete Album" ...
## $ category : Factor w/ 159 levels "3D Printing",..: 109 94 94 91 124 59 42 96 73 33 ...
## $ main_category : Factor w/ 15 levels "Art","Comics",..: 13 7 7 11 8 8 8 13 11 3 ...
## $ country : Factor w/ 23 levels "AT","AU","BE",..: 10 23 23 23 23 23 23 4 23 23 ...
## $ launched_year : chr "2015" "2017" "2013" "2012" ...
## $ launched_month : chr "08" "09" "01" "03" ...
## $ deadline_year : chr "2015" "2017" "2013" "2012" ...
## $ deadline_month : chr "10" "11" "02" "04" ...
## $ duration : num 59 60 45 30 35 20 45 30 30 30 ...
## $ currency : Factor w/ 14 levels "AUD","CAD","CHF",..: 6 14 14 14 14 14 14 2 14 14 ...
## $ goal : num 1000 30000 45000 5000 50000 1000 25000 2500 12500 5000 ...
## $ pledged : num 0 2421 220 1 52375 ...
## $ backers : int 0 15 3 1 224 16 40 0 100 0 ...
## $ usd.pledged : num 0 100 220 1 52375 ...
## $ usd_pledged_real: num 0 2421 220 1 52375 ...
## $ usd_goal_real : num 1534 30000 45000 5000 50000 ...
## $ state : Factor w/ 2 levels "failed","successful": 1 1 1 1 2 2 1 1 2 1 ...
7. Feature Engineering:
The dataset contains many categorical variables with multiple levels. Many of those levels have too few observations within the level. In order to reduce the number of parameters in the predictive analysis, such levels were consolidated.
a. Country:
KS projects can be launched in 22 countries. However, 94% of the projects launched between 2009 and 2017 were launched in the US, UK, Canada and Australia. Projects launched in New Zealand, Denmark, Norway, Sweden, The Netherlands, Ireland, Spain, France, Germany, Austria, Italy, Belgium, Luxembourg, Switzerland and Mexico, that comprised of 6% of all the projects in the dataset were grouped in a level ‘Other’.
ks.proj$country <- as.character(ks.proj$country)
ks.proj$country <- as.factor(ks.proj$country)
# Reducing levels in Country
ks.proj$country <- as.character(ks.proj$country)
ks.proj$country[ks.proj$country %in% c("JP", "LU", "AT", "HK", "SG", "BE", "CH", "IE", "NO", "DK",
"MX", "NZ", "SE", "ES", "IT", "NL", "FR", "DE")] <- "Other"
ks.proj$country <- as.factor(ks.proj$country)
levels(ks.proj$country) # 5 levels
## [1] "AU" "CA" "GB" "Other" "US"
sort(round(prop.table(table(ks.proj$country)),2))
##
## AU CA Other GB US
## 0.02 0.04 0.07 0.09 0.79
b. Launched Year:
The dataset records all KS projects launched between 2009 and 2017. However, only 10% of the projects were launched before 2012. Thus, a new level ‘Before 2012’ was created to include all projects launched in 2009, 2010 and 2011.
levels(ks.proj$launched_year) # 9 levels
## [1] "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017"
round(prop.table(table(ks.proj$launched_year)),2)
##
## 2009 2010 2011 2012 2013 2014 2015 2016 2017
## 0.00 0.03 0.07 0.12 0.12 0.18 0.20 0.15 0.13
# Reducing levels in Launched Year
ks.proj$launched_year <- as.character(ks.proj$launched_year)
ks.proj$launched_year[ks.proj$launched_year %in% c("2009", "2010", "2011")] <- "Before 2012"
ks.proj$launched_year <- as.factor(ks.proj$launched_year)
c. Currency:
14 different currencies were used by KS projects recorded in the dataset. However, 98% of the projects were launched in US Dollar, British Pound, Euro, Canadian Dollar and Australian Dollar. The rest of the currencies were grouped in a label, ‘Other’.
levels(ks.proj$currency)
## [1] "AUD" "CAD" "CHF" "DKK" "EUR" "GBP" "HKD" "JPY" "MXN" "NOK" "NZD"
## [12] "SEK" "SGD" "USD"
sort(round(prop.table(table(ks.proj$currency)),2))
##
## CHF DKK HKD JPY MXN NOK NZD SEK SGD AUD CAD EUR GBP USD
## 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.04 0.04 0.09 0.79
# Reducing levels in Country
ks.proj$currency <- as.character(ks.proj$currency)
ks.proj$currency[ks.proj$currency %in% c("JPY", "HKD", "SGD", "CHF", "NOK", "DKK", "MXN", "NZD",
"SEK")] <- "Other"
ks.proj$currency <- as.factor(ks.proj$currency)
Listed below are the first 10 entries of the Kickstarter Projects data. Each row is a KS project entry with details about the project listed in the columns.
head(ks.proj, n = 50) %>%
formattable() %>%
as.datatable(options = list(dom = 't',scrollX = TRUE,scrollCollapse = TRUE))
Details about the Variables in the Dataset are provided below.
variable.type <- lapply(ks.proj, class)
variable.description <- c("ID of Kickstarter Project", "Name of Kickstarter Project",
"Category of Kickstarter Project", "Main Category of Kickstarter Project",
"Country where Kickstarter Project was launched",
"Year when Kickstarter Project was launched",
"Month when Kickstarter Project was launched",
"Year when Kickstarter Project ended",
"Month when Kickstarter Project ended",
"Active Duration of the Kickstarter Project", "Currency of amount pledged",
"Goal of Kickstarter Project in original currency",
"Total amount pledged in original currency", "Number of Backers",
"Conversion in USD of the pledged column (conversion done by Kickstarter)",
"Conversion in USD of the pledged column (conversion from Fixer.io API)",
"Conversion in USD of the goal column (conversion from Fixer.io API)",
"Final State of Kickstarter Project")
variable.name <- colnames(ks.proj)
ks_datadesc <- as_data_frame(cbind(variable.name, variable.type, variable.description))
colnames(ks_datadesc) <- c("Variable Name","Data Type","Variable Description")
kable(ks_datadesc)
| Variable Name | Data Type | Variable Description |
|---|---|---|
| ID | character | ID of Kickstarter Project |
| name | character | Name of Kickstarter Project |
| category | factor | Category of Kickstarter Project |
| main_category | factor | Main Category of Kickstarter Project |
| country | factor | Country where Kickstarter Project was launched |
| launched_year | factor | Year when Kickstarter Project was launched |
| launched_month | factor | Month when Kickstarter Project was launched |
| deadline_year | factor | Year when Kickstarter Project ended |
| deadline_month | factor | Month when Kickstarter Project ended |
| duration | numeric | Active Duration of the Kickstarter Project |
| currency | factor | Currency of amount pledged |
| goal | numeric | Goal of Kickstarter Project in original currency |
| pledged | numeric | Total amount pledged in original currency |
| backers | integer | Number of Backers |
| usd.pledged | numeric | Conversion in USD of the pledged column (conversion done by Kickstarter) |
| usd_pledged_real | numeric | Conversion in USD of the pledged column (conversion from Fixer.io API) |
| usd_goal_real | numeric | Conversion in USD of the goal column (conversion from Fixer.io API) |
| state | factor | Final State of Kickstarter Project |
The analysis of the Kickstarter Projects data can be found under the following tabs.
The Descriptive Analysis is sectionally divided into Univariate and Bivariate explorations. The correlations between the predictors can be found in the last tab.
Before we start the Descriptive Analysis, let’s look at the structure of the cleaned dataset, ks.projand understand the dimensions of the dataset.
str(ks.proj)
## 'data.frame': 331465 obs. of 18 variables:
## $ ID : chr "1000002330" "1000003930" "1000004038" "1000007540" ...
## $ name : chr "The Songs of Adelaide & Abullah" "Greeting From Earth: ZGAC Arts Capsule For ET" "Where is Hank?" "ToshiCapital Rekordz Needs Help to Complete Album" ...
## $ category : Factor w/ 159 levels "3D Printing",..: 109 94 94 91 124 59 42 96 73 33 ...
## $ main_category : Factor w/ 15 levels "Art","Comics",..: 13 7 7 11 8 8 8 13 11 3 ...
## $ country : Factor w/ 5 levels "AU","CA","GB",..: 3 5 5 5 5 5 5 2 5 5 ...
## $ launched_year : Factor w/ 7 levels "2012","2013",..: 4 6 2 1 5 3 5 2 2 3 ...
## $ launched_month : Factor w/ 12 levels "01","02","03",..: 8 9 1 3 2 12 2 9 3 9 ...
## $ deadline_year : Factor w/ 10 levels "2009","2010",..: 7 9 5 4 8 6 8 5 5 6 ...
## $ deadline_month : Factor w/ 12 levels "01","02","03",..: 10 11 2 4 4 12 3 10 4 10 ...
## $ duration : num 59 60 45 30 35 20 45 30 30 30 ...
## $ currency : Factor w/ 6 levels "AUD","CAD","EUR",..: 4 6 6 6 6 6 6 2 6 6 ...
## $ goal : num 1000 30000 45000 5000 50000 1000 25000 2500 12500 5000 ...
## $ pledged : num 0 2421 220 1 52375 ...
## $ backers : int 0 15 3 1 224 16 40 0 100 0 ...
## $ usd.pledged : num 0 100 220 1 52375 ...
## $ usd_pledged_real: num 0 2421 220 1 52375 ...
## $ usd_goal_real : num 1534 30000 45000 5000 50000 ...
## $ state : Factor w/ 2 levels "failed","successful": 1 1 1 1 2 2 1 1 2 1 ...
The cleaned dataset has 331,465 observations and 18 variables. The target variable, ‘State’ is categorical, while the predictor variables are a combination of numerical and categorical variables.
Response Variable: State
The response variable, State, indicates which of the projects Kickstarter deemed successful. Of the 331,465 projects listed in the dataset, 40% of the projects were successful, while the rest failed.
# Final State of the KS project
ggplot(ks.proj, aes(state, fill = state)) +
geom_bar() +
ylab("# of Projects") + xlab("Final State") +
theme(legend.position = "bottom") +
ggtitle("Final State of the Kickstarter projects")
Predictor Variables:
1. Main Category
There are 15 different categories of KS projects listed in the dataset, of which Film & Video, Music and Publishing, collectively make up more 40% of the projects. Figures below show the distribution of KS projects and the success/ failure rate across categories.
# 1. Main Categories present in the dataset
p2 <- ggplot(ks.proj, aes(x = main_category, fill = ks.proj$state)) +
geom_bar() +
coord_flip() +
theme(legend.position = "bottom") +
ylab("Number of projects") + xlab("") +
ggtitle("Main categories of the KS Projects")
# Main categories of the KS Projects - percent to whole
p1 <- ks.proj %>%
count(main_category) %>%
mutate(pct = n / sum(n)) %>%
ggplot(aes(reorder(main_category, pct), pct)) +
geom_col() +
coord_flip() +
ylab("% of projects") + xlab("") +
ggtitle("Main categories of the KS Projects")
gridExtra::grid.arrange(p1, p2, ncol = 2)
2. Country
Of the 22 countries where KS projects can be launched, 94% of the projects were launched in US, UK, Canada and Australia. Figure 4 shows the distribution of KS projects across countries and the associated rate of success across countries.
# Final State of the KS project
ggplot(ks.proj, aes(x = country, fill = ks.proj$state)) +
geom_bar() +
theme(legend.position = "bottom") +
scale_y_continuous(labels = scales::comma) +
ylab("Number of projects") + xlab("") +
ggtitle("KS Projects by country")
3. Launched Year
Figure 5 shows the distribution of the KS projects based on the project launch year. There’s a clear peak in the number of projects that were launched in 2014 and 2015, of which about 30% were successful.
ggplot(ks.proj, aes(x = launched_year, fill = ks.proj$state)) +
geom_bar() +
theme(legend.position = "bottom") +
ylab("Number of projects") + xlab("Year launched") +
ggtitle("KS projects by Year")
4. Launched Month
Figure 6 shows the distribution over time when the KS projects were launched. There was a strong peak between July and August in 2014 and 2015.
df1 <- as.data.frame(table(ks.proj$launched_year, ks.proj$launched_month))
names(df1) <- c('Launched_Year','Launched_Month', 'Freq')
df1 <- df1 %>% group_by(Launched_Year) %>% arrange(desc(Freq))
ggplot(df1, aes(x = Launched_Year, y = `Freq`, fill = Launched_Month)) +
geom_bar(stat = "identity", position = "dodge") +
ggtitle("KS Projects launched over time") +
scale_y_continuous("# of Projects launched", labels = scales::comma) +
scale_x_discrete("Year")
4. Deadline Year
Figure 7 shows the distribution of the project deadlines over time. The distribution of this variable being similar to the launched year and duration variable makes the need for this variable dubious.
ggplot(ks.proj, aes(x = deadline_year, fill = ks.proj$state)) +
geom_bar() +
theme(legend.position = "bottom") +
ylab("Number of projects") + xlab("Deadline Year") +
ggtitle("KS projects by Deadline")
5. Active Duration
The figure reflect the distribution of the variable showing that both successful and unsuccessful projects have a mean duration of 30 days with many outliers.
p1 <- ggplot(ks.proj, aes(duration, fill = ks.proj$state)) +
geom_histogram(binwidth = 5) +
theme(legend.position = "bottom") +
ylab("Number of projects") + xlab("Duration") +
ggtitle("Duration of the KS projects (in days)")
p2 <- ggplot(ks.proj, aes(x = state, y = duration, fill = ks.proj$state)) +
geom_boxplot() +
theme(legend.position = "bottom") +
coord_flip() +
xlab("") + ylab("Duration") +
ggtitle("Active Duration of the KS projects")
gridExtra::grid.arrange(p1, p2, ncol = 2)
6. Currency
Of the 22 countries where KS projects can be launched, 14 unique currencies were used. 98% of the projects were launched in US Dollar, British Pound, Canadian Dollar and Australian Dollar. Figure 9 shows the distribution of KS projects across countries and the associated rate of success across countries. As this plot resembles the distribution of the country variable, it might not be too useful.
ggplot(ks.proj, aes(x = currency, fill = ks.proj$state)) +
geom_bar() +
scale_y_continuous(labels = scales::comma) +
theme(legend.position = "bottom") +
ylab("Number of projects") + xlab("") +
ggtitle("Currency of the KS Projects")
7. Backers
Table 4 shows the quantile distribution of the ‘Backers’ variable. The distribution is highly skewed as the variance of this variable is large. Clearly, for 50% of the KS projects, there were fewer than 15 backers.
quantile(ks.proj$backers, probs = seq(from = 0, to = 1, by = .1))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
## 0 0 1 3 7 15 28 48 84 183
## 100%
## 219382
To help understand the distribution better, a log transformation has been used to represent it as can be seen in Figures 10a, 10b. The ‘Backers’ variable shows a large difference between successful and failed KS projects and hence, might be a useful variable in the predictive analysis.
# Log transforming the backers field shows the distribution better
p1 <- ggplot(ks.proj, aes(log(backers), fill = ks.proj$state)) +
geom_density() +
theme(legend.position = "bottom") +
ylab("Number of Backers") + xlab("") +
ggtitle("# of Backers of the KS projects")
p2 <- ggplot(ks.proj, aes(x = state, y = log(backers), fill = ks.proj$state)) +
geom_boxplot() +
coord_flip() +
theme(legend.position = "bottom") +
ylab("# of Backers (log-transformed)") + xlab("") +
ggtitle("# of Backers of the KS projects (Log)")
gridExtra::grid.arrange(p1, p2, ncol = 2)
8. Amount Pledged
Table 5 shows the quantile distribution of the ‘Amount Pledged’ variable. Similar to the ‘Backers’ variable, the distribution is highly skewed, and the variance of this variable is large. From the quantile table, it’s clear that about 20% the projects had lesser than 20 USD and another 20% had more than 6300 USD pledged towards the project.
quantile(ks.proj$usd_pledged_real, probs = seq(from = 0, to = 1, by = .2))
## 0% 20% 40% 60% 80%
## 0.000 20.000 307.982 1681.000 6305.594
## 100%
## 20338986.270
Figures 11a, 11b show the distribution better by using a log transformation. As the ‘Amount Pledged’ variable also shows a large difference between successful and failed KS projects, it might be a useful predictor in the analysis.
# Log transforming the usd_pledged_real field shows the distribution better
p1 <- ggplot(ks.proj, aes(log(usd_pledged_real), fill = ks.proj$state)) +
geom_density() +
theme(legend.position = "bottom") +
xlab("USD pledged (log-transformed)") + ylab("") +
ggtitle("USD pledged for the KS projects")
# Log-transformed usd_pledged_real
p2 <- ggplot(ks.proj, aes(x = state, y = log(usd_pledged_real), fill = ks.proj$state)) +
geom_boxplot() +
theme(legend.position = "bottom") +
ylab("USD pledged (log-transformed)") + xlab("") +
scale_y_continuous(labels = scales::comma) +
coord_flip() +
ggtitle("USD pledged for the KS projects (Log)")
gridExtra::grid.arrange(p1, p2, ncol = 2)
9. Goal
The below table shows the quantile distribution of the ‘Goal’ variable. Similar to the ‘Backers’ and ‘Amount Pledged’ variable, the distribution is highly skewed, and the variance of this variable is large. Figures 12a, 12b show the distribution using a log transformation.
quantile(ks.proj$usd_goal_real, probs = seq(from = 0, to = 1, by = .2))
## 0% 20% 40% 60% 80%
## 0.01 1500.00 3975.00 8000.00 20000.00
## 100%
## 166361390.71
# Log transforming the goal field shows the distribution better
p1 <- ggplot(ks.proj, aes(log(usd_goal_real), fill = ks.proj$state)) +
geom_density() +
theme(legend.position = "bottom") +
xlab("USD pledged (log-transformed)") + ylab("") +
ggtitle("KS projects' Goal")
# Log-transformed usd_goal_real
p2 <- ggplot(ks.proj, aes(x = state, y = log(usd_goal_real), fill = ks.proj$state)) +
geom_boxplot() +
theme(legend.position = "bottom") +
ylab("Goal in USD (log-transformed)") + xlab("") +
scale_y_continuous(labels = scales::comma) +
coord_flip() +
ggtitle(" Goal of the KS projects (Log)")
gridExtra::grid.arrange(p1, p2, ncol = 2)
Category by Year
Figure 13 shows that across the years KS Projects in Music, Film & Video have been consistently high. No other category stands out as much. This indicates that category might not be a strong predictive variable.
ggplot(ks.proj, aes(launched_year, fill = ks.proj$state)) +
geom_bar() +
theme(legend.position = "bottom") +
facet_wrap( ~ main_category) +
ylab("Number of Projects") + xlab("Launched Year") +
ggtitle("KS projects launched over time by Category")
Country by Year
Figure 15 shows that over the years, KS projects have been launched in the US much more than in any other country. The plot clearly shows that both country and year might not be highly indicative of which projects will be successful if launched on Kickstarter.
ggplot(ks.proj, aes(launched_year, fill = ks.proj$state)) +
geom_bar() +
theme(legend.position = "bottom") +
facet_wrap( ~ country) +
ylab("Number of Projects") + xlab("Launched Year") +
ggtitle("KS projects launched over time by Country")
USD Pledged vs Backers and Goal vs Backers
Figure 14a shows a strong positive correlation between USD pledged and Backers and Figure 14b shows a mild correlation between Goal and Backers. The strong separation of the successful and failed outcomes with these variables indicates that they might be strong predictors.
p1 <- ggplot(ks.proj, aes(x = log(backers), y = log(usd_pledged_real))) +
geom_jitter(aes(color = state)) +
theme(legend.position = "bottom") +
ylab("Amount pledged (log)") + xlab("Backers (log)") +
ggtitle("KS projects USD Pledged vs Backers")
# 4. Goal vs Backers
p2 <- ggplot(ks.proj, aes(x = log(backers), y = log(usd_goal_real))) +
geom_jitter(aes(color = state)) +
theme(legend.position = "bottom") +
ylab("Goal (log)") + xlab("Backers (log)") +
ggtitle("KS projects' Goal vs Backers")
gridExtra::grid.arrange(p1, p2, ncol = 2)
From the correlation plot, many strong correlations can be identified. However, the variable ‘Goal’ and ‘Pledged’ is the goal and amount pledged by the creator in the respective currency at the time of launch. ‘USD_goal_real’ and ‘USD_pledged_real’ are the USD equivalent of both variables (conversion provided by fixer.api). Hence, the variables goal, pledged, usd.pledged will be dropped.
# Correlation between all numerical variables
corMat <- cor(ks.proj[, c(10, 12:17)])
corrplot.mixed(corMat,tl.pos = "lt")
The only other high correlation noted from Figure 16 is the correlation of 0.75 between usd_pledged_real and Backers. This was also previously observed in Figure 13a. Hence, the Variance Inflation factor of the final model built will be checked in the future steps.
The below tabs show the approach to modeling and the results of the modeling exercise.
To build a predictive model that differentiates and classifies the KS projects that will/will not succeed, the following modeling approach was used. The data set was split into a 60-20-20 split for Train, Validation and Test set respectively.
Model Building: Modeling techniques such as Logistic Regression, Generalized Additive Models (GAM), Classification Trees and Random Forest was performed on the training set.
Parameter tuning: Cross validation and Grid search methods were used on the training data to identify the best parameter(s) suited for each of the models previously built.
Model performance was evaluated for each model using AUC and accuracy as the criterion on the validation set (20% of the dataset) in order to choose the best performing model for this dataset.
The performance of the best model chosen from the previous step was evaluated by measuring the accuracy of the final model.
The data set was split into a 60-20-20 split for Train, Validation and Test set respectively.
# create Training - Test and Validation set (60 - 20 - 20%)
set.seed(12420360)
spec = c(train = .6, test = .2, validate = .2)
g = sample(cut(seq(nrow(data)), nrow(data)*cumsum(c(0,spec)), labels = names(spec)))
res = split(data, g)
train.data <- res$train
test.data <- res$test
validation.data <- res$validate
1. Logistic Regression
A full logistic regression model was built with State as the response variable against the rest of the eight predictor variables – Main Category, Country, Launched Year, Launched Month, Duration, Currency, Backers and Goal. Step-wise variable selection method was used to identify the most important variables in the logistic regression model.
nullmodel <- glm(state~1, data = train.data, family = "binomial")
fullmodel <- glm(state~., data = train.data, family = "binomial")
#Backward Elimination
model.step.b <- step(fullmodel, direction = 'backward')
## Start: AIC=1023.75
## state ~ main_category + country + launched_year + launched_month +
## duration + currency + backers + usd_pledged_real + usd_goal_real
##
##
## Step: AIC=1023.75
## state ~ main_category + launched_year + launched_month + duration +
## currency + backers + usd_pledged_real + usd_goal_real
##
## Df Deviance AIC
## - currency 5 945 1017
## - launched_month 11 959 1019
## - main_category 14 969 1023
## <none> 942 1024
## - duration 1 945 1025
## - backers 1 954 1034
## - launched_year 6 967 1037
## - usd_pledged_real 1 123274 123354
## - usd_goal_real 1 172009 172089
##
## Step: AIC=1016.78
## state ~ main_category + launched_year + launched_month + duration +
## backers + usd_pledged_real + usd_goal_real
##
## Df Deviance AIC
## - launched_month 11 962 1012
## - main_category 14 972 1016
## <none> 945 1017
## - duration 1 948 1018
## - launched_year 6 972 1032
## - backers 1 978 1048
## - usd_pledged_real 1 123334 123404
## - usd_goal_real 1 172259 172329
##
## Step: AIC=1011.69
## state ~ main_category + launched_year + duration + backers +
## usd_pledged_real + usd_goal_real
##
## Df Deviance AIC
## <none> 962 1012
## - duration 1 966 1014
## - main_category 14 994 1016
## - backers 1 969 1017
## - launched_year 6 989 1027
## - usd_pledged_real 1 123393 123441
## - usd_goal_real 1 172432 172480
#Forward Selection
model.step.f <- step(nullmodel, scope = list(lower = nullmodel, upper = fullmodel), direction = 'forward')
## Start: AIC=268224.2
## state ~ 1
##
## Df Deviance AIC
## + backers 1 190220 190224
## + usd_pledged_real 1 208657 208661
## + main_category 14 258852 258882
## + usd_goal_real 1 260086 260090
## + launched_year 6 265066 265080
## + duration 1 265449 265453
## + currency 5 267089 267101
## + country 4 267146 267156
## + launched_month 11 267867 267891
## <none> 268222 268224
##
## Step: AIC=190224
## state ~ backers
##
## Df Deviance AIC
## + usd_goal_real 1 129495 129501
## + main_category 14 177586 177618
## + duration 1 187547 187553
## + launched_year 6 187697 187713
## + currency 5 189436 189450
## + country 4 189513 189525
## + launched_month 11 190088 190114
## + usd_pledged_real 1 190137 190143
## <none> 190220 190224
##
## Step: AIC=129500.7
## state ~ backers + usd_goal_real
##
## Df Deviance AIC
## + usd_pledged_real 1 1036 1044
## + main_category 14 124239 124273
## + launched_year 6 128885 128903
## + duration 1 129119 129127
## + currency 5 129262 129278
## + country 4 129270 129284
## + launched_month 11 129413 129441
## <none> 129495 129501
##
## Step: AIC=1043.69
## state ~ backers + usd_goal_real + usd_pledged_real
##
## Df Deviance AIC
## + launched_year 6 998.11 1018.1
## + main_category 14 992.15 1028.2
## + launched_month 11 1010.69 1040.7
## <none> 1035.69 1043.7
## + country 4 1027.86 1043.9
## + duration 1 1034.22 1044.2
## + currency 5 1027.58 1045.6
##
## Step: AIC=1018.11
## state ~ backers + usd_goal_real + usd_pledged_real + launched_year
##
## Df Deviance AIC
## + main_category 14 965.77 1013.8
## + duration 1 994.31 1016.3
## <none> 998.11 1018.1
## + launched_month 11 976.27 1018.3
## + country 4 993.84 1021.8
## + currency 5 993.24 1023.2
##
## Step: AIC=1013.77
## state ~ backers + usd_goal_real + usd_pledged_real + launched_year +
## main_category
##
## Df Deviance AIC
## + duration 1 961.69 1011.7
## <none> 965.77 1013.8
## + launched_month 11 948.47 1018.5
## + country 4 963.38 1019.4
## + currency 5 962.41 1020.4
##
## Step: AIC=1011.69
## state ~ backers + usd_goal_real + usd_pledged_real + launched_year +
## main_category + duration
##
## Df Deviance AIC
## <none> 961.69 1011.7
## + launched_month 11 944.78 1016.8
## + country 4 959.61 1017.6
## + currency 5 958.55 1018.5
Both Forward and Backward selection methods resulted in the same final model. The final model chosen with AIC as the criterion for selection generated a logistic regression model with the lowest AIC value of 1012 as below.
# Forward and backward selection gives the same final model
model.glm <- glm(state ~ main_category + launched_year + duration +
backers + usd_pledged_real + usd_goal_real, data = train.data, family = "binomial")
summary(model.glm)
##
## Call:
## glm(formula = state ~ main_category + launched_year + duration +
## backers + usd_pledged_real + usd_goal_real, family = "binomial",
## data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.49 0.00 0.00 0.00 6.55
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.592292 0.473563 7.586 3.31e-14 ***
## main_categoryComics 0.364321 0.735345 0.495 0.62029
## main_categoryCrafts -0.298821 0.417253 -0.716 0.47389
## main_categoryDance 1.786427 1.085384 1.646 0.09979 .
## main_categoryDesign -0.038988 0.446695 -0.087 0.93045
## main_categoryFashion 0.514163 0.463273 1.110 0.26706
## main_categoryFilm & Video 0.859109 0.323352 2.657 0.00789 **
## main_categoryFood -1.155341 0.266997 -4.327 1.51e-05 ***
## main_categoryGames -0.469846 0.368808 -1.274 0.20268
## main_categoryJournalism 0.800839 0.882433 0.908 0.36412
## main_categoryMusic 0.430053 0.306353 1.404 0.16038
## main_categoryPhotography -0.216085 0.403612 -0.535 0.59239
## main_categoryPublishing -0.199835 0.294969 -0.677 0.49810
## main_categoryTechnology 0.036450 0.505447 0.072 0.94251
## main_categoryTheater 1.415904 0.622875 2.273 0.02302 *
## launched_year2013 0.338879 0.612214 0.554 0.57990
## launched_year2014 -1.352207 0.414653 -3.261 0.00111 **
## launched_year2015 -1.143311 0.416774 -2.743 0.00608 **
## launched_year2016 -0.987829 0.441421 -2.238 0.02523 *
## launched_year2017 -0.310882 0.478285 -0.650 0.51570
## launched_yearBefore 2012 0.282530 0.532915 0.530 0.59600
## duration -0.013058 0.005980 -2.184 0.02899 *
## backers 0.035993 0.006546 5.498 3.84e-08 ***
## usd_pledged_real 0.234382 0.010890 21.524 < 2e-16 ***
## usd_goal_real -0.234333 0.010899 -21.501 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 268222.18 on 198878 degrees of freedom
## Residual deviance: 961.69 on 198854 degrees of freedom
## AIC: 1011.7
##
## Number of Fisher Scoring iterations: 25
The final selected model implies that if a new KS project is launched under the ‘Comics’ main category, the odds of its success increases by 44%. The columns ‘Odds’ in Table 7 shows the relative importance of a particular field.
Parameter Tuning
To tune the parameter, pcut, which is the optimal cut-off probability for classifying projects into the successful or failure class, a three-fold cross-validation method was used on the training data. Using a symmetric cost function for wrongly classified KS projects, the optimal cut-off probability was calculated.
The optimal cut-off probability is the probability at which the logistic regression model has the least misclassification rate.
# Logistic Regression - Parameter Tuning
# CV to choose cut-off probability
searchgrid = seq(0.4, 0.7, 0.02)
result = cbind(searchgrid, NA)
cost1 <- function(r, pi) {
weight1 = 1
weight0 = 1
c1 = (r == 1) & (pi < pcut) #logical vector - true if actual 1 but predict 0 (False Negative)
c0 = (r == 0) & (pi > pcut) #logical vector - true if actual 0 but predict 1 (False Positive)
return(mean(weight1 * c1 + weight0 * c0))
}
for (i in 1:length(searchgrid)) {
pcut <- result[i, 1]
result[i, 2] <- cv.glm(data = train.data, glmfit = model.glm, cost = cost1, K = 3)$delta[2]
}
plot(result, ylab = "CV Cost",main = "Optimal cut-off probability identification")
Plotting the cross-validated cost with different values of pcut resulted in Figure 17. 0.64 was chosen as the optimal cut-off probability with a CV cost of 0.00036.
par(mfrow = c(1,2))
# In-sample Prediction
pred.in <- predict(model.glm, newdata = train.data, type = "response")
prediction.in <- ifelse(pred.in < 0.64,0,1)
table(as.factor(train.data$state), prediction.in)
## prediction.in
## 0 1
## failed 118597 70
## successful 5 80207
roc.plot(train.data$state == "successful", pred.in, main = "In-sample ROC")$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.9999841 0 NA
# Model selection - Validation data
pred.val <- predict(model.glm, newdata = validation.data, type = "response")
prediction.val <- ifelse(pred.val < 0.64,0,1)
table(as.factor(validation.data$state), prediction.val)
## prediction.val
## 0 1
## failed 39324 21
## successful 0 26948
roc.plot(validation.data$state == "successful", pred.val, main = "Validation ROC")$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.9999986 0 NA
2. Generalized Additive Models
Generalized Additive models are linear models in which the response depends linearly on unknown smooth functions of numerical predictor variables. Using GAM, the non-linear relationships of predictors with the response can be tested and used in the final model.
Using a smoothing term on the predictors - Duration, Backers, Pledged and Goal generated a linear estimated degree of freedom for Backers, Pledged and Goal variables as can be seen in Figure 18.
model.gam <- gam(state ~ main_category + country + launched_year + launched_month +
s(duration) + currency + s(backers) + s(usd_pledged_real) +
s(usd_goal_real), family = binomial, data = train.data)
summary(model.gam)
##
## Family: binomial
## Link function: logit
##
## Formula:
## state ~ main_category + country + launched_year + launched_month +
## s(duration) + currency + s(backers) + s(usd_pledged_real) +
## s(usd_goal_real)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.551e+03 3.608e+02 -20.928 < 2e-16 ***
## main_categoryComics 3.514e-01 7.530e-01 0.467 0.64076
## main_categoryCrafts -1.920e-01 4.209e-01 -0.456 0.64828
## main_categoryDance 1.913e+00 1.122e+00 1.706 0.08807 .
## main_categoryDesign 6.718e-02 4.567e-01 0.147 0.88307
## main_categoryFashion 5.488e-01 4.625e-01 1.187 0.23538
## main_categoryFilm & Video 8.507e-01 3.240e-01 2.625 0.00865 **
## main_categoryFood -9.137e-01 2.800e-01 -3.264 0.00110 **
## main_categoryGames -4.974e-01 3.746e-01 -1.328 0.18424
## main_categoryJournalism 7.362e-01 8.967e-01 0.821 0.41162
## main_categoryMusic 4.832e-01 3.106e-01 1.556 0.11982
## main_categoryPhotography -2.021e-01 4.087e-01 -0.495 0.62085
## main_categoryPublishing -2.179e-01 2.952e-01 -0.738 0.46051
## main_categoryTechnology 1.355e-01 5.140e-01 0.264 0.79208
## main_categoryTheater 1.349e+00 6.250e-01 2.158 0.03092 *
## countryCA -1.005e-01 5.118e-01 -0.196 0.84427
## countryGB 0.000e+00 0.000e+00 NA NA
## countryOther 0.000e+00 0.000e+00 NA NA
## countryUS 0.000e+00 0.000e+00 NA NA
## launched_year2013 3.385e-01 6.100e-01 0.555 0.57894
## launched_year2014 -1.104e+00 4.227e-01 -2.613 0.00898 **
## launched_year2015 -1.165e+00 4.172e-01 -2.792 0.00524 **
## launched_year2016 -9.649e-01 4.448e-01 -2.169 0.03006 *
## launched_year2017 -2.744e-01 4.807e-01 -0.571 0.56810
## launched_yearBefore 2012 3.540e-01 5.315e-01 0.666 0.50533
## launched_month02 4.598e-01 4.283e-01 1.074 0.28298
## launched_month03 1.446e-01 4.008e-01 0.361 0.71821
## launched_month04 3.571e-01 4.435e-01 0.805 0.42075
## launched_month05 5.193e-01 4.538e-01 1.144 0.25248
## launched_month06 1.056e+00 5.350e-01 1.973 0.04845 *
## launched_month07 -4.954e-01 3.646e-01 -1.359 0.17426
## launched_month08 2.970e-01 4.155e-01 0.715 0.47469
## launched_month09 2.040e-01 4.226e-01 0.483 0.62921
## launched_month10 1.276e-02 3.995e-01 0.032 0.97453
## launched_month11 -6.790e-02 4.133e-01 -0.164 0.86951
## launched_month12 1.267e-01 4.623e-01 0.274 0.78399
## currencyCAD 0.000e+00 0.000e+00 NA NA
## currencyEUR 5.841e-01 5.840e-01 1.000 0.31722
## currencyGBP 5.030e-01 4.945e-01 1.017 0.30900
## currencyOther -1.149e-01 5.980e-01 -0.192 0.84766
## currencyUSD 3.030e-01 4.377e-01 0.692 0.48876
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(duration) 1.048 1.095 4.162 0.0488 *
## s(backers) 1.000 1.000 23.210 1.45e-06 ***
## s(usd_pledged_real) 1.000 1.000 439.800 < 2e-16 ***
## s(usd_goal_real) 1.000 1.000 439.037 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 73/77
## R-sq.(adj) = 0.998 Deviance explained = 99.6%
## UBRE = -0.99485 Scale est. = 1 n = 198879
plot(model.gam, shade = TRUE, seWithMean = TRUE, scale = 0, pages = 1)
The final model was built with linear terms on all the predictors. The final GAM model generated an AIC of 1023.8 comparable to AIC of the final Logistic regression model with 1012.
model.gam <- gam(state ~ main_category + country + launched_year + launched_month +
duration + currency + backers + usd_pledged_real +
usd_goal_real, family = binomial, data = train.data)
# In-sample Prediction
prob.gam.in <- predict(model.gam, train.data, type = "response")
pred.gam.in <- (prob.gam.in >= 0.64) * 1
table(train.data$state, pred.gam.in, dnn = c("Observation", "Prediction"))
## Prediction
## Observation 0 1
## failed 118598 69
## successful 14 80198
# roc.plot(train.data$state == "successful", prob.gam.in)$roc.vol
# Model selection - Validation data
prob.gam.val <- predict(model.gam, validation.data, type = "response")
pred.gam.val <- (prob.gam.val >= 0.64) * 1
table(validation.data$state, pred.gam.val, dnn = c("Observation", "Prediction"))
## Prediction
## Observation 0 1
## failed 39329 16
## successful 4 26944
roc.plot(validation.data$state == "successful", prob.gam.val, main = "Validation ROC",
ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.9999984 0 NA
3. Classification Tree
Tree models, the easiest machine learning models to build, split data into two or more homogeneous sets based on the most significant differentiator in the predictor variables-value set.
Building a classification tree on the binary response variable, ‘State’, resulted in the below classification tree (Figure 19). Figure 19 shows that the predictors - backers, goal and usd_pledged are the most significant variables that classify the response variable.
The below figure shows the separation between the failed and the successful classes. If a KS project has less than 18 backers and lesser than $999 pledged towards it with a goal greater than 808 USD, it will definitely fail.
tree.model <- rpart(state ~ ., data = train.data, method = "class")
rpart.plot(tree.model)
Parameter Tuning
In order to prune the classification tree from the initial tree built, the complexity parameter was tuned. The Cp value after which the relative error does not decrease significantly was chosen as the final Cp (0.011).
plotcp(tree.model)
# Optimal cut-off value
# Pruning the tree
tree.model <- rpart(state ~ ., data = train.data, method = "class", cp = 0.011)
rpart.plot(tree.model)
Pruning the initial decision tree with a Cp value of 0.011 resulted in the Final decision tree (Figure 20). Clearly, the initial tree was the most pruned and the final tree is the same as the initial decision tree built.
par(mfrow = c(1,2))
# In-sample Prediction
tree.predict.in <- predict(tree.model, train.data, type = "class")
tree.pred.in <- predict(tree.model, train.data, type = "prob")
confusionMatrix(train.data$state, tree.predict.in)
## Confusion Matrix and Statistics
##
## Reference
## Prediction failed successful
## failed 113621 5046
## successful 1753 78459
##
## Accuracy : 0.9658
## 95% CI : (0.965, 0.9666)
## No Information Rate : 0.5801
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9294
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9848
## Specificity : 0.9396
## Pos Pred Value : 0.9575
## Neg Pred Value : 0.9781
## Prevalence : 0.5801
## Detection Rate : 0.5713
## Detection Prevalence : 0.5967
## Balanced Accuracy : 0.9622
##
## 'Positive' Class : failed
##
roc.plot(train.data$state == "successful", tree.pred.in[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.9801735 0 NA
# Model selection - Validation data
tree.predict.val <- predict(tree.model, validation.data, type = "class")
tree.pred.val <- predict(tree.model, validation.data, type = "prob")
confusionMatrix(validation.data$state, tree.predict.val)
## Confusion Matrix and Statistics
##
## Reference
## Prediction failed successful
## failed 37606 1739
## successful 592 26356
##
## Accuracy : 0.9648
## 95% CI : (0.9634, 0.9662)
## No Information Rate : 0.5762
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9276
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9845
## Specificity : 0.9381
## Pos Pred Value : 0.9558
## Neg Pred Value : 0.9780
## Prevalence : 0.5762
## Detection Rate : 0.5673
## Detection Prevalence : 0.5935
## Balanced Accuracy : 0.9613
##
## 'Positive' Class : failed
##
roc.plot(validation.data$state == "successful", tree.pred.val[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.9798618 0 NA
4. Random Forest
Random Forests are an extension of the simple Classification Tree algorithm where many trees are built on different selections of the data with a subset of variables used for building each tree. The random forest combines the decision of the individual trees built and makes the final decision about each observation. The random forest package in R automatically calculates the average out of bag error when the random forest is built. The OOB error is synonymous to a misclassification rate that is calculated on the subset that is not chosen to build the trees.
set.seed(12420360)
rf.model <- randomForest(formula = state ~ ., data = train.data, importance = TRUE)
print(rf.model)
##
## Call:
## randomForest(formula = state ~ ., data = train.data, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0.27%
## Confusion matrix:
## failed successful class.error
## failed 118148 519 0.0043735832
## successful 11 80201 0.0001371366
Parameter Tuning
There are multiple parameters that can be tuned in the Random Forest model. Parameters ntree, mtry and nodesize are respectively the number of trees that are built for the Random Forest to generate the lowest OOB error, the number of variables sampled on each candidate split and the minimum size of the terminal nodes.
Tuning only the mtry with a fixed number of 500 trees for the Random Forest resulted in the OOB errors represented in Figure 22. From the Figure, mtry of 3 was chosen as the ideal number to introduce variability in the Random Forest although the mtry of 7 resulted in the lowest OOB.
# Tune only mtry parameter using tuneRF()
set.seed(12420360)
res <- tuneRF(x = subset(train.data, select = -state),
y = train.data$state,
ntreeTry = 500)
## mtry = 3 OOB error = 0.27%
## Searching left ...
## mtry = 2 OOB error = 0.52%
## -0.9512195 0.05
## Searching right ...
## mtry = 6 OOB error = 0.1%
## 0.6097561 0.05
## mtry = 9 OOB error = 0.08%
## 0.25 0.05
print(res)
## mtry OOBError
## 2.OOB 2 0.0052293103
## 3.OOB 3 0.0026800215
## 6.OOB 6 0.0010458621
## 9.OOB 9 0.0007843965
Tuning the other parameters nodesize and ntree using the fixed mtry of 3 resulted in lowest OOB error of 0.0027 with nodesize as 3 and ntree as 200. The above tuned parameters were used to build the final Random Forest model.
# Manually tuning all parameters of the Random Forest
# Establish a list of possible values for nodesize and ntree
nodesize <- seq(3, 8, 2)
ntree <- seq(50, 200, 50)
hyper_grid <- expand.grid(nodesize = nodesize, ntree = ntree)
oob_err <- c()
for (i in 1:nrow(hyper_grid)) {
model <- randomForest(formula = state ~ ., data = train.data,
mtry = 3,
nodesize = hyper_grid$nodesize[i],
ntree = hyper_grid$ntree[i])
oob_err[i] <- model$err.rate[nrow(model$err.rate), "OOB"]
}
# Identify optimal set of hyperparmeters based on OOB error
opt_i <- which.min(oob_err)
print(hyper_grid[opt_i,])
## nodesize ntree
## 10 3 200
# Build the tuned randomForest with the above parameters
rf.model <- randomForest(formula = state ~ ., data = train.data,
mtry = 3, nodesize = 3,
ntree = 200, importance = TRUE)
par(mfrow = c(1,2))
# In-sample Prediction
rf.predict.in <- predict(rf.model, train.data)
rf.pred.in <- predict(rf.model, train.data, type = "prob")
confusionMatrix(train.data$state, rf.predict.in)
## Confusion Matrix and Statistics
##
## Reference
## Prediction failed successful
## failed 118664 3
## successful 0 80212
##
## Accuracy : 1
## 95% CI : (1, 1)
## No Information Rate : 0.5967
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5967
## Detection Rate : 0.5967
## Detection Prevalence : 0.5967
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : failed
##
roc.plot(train.data$state == "successful", rf.pred.in[,2], ylab = "True Positive Rate",
xlab = "False Positive Rate")$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 1 0 NA
# Model selection - Validation data
rf.predict.val <- predict(rf.model, validation.data)
rf.pred.val <- predict(rf.model, validation.data, type = "prob")
confusionMatrix(validation.data$state, rf.predict.val)
## Confusion Matrix and Statistics
##
## Reference
## Prediction failed successful
## failed 39175 170
## successful 5 26943
##
## Accuracy : 0.9974
## 95% CI : (0.9969, 0.9977)
## No Information Rate : 0.591
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9945
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9999
## Specificity : 0.9937
## Pos Pred Value : 0.9957
## Neg Pred Value : 0.9998
## Prevalence : 0.5910
## Detection Rate : 0.5909
## Detection Prevalence : 0.5935
## Balanced Accuracy : 0.9968
##
## 'Positive' Class : failed
##
roc.plot(validation.data$state == "successful", rf.pred.val[,2], ylab = "True Positive Rate",
xlab = "False Positive Rate")$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.9999771 0 NA
The final model for classifying the KS projects was determined after comparison of all models’ performance on the validation data (20% of the data).
The best model was chosen based on model performance on the validation data using AUC as the criterion for model selection. Across the in-sample data, the Random Forest model performed the best by recording the highest AUC and accuracy of 1. On the validation data, the performance of the Logistic Regression model and GAM was similar. Hence, the simpler Logistic Regression model was chosen as the best model with an accuracy of 0.9996 and AUC of 0.9999.
The below figure compares the ROC plots across all models for in-sample and validation data.
par(mfrow = c(1,2))
# Plot multiple ROC curves in one - Logistic, GAM, Tree and Random Forest
# Compare different models using ROC curves
# In-sample ROC
rocplot1 <- roc.plot(x = (train.data$state == "successful"),
pred = cbind(pred.in, prob.gam.in, tree.pred.in[,2], rf.pred.in[,2]),
main = "ROC curve: In-sample",
legend = T, leg.text = c("GLM","GAM", "Tree", "Random Forest"))
# Validation set ROC
rocplot2 <- roc.plot(x = (validation.data$state == "successful"),
pred = cbind(pred.val, prob.gam.val, tree.pred.val[,2], rf.pred.val[,2]),
main = "ROC curve: Validation set",
legend = T, leg.text = c("GLM","GAM", "Tree", "Random Forest"))
rocplot1$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.9999841 0 NA
## 2 Model 2 0.9999845 0 NA
## 3 Model 3 0.9801735 0 NA
## 4 Model 4 1.0000000 0 NA
rocplot2$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.9999986 0 NA
## 2 Model 2 0.9999984 0 NA
## 3 Model 3 0.9798618 0 NA
## 4 Model 4 0.9999771 0 NA
The Logistic Regression model was chosen as the best in terms of most performance criterions – Accuracy and AUC. With the Logistic Regression model chosen as the final model for classifying KS projects as potential successes and failures, the below results were found while testing the performance on the Test data.
# Model performance - Test data
pred.out <- predict(model.glm, newdata = test.data, type = "response")
prediction.out <- ifelse(pred.out < 0.64,0,1)
table(as.factor(test.data$state), prediction.out)
## prediction.out
## 0 1
## failed 39580 22
## successful 2 26689
roc.plot(test.data$state == "successful", pred.out)$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.9999739 0 NA
References