Introduction (Bussiness Understanding)

Initial coin offers (ICOs) are a new method for startups to raise funding. For an entrepreneur, ICOs can provide funding at all stages with virtually no transaction expenses. For an investor, ICOs may help them earn higher profits.
In this report, I will produce a model for predicting whether a startup will reach its fundraising goal using the potential predictors.

Data Understanding

The dataset contains 20 columns, apart from column “ID” and response variable column “goal”, there are 10 numerical columns, and 8 categorical columns. However, after further analysis, column “enddate” and “startdate” can be converted to a “Date” object, which is a numerical type.

data <- read.csv("data_LUBS5990M_202122.csv")

data <- data %>% mutate(startdate = as.Date(startdate, format = "%d/%m/%Y")) %>% 
  mutate(enddate = as.Date(enddate, format = "%d/%m/%Y"))

glimpse(data, width = 60)

## Rows: 1,181
## Columns: 20
## $ ID                   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10…
## $ goal                 <chr> "N", "N", "Y", "Y", "Y", "N",…
## $ enddate              <date> 2018-04-15, 2017-08-25, 2017…
## $ startdate            <date> 2018-03-01, 2017-07-25, 2017…
## $ coinNum              <dbl> 1.275e+07, 0.000e+00, 2.220e+…
## $ teamSize             <int> 14, 13, 9, 7, 20, 10, 4, 22, …
## $ country_region       <chr> "Estonia", "Singapore", "Sing…
## $ categories           <chr> "Charity,Education,Health,Sma…
## $ overallrating        <dbl> 2.6, 2.1, 2.5, 2.1, 4.3, 3.5,…
## $ ratingTeam           <dbl> 2.4, 0.0, 0.0, 1.0, 4.4, 3.3,…
## $ ratingProduct        <dbl> 2.4, 0.0, 0.0, 2.0, 4.0, 3.3,…
## $ platform             <chr> "Ethereum", "unknown", "Ether…
## $ acceptingCurrencyNum <chr> "1", "1", "unknown", "2", "1"…
## $ whitepaper           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ video                <int> 1, 1, 0, 0, 1, 1, 1, 0, 1, 0,…
## $ socialMedia          <int> 0, 0, 3, 0, 2, 3, 0, 3, 0, 0,…
## $ GitHub               <int> 0, 1, 0, 0, 1, 1, 1, 1, 1, 0,…
## $ teamLinkedIn         <chr> "86%", "0%", "0%", "0%", "100…
## $ teamPhotos           <chr> "100%", "100%", "100%", "100%…
## $ CEOPhoto             <int> 0, 0, 0, 1, 0, 1, 1, 0, 1, 0,…

Missing values and outliers can have a great impact on the classification performance for some models, as we should first check them and make the dataset tidy. According to the dataset summary information above, there is no missing values.

For the 10 numerical columns, we can check the distribution of each feature, and examine whether there exists outliers.

data %>% select(coinNum, teamSize, overallrating, ratingTeam, ratingProduct, 
                whitepaper, video, socialMedia, GitHub, CEOPhoto) %>% 
  gather(key = "Type", value = "value") %>% 
  ggplot() + 
  geom_histogram(aes(x = value, fill = Type), bins = 30) + 
  facet_wrap(~Type, scales = "free") + 
  theme_bw() + 
  theme(legend.position = "none")

median_coinNum <- data %>% select(coinNum) %>% pull(coinNum) %>% median()

According to the distribution of 10 numerical predictor variables, there are some outliers in the variable “coinNum”. The median of the variable “coinNum” is 6^{7}. We can set the “coinNum” higher than 99% quantile as the 99% quantile, and then set the “coinNum” lower than 1% quantile as the 1% quantile. After that the logarithm of the variable should be taken.

A <- ggplot(data) + 
  geom_histogram(aes(coinNum), bins = 30) + 
  ggtitle("Before Transformation") + 
  theme_bw()

B <- data %>% mutate(quantile_99 = quantile(coinNum, .99), 
                quantile_01 = quantile(coinNum, .01)) %>% 
  mutate(coinNum = ifelse(coinNum > quantile_99, quantile_99, coinNum),
         coinNum = ifelse(coinNum < quantile_01, quantile_01, coinNum)) %>% 
  mutate(coinNum_log = log(coinNum + 1)) %>% 
  ggplot() + 
  geom_histogram(aes(coinNum_log), bins = 30) + 
  ggtitle("After Transformation") + 
  theme_bw()

ggarrange(A, B)

data <- data %>% mutate(quantile_99 = quantile(coinNum, .99), 
                quantile_01 = quantile(coinNum, .01)) %>% 
  mutate(coinNum = ifelse(coinNum > quantile_99, quantile_99, coinNum),
         coinNum = ifelse(coinNum < quantile_01, quantile_01, coinNum)) %>% 
  mutate(coinNum_log = log(coinNum + 1)) %>% 
  select(-quantile_01, -quantile_99, -coinNum)

According to the figure above, after such transformation, the distribution of variable “coinNum” looks much better.

Data Preparation

Instead of being limited to the existing predictor variables, some predictor variables can be created based on the existing variables.

First, the duration of the fundraising campaign can be created using the start and end dates. After analyzing, the start date and end date of the following 3 samples are wrong.

data %>% mutate(duration = enddate - startdate) %>% 
  filter(duration < 0) %>% 
  select(ID, goal, enddate, startdate, duration)

Apparently, these samples are wrong, and we should take the absolute value of the duration.

A <- data %>% mutate(duration = enddate - startdate) %>% 
  mutate(duration = ifelse(duration < 0, -duration, duration)) %>% 
  ggplot() + 
  geom_histogram(aes(duration), bins = 30) + 
  ggtitle("Before Transformation") + 
  theme_bw()

B <- data %>% mutate(duration = enddate - startdate) %>% 
  mutate(duration = ifelse(duration < 0, -duration, duration)) %>% 
  mutate(duration_log = log(duration + 1)) %>% 
  ggplot() + 
  geom_histogram(aes(duration_log), bins = 30) + 
  ggtitle("After Transformation") + 
  theme_bw()

ggarrange(A, B)

data <- data %>% mutate(duration = enddate - startdate) %>% 
  mutate(duration = ifelse(duration < 0, -duration, duration)) %>% 
  mutate(duration_log = log(duration + 1)) %>% 
  select(-duration)

According to the figure above, the distribution of duration is left-skewed, however, a normal distribution generally contributes to better prediction performance. As a result, I add 1 to the “duration” variable and then take the logarithm. It can be seen that, the transformed variable has a better distribution.

glimpse(data, width = 60)

## Rows: 1,181
## Columns: 21
## $ ID                   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10…
## $ goal                 <chr> "N", "N", "Y", "Y", "Y", "N",…
## $ enddate              <date> 2018-04-15, 2017-08-25, 2017…
## $ startdate            <date> 2018-03-01, 2017-07-25, 2017…
## $ teamSize             <int> 14, 13, 9, 7, 20, 10, 4, 22, …
## $ country_region       <chr> "Estonia", "Singapore", "Sing…
## $ categories           <chr> "Charity,Education,Health,Sma…
## $ overallrating        <dbl> 2.6, 2.1, 2.5, 2.1, 4.3, 3.5,…
## $ ratingTeam           <dbl> 2.4, 0.0, 0.0, 1.0, 4.4, 3.3,…
## $ ratingProduct        <dbl> 2.4, 0.0, 0.0, 2.0, 4.0, 3.3,…
## $ platform             <chr> "Ethereum", "unknown", "Ether…
## $ acceptingCurrencyNum <chr> "1", "1", "unknown", "2", "1"…
## $ whitepaper           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ video                <int> 1, 1, 0, 0, 1, 1, 1, 0, 1, 0,…
## $ socialMedia          <int> 0, 0, 3, 0, 2, 3, 0, 3, 0, 0,…
## $ GitHub               <int> 0, 1, 0, 0, 1, 1, 1, 1, 1, 0,…
## $ teamLinkedIn         <chr> "86%", "0%", "0%", "0%", "100…
## $ teamPhotos           <chr> "100%", "100%", "100%", "100%…
## $ CEOPhoto             <int> 0, 0, 0, 1, 0, 1, 1, 0, 1, 0,…
## $ coinNum_log          <dbl> 16.36104, 0.00000, 21.52077, …
## $ duration_log         <dbl> 3.8286414, 3.4657359, 1.79175…

According to the glimpse of the dataset, there are some categorical predictor variables with order can be converted to numerical variables. For example, categorical predictor variables “acceptingCurrencyNum”, “teamLinkedInn”, and “teamPhotos.

For “acceptingCurrencyNum”, the unique values of this categorical predictor variable are listed as follow:

data %>% select(acceptingCurrencyNum) %>% pull(acceptingCurrencyNum) %>% unique()

##  [1] "1"       "unknown" "2"       "3"       "4"       "10"      "7"      
##  [8] "5"       "8"       "6"       "9"       "11"

freq_acceptingCurrencyNum <- data %>% select(acceptingCurrencyNum) %>% 
  pull(acceptingCurrencyNum) %>% table()

freq_acceptingCurrencyNum

## .
##       1      10      11       2       3       4       5       6       7       8 
##     571       3       1     185     122      79      33      19       8       4 
##       9 unknown 
##       3     153

percent_unknown <- round(freq_acceptingCurrencyNum['unknown'] / 
                           sum(freq_acceptingCurrencyNum), 4)

It can be seen that this categorical variable is ordinal, all categories can be converted into numerical values directly except the category “unknown”. The “unknown” can be considered as the missing value (the percent is 0.1296), and we can fill it using the median value.

data <- data %>% mutate(acceptingCurrencyNum = as.numeric(acceptingCurrencyNum)) %>% 
  mutate(acceptingCurrencyNum = ifelse(is.na(acceptingCurrencyNum),
                                       median(na.omit(acceptingCurrencyNum)),
                                       acceptingCurrencyNum))

## Warning in mask$eval_all_mutate(quo): 强制改变过程中产生了NA

ggplot(data) + 
  geom_histogram(aes(x = acceptingCurrencyNum), bins = 30) + 
  ggtitle("Distribution of acceptingCurrencyNum") + 
  theme_bw()

According to the figure “Distribution of acceptingCurrencyNum”, we can see that the frequency decrease exponentially when the number of acceptingCurrencyNum increases.

For “teamLinkedInn”, the unique values of this categorical predictor variable are listed as follow:

data %>% select(teamLinkedIn) %>% pull(teamLinkedIn) %>% unique()

##  [1] "86%"  "0%"   "100%" "90%"  "95%"  "29%"  "23%"  "88%"  "50%"  "96%" 
## [11] "91%"  "60%"  "33%"  "89%"  "69%"  "98%"  "66%"  "73%"  "67%"  "70%" 
## [21] "94%"  "71%"  "64%"  "17%"  "58%"  "47%"  "44%"  "43%"  "55%"  "63%" 
## [31] "8%"   "59%"  "20%"  "85%"  "83%"  "57%"  "54%"  "74%"  "25%"  "72%" 
## [41] "62%"  "11%"  "79%"  "37%"  "31%"  "68%"  "93%"  "38%"  "36%"  "97%" 
## [51] "87%"  "75%"  "41%"  "12%"  "56%"  "21%"  "6%"   "81%"  "45%"  "80%" 
## [61] "40%"  "65%"  "92%"  "42%"  "77%"  "15%"  "13%"  "30%"  "22%"  "82%" 
## [71] "18%"  "84%"  "78%"  "3%"   "14%"  "53%"  "61%"  "76%"  "46%"  "27%"

According to the unique value of variable “teamLinkedInn”, this categorical predictor variable can be converted into numerical predictor variable directly by removing the “%” symbol.

data <- data %>% mutate(teamLinkedIn = substr(teamLinkedIn, 
                                      start = 1, 
                                      stop = nchar(teamLinkedIn) - 1)) %>% 
  mutate(teamLinkedIn = as.numeric(teamLinkedIn))
data %>% 
  ggplot() + 
  geom_histogram(aes(x = teamLinkedIn), bins = 30) + 
  ggtitle("Distribution of teamLinkedInn") + 
  theme_bw()

According to the figure above, the transformed numerical predictor variable “teamLinkedInn” is not normalized, and there is no missing values.

For “teamLinkedInn”, the unique values of this categorical predictor variable are listed as follow:

data %>% select(teamPhotos) %>% pull(teamPhotos) %>% unique()

##  [1] "100%" "57%"  "0%"   "88%"  "86%"  "90%"  "95%"  "71%"  "78%"  "60%" 
## [11] "80%"  "93%"  "70%"  "62%"  "59%"  "97%"  "92%"  "89%"  "45%"  "94%" 
## [21] "91%"  "75%"  "22%"  "96%"  "82%"  "48%"  "83%"  "79%"  "84%"  "50%" 
## [31] "26%"  "55%"  "8%"   "85%"  "17%"  "33%"  "38%"  "76%"  "72%"

Same as variable “teamLinkedInn”, this categorical predictor variable can be converted into numerical predictor variable directly by removing the “%” symbol.

data <- data %>% mutate(teamPhotos = substr(teamPhotos, 
                                      start = 1, 
                                      stop = nchar(teamPhotos) - 1)) %>% 
  mutate(teamPhotos = as.numeric(teamPhotos))

data %>% 
  ggplot() + 
  geom_histogram(aes(x = teamPhotos), bins = 30) + 
  ggtitle("Distribution of teamPhotos") + 
  theme_bw()

According to the figure above, almost all teams have team Photos.

After these transformation, there are only 3 categorical predictor variables (country_region, categories, platform) in the dataset.

glimpse(data, width = 60)

## Rows: 1,181
## Columns: 21
## $ ID                   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10…
## $ goal                 <chr> "N", "N", "Y", "Y", "Y", "N",…
## $ enddate              <date> 2018-04-15, 2017-08-25, 2017…
## $ startdate            <date> 2018-03-01, 2017-07-25, 2017…
## $ teamSize             <int> 14, 13, 9, 7, 20, 10, 4, 22, …
## $ country_region       <chr> "Estonia", "Singapore", "Sing…
## $ categories           <chr> "Charity,Education,Health,Sma…
## $ overallrating        <dbl> 2.6, 2.1, 2.5, 2.1, 4.3, 3.5,…
## $ ratingTeam           <dbl> 2.4, 0.0, 0.0, 1.0, 4.4, 3.3,…
## $ ratingProduct        <dbl> 2.4, 0.0, 0.0, 2.0, 4.0, 3.3,…
## $ platform             <chr> "Ethereum", "unknown", "Ether…
## $ acceptingCurrencyNum <dbl> 1, 1, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ whitepaper           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ video                <int> 1, 1, 0, 0, 1, 1, 1, 0, 1, 0,…
## $ socialMedia          <int> 0, 0, 3, 0, 2, 3, 0, 3, 0, 0,…
## $ GitHub               <int> 0, 1, 0, 0, 1, 1, 1, 1, 1, 0,…
## $ teamLinkedIn         <dbl> 86, 0, 0, 0, 100, 90, 100, 95…
## $ teamPhotos           <dbl> 100, 100, 100, 100, 100, 100,…
## $ CEOPhoto             <int> 0, 0, 0, 1, 0, 1, 1, 0, 1, 0,…
## $ coinNum_log          <dbl> 16.36104, 0.00000, 21.52077, …
## $ duration_log         <dbl> 3.8286414, 3.4657359, 1.79175…

num_country <- length(data %>% pull(country_region) %>% unique())

There are total 103 in the dataset, however, the number of observations of most countries is very small. The first 10 countries with highest frequency can be seen as follow:

country_region_num <- data %>% select(country_region) %>% group_by(country_region) %>% 
  summarise(num = n()) %>% arrange(desc(num)) %>% slice(1:10)

country_region_num

To construct a model with high classification performance without over fitting, it is necessary to create new categorical predictor variable from this categorical predictor variable. In this case, the country whose number of observations is smaller than 10 will be encoded as “other”.

data_num_10 <- data %>% select(country_region) %>% group_by(country_region) %>% 
  summarise(num = n()) %>% arrange(desc(num)) %>% filter(num >= 10)

data <- data %>% mutate(country_region = 
                  ifelse(country_region %in% data_num_10$country_region,
                         country_region,
                         "other"))

data %>% select(country_region) %>% group_by(country_region) %>% 
  summarise(num = n()) %>% 
  ggplot() + 
  geom_col(aes(x = fct_rev(fct_reorder(country_region, num)), y = num, fill = country_region)) + 
  ggtitle("Number of Samples in Each Country") + 
  xlab("Country") + 
  ylab("Frequency") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  theme(legend.position = "none")

According to the figure above, after re-encode the countries with 10 or fewer samples as “other”, the number of categories is reasonable.

num_platform <- length(data %>% pull(platform) %>% unique())

There are total 50 in the dataset, however, the number of observations of most platforms is very small. The first 10 platforms with highest frequency can be seen as follow:

platform_num <- data %>% select(platform) %>% group_by(platform) %>% 
  summarise(num = n()) %>% arrange(desc(num)) %>% slice(1:10)

platform_num

platform_share <- round(sum(platform_num$num[1]) / nrow(data), 4)

According to the number of samples in each platform, the top 1 platform takes 0.829 of the market share, and I think all other platforms should be encoded as “other”.

data_num_100 <- data %>% select(platform) %>% group_by(platform) %>% 
  summarise(num = n()) %>% arrange(desc(num)) %>% filter(num >= 100)

data <- data %>% mutate(platform = 
                  ifelse(platform %in% data_num_100$platform,
                         platform,"other"))

Follow that, the categorical predictor variable “categories” should be processed as it typically contains multiple categories. First of all, we need to check how many categories this predictor variable contains after split each element by comma.

categories_vector <- c()

for (i in 1:nrow(data)){
  res <- strsplit(data$categories[i], ",")
  categories_vector <- append(categories_vector, res[[1]])
}

num_categories <- length(unique(categories_vector))

sort(table(categories_vector))

## categories_vector
##                     Art             Electronics         Virtual Reality 
##                      12                      13                      17 
##                   Legal                 Charity                  Sports 
##                      19                      20                      25 
##           Manufacturing                  Energy       Casino & Gambling 
##                      30                      32                      34 
##               Education                 Tourism                   Other 
##                      38                      38                      46 
##                Big Data                  Health             Real estate 
##                      54                      55                      56 
## Artificial Intelligence                  Retail                   Media 
##                      59                      66                      70 
##           Communication          Smart Contract          Infrastructure 
##                      91                     100                     106 
##                 Banking           Entertainment                Internet 
##                     112                     128                     130 
##                Software              Investment       Business services 
##                     166                     206                     290 
##          Cryptocurrency                Platform 
##                     439                     583

After analyzing, the total number of categories is 29. From the frequency table above, it can be seen that the minimum number of observations of a category is 12. In this case, I want to add additional three binary categorical variables, which is “categories_Platform”, “categories_Cryptocurrency”, and “categories_Business_services” respectively. Each variable indicates whether certain category exists in the category list of the startup company.

data <- data %>% 
  mutate(categories_Platform = 
           ifelse(str_detect(categories, "Platform"), "YES", "NO")) %>% 
  mutate(categories_Cryptocurrency = 
           ifelse(str_detect(categories, "Cryptocurrency"), "YES", "NO")) %>% 
  mutate(categories_Business_services = 
           ifelse(str_detect(categories, "Business services"), "YES", "NO")) %>% 
  select(-categories)

At last, the original “category” variable can be removed, and the dataset is clean now.

glimpse(data, width = 60)

## Rows: 1,181
## Columns: 23
## $ ID                           <int> 1, 2, 3, 4, 5, 6, 7, …
## $ goal                         <chr> "N", "N", "Y", "Y", "…
## $ enddate                      <date> 2018-04-15, 2017-08-…
## $ startdate                    <date> 2018-03-01, 2017-07-…
## $ teamSize                     <int> 14, 13, 9, 7, 20, 10,…
## $ country_region               <chr> "Estonia", "Singapore…
## $ overallrating                <dbl> 2.6, 2.1, 2.5, 2.1, 4…
## $ ratingTeam                   <dbl> 2.4, 0.0, 0.0, 1.0, 4…
## $ ratingProduct                <dbl> 2.4, 0.0, 0.0, 2.0, 4…
## $ platform                     <chr> "Ethereum", "other", …
## $ acceptingCurrencyNum         <dbl> 1, 1, 1, 2, 1, 1, 1, …
## $ whitepaper                   <int> 1, 1, 1, 1, 1, 1, 1, …
## $ video                        <int> 1, 1, 0, 0, 1, 1, 1, …
## $ socialMedia                  <int> 0, 0, 3, 0, 2, 3, 0, …
## $ GitHub                       <int> 0, 1, 0, 0, 1, 1, 1, …
## $ teamLinkedIn                 <dbl> 86, 0, 0, 0, 100, 90,…
## $ teamPhotos                   <dbl> 100, 100, 100, 100, 1…
## $ CEOPhoto                     <int> 0, 0, 0, 1, 0, 1, 1, …
## $ coinNum_log                  <dbl> 16.36104, 0.00000, 21…
## $ duration_log                 <dbl> 3.8286414, 3.4657359,…
## $ categories_Platform          <chr> "NO", "NO", "YES", "N…
## $ categories_Cryptocurrency    <chr> "NO", "NO", "NO", "YE…
## $ categories_Business_services <chr> "NO", "NO", "YES", "Y…

Modelling

Relationship among Predictors

Before fitting classification model to predict whether the startup achieved their funding goal, it is necessary to analyze the relationship of variables. Finally, total 21 predictors variables are selected. It can be seen as follow:

data %>% select(-ID, -goal) %>% colnames()

##  [1] "enddate"                      "startdate"                   
##  [3] "teamSize"                     "country_region"              
##  [5] "overallrating"                "ratingTeam"                  
##  [7] "ratingProduct"                "platform"                    
##  [9] "acceptingCurrencyNum"         "whitepaper"                  
## [11] "video"                        "socialMedia"                 
## [13] "GitHub"                       "teamLinkedIn"                
## [15] "teamPhotos"                   "CEOPhoto"                    
## [17] "coinNum_log"                  "duration_log"                
## [19] "categories_Platform"          "categories_Cryptocurrency"   
## [21] "categories_Business_services"

To analyze which predictor has the highest correlation with the target variable, the pearson correlation coefficient is calculated, and the corresponding correlation matrix is plotted.

data.cor <- data %>% mutate(goal = ifelse(goal == "Y", 1, 0)) %>% 
  select(goal, teamSize, overallrating, ratingTeam, ratingProduct, acceptingCurrencyNum, 
         whitepaper, video, socialMedia, GitHub, teamLinkedIn, teamPhotos, CEOPhoto,
         coinNum_log, duration_log) %>% 
  cor()
corrplot(data.cor)

overallrating_coef <- round(data.cor['goal', 'overallrating'], 4)

According to the correlation coefficient matrix above, it can be seen that predictor variables “teamSize”, “overallrating”, “ratingTeam”, “ratingProduct”, “socialMedia”, and “coinNum_log” are most related to the target variable “goal”. These predictor variables are all positively correlated with the target variable. Among these predictors, variable “overallrating” has the highest correlation coefficient 0.5173.
There also exists collinearity in these predictor variables, such as “overallrating” - “teamSize”, “ratingTeam” - “teamSize”, “ratingProduct” - “teamSize”, “ratingTeam” - “overallrating”, “ratingProduct” - “overallrating”, and “video” - “overallrating”.

Split Traning and Validation Dataset

To evaluate the performance of classification models, we should first split the dataset into training dataset and test dataset. In order for reproducibility, a random seed 1 was set before splitting the dataset.

set.seed(1)
ns <- nrow(data)
idx <- sample(seq(1, ns), ns)
ntr <- floor(0.7 * ns)
idx_tr <- idx[1:ntr]
idx_te <- idx[(ntr+1):ns]
data_tr <- data %>% slice(idx_tr) %>% select(-ID)
data_te <- data %>% slice(idx_te) %>% select(-ID)
nte <- nrow(data_te)

70% percent of samples are selected as training dataset, and the reaming 30% percent of samples as test dataset. At last, the training dataset has 826 samples, and the test dataset has 355 samples.

Model Selection

Generally, there are many machine learning classification models can be used for this dataset, such as K Nearest Neighbor (KNN), Naive Bayes (NB), Decision Trees (DT), Random Forest (RF), Support Vector Machine (SVM), and Neural Network (NN). However, the predictor variables of this dataset are not purely numerical, which means K Nearest Neighbor model and Neural Network model cannot be easily implemented without further encoding. Not only that, the Naive Bayes model is not suitable for this dataset since this model assumes all the predictor variables are independent. However, according to the correlation matrix, this assumption is not satisfied.

As a result, in this case, to determine the best model for classification, I use cross validation to test the classification performance of Decision Tree Model, Random Forest Model, and Support Vector Machine Model.

set.seed(1)
train_control <- trainControl('cv', 5, savePred = T, classProbs = T)

model.dt <- train(goal ~ ., data = data_tr, method="rpart", trControl = train_control)

model.rf <- train(goal ~ ., data = data_tr, method="rf", trControl = train_control)

model.svm <- train(goal ~ ., data = data_tr, method="svmRadial", trControl = train_control)

The cross validation result of Decision Tree Model is:

model.dt$results

acc_dt <- round(max(model.dt$results['Accuracy']), 4)

The cross validation result of Random Forest Model is:

model.rf$results

acc_rf <- round(max(model.rf$results['Accuracy']), 4)

The cross validation result of Support Vector Machine Model is:

model.svm$results

acc_svm <- round(max(model.svm$results['Accuracy']), 4)

The highest accuracy of three models are 0.7724, 0.793, and 0.7494 respectively. As the samples are balanced, the model with highest accuracy and lowest accuracy variance should be selected. As the result, I think the most appropriate model for this dataset is Random Forest.

Evaluation

To evaluate the prediction performance of the three selected models above, the ROC curve is created for three models.

data_te$goal <- as.factor(data_te$goal)
levels(data_te$goal) <- c("Y", "N")

pred_response <- predict(model.dt, data_te, type="prob")
pred_response <- pred_response[,2]
real_class <- data_te$goal
# Y: Positive; N: Negative;
roc_score <- roc(real_class, pred_response, levels = c("N", "Y"), direction = ">")
plot(roc_score, main = "ROC curve -- Decision Tree")

auc.dt <- round(roc_score$auc, 4)

pred_response <- predict(model.rf, data_te, type="prob")
pred_response <- pred_response[,2]
real_class <- data_te$goal
# Y: Positive; N: Negative;
roc_score <- roc(real_class, pred_response, levels = c("N", "Y"), direction = ">")
plot(roc_score, main = "ROC curve -- Random Forest")

auc.rf <- round(roc_score$auc, 4)

pred_response <- predict(model.svm, data_te, type="prob")
pred_response <- pred_response[,2]
real_class <- data_te$goal
# Y: Positive; N: Negative;
roc_score <- roc(real_class, pred_response, levels = c("N", "Y"), direction = ">")
plot(roc_score, main = "ROC curve -- Support Vector Machine")

auc.svm <- round(roc_score$auc, 4)

According to the three ROC curves above, the area under the curve is different for different model. The AUC of Decision Tree, Random Forest, and Support Vector Machine models is 0.7709, 0.8756, and 0.8308 respectively.

Based on the ROC curve, we can determine the optimal cutoff point which can maximize the sum of sensitivities and specificities.

pred_response <- predict(model.dt, data_te, type="prob")
pred_response <- pred_response[,2]
real_class <- data_te$goal
# Y: Positive; N: Negative;
roc_score <- roc(real_class, pred_response, levels = c("N", "Y"), direction = ">")
# Determine the optimal cutoff point
idx <- which.max(roc_score$sensitivities + roc_score$specificities)
cutoff <- 0.5
pred_class <- ifelse(pred_response > cutoff, "Y", "N")
# Performance
(cm <- table(Real = data_te$goal, Predicted = pred_class)) # Confusion Matrix

##     Predicted
## Real   N   Y
##    Y 146  43
##    N  40 126

TP <- cm[1,1]; FN <- cm[1,2]; FP <- cm[2,1]; TN <- cm[2,2];
acc <- round((TP + TN) / (TP + FN + FP + TN), 4)
sens <- round(TP / (TP + FP), 4)
spec <- round(TN / (FN + TN), 4)
auc <- round(roc_score$auc, 4)

fprintf("AUC: %8.2f; Accuracy: %8.2f%%; Sensitivity: %8.2f%%; Specificity: %8.2f%%", 
        auc, acc * 100, sens * 100, spec * 100)

## AUC:     0.77; Accuracy:    76.62%; Sensitivity:    78.49%; Specificity:    74.56%

According to the confusion matrix above, the final accuracy of Decision Tree model on test dataset is 0.7662, Sensitivity is 0.7849, and Specificity is 0.7456.

pred_response <- predict(model.rf, data_te, type="prob")
pred_response <- pred_response[,2]
real_class <- data_te$goal
# Y: Positive; N: Negative;
roc_score <- roc(real_class, pred_response, levels = c("N", "Y"), direction = ">")
# Determine the optimal cutoff point
idx <- which.max(roc_score$sensitivities + roc_score$specificities)
cutoff <- 0.5
pred_class <- ifelse(pred_response > cutoff, "Y", "N")
# Performance
(cm <- table(Real = data_te$goal, Predicted = pred_class)) # Confusion Matrix

##     Predicted
## Real   N   Y
##    Y 151  38
##    N  30 136

TP <- cm[1,1]; FN <- cm[1,2]; FP <- cm[2,1]; TN <- cm[2,2];
acc <- round((TP + TN) / (TP + FN + FP + TN), 4)
sens <- round(TP / (TP + FP), 4)
spec <- round(TN / (FN + TN), 4)
auc <- round(roc_score$auc, 4)

fprintf("AUC: %8.2f; Accuracy: %8.2f%%; Sensitivity: %8.2f%%; Specificity: %8.2f%%", 
        auc, acc * 100, sens * 100, spec * 100)

## AUC:     0.88; Accuracy:    80.85%; Sensitivity:    83.43%; Specificity:    78.16%

According to the confusion matrix above, the final accuracy of Random Forest model on test dataset is 0.8085, Sensitivity is 0.8343, and Specificity is 0.7816.

pred_response <- predict(model.svm, data_te, type="prob")
pred_response <- pred_response[,2]
real_class <- data_te$goal
# Y: Positive; N: Negative;
roc_score <- roc(real_class, pred_response, levels = c("N", "Y"), direction = ">")
# Determine the optimal cutoff point
idx <- which.max(roc_score$sensitivities + roc_score$specificities)
cutoff <- 0.5
pred_class <- ifelse(pred_response > cutoff, "Y", "N")
# Performance
(cm <- table(Real = data_te$goal, Predicted = pred_class)) # Confusion Matrix

##     Predicted
## Real   N   Y
##    Y 137  52
##    N  35 131

TP <- cm[1,1]; FN <- cm[1,2]; FP <- cm[2,1]; TN <- cm[2,2];
acc <- round((TP + TN) / (TP + FN + FP + TN), 4)
sens <- round(TP / (TP + FP), 4)
spec <- round(TN / (FN + TN), 4)
auc <- round(roc_score$auc, 4)

fprintf("AUC: %8.2f; Accuracy: %8.2f%%; Sensitivity: %8.2f%%; Specificity: %8.2f%%", 
        auc, acc * 100, sens * 100, spec * 100)

## AUC:     0.83; Accuracy:    75.49%; Sensitivity:    79.65%; Specificity:    71.58%

According to the confusion matrix above, the final accuracy of Support Vector Machine model on test dataset is 0.7549, Sensitivity is 0.7965, and Specificity is 0.7158.

As a result, Random Forest Model has the best classification performance, with highest accuracy, area under the curve, sensitivity, and specificity.

Concolusion (Deployment)

Based on the fitted Random Forest, we can conclude the importance of each predictor using the “total decrease in node impurities from splitting on the variable, averaged over all trees”. The idea of importance is that, if a variable is not important, rearranging the value of this variable won’t decrease the out-of-bag prediction accuracy.

var_imp <- varImp(model.rf)
var_imp <- data.frame(var_imp$importance)
Index <- row.names(var_imp)
var_imp$Index <- Index

var_imp %>% arrange(desc(Overall)) %>% slice(1:10) %>% 
  ggplot() + 
  geom_col(aes(x = fct_rev(fct_reorder(Index, Overall)), y = Overall, fill = Index)) + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  xlab("Predictors") + 
  ylab("Importance") + 
  ggtitle("Importance of Each Predictors")

The top-10 most important predictors are shown as above. It means higher overall rating score for the quality of an ICO project given by investment experts is the most related factor to whether the startup can achieve their funding goal. Interestingly, If the startup company issues higher number of coins, the chance of achieving their funding goal will increase. In addition to this, the starting date and end date of the funding raising campaign also greatly affect whether the startup can achieve their funding goal. Apart from this, the rating score for the project team and the product or service, the number of team members for the fundraising project, the duration of the funding raising campaign, the LinkedIn profile, and social media also have significant effect on the result.
As a result, to increase the probability of achieve funding goal, the startup company should issue more coins, have more team members, and try to complete the LinkedIn profile and increase the activity level of the company in social media during the campaign period.

report

2022-05-09