Results

Kickerstarter Projects: Exploratory Analysis

Introduction

What makes a Kickstarter more likely to reach its funding goal? This data set from Kaggle contains features describing over 370K Kickstarter Projects and includes the state of each one. Which model can predict whether a project succeeded?

library(e1071)
library(lubridate)
library(tidyverse)
library(corrplot)
library(reshape2)
library(DMwR)
library(MASS)
library(car)
library(caret)
library(rpart) 
library(rstatix)  # possibly checking for homogeneity of covariance matrices
library(rpart.plot)
library(ROCR) # only for binary classification
library(pROC)
df <- read.csv("kickstarter.csv")
df <- cbind(data.frame(Id = as.integer(rownames(df))), df) 
df_melt = melt(subset(df, select=c(Id,Goal,Pledged,Backers)), id.vars = "Id")

ggplot(aes(value), data = df_melt) + geom_histogram(stat = "bin", fill = "navyblue") + facet_wrap(~variable, scales = "free") + labs(title = "Distributions of Continuous Variables", x = "Variable", y = "Count") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

These features have a large range, but most are close to zero rather than in the thousands.

corrplot::corrplot(cor(dplyr::select(df, c(Goal,Pledged,Backers)), use = "complete.obs"), method = 'circle', type = 'lower', insig='blank', addCoef.col ='black', number.cex = 0.8, order = 'AOE', diag=TRUE)

The only significant correlation among the numerical predictors appears to be Backers and Pledged. It makes sense that they are positively correlated.

Change the Launched and Deadline columns to a Date type using the lubridate package.

df <- df %>% mutate(Launched = ymd_hms(df$Launched), Deadline = ymd(df$Deadline))

Add some extra features related to time, such as whether the project was launched on a weekend.

df <- df %>% mutate(LaunchMonth = month(Launched), LaunchIsWeekend=ifelse(wday(Launched) %in% c(6,7,1), 1, 0), DeadlineMonth=month(Deadline), Campaign_Length=DeadlineMonth-LaunchMonth)  # 

Drop unnecessary columns.

df <- df %>% select(-c("ID", "Name", "Launched", "Deadline"))

Create training and test sets.

df$State <- as.factor(df$State)
df <- df[sample(1:nrow(df)), ]
comments <- df[1:100000,]
set.seed(1234)
sample_set <- sample(nrow(comments), round(nrow(comments)*0.75), replace=FALSE)
comments_train <- comments[sample_set, ]
comments_test <- comments[-sample_set, ]

Random Forest Model

The model uses all predictors except for Category and Subcategory because R does not have a straightforward way to handle when there is a difference between factors present in the training set vs the test set. Including them would throw a fatal error, so they are omitted.

rf_mod <- train(
  State ~ Goal+Pledged+Backers+LaunchMonth+LaunchIsWeekend+DeadlineMonth+Campaign_Length,
  data = comments_train,
  metric = "Accuracy",
  method = "rf",
  trControl = trainControl(method = "boot632", number=3),
  tuneGrid = expand.grid(.mtry = 3)   # mtry=sqrt(num predictors)
)

rf_pred <- predict(rf_mod, comments_test)
confusionMatrix(rf_pred, comments_test$State)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Canceled Failed  Live Successful Suspended
##   Canceled         32     96     1          1         2
##   Failed         2498  13009   165          1        99
##   Live              2      8     0          1         0
##   Successful       52     52    26       8932        21
##   Suspended         0      1     0          1         0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8789          
##                  95% CI : (0.8748, 0.8829)
##     No Information Rate : 0.5266          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7747          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Canceled Class: Failed Class: Live
## Sensitivity                  0.01238        0.9881     0.00000
## Specificity                  0.99554        0.7665     0.99956
## Pos Pred Value               0.24242        0.8248     0.00000
## Neg Pred Value               0.89738        0.9830     0.99232
## Prevalence                   0.10336        0.5266     0.00768
## Detection Rate               0.00128        0.5204     0.00000
## Detection Prevalence         0.00528        0.6309     0.00044
## Balanced Accuracy            0.50396        0.8773     0.49978
##                      Class: Successful Class: Suspended
## Sensitivity                     0.9996          0.00000
## Specificity                     0.9906          0.99992
## Pos Pred Value                  0.9834          0.00000
## Neg Pred Value                  0.9997          0.99512
## Prevalence                      0.3574          0.00488
## Detection Rate                  0.3573          0.00000
## Detection Prevalence            0.3633          0.00008
## Balanced Accuracy               0.9951          0.49996
prediction <- predict(rf_mod, comments_test, type="prob")

roc_object <- pROC::multiclass.roc(comments_test$State, prediction, percent=TRUE)

print(roc_object)
## 
## Call:
## multiclass.roc.default(response = comments_test$State, predictor = prediction,     percent = TRUE)
## 
## Data: multivariate predictor prediction with 5 levels of comments_test$State: Canceled, Failed, Live, Successful, Suspended.
## Multi-class area under the curve: 77.02%

Glass Types: Exploratory Analysis

Introduction

The UCI Machine Learning Repository has a data set for glass classification. It has also been uploaded to Kaggle. The goal is to predict the type of glass from predictors related to a selection of the sample’s chemical attributes. Which type of model is best for this task?

library(tidyverse)
library(corrplot)
library(reshape2)
library(DMwR)
library(MASS)
library(car)
library(caret)
library(rpart) 
library(rstatix)  # possibly checking for homogeneity of covariance matrices
library(rpart.plot)
library(ROCR) # only for binary classification
library(pROC)
glass <- read_csv('https://raw.githubusercontent.com/djunga/DATA622/main/glass.csv', col_types = 'nnnnnnnnnnf')

Data Exploration

glimpse(glass)
## Rows: 214
## Columns: 10
## $ RI   <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.51743, 1.~
## $ Na   <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 14.04, 13~
## $ Mg   <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.60, 3.46,~
## $ Al   <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.36, 1.56,~
## $ Si   <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 72.08, 72~
## $ K    <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.57, 0.67,~
## $ Ca   <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.40, 8.09,~
## $ Ba   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Fe   <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.11, 0.24,~
## $ Type <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~

The glass types are 1,2,3,5,6,7. There is no type 4. Adjust for consistency.

glass$Type <- factor(glass$Type, labels = c(1,2,3,4,5,6))

Add unique ID numbers for each observation.

glass <- cbind(data.frame(Id = as.integer(rownames(glass))), glass) 

Distributions of predictors

glass_melt = melt(subset(glass, select=-c(Type)), id.vars = "Id")

ggplot(aes(value), data = glass_melt) + geom_histogram(stat = "bin", fill = "navyblue") + facet_wrap(~variable, scales = "free") + labs(title = "Distributions of Continuous Variables", x = "Variable", y = "Count") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There are many zeroes in Mg, K, Ba, and Fe. The RI, Na, Al, Si and Ca predictors barely look normally distributed. Curiously, could there be a correlation between glass type and the presence of zeroes?

Which class Types have all these zeroes?

zeroes <- data.frame()
ba_zero <- glass %>%
            group_by(Type) %>%
            filter(Ba==0) %>%
            summarize(
              Ba_Zero = n()
            )

mg_zero <- glass %>%
            group_by(Type) %>%
            filter(Mg==0) %>%
            summarize(
              Mg_Zero = n()
            )

mg_zero <- rbind(mg_zero, data.frame(Type=c(1,3), Mg_Zero=c(0,0))) %>%  arrange(Type)

fe_zero <- glass %>%
            group_by(Type) %>%
            filter(Fe==0) %>%
            summarize(
              Fe_Zero = n()
            )

ri_zero <- data.frame(Type=c(1,2,3,4,5,6), RI_Zero=c(0,0,0,0,0,0))

 
na_zero <- data.frame(Type=c(1,2,3,4,5,6), Na_Zero=c(0,0,0,0,0,0))

al_zero <- data.frame(Type=c(1,2,3,4,5,6), Al_Zero=c(0,0,0,0,0,0))

si_zero <- data.frame(Type=c(1,2,3,4,5,6), Si_Zero=c(0,0,0,0,0,0))

k_zero <- glass %>%
            group_by(Type) %>%
            filter(K==0) %>%
            summarize(
              K_Zero = n()
            )
k_zero <- rbind(k_zero, c(Type=5,K_Zero=0)) %>% arrange(Type)

ca_zero <- data.frame(Type=c(1,2,3,4,5,6), Ca_Zero=c(0,0,0,0,0,0))

zeroes <- cbind(ri_zero, na_zero, mg_zero, al_zero, si_zero, k_zero, ca_zero, ba_zero, fe_zero)
zeroes <- zeroes[, c(1,2,4,6,8,10,12, 14,16,18)]
head(zeroes)
##   Type RI_Zero Na_Zero Mg_Zero Al_Zero Si_Zero K_Zero Ca_Zero Ba_Zero Fe_Zero
## 1    1       0       0       0       0       0      1       0      67      45
## 2    2       0       0       9       0       0      3       0      70      44
## 3    3       0       0       0       0       0      1       0      16      12
## 4    4       0       0       7       0       0      9       0      11      11
## 5    5       0       0       3       0       0      0       0       9       9
## 6    6       0       0      23       0       0     16       0       3      23
  • Ba and Fe contain the most zeroes.
  • Note that the classes are imbalanced. The glass types are not represented equally.
  • Because the classes are imbalanced, it would not make sense to judge a class’s number of zeros based on this chart. Ratios would be better.
  • The significance of zero for these predictors is unknown. We aren’t sure whether they are stand-ins for NA values.

Data Summary

summary(glass)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   4:13  
##  3rd Qu.:0.10000   5: 9  
##  Max.   :0.51000   6:29

Correlation plot

Which predictors are most likely to influence glass type?

corrplot::corrplot(cor(dplyr::select(glass, 'RI':'Fe'), use = "complete.obs"), method = 'circle', type = 'lower', insig='blank', addCoef.col ='black', number.cex = 0.8, order = 'AOE', diag=FALSE)

Class imbalance

round(prop.table(table(dplyr::select(glass, Type))),2)
## 
##    1    2    3    4    5    6 
## 0.33 0.36 0.08 0.06 0.04 0.14

We use upsampling to balance the classes.

glass <- upSample(x = glass[, -ncol(glass)], y = glass$Type)
colnames(glass)[10] <- "Type"
table(glass$Type)   
## 
##  1  2  3  4  5  6 
## 76 76 76 76 76 76
round(prop.table(table(dplyr::select(glass, Type))),2)
## 
##    1    2    3    4    5    6 
## 0.17 0.17 0.17 0.17 0.17 0.17

Now there are 456 observations, up from 214 initially.

Model Building

Takeaways from the data exploration:

  • There are a lot of zeroes, which may or may not be missing values.
  • The predictors are not normally distributed.
  • Since the predictors are not normally distributed, Box’s M-test for homogeneity of covariance matrices cannot be used. We conclude that the covariance matrices are not equal.
  • The classes were imbalanced, but we balanced them.

Since the predictors are not normally distributed and the covariance matrices are not equal, we cannot use linear or quadratic discriminant analysis for classification. Naive Bayes cannot be used either because it assumes class conditional independence. From the correlation matrix, we cannot say that the predictors are independent from each other. Also, the predictors are all continuous- we would have to discretize them to possibly use Naive Bayes. Logistic Regression is a better choice for binary classification, and it makes many assumptions about the data that this data does not fit.

We decide on random forest. The basic building block is a decision tree, which does not make assumptions about the underlying data and can handle missing and noisy data. Class imbalance will hinder its performance, but we balanced our data earlier. A random forest will likely result in a higher accuracy for predictions, in exchange for explainability of the results.

Create train and test sets

We use a 75-25 train-to-test split.

set.seed(1234)
sample_set <- sample(nrow(glass), round(nrow(glass)*0.75), replace=FALSE)
glass_train <- glass[sample_set, ]
glass_test <- glass[-sample_set, ]

Fit model

For the random forest model, we specify 0.632 bootstrap for the resampling technique and 3 resampling iterations. The mtry parameter for the trees is typically the square root of the number of predictors. In this problem, this evaluates to 3.

rf_mod <- train(
  Type ~ .,
  data = glass_train,
  metric = "Accuracy",
  method = "rf",
  trControl = trainControl(method = "boot632", number=3),
  tuneGrid = expand.grid(.mtry = 3)   # mtry=sqrt(num predictors)
)

rf_pred <- predict(rf_mod, glass_test)
confusionMatrix(rf_pred, glass_test$Type)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4  5  6
##          1 20  2  0  0  0  0
##          2  3 14  0  0  0  0
##          3  1  0 15  0  0  0
##          4  0  3  0 18  0  0
##          5  0  0  0  0 16  0
##          6  0  0  0  0  0 22
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9211          
##                  95% CI : (0.8554, 0.9633)
##     No Information Rate : 0.2105          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9049          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity            0.8333   0.7368   1.0000   1.0000   1.0000    1.000
## Specificity            0.9778   0.9684   0.9899   0.9688   1.0000    1.000
## Pos Pred Value         0.9091   0.8235   0.9375   0.8571   1.0000    1.000
## Neg Pred Value         0.9565   0.9485   1.0000   1.0000   1.0000    1.000
## Prevalence             0.2105   0.1667   0.1316   0.1579   0.1404    0.193
## Detection Rate         0.1754   0.1228   0.1316   0.1579   0.1404    0.193
## Detection Prevalence   0.1930   0.1491   0.1404   0.1842   0.1404    0.193
## Balanced Accuracy      0.9056   0.8526   0.9949   0.9844   1.0000    1.000

ROC and AUC

This was calculated using the pROC package. Since there are multiple classes in the response variable, the pROC documentation says the ROC curve cannot be plotted.

prediction <- predict(dt_mod, glass_test, type="vector")

roc_object <- multiclass.roc(glass_test$Type, prediction, percent=TRUE)
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
print(roc_object)
## 
## Call:
## multiclass.roc.default(response = glass_test$Type, predictor = prediction,     percent = TRUE)
## 
## Data: prediction with 6 levels of glass_test$Type: 1, 2, 3, 4, 5, 6.
## Multi-class area under the curve: 89.59%
# Documentation of pROC's multiclass.roc says it cannot be plotted. 

We would have liked to see the plots for the trees to see how to decisions were made for each one, and which predictors were selected at each level. However, compared to a DT, a random forest acts more like a black box. We can’t see the inner workings. We have traded the explainability or the result for a higher accuracy. I did test the data with a DT and the DT was 16% worse.