What makes a Kickstarter more likely to reach its funding goal? This data set from Kaggle contains features describing over 370K Kickstarter Projects and includes the state of each one. Which model can predict whether a project succeeded?
library(e1071)
library(lubridate)
library(tidyverse)
library(corrplot)
library(reshape2)
library(DMwR)
library(MASS)
library(car)
library(caret)
library(rpart)
library(rstatix) # possibly checking for homogeneity of covariance matrices
library(rpart.plot)
library(ROCR) # only for binary classification
library(pROC)
df <- read.csv("kickstarter.csv")
df <- cbind(data.frame(Id = as.integer(rownames(df))), df)
df_melt = melt(subset(df, select=c(Id,Goal,Pledged,Backers)), id.vars = "Id")
ggplot(aes(value), data = df_melt) + geom_histogram(stat = "bin", fill = "navyblue") + facet_wrap(~variable, scales = "free") + labs(title = "Distributions of Continuous Variables", x = "Variable", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
These features have a large range, but most are close to zero rather
than in the thousands.
corrplot::corrplot(cor(dplyr::select(df, c(Goal,Pledged,Backers)), use = "complete.obs"), method = 'circle', type = 'lower', insig='blank', addCoef.col ='black', number.cex = 0.8, order = 'AOE', diag=TRUE)
The only significant correlation among the numerical predictors appears
to be Backers and Pledged. It makes sense that they are positively
correlated.
Change the Launched and Deadline columns to a Date type using the lubridate package.
df <- df %>% mutate(Launched = ymd_hms(df$Launched), Deadline = ymd(df$Deadline))
Add some extra features related to time, such as whether the project was launched on a weekend.
df <- df %>% mutate(LaunchMonth = month(Launched), LaunchIsWeekend=ifelse(wday(Launched) %in% c(6,7,1), 1, 0), DeadlineMonth=month(Deadline), Campaign_Length=DeadlineMonth-LaunchMonth) #
Drop unnecessary columns.
df <- df %>% select(-c("ID", "Name", "Launched", "Deadline"))
Create training and test sets.
df$State <- as.factor(df$State)
df <- df[sample(1:nrow(df)), ]
comments <- df[1:100000,]
set.seed(1234)
sample_set <- sample(nrow(comments), round(nrow(comments)*0.75), replace=FALSE)
comments_train <- comments[sample_set, ]
comments_test <- comments[-sample_set, ]
The model uses all predictors except for Category and Subcategory because R does not have a straightforward way to handle when there is a difference between factors present in the training set vs the test set. Including them would throw a fatal error, so they are omitted.
rf_mod <- train(
State ~ Goal+Pledged+Backers+LaunchMonth+LaunchIsWeekend+DeadlineMonth+Campaign_Length,
data = comments_train,
metric = "Accuracy",
method = "rf",
trControl = trainControl(method = "boot632", number=3),
tuneGrid = expand.grid(.mtry = 3) # mtry=sqrt(num predictors)
)
rf_pred <- predict(rf_mod, comments_test)
confusionMatrix(rf_pred, comments_test$State)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Canceled Failed Live Successful Suspended
## Canceled 32 96 1 1 2
## Failed 2498 13009 165 1 99
## Live 2 8 0 1 0
## Successful 52 52 26 8932 21
## Suspended 0 1 0 1 0
##
## Overall Statistics
##
## Accuracy : 0.8789
## 95% CI : (0.8748, 0.8829)
## No Information Rate : 0.5266
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7747
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Canceled Class: Failed Class: Live
## Sensitivity 0.01238 0.9881 0.00000
## Specificity 0.99554 0.7665 0.99956
## Pos Pred Value 0.24242 0.8248 0.00000
## Neg Pred Value 0.89738 0.9830 0.99232
## Prevalence 0.10336 0.5266 0.00768
## Detection Rate 0.00128 0.5204 0.00000
## Detection Prevalence 0.00528 0.6309 0.00044
## Balanced Accuracy 0.50396 0.8773 0.49978
## Class: Successful Class: Suspended
## Sensitivity 0.9996 0.00000
## Specificity 0.9906 0.99992
## Pos Pred Value 0.9834 0.00000
## Neg Pred Value 0.9997 0.99512
## Prevalence 0.3574 0.00488
## Detection Rate 0.3573 0.00000
## Detection Prevalence 0.3633 0.00008
## Balanced Accuracy 0.9951 0.49996
prediction <- predict(rf_mod, comments_test, type="prob")
roc_object <- pROC::multiclass.roc(comments_test$State, prediction, percent=TRUE)
print(roc_object)
##
## Call:
## multiclass.roc.default(response = comments_test$State, predictor = prediction, percent = TRUE)
##
## Data: multivariate predictor prediction with 5 levels of comments_test$State: Canceled, Failed, Live, Successful, Suspended.
## Multi-class area under the curve: 77.02%
The UCI Machine Learning Repository has a data set for glass classification. It has also been uploaded to Kaggle. The goal is to predict the type of glass from predictors related to a selection of the sample’s chemical attributes. Which type of model is best for this task?
library(tidyverse)
library(corrplot)
library(reshape2)
library(DMwR)
library(MASS)
library(car)
library(caret)
library(rpart)
library(rstatix) # possibly checking for homogeneity of covariance matrices
library(rpart.plot)
library(ROCR) # only for binary classification
library(pROC)
glass <- read_csv('https://raw.githubusercontent.com/djunga/DATA622/main/glass.csv', col_types = 'nnnnnnnnnnf')
glimpse(glass)
## Rows: 214
## Columns: 10
## $ RI <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.51743, 1.~
## $ Na <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 14.04, 13~
## $ Mg <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.60, 3.46,~
## $ Al <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.36, 1.56,~
## $ Si <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 72.08, 72~
## $ K <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.57, 0.67,~
## $ Ca <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.40, 8.09,~
## $ Ba <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Fe <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.11, 0.24,~
## $ Type <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
The glass types are 1,2,3,5,6,7. There is no type 4. Adjust for consistency.
glass$Type <- factor(glass$Type, labels = c(1,2,3,4,5,6))
Add unique ID numbers for each observation.
glass <- cbind(data.frame(Id = as.integer(rownames(glass))), glass)
glass_melt = melt(subset(glass, select=-c(Type)), id.vars = "Id")
ggplot(aes(value), data = glass_melt) + geom_histogram(stat = "bin", fill = "navyblue") + facet_wrap(~variable, scales = "free") + labs(title = "Distributions of Continuous Variables", x = "Variable", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are many zeroes in
Mg, K,
Ba, and Fe. The RI,
Na, Al, Si and Ca
predictors barely look normally distributed. Curiously, could there be a
correlation between glass type and the presence of zeroes?
Which class Types have all these zeroes?
zeroes <- data.frame()
ba_zero <- glass %>%
group_by(Type) %>%
filter(Ba==0) %>%
summarize(
Ba_Zero = n()
)
mg_zero <- glass %>%
group_by(Type) %>%
filter(Mg==0) %>%
summarize(
Mg_Zero = n()
)
mg_zero <- rbind(mg_zero, data.frame(Type=c(1,3), Mg_Zero=c(0,0))) %>% arrange(Type)
fe_zero <- glass %>%
group_by(Type) %>%
filter(Fe==0) %>%
summarize(
Fe_Zero = n()
)
ri_zero <- data.frame(Type=c(1,2,3,4,5,6), RI_Zero=c(0,0,0,0,0,0))
na_zero <- data.frame(Type=c(1,2,3,4,5,6), Na_Zero=c(0,0,0,0,0,0))
al_zero <- data.frame(Type=c(1,2,3,4,5,6), Al_Zero=c(0,0,0,0,0,0))
si_zero <- data.frame(Type=c(1,2,3,4,5,6), Si_Zero=c(0,0,0,0,0,0))
k_zero <- glass %>%
group_by(Type) %>%
filter(K==0) %>%
summarize(
K_Zero = n()
)
k_zero <- rbind(k_zero, c(Type=5,K_Zero=0)) %>% arrange(Type)
ca_zero <- data.frame(Type=c(1,2,3,4,5,6), Ca_Zero=c(0,0,0,0,0,0))
zeroes <- cbind(ri_zero, na_zero, mg_zero, al_zero, si_zero, k_zero, ca_zero, ba_zero, fe_zero)
zeroes <- zeroes[, c(1,2,4,6,8,10,12, 14,16,18)]
head(zeroes)
## Type RI_Zero Na_Zero Mg_Zero Al_Zero Si_Zero K_Zero Ca_Zero Ba_Zero Fe_Zero
## 1 1 0 0 0 0 0 1 0 67 45
## 2 2 0 0 9 0 0 3 0 70 44
## 3 3 0 0 0 0 0 1 0 16 12
## 4 4 0 0 7 0 0 9 0 11 11
## 5 5 0 0 3 0 0 0 0 9 9
## 6 6 0 0 23 0 0 16 0 3 23
Ba and Fe contain the most zeroes.summary(glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 4:13
## 3rd Qu.:0.10000 5: 9
## Max. :0.51000 6:29
Which predictors are most likely to influence glass type?
corrplot::corrplot(cor(dplyr::select(glass, 'RI':'Fe'), use = "complete.obs"), method = 'circle', type = 'lower', insig='blank', addCoef.col ='black', number.cex = 0.8, order = 'AOE', diag=FALSE)
round(prop.table(table(dplyr::select(glass, Type))),2)
##
## 1 2 3 4 5 6
## 0.33 0.36 0.08 0.06 0.04 0.14
We use upsampling to balance the classes.
glass <- upSample(x = glass[, -ncol(glass)], y = glass$Type)
colnames(glass)[10] <- "Type"
table(glass$Type)
##
## 1 2 3 4 5 6
## 76 76 76 76 76 76
round(prop.table(table(dplyr::select(glass, Type))),2)
##
## 1 2 3 4 5 6
## 0.17 0.17 0.17 0.17 0.17 0.17
Now there are 456 observations, up from 214 initially.
Takeaways from the data exploration:
Since the predictors are not normally distributed and the covariance matrices are not equal, we cannot use linear or quadratic discriminant analysis for classification. Naive Bayes cannot be used either because it assumes class conditional independence. From the correlation matrix, we cannot say that the predictors are independent from each other. Also, the predictors are all continuous- we would have to discretize them to possibly use Naive Bayes. Logistic Regression is a better choice for binary classification, and it makes many assumptions about the data that this data does not fit.
We decide on random forest. The basic building block is a decision tree, which does not make assumptions about the underlying data and can handle missing and noisy data. Class imbalance will hinder its performance, but we balanced our data earlier. A random forest will likely result in a higher accuracy for predictions, in exchange for explainability of the results.
We use a 75-25 train-to-test split.
set.seed(1234)
sample_set <- sample(nrow(glass), round(nrow(glass)*0.75), replace=FALSE)
glass_train <- glass[sample_set, ]
glass_test <- glass[-sample_set, ]
For the random forest model, we specify 0.632 bootstrap for the
resampling technique and 3 resampling iterations. The mtry
parameter for the trees is typically the square root of the number of
predictors. In this problem, this evaluates to 3.
rf_mod <- train(
Type ~ .,
data = glass_train,
metric = "Accuracy",
method = "rf",
trControl = trainControl(method = "boot632", number=3),
tuneGrid = expand.grid(.mtry = 3) # mtry=sqrt(num predictors)
)
rf_pred <- predict(rf_mod, glass_test)
confusionMatrix(rf_pred, glass_test$Type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6
## 1 20 2 0 0 0 0
## 2 3 14 0 0 0 0
## 3 1 0 15 0 0 0
## 4 0 3 0 18 0 0
## 5 0 0 0 0 16 0
## 6 0 0 0 0 0 22
##
## Overall Statistics
##
## Accuracy : 0.9211
## 95% CI : (0.8554, 0.9633)
## No Information Rate : 0.2105
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9049
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.8333 0.7368 1.0000 1.0000 1.0000 1.000
## Specificity 0.9778 0.9684 0.9899 0.9688 1.0000 1.000
## Pos Pred Value 0.9091 0.8235 0.9375 0.8571 1.0000 1.000
## Neg Pred Value 0.9565 0.9485 1.0000 1.0000 1.0000 1.000
## Prevalence 0.2105 0.1667 0.1316 0.1579 0.1404 0.193
## Detection Rate 0.1754 0.1228 0.1316 0.1579 0.1404 0.193
## Detection Prevalence 0.1930 0.1491 0.1404 0.1842 0.1404 0.193
## Balanced Accuracy 0.9056 0.8526 0.9949 0.9844 1.0000 1.000
ROC and AUC
This was calculated using the pROC package. Since there are multiple classes in the response variable, the pROC documentation says the ROC curve cannot be plotted.
prediction <- predict(dt_mod, glass_test, type="vector")
roc_object <- multiclass.roc(glass_test$Type, prediction, percent=TRUE)
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
print(roc_object)
##
## Call:
## multiclass.roc.default(response = glass_test$Type, predictor = prediction, percent = TRUE)
##
## Data: prediction with 6 levels of glass_test$Type: 1, 2, 3, 4, 5, 6.
## Multi-class area under the curve: 89.59%
# Documentation of pROC's multiclass.roc says it cannot be plotted.
We would have liked to see the plots for the trees to see how to decisions were made for each one, and which predictors were selected at each level. However, compared to a DT, a random forest acts more like a black box. We can’t see the inner workings. We have traded the explainability or the result for a higher accuracy. I did test the data with a DT and the DT was 16% worse.