Kickstarter Cheat Sheet!

Before you start reading my analysis below, let me thank you first for being willing to see the results of my writing, and it means a lot to me. This analysis is an extended to my previous one. I have already done this on the same dataset before and also have published it on RPubs. If you’d like to see my previous analysis using base R, you can visit this link.

Whatever you will see here, is the result of my study in Data Visualization class at Algoritma Academy and is written entirely based on my experience and knowledge up until now. To see what I’ve learned in more detail, you can visit the Algoritma Academy learning syllabus.

If something is wrong or missing, please feel free to contact me, I’d love to discuss it with you. Thank you.

Background and motivation

This dataset is about Kickstarter, a platform to raise money for creative projects published by Mickaël Mouillé and last updated in 2018. To download the dataset and see the details, you can visit Kaggle.

Background

One of the biggest obstacles to developing a project or business is money. sometimes, we already have an idea for our business, and maybe we already know how the business will run, but due to lack of funds, our business just doesn’t work.

Because of that reason, Kickstarter born. It’s a platform, where people can share their new visions for creative work with the communities that will come together to fund them. Simply, it’s a platform to bring creative projects to life.

Motivation

The main objective of this analysis is to try to enhance my previous analysis using more powerful R packages such as lubridate and ggplot2 to tell insights we got from this particular dataset better and more beautifully.

Also, by the end of this analysis, we will try to conclude the most effective way to get our project funded on Kickstarter.

Setting Up

# Call the packages we will be using
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(ggridges)

Data Exploratory

Reading data & check for missing values

# Data Input
data <- read.csv("data/kickstarter.csv")
str(data)
## 'data.frame':    378661 obs. of  15 variables:
##  $ ID              : int  1000002330 1000003930 1000004038 1000007540 1000011046 1000014025 1000023410 1000030581 1000034518 100004195 ...
##  $ name            : chr  "The Songs of Adelaide & Abullah" "Greeting From Earth: ZGAC Arts Capsule For ET" "Where is Hank?" "ToshiCapital Rekordz Needs Help to Complete Album" ...
##  $ category        : chr  "Poetry" "Narrative Film" "Narrative Film" "Music" ...
##  $ main_category   : chr  "Publishing" "Film & Video" "Film & Video" "Music" ...
##  $ currency        : chr  "GBP" "USD" "USD" "USD" ...
##  $ deadline        : chr  "2015-10-09" "2017-11-01" "2013-02-26" "2012-04-16" ...
##  $ goal            : num  1000 30000 45000 5000 19500 50000 1000 25000 125000 65000 ...
##  $ launched        : chr  "2015-08-11 12:12:28" "2017-09-02 04:43:57" "2013-01-12 00:20:50" "2012-03-17 03:24:11" ...
##  $ pledged         : num  0 2421 220 1 1283 ...
##  $ state           : chr  "failed" "failed" "failed" "failed" ...
##  $ backers         : int  0 15 3 1 14 224 16 40 58 43 ...
##  $ country         : chr  "GB" "US" "US" "US" ...
##  $ usd.pledged     : num  0 100 220 1 1283 ...
##  $ usd_pledged_real: num  0 2421 220 1 1283 ...
##  $ usd_goal_real   : num  1534 30000 45000 5000 19500 ...
# Changing data type and checking for missing values
data$launched <- ymd_hms(data$launched)
data$deadline <- ymd(data$deadline)
colSums(is.na(data))
##               ID             name         category    main_category 
##                0                0                0                0 
##         currency         deadline             goal         launched 
##                0                0                0                0 
##          pledged            state          backers          country 
##                0                0                0                0 
##      usd.pledged usd_pledged_real    usd_goal_real 
##             3797                0                0

Our dataset consist of 378661 observations and 15 variables, it has missing values though inside the usd.plegged variable. But, because we won’t use this variable, we will subset this variable, along with usd.pledged, usd_pledged_readl, and usd_goal_real.

Subsetting and dividing the dataset

# Drop unused variables
data_clean <- subset(data, select = -c(ID, usd.pledged, usd_pledged_real, usd_goal_real))
str(data_clean)
## 'data.frame':    378661 obs. of  11 variables:
##  $ name         : chr  "The Songs of Adelaide & Abullah" "Greeting From Earth: ZGAC Arts Capsule For ET" "Where is Hank?" "ToshiCapital Rekordz Needs Help to Complete Album" ...
##  $ category     : chr  "Poetry" "Narrative Film" "Narrative Film" "Music" ...
##  $ main_category: chr  "Publishing" "Film & Video" "Film & Video" "Music" ...
##  $ currency     : chr  "GBP" "USD" "USD" "USD" ...
##  $ deadline     : Date, format: "2015-10-09" "2017-11-01" ...
##  $ goal         : num  1000 30000 45000 5000 19500 50000 1000 25000 125000 65000 ...
##  $ launched     : POSIXct, format: "2015-08-11 12:12:28" "2017-09-02 04:43:57" ...
##  $ pledged      : num  0 2421 220 1 1283 ...
##  $ state        : chr  "failed" "failed" "failed" "failed" ...
##  $ backers      : int  0 15 3 1 14 224 16 40 58 43 ...
##  $ country      : chr  "GB" "US" "US" "US" ...

Currencies

If you notice, our data consists of different currencies. These currencies, couldn’t be compared one to another because they don’t have the same value. First, let’s check how many currencies exist in our dataset.

# Checking for currencies
unique(data_clean$currency)
##  [1] "GBP" "USD" "CAD" "AUD" "NOK" "EUR" "MXN" "SEK" "NZD" "CHF" "DKK" "HKD"
## [13] "SGD" "JPY"
# Make currency table
currency <- as.data.frame(sort(table(data_clean$currency), decreasing = T))
names(currency) <- c("currency", "nrow")

# Make the plot
options(scipen = 999)
ggplot(currency, aes(x = currency, y = nrow)) +
  geom_segment(aes(x = currency, xend = currency, y=0, yend = nrow)) +
  geom_point(color = "orange", size = 4) +
  theme_light() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank()
  ) +
  labs(title = "Table of Currency",
       x = "",
       y = "Number of rows") +
  theme_bw()

As you can see, almost all of our data is using USD currency, therefore to make it simple we will do another subset to our dataset to get only USD data.

# Subsetting the data
data_clean <- data_clean[data_clean$currency == "USD",]

Campaign states

The only thing that can happen to our campaign on Kickstarter is not only success or failure. Sometimes other things can happen too along the way. To see what states exist on our dataset, we have to take a look at state variable.

# Get the states names
unique(data_clean$state)
## [1] "failed"     "canceled"   "successful" "undefined"  "live"      
## [6] "suspended"
# See the portion of each state
state <- as.data.frame(table(data_clean$state))
names(state) <- c("States", "nrow")

# Create a pie chart to see the bigger picture
ggplot(state, aes(x = "", y = nrow, fill = States)) +
  geom_bar(stat = "identity", width = 1, color = "black") +
  coord_polar("y", start = 0) +
  labs(title = "Portions of States") +
  theme_void()

Now we know that failed and successful states cover more than 50% of our data, let’s just remove the rest of the states to make our dataset more straightforward.

# Remove the unnecessary states
data_clean <- data_clean[data_clean$state == "failed" | data_clean$state == "successful",]

Categories

There are 2 variables that have ‘category’ in them, to avoid, confusion let’s change these 2 into category and sub-category.

# Change variable names
names(data_clean)[2:3] <- c("sub-category","category")
names(data_clean)
##  [1] "name"         "sub-category" "category"     "currency"     "deadline"    
##  [6] "goal"         "launched"     "pledged"      "state"        "backers"     
## [11] "country"

Perfect! Now our dataset is ready to be analyzed.

Business questions

Before we begin, it is helpful to determine what we are really looking for in our analysis.

By looking at the variable names above, we can make some questions that we might be able to answer through analyzing this dataset.

  • What categories have the most submission, and which were the most supported?
  • What is the most reasonable goal to be supported?**
  • What deadlines make the most sense in setting up a campaign?

Answering the business questions

1. What categories have the most submission, and which were the most supported?

To answer this question, let’s first find out if there are any trends changing along the time in Kickstarter based on categories.

# Subset the dataset for categories and launch year
trends <- data_clean[, c("category", "launched")]
trends$year <- year(trends$launched)
trends <- trends[order(trends$year),]

# Make a ridge plot
ggplot(trends, aes(x = year, y = category, fill = category)) +
  geom_density_ridges() +
  theme_ridges() + 
  scale_x_discrete(limits = unique(trends$year)) +
  labs(x = "", y = "", title = "Category Trends") + 
  theme_bw() +
  theme(legend.position = "none")
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
## Picking joint bandwidth of 0.245

As we can see above, there are no significant differences among the categories every year. So we can say, a number of submission for each category don’t have too many differences from one another every year.

Ok, so we don’t find any trends on Kickstarter. It’s alright, we can still see what categories have the most submission in the past years.

# Submission per category
cat_sum <- as.data.frame(table(data_clean$category))
names(cat_sum) <- c("category", "nrow")

# Make the plot
ggplot(cat_sum, aes(x = category, y = nrow)) +
  geom_segment( aes(x = category, xend = category, y = 0, yend = nrow), color = ifelse(cat_sum$nrow > 25000, "orange", "black"), size = ifelse(cat_sum$nrow > 25000, 1.3, 0.7)) +
  geom_point(color = ifelse(cat_sum$nrow > 25000, "orange", "black"), size = ifelse(cat_sum$nrow > 25000, 5, 2)) +
  labs(x = "", y = "Number of rows", title = "Number of Campaign per Category") +
  theme_bw() +
  coord_flip()

Now we know, that the top 3 categories that have the most submission on Kickstarter are PUblishing, Music, andFilm & Video with more than 25000 submissions each.

But wait, many submissions not equal to many successes right? What if these 3 categories only win in a number of submissions but now it a number of a successful campaigns? Let’s find out!

# Get the top 10 categories
top_cat <- as.data.frame(xtabs( ~ category + state, data = data_clean))
names(top_cat) <- c("category", "States", "nrow")

# Make a plot
ggplot(top_cat, aes(x = category, y = nrow, fill = States)) + 
  geom_bar(position ="fill", stat="identity") +
  geom_hline(yintercept = 0.5) +
  geom_hline(yintercept = 0.6) +
  labs(x="", y="Number of rows", title = "Percentages of Success Rate from each Category") +
  coord_flip() +
  theme_bw()

See, the only category from the top 3 submissions we found out before that has over 50% success rate is only the Music category. And, it doesn’t even win against categories like Theather, Dance, and Comics that has around a 60% of success rate.

Great, from several analyses above, we can answer the first question. So far we can say that Film & Video category has the most submission on Kickstarter with more than 45000 submissions, Dance category wins the most successful category with more than 60% of the campaigns ended up being successful. But, Music is the best category so far, because it has more than a 50% success rate, and also 1 of 3 most submitted categories on Kickstarter.

2. What is the most reasonable goal to be supported?

The main reason Kickstarter exist is to gather fund. In another word, to gather money. Before we set our goal when launching the campaign, we have to think about how much money we can raise at Kickstarter.

We can achieve this by looking at previous campaigns as ask, - how much is their goal, when they first launch the campaign, and - how much the backers want to pay to support the campaign.

By answering these questions, we are able to provide numbers that are reasonable for our campaign objectives.

# Summary of the goal of the succesful campaign
summary(data_clean[data_clean$state == "successful",]$goal)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1500    4000    9693   10000 2000000
# Summary of the goal of the failed campaign
summary(data_clean[data_clean$state == "failed",]$goal)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##         0      2600      7500     60659     20000 100000000

As the failed summary says that its median is almost twice the median of a successful one, and its mean also is so significantly higher than its median, we can clearly see that successful campaigns tend to have lower goal numbers and fewer outliers.

# Create a goal segment based on the summary of our data above
data_clean$goal_segment <- sapply(data_clean$goal, function(i){
  if(i < 2000){
    "< 2.000 USD"
  } else if(i >= 2000 & i < 15000){
    "2.000 - 15000 USD"
  } else if(i >= 15000 & i < 5000000) {
    "15.000 - 5.000.000 USD"
  } else {
    "> 5.000.000 USD"
  }
})

str(data_clean)
## 'data.frame':    261511 obs. of  12 variables:
##  $ name        : chr  "Greeting From Earth: ZGAC Arts Capsule For ET" "Where is Hank?" "ToshiCapital Rekordz Needs Help to Complete Album" "Monarch Espresso Bar" ...
##  $ sub-category: chr  "Narrative Film" "Narrative Film" "Music" "Restaurants" ...
##  $ category    : chr  "Film & Video" "Film & Video" "Music" "Food" ...
##  $ currency    : chr  "USD" "USD" "USD" "USD" ...
##  $ deadline    : Date, format: "2017-11-01" "2013-02-26" ...
##  $ goal        : num  30000 45000 5000 50000 1000 25000 12500 5000 200000 2500 ...
##  $ launched    : POSIXct, format: "2017-09-02 04:43:57" "2013-01-12 00:20:50" ...
##  $ pledged     : num  2421 220 1 52375 1205 ...
##  $ state       : chr  "failed" "failed" "failed" "successful" ...
##  $ backers     : int  15 3 1 224 16 40 100 0 0 11 ...
##  $ country     : chr  "US" "US" "US" "US" ...
##  $ goal_segment: chr  "15.000 - 5.000.000 USD" "15.000 - 5.000.000 USD" "2.000 - 15000 USD" "15.000 - 5.000.000 USD" ...
# Subset campaign that has got more than 1 USD goal
goals <- data_clean[data_clean$goal > 1,]

# Create the goal distribution
ggplot(goals, aes(x = state, y = goal)) +
  geom_jitter(aes(col = goal_segment)) +
  geom_boxplot(alpha=0.5) +
  scale_y_log10()+
  labs(x= "State", y= "Goal", col = "Goal Segment", title = "Distribution of Goal") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) 

From the boxplots, we can see that failed goals are more distributed and have more outliers than successful ones.

How about the backers? Did they willing to pay as much as 100M USD to support the campaigns?

# Summary of the fund of the successful campaign
summary(data_clean[data_clean$state == "successful",]$pledged)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        1     2061     5184    23224    13192 20338986
# Summary of the fund of the failed campaign
summary(data_clean[data_clean$state == "failed",]$pledged)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       5     106    1331     700  757353

The answer is clearly not! As we see that the most paid failed campaign only got 757353 USD. Way beyond the 100M goal we found before. But, surprisingly, we have more than 20M pledge into one of the successful campaign.

Let’s see the distribution.

# Create a goal segment based on the summary of our data above
data_clean$pledged_segment <- sapply(data_clean$pledged, function(i){
  if(i < 50){
    "< 50 USD"
  } else if(i >= 50 & i < 5000){
    "50 - 5.000 USD"
  } else if(i >= 5000 & i <= 5000000) {
    "5.000 - 5.000.000 USD"
  } else {
    "> 5.000.000 USD"
  }
})

str(data_clean)
## 'data.frame':    261511 obs. of  13 variables:
##  $ name           : chr  "Greeting From Earth: ZGAC Arts Capsule For ET" "Where is Hank?" "ToshiCapital Rekordz Needs Help to Complete Album" "Monarch Espresso Bar" ...
##  $ sub-category   : chr  "Narrative Film" "Narrative Film" "Music" "Restaurants" ...
##  $ category       : chr  "Film & Video" "Film & Video" "Music" "Food" ...
##  $ currency       : chr  "USD" "USD" "USD" "USD" ...
##  $ deadline       : Date, format: "2017-11-01" "2013-02-26" ...
##  $ goal           : num  30000 45000 5000 50000 1000 25000 12500 5000 200000 2500 ...
##  $ launched       : POSIXct, format: "2017-09-02 04:43:57" "2013-01-12 00:20:50" ...
##  $ pledged        : num  2421 220 1 52375 1205 ...
##  $ state          : chr  "failed" "failed" "failed" "successful" ...
##  $ backers        : int  15 3 1 224 16 40 100 0 0 11 ...
##  $ country        : chr  "US" "US" "US" "US" ...
##  $ goal_segment   : chr  "15.000 - 5.000.000 USD" "15.000 - 5.000.000 USD" "2.000 - 15000 USD" "15.000 - 5.000.000 USD" ...
##  $ pledged_segment: chr  "50 - 5.000 USD" "50 - 5.000 USD" "< 50 USD" "5.000 - 5.000.000 USD" ...
# Subset campaign that has got funded more than 1 USD
pledged <- data_clean[data_clean$pledged > 0,]

# Create the fund distribution
ggplot(pledged, aes(state, pledged)) +
  geom_jitter(aes(col = pledged_segment)) +
  geom_boxplot(alpha=0.5) +
  scale_y_log10()+
  labs(x= "State", y= "Pledged Amount", col = "Pledged Amount Segments", title = "Pledged Amount Distribution") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) 

Contradiction to the goal boxplot, successful campaigns have more distributed pledged value. This means, when the backers love your campaign, they don’t hesitate to support you. As we can see some backers gives more than 5M USD to the campaigns.

Nice, now we know that in order to get your campaign fully funded, you have to set your goal reasonably. According to our data, most successful campaigns set their goal around 4000 - 9000 USD. But, this number is not a must because the resources needed for each campaign may vary, and also the backers on Kickstarter will fund you anyway if they like your campaign.

3. What time is the best time to post a campaign, and what about the deadline?

Ok, besides the category and the goal that is so important when you launch a new campaign, time is also an important thing to think about. You have to carefully set up your own deadline. Why? because this will indicate how confident you are with the campaign you have created.

Imagine, if you really believe in your project and your campaign, and you are very sure that people will like your idea, you will definitely make a short deadline. And people will notice this.

But, before we talk about our deadline, let’s figure out the best time to post a new campaign. Is it at the end of the year? When people on holiday?

To figure this out, let’s see the number of campaigns and also number of backers participated for each month.

# Add month category to the dataset
launched_data <- data_clean
launched_data$month <- month(launched_data$launched)

# Create new monthly table
monthly <- as.data.frame(table(launched_data$month))

# Add the backers sum for each month
backers <- aggregate(backers ~ month, data = launched_data, FUN = sum)
monthly$backers <- backers$backers

# Add percentage columns
names(monthly) <- c("month", "nrow", "nbackers")
monthly$percent_nrow <- monthly$nrow / sum(monthly$nrow) * 100
monthly$percent_nbackers <- monthly$nbackers / sum(monthly$nbackers) * 100
monthly
##    month  nrow nbackers percent_nrow percent_nbackers
## 1      1 19501  2374469     7.457048         7.401581
## 2      2 20866  2605794     7.979014         8.122656
## 3      3 23945  3073898     9.156403         9.581808
## 4      4 23000  2876331     8.795041         8.965961
## 5      5 23057  3115525     8.816838         9.711565
## 6      6 22882  2787139     8.749919         8.687936
## 7      7 25604  2893750     9.790793         9.020259
## 8      8 22838  2609142     8.733093         8.133092
## 9      9 21464  2834052     8.207685         8.834171
## 10    10 22720  3044754     8.687971         9.490961
## 11    11 21271  2622338     8.133883         8.174226
## 12    12 14363  1243373     5.492312         3.875783
# Make a plot
ggplot(monthly, aes(x=as.numeric(month))) + 
  # Number of submitted campaigns each month
  geom_line(aes(y = percent_nrow), color = "orange") +
  geom_point(aes(y = percent_nrow), shape=21, color="orange", fill="orange", size=3) +
  
  # Number of backers each month
  geom_line(aes(y = percent_nbackers), color="black") +
  geom_point(aes(y = percent_nbackers), shape=21, color="black", fill="black", size=3) +
  
  theme_bw() +
  scale_x_discrete(limits = unique(monthly$month)) +
  
  labs(x="Month", y="Percentage of Total Data", title = "Number of Campaigns and Backers each Month")

From our line plot above, we can conclude that the best month to launch your campaign is on March. As we can see, in March more than 9% of support and campaign launch happens.

Now, for the duration of the campaign. We will know the best campaign duration by subsetting our data by states and campaign duration. According to the Kickstarter help page, each campaign can last anywhere from 1 - 60 days, so we will take data that have maximum 60 days deadline.

# Find the duration of each campaign
data_clean$duration <- interval(data_clean$launched, data_clean$deadline) / days(1)

# Subset for max 60 days duration
data_clean <- data_clean[data_clean$duration <= 60,]

# Make the plot
ggplot(data_clean, aes(x=duration, group=state, fill=state)) +
  geom_density() +
  labs(x = "Deadline Interval (days)", y = "Percentage of the data", title = "Most Common Deadline Intervals") +
  theme_bw()

The most successful deadline so far is around 30 days. The comparison with other intervals is so significant.

Combined with the month from the previous plot, we now can say march is the busiest month on Kickstart as more than 9% of campaign launch and backers support happens on that month. Also, the longer the deadline interval from our launch date, don’t guarantee the success of the campaigns themselves, as most successful campaigns only have around 30 days deadline intervals.

Conclusion

From our analysis above we can confidently say, to be fully funded on Kickstart you have to pick the categories that have the most successful rate such as Comics, Dance, Music, and Theather. Also, set your goal as low as possible, but don’t only focus on your goal amount, as the backers will still support you if they like your ideas anyway. Last but not least, place your deadline around 30 days after you launch your campaign because most successful campaigns have this kind of timeframe.