Exploring Kickstarter Success

Before diving into analysis, let’s begin by taking care of some preliminaries:

setwd("~/Documents/MBA 677/Assignment 1")
getwd()
library(ggplot2)
library(tidyr)
library(dplyr)
library(scales)
library(knitr)
library(stats)
library(RColorBrewer)

Let’s begin by reading in our data frame and completing some initial exploration. I expect to reformat the data into a long format too.

Kick <- read.csv('A2_kickstart_WORKING.csv')
str(Kick)
## 'data.frame':    13 obs. of  7 variables:
##  $ Kickstarter_Category : Factor w/ 13 levels "Art","Comics",..: 8 6 4 9 12 11 7 1 2 13 ...
##  $ Projects_Launched    : int  2796 9600 1882 9086 831 5634 1828 3783 1170 1787 ...
##  $ Projects_Fully_Funded: int  911 3891 759 5067 312 1666 688 1837 542 1194 ...
##  $ Pledge_Amt           : int  83144565 57951876 50124041 34953600 29003932 15311251 11117486 10477939 9242233 7084968 ...
##  $ Pledge_Ct            : int  1378143 647361 536469 522441 270912 262738 138204 155782 177070 95225 ...
##  $ Success_Rate         : num  0.326 0.405 0.403 0.558 0.375 0.296 0.376 0.486 0.463 0.668 ...
##  $ Avg_Pledge_Val       : num  60.3 89.5 93.4 66.9 107.1 ...
qplot(Kick, x=Kick$Pledge_Amt, y=Kick$Success_Rate)

#Trying something nicer in ggplot proper
plot1 <- ggplot(Kick, aes(x=(Pledge_Amt/1000000), y=Success_Rate)) +
  geom_text(aes(label=Kickstarter_Category)) +
  scale_y_continuous(labels = scales::percent) +
  ylab("Kickstarter Success Rate") +
  scale_x_continuous(labels=dollar) +
  xlab("Pledges in Millions $")
plot1

Extending this first plot just a little bit, let’s see if there’s a discernable relationship between these variables:

plot1.1 <- plot1 + geom_smooth(method = "lm", se = FALSE, alpha = 0.2) +
  coord_cartesian(ylim = c(0,.8))
plot1.1

By adding the linear trendline, we can more-easily see that there is indeed a negative relationship between success rate and money pledged. However, we’ve introduced some issues in terms of legibility.

One other initial finding to file away - the fine arts categories enjoy the top success rates:

TopCat <- Kick %>% select (Kickstarter_Category, Success_Rate) %>%   arrange(desc(Success_Rate)) %>% top_n(4) %>% rename("Kickstarter Category"= Kickstarter_Category) %>% rename("Success Rate" = Success_Rate)
TopCat$`Success Rate` <- TopCat$`Success Rate` * 100
kable(TopCat)
Kickstarter Category Success Rate
Dance 74.4
Theater 66.8
Music 55.8
Art 48.6

Finally, let’s take a look at the relationship between average size of pledge and success rate - still just exploring these relationships.

plot2 <- ggplot(Kick, aes(x=(Avg_Pledge_Val), y=Success_Rate)) +
  geom_text(aes(label=Kickstarter_Category)) +
  scale_y_continuous(labels = scales::percent) +
  ylab("Kickstarter Success Rate") +
  scale_x_continuous(labels=dollar) +
  xlab("Average Pledge in $")
plot2

# This view isn't showing a strong relationship - let's try adding the geom_smooth to see if anything jumps out.  
plot2 + geom_smooth()

Based on these two exploratory plots, I’m not identifying a strong relationship between the average pledge size and the success rate.

Note that this graph’s x-axis does not begin at zero - this would be a red flag if we were to use this as a final visualization, and must be addressed, lest the graph be called misleading.

Another question we might find interesting is to explore how many total projects and pledges there were by category:

ggplot(Kick,aes(reorder(Kickstarter_Category, desc(Pledge_Ct)), Pledge_Ct/1000, fill=Success_Rate)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x  = element_text(angle=90, vjust = 0.5)) +
  xlab("Kickstarter Category") +
  ylab("Number of Pledges (000's)") +
  scale_fill_gradient(low = "#8e0152", high = "#006837")

This graph shows another interesting result; some of the categories with the highest success rates (the fine arts we noted earlier, in green) enjoy the lowest number of pledges.

Since we also know the average pledge size for these categories was unremarkable, we can deduce that the projects in those categories tended to be smaller in scope than those in the larger categories.

Taken all together, these findings suggest that Kickstarter projects most likely to be successful are those with lower numbers of pledges, lower total pledge amount, and in the Fine Arts. Let’s think about a vis to highlight these points.

The Final Product

First, we’ll identify a “meta-category” for the Fine Arts, Popular Arts, and Consumer Goods.

Kick.Focused <- Kick %>% 
  arrange(desc(Success_Rate)) %>% 
  filter(row_number() %in% c(1, 2, 3, 11, 12, 13))
str(Kick.Focused)
## 'data.frame':    6 obs. of  7 variables:
##  $ Kickstarter_Category : Factor w/ 13 levels "Art","Comics",..: 3 13 9 8 11 5
##  $ Projects_Launched    : int  512 1787 9086 2796 5634 1659
##  $ Projects_Fully_Funded: int  381 1194 5067 911 1666 434
##  $ Pledge_Amt           : int  1773304 7084968 34953600 83144565 15311251 6317799
##  $ Pledge_Ct            : int  23807 95225 522441 1378143 262738 83067
##  $ Success_Rate         : num  0.744 0.668 0.558 0.326 0.296 0.262
##  $ Avg_Pledge_Val       : num  74.5 74.4 66.9 60.3 58.3 76.1
Kick.Focused$Kickstarter_Category
## [1] Dance      Theater    Music      Games      Publishing Fashion   
## 13 Levels: Art Comics Dance Design Fashion Film & Video Food ... Theater
#Meta_Cat <- factor(c("Fine_Art", "Popular_Art"))
#Meta_Cat
Kick.Meta <- Kick %>% mutate(Meta_Cat = if_else(row_number() %in% c(1,2), "Popular Art", if_else(row_number() %in% c(3), "Consumer Goods", if_else(row_number() %in% c(4), "Fine Art", if_else(row_number() %in% c(5,6,7), "Consumer Goods", if_else(row_number() %in% c(8), "Fine Art", if_else(row_number() %in% c(9), "Popular Art", if_else(row_number() %in% c(10), "Fine Art", if_else(row_number() %in% c(11), "Consumer Goods", if_else(row_number() %in% c(12), "Popular Art", "Fine Art"))))))))))
str(Kick.Meta)
## 'data.frame':    13 obs. of  8 variables:
##  $ Kickstarter_Category : Factor w/ 13 levels "Art","Comics",..: 8 6 4 9 12 11 7 1 2 13 ...
##  $ Projects_Launched    : int  2796 9600 1882 9086 831 5634 1828 3783 1170 1787 ...
##  $ Projects_Fully_Funded: int  911 3891 759 5067 312 1666 688 1837 542 1194 ...
##  $ Pledge_Amt           : int  83144565 57951876 50124041 34953600 29003932 15311251 11117486 10477939 9242233 7084968 ...
##  $ Pledge_Ct            : int  1378143 647361 536469 522441 270912 262738 138204 155782 177070 95225 ...
##  $ Success_Rate         : num  0.326 0.405 0.403 0.558 0.375 0.296 0.376 0.486 0.463 0.668 ...
##  $ Avg_Pledge_Val       : num  60.3 89.5 93.4 66.9 107.1 ...
##  $ Meta_Cat             : chr  "Popular Art" "Popular Art" "Consumer Goods" "Fine Art" ...
#I want to see Meta_Cat as a factor var
Kick.Meta$Meta_Cat <- factor(Kick.Meta$Meta_Cat)
str(Kick.Meta)
## 'data.frame':    13 obs. of  8 variables:
##  $ Kickstarter_Category : Factor w/ 13 levels "Art","Comics",..: 8 6 4 9 12 11 7 1 2 13 ...
##  $ Projects_Launched    : int  2796 9600 1882 9086 831 5634 1828 3783 1170 1787 ...
##  $ Projects_Fully_Funded: int  911 3891 759 5067 312 1666 688 1837 542 1194 ...
##  $ Pledge_Amt           : int  83144565 57951876 50124041 34953600 29003932 15311251 11117486 10477939 9242233 7084968 ...
##  $ Pledge_Ct            : int  1378143 647361 536469 522441 270912 262738 138204 155782 177070 95225 ...
##  $ Success_Rate         : num  0.326 0.405 0.403 0.558 0.375 0.296 0.376 0.486 0.463 0.668 ...
##  $ Avg_Pledge_Val       : num  60.3 89.5 93.4 66.9 107.1 ...
##  $ Meta_Cat             : Factor w/ 3 levels "Consumer Goods",..: 3 3 1 2 1 1 1 2 3 2 ...

It may have been a bit verbose, but we now have a data frame with a new factor variable which we’ll use to emphasize the success rates enjoyed by Fine Arts on Kickstarter.

final <- ggplot(Kick.Meta, aes(x=(Pledge_Amt/1000000), y=Success_Rate)) +
  geom_point(aes(fill = Meta_Cat), size = 5, shape=23, alpha = 0.75) +
  scale_y_continuous(labels = percent_format(), limit = c(0,1)) +
  ylab("Kickstarter Success Rate") +
  scale_x_continuous(labels=dollar) +
  xlab("Pledges in Millions $") +
  scale_fill_brewer(name = "Kickstarter Project Type", palette="Set1", breaks = c("Fine Art", "Popular Art" , "Consumer Goods")) +
  ggtitle("On Kickstarter, Small Fine Arts Projects Find Success") +
  geom_smooth(method = "lm", se=FALSE, color = "#984ea3", show.legend = FALSE, alpha = 0.75) +
  theme(legend.position = "bottom") +
  theme(plot.title = element_text(hjust = 0.5))
final

#Based on classmate feedback; adding a simple Kable table of the encoding from the meta groups.  
last_kable <- select(Kick.Meta, c(8,1)) %>% arrange(Meta_Cat) %>% rename("Meta Category" = Meta_Cat) %>% rename ("Kickstarter Category" = Kickstarter_Category)
kable(last_kable)
Meta Category Kickstarter Category
Consumer Goods Design
Consumer Goods Technology
Consumer Goods Publishing
Consumer Goods Food
Consumer Goods Fashion
Fine Art Music
Fine Art Art
Fine Art Theater
Fine Art Dance
Popular Art Games
Popular Art Film & Video
Popular Art Comics
Popular Art Photography

With the final graph, we draw out a few things:

Fine Art projects enjoy a higher success rate than other types. As the total amount pledged for a category increases, the success rate actually decreases - this is perhaps a surprising conclusion. The linear regression line is intended to help make this relationship clear. Most categories of projects enjoy a little less than 50/50 shot of meeting their fundraising goals.

Notes Regarding Encoding Choices:

  1. We have some minor overplotting on the final graph. To address that, the author has applied an alpha transparency to the geom_points and geom_smooth layers.
  2. I selected diamonds as a shape attribute and increased the size somewhat in order to improve legibility.
  3. I created a “meta-category” to summarize the 13 categorical labels provided. I made this choice to highlight and simplify the message for the audience. It required some dplyr and base R code to create the new factor variable in my dataframe, and I went about that in a fairly verbose manner (check out Rpubs for some ugliness!).
  4. In general, in my work life, I am always concerned about dropping data - as part of this course, I’m learning that it is acceptable to drop data under certain circumstances; our role as the analyst is to develop an understanding of the story the data are telling us, make that story consumable, and put it into a form where it can be shared.
  5. I attempted converting the data to long form but actually found it hindered my exploration of the data - the natural groupings at category seemed to make sense, and after much experimentation, I reverted to a dataframe schema similar to the one we started with.
  6. I made what, for me, is a bold title choice - designed to reinforce my primary message and spur the reader to decide whether they agree with the premise. I’ll be interested to hear any feedback on that choice.

Comparing with the Economist:

  1. I made a few choices specifically designed to simplify the story. As such, I didn’t set out with the goal of including as much data as humanly possible in the final vis.
  2. My work makes use of color both to add interest and to communicate an important mapping of the underlying data. I used several rounds of trial-and-error to select a color.brewer palette.
  3. While I tried some exploratory plots (see Rpubs if interested!) using the Average Pledge per Category, I ultimately decided to drop that variable from the analysis - much as they did. I didn’t see a remarkable relationship in the Average Pledge data, and wanted to really focus on the story I was trying to tell.