Kickstarter Data Exploration

ABSTRACT
INTRODUCTION
DATA AND METHODOLOGY
RESULTS
CONCLUSION
APPENDIX

ABSTRACT

In this research report, we conducted an analysis on Kickstarter projects using Jonathan Leland’s Kickstarter dataset from 2009 to 2020. Our primary objectives were to identify key factors influencing project outcomes and understand time trends in crowdfunding dynamics. Our findings revealed that the Film & Video category boasts the greatest number of projects, contrasting with the Dance category having the smallest quantity. Interestingly, the Dance category exhibited the highest success rate, while the Technology category faced the greatest challenges in achieving project success. Notably, the smallest quantity of projects demonstrated the highest success rate, opening avenues for intriguing future investigations. Additionally, our analysis unveiled a positive trend in the number of projects launched over time, within all categories as well as within the Art and Games categories specifically. This study lays the groundwork for a nuanced understanding of Kickstarter project dynamics and paves the way for future in-depth explorations within the Kickstarter crowdfunding landscape.

INTRODUCTION

Jonathan Leland’s Kickstarter dataset is a valuable and comprehensive collection that includes the details of over 500,000 projects. It offers insight on one of the most popular crowdfunding platforms. Spanning over a decade, from 2009 to 2020, this dataset sheds light on various aspects, including project goals, outcomes, backer engagement, geographical location, categories, and creator details. The extensiveness of the dataset opens up opportunities for in-depth analyses, allowing researchers, analysts, and enthusiasts to explore patterns and trends that have emerged over the years.

DATA AND METHODOLOGY

There are 21 variables total, including:

##  [1] "CASEID"                       "NAME"                        
##  [3] "PID"                          "CATEGORY"                    
##  [5] "CATEGORY_ID"                  "SUBCATEGORY"                 
##  [7] "SUBCATEGORY_ID"               "PROJECT_PAGE_LOCATION_NAME"  
##  [9] "PROJECT_PAGE_LOCATION_STATE"  "PROJECT_PAGE_LOCATION_COUNTY"
## [11] "UID"                          "LAUNCHED_DATE"               
## [13] "DEADLINE_DATE"                "PROJECT_CURRENCY"            
## [15] "GOAL_IN_ORIGINAL_CURRENCY"    "PLEDGED_IN_ORIGINAL_CURRENCY"
## [17] "GOAL_IN_USD"                  "PLEDGED_IN_USD"              
## [19] "BACKERS_COUNT"                "STATE"                       
## [21] "URL_NAME"

The variables that we will be focusing on in this analysis will be Category, Launched_Date, Pledged_In_USD, and State. It’s important to note that incomplete or missing data for certain variables can be a common issue. For example, some projects might have missing values for pledged amounts or launch dates. Additionally, the available variables in this observational dataset might not cover all relevant aspects of a project’s success of failure. Some influential factors might not be captured in the dataset.

We will be working to answer the following 3 research questions:

Which categories contain the greatest quantity of campaigns? The least?
Which project categories were tied to the most successful campaigns? The most unsuccessful campaigns? (And how do you define “successful”?)
Is there any time trend regarding the number of campaigns launched?

RESULTS

Question 1: First, we wanted to examine which categories contained the greatest and least quantity of campaigns. As we can see in Figure 1, the most popular project topic relates to Film & Video. Conversely, the least popular campaign relates to Dance.

Question 2: We wanted to determine what Kickstarter project categories were tied to the most successful campaigns and, conversely, which were tied to the most unsuccessful campaigns. There are four states: successful, failed, canceled, and suspended, as seen in Figure 2.

We’ll focus on successful and unsuccessful projects for the rest of this analysis. We defined successful projects as the projects that have a successful state, while unsuccessful projects are the projects that have a canceled, failed, or suspended state.

Next, we wanted to focus on whether projects in certain categories have a higher success rate than others.

Specifically, we want to focus on the proportion of successful projects.

## `summarise()` has grouped output by 'CATEGORY'. You can override using the
## `.groups` argument.

From the above bar graph, we can see that the least successful projects are within the Technology category. Additionally, we can see that the Theater, Dance, and Comics categories have over a 50% proportion of successful projects. We can see that some categories have a higher proportion of successful projects than others. However, to determine if there is a significant association between the two categorical variables, we will conduct a chi-square test of independence. All three test assumptions are met and proof is located in the Appendix.

Null Hypothesis: The proportion of successful projects is the same across all categories. There is no association between project success and project category.

Alternative Hypothesis: The proportion of successful projects differs across at least two categories. There is an association between project success and project category.

##               
##                successful unsuccessful
##   Art               18955        22500
##   Comics            10666         6894
##   Crafts             3061         8856
##   Dance              2650         1648
##   Design            17000        26503
##   Fashion            9615        23451
##   Film & Video      28597        47211
##   Food               7844        22914
##   Games             24028        32672
##   Journalism         1351         4514
##   Music             31854        31632
##   Photography        4175         8471
##   Publishing        17831        34251
##   Technology         9423        35283
##   Theater            7408         4941

## 
##  Pearson's Chi-squared test
## 
## data:  myTable
## X-squared = 23166, df = 14, p-value < 2.2e-16

Given the p-value of 2.2e-16 (nearly 0), we can safely reject the null hypothesis. This means that we have reasonable evidence to say that there is a statistically significant association between project success and the project category. However, it’s important to note that since the chi-square test is sensitive to sample size (i.e., when the sample size is larger than 500), almost any small difference will appear statistically significant, which can lead to a false-positive result.

Now, let’s dive deeper into the financial aspect of Kickstarter projects by examining the average pledged amounts across various categories. Understanding how different categories attract financial support can offer additional insights into backers’ preferences and the overall dynamics of crowdfunding. The following bar plot represents the average values of each of the categories. This highlights the differences in the mean pledged amounts for each of the categories, where the Comics category has an average pledge amount far above the average pledge amount of projects in the Journalism category.

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `PLEDGED_IN_USD = as.numeric(PLEDGED_IN_USD)`.
## Caused by warning:
## ! NAs introduced by coercion

According to this graphic, we can see that the project categories Art and Games are visually roughly equivalent - but are they statistically different from each other? To determine if the average amounts are statistically different across the two groups, we will perform a two-sample t-test. Justifications and assumptions are discussed in the Appendix.

Null Hypothesis: the average pledged amount for successful projects in the Art category equals the average pledged amount for successful projects in the Games category.

Alternative Hypothesis: the average pledged amount for successful projects in the Dance category differs from the average pledged amount for successful projects in the Comics category.

## Joining with `by = join_by(CASEID, NAME, PID, CATEGORY, CATEGORY_ID,
## SUBCATEGORY, SUBCATEGORY_ID, PROJECT_PAGE_LOCATION_NAME,
## PROJECT_PAGE_LOCATION_STATE, PROJECT_PAGE_LOCATION_COUNTY, UID, LAUNCHED_DATE,
## DEADLINE_DATE, PROJECT_CURRENCY, GOAL_IN_ORIGINAL_CURRENCY,
## PLEDGED_IN_ORIGINAL_CURRENCY, GOAL_IN_USD, PLEDGED_IN_USD, BACKERS_COUNT,
## STATE, URL_NAME, theMonth, unsuccessful)`

## 
##  Welch Two Sample t-test
## 
## data:  Artds$PLEDGED_IN_USD and Gamesds$PLEDGED_IN_USD
## t = -0.23654, df = 48489, p-value = 0.813
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.350310  4.197987
## sample estimates:
## mean of x mean of y 
##  223.0487  223.6249

As we can see from our two-sample t-tests, we can see that our p-value is 0.813. Since this is greater than 0.05, we do not have evidence to reject our null hypothesis. This suggests that there is insufficient evidence to conclude that the average pledged amounts in the Art and Games categories are significantly different.

Question 3: The crowdfunding landscape is not static, and understanding how the number of projects launched evolves over time can offer valuable insights. The success of a crowdfunding project may not only be influenced by its category but also by broader trends in engagement, market demand, and platform dynamics. By examining the patterns of project launches, we aim to uncover potential time trends that could shed light on the evolving preferences and behaviors of backers.

## `geom_smooth()` using formula = 'y ~ x'

## 
## Call:
## lm(formula = count ~ LAUNCHED_DATE, data = project_ct3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -187.47  -63.54   -5.53   48.85  361.06 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.810e+02  1.955e+01  -14.37   <2e-16 ***
## LAUNCHED_DATE  2.521e-02  1.181e-03   21.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79.92 on 3150 degrees of freedom
## Multiple R-squared:  0.1263, Adjusted R-squared:  0.1261 
## F-statistic: 455.6 on 1 and 3150 DF,  p-value: < 2.2e-16

As we can see, the overall trend is positive both in the data and the smoothed line from 2009 to 2020. The coefficient for LAUNCHED_DATE represents the estimated change in the dependent variable (count) for a one-unit increase in the predictor variable (LAUNCHED_DATE). In this case, a one-unit increase (one day) in LAUNCHED_DATE is associated with an estimated increase of 2.410e-02 (or approximately 0.02410) in the count of projects. This suggests a minimal, but positive, relationship between LAUNCHED_DATE and the count of projects. Additionally, we can see a very clear gap between the counts of 50 to 100. This gap occurred after we filtered our dataset to counts under 500 (to remove outliers) and further investigation can be done to find the root cause of this gap.

Based off our Residuals vs Fitted diagnostic plot, we can see a slight parabolic pattern in the residuals around the 0 line. This suggests that the relationship may not be linear. Although there are some outliers we could remove in a future research project, the residuals seem to be following the 45 degree line in our Q-Q plot. Additionally, the red line in the Scale-Location plot is roughly horizontal across the plot. This gives us evidence that the assumption of homoscedasticity is likely satisfied for this model.

Next, we wanted to focus specifically on number of projects in the Art and Games categories. We can see in the below figure that there is an increase in the number of projects in the Games category from 2015 and 2020, while the count of Art related projects have stayed roughly the same throughout the years. This provides evidence that Game related projects have increased in popularity in the past decade or so.

## `summarise()` has grouped output by 'LAUNCHED_DATE'. You can override using the
## `.groups` argument.
## `geom_smooth()` using formula = 'y ~ x'

##                 Df Sum Sq Mean Sq F value Pr(>F)    
## LAUNCHED_DATE    1 121767  121767  1666.6 <2e-16 ***
## CATEGORY         1  36399   36399   498.2 <2e-16 ***
## Residuals     7802 570024      73                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The diagnostic plots from this multiple linear regression are similar to the previous diagnostic plots. For the Residuals vs Fitted diagnostic plot, the red line is almost completely horizontal to our 0 line. Additionally, the Q-Q plot shows a strong deviation in points near the ends, with several outliers. The red line in the Scale-Location plot is roughly horizontal across the plot. This gives us evidence that the assumption of homoscedasticity is likely satisfied for this model. Overall, the four plots show us that the current regression model might not be the best way to understand our data.

CONCLUSION

In this project, we conducted an analysis of some of the key factors related to Kickstarter projects. Notably, the Film & Video category emerged with the greatest quantity of projects, while the Dance category had the smallest quantity. Intriguingly, despite its smaller quantity, the Dance category had the highest success rate, which can lead to potential future research and analysis. On the other end of the spectrum, the Technology category had the highest number of unsuccessful projects. Additionally, we observed a positive trend, albeit minimal, in the number of projects launched over time. We conducted a linear regression analysis as well as a multiple linear regression analysis. Our diagnostic models showed us that the relationship between time and number of projects may not be a linear relationship. This creates opportunity to fit different models to our data.

While the current analysis can provide valuable insight, there are opportunities for improvement. One consideration is to explore a smaller sample size for a more focused analysis. Additionally, other variables in the dataset, such as location, goal amount, and number of backers, might present interesting options for further investigation. Lastly, fitting a logarithmic regression model onto either of our time trend data can prove insightful.

APPENDIX

Chi-Square Test of Independence

We performed a Chi-Square Test of Independence to determine if there is a significant association between cateogry and success.

Assumptions:

Count data: the table we created (myTable) lists lists counts for the number of successful and unsuccesful projects within each category.
Independence Assumption: the observations are independent of each other. One project does not impact another project. Additionally, the sample size for this dataset is sufficiently large.
Each expected count is as least 5: the expected count for each category in myTable is greater than 5.

Two Sample T-Test

We performed a two sample t-test to compare the means from the Arts and Games categories.