Kickstarter Projects

Anna Pająk, Monika Dyczewska, Daria Skarbek

Published on 16.05.2022

Description of the dataset

Kickstarter Projects dataset contains data about more than 300000 initiatives from the Kickstarter crowdfunding platform. Each entry describes one fundraiser launched in the platform - its identification, categorisation, timeframe, financial results and goals.

Discussion of data mining goals and success criteria

Using given values of variables we may be able to assess the expected result of the project - out of five possibilities: Failed, Live, Successful, Canceled or Suspended - based on type of project (category, main category), country, amount of money set as goal, number of people who support the project and amount of money they pledge.

Characteristics of the dataset

All the data come from Kickstarter Platform and are collected in two .csv files - for two time points: the first one reflects the state of the platform content in December 2016 and the second in January 2018. In the prior set information about 321616 unique projects can be found, whilst in the latter one this number reaches 375765.

Description of the attributes

Attribute_name Type Meaning Special.Values
ID nominal (nominal lvl of measurement) Identification number of a kickstarter project in the database. Integer number.
name nominal (nominal lvl of measurement) Specific name of a kickstarter project in the database given by creators. String value, all characters allowed.
category nominal (nominal lvl of measurement) Classification of projects into a narrow category. Many subgroups for example: Comics, 3D Printing, Crafts, Science Fiction. In total 162 categories.
main_category nominal (nominal lvl of measurement) Classification of projects into a more wide campaign category. Subgroups for example: Art, Comics, Music, Technology, Food, Film & Video, Crafts, Dance. In total 17 categories.
currency nominal (nominal lvl of measurement) Currency the project’s data are given in (goal, pledged). Codes for currencies for example: USD, GDB, NZD, CAD.
deadline nominal (interval lvl of measurement) Date of ending the crowdfunding of the project. Date and time format.
goal numerical (ratio lvl of measurement) Target amount of money to classify the project as successful. Decimal value.
launched nominal (interval lvl of measurement) Date of launching the project. Date and time format.
pledged numerical (ratio lvl of measurement) Amount of money pledged by ‘crowd’ to the project in the project’s currency. Decimal value.
state nominal (nominal lvl of measurement) Current state of the project - depends on the amount of money set as a goal and reached. Failed - project hasn’t reached the goal within the deadline; Live - active project; Successful - project has reached the goal; Canceled - project has been canceled by the creator; Suspended - project has been suspended for some legal reason.
backers numerical (ratio lvl of measurement) Number of people that support the project (backers). Integer non negative number.
country nominal (nominal lvl of measurement) Country the project origins from. Codes for countries for example: US, GB, CA, ES, NZ.
usd_pledged numerical (ratio lvl of measurement) Conversion of the pledged value to USD using a kickstarter converter. Decimal value of converted value from pledged column.
usd_pledged_real numerical (ratio lvl of measurement) Conversion of the pledged value to USD using a fixer.io converter. Decimal value of converted value from pledged column.
usd_goal_real numerical (ratio lvl of measurement) Conversion of the goal value to USD using a fixer.io converter. Decimal value of converted value from goal column.

Data overwiev

At the beginning, let’s extract several top rows from a data set to familiarize with its contents and structure:
ID name category main_category currency deadline goal launched pledged state backers country usd.pledged usd_pledged_real usd_goal_real
1000002330 The Songs of Adelaide & Abullah Poetry Publishing GBP 2015-10-09 1000 2015-08-11 12:12:28 0 failed 0 GB 0 0 1533.95
1000003930 Greeting From Earth: ZGAC Arts Capsule For ET Narrative Film Film & Video USD 2017-11-01 30000 2017-09-02 04:43:57 2421 failed 15 US 100 2421 30000.00
1000004038 Where is Hank? Narrative Film Film & Video USD 2013-02-26 45000 2013-01-12 00:20:50 220 failed 3 US 220 220 45000.00
1000007540 ToshiCapital Rekordz Needs Help to Complete Album Music Music USD 2012-04-16 5000 2012-03-17 03:24:11 1 failed 1 US 1 1 5000.00
1000011046 Community Film Project: The Art of Neighborhood Filmmaking Film & Video Film & Video USD 2015-08-29 19500 2015-07-04 08:35:03 1283 canceled 14 US 1283 1283 19500.00
1000014025 Monarch Espresso Bar Restaurants Food USD 2016-04-01 50000 2016-02-26 13:38:27 52375 successful 224 US 52375 52375 50000.00

Project categories and states

One of the basic characteristics of each project is a classification into theme category and subcategory. The other feature is a state of the project - that will be a subject of our interest in further research and predictions. The graph bellow presents a joint number of projects assigned to each of main categories together with representation of projects’ states.

For further analysis we will use only projects that provide the most relevant information - those that were historical in a moment of downloading data from the platform (successful, failed or cancelled). Records belonging to remaining three state categories are not going to be taken into consideration, so they will be excluded from the data set.

The following graph, presents proportions of successful, failed and cancelled projects within main theme categories. It is worth noticing that order of categories is indicated by descending number of projects within them. (Same as in one of previous graphs: “Number and state structure of projects by category”). Percentage of successful projects seems to be unrelated to overall number of projects. In particular, “Dance”, that is a category with the lowest number of registered projects, is also a category with the greater percentage of successes.

Which project categories gain the most backers?

Having plotted the boxplots of log of backers attracted by project categories, we can conclude which topics attract more sponsors. What we observe is that more people are eager too fund “Games”, “Design” and “Comics”, while “Crafts” and “Journalism” are the least backers-attracting.

Another interesting information is how the number of registered projects hanged in time:

Geographical diversification of projects

Kickstarter platform is dominated by projects created in United States, with a quite significant representation of undertakings established in Great Britain;

As it was stated previously, majority of projects originated from US. However, we decided to check whether the country could be influential for a project success. Graphs in this section allow to observe countries’ success and failure rates in particular years. It can be also seen when the country appeared in the projects registry.

Countries with the highest average success rate
country avg_sr
US 0.3977778
SG 0.3550000
HK 0.3450000

Countries with the highest average failure rate
country avg_fr
IT 0.7033333
AT 0.6600000
CH 0.6366667

Data quality

Missing Values

In this section we will look at quality of our data. First, we can check our original data for any missing values.
We can observe that we are dealing with over 3000 missing values. However, all of them exist in the column usd_pledged. From the description of the dataset, we know that this column corresponds to converted value of pledged column (using a kickstarter converter). Hence, this data has no high importance - to compare the projects the value of usd_pledged_real (converted using a different converter) can be used.

Name Category Main.category Currency Deadline Goal Launched Pledged State Backers Country Usd_pledged Usd_pledged_real Usd_goal_real
0 0 0 0 0 0 0 0 0 0 0 3797 0 0

Outliers

To look at outliers, we will check only two columns - usd_pledged_real and usd_goal_real, because we must ensure that the values are converted to the same currency to be able to compare them.

When we look at the boxplot of first variable (usd_goal_real) we can see that it is very highly influenced by very high observations. Looking at number of elements that fall within each quartile, we can see that almost 100 000 lies below first quartile (25%) and above third (75%). Hence this are the observations visible aside the boxplot. Only the same number (200 000) observations fall within 25%-75% of data.
When identifying outliers using R function identify_outliers, we can see that it returns around 45 000 observations with mean goal in USD of 322 148.

Value_of_quartile Num_of_observations
25% 2000 98786
50% 5500 102437
75% 15500 94586
100% 166361391 94767
Num_of_outliers Mean_outlier_value
45508 322148

The second numeric variable, usd_pledged_real also shows huge number of outliers (around 50 000).

Value_of_quartile Num_of_observations
25% 31.00 95071
50% 624.33 94697
75% 4050.00 94652
100% 20338986.27 94679
Num_of_outliers Mean_outlier_value
50578 58253.41

Data integrity

Here we are going to ensure that the integrity of data from this set was preseved. The test shows that the data in columns with numeric values is consistent and has reasonable values. In case of column usd_pledged there are some missing values, but for this test we omitted them, because they do not cause lack of data integirty.

k2018 <- read.csv(file = "ks-projects-201801.csv", sep = ",", dec = ".")
k2 <- k2018

stopifnot(is.numeric(k2$goal))
stopifnot(is.numeric(k2$pledged))
stopifnot(is.numeric(k2$backers))
stopifnot(is.numeric(k2$usd.pledged))
stopifnot(is.numeric(k2$usd_pledged_real))
stopifnot(is.numeric(k2$usd_goal_real))

stopifnot(k2$goal>0)
stopifnot(k2$pledged>=0)
stopifnot(k2$backers>=0)
stopifnot(na.omit(k2$usd.pledged)>=0)
stopifnot(k2$usd_pledged_real>=0)
stopifnot(k2$usd_goal_real>=0)

print('The integirty is preserved')
## [1] "The integirty is preserved"

In this dataset we didn’t notice any values of unknown meaning or ones that would be inconsistent with the rest.

Revision of validity of goals for further data mining

Looking at the data we can say that we will be able to assess expected results of projects based on the variables given in the data set. The data has good quality, so it allows for accurate analysis and getting proper results. In the exploratory data analysis, we could notice correlations between different variables and outcome of a project. The set provides resources for receiving valid goals of further data mining.