Kickstarter Projects

Anna Pająk, Monika Dyczewska, Daria Skarbek

Published on 16.05.2022

Description of the dataset

Kickstarter Projects dataset contains data about more than 300000 initiatives from the Kickstarter crowdfunding platform. Each entry describes one fundraiser launched in the platform - its identification, categorisation, timeframe, financial results and goals.

Discussion of data mining goals and success criteria

Using given values of variables we may be able to assess the expected result of the project - out of five possibilities: Failed, Live, Successful, Canceled or Suspended - based on type of project (category, main category), country, amount of money set as goal, number of people who support the project and amount of money they pledge.

Characteristics of the dataset

All the data come from Kickstarter Platform and are collected in two .csv files - for two time points: the first one reflects the state of the platform content in December 2016 and the second in January 2018. In the prior set information about 321616 unique projects can be found, whilst in the latter one this number reaches 375765.

Description of the attributes

Attribute_name	Type	Meaning	Special.Values
ID	nominal (nominal lvl of measurement)	Identification number of a kickstarter project in the database.	Integer number.
name	nominal (nominal lvl of measurement)	Specific name of a kickstarter project in the database given by creators.	String value, all characters allowed.
category	nominal (nominal lvl of measurement)	Classification of projects into a narrow category.	Many subgroups for example: Comics, 3D Printing, Crafts, Science Fiction. In total 162 categories.
main_category	nominal (nominal lvl of measurement)	Classification of projects into a more wide campaign category.	Subgroups for example: Art, Comics, Music, Technology, Food, Film & Video, Crafts, Dance. In total 17 categories.
currency	nominal (nominal lvl of measurement)	Currency the project’s data are given in (goal, pledged).	Codes for currencies for example: USD, GDB, NZD, CAD.
deadline	nominal (interval lvl of measurement)	Date of ending the crowdfunding of the project.	Date and time format.
goal	numerical (ratio lvl of measurement)	Target amount of money to classify the project as successful.	Decimal value.
launched	nominal (interval lvl of measurement)	Date of launching the project.	Date and time format.
pledged	numerical (ratio lvl of measurement)	Amount of money pledged by ‘crowd’ to the project in the project’s currency.	Decimal value.
state	nominal (nominal lvl of measurement)	Current state of the project - depends on the amount of money set as a goal and reached.	Failed - project hasn’t reached the goal within the deadline; Live - active project; Successful - project has reached the goal; Canceled - project has been canceled by the creator; Suspended - project has been suspended for some legal reason.
backers	numerical (ratio lvl of measurement)	Number of people that support the project (backers).	Integer non negative number.
country	nominal (nominal lvl of measurement)	Country the project origins from.	Codes for countries for example: US, GB, CA, ES, NZ.
usd_pledged	numerical (ratio lvl of measurement)	Conversion of the pledged value to USD using a kickstarter converter.	Decimal value of converted value from pledged column.
usd_pledged_real	numerical (ratio lvl of measurement)	Conversion of the pledged value to USD using a fixer.io converter.	Decimal value of converted value from pledged column.
usd_goal_real	numerical (ratio lvl of measurement)	Conversion of the goal value to USD using a fixer.io converter.	Decimal value of converted value from goal column.

Data overwiev

At the beginning, let’s extract several top rows from a data set to familiarize with its contents and structure:

ID	name	category	main_category	currency	deadline	goal	launched	pledged	state	backers	country	usd.pledged	usd_pledged_real	usd_goal_real
1000002330	The Songs of Adelaide & Abullah	Poetry	Publishing	GBP	2015-10-09	1000	2015-08-11 12:12:28	0	failed	0	GB	0	0	1533.95
1000003930	Greeting From Earth: ZGAC Arts Capsule For ET	Narrative Film	Film & Video	USD	2017-11-01	30000	2017-09-02 04:43:57	2421	failed	15	US	100	2421	30000.00
1000004038	Where is Hank?	Narrative Film	Film & Video	USD	2013-02-26	45000	2013-01-12 00:20:50	220	failed	3	US	220	220	45000.00
1000007540	ToshiCapital Rekordz Needs Help to Complete Album	Music	Music	USD	2012-04-16	5000	2012-03-17 03:24:11	1	failed	1	US	1	1	5000.00
1000011046	Community Film Project: The Art of Neighborhood Filmmaking	Film & Video	Film & Video	USD	2015-08-29	19500	2015-07-04 08:35:03	1283	canceled	14	US	1283	1283	19500.00
1000014025	Monarch Espresso Bar	Restaurants	Food	USD	2016-04-01	50000	2016-02-26 13:38:27	52375	successful	224	US	52375	52375	50000.00

Project categories and states

One of the basic characteristics of each project is a classification into theme category and subcategory. The other feature is a state of the project - that will be a subject of our interest in further research and predictions. The graph bellow presents a joint number of projects assigned to each of main categories together with representation of projects’ states.

For further analysis we will use only projects that provide the most relevant information - those that were historical in a moment of downloading data from the platform (successful, failed or cancelled). Records belonging to remaining three state categories are not going to be taken into consideration, so they will be excluded from the data set.

The following graph, presents proportions of successful, failed and cancelled projects within main theme categories. It is worth noticing that order of categories is indicated by descending number of projects within them. (Same as in one of previous graphs: “Number and state structure of projects by category”). Percentage of successful projects seems to be unrelated to overall number of projects. In particular, “Dance”, that is a category with the lowest number of registered projects, is also a category with the greater percentage of successes.

Which project categories gain the most backers?

Having plotted the boxplots of log of backers attracted by project categories, we can conclude which topics attract more sponsors. What we observe is that more people are eager too fund “Games”, “Design” and “Comics”, while “Crafts” and “Journalism” are the least backers-attracting.

Another interesting information is how the number of registered projects hanged in time:

Geographical diversification of projects

Kickstarter platform is dominated by projects created in United States, with a quite significant representation of undertakings established in Great Britain;

As it was stated previously, majority of projects originated from US. However, we decided to check whether the country could be influential for a project success. Graphs in this section allow to observe countries’ success and failure rates in particular years. It can be also seen when the country appeared in the projects registry.

Countries with the highest average success rate
country	avg_sr
US	0.3977778
SG	0.3550000
HK	0.3450000

Countries with the highest average failure rate
country	avg_fr
IT	0.7033333
AT	0.6600000
CH	0.6366667

Data quality

Missing Values

In this section we will look at quality of our data. First, we can check our original data for any missing values.
We can observe that we are dealing with over 3000 missing values. However, all of them exist in the column usd_pledged. From the description of the dataset, we know that this column corresponds to converted value of pledged column (using a kickstarter converter). Hence, this data has no high importance - to compare the projects the value of usd_pledged_real (converted using a different converter) can be used.

Name	Category	Main.category	Currency	Deadline	Goal	Launched	Pledged	State	Backers	Country	Usd_pledged	Usd_pledged_real	Usd_goal_real
0	0	0	0	0	0	0	0	0	0	0	3797	0	0

Outliers

To look at outliers, we will check only two columns - usd_pledged_real and usd_goal_real, because we must ensure that the values are converted to the same currency to be able to compare them.

When we look at the boxplot of first variable (usd_goal_real) we can see that it is very highly influenced by very high observations. Looking at number of elements that fall within each quartile, we can see that almost 100 000 lies below first quartile (25%) and above third (75%). Hence this are the observations visible aside the boxplot. Only the same number (200 000) observations fall within 25%-75% of data.
When identifying outliers using R function identify_outliers, we can see that it returns around 45 000 observations with mean goal in USD of 322 148.

	Value_of_quartile	Num_of_observations
25%	2000	98786
50%	5500	102437
75%	15500	94586
100%	166361391	94767

Num_of_outliers	Mean_outlier_value
45508	322148

The second numeric variable, usd_pledged_real also shows huge number of outliers (around 50 000).

	Value_of_quartile	Num_of_observations
25%	31.00	95071
50%	624.33	94697
75%	4050.00	94652
100%	20338986.27	94679

Num_of_outliers	Mean_outlier_value
50578	58253.41

Data integrity

Here we are going to ensure that the integrity of data from this set was preseved. The test shows that the data in columns with numeric values is consistent and has reasonable values. In case of column usd_pledged there are some missing values, but for this test we omitted them, because they do not cause lack of data integirty.

k2018 <- read.csv(file = "ks-projects-201801.csv", sep = ",", dec = ".")
k2 <- k2018

stopifnot(is.numeric(k2$goal))
stopifnot(is.numeric(k2$pledged))
stopifnot(is.numeric(k2$backers))
stopifnot(is.numeric(k2$usd.pledged))
stopifnot(is.numeric(k2$usd_pledged_real))
stopifnot(is.numeric(k2$usd_goal_real))

stopifnot(k2$goal>0)
stopifnot(k2$pledged>=0)
stopifnot(k2$backers>=0)
stopifnot(na.omit(k2$usd.pledged)>=0)
stopifnot(k2$usd_pledged_real>=0)
stopifnot(k2$usd_goal_real>=0)

print('The integirty is preserved')

## [1] "The integirty is preserved"

In this dataset we didn’t notice any values of unknown meaning or ones that would be inconsistent with the rest.

Revision of validity of goals for further data mining

Looking at the data we can say that we will be able to assess expected results of projects based on the variables given in the data set. The data has good quality, so it allows for accurate analysis and getting proper results. In the exploratory data analysis, we could notice correlations between different variables and outcome of a project. The set provides resources for receiving valid goals of further data mining.