Description of the dataset
Kickstarter Projects dataset contains data about more than 300000 initiatives from the Kickstarter crowdfunding platform. Each entry describes one fundraiser launched in the platform - its identification, categorisation, timeframe, financial results and goals.
Discussion of data mining goals and success criteria
Using given values of variables we may be able to assess the expected result of the project - out of five possibilities: Failed, Live, Successful, Canceled or Suspended - based on type of project (category, main category), country, amount of money set as goal, number of people who support the project and amount of money they pledge.
Characteristics of the dataset
All the data come from Kickstarter Platform and are collected in two .csv files - for two time points: the first one reflects the state of the platform content in December 2016 and the second in January 2018. In the prior set information about 321616 unique projects can be found, whilst in the latter one this number reaches 375765.
Description of the attributes
| Attribute_name | Type | Meaning | Special.Values |
|---|---|---|---|
| ID | nominal (nominal lvl of measurement) | Identification number of a kickstarter project in the database. | Integer number. |
| name | nominal (nominal lvl of measurement) | Specific name of a kickstarter project in the database given by creators. | String value, all characters allowed. |
| category | nominal (nominal lvl of measurement) | Classification of projects into a narrow category. | Many subgroups for example: Comics, 3D Printing, Crafts, Science Fiction. In total 162 categories. |
| main_category | nominal (nominal lvl of measurement) | Classification of projects into a more wide campaign category. | Subgroups for example: Art, Comics, Music, Technology, Food, Film & Video, Crafts, Dance. In total 17 categories. |
| currency | nominal (nominal lvl of measurement) | Currency the project’s data are given in (goal, pledged). | Codes for currencies for example: USD, GDB, NZD, CAD. |
| deadline | nominal (interval lvl of measurement) | Date of ending the crowdfunding of the project. | Date and time format. |
| goal | numerical (ratio lvl of measurement) | Target amount of money to classify the project as successful. | Decimal value. |
| launched | nominal (interval lvl of measurement) | Date of launching the project. | Date and time format. |
| pledged | numerical (ratio lvl of measurement) | Amount of money pledged by ‘crowd’ to the project in the project’s currency. | Decimal value. |
| state | nominal (nominal lvl of measurement) | Current state of the project - depends on the amount of money set as a goal and reached. | Failed - project hasn’t reached the goal within the deadline; Live - active project; Successful - project has reached the goal; Canceled - project has been canceled by the creator; Suspended - project has been suspended for some legal reason. |
| backers | numerical (ratio lvl of measurement) | Number of people that support the project (backers). | Integer non negative number. |
| country | nominal (nominal lvl of measurement) | Country the project origins from. | Codes for countries for example: US, GB, CA, ES, NZ. |
| usd_pledged | numerical (ratio lvl of measurement) | Conversion of the pledged value to USD using a kickstarter converter. | Decimal value of converted value from pledged column. |
| usd_pledged_real | numerical (ratio lvl of measurement) | Conversion of the pledged value to USD using a fixer.io converter. | Decimal value of converted value from pledged column. |
| usd_goal_real | numerical (ratio lvl of measurement) | Conversion of the goal value to USD using a fixer.io converter. | Decimal value of converted value from goal column. |
Data overwiev
At the beginning, let’s extract several top rows from a data set to familiarize with its contents and structure:| ID | name | category | main_category | currency | deadline | goal | launched | pledged | state | backers | country | usd.pledged | usd_pledged_real | usd_goal_real |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1000002330 | The Songs of Adelaide & Abullah | Poetry | Publishing | GBP | 2015-10-09 | 1000 | 2015-08-11 12:12:28 | 0 | failed | 0 | GB | 0 | 0 | 1533.95 |
| 1000003930 | Greeting From Earth: ZGAC Arts Capsule For ET | Narrative Film | Film & Video | USD | 2017-11-01 | 30000 | 2017-09-02 04:43:57 | 2421 | failed | 15 | US | 100 | 2421 | 30000.00 |
| 1000004038 | Where is Hank? | Narrative Film | Film & Video | USD | 2013-02-26 | 45000 | 2013-01-12 00:20:50 | 220 | failed | 3 | US | 220 | 220 | 45000.00 |
| 1000007540 | ToshiCapital Rekordz Needs Help to Complete Album | Music | Music | USD | 2012-04-16 | 5000 | 2012-03-17 03:24:11 | 1 | failed | 1 | US | 1 | 1 | 5000.00 |
| 1000011046 | Community Film Project: The Art of Neighborhood Filmmaking | Film & Video | Film & Video | USD | 2015-08-29 | 19500 | 2015-07-04 08:35:03 | 1283 | canceled | 14 | US | 1283 | 1283 | 19500.00 |
| 1000014025 | Monarch Espresso Bar | Restaurants | Food | USD | 2016-04-01 | 50000 | 2016-02-26 13:38:27 | 52375 | successful | 224 | US | 52375 | 52375 | 50000.00 |
Project categories and states
One of the basic characteristics of each project is a classification into theme category and subcategory. The other feature is a state of the project - that will be a subject of our interest in further research and predictions. The graph bellow presents a joint number of projects assigned to each of main categories together with representation of projects’ states.
For further analysis we will use only projects that provide the most relevant information - those that were historical in a moment of downloading data from the platform (successful, failed or cancelled). Records belonging to remaining three state categories are not going to be taken into consideration, so they will be excluded from the data set.
The following graph, presents proportions of successful, failed and cancelled projects within main theme categories. It is worth noticing that order of categories is indicated by descending number of projects within them. (Same as in one of previous graphs: “Number and state structure of projects by category”). Percentage of successful projects seems to be unrelated to overall number of projects. In particular, “Dance”, that is a category with the lowest number of registered projects, is also a category with the greater percentage of successes.
Which project categories gain the most backers?
Having plotted the boxplots of log of backers attracted by project categories, we can conclude which topics attract more sponsors. What we observe is that more people are eager too fund “Games”, “Design” and “Comics”, while “Crafts” and “Journalism” are the least backers-attracting.
Another interesting information is how the number of registered projects hanged in time:
Geographical diversification of projects
Kickstarter platform is dominated by projects created in United States, with a quite significant representation of undertakings established in Great Britain;
As it was stated previously, majority of projects originated from US. However, we decided to check whether the country could be influential for a project success. Graphs in this section allow to observe countries’ success and failure rates in particular years. It can be also seen when the country appeared in the projects registry.
| country | avg_sr |
|---|---|
| US | 0.3977778 |
| SG | 0.3550000 |
| HK | 0.3450000 |
| country | avg_fr |
|---|---|
| IT | 0.7033333 |
| AT | 0.6600000 |
| CH | 0.6366667 |
Data quality
Missing Values
In this section we will look at quality of our data. First, we can check our original data for any missing values.
We can observe that we are dealing with over 3000 missing values. However, all of them exist in the column usd_pledged. From the description of the dataset, we know that this column corresponds to converted value of pledged column (using a kickstarter converter). Hence, this data has no high importance - to compare the projects the value of usd_pledged_real (converted using a different converter) can be used.
| Name | Category | Main.category | Currency | Deadline | Goal | Launched | Pledged | State | Backers | Country | Usd_pledged | Usd_pledged_real | Usd_goal_real |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3797 | 0 | 0 |
Outliers
To look at outliers, we will check only two columns - usd_pledged_real and usd_goal_real, because we must ensure that the values are converted to the same currency to be able to compare them.
When we look at the boxplot of first variable (usd_goal_real) we can see that it is very highly influenced by very high observations. Looking at number of elements that fall within each quartile, we can see that almost 100 000 lies below first quartile (25%) and above third (75%). Hence this are the observations visible aside the boxplot. Only the same number (200 000) observations fall within 25%-75% of data.
When identifying outliers using R function identify_outliers, we can see that it returns around 45 000 observations with mean goal in USD of 322 148.
| Value_of_quartile | Num_of_observations | |
|---|---|---|
| 25% | 2000 | 98786 |
| 50% | 5500 | 102437 |
| 75% | 15500 | 94586 |
| 100% | 166361391 | 94767 |
| Num_of_outliers | Mean_outlier_value |
|---|---|
| 45508 | 322148 |
The second numeric variable, usd_pledged_real also shows huge number of outliers (around 50 000).
| Value_of_quartile | Num_of_observations | |
|---|---|---|
| 25% | 31.00 | 95071 |
| 50% | 624.33 | 94697 |
| 75% | 4050.00 | 94652 |
| 100% | 20338986.27 | 94679 |
| Num_of_outliers | Mean_outlier_value |
|---|---|
| 50578 | 58253.41 |
Data integrity
Here we are going to ensure that the integrity of data from this set was preseved. The test shows that the data in columns with numeric values is consistent and has reasonable values. In case of column usd_pledged there are some missing values, but for this test we omitted them, because they do not cause lack of data integirty.
k2018 <- read.csv(file = "ks-projects-201801.csv", sep = ",", dec = ".")
k2 <- k2018
stopifnot(is.numeric(k2$goal))
stopifnot(is.numeric(k2$pledged))
stopifnot(is.numeric(k2$backers))
stopifnot(is.numeric(k2$usd.pledged))
stopifnot(is.numeric(k2$usd_pledged_real))
stopifnot(is.numeric(k2$usd_goal_real))
stopifnot(k2$goal>0)
stopifnot(k2$pledged>=0)
stopifnot(k2$backers>=0)
stopifnot(na.omit(k2$usd.pledged)>=0)
stopifnot(k2$usd_pledged_real>=0)
stopifnot(k2$usd_goal_real>=0)
print('The integirty is preserved')## [1] "The integirty is preserved"
In this dataset we didn’t notice any values of unknown meaning or ones that would be inconsistent with the rest.
Revision of validity of goals for further data mining
Looking at the data we can say that we will be able to assess expected results of projects based on the variables given in the data set. The data has good quality, so it allows for accurate analysis and getting proper results. In the exploratory data analysis, we could notice correlations between different variables and outcome of a project. The set provides resources for receiving valid goals of further data mining.