## state backers_count blurb
## Length:182958 Min. : 0.0 Length:182958
## Class :character 1st Qu.: 3.0 Class :character
## Mode :character Median : 24.0 Mode :character
## Mean : 132.4
## 3rd Qu.: 80.0
## Max. :105857.0
##
## category.color category.id category.name category.parent_id
## Min. : 51627 Min. : 1 Length:182958 Min. : 1.00
## 1st Qu.: 6526716 1st Qu.: 34 Class :character 1st Qu.:10.00
## Median :14867664 Median :239 Mode :character Median :12.00
## Mean :12088863 Mean :167 Mean :11.78
## 3rd Qu.:16743775 3rd Qu.:298 3rd Qu.:16.00
## Max. :16776056 Max. :389 Max. :26.00
## NA's :16740
## category.position category.slug category.urls.web.discover
## Min. : 1.000 Length:182958 Length:182958
## 1st Qu.: 3.000 Class :character Class :character
## Median : 6.000 Mode :character Mode :character
## Mean : 7.202
## 3rd Qu.:10.000
## Max. :19.000
##
## converted_pledged_amount country created_at
## Min. : 0 Length:182958 Min. :2009-04-21 17:35:35
## 1st Qu.: 85 Class :character 1st Qu.:2013-03-22 02:47:18
## Median : 1368 Mode :character Median :2014-11-27 16:17:33
## Mean : 11039 Mean :2014-10-05 21:21:56
## 3rd Qu.: 5866 3rd Qu.:2016-04-10 16:22:50
## Max. :10266845 Max. :2018-10-08 04:31:57
##
## is_backing is_starrable is_starred
## Length:182958 Length:182958 Length:182958
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## launched_at usd_pledged usd_type
## Min. :2009-04-24 19:52:03 Min. : 0 Length:182958
## 1st Qu.:2013-05-17 19:38:54 1st Qu.: 85 Class :character
## Median :2015-01-22 07:06:12 Median : 1367 Mode :character
## Mean :2014-11-16 02:54:22 Mean : 10981
## 3rd Qu.:2016-05-23 06:54:53 3rd Qu.: 5858
## Max. :2018-10-09 21:01:51 Max. :10266846
##
is_backing, is_starrable, is_starred can all be dropped because they only have one distinct value.
The category slug is the finest level of detail for category data. This can be seen by compared the grouped counts for category slug with the grouped counts for all category related variables
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [71] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [99] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [113] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [127] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [141] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [155] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [169] TRUE
Category.Color provides the same information as the category.parent.id. Category.name is unique within it’s parent category, but duplicated outside. All category slugs have the same category position, but not all category positions have the same category.slug.
Category slug provides the most unique information for the model. The question is whether the category parent matters or some other combination provides the most differentiating information. Our chosen variable selection technique can make this decision for each category/parent combination
Since it is never null, a flag cannot be used to indicate the presence of the “category.urls.web.discover” URL.
## [1] "http://www.kickstarter.com/discover/categories/games/playing%20cards"
A sample shows that it is just a rehashing of the category so it can be safely dropped.
## # A tibble: 22 x 2
## country num
## <chr> <int>
## 1 AT 268
## 2 AU 3682
## 3 BE 337
## 4 CA 7048
## 5 CH 349
## 6 DE 1902
## 7 DK 604
## 8 ES 1051
## 9 FR 1406
## 10 GB 16240
## # ... with 12 more rows
This is a clean column with no NAs. No additional cleanup required.
If we used either of these as the response it would probably require a transform. They shouldn’t be used as predictors because they are not populated before the campaign.
This could be turned into valuable information with a little work.
Many of the blurbs are related to the category. We will expirment with including all category related words in the stop words.
Making the cut for our next stage should be:
Backers_count, and pledged could be target variables if we decide to pivot