##     state           backers_count         blurb          
##  Length:182958      Min.   :     0.0   Length:182958     
##  Class :character   1st Qu.:     3.0   Class :character  
##  Mode  :character   Median :    24.0   Mode  :character  
##                     Mean   :   132.4                     
##                     3rd Qu.:    80.0                     
##                     Max.   :105857.0                     
##                                                          
##  category.color      category.id  category.name      category.parent_id
##  Min.   :   51627   Min.   :  1   Length:182958      Min.   : 1.00     
##  1st Qu.: 6526716   1st Qu.: 34   Class :character   1st Qu.:10.00     
##  Median :14867664   Median :239   Mode  :character   Median :12.00     
##  Mean   :12088863   Mean   :167                      Mean   :11.78     
##  3rd Qu.:16743775   3rd Qu.:298                      3rd Qu.:16.00     
##  Max.   :16776056   Max.   :389                      Max.   :26.00     
##                                                      NA's   :16740     
##  category.position category.slug      category.urls.web.discover
##  Min.   : 1.000    Length:182958      Length:182958             
##  1st Qu.: 3.000    Class :character   Class :character          
##  Median : 6.000    Mode  :character   Mode  :character          
##  Mean   : 7.202                                                 
##  3rd Qu.:10.000                                                 
##  Max.   :19.000                                                 
##                                                                 
##  converted_pledged_amount   country            created_at                 
##  Min.   :       0         Length:182958      Min.   :2009-04-21 17:35:35  
##  1st Qu.:      85         Class :character   1st Qu.:2013-03-22 02:47:18  
##  Median :    1368         Mode  :character   Median :2014-11-27 16:17:33  
##  Mean   :   11039                            Mean   :2014-10-05 21:21:56  
##  3rd Qu.:    5866                            3rd Qu.:2016-04-10 16:22:50  
##  Max.   :10266845                            Max.   :2018-10-08 04:31:57  
##                                                                           
##   is_backing        is_starrable        is_starred       
##  Length:182958      Length:182958      Length:182958     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   launched_at                   usd_pledged         usd_type        
##  Min.   :2009-04-24 19:52:03   Min.   :       0   Length:182958     
##  1st Qu.:2013-05-17 19:38:54   1st Qu.:      85   Class :character  
##  Median :2015-01-22 07:06:12   Median :    1367   Mode  :character  
##  Mean   :2014-11-16 02:54:22   Mean   :   10981                     
##  3rd Qu.:2016-05-23 06:54:53   3rd Qu.:    5858                     
##  Max.   :2018-10-09 21:01:51   Max.   :10266846                     
## 

is_backing, is_starrable, is_starred can all be dropped because they only have one distinct value.

Null Summary

Categorizing the Categories

The category slug is the finest level of detail for category data. This can be seen by compared the grouped counts for category slug with the grouped counts for all category related variables

##   [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [71] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [99] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [113] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [127] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [141] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [155] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [169] TRUE

Category.Color provides the same information as the category.parent.id. Category.name is unique within it’s parent category, but duplicated outside. All category slugs have the same category position, but not all category positions have the same category.slug.

Category slug provides the most unique information for the model. The question is whether the category parent matters or some other combination provides the most differentiating information. Our chosen variable selection technique can make this decision for each category/parent combination

URL

Since it is never null, a flag cannot be used to indicate the presence of the “category.urls.web.discover” URL.

## [1] "http://www.kickstarter.com/discover/categories/games/playing%20cards"

A sample shows that it is just a rehashing of the category so it can be safely dropped.

Country

## # A tibble: 22 x 2
##    country   num
##      <chr> <int>
##  1      AT   268
##  2      AU  3682
##  3      BE   337
##  4      CA  7048
##  5      CH   349
##  6      DE  1902
##  7      DK   604
##  8      ES  1051
##  9      FR  1406
## 10      GB 16240
## # ... with 12 more rows

This is a clean column with no NAs. No additional cleanup required.

Numerics

If we used either of these as the response it would probably require a transform. They shouldn’t be used as predictors because they are not populated before the campaign.

Blurb

This could be turned into valuable information with a little work.

Many of the blurbs are related to the category. We will expirment with including all category related words in the stop words.

Summary

Making the cut for our next stage should be:

Backers_count, and pledged could be target variables if we decide to pivot