Avatar Columns

These are all URL links to the same image for three separate fields. There are NO NA values and each most rows to have a unique URL. I cannot imaging using the the urls, but we might consider using a flag, uniqe not-unique.

creator.avatar.medium

Unique Values (levels): 154852

Is Null 0

Proportion Unique: 8.463800435e+0010^{-1e+00}

creator.avatar.small

Unique Values (levels):154852

Is Null: 0

Proportion Unique: 8.463800435e+0010^{-1e+00}

creator.avatar.thumb

Unique Values (levels): 154852

Is Null: 0

Proportion Unique: 8.463800435e+0010^{-1e+00}

creator.chosen_currency

There are thirteen chosen currencies, and different levels of adoption and a 192k+ with no deffault value (but not NA). This seems like it might be an interesting categorical variable if we create a level of none to replace the empty frames

Also might be worth a flag to compare to the actual currency used…flag for same, different

Var1 Freq
1.82443e+05
AUD 9.00000e+00
CAD 3.10000e+01
CHF 3.00000e+00
DKK 1.00000e+00
EUR 9.50000e+01
GBP 3.80000e+01
HKD 5.00000e+00
JPY 1.30000e+01
MXN 1.70000e+01
NZD 4.00000e+00
SEK 9.00000e+00
SGD 1.00000e+00
USD 2.89000e+02

NA’s

0

creator.id

This is an interesting variable, some have repeats which may mean they are serial entrepreneurs and good at this, or re-trying a failed mission. It might be good to have a column for repeat ID flag and a column for successful previous…
Duplicated Id’s: 24844

Proportion Unique: 8.642092721e+0010^{-1e+00}

creator.name

Like with the ID’s there is some redundancy here, even more than with the ID’s which might mean multiple paths to entrepreeurship. This might be a good column to create a flag from, as multiple kicks and prior success.

Number of Unique Names: 151484

Proportion of Unique Names: 8.27971447e+0010^{-1e+00}

creator.is_registered

ALL TRUE so there is nothing to be gleaned from this column, and it can be removed

Table of True & False Values:

Var1 Freq
True 1.82958e+05

Proportion: 0.00000546573531

creator.slug

Here there are many missing (no NA, just missing) values and many duplicated slugs as well as many original slugs. Because they come in as factors the empty cells are not NA. we would need to recode them.

Number of Unique Slugs: 42011

Proportion of Unique Slugs: 0.229621006

Number of Duplicated Slugs: 140947

Proportion of Duplicated Slugs: 0.770378994

creator.urls.api.user

There are VERY FEW duplicates, .3% and almost all campaigns have an api.user so I would condider killing this field, at best giving a 1 for has 0 for does not.

Number of Unique API Users: 182389

Proportion of Unique API Users: 0.996889997

Proportion of Duplicated API Users: 0.00311000339

creator.urls.web.user

There are also few duplicates but more than api at 13%. It may be a useful flag to include.

Number of Unique web.user : 158114

Proportion of Unique web.user : 0.864209272

Proportion of Duplicated API Users: 0.135790728

currency

Count
AUD 3682
CAD 7048
CHF 349
DKK 604
EUR 8098
GBP 16240
HKD 333
JPY 95
MXN 1075
NOK 368
NZD 670
SEK 921
SGD 271
USD 143204

currency_symbol

The symbols do not translate into the Runiverse but there is significant variation and we should very likely use these. There are also no missing values which is useful.

Count
$ 156283
€ 8098
£ 16240
Â¥ 95
Fr 349
kr 1893

currency_trailing_code

I have no idea what this is, but is varies in a discernable amount so it might be good to use this variable as a 0/1 flag and se how it combines with other currency codes to differntiate success.

Count
False 24782
True 158176

Proportions:

Proportion
False 0.135451852
True 0.864548148

current_currency

This is good, there is a difference between the creators desired code and these codes and between US & Canadian. It could be useful to leave in

Count
CAD 1069
USD 181889

Proportions:

Proportion
CAD 0.00584287104
USD 0.99415712896

deadline (need converting to datetime)

I think this could be a GREAT adjunct to other dates, we would calculate the difference between start and proposed end, the actual start and actual end, the proposed and actual end and create flags for under/over and see if these date relationships affect confidence in a campaign affecting the outcome.

Number of Unique Deadline Dates: 171038

disable_communication

This is interesting and there are no NA values as a binary it seems prudent to test whether or not disabling communication with entrepreneurs affects the outcome of a campaign.

Frequency:

Count
False 182345
True 613

Proportions:

Proportion
False 0.99664950426
True 0.00335049574

friends

Proportion of NA Values: 1

Continuous Variables

fx_rate

Frequency of Rates

Count
0.00888622 67
0.00889989 28
0.05296891 786
0.05327321 289
0.11145669 706
0.11209032 215
0.12167293 270
0.12259650 98
0.12756973 281
0.12759960 52
0.15413583 430
0.15496150 174
0.65518091 518
0.65849475 148
0.71320267 2734
0.71353141 933
0.72586745 201
0.72739916 70
0.76665011 5265
0.77189202 1736
0.85460225 4
0.93028444 15
1.00000000 142367
1.00531005 261
1.00852303 88
1.14999632 6057
1.15610946 2039
1.30437600 884
1.30994986 11502
1.31748313 4621
1.50002759 2
1.70866715 117

`

I am not sure what this variable is either. The spread was hard to capture in a histogram because of disproportion and useless in summary, so I went with a table which shows there are a wide variety of rates with most cmpaigns having a rate of 1. I think this might be a good column to keep in if it is not an outcome related rate!

If it is not outcome related, we could try leaving it as a ratio or creating levels of factor, either could work.

goal

There is a HUGE disparity between the median and the mean which is caused by right skewing from a few VERY large goals. This is a kind of difficult thing to negotiate. We could create a scaled amount but that makes proper scaling on predicitions essential and not always reproducible if we get values outside the current range. It might be work creating and outlier flag for the super wicked huge goals in this set and allowing some of the slop/noise from this variable to spill to the flag.

Summary

summary(my_data$goal)
##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
## 0.00000e+00 1.50000e+03 5.00000e+03 4.79738e+04 1.40000e+04 1.00000e+08

so heavily skewed that NO single plot can convey this distributionos

id

This is a very interesting column in that most but not all ID’s are unique. It may be that repeated campaigns are more likely to succeed or fail based on prior performance. We should look into this with either a prior successful campaign 0/1 and prior unsuccesfull campaign or something like it…this means sorting by date and establishing priority of successes. Duplicated Id’s: 0

Proportion Unique: 1

Na Breakdown

The Only Column with true NA values was friends and it was 100% NA, this should be excluded. Some of the categorical columns have empty levels. We might need to consider them in our model, creating unknown levels.