Avatar Columns

These are all URL links to the same image for three separate fields. There are NO NA values and each most rows to have a unique URL. I cannot imaging using the the urls, but we might consider using a flag, uniqe not-unique.

creator.avatar.medium

Unique Values (levels): 154852

Is Null 0

Proportion Unique: 8.463800435e+0010^{-1e+00}

creator.avatar.small

Unique Values (levels):154852

Is Null: 0

Proportion Unique: 8.463800435e+0010^{-1e+00}

creator.avatar.thumb

Unique Values (levels): 154852

Is Null: 0

Proportion Unique: 8.463800435e+0010^{-1e+00}

creator.chosen_currency

There are thirteen chosen currencies, and different levels of adoption and a 192k+ with no deffault value (but not NA). This seems like it might be an interesting categorical variable if we create a level of none to replace the empty frames

Also might be worth a flag to compare to the actual currency used…flag for same, different

Var1	Freq
	1.82443e+05
AUD	9.00000e+00
CAD	3.10000e+01
CHF	3.00000e+00
DKK	1.00000e+00
EUR	9.50000e+01
GBP	3.80000e+01
HKD	5.00000e+00
JPY	1.30000e+01
MXN	1.70000e+01
NZD	4.00000e+00
SEK	9.00000e+00
SGD	1.00000e+00
USD	2.89000e+02

NA’s

creator.id

This is an interesting variable, some have repeats which may mean they are serial entrepreneurs and good at this, or re-trying a failed mission. It might be good to have a column for repeat ID flag and a column for successful previous…
Duplicated Id’s: 24844

Proportion Unique: 8.642092721e+0010^{-1e+00}

creator.name

Like with the ID’s there is some redundancy here, even more than with the ID’s which might mean multiple paths to entrepreeurship. This might be a good column to create a flag from, as multiple kicks and prior success.

Number of Unique Names: 151484

Proportion of Unique Names: 8.27971447e+0010^{-1e+00}

creator.is_registered

ALL TRUE so there is nothing to be gleaned from this column, and it can be removed

Table of True & False Values:

Var1	Freq
True	1.82958e+05

Proportion: 0.00000546573531

creator.slug

Here there are many missing (no NA, just missing) values and many duplicated slugs as well as many original slugs. Because they come in as factors the empty cells are not NA. we would need to recode them.

Number of Unique Slugs: 42011

Proportion of Unique Slugs: 0.229621006

Number of Duplicated Slugs: 140947

Proportion of Duplicated Slugs: 0.770378994

creator.urls.api.user

There are VERY FEW duplicates, .3% and almost all campaigns have an api.user so I would condider killing this field, at best giving a 1 for has 0 for does not.

Number of Unique API Users: 182389

Proportion of Unique API Users: 0.996889997

Proportion of Duplicated API Users: 0.00311000339

creator.urls.web.user

There are also few duplicates but more than api at 13%. It may be a useful flag to include.

Number of Unique web.user : 158114

Proportion of Unique web.user : 0.864209272

Proportion of Duplicated API Users: 0.135790728

currency

	Count
AUD	3682
CAD	7048
CHF	349
DKK	604
EUR	8098
GBP	16240
HKD	333
JPY	95
MXN	1075
NOK	368
NZD	670
SEK	921
SGD	271
USD	143204

currency_symbol

The symbols do not translate into the Runiverse but there is significant variation and we should very likely use these. There are also no missing values which is useful.

	Count
$	156283
â‚¬	8098
Â£	16240
Â¥	95
Fr	349
kr	1893

currency_trailing_code

I have no idea what this is, but is varies in a discernable amount so it might be good to use this variable as a 0/1 flag and se how it combines with other currency codes to differntiate success.

	Count
False	24782
True	158176

Proportions:

	Proportion
False	0.135451852
True	0.864548148

current_currency

This is good, there is a difference between the creators desired code and these codes and between US & Canadian. It could be useful to leave in

	Count
CAD	1069
USD	181889

Proportions:

	Proportion
CAD	0.00584287104
USD	0.99415712896

deadline (need converting to datetime)

I think this could be a GREAT adjunct to other dates, we would calculate the difference between start and proposed end, the actual start and actual end, the proposed and actual end and create flags for under/over and see if these date relationships affect confidence in a campaign affecting the outcome.

Number of Unique Deadline Dates: 171038

disable_communication

This is interesting and there are no NA values as a binary it seems prudent to test whether or not disabling communication with entrepreneurs affects the outcome of a campaign.

Frequency:

	Count
False	182345
True	613

Proportions:

	Proportion
False	0.99664950426
True	0.00335049574

friends

Proportion of NA Values: 1

Continuous Variables

fx_rate

Frequency of Rates

	Count
0.00888622	67
0.00889989	28
0.05296891	786
0.05327321	289
0.11145669	706
0.11209032	215
0.12167293	270
0.12259650	98
0.12756973	281
0.12759960	52
0.15413583	430
0.15496150	174
0.65518091	518
0.65849475	148
0.71320267	2734
0.71353141	933
0.72586745	201
0.72739916	70
0.76665011	5265
0.77189202	1736
0.85460225	4
0.93028444	15
1.00000000	142367
1.00531005	261
1.00852303	88
1.14999632	6057
1.15610946	2039
1.30437600	884
1.30994986	11502
1.31748313	4621
1.50002759	2
1.70866715	117

I am not sure what this variable is either. The spread was hard to capture in a histogram because of disproportion and useless in summary, so I went with a table which shows there are a wide variety of rates with most cmpaigns having a rate of 1. I think this might be a good column to keep in if it is not an outcome related rate!

If it is not outcome related, we could try leaving it as a ratio or creating levels of factor, either could work.

goal

There is a HUGE disparity between the median and the mean which is caused by right skewing from a few VERY large goals. This is a kind of difficult thing to negotiate. We could create a scaled amount but that makes proper scaling on predicitions essential and not always reproducible if we get values outside the current range. It might be work creating and outlier flag for the super wicked huge goals in this set and allowing some of the slop/noise from this variable to spill to the flag.

Summary

summary(my_data$goal)

##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
## 0.00000e+00 1.50000e+03 5.00000e+03 4.79738e+04 1.40000e+04 1.00000e+08

so heavily skewed that NO single plot can convey this distributionos

id

This is a very interesting column in that most but not all ID’s are unique. It may be that repeated campaigns are more likely to succeed or fail based on prior performance. We should look into this with either a prior successful campaign 0/1 and prior unsuccesfull campaign or something like it…this means sorting by date and establishing priority of successes. Duplicated Id’s: 0

Proportion Unique: 1

Na Breakdown

The Only Column with true NA values was friends and it was 100% NA, this should be excluded. Some of the categorical columns have empty levels. We might need to consider them in our model, creating unknown levels.

Creator Variables

Bethany Poulin

December 16, 2018

Avatar Columns

creator.avatar.medium

creator.avatar.small

creator.avatar.thumb

creator.chosen_currency

NA’s

creator.id

creator.name

creator.is_registered

creator.slug

creator.urls.api.user

creator.urls.web.user

currency

currency_symbol

currency_trailing_code

current_currency

deadline (need converting to datetime)

disable_communication

friends

Continuous Variables

fx_rate

goal

id

Na Breakdown