These are all URL links to the same image for three separate fields. There are NO NA values and each most rows to have a unique URL. I cannot imaging using the the urls, but we might consider using a flag, uniqe not-unique.
Unique Values (levels): 154852
Is Null 0
Proportion Unique: 8.463800435e+0010^{-1e+00}
Unique Values (levels):154852
Is Null: 0
Proportion Unique: 8.463800435e+0010^{-1e+00}
Unique Values (levels): 154852
Is Null: 0
Proportion Unique: 8.463800435e+0010^{-1e+00}
There are thirteen chosen currencies, and different levels of adoption and a 192k+ with no deffault value (but not NA). This seems like it might be an interesting categorical variable if we create a level of none to replace the empty frames
Also might be worth a flag to compare to the actual currency used…flag for same, different
Var1 | Freq |
---|---|
1.82443e+05 | |
AUD | 9.00000e+00 |
CAD | 3.10000e+01 |
CHF | 3.00000e+00 |
DKK | 1.00000e+00 |
EUR | 9.50000e+01 |
GBP | 3.80000e+01 |
HKD | 5.00000e+00 |
JPY | 1.30000e+01 |
MXN | 1.70000e+01 |
NZD | 4.00000e+00 |
SEK | 9.00000e+00 |
SGD | 1.00000e+00 |
USD | 2.89000e+02 |
0
This is an interesting variable, some have repeats which may mean they are serial entrepreneurs and good at this, or re-trying a failed mission. It might be good to have a column for repeat ID flag and a column for successful previous…
Duplicated Id’s: 24844
Proportion Unique: 8.642092721e+0010^{-1e+00}
Like with the ID’s there is some redundancy here, even more than with the ID’s which might mean multiple paths to entrepreeurship. This might be a good column to create a flag from, as multiple kicks and prior success.
Number of Unique Names: 151484
Proportion of Unique Names: 8.27971447e+0010^{-1e+00}
ALL TRUE so there is nothing to be gleaned from this column, and it can be removed
Table of True & False Values:
Var1 | Freq |
---|---|
True | 1.82958e+05 |
Proportion: 0.00000546573531
Here there are many missing (no NA, just missing) values and many duplicated slugs as well as many original slugs. Because they come in as factors the empty cells are not NA. we would need to recode them.
Number of Unique Slugs: 42011
Proportion of Unique Slugs: 0.229621006
Number of Duplicated Slugs: 140947
Proportion of Duplicated Slugs: 0.770378994
There are VERY FEW duplicates, .3% and almost all campaigns have an api.user so I would condider killing this field, at best giving a 1 for has 0 for does not.
Number of Unique API Users: 182389
Proportion of Unique API Users: 0.996889997
Proportion of Duplicated API Users: 0.00311000339
There are also few duplicates but more than api
at 13%. It may be a useful flag to include.
Number of Unique web.user : 158114
Proportion of Unique web.user : 0.864209272
Proportion of Duplicated API Users: 0.135790728
Count | |
---|---|
AUD | 3682 |
CAD | 7048 |
CHF | 349 |
DKK | 604 |
EUR | 8098 |
GBP | 16240 |
HKD | 333 |
JPY | 95 |
MXN | 1075 |
NOK | 368 |
NZD | 670 |
SEK | 921 |
SGD | 271 |
USD | 143204 |
The symbols do not translate into the Runiverse
but there is significant variation and we should very likely use these. There are also no missing values which is useful.
Count | |
---|---|
$ | 156283 |
€ | 8098 |
£ | 16240 |
Â¥ | 95 |
Fr | 349 |
kr | 1893 |
I have no idea what this is, but is varies in a discernable amount so it might be good to use this variable as a 0/1 flag and se how it combines with other currency codes to differntiate success.
Count | |
---|---|
False | 24782 |
True | 158176 |
Proportions:
Proportion | |
---|---|
False | 0.135451852 |
True | 0.864548148 |
This is good, there is a difference between the creators desired code and these codes and between US & Canadian. It could be useful to leave in
Count | |
---|---|
CAD | 1069 |
USD | 181889 |
Proportions:
Proportion | |
---|---|
CAD | 0.00584287104 |
USD | 0.99415712896 |
I think this could be a GREAT adjunct to other dates, we would calculate the difference between start and proposed end, the actual start and actual end, the proposed and actual end and create flags for under/over and see if these date relationships affect confidence in a campaign affecting the outcome.
Number of Unique Deadline Dates: 171038
This is interesting and there are no NA values as a binary it seems prudent to test whether or not disabling communication with entrepreneurs affects the outcome of a campaign.
Frequency:
Count | |
---|---|
False | 182345 |
True | 613 |
Proportions:
Proportion | |
---|---|
False | 0.99664950426 |
True | 0.00335049574 |
Proportion of NA Values: 1
Frequency of Rates
Count | |
---|---|
0.00888622 | 67 |
0.00889989 | 28 |
0.05296891 | 786 |
0.05327321 | 289 |
0.11145669 | 706 |
0.11209032 | 215 |
0.12167293 | 270 |
0.12259650 | 98 |
0.12756973 | 281 |
0.12759960 | 52 |
0.15413583 | 430 |
0.15496150 | 174 |
0.65518091 | 518 |
0.65849475 | 148 |
0.71320267 | 2734 |
0.71353141 | 933 |
0.72586745 | 201 |
0.72739916 | 70 |
0.76665011 | 5265 |
0.77189202 | 1736 |
0.85460225 | 4 |
0.93028444 | 15 |
1.00000000 | 142367 |
1.00531005 | 261 |
1.00852303 | 88 |
1.14999632 | 6057 |
1.15610946 | 2039 |
1.30437600 | 884 |
1.30994986 | 11502 |
1.31748313 | 4621 |
1.50002759 | 2 |
1.70866715 | 117 |
`
I am not sure what this variable is either. The spread was hard to capture in a histogram because of disproportion and useless in summary, so I went with a table which shows there are a wide variety of rates with most cmpaigns having a rate of 1. I think this might be a good column to keep in if it is not an outcome related rate!
If it is not outcome related, we could try leaving it as a ratio or creating levels of factor, either could work.
There is a HUGE disparity between the median and the mean which is caused by right skewing from a few VERY large goals. This is a kind of difficult thing to negotiate. We could create a scaled amount but that makes proper scaling on predicitions essential and not always reproducible if we get values outside the current range. It might be work creating and outlier flag for the super wicked huge goals in this set and allowing some of the slop/noise from this variable to spill to the flag.
Summary
summary(my_data$goal)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000e+00 1.50000e+03 5.00000e+03 4.79738e+04 1.40000e+04 1.00000e+08
so heavily skewed that NO single plot can convey this distributionos
This is a very interesting column in that most but not all ID’s are unique. It may be that repeated campaigns are more likely to succeed or fail based on prior performance. We should look into this with either a prior successful campaign 0/1 and prior unsuccesfull campaign or something like it…this means sorting by date and establishing priority of successes. Duplicated Id’s: 0
Proportion Unique: 1
The Only Column with true NA values was friends
and it was 100% NA, this should be excluded. Some of the categorical columns have empty levels. We might need to consider them in our model, creating unknown
levels.