1 Introduction

킥스타터(영어: Kickstarter)는 2009년 시작된 미국의 크라우드 펀딩 서비스이다. 영화, 음악, 공연예술, 만화, 비디오게임 등 다양한 분야 프로젝트의 투자를 유치했다. 프로젝트에 기부하여 일정금액이 넘으면 돈을 제공하고, 목표액을 넘지 못하면 투자를 하지 않아도 된다. 투자자는 돈이 아닌 해당 시제품, 감사인사, 티셔츠, 작가와의 식사 등 다른 유무형 형태의 보상을 받는다.

더 자세한 정보를 원한다면 https://www.kickstarter.com/

1.1 Aim

1) Data Analysis

- 킥스타터 고객들의 특성을 분석
- 킥스타터에서 성공/실패한 프로젝트의 특성을 분석

2) Data Analytics

- Decision Tree를 활용하여 성공/실패한 프로젝트의 유형을 살펴보고 프로젝트의 성공을 위한 제안 제시

2 Collect the data

2.1 Import libraries & Load data

- Load Libraries

- Load & Overview data

ksdata <- read.csv("./input/ks-projects-201801.csv", fileEncoding="latin1")
fillColor = "#FFA07A"
fillColor2 = "#F1C40F"

str(ksdata)

## 'data.frame':    378661 obs. of  15 variables:
##  $ ID              : int  1000002330 1000003930 1000004038 1000007540 1000011046 1000014025 1000023410 1000030581 1000034518 100004195 ...
##  $ name            : Factor w/ 375765 levels "","    ITâ\u0080\u0099S A HOT CAPPUCCINO NIGHT  ",..: 332538 135688 364964 344805 77347 206129 293461 69358 284136 290718 ...
##  $ category        : Factor w/ 159 levels "3D Printing",..: 109 94 94 91 56 124 59 42 114 40 ...
##  $ main_category   : Factor w/ 15 levels "Art","Comics",..: 13 7 7 11 7 8 8 8 5 7 ...
##  $ currency        : Factor w/ 14 levels "AUD","CAD","CHF",..: 6 14 14 14 14 14 14 14 14 14 ...
##  $ deadline        : Factor w/ 3164 levels "2009-05-03","2009-05-16",..: 2288 3042 1333 1017 2247 2463 1996 2448 1790 1863 ...
##  $ goal            : num  1000 30000 45000 5000 19500 50000 1000 25000 125000 65000 ...
##  $ launched        : Factor w/ 378089 levels "1970-01-01 01:00:00",..: 243292 361975 80409 46557 235943 278600 187500 274014 139367 153766 ...
##  $ pledged         : num  0 2421 220 1 1283 ...
##  $ state           : Factor w/ 6 levels "canceled","failed",..: 2 2 2 2 1 4 4 2 1 1 ...
##  $ backers         : int  0 15 3 1 14 224 16 40 58 43 ...
##  $ country         : Factor w/ 23 levels "AT","AU","BE",..: 10 23 23 23 23 23 23 23 23 23 ...
##  $ usd.pledged     : num  0 100 220 1 1283 ...
##  $ usd_pledged_real: num  0 2421 220 1 1283 ...
##  $ usd_goal_real   : num  1534 30000 45000 5000 19500 ...

kable(head(ksdata, 5))

ID	name	category	main_category	currency	deadline	goal	launched	pledged	state	backers	country	usd.pledged	usd_pledged_real	usd_goal_real
1000002330	The Songs of Adelaide & Abullah	Poetry	Publishing	GBP	2015-10-09	1000	2015-08-11 12:12:28	0	failed	0	GB	0	0	1533.95
1000003930	Greeting From Earth: ZGAC Arts Capsule For ET	Narrative Film	Film & Video	USD	2017-11-01	30000	2017-09-02 04:43:57	2421	failed	15	US	100	2421	30000.00
1000004038	Where is Hank?	Narrative Film	Film & Video	USD	2013-02-26	45000	2013-01-12 00:20:50	220	failed	3	US	220	220	45000.00
1000007540	ToshiCapital Rekordz Needs Help to Complete Album	Music	Music	USD	2012-04-16	5000	2012-03-17 03:24:11	1	failed	1	US	1	1	5000.00
1000011046	Community Film Project: The Art of Neighborhood Filmmaking	Film & Video	Film & Video	USD	2015-08-29	19500	2015-07-04 08:35:03	1283	canceled	14	US	1283	1283	19500.00

#kable(tail(ksdata, 5))

2.2 Missing values

- Checking for any missing values

2.2.1 Visualization

options(repr.plot.width=6, repr.plot.height=8)
missing_data <- ksdata %>% summarise_all(funs(sum(is.na(.))/n()))
missing_data <- gather(missing_data, key = "variables", value = "percent_missing") 
ggplot(missing_data, aes(x = reorder(variables, percent_missing), y = percent_missing)) +
  geom_bar(stat = "identity", fill = "red", aes(color = I('white')), size = 0.1)+coord_flip()+ theme_few()

2.2.2 Table

sapply(ksdata, function(x) sum(is.na(x)))

##               ID             name         category    main_category 
##                0                0                0                0 
##         currency         deadline             goal         launched 
##                0                0                0                0 
##          pledged            state          backers          country 
##                0                0                0                0 
##      usd.pledged usd_pledged_real    usd_goal_real 
##             3797                0                0

2.2.3 Summary

’usd.pledged’에 대한 N/A 값만 존재하는 것으로 보인다. ’usd_pledged_real’을 사용할 것이므로 이 열을 제거하고 ’usd_pledged_real’의 이름을 ’usd_pledged’로 바꾼다. 마찬가지로, ’usd_goal_real’을 사용하여 동일한 작업을 수행하고 ’usd_goal’이라는 이름을 지정한다.

2.2.4 Preprocessing

ksdata <- ksdata[,-13]
colnames(ksdata)[13] <- "usd_pledged"
colnames(ksdata)[14] <- "usd_goal"

3 Exploratory Data Analysis

3.1 가장 인기있는 프로젝트 유형은 무엇입니까?

3.1.1 인기있는 프로젝트 유형

이 질문은 프로젝트 스타터의 관점에서 카테고리 (데이터 집합의 main_category라고 함)와 하위 카테고리 (데이터 집합의 카테고리라고 함)의 두 가지 수준을 기준으로 답한다. 먼저 카테고리별로 프로젝트 수를 조사해보겠다.

cat.freq <- ksdata %>%
  group_by(main_category) %>%
  summarize(count=n()) %>%
  arrange(desc(count))

cat.freq$main_category <- factor(cat.freq$main_category, levels=cat.freq$main_category)

ggplot(cat.freq, aes(main_category, count, fill=count)) + geom_bar(stat="identity") + 
  ggtitle("Projects by Category") + xlab("Project Category") + ylab("Frequency") + 
  geom_text(aes(label=count), vjust=-0.5) + theme_economist() + 
  theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12, angle=90), legend.position="null") + 
  scale_fill_gradient(low="skyblue1", high="royalblue4")

3.1.1.1 Summary

’Film & Video’는 가장 인기있는 프로젝트 범주에 속하며 가장 인기가 적은 카테고리는 ’Dance’이다. 이제 하위 범주에 대해 동일한 작업을 수행할 것이다. 개별적으로 플롯하기에는 너무 많은(159개의) 하위 카테고리가 있으므로 가장 많은 수의 프로젝트로 10 개의 하위 카테고리를 볼 것이다.

3.1.2 인기있는 하위 카테고리 프로젝트 유형

subcat.freq <- ksdata %>%
  group_by(category) %>%
  summarize(count=n()) %>%
  arrange(desc(count))

subcat.freq$category <- factor(subcat.freq$category, levels=subcat.freq$category)

ggplot(head(subcat.freq, 10), aes(category, count, fill=count)) + geom_bar(stat="identity") + 
  ggtitle("Projects by Subcategory") + xlab("Project Subcategory") + ylab("Frequency") + 
  geom_text(aes(label=count), vjust=-0.5) + theme_economist() + 
  theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12, angle=90), legend.position="null") + 
  scale_fill_gradient(low="skyblue1", high="royalblue4")

3.1.2.1 Summary

‘Product Design’은 여기에서 가장 인기있는 하위 범주이며 ’Design’ 범주에 속한다.

3.1.3 카테고리별 추이

# Converting the launch and deadline Dates to correct format:
ksdata$launch_date <- as.Date(ksdata$launched, "%Y-%m-%d")
ksdata$deadline_date <- as.Date(ksdata$deadline, "%Y-%m-%d")

# also addding month and year columns for the deadline and launch dates:
ksdata$launch_year <- substr(ksdata$launched, 1,4)
ksdata$launch_mth <- substr(ksdata$launch_date, 1,7)

ksdata$final_year <- substr(ksdata$deadline,1,4)
ksdata$final_mth <- substr(ksdata$deadline, 1, 7)


ggplot(data=ksdata, aes(x=launch_year)) +
  geom_bar(colour="black", fill=fillColor2) +
  ylab('Count') +
  facet_wrap(~main_category) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  theme_minimal()

3.2 어떤 종류의 프로젝트가 펀딩을 하고 있습니까?

3.2.1 모금이 가장 많이 된 프로젝트

위의 질문과 비슷하게 느껴질 수 있지만, 후원자의 관점에서 가장 인기 있는 프로젝트를 금액 기준으로 정렬해보았다(상위 15개).

kable(head(ksdata[order(-ksdata$usd_pledged), c(2,3,13)], 15))

	name	category	usd_pledged
157271	Pebble Time - Awesome Smartwatch, No Compromises	Product Design	20338986
250255	COOLEST COOLER: 21st Century Cooler that’s Actually Cooler	Product Design	13285226
216630	Pebble 2, Time 2 + All-New Pebble Core	Product Design	12779843
289916	Kingdom Death: Monster 1.5	Tabletop Games	12393140
282417	Pebble: E-Paper Watch for iPhone and Android	Product Design	10266846
293862	The World’s Best TRAVEL JACKET with 15 Features \|\| BAUBAX	Product Design	9192056
187653	Exploding Kittens	Tabletop Games	8782572
6666	OUYA: A New Kind of Video Game Console	Gaming Hardware	8596475
309631	THE 7th CONTINENT â What Goes Up, Must Come Down.	Tabletop Games	7072757
271277	The Everyday Backpack, Tote, and Sling	Product Design	6565782
75901	Fidget Cube: A Vinyl Desk Toy	Product Design	6465690
368574	Shenmue 3	Video Games	6333296
30042	Pono Music - Where Your Soul Rediscovers Music	Sound	6225355
89482	Bring Back MYSTERY SCIENCE THEATER 3000	Television	5764229
148586	The Veronica Mars Movie Project	Narrative Film	5702153

3.2.1.1 Summary

여기에 있는 많은 프로젝트가 Product Design 하위 범주에 속한다. 마찬가지로, 가장 많은 후원자가 있는 프로젝트의 상위 15 개 프로젝트를 나열해 보겠다.

kable(head(ksdata[order(-ksdata$backers), c(2,3,11)], 15))

	name	category	backers
187653	Exploding Kittens	Tabletop Games	219382
75901	Fidget Cube: A Vinyl Desk Toy	Product Design	154926
292245	Bring Reading Rainbow Back for Every Child, Everywhere!	Web	105857
148586	The Veronica Mars Movie Project	Narrative Film	91585
182658	Double Fine Adventure	Video Games	87142
23405	Bears vs Babies - A Card Game	Tabletop Games	85581
157271	Pebble Time - Awesome Smartwatch, No Compromises	Product Design	78471
239176	Torment: Tides of Numenera	Video Games	74405
272925	Project Eternity	Video Games	73986
38292	Yooka-Laylee - A 3D Platformer Rare-vival!	Video Games	73206
215085	ZNAPS -The $9 Magnetic Adapter for your mobile devices	Technology	70122
368574	Shenmue 3	Video Games	69320
282417	Pebble: E-Paper Watch for iPhone and Android	Product Design	68929
293644	Mighty No. 9	Video Games	67226
216630	Pebble 2, Time 2 + All-New Pebble Core	Product Design	66673

3.2.1.2 Summary

가장 일반적인 하위 카테고리는 Video Games인 것으로 보인다. 이제는 각 카테고리에 대해 기부된 금액을 집계하여 총 금액을 확인할 것이다.

3.2.2 카테고리별 모금 총액

pledged.tot <- ksdata %>%
  group_by(main_category) %>%
  summarize(total=sum(usd_pledged)) %>%
  arrange(desc(total))

pledged.tot$main_category <- factor(pledged.tot$main_category, levels=pledged.tot$main_category)

ggplot(pledged.tot, aes(main_category, total/1000000, fill=total)) + geom_bar(stat="identity") + 
  ggtitle("Total Amount Pledged by Category") + xlab("Project Category") + 
  ylab("Amount Pledged (USD millions)") + 
  geom_text(aes(label=paste0("$", round(total/1000000,1))), vjust=-0.5) + theme_economist() + 
  theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12, angle=90), legend.position="null") + 
  scale_fill_gradient(low="skyblue1", high="royalblue4")

3.2.2.1 Summary

‘Games’, ‘Design’, ’Technology’는 지금까지 가장 높은 수익을 올리는 범주이다. 후원자 수(객단가)를 고려하는 것이 중요하므로 각 카테고리에 대해 후원자 당 보장되는 평균 금액을 봐야한다. 각 카테고리에 대해 기부 된 총액을 각 카테고리에 대한 후원자 수로 나누어 계산한다.

3.2.3 1인당 평균 후원금액

pledged.avg <- ksdata %>%
  group_by(main_category) %>%
  summarize(pledged=sum(usd_pledged), backers=sum(backers)) %>%
  mutate(avg=pledged/backers) %>%
  arrange(desc(avg))

pledged.avg$main_category <- factor(pledged.avg$main_category, levels=pledged.avg$main_category)

ggplot(pledged.avg, aes(main_category, avg, fill=avg)) + geom_bar(stat="identity") + 
  ggtitle("Average Amount Pledged per Backer") + xlab("Project Category") + 
  ylab("Amount Pledged (USD)") + 
  geom_text(aes(label=paste0("$", round(avg,2))), vjust=-0.5) + theme_economist() + 
  theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12, angle=90), legend.position="null") + 
  scale_fill_gradient(low="skyblue1", high="royalblue4")

3.2.3.1 Summary

‘Technology’는 1인당 평균 후원금액이 가장 높은 반면, ’Comics’는 가장 적은 금액을 보였다. 흥미로운 점은 ’Games’의 경우 (이전 그래프에서 볼 수 있듯) 총액이 많았지만 ’Technology의’ 평균과 2배 정도 차이가 난다.

이어서 박스 플롯을 사용하여 개별 프로젝트에 대해 약속된 금액의 분배를 검토할 것이다. 이상치가 많은(자금 조달이 거의 없거나 엄청나게 많은) 프로젝트들로 인해 박스 플롯이 맨 아래에 삐걱 거리는 것처럼 보일 것이다.

3.2.4 Box plot으로 데이터의 분포 확인하기

ggplot(ksdata, aes(main_category, usd_pledged, fill=main_category)) + geom_boxplot() + 
  ggtitle("Amount Pledged vs. Project Category") + xlab("Project Category") + 
  ylab("Amount Pledged (USD)") + 
  theme(plot.title=element_text(size=15, face="bold", hjust=0.5), 
        axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12, angle=90), legend.position="null") + 
  coord_cartesian(ylim=c(0,20000))

3.2.4.1 Summary

‘Design’과 ’Games’는 다른 카테고리에 비해 상위 4분위수를 가지고 있다. ’Design’, ‘Dance’, ’Theater’는 다른 카테고리에 비해 평균 금액이 높다. ’Comics’는 놀갑게도 상위 4분위수와 중간값을 가지고 있는데, 비록 다른 카테고리와 비교해 볼 때 금액이 낮고(총계와 중앙값 모두), 1인당 후원금액 또한 낮았으므로, 이 카테고리는 개별 프로젝트 마다 평균 후원자의 수가 상대적으로 많아야 한다는 것을 알 수 있다. 한편 ’Technology’는 (총계와 평균치 모두) 많았음에도 불구하고 중앙값이 낮다. 이를 통해 이상치가 많이 있음을 시사한다. ’Crafts’와 ’Journalism’은 다른 범주에 비해 매우 작은 IQR과 낮은 중간 값을 가지고 있다.

3.3 얼마나 많은 자금이 필요합니까?

3.3.1 목표 금액이 가장 높은 프로젝트

자금 목표가 가장 높은 상위 15개 프로젝트를 살펴보겠다.

kable(head(ksdata[order(-ksdata$usd_goal), c(2,3,14,10)], 15))

	name	category	usd_goal	state
47804	FUCK Potato Salad. Paleo Potato Brownies!	Food	166361391	failed
196532	A Celtic Lovestory	Drama	151395870	failed
367929	Hydroponic’s Skyscraper(un gratte-ciel hydroponique)e-solar	Technology	110169772	failed
222209	DER NEANDERTALER (Kinofilm mit Starbesetzung in 3D)	Movie Theaters	107369868	failed
226162	Kitchen from Austria in Switzerland	Restaurants	104057190	failed
23470	Help me start a fair media company, Trump Media	Video	100000000	failed
33085	Sage: Warrior Of The Fallen Gods (Movie)	Fantasy	100000000	failed
55010	Kybernan Holographic Gaming Network	Video Games	100000000	failed
72776	Art Is Fabulous	Art	100000000	failed
72991	Wax Apple	Documentary	100000000	failed
77677	Bring back Pontiac	Technology	100000000	failed
145944	Let’s prove the earth is FLAT! (Suspended) (Suspended)	Video	100000000	suspended
158460	The Return Of The Bell Witch Movie	Shorts	100000000	failed
160292	The Book on Comedy 10 GABAGILLTRILLBILLIOPLEXIAN!!!!!!!!!!!!	Comics	100000000	failed
211027	Virtual Reality Playground in Buffalo, NY	Live Games	100000000	failed

3.3.1.1 Summary

일시 중단된 프로젝트 하나를 제외한 다른 모든 프로젝트의 목표가 너무 높게 설정되어 아이디어에 비해 비합리적인 것으로 보인다. 그러면 성공적으로 자금을 지원받은 프로젝트 중 상위 15개를 살펴보겠다.

3.3.2 성공한 프로젝트 중 목표 금액이 가장 높은 프로젝트

goal.tops <- ksdata[ksdata$state=="successful",]

kable(head(goal.tops[order(-goal.tops$usd_goal), c(2,3,14,10)], 15))

	name	category	usd_goal	state
355331	Elite: Dangerous	Video Games	2015609	successful
89482	Bring Back MYSTERY SCIENCE THEATER 3000	Television	2000000	successful
134463	Camelot Unchained	Video Games	2000000	successful
148586	The Veronica Mars Movie Project	Narrative Film	2000000	successful
171323	WISH I WAS HERE	Narrative Film	2000000	successful
368574	Shenmue 3	Video Games	2000000	successful
324292	Blue Mountain State: The Movie	Narrative Film	1500000	successful
111637	The Newest Hottest Spike Lee Joint	Narrative Film	1250000	successful
337373	The Bards Tale IV	Video Games	1250000	successful
141639	Obduction	Video Games	1100000	successful
272925	Project Eternity	Video Games	1100000	successful
30719	Shroud of the Avatar: Forsaken Virtues	Video Games	1000000	successful
37515	Pathfinder Online: A Fantasy Sandbox MMO	Video Games	1000000	successful
189800	ARKYD: A Space Telescope for Everyone	Space Exploration	1000000	successful
216630	Pebble 2, Time 2 + All-New Pebble Core	Product Design	1000000	successful

3.3.2.1 Summary

여기에 나열된 일부 프로젝트는 최고 모금 프로젝트 목록(상위 15개)에도 있었다. 여기에 나열된 가장 일반적인 하위 카테고리는 ’Video Games’과 ’Narrative Film’이다. 각 카테고리의 평균 프로젝트 목표 금액을 살펴보겠다. 아마도 나중에 어떤 프로젝트 유형이 성공 또는 실패했는지에 대한 통찰력을 줄 것이다.

3.3.3 각 카테고리의 평균 프로젝트 목표 금액

goal.avg <- ksdata %>%
  group_by(main_category) %>%
  summarize(goal=sum(usd_goal), projects=n()) %>%
  mutate(avg=goal/projects) %>%
  arrange(desc(avg))

goal.avg$main_category <- factor(goal.avg$main_category, levels=goal.avg$main_category)

ggplot(goal.avg, aes(main_category, avg, fill=avg)) + geom_bar(stat="identity") + 
  ggtitle("Average Project Goal") + xlab("Project Category") + ylab("Project Goal (USD)") + 
  geom_text(aes(label=paste0("$", round(avg,0))), vjust=-0.5) + theme_economist() + 
  theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12, angle=90), legend.position="null") + 
  scale_fill_gradient(low="skyblue1", high="royalblue4")

3.3.3.1 Summary

‘Technology’, ‘Journalism’, ‘Film & Video’는 평균 목표금액이 가장 높다. 반대로 ’Dance’, ‘Crafts’, ’Photography’는 하위 그룹에 속한다. 여기에서 박스 플롯을 사용하여 개별 프로젝트에 대한 프로젝트 목표 금액의 분포를 확인할 것이다.

3.3.4 프로젝트 카테고리별 목표 금액

ggplot(ksdata, aes(main_category, usd_goal, fill=main_category)) + geom_boxplot() + 
  ggtitle("Project Goal vs. Project Category") + xlab("Project Category") +
  ylab("Project Goal (USD)") + 
  theme(plot.title=element_text(size=15, face="bold", hjust=0.5), 
        axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12, angle=90), legend.position="null") + 
  coord_cartesian(ylim=c(0,60000))

3.3.4.1 Summary

‘Technology’는 엄청나게 높은 4 분위수 및 중간 값을 가지고 있다. ’Technology’에 비해 높지는 않지만, ’Design’과 ’Food’ 또한 상대적으로 높은 4 분위수와 중앙값을 가지고 있다. 이 두 카테고리의 평균 프로젝트 목표는 ‘Journalism’ 및 ’Film & Video’보다 낮았지만 중간 및 상위 분위 값이 높았으므로 목표 금액이 적은 많은 프로젝트가 있어야하며 후자는 많은 (높은 금액의)이상치를 가지거나, 둘다이여야 한다.

서로 다른 종류의 프로젝트에 대해 모금액과 목표 금액을 살펴본 결과 각각의 분포가 서로 어떻게 비교되는지 확인해보겠다. 자금 조달이 거의 없는 프로젝트가 많기 때문에(즉, 오른쪽으로 꼬리가 긴 분포를 가진 그래프) 위해 두 변수에 로그 변환을 사용하여 분포를 시각화할 것이다.

3.3.5 목표 금액과 모금액의 분포

usd.amounts <- gather(ksdata, type, amount, usd_pledged, usd_goal, factor_key=T)

ggplot(usd.amounts, aes(amount, fill=type)) + 
  geom_histogram(alpha=0.5, position="identity") + 
  ggtitle("Distribution of USD Pledged vs. USD Goal") + xlab("USD") + 
  ylab("Frequency") + scale_fill_discrete("Type", labels=c("USD Pledged", "USD Goal"))

usd.amounts <- gather(ksdata, type, amount, usd_pledged, usd_goal, factor_key=T)

ggplot(usd.amounts, aes(log(amount+1), fill=type)) + 
  geom_histogram(alpha=0.5, position="identity") + 
  ggtitle("Distribution of log(USD Pledged) vs. log(USD Goal)") + xlab("log(USD + 1)") + 
  ylab("Frequency") + scale_fill_discrete("Type", labels=c("USD Pledged", "USD Goal"))

3.3.5.1 Summary

목표금액 정규분포의 형태를 보이고 있다. 반면에 모금액은 다봉분포(bimodal distribution, 서로 다른 두 개의 최빈값을 갖는 연속확률분포) 형태를 보인다. 모금액은 목표 금액 분포보다 왼쪽에 위치하고 있으며, 많은 프로젝트가 필요한 자금을 조달받지 못했음을 보여주고 있다.

목표 금액과 모금액을 비교할 때 흥미로운 점은 모금 달성 비율(모금달성률)이다. 그런 비율이 가장 높은 상위 15개 프로젝트의 목록을 살펴보겠다.

3.3.6 목표 달성률이 가장 높았던 프로젝트

ksdata$ratio <- ksdata$usd_pledged/ksdata$usd_goal

kable(head(ksdata[order(-ksdata$ratio), c(2,3,13,14,15)], 15))

	name	category	usd_pledged	usd_goal	launch_date
369177	VULFPECK /// The Beautiful Game	Music	104277.89	1.00	2016-08-18
186097	Re-covering with Friends	Rock	68764.10	1.00	2016-10-14
360722	VULFPECK /// Thrill of the Arts	Music	55266.57	1.00	2015-08-10
76291	Energy Hook	Video Games	41535.01	1.00	2013-05-10
81369	Band of Brothers 2nd Chance	Tabletop Games	32843.00	1.00	2016-07-12
285409	CLOCKWRIGHT: Large-Scale Analogue Time Machines	Sculpture	27588.23	1.00	2015-12-14
369215	Heartbreak Hits â Theo Katzman’s Second LP	Music	24675.00	1.00	2016-11-04
290937	Penny Arcade’s Podcast, “Downloadable Content”: The Return	Radio & Podcasts	230360.67	10.00	2013-05-08
15289	THE ‘mi8’ RISES \| The Best Wireless Duo Stereo Sound System	Hardware	22603.00	1.00	2016-09-13
164450	CO-OP THE GAME (Canceled)	Video Games	16461.73	0.97	2013-09-09
127905	Worst Game Ever	Tabletop Games	15804.00	1.00	2014-05-29
244399	Y (A Generation)	Documentary	15066.00	1.00	2010-10-09
354532	Graveface Archival Series	Music	14328.77	1.00	2014-08-01
150243	Braiiiins! …it will not die!	Tabletop Games	12984.00	1.00	2015-02-16
367450	Multi-Purpose, All-Occasion Greeting Cards	Printing	25150.30	2.00	2015-10-15

3.3.6.1 Summary

대부분의 프로젝트 목표가 1 달러이다. 최소 목표가 1,000 달러 이상인 프로젝트만 살펴 보겠다.

3.3.7 목표 달성률이 가장 높았던 프로젝트 (목표 금액 $1,000 이상)

goal.min <- ksdata[ksdata$usd_goal>=1000,]

kable(head(goal.min[order(-goal.min$ratio), c(2,3,13,14,15)], 15))

	name	category	usd_pledged	usd_goal	launch_date
187653	Exploding Kittens	Tabletop Games	8782572.0	10000.00	2015-01-20
293862	The World’s Best TRAVEL JACKET with 15 Features \|\| BAUBAX	Product Design	9192055.7	20000.00	2015-07-07
75901	Fidget Cube: A Vinyl Desk Toy	Product Design	6465690.3	15000.00	2016-08-30
132758	2015 CES Best, First Domestic Robot That Tracks Intruders!	Robots	365538.0	1000.00	2014-12-29
137446	The Component Collector	Tabletop Games	327806.7	1000.00	2017-06-01
97372	Joking Hazard	Tabletop Games	3246588.5	10000.00	2016-02-09
23405	Bears vs Babies - A Card Game	Tabletop Games	3215679.8	10000.00	2016-10-18
250255	COOLEST COOLER: 21st Century Cooler that’s Actually Cooler	Product Design	13285226.4	50000.00	2014-07-08
311864	ClickPack Pro \| The Best Functional Anti-theft BackPack	Product Design	1206179.0	5000.00	2017-03-27
1874	Redefining Italian Luxury Watches - Filippo Loreti	Design	5020667.0	20877.92	2016-11-16
146395	SMART BELT - KevlarÂ® Core Indestructible & Micro Adjustment	Design	1174805.1	5000.00	2017-05-25
289928	Gravity: The Weighted Blanket for Sleep, Stress and Anxiety	Product Design	4729263.1	21500.00	2017-04-26
359433	HALLAM new york SMART JACKET 2.0 for TRAVEL, for OUTDOOR	Wearables	1076751.1	5000.00	2016-12-10
65216	Unstable Unicorns	Tabletop Games	1865140.9	10000.00	2017-08-17
309631	THE 7th CONTINENT â What Goes Up, Must Come Down.	Tabletop Games	7072757.0	40000.00	2017-09-26

3.3.7.1 Summary

가장 많은 후원금을 받았던 프로젝트가 여기에서도 보인다. 달성률이 높은 프로젝트에서 자주 보이는 카테고리는 ’Tabletop Games’와 ’Product Design’이다.

3.4 어떤 유형의 프로젝트가 성공을 했고 실패를 했습니까?

3.4.1 프로젝트 상태별 분석

프로젝트 상태(성공, 실패, 취소 등)별로 세분화해서 분석해보겠다.

state.freq <- ksdata %>%
  group_by(state) %>%
  summarize(count=n()) %>%
  arrange(desc(count))

state.freq$state <- factor(state.freq$state, levels=state.freq$state)

ggplot(state.freq, aes(state, count, fill=count)) + geom_bar(stat="identity") + 
  ggtitle("Projects by Status") + xlab("Project Status") + ylab("Frequency") + 
  geom_text(aes(label=count), vjust=-0.5) + theme_economist() + 
  theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12), legend.position="null") + 
  scale_fill_gradient(low="skyblue1", high="royalblue4")

3.4.1.1 Summary

실패한 프로젝트가 성공한 프로젝트보다 많다. 대부분의 프로젝트가 빛을 보지 못하는 것 같다. 프로젝트를 서로 다른 두 개의 카테고리로 그룹화 할 수 있는데, “완료”된 프로젝트 (마감일에 도달한 프로젝트, 즉 성공적인 프로젝트 및 실패한 프로젝트)와 “완료되지 못한”프로젝트 (마감일에 도달하지 않은 프로젝트, 즉 아직 진행중인 프로젝트나 취소 또는 일시 중지된 프로젝트)이다. 이 작업을 수행하고 각 그룹의 프로젝트 상태 비율을 살펴 보겠다.

3.4.2 좀 더 세분화(그룹화)된 프로젝트 상태별 분석

state.grp <- ksdata %>%
  filter(state!="undefined") %>%
  mutate(grp=ifelse(state %in% c("successful", "failed"), "complete", "incomplete")) %>%
  group_by(grp, state) %>%
  summarize(count=n()) %>%
  mutate(pct=count/sum(count)) %>%
  arrange(grp, desc(state))

state.grp$state <- factor(state.grp$state, levels=state.grp$state)

ggplot(state.grp, aes(grp, pct, fill=state)) + geom_bar(stat="identity") + 
  ggtitle("Project Status by Completion") + xlab("Project Completion") + ylab("Percentage") + 
  scale_x_discrete(labels=c("Complete", "Incomplete")) + 
  scale_y_continuous(labels=scales::percent) + 
  scale_fill_brewer(name="Project Status", 
                    labels=c("Successful", "Failed", "Suspended", "Live", "Cancelled"), 
                    palette="Set1") + 
  geom_text(aes(label=paste0(round(pct*100,1),"%")), position=position_stack(vjust=0.5), 
            colour="white", size=5) + theme_economist() + 
  theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12), legend.position="bottom", 
        legend.title=element_text(size=12, face="bold"))

3.4.2.1 Summary

완료된(마감일에 도달한) 프로젝트의 약 60%가 펀딩에 실패하고, 약 40% 정도가 성공했다는 것을 알 수 있다. 완료되지 못한 (아직 진행중이거나, 취소되었거나 일시 중지된) 프로젝트의 약 90%가 취소되었다.

3.4.3 카테고리별 프로젝트 성공/실패 여부

state.pct <- ksdata %>%
  filter(state %in% c("successful", "failed")) %>%
  group_by(main_category, state) %>%
  summarize(count=n()) %>%
  mutate(pct=count/sum(count)) %>%
  arrange(desc(state), pct)

state.pct$main_category <- factor(state.pct$main_category, 
                                  levels=state.pct$main_category[1:(nrow(state.pct)/2)])

ggplot(state.pct, aes(main_category, pct, fill=state)) + geom_bar(stat="identity") + 
  ggtitle("Success vs. Failure Rate by Project Category") + 
  xlab("Project Category") + ylab("Percentage") + scale_y_continuous(labels=scales::percent) + 
  scale_fill_discrete(name="Project Status", breaks=c("successful", "failed"),
                      labels=c("Success", "Failure")) + 
  geom_text(aes(label=paste0(round(pct*100,1),"%")), position=position_stack(vjust=0.5), 
            colour="white", size=5) + theme_economist() + 
  theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12), legend.position="bottom", 
        legend.title=element_text(size=12, face="bold")) + coord_flip()

3.4.3.1 Summary

‘Dance’, ‘Theater’, ‘Comics’은 가장 높은 성공률을 보이며, ’Technology’, ‘Journalism’, ’Crafts’가 가장 성공률이 낮다. ’Dance’와 ’Comics’은 모금액에서 높은 중앙값을 보였었고, 목표액에서는 낮은 중간값을 보였었다.

‘Technology’, ‘Journalism’, ’Crafts’은 모금액의 중간값이 낮았고, ’Technology’는 목표액의 중앙값이 높았다. 일반적으로 모금액이 높고, 목표금액이 낮을 수록 성공할 확률이 높다. 흥미롭게도 낮은 목표액의 중간 값을 가진 ’Crafts’는 목표액의 중간값이 낮았음에도 불구하고 성공률이 낮았다. 사람들이 전반적으로 이 카테고리에 관심이 많지 않다는 것을 알 수 있었다.

3.5 프로젝트 기간이 성공에 영향을 줍니까?

킥스타터의 최대 프로젝트 기간은 60일이다. Kickstarter는 프로젝트를 30일 이하로 설정할 것을 권장한다. 그들의 추론은 30 일 이내에 자금 지원을 받지 못한 프로젝트도 마감 기한까지 자금을 조달하지 못할 것이라고 말하고 있다.

프로젝트 마감일과 프로젝트 시작일의 차이를 계산한 다음 전체 일수로 나누어 각 프로젝트의 기간을 알 수 있다. 이제 프로젝트의 성공률을 일 단위로 계산해보겠다.

3.5.1 프로젝트 기간별 성공률

ksdata$length <- interval(ymd_hms(ksdata$launched), ymd(ksdata$deadline)) %/% days(1)

length.pct <- ksdata %>%
  filter(state %in% c("successful", "failed"), length <= 61) %>%
  group_by(length, state) %>%
  summarize(count=n()) %>%
  mutate(pct=count/sum(count))

ggplot(length.pct[length.pct$state=="successful",], aes(length, pct)) + 
  geom_point(colour="royalblue4", size=2.5) + ggtitle("Success Rate vs. Project Length") + 
  xlab("Project Length (Days)") + ylab("Success Rate (%)") + 
  scale_x_continuous(breaks=c(0,10,20,30,40,50,60)) + geom_vline(xintercept=30, colour="red") + 
  theme_economist() + 
  theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=12, face="bold"))

3.5.1.1 Summary

전반적으로 30일을 초과하는 프로젝트는 30일 미만의 프로젝트보다 성공률이 낮으므로 킥스타터의 말은 일리가 있다. 실제로 30일을 초과하는 프로젝트의 경우 성공률과 프로젝트의 기간 간에 역 선형관계(음의 선형관계)를 보이고 있다(30일 이전 상향 추세선, 30일 이후 하향 추세선). 프로젝트 길이에 따른 프로젝트 분포를 살펴보겠다.

3.5.2 프로젝트 기간에 따른 프로젝트 분포

ggplot(ksdata[ksdata$length <= 61,], aes(length)) + geom_density(colour="royalblue4", size=1) + 
  ggtitle("Distribution of Projects by Campaign Length") + xlab("Project Length (Days)") + 
  ylab("Density (%)") + scale_x_continuous(breaks=c(0,10,20,30,40,50,60)) + 
  geom_vline(xintercept=30, colour="red") + theme_economist() + 
  theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=12, face="bold"))

3.5.2.1 Summary

대다수의 프로젝트가 30일 이내로 설정되어 있는 것처럼 보인다.

3.6 연도별로 프로젝트 현황은 어떻게 됩니까?

연도별로 런칭된 프로젝트 수와 성공 / 실패율을 살펴 보겠다.

3.6.1 연도별 프로젝트 런칭 수

year.freq <- ksdata %>%
  filter(year(launched)!="1970") %>%
  group_by(year=year(launched)) %>%
  summarize(count=n())

ggplot(year.freq, aes(year, count, fill=count)) + geom_bar(stat="identity") + 
  ggtitle("Number of Projects by Launch Year") + xlab("Year") + ylab("Frequency") + 
  scale_x_discrete(limits=c(2009:2018)) + 
  geom_text(aes(label=paste0(count)), vjust=-0.5) + theme_economist() + 
  theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12), legend.position="null") + 
  scale_fill_gradient(low="skyblue1", high="royalblue4")

3.6.1.1 Summary

2015년은 프로젝트가 가장 많은 해였다. 2016년과 2017년의 프로젝트 수가 그 전(2014, 2015년)에 비해 프로젝트 수가 점차 감소하고 있는 것으로 보인다.

3.6.2 연도별 프로젝트 성공/실패율

state.pct2 <- ksdata %>%
  filter(year(launched)!="1970", state %in% c("successful", "failed")) %>%
  group_by(year=year(launched), state) %>%
  summarize(count=n()) %>%
  mutate(pct=count/sum(count)) %>%
  arrange(desc(state))

ggplot(state.pct2, aes(year, pct, fill=state)) + geom_bar(stat="identity") + 
  ggtitle("Success vs. Failure Rate by Year Launched") + 
  xlab("Year") + ylab("Percentage") + scale_x_discrete(limits=c(2009:2017)) + 
  scale_y_continuous(labels=scales::percent) + 
  scale_fill_discrete(name="Project Status", breaks=c("successful", "failed"),
                      labels=c("Success", "Failure")) + 
  geom_text(aes(label=paste0(round(pct*100,1),"%")), position=position_stack(vjust=0.5), 
            colour="white", size=5) + theme_economist() + 
  theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12), legend.position="bottom", 
        legend.title=element_text(size=12, face="bold"))

3.6.2.1 Summary

흥미롭게도 2014년과 2015년은 프로젝트 성공률이 가장 낮았다. 어쩌면 많은 사람들이 당시에 크라우드 펀딩에 뛰어들려고 시도했지만 성공하지 못했던 것으로 보인다. 년도를 기준으로 프로젝트 유형간에 차이가 있는지 살펴보갰다. 히트맵은 각 카테고리의 유형 및 연도의 프로젝트 수를 보여준다.

3.6.3 년도/프로젝트 유형별 차이 비교 시각화

cat.year <- ksdata %>%
  filter(!year(launched) %in% c("1970", "2018")) %>%
  group_by(main_category, year=year(launched)) %>%
  summarize(count=n())

cat.year2 <- t(matrix(cat.year$count, nrow=9))
colnames(cat.year2) <- c("2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017")
rownames(cat.year2) <- levels(ksdata$main_category)

heatmap.2(cat.year2, dendrogram="row", Colv=F, trace="none", margins=c(10,10))

3.6.3.1 Summary

히트맵에서 주목할만한 부분은 2015년에 ‘Technology’, ‘Film & Video’와 ’Music’가 인가가 매우 많았던 것을 알 수 있었다. ’Food’, ‘Fashion’, ’Art’는 그 다음으로 인기가 많았던 것을 알 수 있었다.

3.7 국가별로 프로젝트 현황은 어떻게 됩니까?

프로젝트가 시작된 국가를 확인할 것이다. 국가별 프로젝트 성공률과 모금 총액과 성공률 그리고 지리적 히트 맵으로 시각화할 것이다.

3.7.1 국가별 프로젝트 성공 현황

ks5 <- ksdata %>%
  group_by(country, state) %>%
  select(goal, pledged, backers) %>%
  summarise(freq=n()) %>%
  mutate(categoryPerc = freq/sum(freq)) %>%
  filter(state == "successful") %>%
  arrange(desc(categoryPerc))

ks55 <- head(ks5, 10)

p5 <- ggplot(ks55, aes(x = reorder(country, -categoryPerc), y = round(categoryPerc*100, digits = 1))) +
         geom_bar(stat = "identity", fill = fillColor2) + theme(axis.text.x = element_text(angle=60, hjust=1)) + geom_text(aes(label=round(categoryPerc*100, digits = 1)), position=position_dodge(width=0.9), vjust=-0.25) +
  theme_economist()

p5+labs(y="Success %", x="Country", title="Top 10 Country by Successful project %")

3.7.1.1 Summary

킥스타터는 글로벌 플랫폼이다. 미국이 아닌 여러 나라의 메이커들도 자신만의 프로젝트를 시작하기 위한 자금을 모금하기 위해 진행한다. 위의 차트에서 볼 수 있듯이 미국은 프로젝트 성공률이 가장 높은 나라지만 다른 나라들 또한 크게 뒤떨어져 있지 않다. 미국에서 시작된 프로젝트의 성공률은 37.4%이며, 영국에서 시작된 프로젝트의 성공률은 35.8%이다. 아시아 지역에서는 홍콩의 프로젝트 성공률이 35%로 영국과 비슷한 양상을 보이고 있으며, 싱가포르 (32.1%)는 상위 5위에 머무르고 있다. 유럽에서는 영국 다음으로 덴마크 (32.3%), 뉴질랜드 (31%), 프랑스 (30.9%), 룩셈부르크 (30.6%), 스웨덴 (29%)이 뒤를 이었다.

3.7.2 국가별 프로젝트 모금 현황

ks6 <- ksdata %>%
  group_by(country, state) %>%
  select(goal, pledged, backers) %>%
  summarise(sumpledge=sum(pledged)) %>%
  filter(state == "successful") %>%
  arrange(desc(sumpledge))
 
ks66 <- head(ks6, 10)
 
p6 <- ggplot(ks66, aes(x = reorder(country, -sumpledge), y = round(sumpledge/1000000, digits = 0))) +
         geom_bar(stat = "identity", fill = fillColor2) + theme(axis.text.x = element_text(angle=60, hjust=1)) + geom_text(aes(label=round(sumpledge/1000000, digits = 0)), position=position_dodge(width=0.9), vjust=-0.25) +
  theme_economist()

p6 + labs(y="Money raised (USD Million)", x="Country", title="Top 10 Country by Money raised (in USD Million)")

3.7.2.1 Summary

각국의 프로젝트 모금액을 비교해 보면, 미국은 나머지 모든 국가를 합친 것보다 월등히 높다는 것을 알 수 있다. 2018년 1월까지 총 31억 2000만 달러의 자금이 조달되었으며, 그 중에서 미국은 25억 3700만 달러를 모았다(한화로 2조 8607억원 수준으로 이는 81.3%에 달한다). 자금 조달 측면에서 영국은 1억 3800만 달러(한화로 1,557억) 수준이고, 미국을 제외한 국가(영국 포함)의 총합은 5억 8300만 달러(한화 6,578억) 수준이다.

3.7.3 미국과 미국이 아닌 국가들의 프로젝트 성공률 비교

킥스타터는 전세계적으로 사용되고 있기 때문에, 카테고리 성공 여부는 프로젝트가 미국 또는 미국 이외의 지역에서 시작되었는지에 따라 결정된다. 이를 위해 데이터를 정리하고 분류하여 미국과 미국이 아닌 지역에서 성공률로 나누었다(아래의 차트는 미국 및 미국 외 국가의 카테고리 성공율을 비교한 것이다). 전반적으로 미국의 성공률이 조금 더 높다.

ks7 <- ksdata %>%
  filter(country %in% c('US')) %>%
  group_by(main_category, state) %>%
  select(goal, pledged, backers) %>%
  summarise(freq=n()) %>%
  mutate(us = freq/sum(freq)) %>%
  filter(state == "successful") %>%
  arrange(desc(us))

us_success <- select(ks7, main_category, us)


ks8 <- ksdata %>%
  filter(!country %in% c('US')) %>%
  group_by(main_category, state) %>%
  select(goal, pledged, backers) %>%
  summarise(freq=n()) %>%
  mutate(non_us = freq/sum(freq)) %>%
  filter(state == "successful") %>%
  arrange(desc(non_us))

non_us_success <- select(ks8, main_category, non_us)


p7 <- ggplot(ks7, aes(x = reorder(main_category, -us), y = round(us*100, digits = 0))) +
         geom_bar(stat = "identity", fill = fillColor2) + theme(axis.text.x = element_text(angle=60, hjust=1)) + geom_text(aes(label=round(us*100, digits = 0)), position=position_dodge(width=0.3), vjust=-0.25) + labs(y="Success %", x="Main Category", title="USA - Successful project %") + theme_economist()+ theme(axis.text.x = element_text(angle = 90, hjust = 1))



p8 <- ggplot(ks8, aes(x = reorder(main_category, -non_us), y = round(non_us*100, digits = 0))) +
         geom_bar(stat = "identity", fill = fillColor2) + theme(axis.text.x = element_text(angle=60, hjust=1)) + geom_text(aes(label=round(non_us*100, digits = 0)), position=position_dodge(width=0.3), vjust=-0.25) + labs(y="Success %", x="Main Category", title="Non USA - Successful project %") + theme_economist() + theme(axis.text.x = element_text(angle = 90, hjust = 1))


ggarrange(p7, p8, ncol = 2)

4 Statistics Analysis

4.1 미국과 미국이 아닌 국가에서의 성공률은 통계적으로 차이가 있습니까?

4.1.1 독립성 검정 (피어슨 카이제곱 검정)

\[H0(귀무가설)\] 두 개의 범주형 변수 간에는 연관성이 없고, 행 변수와 열 변수는 모집단에서 독립적이며, 카테고리의 성공률에 영향을 미친다. \[H1(대립가설)\] 두 개의 범주형 변수 간에는 연관성이 없고, 행 변수와 열 변수는 독립적이지 않으며, 카테고리의 성공에 영향을 주지 않는다.

compare_main <- left_join(us_success, non_us_success, by = c("main_category" = "main_category"))
compare_main

## # A tibble: 15 x 3
## # Groups:   main_category [?]
##    main_category    us non_us
##    <fct>         <dbl>  <dbl>
##  1 Dance         0.647  0.463
##  2 Theater       0.599  0.596
##  3 Comics        0.541  0.535
##  4 Music         0.502  0.286
##  5 Art           0.426  0.345
##  6 Film & Video  0.381  0.329
##  7 Games         0.380  0.298
##  8 Design        0.354  0.342
##  9 Photography   0.318  0.274
## 10 Publishing    0.314  0.287
## 11 Food          0.268  0.158
## 12 Fashion       0.248  0.236
## 13 Crafts        0.247  0.220
## 14 Journalism    0.225  0.176
## 15 Technology    0.219  0.155

chisq.test(data.frame(compare_main$us, compare_main$non_us))

## 
##  Pearson's Chi-squared test
## 
## data:  data.frame(compare_main$us, compare_main$non_us)
## X-squared = 0.074268, df = 14, p-value = 1

4.1.1.1 Summary

결과의 p-값이 1에 가깝기 때문에 귀무 가설을 채택할 수 있으며, 카테고리의 성공에 영향을 준다고(성공에 차이가 있다고) 말할 수 있다. 모든 범주에서 미국 내 성공확률은 미국을 제외한 국가의 성공률보다 높다. 따라서 미국에서 프로젝트를 시작하는 것이 성공률을 더 높인다고 말할 수 있다.

4.1.2 분산 분석 (Anova Test)

킥스타터에서 ‘Games’와 ’Design’ 카테고리는 가장 많은 모금액($741, $734 millon)이 형성되었는데, 이 카테고리에 의해 모금된 금액이 통계적으로 차이가 있는지 아니면 비슷한지 확인해보겠다. 분석을 위해 상위 1,000개의 프로젝트를 선택하고 Anova 분석을 시행하겠다.

ks9 <- ksdata %>%
  filter(main_category %in% c('Games')) %>%
  select(pledged) %>%
  arrange(desc(pledged))

ks10 <- ksdata %>%
  filter(main_category %in% c('Design')) %>%
  select(pledged) %>%
  arrange(desc(pledged))

ks9 <- head(ks9, 1000)
ks10 <- head(ks10, 1000)

sprintf("Mean amount pledged on Design Category: %f", mean(ks9$pledged))

## [1] "Mean amount pledged on Design Category: 500963.445080"

## [1] "Mean amount pledged on Design Category: 399344.450060"
sprintf("Mean amount pledged on Games Category: %f", mean(ks10$pledged))

## [1] "Mean amount pledged on Games Category: 519495.490640"

## [1] "Mean amount pledged on Games Category: 387663.683480"
#ks11 <- round(ks9[sample(1:nrow(ks9), 1000, replace = FALSE), ], digits = -3)
#ks12 <- round(ks10[sample(1:nrow(ks10), 1000, replace = FALSE), ], digits = -3)
ks9 <- round(ks9, digits= -3)
ks10 <- round(ks10, digits = -3)
scores = data.frame(ks9, ks10)
#boxplot(scores)
scores = stack(scores)
resultaov <- aov(values ~ ind, data=scores, var.equal = T)
summary(resultaov)

##               Df    Sum Sq   Mean Sq F value Pr(>F)
## ind            1 1.718e+11 1.718e+11   0.165  0.684
## Residuals   1998 2.077e+15 1.040e+12

4.1.2.1 Summary

P값(유의확률)은 0.77로 유의수준인 0.05보다 높으므로 귀무가설을 채택할 수 있으며, 두 카테고리 모두 통계적으로 같은 금액의 평균값을 가진 1,000개의 프로젝트가 있다고 말할 수 있다.

4.1.3 국가별 차이 비교 시각화

countries.freq <- ksdata %>%
  filter(country!='N,0"') %>%
  group_by(country) %>%
  summarize(count=n())

countries.match <- joinCountryData2Map(countries.freq, joinCode="ISO2", nameJoinColumn="country")

## 22 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 220 codes from the map weren't represented in your data

mapCountryData(countries.match, nameColumnToPlot="count", 
               mapTitle="Number of Projects by Country", catMethod="logFixedWidth", 
               colourPalette="heat")

4.1.3.1 Summary

프로젝트는 일본과 북아메리카, 유럽 및 오세아니아에서 많이 발생하는 것으로 보인다. 미국과 영국은 규모가 작지만 프로젝트의 수가 가장 많다.

4.2 킥스타터의 모금액과 펀딩에 참여한 사람 수와의 관계

킥스타터의 프로젝트는 펀딩에 참여한 사람들이 금전적으로 지원을 해주는 것이기 때문에 후원자의 수에 따라 프로젝트 성공 및 금액의 여부가 결정된다. 아래의 그림은 후원자의 수와 모금액의 선형관계를 보여주고 있다. 모금액과 후원자 수의 관계를 이해하기 위해 다음 모델을 사용하여 회귀 분석을 실시했다.

4.2.1 Linear Regression Model

\[Pledgedamount = A_{0} + B_{1} * backers\]

xyplot(ksdata$pledged~ksdata$backers, data = ksdata, xlab= "Number of backers", ylab= "Amount pledged (USD)", panel = function(x, y, ...) {
                 panel.xyplot(x, y, ...)
                 panel.lmline(x, y, ...)
               })

4.2.2 선형 회귀 모델의 가정

회귀 분석의 결과와 회귀 분석의 가정을 만족시키는 것을 보여주는 차트는 아래에 표시해두었다.

resultslm <- lm(ksdata$pledged~ksdata$backers, data = ksdata)
summary(resultslm)

## 
## Call:
## lm(formula = ksdata$pledged ~ ksdata$backers, data = ksdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -7803302    -2280    -1773    -1674 14405275 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1698.8299   109.0555   15.58   <2e-16 ***
## ksdata$backers   75.5950     0.1194  633.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66660 on 378659 degrees of freedom
## Multiple R-squared:  0.5142, Adjusted R-squared:  0.5142 
## F-statistic: 4.008e+05 on 1 and 378659 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(resultslm)

4.2.2.1 Summary

후원자의 p값(유의확률)은 0.01 미만으로 이는 후원자의 수가 모금 총액에 상당한 영향을 미친다는 것을 의미한다. 선형회귀 모델의 식은 아래와 같다.

\[Pledgedamount = 1698.9 + 75.6 ∗ backers\]

후원자의 기울기는 75.6이므로 각 후원자가 킥스타터에 평균적으로 $ 75.6에 기여했다고 가정할 수 있다. 따라서 더 많은 수의 후원자를 유치할 수 있는 프로젝트가 더 많은 돈을 모으게 됨으로써 성공적으로 진행될 것이다.

5 Predictive Analystics

프로젝트의 성공에 영향을 끼치는 요소 찾기

5.1 Decison Tree (CART Algorithm)

의사결정나무 알고리즘을 데이터에 적용하여 변수 중요도와 성공과 실패를 끼치는 요소가 무엇인지 파악하겠다.

5.1.1 Setting

full <- ksdata %>%
  filter(state %in% c("successful", "failed"))

full <- full %>%
  select(c(4, 7, 9, 10, 11, 12))

full$state <- ifelse(full$state == 'failed', 0, 1)

kick_dummy <- data.frame(full, dummy(full$main_category))
kick_dummy <- data.frame(kick_dummy, dummy(full$country))
head(kick_dummy)

### lets prepare and keep data in the proper format
feature <- kick_dummy
response <- as.factor(kick_dummy$state)
feature$state <- as.factor(kick_dummy$state)


### For Cross validation purpose will keep 20% of data aside from my orginal train set
## This is just to check how well my data works for unseen data
set.seed(1234)
ind <- createDataPartition(kick_dummy$state, times=1, p=0.8, list=FALSE)

train_val <- feature[ind, ]
test_val <- feature[-ind, ]

prop.table(table(full$state))*100
prop.table(table(train_val$state))*100
prop.table(table(test_val$state))*100

5.1.2 Training a model on the data

## Training
dt <- rpart(as.factor(state)~., data = train_val, cp = 0.1^20) # 모든 변수 사용, Full tree 생성

xerror_min_which <- which.min(dt$cptable[, "xerror"])
xerror_min <- min(dt$cptable[, "xerror"])

printcp(dt) # cptable 출력

## 
## Classification tree:
## rpart(formula = as.factor(state) ~ ., data = train_val, cp = 0.1^20)
## 
## Variables actually used in tree construction:
## [1] backers       goal          main_category pledged      
## 
## Root node error: 107198/265340 = 0.404
## 
## n= 265340 
## 
##            CP nsplit rel error    xerror       xstd
## 1  6.0837e-01      0 1.0000000 1.0000000 0.00235792
## 2  7.5710e-02      1 0.3916304 0.3916304 0.00175366
## 3  2.8536e-02      3 0.2402097 0.2403310 0.00142276
## 4  2.5210e-02      5 0.1831377 0.1832870 0.00125825
## 5  1.2640e-02      7 0.1327170 0.1329409 0.00108330
## 6  1.1316e-02      9 0.1074367 0.1078005 0.00098073
## 7  6.5859e-03     11 0.0848057 0.0851788 0.00087593
## 8  5.0281e-03     13 0.0716338 0.0707756 0.00080085
## 9  4.0206e-03     15 0.0615776 0.0607381 0.00074343
## 10 2.8475e-03     17 0.0535364 0.0536484 0.00069972
## 11 2.5327e-03     21 0.0421463 0.0461669 0.00065011
## 12 1.6465e-03     23 0.0370809 0.0386854 0.00059602
## 13 1.6278e-03     25 0.0337879 0.0334521 0.00055483
## 14 1.6255e-03     27 0.0305323 0.0322021 0.00054451
## 15 1.6092e-03     31 0.0240303 0.0315398 0.00053895
## 16 1.4553e-03     33 0.0208120 0.0259333 0.00048927
## 17 9.0953e-04     35 0.0179015 0.0210918 0.00044168
## 18 8.7688e-04     37 0.0160824 0.0179481 0.00040770
## 19 6.8565e-04     39 0.0143286 0.0158492 0.00038328
## 20 5.6904e-04     41 0.0129573 0.0146365 0.00036841
## 21 3.8713e-04     43 0.0118193 0.0142260 0.00036324
## 22 3.3116e-04     45 0.0110450 0.0133118 0.00035144
## 23 3.2183e-04     47 0.0103827 0.0125096 0.00034074
## 24 2.9385e-04     49 0.0097390 0.0112689 0.00032349
## 25 2.6586e-04     51 0.0091513 0.0101961 0.00030777
## 26 2.1922e-04     53 0.0086196 0.0092352 0.00029297
## 27 2.0989e-04     55 0.0081811 0.0087875 0.00028580
## 28 2.0523e-04     57 0.0077613 0.0085543 0.00028200
## 29 1.9123e-04     59 0.0073509 0.0081252 0.00027486
## 30 1.7724e-04     61 0.0069684 0.0074815 0.00026378
## 31 1.6791e-04     64 0.0064367 0.0072389 0.00025948
## 32 1.6325e-04     65 0.0062688 0.0069964 0.00025511
## 33 1.3526e-04     67 0.0059423 0.0068098 0.00025170
## 34 1.3060e-04     75 0.0048602 0.0062874 0.00024187
## 35 1.1894e-04     77 0.0045990 0.0059982 0.00023626
## 36 1.0261e-04     81 0.0041232 0.0053732 0.00022364
## 37 9.7950e-05     83 0.0039180 0.0051867 0.00021973
## 38 8.3957e-05     85 0.0037221 0.0051027 0.00021795
## 39 7.4628e-05     90 0.0033023 0.0047296 0.00020985
## 40 6.9964e-05     94 0.0030038 0.0042258 0.00019838
## 41 6.5300e-05     96 0.0028639 0.0042258 0.00019838
## 42 4.6643e-05     98 0.0027333 0.0037874 0.00018782
## 43 4.2911e-05    100 0.0026400 0.0035262 0.00018124
## 44 4.1978e-05    105 0.0024254 0.0035262 0.00018124
## 45 3.7314e-05    107 0.0023415 0.0033863 0.00017761
## 46 3.4205e-05    109 0.0022668 0.0031997 0.00017266
## 47 3.2650e-05    115 0.0020616 0.0031810 0.00017215
## 48 2.3321e-05    119 0.0019310 0.0031624 0.00017165
## 49 1.5548e-05    121 0.0018844 0.0031344 0.00017089
## 50 1.3993e-05    124 0.0018377 0.0030411 0.00016833
## 51 9.3285e-06    126 0.0018097 0.0030411 0.00016833
## 52 4.6643e-06    130 0.0017724 0.0030504 0.00016859
## 53 3.1095e-06    132 0.0017631 0.0030318 0.00016807
## 54 1.0000e-20    135 0.0017538 0.0030411 0.00016833

5.1.3 Cross Validation: Pruning

plotcp(dt) # cpplot 출력

abline(v = xerror_min_which, lty = 2, col = "red")
text(xerror_min_which, xerror_min, labels = round(xerror_min_which, 2), pos = 3, col = "red")

# pruning
dt_prune <- prune(dt, cp = dt$cptable[which.min(dt$cptable[, "xerror"]), "CP"])

5.1.4 Evaluating model performance

- training accuracy

pred_tr_dt <- predict(dt_prune, type = "class") # class(범주형)으로 예측

#CrossTable(x = train_val$state, y = pred_tr_dt, prop.chisq = FALSE)

confusionMatrix(train_val$state, pred_tr_dt)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 157998    144
##          1     45 107153
##                                           
##                Accuracy : 0.9993          
##                  95% CI : (0.9992, 0.9994)
##     No Information Rate : 0.5956          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9985          
##  Mcnemar's Test P-Value : 1.015e-12       
##                                           
##             Sensitivity : 0.9997          
##             Specificity : 0.9987          
##          Pos Pred Value : 0.9991          
##          Neg Pred Value : 0.9996          
##              Prevalence : 0.5956          
##          Detection Rate : 0.5955          
##    Detection Prevalence : 0.5960          
##       Balanced Accuracy : 0.9992          
##                                           
##        'Positive' Class : 0               
##

- test accuracy

pred_te_dt <- predict(dt_prune, test_val, type = "class")

#CrossTable(x = test_val$state, y = pred_te_dt, prop.chisq = FALSE)

confusionMatrix(test_val$state, pred_te_dt, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 39526    51
##          1    20 26738
##                                           
##                Accuracy : 0.9989          
##                  95% CI : (0.9987, 0.9992)
##     No Information Rate : 0.5962          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9978          
##  Mcnemar's Test P-Value : 0.0003704       
##                                           
##             Sensitivity : 0.9981          
##             Specificity : 0.9995          
##          Pos Pred Value : 0.9993          
##          Neg Pred Value : 0.9987          
##              Prevalence : 0.4038          
##          Detection Rate : 0.4031          
##    Detection Prevalence : 0.4034          
##       Balanced Accuracy : 0.9988          
##                                           
##        'Positive' Class : 1               
##

5.1.5 Visualization

# plotting
plot(dt_prune, margin = 0.1)
text(dt_prune, use.n = T)

fancyRpartPlot(dt_prune, cex = 1) #fancy tree

5.1.6 Feature Importance

barplot(dt_prune$variable.importance, ylim = c(0, 55))

dt_prune$variable.importance

##                                                                                                         pledged 
##                                                                                                    9.003919e+04 
##                                                                                                         backers 
##                                                                                                    7.713038e+04 
##                                                                                                            goal 
##                                                                                                    3.196962e+04 
##                                                                                                   main_category 
##                                                                                                    1.012345e+04 
##       X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdDesign 
##                                                                                                    2.491352e+03 
##        X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdGames 
##                                                                                                    2.379017e+03 
##       X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdComics 
##                                                                                                    1.484687e+03 
##   X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdTechnology 
##                                                                                                    4.364026e+02 
##                                                                                                         country 
##                                                                                                    2.497828e+02 
##      X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdTheater 
##                                                                                                    1.839015e+02 
##         X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdFood 
##                                                                                                    1.075711e+02 
##        X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdMusic 
##                                                                                                    8.531380e+01 
##         X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdN.0. 
##                                                                                                    4.432206e+01 
##        X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdDance 
##                                                                                                    3.981948e+01 
## X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdFilm...Video 
##                                                                                                    1.836672e+01 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdGB 
##                                                                                                    1.213305e+01 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdBE 
##                                                                                                    7.046398e+00 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdIT 
##                                                                                                    5.240522e+00 
##      X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdFashion 
##                                                                                                    5.124196e+00 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdJP 
##                                                                                                    3.822880e+00 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdSG 
##                                                                                                    3.196868e+00 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdCH 
##                                                                                                    2.937949e+00 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdDE 
##                                                                                                    2.932639e+00 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdES 
##                                                                                                    2.778649e+00 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdDK 
##                                                                                                    2.665810e+00 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdNO 
##                                                                                                    2.385620e+00 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdHK 
##                                                                                                    2.327269e+00 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdLU 
##                                                                                                    2.031737e+00 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdCA 
##                                                                                                    1.659574e+00 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdAT 
##                                                                                                    1.468975e+00 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdNL 
##                                                                                                    1.008650e+00 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdFR 
##                                                                                                    4.457456e-01 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdIE 
##                                                                                                    4.993410e-02 
##          X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdArt 
##                                                                                                    6.642238e-03 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdMX 
##                                                                                                    4.310621e-03 
##           X.Users.mac.codebox_modified.dev.data.analytics.project.kickstarter.kickstarter.analysis.181030.RmdSE 
##                                                                                                    6.661809e-05

6 Conclusion

6.1 Summary

킥스타터에서 프로젝트의 성공에 영향을 미치는 요인에는 여러 가지가 있다. ‘Dance’, ‘Theater’ 기반의 프로젝트는 위험이 낮기 때문에 약 65% 정도의 높은 성공률을 보이고 있다. 음악은 52.7%의 성공률을 보이지만 Chiptune은 Kickstarter의 모든 하위 카테고리 중에서 75 %의 성공률을 기록했다. ’Games’의 성공률은 43.9%이며, 탁상형 게임과 비디오 게임이 다른 유형의 프로젝트보다 성공률이 높았다. 반면에 ’Technology’는 23.8%로 성공률이 가장 낮았다. 미국과 미국 외의 국가에서의 프로젝트 성공률이 비슷해보일지라도, 실제 자금 조달 규모에서 미국은 모든 국가를 능가한다(전체의 81.3%를 차지하며 한화로 2조 8607억 수준).

‘Design’과 ’Games’, ‘Comics’, ‘Technology’ 분야가 펀딩에 성공할 확률이 높고, 게임과 디자인, 기술 프로젝트는 더 많은 모금이 가능한 걸 확인할 수 있었다. 지난 7년간의 데이터를 살펴보면 킥 스타터의 각각의 후원자들은 평균 75.6 달러를 펀딩에 기여했다. 따라서 더 많은 수의 후원자가 프로젝트를 성공시키기 위한 프로젝트에 더 많은 자금을 지원할 수 있게 방안을 모색해야 한다.

6.2 Suggestion

영어 콘텐츠를 마련한다.
- 추가 정보를 모국어로만 제공하면 후원을 받기 어려움
목표 금액을 적게 잡고 확실히 달성한다.
- 프로젝트 성공 자체를 뉴스거리로 삼아 홍보에 활용할 수 있다.
- 프로젝트가 대성공한 경우, 일단 접고 다시 만드는 방법도 사용할 만하다.
프로젝트를 시작하기 전에 적어도 10명 정도의 후원자를 확보한다.
- 흥행하는 느낌을 연출할 수 있다.
- 목표 금액의 30%를 달성하면 ‘성공 가능성이 높다’는 이유로 후원하는 사람이 늘어난다.
소액으로도 보상을 받을 수 있는 후원 플랜을 갖춘다.
- 10달러 선에서 보상이 걸린 후원 플랜을 갖춘다.
- 평균 후원액이 200달러 이하가 되도록 한다.
- 달성 직전의 후원이 소액이면 부담 없이 참여할 수 있다

Kickstarter data analytics

Andrew Jang

10/14/2018