1 # TED Quartile EDA


이 문서는 ted_quartie.rmd에서 정리한 작업의 결과물을 가지고 EDA를 수행하는 과정을 기록하는 문서이다. 주로 다음의 데이터 파일을 로드한 후 처리할 예정이다.


** TED Quartile EDA에서 다룰 데이터**


1.0.1 이번 문서에서 얻고싶은 정보속성을 정리해보자.

기본적으로 미리 정리해둘 정보는 다음과 같다.

  • 조회수 평균
  • 조회수 분산
  • 조회수가 표준정규분포를 따르는가?
  • 조회수 박스플롯
  • 조회수 밀도곡선그래프
  • 조회수 스캐터플롯

언제든지 더 추가, 삭제 가능하다.



1.1 데이터 멍잉


TED Quartile Middle(25%~75%) 데이터를 EDA를 하기위해 재구조화하고, 정리하는 작업이다.


1.1.1 라이브러리 로드

library(dplyr) #for refine
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2) #for visualization (EDA)
library(mvnormtest) #for Standard normal distribution test (표준정규분포 테스트)


1.1.2 데이터 로드

데이터를 로드하는 영역

ted_Q1 = read.csv("ted_quartile_middle.csv", sep=",", stringsAsFactors=FALSE, header=TRUE)
str(ted_Q1)
## 'data.frame':    1092 obs. of  30 variables:
##  $ X           : int  17 19 21 22 26 27 28 30 31 33 ...
##  $ NEWST       : chr  "p.1-17" "p.1-19" "p.1-21" "p.1-22" ...
##  $ NAME        : chr  "Alex Kipman" "Latif Nasser" "Adam Foss" "Meron Gribetz" ...
##  $ ROLE        : chr  "Inventor" "Radio researcher" "Juvenile justice reformer" "Founder and CEO_ Meta" ...
##  $ TITLE       : chr  "A futuristic vision of the age of holograms" "You have no idea where camels really come from" "A prosecutors vision for a better justice system" "A glimpse of the future through an augmented reality headset" ...
##  $ POSITION    : chr  "TED2016" "TED Talks Live" "TED2016" "TED2016" ...
##  $ DURATION    : chr  "19:05" "12:27" "15:57" "10:54" ...
##  $ DAYS        : chr  "2016-02" "2015-11" "2016-02" "2016-02" ...
##  $ SUB_COUNT   : chr  "4" "7" "5" "10" ...
##  $ TOTAL_VIEWS : int  712809 1151849 852920 683227 1090977 661699 1006802 1043401 1086686 838626 ...
##  $ TOPICS      : chr  "NASA_Communication_Computers_Creativity_Design_Engineering_Exploration_Future_Innovation_Interface design_Invention_Microsoft_P"| __truncated__ "Ancient world_Animals_Biology_Biosphere_Curiosity_Environment_Evolution_History_Life_Nature_Paleontology_Science__" "Criminal Justice_Big problems_Choice_Compassion_Decision-making_Education_Government_Inequality_Law_Leadership_Policy_Race_Soci"| __truncated__ "Senses_Augmented reality_Brain_Computers_Creativity_Cyborg_Demo_Design_Engineering_Entrepreneur_Innovation_Interface design_Inv"| __truncated__ ...
##  $ Beautiful   : int  97 44 158 35 56 407 481 95 143 6 ...
##  $ OK          : int  61 29 14 44 25 33 32 36 29 52 ...
##  $ Funny       : int  10 344 6 16 62 56 29 4 41 15 ...
##  $ Unconvincing: int  42 18 2 45 11 54 5 0 19 51 ...
##  $ Fascinating : int  323 302 81 211 185 119 151 52 284 96 ...
##  $ Informative : int  177 302 264 121 181 63 209 109 221 235 ...
##  $ Ingenious   : int  92 51 33 88 75 58 56 8 167 61 ...
##  $ Persuasive  : int  11 37 299 14 70 214 263 2 50 71 ...
##  $ Inspiring   : int  170 113 474 135 354 722 870 28 369 140 ...
##  $ Courageous  : int  6 11 198 15 47 340 124 2 23 19 ...
##  $ Obnoxious   : int  25 15 3 12 13 16 8 1 6 33 ...
##  $ Confusing   : int  11 4 1 10 3 45 15 10 3 9 ...
##  $ Longwinded  : int  41 22 3 8 12 28 1 7 13 13 ...
##  $ Jaw.dropping: int  208 64 19 79 8 49 40 3 77 14 ...
##  $ Role_refine : chr  "Science" "Science" "Public" "Management" ...
##  $ posValue    : int  1155 1297 1546 758 1063 2061 2255 339 1404 709 ...
##  $ negValue    : int  119 59 9 75 39 143 29 18 41 106 ...
##  $ totalRate   : int  1274 1356 1555 833 1102 2204 2284 357 1445 815 ...
##  $ meanValues  : int  61 29 14 44 25 33 32 36 29 52 ...

1.1.3 posPer, negPer 추가

각 강연의 totalRate에서 negValposVal이 각각 어느정도의 비중을 차지하고있는지를 %로 나타내기위함. 예를들어 전체 100개의 평가중 80개가 posVal이면 posPer는 80%, negPer는 20%가 되는 식이다.

#추후에 middle뿐 아니라, bottom, top에도 적용해야한다.
ted_Q1$posPer = (ted_Q1$posVal/ted_Q1$totalRate) * 100 #percent공식
ted_Q1$negPer = (ted_Q1$negVal/ted_Q1$totalRate) * 100 #percent공식
head(ted_Q1, 10)
##     X  NEWST            NAME
## 1  17 p.1-17     Alex Kipman
## 2  19 p.1-19    Latif Nasser
## 3  21 p.1-21       Adam Foss
## 4  22 p.1-22   Meron Gribetz
## 5  26 p.1-26      Joe Gebbia
## 6  27 p.1-27    Casey Gerald
## 7  28 p.1-28 Tshering Tobgay
## 8  30 p.1-30  Laura Robinson
## 9  31 p.1-31    Caleb Harper
## 10 33 p.1-33 Travis Kalanick
##                                                                      ROLE
## 1                                                                Inventor
## 2                                                        Radio researcher
## 3                                               Juvenile justice reformer
## 4                                                   Founder and CEO_ Meta
## 5                                          Designer_ co-founder of Airbnb
## 6                                                                American
## 7                                                Prime Minister of Bhutan
## 8                                                         Ocean scientist
## 9  Principal Investigator and Director of the Open Agriculture Initiative
## 10                                      Problem solver-­in-­chief_ Uber
##                                                           TITLE
## 1                   A futuristic vision of the age of holograms
## 2                You have no idea where camels really come from
## 3              A prosecutors vision for a better justice system
## 4  A glimpse of the future through an augmented reality headset
## 5                                  How Airbnb designs for trust
## 6                                           The gospel of doubt
## 7   This country isnt just carbon neutral — its carbon negative
## 8              The secrets I find on the mysterious ocean floor
## 9               This computer will grow your food in the future
## 10                Ubers plan to get more people into fewer cars
##            POSITION DURATION    DAYS SUB_COUNT TOTAL_VIEWS
## 1           TED2016    19:05 2016-02         4      712809
## 2    TED Talks Live    12:27 2015-11         7     1151849
## 3           TED2016    15:57 2016-02         5      852920
## 4           TED2016    10:54 2016-02        10      683227
## 5           TED2016    15:51 2016-02        13     1090977
## 6           TED2016    18:19 2016-02         7      661699
## 7           TED2016    18:54 2016-02        12     1006802
## 8      TEDxBrussels    11:21 2014-12         8     1043401
## 9  TEDGlobal>Geneva    15:55 2015-12         7     1086686
## 10          TED2016    19:18 2016-02         9      838626
##                                                                                                                                                                                                                                                                              TOPICS
## 1                                                                                     NASA_Communication_Computers_Creativity_Design_Engineering_Exploration_Future_Innovation_Interface design_Invention_Microsoft_Potential_Prediction_Product design_Technology_Visualizations__
## 2                                                                                                                                                                Ancient world_Animals_Biology_Biosphere_Curiosity_Environment_Evolution_History_Life_Nature_Paleontology_Science__
## 3                                                                                                                                Criminal Justice_Big problems_Choice_Compassion_Decision-making_Education_Government_Inequality_Law_Leadership_Policy_Race_Social change_Society__
## 4                                                                Senses_Augmented reality_Brain_Computers_Creativity_Cyborg_Demo_Design_Engineering_Entrepreneur_Innovation_Interface design_Invention_Neuroscience_Potential_Prediction_Product design_Technology_Visualizations__
## 5                                                                            Behavioral economics_Business_Collaboration_Community_Culture_Design_Economics_Entrepreneur_Future_Innovation_Potential_Privacy_Product design_Relationships_Social change_Technology_Urban planning__
## 6                                                                                                                                                                                 God_Big problems_Business_Capitalism_Community_Education_Faith_Inequality_Social change_Society__
## 7  Buddhism_Alternative energy_Beauty_Big problems_Biosphere_Climate change_Democracy_Development_Economics_Environment_Future_Global issues_Goal-setting_Government_Green_Happiness_Humanity_Innovation_Leadership_Nature_Politics_Pollution_Sustainability_Trees_World cultures__
## 8                                                                                                     TEDx_Adventure_Ancient world_Animals_Beauty_Biodiversity_Biology_Biosphere_Chemistry_Climate change_Environment_Exploration_Future_History_Life_Nature_Oceans_Science_Water__
## 9   Vaccines_Agriculture_Biomechanics_Biosphere_Biotech_Botany_Chemistry_Climate change_Collaboration_Data_Design_Ebola_Education_Engineering_Environment_Food_Future_Garden_Green_Innovation_Nature_Open-source_Potential_Science_Software_Sustainability_Technology_Virus_Water__
## 10                                                     Brand_Internet_Business_Cars_China_Cities_Economics_Entrepreneur_Environment_Future_Green_India_Innovation_Invention_Investment_Mobility_Pollution_Potential_Society_Software_Sustainability_Technology_Transportation_Web__
##    Beautiful OK Funny Unconvincing Fascinating Informative Ingenious
## 1         97 61    10           42         323         177        92
## 2         44 29   344           18         302         302        51
## 3        158 14     6            2          81         264        33
## 4         35 44    16           45         211         121        88
## 5         56 25    62           11         185         181        75
## 6        407 33    56           54         119          63        58
## 7        481 32    29            5         151         209        56
## 8         95 36     4            0          52         109         8
## 9        143 29    41           19         284         221       167
## 10         6 52    15           51          96         235        61
##    Persuasive Inspiring Courageous Obnoxious Confusing Longwinded
## 1          11       170          6        25        11         41
## 2          37       113         11        15         4         22
## 3         299       474        198         3         1          3
## 4          14       135         15        12        10          8
## 5          70       354         47        13         3         12
## 6         214       722        340        16        45         28
## 7         263       870        124         8        15          1
## 8           2        28          2         1        10          7
## 9          50       369         23         6         3         13
## 10         71       140         19        33         9         13
##    Jaw.dropping Role_refine posValue negValue totalRate meanValues
## 1           208     Science     1155      119      1274         61
## 2            64     Science     1297       59      1356         29
## 3            19      Public     1546        9      1555         14
## 4            79  Management      758       75       833         44
## 5             8         Art     1063       39      1102         25
## 6            49      Public     2061      143      2204         33
## 7            40         Vip     2255       29      2284         32
## 8             3     Science      339       18       357         36
## 9            77    Humanist     1404       41      1445         29
## 10           14  Management      709      106       815         52
##      posPer     negPer
## 1  90.65934  9.3406593
## 2  95.64897  4.3510324
## 3  99.42122  0.5787781
## 4  90.99640  9.0036014
## 5  96.46098  3.5390200
## 6  93.51180  6.4882033
## 7  98.73030  1.2697023
## 8  94.95798  5.0420168
## 9  97.16263  2.8373702
## 10 86.99387 13.0061350


1.1.4 기본통계학 관련 값 확인

여기서 평균, 분산, 표준편차, 중앙값, 사분위수를 확인한다.

#summarise(ted_Q1, mean(TOTAL_VIEWS)) #평균 : 436,564.5
#summarise(ted_Q1, var(TOTAL_VIEWS)) #분산 : 2,052,852,131
#summarise(ted_Q1, sd(TOTAL_VIEWS)) #표준편차 : 143,277.8
#summarise(ted_Q1, median(TOTAL_VIEWS)) #중앙값 : 453897

#quantile(ted_Q1$TOTAL_VIEWS) # 사분위값
##     0%    25%    50%    75%   100% 
##  48071 336880 453897 562141 647809

데이터셋 정제는 여기까지로하고, 이어서 ggplot2를 활용한 EDA코드를 작성한다.



2 EDA

TED Quartile Middle(25%~75%) EDA 를 위해 기본적인 데이터 형태를 확인하기 위한 코드를 작성해 둔다.


2.1 단변량 EDA



2.1.1 [단변량EDA] 변수설정 A. x: 조회수

gg_ted_m = ggplot(ted_Q1, aes(x=TOTAL_VIEWS)) # ted_Q1의 조회수(TOTAL_VIEWS)를 x로 두는 변수선언

위 코드로 ted_Q1 데이터 중, 조회수(TOTAL_VIEWS)x값으로 두는 변수를 생성하였다.



2.1.2 [단변량EDA] A-1. Histogram - x: 조회수

gg_ted_m + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


2.1.3 [단변량EDA] A-2. Density Graph - x: 조회수

gg_ted_m + geom_density()


2.2 다변량 EDA

변수 2개 이상을 활용하는 EDA 그래프이다. _ _ _


2.2.1 [다변량EDA] B-1. Box plot - x: factor(직업) y: 조회수

  • 직업별 아웃라이어 찾을 수 있음 : 종교, 인문영역의 아웃라이어 존재.
  • 직업별 조회수 분포를 볼 수 있음 : 종교는 보통 조회수가 낮은 영역에서 구성되며, 조회수 분포범위가 넓은 직업은 vip, Explorer, Media순으로 많음. 반면 Humanist와 Management는 높은 조회수영역에 위치하고 있음을 알 수 있다.
ggplot(ted_Q1, aes(factor(Role_refine), TOTAL_VIEWS)) + geom_boxplot()


2.2.2 [다변량EDA] B-2. Histogram - x: 조회수, fill: 직업

ggplot(ted_Q1, aes(x=TOTAL_VIEWS)) + geom_histogram(aes(binwidth=0.5, fill=Role_refine)) # histogram
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


2.2.3 [다변량EDA] B-3. Scatter plot - x: factor(직업), y: 조회수

  • 직업별 아웃라이어 찾을 수 있음.
  • 직업별 조회수 분포를 볼 수 있음.
  • 원의 크기는 모두 균일(size=8)
ggplot(ted_Q1, aes(x=factor(Role_refine), y=TOTAL_VIEWS)) + geom_point(position="identity", size=8, alpha=0.4)

2.2.4 [다변량EDA] B-4. Density plot - x : 조회수, fill: factor(직업)

  • 왼쪽에 축 쳐진게 Religion.
  • 몇개 안되는 직업의 경우, ’이렇게 반영해도 될까?’라는 생각이 들기도 함.
ggplot(ted_Q1, aes(x=TOTAL_VIEWS, fill=factor(Role_refine))) + geom_density(alpha=.3)

2.2.5 [다변량EDA] B-5. Density plot - x : 조회수, fill: factor(직업)

  • 왼쪽에 축 쳐진게 Religion.
  • 몇개 안되는 직업의 경우, ’이렇게 반영해도 될까?’라는 생각이 들기도 함.
ggplot(ted_Q1, aes(x=TOTAL_VIEWS, colour=Role_refine)) + geom_density() + facet_wrap(~Role_refine, ncol=3, scale="free_y")

#ggplot(ted_Q1, aes(x=TOTAL_VIEWS, fill=factor(Role_refine))) + geom_density(alpha=.3)

2.2.6 [다변량EDA] B-6. Scatter+facet+smooth - x : 긍정%, y: 조회수, fill: factor(직업)

  • 각 직업별로 “긍정%가 높을수록 조회수도 높을까?”라는 생각으로 시도.
  • 전체 표본이 적은 religion, explorer의 경우 주의하며 봐야한다.
  • 어느정도 관련있어보이나, 명확히 드러나지는 않는다.
head(ted_Q1, 10)
##     X  NEWST            NAME
## 1  17 p.1-17     Alex Kipman
## 2  19 p.1-19    Latif Nasser
## 3  21 p.1-21       Adam Foss
## 4  22 p.1-22   Meron Gribetz
## 5  26 p.1-26      Joe Gebbia
## 6  27 p.1-27    Casey Gerald
## 7  28 p.1-28 Tshering Tobgay
## 8  30 p.1-30  Laura Robinson
## 9  31 p.1-31    Caleb Harper
## 10 33 p.1-33 Travis Kalanick
##                                                                      ROLE
## 1                                                                Inventor
## 2                                                        Radio researcher
## 3                                               Juvenile justice reformer
## 4                                                   Founder and CEO_ Meta
## 5                                          Designer_ co-founder of Airbnb
## 6                                                                American
## 7                                                Prime Minister of Bhutan
## 8                                                         Ocean scientist
## 9  Principal Investigator and Director of the Open Agriculture Initiative
## 10                                      Problem solver-­in-­chief_ Uber
##                                                           TITLE
## 1                   A futuristic vision of the age of holograms
## 2                You have no idea where camels really come from
## 3              A prosecutors vision for a better justice system
## 4  A glimpse of the future through an augmented reality headset
## 5                                  How Airbnb designs for trust
## 6                                           The gospel of doubt
## 7   This country isnt just carbon neutral — its carbon negative
## 8              The secrets I find on the mysterious ocean floor
## 9               This computer will grow your food in the future
## 10                Ubers plan to get more people into fewer cars
##            POSITION DURATION    DAYS SUB_COUNT TOTAL_VIEWS
## 1           TED2016    19:05 2016-02         4      712809
## 2    TED Talks Live    12:27 2015-11         7     1151849
## 3           TED2016    15:57 2016-02         5      852920
## 4           TED2016    10:54 2016-02        10      683227
## 5           TED2016    15:51 2016-02        13     1090977
## 6           TED2016    18:19 2016-02         7      661699
## 7           TED2016    18:54 2016-02        12     1006802
## 8      TEDxBrussels    11:21 2014-12         8     1043401
## 9  TEDGlobal>Geneva    15:55 2015-12         7     1086686
## 10          TED2016    19:18 2016-02         9      838626
##                                                                                                                                                                                                                                                                              TOPICS
## 1                                                                                     NASA_Communication_Computers_Creativity_Design_Engineering_Exploration_Future_Innovation_Interface design_Invention_Microsoft_Potential_Prediction_Product design_Technology_Visualizations__
## 2                                                                                                                                                                Ancient world_Animals_Biology_Biosphere_Curiosity_Environment_Evolution_History_Life_Nature_Paleontology_Science__
## 3                                                                                                                                Criminal Justice_Big problems_Choice_Compassion_Decision-making_Education_Government_Inequality_Law_Leadership_Policy_Race_Social change_Society__
## 4                                                                Senses_Augmented reality_Brain_Computers_Creativity_Cyborg_Demo_Design_Engineering_Entrepreneur_Innovation_Interface design_Invention_Neuroscience_Potential_Prediction_Product design_Technology_Visualizations__
## 5                                                                            Behavioral economics_Business_Collaboration_Community_Culture_Design_Economics_Entrepreneur_Future_Innovation_Potential_Privacy_Product design_Relationships_Social change_Technology_Urban planning__
## 6                                                                                                                                                                                 God_Big problems_Business_Capitalism_Community_Education_Faith_Inequality_Social change_Society__
## 7  Buddhism_Alternative energy_Beauty_Big problems_Biosphere_Climate change_Democracy_Development_Economics_Environment_Future_Global issues_Goal-setting_Government_Green_Happiness_Humanity_Innovation_Leadership_Nature_Politics_Pollution_Sustainability_Trees_World cultures__
## 8                                                                                                     TEDx_Adventure_Ancient world_Animals_Beauty_Biodiversity_Biology_Biosphere_Chemistry_Climate change_Environment_Exploration_Future_History_Life_Nature_Oceans_Science_Water__
## 9   Vaccines_Agriculture_Biomechanics_Biosphere_Biotech_Botany_Chemistry_Climate change_Collaboration_Data_Design_Ebola_Education_Engineering_Environment_Food_Future_Garden_Green_Innovation_Nature_Open-source_Potential_Science_Software_Sustainability_Technology_Virus_Water__
## 10                                                     Brand_Internet_Business_Cars_China_Cities_Economics_Entrepreneur_Environment_Future_Green_India_Innovation_Invention_Investment_Mobility_Pollution_Potential_Society_Software_Sustainability_Technology_Transportation_Web__
##    Beautiful OK Funny Unconvincing Fascinating Informative Ingenious
## 1         97 61    10           42         323         177        92
## 2         44 29   344           18         302         302        51
## 3        158 14     6            2          81         264        33
## 4         35 44    16           45         211         121        88
## 5         56 25    62           11         185         181        75
## 6        407 33    56           54         119          63        58
## 7        481 32    29            5         151         209        56
## 8         95 36     4            0          52         109         8
## 9        143 29    41           19         284         221       167
## 10         6 52    15           51          96         235        61
##    Persuasive Inspiring Courageous Obnoxious Confusing Longwinded
## 1          11       170          6        25        11         41
## 2          37       113         11        15         4         22
## 3         299       474        198         3         1          3
## 4          14       135         15        12        10          8
## 5          70       354         47        13         3         12
## 6         214       722        340        16        45         28
## 7         263       870        124         8        15          1
## 8           2        28          2         1        10          7
## 9          50       369         23         6         3         13
## 10         71       140         19        33         9         13
##    Jaw.dropping Role_refine posValue negValue totalRate meanValues
## 1           208     Science     1155      119      1274         61
## 2            64     Science     1297       59      1356         29
## 3            19      Public     1546        9      1555         14
## 4            79  Management      758       75       833         44
## 5             8         Art     1063       39      1102         25
## 6            49      Public     2061      143      2204         33
## 7            40         Vip     2255       29      2284         32
## 8             3     Science      339       18       357         36
## 9            77    Humanist     1404       41      1445         29
## 10           14  Management      709      106       815         52
##      posPer     negPer
## 1  90.65934  9.3406593
## 2  95.64897  4.3510324
## 3  99.42122  0.5787781
## 4  90.99640  9.0036014
## 5  96.46098  3.5390200
## 6  93.51180  6.4882033
## 7  98.73030  1.2697023
## 8  94.95798  5.0420168
## 9  97.16263  2.8373702
## 10 86.99387 13.0061350
ggplot(ted_Q1, aes(x=posPer, y=TOTAL_VIEWS, colour=Role_refine)) + geom_point(position="identity", size=4, alpha=0.5) + facet_grid(. ~ Role_refine)+ geom_smooth(method = "lm")

2.2.7 [다변량EDA] B-7. Scatter+facet+smooth - x : 부정%, y: 조회수, fill: factor(직업)

  • 위의 그래프와 방식은 동일하나, 긍정%를 부정%로 바꾼것이다. 굳이 표현하자면, “부정값%가 높은 강연은 조회수도 낮을까?”정도 되겠다.
  • 어느정도 관련있어보이나, 명확히 드러나지는 않는다.
ggplot(ted_Q1, aes(x=negPer, y=TOTAL_VIEWS, colour=Role_refine)) + geom_point(position="identity", size=4, alpha=0.5) + facet_wrap(~Role_refine, ncol=3, scale="free_y")+ geom_smooth(method="lm")