이 문서는 ted_quartie.rmd에서 정리한 작업의 결과물을 가지고 EDA를 수행하는 과정을 기록하는 문서이다. 주로 다음의 데이터 파일을 로드한 후 처리할 예정이다.
** TED Quartile EDA에서 다룰 데이터**
ted_quartile_middle을 다룰 것이다.기본적으로 미리 정리해둘 정보는 다음과 같다.
언제든지 더 추가, 삭제 가능하다.
TED Quartile Middle(25%~75%) 데이터를 EDA를 하기위해 재구조화하고, 정리하는 작업이다.
library(dplyr) #for refine
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2) #for visualization (EDA)
library(mvnormtest) #for Standard normal distribution test (표준정규분포 테스트)
데이터를 로드하는 영역
ted_Q1 = read.csv("ted_quartile_middle.csv", sep=",", stringsAsFactors=FALSE, header=TRUE)
str(ted_Q1)
## 'data.frame': 1092 obs. of 30 variables:
## $ X : int 17 19 21 22 26 27 28 30 31 33 ...
## $ NEWST : chr "p.1-17" "p.1-19" "p.1-21" "p.1-22" ...
## $ NAME : chr "Alex Kipman" "Latif Nasser" "Adam Foss" "Meron Gribetz" ...
## $ ROLE : chr "Inventor" "Radio researcher" "Juvenile justice reformer" "Founder and CEO_ Meta" ...
## $ TITLE : chr "A futuristic vision of the age of holograms" "You have no idea where camels really come from" "A prosecutors vision for a better justice system" "A glimpse of the future through an augmented reality headset" ...
## $ POSITION : chr "TED2016" "TED Talks Live" "TED2016" "TED2016" ...
## $ DURATION : chr "19:05" "12:27" "15:57" "10:54" ...
## $ DAYS : chr "2016-02" "2015-11" "2016-02" "2016-02" ...
## $ SUB_COUNT : chr "4" "7" "5" "10" ...
## $ TOTAL_VIEWS : int 712809 1151849 852920 683227 1090977 661699 1006802 1043401 1086686 838626 ...
## $ TOPICS : chr "NASA_Communication_Computers_Creativity_Design_Engineering_Exploration_Future_Innovation_Interface design_Invention_Microsoft_P"| __truncated__ "Ancient world_Animals_Biology_Biosphere_Curiosity_Environment_Evolution_History_Life_Nature_Paleontology_Science__" "Criminal Justice_Big problems_Choice_Compassion_Decision-making_Education_Government_Inequality_Law_Leadership_Policy_Race_Soci"| __truncated__ "Senses_Augmented reality_Brain_Computers_Creativity_Cyborg_Demo_Design_Engineering_Entrepreneur_Innovation_Interface design_Inv"| __truncated__ ...
## $ Beautiful : int 97 44 158 35 56 407 481 95 143 6 ...
## $ OK : int 61 29 14 44 25 33 32 36 29 52 ...
## $ Funny : int 10 344 6 16 62 56 29 4 41 15 ...
## $ Unconvincing: int 42 18 2 45 11 54 5 0 19 51 ...
## $ Fascinating : int 323 302 81 211 185 119 151 52 284 96 ...
## $ Informative : int 177 302 264 121 181 63 209 109 221 235 ...
## $ Ingenious : int 92 51 33 88 75 58 56 8 167 61 ...
## $ Persuasive : int 11 37 299 14 70 214 263 2 50 71 ...
## $ Inspiring : int 170 113 474 135 354 722 870 28 369 140 ...
## $ Courageous : int 6 11 198 15 47 340 124 2 23 19 ...
## $ Obnoxious : int 25 15 3 12 13 16 8 1 6 33 ...
## $ Confusing : int 11 4 1 10 3 45 15 10 3 9 ...
## $ Longwinded : int 41 22 3 8 12 28 1 7 13 13 ...
## $ Jaw.dropping: int 208 64 19 79 8 49 40 3 77 14 ...
## $ Role_refine : chr "Science" "Science" "Public" "Management" ...
## $ posValue : int 1155 1297 1546 758 1063 2061 2255 339 1404 709 ...
## $ negValue : int 119 59 9 75 39 143 29 18 41 106 ...
## $ totalRate : int 1274 1356 1555 833 1102 2204 2284 357 1445 815 ...
## $ meanValues : int 61 29 14 44 25 33 32 36 29 52 ...
각 강연의 totalRate에서 negVal과 posVal이 각각 어느정도의 비중을 차지하고있는지를 %로 나타내기위함. 예를들어 전체 100개의 평가중 80개가 posVal이면 posPer는 80%, negPer는 20%가 되는 식이다.
#추후에 middle뿐 아니라, bottom, top에도 적용해야한다.
ted_Q1$posPer = (ted_Q1$posVal/ted_Q1$totalRate) * 100 #percent공식
ted_Q1$negPer = (ted_Q1$negVal/ted_Q1$totalRate) * 100 #percent공식
head(ted_Q1, 10)
## X NEWST NAME
## 1 17 p.1-17 Alex Kipman
## 2 19 p.1-19 Latif Nasser
## 3 21 p.1-21 Adam Foss
## 4 22 p.1-22 Meron Gribetz
## 5 26 p.1-26 Joe Gebbia
## 6 27 p.1-27 Casey Gerald
## 7 28 p.1-28 Tshering Tobgay
## 8 30 p.1-30 Laura Robinson
## 9 31 p.1-31 Caleb Harper
## 10 33 p.1-33 Travis Kalanick
## ROLE
## 1 Inventor
## 2 Radio researcher
## 3 Juvenile justice reformer
## 4 Founder and CEO_ Meta
## 5 Designer_ co-founder of Airbnb
## 6 American
## 7 Prime Minister of Bhutan
## 8 Ocean scientist
## 9 Principal Investigator and Director of the Open Agriculture Initiative
## 10 Problem solver-in-chief_ Uber
## TITLE
## 1 A futuristic vision of the age of holograms
## 2 You have no idea where camels really come from
## 3 A prosecutors vision for a better justice system
## 4 A glimpse of the future through an augmented reality headset
## 5 How Airbnb designs for trust
## 6 The gospel of doubt
## 7 This country isnt just carbon neutral — its carbon negative
## 8 The secrets I find on the mysterious ocean floor
## 9 This computer will grow your food in the future
## 10 Ubers plan to get more people into fewer cars
## POSITION DURATION DAYS SUB_COUNT TOTAL_VIEWS
## 1 TED2016 19:05 2016-02 4 712809
## 2 TED Talks Live 12:27 2015-11 7 1151849
## 3 TED2016 15:57 2016-02 5 852920
## 4 TED2016 10:54 2016-02 10 683227
## 5 TED2016 15:51 2016-02 13 1090977
## 6 TED2016 18:19 2016-02 7 661699
## 7 TED2016 18:54 2016-02 12 1006802
## 8 TEDxBrussels 11:21 2014-12 8 1043401
## 9 TEDGlobal>Geneva 15:55 2015-12 7 1086686
## 10 TED2016 19:18 2016-02 9 838626
## TOPICS
## 1 NASA_Communication_Computers_Creativity_Design_Engineering_Exploration_Future_Innovation_Interface design_Invention_Microsoft_Potential_Prediction_Product design_Technology_Visualizations__
## 2 Ancient world_Animals_Biology_Biosphere_Curiosity_Environment_Evolution_History_Life_Nature_Paleontology_Science__
## 3 Criminal Justice_Big problems_Choice_Compassion_Decision-making_Education_Government_Inequality_Law_Leadership_Policy_Race_Social change_Society__
## 4 Senses_Augmented reality_Brain_Computers_Creativity_Cyborg_Demo_Design_Engineering_Entrepreneur_Innovation_Interface design_Invention_Neuroscience_Potential_Prediction_Product design_Technology_Visualizations__
## 5 Behavioral economics_Business_Collaboration_Community_Culture_Design_Economics_Entrepreneur_Future_Innovation_Potential_Privacy_Product design_Relationships_Social change_Technology_Urban planning__
## 6 God_Big problems_Business_Capitalism_Community_Education_Faith_Inequality_Social change_Society__
## 7 Buddhism_Alternative energy_Beauty_Big problems_Biosphere_Climate change_Democracy_Development_Economics_Environment_Future_Global issues_Goal-setting_Government_Green_Happiness_Humanity_Innovation_Leadership_Nature_Politics_Pollution_Sustainability_Trees_World cultures__
## 8 TEDx_Adventure_Ancient world_Animals_Beauty_Biodiversity_Biology_Biosphere_Chemistry_Climate change_Environment_Exploration_Future_History_Life_Nature_Oceans_Science_Water__
## 9 Vaccines_Agriculture_Biomechanics_Biosphere_Biotech_Botany_Chemistry_Climate change_Collaboration_Data_Design_Ebola_Education_Engineering_Environment_Food_Future_Garden_Green_Innovation_Nature_Open-source_Potential_Science_Software_Sustainability_Technology_Virus_Water__
## 10 Brand_Internet_Business_Cars_China_Cities_Economics_Entrepreneur_Environment_Future_Green_India_Innovation_Invention_Investment_Mobility_Pollution_Potential_Society_Software_Sustainability_Technology_Transportation_Web__
## Beautiful OK Funny Unconvincing Fascinating Informative Ingenious
## 1 97 61 10 42 323 177 92
## 2 44 29 344 18 302 302 51
## 3 158 14 6 2 81 264 33
## 4 35 44 16 45 211 121 88
## 5 56 25 62 11 185 181 75
## 6 407 33 56 54 119 63 58
## 7 481 32 29 5 151 209 56
## 8 95 36 4 0 52 109 8
## 9 143 29 41 19 284 221 167
## 10 6 52 15 51 96 235 61
## Persuasive Inspiring Courageous Obnoxious Confusing Longwinded
## 1 11 170 6 25 11 41
## 2 37 113 11 15 4 22
## 3 299 474 198 3 1 3
## 4 14 135 15 12 10 8
## 5 70 354 47 13 3 12
## 6 214 722 340 16 45 28
## 7 263 870 124 8 15 1
## 8 2 28 2 1 10 7
## 9 50 369 23 6 3 13
## 10 71 140 19 33 9 13
## Jaw.dropping Role_refine posValue negValue totalRate meanValues
## 1 208 Science 1155 119 1274 61
## 2 64 Science 1297 59 1356 29
## 3 19 Public 1546 9 1555 14
## 4 79 Management 758 75 833 44
## 5 8 Art 1063 39 1102 25
## 6 49 Public 2061 143 2204 33
## 7 40 Vip 2255 29 2284 32
## 8 3 Science 339 18 357 36
## 9 77 Humanist 1404 41 1445 29
## 10 14 Management 709 106 815 52
## posPer negPer
## 1 90.65934 9.3406593
## 2 95.64897 4.3510324
## 3 99.42122 0.5787781
## 4 90.99640 9.0036014
## 5 96.46098 3.5390200
## 6 93.51180 6.4882033
## 7 98.73030 1.2697023
## 8 94.95798 5.0420168
## 9 97.16263 2.8373702
## 10 86.99387 13.0061350
여기서 평균, 분산, 표준편차, 중앙값, 사분위수를 확인한다.
#summarise(ted_Q1, mean(TOTAL_VIEWS)) #평균 : 436,564.5
#summarise(ted_Q1, var(TOTAL_VIEWS)) #분산 : 2,052,852,131
#summarise(ted_Q1, sd(TOTAL_VIEWS)) #표준편차 : 143,277.8
#summarise(ted_Q1, median(TOTAL_VIEWS)) #중앙값 : 453897
#quantile(ted_Q1$TOTAL_VIEWS) # 사분위값
## 0% 25% 50% 75% 100%
## 48071 336880 453897 562141 647809
데이터셋 정제는 여기까지로하고, 이어서 ggplot2를 활용한 EDA코드를 작성한다.
TED Quartile Middle(25%~75%) EDA 를 위해 기본적인 데이터 형태를 확인하기 위한 코드를 작성해 둔다.
gg_ted_m = ggplot(ted_Q1, aes(x=TOTAL_VIEWS)) # ted_Q1의 조회수(TOTAL_VIEWS)를 x로 두는 변수선언
위 코드로 ted_Q1 데이터 중, 조회수(TOTAL_VIEWS)를 x값으로 두는 변수를 생성하였다.
gg_ted_m + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gg_ted_m + geom_density()
변수 2개 이상을 활용하는 EDA 그래프이다. _ _ _
ggplot(ted_Q1, aes(factor(Role_refine), TOTAL_VIEWS)) + geom_boxplot()
ggplot(ted_Q1, aes(x=TOTAL_VIEWS)) + geom_histogram(aes(binwidth=0.5, fill=Role_refine)) # histogram
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(ted_Q1, aes(x=factor(Role_refine), y=TOTAL_VIEWS)) + geom_point(position="identity", size=8, alpha=0.4)
ggplot(ted_Q1, aes(x=TOTAL_VIEWS, fill=factor(Role_refine))) + geom_density(alpha=.3)
ggplot(ted_Q1, aes(x=TOTAL_VIEWS, colour=Role_refine)) + geom_density() + facet_wrap(~Role_refine, ncol=3, scale="free_y")
#ggplot(ted_Q1, aes(x=TOTAL_VIEWS, fill=factor(Role_refine))) + geom_density(alpha=.3)
head(ted_Q1, 10)
## X NEWST NAME
## 1 17 p.1-17 Alex Kipman
## 2 19 p.1-19 Latif Nasser
## 3 21 p.1-21 Adam Foss
## 4 22 p.1-22 Meron Gribetz
## 5 26 p.1-26 Joe Gebbia
## 6 27 p.1-27 Casey Gerald
## 7 28 p.1-28 Tshering Tobgay
## 8 30 p.1-30 Laura Robinson
## 9 31 p.1-31 Caleb Harper
## 10 33 p.1-33 Travis Kalanick
## ROLE
## 1 Inventor
## 2 Radio researcher
## 3 Juvenile justice reformer
## 4 Founder and CEO_ Meta
## 5 Designer_ co-founder of Airbnb
## 6 American
## 7 Prime Minister of Bhutan
## 8 Ocean scientist
## 9 Principal Investigator and Director of the Open Agriculture Initiative
## 10 Problem solver-in-chief_ Uber
## TITLE
## 1 A futuristic vision of the age of holograms
## 2 You have no idea where camels really come from
## 3 A prosecutors vision for a better justice system
## 4 A glimpse of the future through an augmented reality headset
## 5 How Airbnb designs for trust
## 6 The gospel of doubt
## 7 This country isnt just carbon neutral — its carbon negative
## 8 The secrets I find on the mysterious ocean floor
## 9 This computer will grow your food in the future
## 10 Ubers plan to get more people into fewer cars
## POSITION DURATION DAYS SUB_COUNT TOTAL_VIEWS
## 1 TED2016 19:05 2016-02 4 712809
## 2 TED Talks Live 12:27 2015-11 7 1151849
## 3 TED2016 15:57 2016-02 5 852920
## 4 TED2016 10:54 2016-02 10 683227
## 5 TED2016 15:51 2016-02 13 1090977
## 6 TED2016 18:19 2016-02 7 661699
## 7 TED2016 18:54 2016-02 12 1006802
## 8 TEDxBrussels 11:21 2014-12 8 1043401
## 9 TEDGlobal>Geneva 15:55 2015-12 7 1086686
## 10 TED2016 19:18 2016-02 9 838626
## TOPICS
## 1 NASA_Communication_Computers_Creativity_Design_Engineering_Exploration_Future_Innovation_Interface design_Invention_Microsoft_Potential_Prediction_Product design_Technology_Visualizations__
## 2 Ancient world_Animals_Biology_Biosphere_Curiosity_Environment_Evolution_History_Life_Nature_Paleontology_Science__
## 3 Criminal Justice_Big problems_Choice_Compassion_Decision-making_Education_Government_Inequality_Law_Leadership_Policy_Race_Social change_Society__
## 4 Senses_Augmented reality_Brain_Computers_Creativity_Cyborg_Demo_Design_Engineering_Entrepreneur_Innovation_Interface design_Invention_Neuroscience_Potential_Prediction_Product design_Technology_Visualizations__
## 5 Behavioral economics_Business_Collaboration_Community_Culture_Design_Economics_Entrepreneur_Future_Innovation_Potential_Privacy_Product design_Relationships_Social change_Technology_Urban planning__
## 6 God_Big problems_Business_Capitalism_Community_Education_Faith_Inequality_Social change_Society__
## 7 Buddhism_Alternative energy_Beauty_Big problems_Biosphere_Climate change_Democracy_Development_Economics_Environment_Future_Global issues_Goal-setting_Government_Green_Happiness_Humanity_Innovation_Leadership_Nature_Politics_Pollution_Sustainability_Trees_World cultures__
## 8 TEDx_Adventure_Ancient world_Animals_Beauty_Biodiversity_Biology_Biosphere_Chemistry_Climate change_Environment_Exploration_Future_History_Life_Nature_Oceans_Science_Water__
## 9 Vaccines_Agriculture_Biomechanics_Biosphere_Biotech_Botany_Chemistry_Climate change_Collaboration_Data_Design_Ebola_Education_Engineering_Environment_Food_Future_Garden_Green_Innovation_Nature_Open-source_Potential_Science_Software_Sustainability_Technology_Virus_Water__
## 10 Brand_Internet_Business_Cars_China_Cities_Economics_Entrepreneur_Environment_Future_Green_India_Innovation_Invention_Investment_Mobility_Pollution_Potential_Society_Software_Sustainability_Technology_Transportation_Web__
## Beautiful OK Funny Unconvincing Fascinating Informative Ingenious
## 1 97 61 10 42 323 177 92
## 2 44 29 344 18 302 302 51
## 3 158 14 6 2 81 264 33
## 4 35 44 16 45 211 121 88
## 5 56 25 62 11 185 181 75
## 6 407 33 56 54 119 63 58
## 7 481 32 29 5 151 209 56
## 8 95 36 4 0 52 109 8
## 9 143 29 41 19 284 221 167
## 10 6 52 15 51 96 235 61
## Persuasive Inspiring Courageous Obnoxious Confusing Longwinded
## 1 11 170 6 25 11 41
## 2 37 113 11 15 4 22
## 3 299 474 198 3 1 3
## 4 14 135 15 12 10 8
## 5 70 354 47 13 3 12
## 6 214 722 340 16 45 28
## 7 263 870 124 8 15 1
## 8 2 28 2 1 10 7
## 9 50 369 23 6 3 13
## 10 71 140 19 33 9 13
## Jaw.dropping Role_refine posValue negValue totalRate meanValues
## 1 208 Science 1155 119 1274 61
## 2 64 Science 1297 59 1356 29
## 3 19 Public 1546 9 1555 14
## 4 79 Management 758 75 833 44
## 5 8 Art 1063 39 1102 25
## 6 49 Public 2061 143 2204 33
## 7 40 Vip 2255 29 2284 32
## 8 3 Science 339 18 357 36
## 9 77 Humanist 1404 41 1445 29
## 10 14 Management 709 106 815 52
## posPer negPer
## 1 90.65934 9.3406593
## 2 95.64897 4.3510324
## 3 99.42122 0.5787781
## 4 90.99640 9.0036014
## 5 96.46098 3.5390200
## 6 93.51180 6.4882033
## 7 98.73030 1.2697023
## 8 94.95798 5.0420168
## 9 97.16263 2.8373702
## 10 86.99387 13.0061350
ggplot(ted_Q1, aes(x=posPer, y=TOTAL_VIEWS, colour=Role_refine)) + geom_point(position="identity", size=4, alpha=0.5) + facet_grid(. ~ Role_refine)+ geom_smooth(method = "lm")
ggplot(ted_Q1, aes(x=negPer, y=TOTAL_VIEWS, colour=Role_refine)) + geom_point(position="identity", size=4, alpha=0.5) + facet_wrap(~Role_refine, ncol=3, scale="free_y")+ geom_smooth(method="lm")