세미나(2)

학습목표

Intro to Graphics
Components of Graphics
ggplot2(1)
dplyr(1) - Data Verbs
base package vs. ggplot2

reference: UC Berkeley Stat 133 Data Computing By Adam Lucas

1. Intro to Graphics

이번 세션에서는 다음의 패키지를 사용할 것이다: mosaicData.

install.packages("mosaicData")

data()를 사용하여 R로 데이터를 로딩하기

data() 함수는 특정한 데이터셋을 특정 패키지로 부터 불러올 때 사용된다. 예를들어, mosaicData 패키지의 CPS85(1985 Current Population Survey) 데이터셋은 다음과 같이 불러올 수 있다.

wage	educ	race	sex	hispanic	south	married	exper	union	age	sector
9.0	10	W	M	NH	NS	Married	27	Not	43	const
5.5	12	W	M	NH	NS	Married	20	Not	38	sales
3.8	12	W	F	NH	NS	Single	4	Not	22	sales
10.5	12	W	F	NH	NS	Married	29	Not	47	clerical
15.0	12	W	M	NH	NS	Married	40	Union	58	const
9.0	16	W	F	NH	NS	Married	27	Not	49	clerical

mosaicData는 여러 데이터셋을 포함하고 있으며 한번 로딩되면 (i.e. library(mosaicData)) 이 패키지에 포함된 모든 데이터셋을 컴퓨터의 메모리에 저장하게 된다. 따라서 모든 데이터셋이 아닌 특정 데이터셋만이 필요할 때, data()가 유용하게 사용된다.

data("CPS85",package="mosaicData")

위의 명령은 CPS85 데이터셋만 로딩한다.

기본 formula in R

R에서의 formula는 ~을 이용하여 변수들 사이의 관계를 나타내게 해준다. y~x는 x로 표현된 y라는 함수라는 뜻을 가지고 있듯, wage ~ age라는 formula는 age를 독립변수(independent variable)로 wage를 종속변수(dependent variable)로 취급한다.

2. Components of Graphics

ggplot2 패키지에서 사용되는 graphic의 요소들을 살펴보자.

Glyphs and Data

Glyph 그 어원이 상형문자에서 온다.

Data Glyphs (Geoms)

data glyph는 시각적으로 데이터를 표현하는 것이다.

어떤 것은 매우 간단한 형태이며, (예시: dots)
어떤 것은 데이터를 모아 요약한 형태이며, (예시: 히스토그램)
어떤 것은 데이터를 복잡한 형태로 표현한 것이다. (예시: confidence interval for expected conditional mean)

See: http://docs.ggplot2.org/current/

ggplot2 패키지로 만들어지는 그래프들의 구성요소:

mosaicData::CPS85 데이터 셋을 사용하여 살펴보자.

data(CPS85, package="mosaicData")
head(CPS85)

wage	educ	race	sex	hispanic	south	married	exper	union	age	sector
9.0	10	W	M	NH	NS	Married	27	Not	43	const
5.5	12	W	M	NH	NS	Married	20	Not	38	sales
3.8	12	W	F	NH	NS	Single	4	Not	22	sales
10.5	12	W	F	NH	NS	Married	29	Not	47	clerical
15.0	12	W	M	NH	NS	Married	40	Union	58	const
9.0	16	W	F	NH	NS	Married	27	Not	49	clerical

Frame= glyps가 그려질 사각형의 공간.

ggplot()

Aesthetics= 데이터 테이블의 변수들과 관련있는 frame이나 glyphs의 특징들을 일컫는다. 예를들어, 색상, 모양, 그리고 점들의 위치가 aesthetics이다.

Scales= Scales은 데이터 테이블의 변수와 aesthetics 간의 mapping을 control한다.

CPS85 %>% ggplot(aes(x=age,y=wage))

Glyph= frame 속의 geometrical objects.

CPS85 %>% ggplot(aes(x=age,y=wage)) + geom_point()

Graphical Attributes= glyphs의 특징 중 데이터 테이블의 변수들과 관련없는 것들을 일컫는다. 예를들어 투명도(alpha)라던지 색(for this case, color that applies to every point)가 그 예시이다.

CPS85 %>% ggplot(aes(x=age,y=wage)) + geom_point(alpha=.2, colour="red")

Facets= 카테고리 변수의 level을 나타내기 위해 여러개의 면으로 되어있는 display를 facet이라 한다.

CPS85 %>% ggplot(aes(x=age,y=wage)) + geom_point() + facet_grid(married ~ .)

guides= 보는 사람들에게 해당 그래프에서 scale(mapping)이 무엇인지 알 수 있도록 해주는 것.

CPS85 %>% ggplot(aes(x=age,y=wage)) + geom_point(aes(shape=sex)) + facet_grid(married ~ .)

guides의 예시:

축과 숫자

Legends

faceted graphics의 labels

Layers= 두 개 이상의 glyph가 한 그래프에 있는 것을 layer가 있다고 한다.

CPS85 %>% ggplot(aes(x=age,y=wage)) +geom_point(colour = "pink", size = 4) + geom_point(colour = "black", size = 1.5)

예시:

mosaic::NHANES데이터의 일부와 그에 해당하는 그래픽 표현이다.

sbp	dbp	sex	smoker
129	75	male	never
105	62	female	never
122	72	male	never
128	83	female	former
123	90	male	former
122	77	male	current

구성요소들은:

Frame: rectangular region

Glyph: points

Facets: sex

Aesthetics: The frames aesthetics is x and y. The points aesthetic is smoker.

Scales: x=sbp, y=dbp, color=smoker

Graphical attributes: size, alpha, shape

Guides: tick mark on axes, labels on faceted graphs, legend

Layers: none

ggplot2에서의 명령은 다음과 같다:

p <- ggplot(df, aes(x = sbp, y = dbp)) + 
  xlab("Systolic BP") + ylab("Diastolic BP")
p + geom_point(size=5, aes(color=smoker), alpha=.8, shape=17) +facet_grid(. ~ sex)

Discuss

각각의 구성요소를 생각해보자:

Frame:

Glyphs:

Facets:

Aesthetics:

Scales:

Graphical attributes:

Guides:

Layers:

다음의 그래프는 nytimes 에서 추출한 것이다. (prediction of 36 senate seats from different polling organizations)

3. `ggplot2`(1)

ggplot2 패키지는 앞서 살펴본 그래프의 요소 (i.e. glyphs, aestetics, frames, scales, layers)를 사용한다. 이러한 요소들을 우리는 그래프의 문법이라 한다. 지금부터는 glyphs를 geoms으로 부르기로 한다.

examples

다음은 mosaicData::CPS85 데이터 셋이다:

data(CPS85,package="mosaicData")
head(CPS85)

wage	educ	race	sex	hispanic	south	married	exper	union	age	sector
9.0	10	W	M	NH	NS	Married	27	Not	43	const
5.5	12	W	M	NH	NS	Married	20	Not	38	sales
3.8	12	W	F	NH	NS	Single	4	Not	22	sales
10.5	12	W	F	NH	NS	Married	29	Not	47	clerical
15.0	12	W	M	NH	NS	Married	40	Union	58	const
9.0	16	W	F	NH	NS	Married	27	Not	49	clerical

frame <- CPS85 %>% ggplot(aes(x=age,y=wage)) 
frame + geom_point()

frame <- CPS85 %>% ggplot(aes(x=age,y=wage)) 
frame + geom_point(aes(shape=sex))

frame <- CPS85 %>% ggplot(aes(x=age,y=wage)) 
frame + geom_point(aes(shape=sex)) + facet_grid(married ~ .)

frame <- CPS85 %>% ggplot(aes(x=age,y=wage)) 
frame + geom_point(aes(shape=married)) + ylim(0,30)

ggplot2와 관련된 자료는 다음의 링크를 통해 얻을 수 있다:

ggplot

Rstudio

Discuss

CPS85을 이용하여 다음을 만들어라.

4. Data Verbs

데이터 테이블은 glyph ready(가공없이 ggplot2를 이용하여 그래프로 표현할 수 있는 상태)의 상태로 존재하기가 드물다. 따라서 데이터를 조작하는 것이 필요한데, 이러한 과정을 wrangling이라 한다. data verb는 데이터 테이블을 input으로 원하는 데이터 테이블을 output으로 반환하는 함수이다.

summarise() and group_by()
select()
mutate()
filter()
arrange()

1) `summarise()` and `group_by()`

summarise() 와 group_by()는 가장 흔하게 쓰이는 data verb이다:

summarise() 은 n() or sum() or mean()등의 reduction formulas를 이용하여 여러개의 case를 하나의 case로 반환하는 함수이다.

예를들어:

head(DataComputing::BabyNames)

name	sex	count	year
Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880
Margaret	F	1578	1880

BabyNames %>% summarise(num_cases=n())  #gives the number of rows.

num_cases
1792091

여기서 n()은 summarise() 내에서만 작동하는 reduction fuction이다.

summarise(num_cases= n())를 사용하는 대신, nrow() 혹은 tally() 함수 또한 동일하게 사용가능하다.

BabyNames %>% nrow()

## [1] 1792091

BabyNames %>% tally()

n
1792091

tally()의 경우는 output이 데이터 테이블이라는 점에서 data verb에 해당하지만, nrow()는 그 output이 integer이기 때문에 data verb에 해당하지 않는다.

BabyNames %>% summarise(average=mean(count)) #gives the average of counts

average
186.0496

mean() 혹은 sum()는 reduction functions으로 변수를(ex. count)를 함수의 argument로 받는다.

만약 이름에 따른 요약 자료를 output 데이터 테이블로 얻고 싶다면 group_by() 와 summarise()를 함께 사용할 수 있다.

BabyNames %>% 
  group_by(name) %>%
  summarise(num_cases=n()) %>%
  head()

name	num_cases
Aaban	6
Aabha	2
Aabid	1
Aabriella	1
Aadam	22
Aadan	8

만약 name을 sex라는 하위그룹에 따라 나누고 싶다면,group_by() 에 name 과 sex (순서 유의, 두 번째로 오는 것이 하위그룹에 해당) 두 개의 arguments를 준다.

BabyNames %>% 
  group_by(name, sex) %>%
  summarise(num_cases=n(), sum_cases=sum(count))

## # A tibble: 102,690 x 4
## # Groups:   name [?]
##         name   sex num_cases sum_cases
##        <chr> <chr>     <int>     <int>
##  1     Aaban     M         6        56
##  2     Aabha     F         2        12
##  3     Aabid     M         1         5
##  4 Aabriella     F         1         5
##  5     Aadam     M        22       177
##  6     Aadan     M         8       104
##  7   Aadarsh     M        13       140
##  8     Aaden     F         1         5
##  9     Aaden     M        13      3677
## 10    Aadesh     M         3        15
## # ... with 102,680 more rows

BabyNames에서 가장 유명한 3개의 이름은?

BabyNames %>%
  group_by(name) %>%
  summarise(tot=sum(count)) %>%
  arrange(desc(tot)) %>%
  head(3)

name	tot
James	5114325
John	5095590
Robert	4809858

Discuss

각 후보자들이 몇 개의 precints에서 first place를 얻었는지에 대한 bar plot을 만드려고 한다. 다음의 데이터를 data verb를 사용하여 glyph ready로 만들어보자.

#not glyph ready
head(DataComputing::Minneapolis2013)

Precinct	First	Second	Third	Ward
P-10	BETSY HODGES	undervote	undervote	W-7
P-06	BOB FINE	MARK ANDREW	undervote	W-10
P-09	KURTIS W. HANNA	BOB FINE	MIKE GOULD	W-10
P-05	BETSY HODGES	DON SAMUELS	undervote	W-13
P-01	DON SAMUELS	undervote	undervote	W-5
P-04	undervote	undervote	undervote	W-6

즉, 우리가 원하는 데이터의 형태는 아래와 같은 것이다.

#glyph ready
FirstPlaceTally

candidate	total
BETSY HODGES	28935
MARK ANDREW	19584
DON SAMUELS	8335
CAM WINTON	7511
JACKIE CHERRYHOMES	3524
BOB FINE	2094
DAN COHEN	1798
STEPHANIE WOODRUFF	1010
MARK V ANDERSON	975
undervote	834
DOUG MANN	779
OLE SAVIOR	693
ALICIA K. BENNETT	351
JAMES EVERETT	347
ABDUL M RAHAMAN “THE ROCK”	338
CAPTAIN JACK SPARROW	264
TONY LANE	219
MIKE GOULD	204
KURTIS W. HANNA	200
JAYMIE KELLY	196
CHRISTOPHER CLARK	188
CHRISTOPHER ROBIN ZIMMERMAN	170
JEFFREY ALAN WAGNER	164
TROY BENJEGERDES	148
NEAL BAXTER	145
GREGG A. IVERSON	144
UWI	117
JOSHUA REA	108
MERRILL ANDERSON	108
BILL KAHN	97
JOHN LESLIE HARTWIG	97
overvote	93
EDMUND BERNARD BRUYERE	70
RAHN V. WORKCUFF	65
JAMES “JIMMY” L. STROUD, JR.	64
BOB “AGAIN” CARNEY JR	56
CYD GORMAN	39
JOHN CHARLES WILSON	37

가장 ballots cast가 높은 precint에서 낮은 precint 순으로 데이터를 재정렬하려 하는 경우.

Minneapolis2013 %>%
  group_by(Precinct) %>%
  summarise(count=n()) %>%     # n() finds how many cases there are
  arrange(desc(count))

Precinct	count
P-06	9711
P-02	9551
P-08	9430
P-03	8703
P-05	8490
P-07	8104
P-04	7753
P-01	7301
P-09	5342
P-10	1561
P-04D	852
P-02D	822
P-05A	742
P-03A	730
P-01C	505
P-6C	504

#or


Minneapolis2013 %>%
  group_by(Precinct) %>%
  tally(sort=TRUE)

Precinct	n
P-06	9711
P-02	9551
P-08	9430
P-03	8703
P-05	8490
P-07	8104
P-04	7753
P-01	7301
P-09	5342
P-10	1561
P-04D	852
P-02D	822
P-05A	742
P-03A	730
P-01C	505
P-6C	504

2번과 같이 precint를 정렬하되, “BETSY HODGES”를 first로 가지는 ballots cast를 기준으로 precint를 정렬하는 경우.

Minneapolis2013 %>%
    filter(First =="BETSY HODGES") %>% 
    group_by(Precinct) %>%
    tally(sort=TRUE)

Precinct	n
P-06	3762
P-02	3739
P-08	3480
P-07	3073
P-05	2895
P-01	2793
P-03	2663
P-04	2571
P-09	1943
P-10	486
P-02D	326
P-04D	319
P-05A	303
P-03A	295
P-01C	161
P-6C	126

몇 개의 투표에서 “BETSY HODGES”가 First와 Second 모두 선택되었는지 알고 싶은 경우.

Minneapolis2013 %>%
  filter(Second == "BETSY HODGES", First == "BETSY HODGES") %>%
  tally()   #could also use nrow here

n
222

Minneapolis2013 %>%
  group_by(First,Second) %>%
  tally() %>%
  filter(First=="BETSY HODGES", Second=="BETSY HODGES")

## # A tibble: 1 x 3
## # Groups:   First [1]
##          First       Second     n
##          <chr>        <chr> <int>
## 1 BETSY HODGES BETSY HODGES   222

2) `select()`

데이터 테이블에서 select 함수는 하나 혹은 그 이상의 변수를 선택하는데 사용된다. select 함수를 사용하는 이유는 다음과 같다:

작업 중인 데이터 테이블을 간단하게 하기 위해서
이름을 다시 붙여야 할 변수가 있는 경우 더욱 적합한 이름을 붙이기 위해서

select 함수는 데이터 테이블을 input으로 받아, 선택한 변수들만으로 이루어진 데이터 테이블을 반환한다.

BabyNames 데이터 테이블로 예시를 살펴보자:

name	sex	count	year
Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880
Margaret	F	1578	1880

name과 year 변수만을 선택하는 경우:

name	year
Mary	1880
Anna	1880
Emma	1880
Elizabeth	1880
Minnie	1880
Margaret	1880

변수에 새로운 이름을 붙이고 싶은 경우,when=year의 argument를 select 함수로 보낸다:

BabyNames %>% select( name, when=year )

name	when
Mary	1880
Anna	1880
Emma	1880
Elizabeth	1880
Minnie	1880
Margaret	1880

3) `filter()`

“filter”은 원하지 않는 cases을 걸러내는데 사용된다. 앞서 살펴본 filter과는 반대의 개념을 가진다. select 함수의 경우 특정한 변수를 input으로 가지지만, filter 함수는 특정한 case를 input으로 가진다.

select에서는 함수의 input의 변수이 이름이다.

BabyNames %>% select( year, count ) %>% head()

year	count
1880	7065
1880	2604
1880	2003
1880	1939
1880	1746
1880	1578

filter에서는 case가 특정 기준으로 인해 정해진다. 여기서 기준은 그 결과가 true/false로 도출되는 논리문을 의미한다. 즉, ==, >, <, %in% 등의 표현으로 완성되는 논리문이다. 예를들어, 다음은 남아들을 걸러내고 오직 여야들의 이름만 결과로 도출하는 command이다.

BabyNames %>% filter( sex=="F") %>%
  sample_n( size=6 )

	name	sex	count	year
505956	Sintia	F	5	1979
544977	Iman	F	48	1983
12108	Zella	F	108	1890
1061450	Malanya	F	5	2013
408178	Pura	F	7	1970
699230	Dolly	F	38	1994

다음은 1990년도 이후 태어난 아기들의 이름이다.

BabyNames %>% filter( year > 1990 ) %>% 
  sample_n( size=6 )

	name	sex	count	year
81527	Corinthia	F	19	1994
362438	Cayleigh	F	74	2004
371508	Railynn	F	9	2004
126598	Khalique	M	6	1995
161074	Shacora	F	16	1997
274132	Porscha	F	19	2001

1990년도 이후 테어난 아기들 중 여아:

BabyNames %>% filter( year > 1990, sex=="F") %>%
  head()

	name	sex	count	year
312913	Elaria	F	5	2008
271471	Areiana	F	5	2006
104141	Ashling	F	8	1997
65755	Tiandra	F	37	1995
163591	Marykate	F	85	2001
411818	Kemauri	F	5	2013

기준은 하나 이상이 될 수도 있다. filter() 함수는 기준이 하나 이상일 때, 모든 기준을 만족시키는 case만을 골라낸다.

만약, “or”을 표현하고 싶다면(the babies who are female or born after 1990):

BabyNames %>% filter( year>1990 | sex=="F")

1980, 1990, 2000, and 2010 중 하나에 태어난 아기들의 이름:

BabyNames %>% 
  filter( year %in% c(1980, 1990, 2000, 2010)) %>%
  sample_n( size=6 )

	name	sex	count	year
56355	Fraida	F	7	2000
14166	Danyell	M	27	1980
39850	Ka	M	10	1990
72041	Tomias	M	6	2000
30031	Holliann	F	7	1990
30848	Adrine	F	6	1990

filter() 와 group_by()를 함께 사용하면 원하는 결과를 효율적으로 얻을 수 있다. 예를들어, 이름 중 빈도수가 적어도 100 이상인 이름을 고르려면:

BabyNames %>% group_by(name) %>%
  filter(count==min(count)) %>%
  filter(count>100) %>%
  head()

## # A tibble: 4 x 4
## # Groups:   name [4]
##       name   sex count  year
##      <chr> <chr> <int> <int>
## 1   Jessie     M   143  1881
## 2 Jacqueli     F   157  1989
## 3 Cassandr     F   152  1989
## 4 Christop     M  1082  1989

아기 이름 중 100년 넘게 사용된 것들만 고르고 싶다면:

BabyNames %>% group_by(name) %>%
  summarise(years_used=n()) %>%
  filter(years_used>100) %>%
  head()

name	years_used
Aaron	218
Abbie	176
Abby	156
Abe	134
Abel	150
Abigail	169

4) `mutate()`

mutate 함수는 이미 존재하는 변수에 변형을 가하거나, 새로운 변수 (파생변수)를 만드는데 사용된다. mutate 함수의 사용은 case를 제거하지 않은채, 변수에 변형을 가할 뿐이다.

예를들어, CountryData 데이터 테이블이pop 과 area 라는 변수를 가지고 있다. 여기서 인구 밀도를 얻고 싶을 때, mutate를 이용하여 새로운 변수를 생성할 수 있다.

  CountryData %>% 
  mutate( popDensity=pop/area ) %>% 
  select( country, pop, area, popDensity) %>%
  sample_n(size=6)

	country	pop	area	popDensity
106	Howland Island	NA	2	NA
235	Turkmenistan	5171943	488100	10.596073
92	Greece	10775557	131957	81.659609
218	Suriname	573311	163820	3.499640
126	Korea, South	49039986	99720	491.776835
28	Bolivia	10631486	1098581	9.677471

5) `arrange()`

Arrange 함수는 case들의 순서를 재정렬한다. 하지만 변수에 변형을 가하지는 않는다 — that’s a job for mutate(). 마찬가지로 arrange 함수 또한 특정 case를 걸러내지 않는다. 즉, case에 변형을 가하지 않는다. 단지 사용자가 지정하는 특정 기준에 따라 case를 정렬할 뿐이다.

예를들어, 아래는 투표 수를 계산(first-choices만 계산)하여 2013년 Minneapolis의 시장선거 결과를 나타낸 것이다.

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total=n() ) %>%
  head()

First	total
ABDUL M RAHAMAN “THE ROCK”	338
ALICIA K. BENNETT	351
BETSY HODGES	28935
BILL KAHN	97
BOB “AGAIN” CARNEY JR	56
BOB FINE	2094

위 결과를 보면 후보자 이름이 알파벳 순으로 정렬되어 있는 것을 알 수 있다. 하지만 사용자가 원하는 것이 후보자들 간의 total 수 비교라면 total을 내림차순으로 정렬하는 것이 더 효율적인 데이터 표현 방식일 수 있다.

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total=n() ) %>%
  arrange( desc(total) ) %>%
  head()

First	total
BETSY HODGES	28935
MARK ANDREW	19584
DON SAMUELS	8335
CAM WINTON	7511
JACKIE CHERRYHOMES	3524
BOB FINE	2094

Data verb languages

우리가 사용하고 있는 것은 dplyr라는 패키지이다. 하지만 dplyr를 이용하여 data wrangling하는 것은 여러 방법들 중 하나일 뿐이라는 것을 인식하는 것이 중요하다. 예를 들면, 다른 여러 방법에는 다음과 같은 것들이 있다:

dplyr - R에서 사용
data.table - R에서 사용 (for big data)
SQL - database servers에서 사용

같은 표현을 다른 notation으로 하면 다음과 같다:

dplyr BabyNames %>% group_by(year,sex) %>% summarise( nNames=n() )
data.table BabyNames[, length(count), by=c("sex","year") ]
SQL "BabyNames" > GROUP_BY("year", "sex") > SUMMARISE(COUNT() AS "nNames")

5. base package vs. `ggplot2`

mtcars data set을 사용해보자.

mtcars_m <- mtcars %>% 
  filter(am==0)

mtcars_a <- mtcars %>%
  filter(am==1) 
  
head(mtcars_a)

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1

ggplot 패키지가 개발되기 전 R에서 graphing을 담당했던 것은 base R package이다. 아직도 많은 연구자들이 base package를 이용한 그래프로 작업하는 경우가 있으니 base package와 ggplot을 비교하며 그 차이점을 알아보자.

예시 1

base package에서 mtcars중 manual 자동차에 한해, mpg ~ wt의 scattter plot을 만들고 싶다면 명령은 다음과 같다.

plot( mtcars_m$wt,mtcars_m$mpg, col=as.factor(mtcars_m$cyl))

여기까지는 문제가 없어보인다. 만약 우리가 위의 그래프에 automatic 자동차들의 정보를 새로운 layer로 추가하고 싶은 경우 명령은 다음과 같아 질 것이다.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
points( mtcars_a$wt,mtcars_a$mpg, col="blue")

base package를 이용할 때의 한계점은:

그래프가 다시 그려지지 않는다.
그래프가 하나의 이미지로써 그려질 뿐이다. 즉, object로 표현되지 않는다. 하지만 ggplot에서는 그래프가 object로 표현된다. 따라서 원하는 변경사항이 있을 경우 손 쉽게 변경을 가할 수 있다.
legend이 필요한 경우 사용자가 직접 만들어야 한다. 예를들어, 위의 그래프에서 무슨 색이 어떤 정보를 주는지에 대한 legend이 필요한 경우 직접 만들어야 한다는 것이다.

ggplot을 이용하는 경우:

mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(aes(col=as.factor(cyl))) + facet_wrap(~am)

Discuss

벡터 precip는 여러 도시에서의 매년 강수량에 대한 정보를 담고 있다. bace package를 이용하여 hist 함수로 다음의 히스토그램을 만들어라(hint: try hist(precip)). 그리고 ggplot의geom_histogram() 함수를 사용하여 히스토그램을 만들어라. ggplot은 데이터 테이블를 input으로 받기 때문에 as.data.frame(precip) 과정이 필요할 것.

ggplot2.org

예시 2

mpg ~ wt(weight)의 linear model을 만들고 싶은 경우를 가정하자.

# Use lm() to calculate a linear model and save it as carModel
carModel <- lm(mpg ~ wt, data = mtcars_m)
carModel

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars_m)
## 
## Coefficients:
## (Intercept)           wt  
##      31.416       -3.786

위에서 base package로 그렸던 plot에 linear model과 legend를 추가해보자.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
carModel <- lm(mpg ~ wt, data = mtcars_m)
abline(carModel, lty = 2)
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")

혹은 실린더의 타입에 따라 각각의 linear model을 설정하는 경우 plot은 다음과 같을 것이다.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==4)), lty = 2)
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==6)), lty = 2)
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==8)), lty = 2)
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")

물론 lapply() 함수를 이용하여 좀 더 같단하게 그래프를 표현할 수도 있다.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
lapply(mtcars_m$cyl, function(x) {
  abline(lm(mpg ~ wt, mtcars_m, subset = (cyl == x)), col = x)
  })
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")

하지만 이 모든 것은 ggplot에서 매우 간단하게 처리할 수 있다.

Note color aesthetic을 geom_point에 위치시키는 대신, ggplot의 frame에 위치시킨 것을 볼 수 있는데, 이는 point와 regression line의 색을 일치시키기 위함이다. legend 또한 특별한 조작 없이 자동으로 완성되는 것을 볼 수 있다.

mtcars_m %>% ggplot(aes(x = wt, y = mpg, col = as.factor(cyl))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

세미나(2)

SeoHyeong Jeong

2018/01/11

학습목표

1. Intro to Graphics

data()를 사용하여 R로 데이터를 로딩하기

기본 formula in R

2. Components of Graphics

Glyphs and Data

Data Glyphs (Geoms)

ggplot2 패키지로 만들어지는 그래프들의 구성요소:

예시:

Discuss

3. `ggplot2`(1)

examples

Discuss

4. Data Verbs

1) `summarise()` and `group_by()`

Discuss

2) `select()`

3) `filter()`

4) `mutate()`

5) `arrange()`

Data verb languages

5. base package vs. `ggplot2`

예시 1

Discuss

예시 2

세미나(2)

SeoHyeong Jeong

2018/01/11

학습목표

1. Intro to Graphics

data()를 사용하여 R로 데이터를 로딩하기

기본 formula in R

2. Components of Graphics

Glyphs and Data

Data Glyphs (Geoms)

ggplot2 패키지로 만들어지는 그래프들의 구성요소:

예시:

Discuss

3. ggplot2(1)

examples

Discuss

4. Data Verbs

1) summarise() and group_by()

Discuss

2) select()

3) filter()

4) mutate()

5) arrange()

Data verb languages

5. base package vs. ggplot2

예시 1

Discuss

예시 2

3. `ggplot2`(1)

1) `summarise()` and `group_by()`

2) `select()`

3) `filter()`

4) `mutate()`

5) `arrange()`

5. base package vs. `ggplot2`