데이터분석

데이터분석

2일차 학습 내용

파생변수 만들기

R에서 파생변수는 데이터프레임에 다른 변수들을 가지고 새로운 변수를 데이터 프레임 안에 생성하는 방법이다.

생성절차

이 절차를 위해 cars데이터를 열고

cars 데이터의 컬럼이름을 추출한다. cars 데이터는 총 2개의 변수와 50개의 변수로 이루어진 데이터 프레임 자료이다.

변수들의 이름은 다음가 같다.

## [1] "speed" "dist"

파생변수를 생성하기 이전에 변수들의 이름을 한글로 변경하였다.

그 결과로서 앞에서부터 6개의 자료를 보여주는 것은 다음과 같다.

##   속도 거리
## 1    4    2
## 2    4   10
## 3    7    4
## 4    7   22
## 5    8   16
## 6    9   10

새로 생성되는 파생변수로의 이름은 ‘속도x거리’ 이며 이 값은 속도와 거리의 곱이다.

##   속도 거리 속도x거리
## 1    4    2         8
## 2    4   10        40
## 3    7    4        28
## 4    7   22       154
## 5    8   16       128
## 6    9   10        90

그 결과로 cars 데이터는 총 3개의 변수와 50개의 변수로 이루어진 데이터 프레임으로 업데이트 되었다.

조건에 따라 변수의 값을 새로 생성하는 법은 다음과 같다.

## [1] "느림" "느림" "느림" "느림" "빠름" "빠름"

hflights 패키지를 이용한 예제

#install.packages("hflights")
require(hflights)
## Loading required package: hflights

이로써 hflights 데이터에 함수를 적용시켜볼 수 있다.

##       Year          Month          DayofMonth      DayOfWeek        DepTime    
##  Min.   :2011   Min.   : 1.000   Min.   : 1.00   Min.   :1.000   Min.   :   1  
##  1st Qu.:2011   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2.000   1st Qu.:1021  
##  Median :2011   Median : 7.000   Median :16.00   Median :4.000   Median :1416  
##  Mean   :2011   Mean   : 6.514   Mean   :15.74   Mean   :3.948   Mean   :1396  
##  3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:6.000   3rd Qu.:1801  
##  Max.   :2011   Max.   :12.000   Max.   :31.00   Max.   :7.000   Max.   :2400  
##                                                                  NA's   :2905  
##     ArrTime     UniqueCarrier        FlightNum      TailNum         
##  Min.   :   1   Length:227496      Min.   :   1   Length:227496     
##  1st Qu.:1215   Class :character   1st Qu.: 855   Class :character  
##  Median :1617   Mode  :character   Median :1696   Mode  :character  
##  Mean   :1578                      Mean   :1962                     
##  3rd Qu.:1953                      3rd Qu.:2755                     
##  Max.   :2400                      Max.   :7290                     
##  NA's   :3066                                                       
##  ActualElapsedTime    AirTime         ArrDelay          DepDelay      
##  Min.   : 34.0     Min.   : 11.0   Min.   :-70.000   Min.   :-33.000  
##  1st Qu.: 77.0     1st Qu.: 58.0   1st Qu.: -8.000   1st Qu.: -3.000  
##  Median :128.0     Median :107.0   Median :  0.000   Median :  0.000  
##  Mean   :129.3     Mean   :108.1   Mean   :  7.094   Mean   :  9.445  
##  3rd Qu.:165.0     3rd Qu.:141.0   3rd Qu.: 11.000   3rd Qu.:  9.000  
##  Max.   :575.0     Max.   :549.0   Max.   :978.000   Max.   :981.000  
##  NA's   :3622      NA's   :3622    NA's   :3622      NA's   :2905     
##     Origin              Dest              Distance          TaxiIn       
##  Length:227496      Length:227496      Min.   :  79.0   Min.   :  1.000  
##  Class :character   Class :character   1st Qu.: 376.0   1st Qu.:  4.000  
##  Mode  :character   Mode  :character   Median : 809.0   Median :  5.000  
##                                        Mean   : 787.8   Mean   :  6.099  
##                                        3rd Qu.:1042.0   3rd Qu.:  7.000  
##                                        Max.   :3904.0   Max.   :165.000  
##                                                         NA's   :3066     
##     TaxiOut         Cancelled       CancellationCode      Diverted       
##  Min.   :  1.00   Min.   :0.00000   Length:227496      Min.   :0.000000  
##  1st Qu.: 10.00   1st Qu.:0.00000   Class :character   1st Qu.:0.000000  
##  Median : 14.00   Median :0.00000   Mode  :character   Median :0.000000  
##  Mean   : 15.09   Mean   :0.01307                      Mean   :0.002853  
##  3rd Qu.: 18.00   3rd Qu.:0.00000                      3rd Qu.:0.000000  
##  Max.   :163.00   Max.   :1.00000                      Max.   :1.000000  
##  NA's   :2947

select() 함수를 이용하여 변수를 선택할 수 있다.

## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 'data.frame':    227496 obs. of  21 variables:
##  $ Year             : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayofMonth       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek        : int  6 7 1 2 3 4 5 6 7 1 ...
##  $ DepTime          : int  1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
##  $ ArrTime          : int  1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
##  $ UniqueCarrier    : chr  "AA" "AA" "AA" "AA" ...
##  $ FlightNum        : int  428 428 428 428 428 428 428 428 428 428 ...
##  $ TailNum          : chr  "N576AA" "N557AA" "N541AA" "N403AA" ...
##  $ ActualElapsedTime: int  60 60 70 70 62 64 70 59 71 70 ...
##  $ AirTime          : int  40 45 48 39 44 45 43 40 41 45 ...
##  $ ArrDelay         : int  -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
##  $ DepDelay         : int  0 1 -8 3 5 -1 -1 -5 43 43 ...
##  $ Origin           : chr  "IAH" "IAH" "IAH" "IAH" ...
##  $ Dest             : chr  "DFW" "DFW" "DFW" "DFW" ...
##  $ Distance         : int  224 224 224 224 224 224 224 224 224 224 ...
##  $ TaxiIn           : int  7 6 5 9 9 6 12 7 8 6 ...
##  $ TaxiOut          : int  13 9 17 22 9 13 15 12 22 19 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CancellationCode : chr  "" "" "" "" ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...

filter() 함수를 이용하여 조건을 설정할 수 있다.

##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011     2          1         2    1401    1539            AA       428
## 2 2011     2          2         3    1420    1530            AA       428
## 3 2011     2          3         4    1405    1504            AA       428
## 4 2011     2          4         5    1516    1614            AA       428
## 5 2011     2          5         6    1358    1505            AA       428
## 6 2011     2          6         7    1350    1452            AA       428
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N474AA                98      73       29        1    IAH  DFW      224
## 2  N463AA                70      42       20       20    IAH  DFW      224
## 3  N548AA                59      40       -6        5    IAH  DFW      224
## 4  N425AA                58      39       64       76    IAH  DFW      224
## 5  N4UCAA                67      40       -5       -2    IAH  DFW      224
## 6  N560AA                62      44      -18      -10    IAH  DFW      224
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1      7      18         0                         0
## 2      8      20         0                         0
## 3      7      12         0                         0
## 4      7      12         0                         0
## 5     12      15         0                         0
## 6      4      14         0                         0
## [1] 17128
##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011     8         16         2     556     859            DL      1884
## 2 2011     8         17         3    1545    1836            DL         8
## 3 2011     8         17         3    1800    2117            DL        54
## 4 2011     8         17         3     700    1007            DL       810
## 5 2011     8         17         3     632     907            DL      1512
## 6 2011     8         17         3     910    1203            DL      1590
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N951DL               123      99      -15       -4    IAH  ATL      689
## 2  N338NB               111      91      -28       -5    IAH  ATL      689
## 3  N345NW               137      90       12        5    IAH  ATL      689
## 4  N969DL               127      98       -8       -2    IAH  ATL      689
## 5  N317NB               155     136      -18       -5    IAH  MSP     1034
## 6  N992DL               113      95      -19       -5    IAH  ATL      689
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1     16       8         0                         0
## 2     10      10         0                         0
## 3     25      22         0                         0
## 4     18      11         0                         0
## 5      7      12         0                         0
## 6      5      13         0                         0
## [1] 20176
## [1] 37304
## [1] 508

arrange()를 이용해 정렬할 수 있다.

##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011     1          1         6    1400    1500            AA       428
## 2 2011     1          2         7    1401    1501            AA       428
## 3 2011     1          3         1    1352    1502            AA       428
## 4 2011     1          4         2    1403    1513            AA       428
## 5 2011     1          5         3    1405    1507            AA       428
## 6 2011     1          6         4    1359    1503            AA       428
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N576AA                60      40      -10        0    IAH  DFW      224
## 2  N557AA                60      45       -9        1    IAH  DFW      224
## 3  N541AA                70      48       -8       -8    IAH  DFW      224
## 4  N403AA                70      39        3        3    IAH  DFW      224
## 5  N492AA                62      44       -3        5    IAH  DFW      224
## 6  N262AA                64      45       -7       -1    IAH  DFW      224
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1      7      13         0                         0
## 2      6       9         0                         0
## 3      5      17         0                         0
## 4      9      22         0                         0
## 5      9       9         0                         0
## 6      6      13         0                         0
##        Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 227491 2011    12          6         2    1307    1600            WN       471
## 227492 2011    12          6         2    1818    2111            WN      1191
## 227493 2011    12          6         2    2047    2334            WN      1674
## 227494 2011    12          6         2     912    1031            WN       127
## 227495 2011    12          6         2     656     812            WN       621
## 227496 2011    12          6         2    1600    1713            WN      1597
##        TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 227491  N632SW               113      98        0        7    HOU  TPA      781
## 227492  N284WN               113      97       -9        8    HOU  TPA      781
## 227493  N366SW               107      94        4        7    HOU  TPA      781
## 227494  N777QC                79      61       -4       -3    HOU  TUL      453
## 227495  N727SW                76      64      -13       -4    HOU  TUL      453
## 227496  N745SW                73      59      -12        0    HOU  TUL      453
##        TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 227491      5      10         0                         0
## 227492      5      11         0                         0
## 227493      4       9         0                         0
## 227494      4      14         0                         0
## 227495      3       9         0                         0
## 227496      3      11         0                         0
##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011    12         15         4    2113    2217            AA       426
## 2 2011    12         16         5    2004    2128            AA       426
## 3 2011    12         18         7    2007    2113            AA       426
## 4 2011    12         19         1    2108    2223            AA       426
## 5 2011    12         20         2    2008    2107            AA       426
## 6 2011    12         21         3    2025    2124            AA       426
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N433AA                64      44       47       63    IAH  DFW      224
## 2  N588AA                84      39       -2       -6    IAH  DFW      224
## 3  N4XHAA                66      46      -17       -3    IAH  DFW      224
## 4  N4YDAA                75      54       53       58    IAH  DFW      224
## 5  N434AA                59      41      -23       -2    IAH  DFW      224
## 6  N589AA                59      43       -6       15    IAH  DFW      224
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1      6      14         0                         0
## 2      8      37         0                         0
## 3      7      13         0                         0
## 4     11      10         0                         0
## 5      6      12         0                         0
## 6      5      11         0                         0
##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011    12         19         1    2108    2223            AA       426
## 2 2011    12         26         1    2013    2118            AA       426
## 3 2011    12          5         1     558     926            AA       466
## 4 2011    12         12         1     609     921            AA       466
## 5 2011    12         19         1     603     913            AA       466
## 6 2011    12         26         1     558     912            AA       466
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N4YDAA                75      54       53       58    IAH  DFW      224
## 2  N4YLAA                65      43      -12        3    IAH  DFW      224
## 3  N3CTAA               148     116        6       -2    IAH  MIA      964
## 4  N3CEAA               132     113        1        9    IAH  MIA      964
## 5  N3GWAA               130     115       -7        3    IAH  MIA      964
## 6  N3GWAA               134     112       -8       -2    IAH  MIA      964
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1     11      10         0                         0
## 2      8      14         0                         0
## 3     11      21         0                         0
## 4     10       9         0                         0
## 5      5      10         0                         0
## 6      6      16         0                         0

열을 조직하기 위해 쓰이는 함수로 mutate()가 있다.

##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011     1          1         6    1400    1500            AA       428
## 2 2011     1          2         7    1401    1501            AA       428
## 3 2011     1          3         1    1352    1502            AA       428
## 4 2011     1          4         2    1403    1513            AA       428
## 5 2011     1          5         3    1405    1507            AA       428
## 6 2011     1          6         4    1359    1503            AA       428
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N576AA                60      40      -10        0    IAH  DFW      224
## 2  N557AA                60      45       -9        1    IAH  DFW      224
## 3  N541AA                70      48       -8       -8    IAH  DFW      224
## 4  N403AA                70      39        3        3    IAH  DFW      224
## 5  N492AA                62      44       -3        5    IAH  DFW      224
## 6  N262AA                64      45       -7       -1    IAH  DFW      224
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted gain
## 1      7      13         0                         0  -10
## 2      6       9         0                         0  -10
## 3      5      17         0                         0    0
## 4      9      22         0                         0    0
## 5      9       9         0                         0   -8
## 6      6      13         0                         0   -6

summarise()로 원하는 데이터의 정보를 확인 할 수 있다.

##      delay
## 1 9.444951

그룹별 요약

group_by()를 이용하거나, %>% 연산자를 이용하는 경우가 있다.

## Adding missing grouping variables: `TailNum`
## `summarise()` ungrouping output (override with `.groups` argument)
hflights %>%
  group_by(TailNum) %>%  
  select(Month,DayofMonth, DayOfWeek) %>%  
  summarise(
          AveMonth = mean(Month, na.rm=TRUE),
          AveDayofMonth = mean(DayofMonth, na.rm=TRUE),
          AveDayOfWeek = mean(DayOfWeek, na.rm=TRUE)
)
## Adding missing grouping variables: `TailNum`
## `summarise()` ungrouping output (override with `.groups` argument)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3,320 x 4
##    TailNum  AveMonth AveDayofMonth AveDayOfWeek
##    <chr>       <dbl>         <dbl>        <dbl>
##  1 ""           3.25          10.0         4.32
##  2 "N0EGMQ"     6.42          16.4         3.42
##  3 "N10156"     6.23          14.0         3.83
##  4 "N10575"     8.49          13.2         3.26
##  5 "N11106"     7.37          15.0         4.10
##  6 "N11107"     7.18          14.8         4.04
##  7 "N11109"     6.24          15.4         3.92
##  8 "N11113"     6.77          15.4         3.98
##  9 "N11119"     7.07          16.0         4.03
## 10 "N11121"     6.46          16.2         4.05
## # ... with 3,310 more rows

Dplyr 연습문제

hflights %>%
  group_by(TailNum) %>%
  select(FlightNum, TailNum, AirTime, Distance, Origin) %>%
  filter(Origin=="HOU", Distance>1000) %>%
  summarise(
    AveAirTime = mean(AirTime, na.rm = TRUE),
    AveDistance = mean(Distance, na.rm = TRUE)
  )
## `summarise()` ungrouping output (override with `.groups` argument)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 715 x 3
##    TailNum  AveAirTime AveDistance
##    <chr>         <dbl>       <dbl>
##  1 ""             NaN        1270.
##  2 "N178JB"       189.       1428 
##  3 "N179JB"       182        1428 
##  4 "N183JB"       186.       1428 
##  5 "N184JB"       181.       1428 
##  6 "N187JB"       185.       1428 
##  7 "N190JB"       184.       1428 
##  8 "N192JB"       187        1428 
##  9 "N193JB"       190        1428 
## 10 "N197JB"       180.       1428 
## # ... with 705 more rows

ggplot2

아래와 같이 패키지를 설치하고 난 후,

#install.packages("ggplot2")
require(ggplot2)
## Loading required package: ggplot2
data("midwest", package = "ggplot2")
summary(midwest)
##       PID          county             state                area        
##  Min.   : 561   Length:437         Length:437         Min.   :0.00500  
##  1st Qu.: 670   Class :character   Class :character   1st Qu.:0.02400  
##  Median :1221   Mode  :character   Mode  :character   Median :0.03000  
##  Mean   :1437                                         Mean   :0.03317  
##  3rd Qu.:2059                                         3rd Qu.:0.03800  
##  Max.   :3052                                         Max.   :0.11000  
##     poptotal         popdensity          popwhite          popblack      
##  Min.   :   1701   Min.   :   85.05   Min.   :    416   Min.   :      0  
##  1st Qu.:  18840   1st Qu.:  622.41   1st Qu.:  18630   1st Qu.:     29  
##  Median :  35324   Median : 1156.21   Median :  34471   Median :    201  
##  Mean   :  96130   Mean   : 3097.74   Mean   :  81840   Mean   :  11024  
##  3rd Qu.:  75651   3rd Qu.: 2330.00   3rd Qu.:  72968   3rd Qu.:   1291  
##  Max.   :5105067   Max.   :88018.40   Max.   :3204947   Max.   :1317147  
##  popamerindian        popasian         popother        percwhite    
##  Min.   :    4.0   Min.   :     0   Min.   :     0   Min.   :10.69  
##  1st Qu.:   44.0   1st Qu.:    35   1st Qu.:    20   1st Qu.:94.89  
##  Median :   94.0   Median :   102   Median :    66   Median :98.03  
##  Mean   :  343.1   Mean   :  1310   Mean   :  1613   Mean   :95.56  
##  3rd Qu.:  288.0   3rd Qu.:   401   3rd Qu.:   345   3rd Qu.:99.07  
##  Max.   :10289.0   Max.   :188565   Max.   :384119   Max.   :99.82  
##    percblack       percamerindan        percasian        percother      
##  Min.   : 0.0000   Min.   : 0.05623   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.: 0.1157   1st Qu.: 0.15793   1st Qu.:0.1737   1st Qu.:0.09102  
##  Median : 0.5390   Median : 0.21502   Median :0.2972   Median :0.17844  
##  Mean   : 2.6763   Mean   : 0.79894   Mean   :0.4872   Mean   :0.47906  
##  3rd Qu.: 2.6014   3rd Qu.: 0.38362   3rd Qu.:0.5212   3rd Qu.:0.48050  
##  Max.   :40.2100   Max.   :89.17738   Max.   :5.0705   Max.   :7.52427  
##    popadults          perchsd        percollege        percprof      
##  Min.   :   1287   Min.   :46.91   Min.   : 7.336   Min.   : 0.5203  
##  1st Qu.:  12271   1st Qu.:71.33   1st Qu.:14.114   1st Qu.: 2.9980  
##  Median :  22188   Median :74.25   Median :16.798   Median : 3.8142  
##  Mean   :  60973   Mean   :73.97   Mean   :18.273   Mean   : 4.4473  
##  3rd Qu.:  47541   3rd Qu.:77.20   3rd Qu.:20.550   3rd Qu.: 4.9493  
##  Max.   :3291995   Max.   :88.90   Max.   :48.079   Max.   :20.7913  
##  poppovertyknown   percpovertyknown percbelowpoverty percchildbelowpovert
##  Min.   :   1696   Min.   :80.90    Min.   : 2.180   Min.   : 1.919      
##  1st Qu.:  18364   1st Qu.:96.89    1st Qu.: 9.199   1st Qu.:11.624      
##  Median :  33788   Median :98.17    Median :11.822   Median :15.270      
##  Mean   :  93642   Mean   :97.11    Mean   :12.511   Mean   :16.447      
##  3rd Qu.:  72840   3rd Qu.:98.60    3rd Qu.:15.133   3rd Qu.:20.352      
##  Max.   :5023523   Max.   :99.86    Max.   :48.691   Max.   :64.308      
##  percadultpoverty percelderlypoverty    inmetro         category        
##  Min.   : 1.938   Min.   : 3.547     Min.   :0.0000   Length:437        
##  1st Qu.: 7.668   1st Qu.: 8.912     1st Qu.:0.0000   Class :character  
##  Median :10.008   Median :10.869     Median :0.0000   Mode  :character  
##  Mean   :10.919   Mean   :11.389     Mean   :0.3432                     
##  3rd Qu.:13.182   3rd Qu.:13.412     3rd Qu.:1.0000                     
##  Max.   :43.312   Max.   :31.162     Max.   :1.0000

ggplot()함수를 통해 분석을 시각화할 수 있다.

Scatter Plot을 아래와 같은 방법으로 생성할 수 있다.

리그레션 라인을 추가하기 위해 아래와 같은 방법을 사용한다.

## `geom_smooth()` using formula 'y ~ x'

### 그래프 세부 조정

그래프의 범위 조정은 다음과 같이 실행된다.

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).

이상치로 인해 시각화된 데이터 자료가 보기 불편한 경우가 종종 있다. 이럴 경우 아래와 같이 처리해준다.

## `geom_smooth()` using formula 'y ~ x'

여기에 제목과 축 이름을 지정해 줄 수 있다.

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).

## Warning: Removed 5 rows containing missing values (geom_point).

점의 색을 변경할 수 있다.

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).

그래프를 분할하여 볼 수 있다.

## `geom_smooth()` using formula 'y ~ x'

앞선 방법들을 통해 다음과 같이 시각화할 수 있다.

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 437 x 28
##      PID county state  area poptotal popdensity popwhite popblack popamerindian
##    <int> <chr>  <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
##  1   561 ADAMS  IL    0.052    66090      1271.    63917     1702            98
##  2   562 ALEXA~ IL    0.014    10626       759      7054     3496            19
##  3   563 BOND   IL    0.022    14991       681.    14477      429            35
##  4   564 BOONE  IL    0.017    30806      1812.    29344      127            46
##  5   565 BROWN  IL    0.018     5836       324.     5264      547            14
##  6   566 BUREAU IL    0.05     35688       714.    35157       50            65
##  7   567 CALHO~ IL    0.017     5322       313.     5298        1             8
##  8   568 CARRO~ IL    0.027    16805       622.    16519      111            30
##  9   569 CASS   IL    0.024    13437       560.    13384       16             8
## 10   570 CHAMP~ IL    0.058   173025      2983.   146506    16559           331
## # ... with 427 more rows, and 19 more variables: popasian <int>,
## #   popother <int>, percwhite <dbl>, percblack <dbl>, percamerindan <dbl>,
## #   percasian <dbl>, percother <dbl>, popadults <int>, perchsd <dbl>,
## #   percollege <dbl>, percprof <dbl>, poppovertyknown <int>,
## #   percpovertyknown <dbl>, percbelowpoverty <dbl>, percchildbelowpovert <dbl>,
## #   percadultpoverty <dbl>, percelderlypoverty <dbl>, inmetro <int>,
## #   category <chr>

midwest$degreeofdensity <-ifelse(midwest$popdensity>=10000, "high", "low")
table(midwest$degreeofdensity)
## 
## high  low 
##   23  414
g+ facet_wrap(~state, nrow=2)
## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

mtcars 데이터를 통한 ggplot 실습

mtcars 데이터를 불러온다.

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

이후 disp, hp 변수를 선택한 후 산점도를 그리는데, 이때 카테고리컬 변수의 정보를 점의 색깔에 반영하여 다음과 같이 나타낼 수 있다.

## `geom_smooth()` using formula 'y ~ x'