데이터 분석 ### 파생변수 만들기

R에서 파생변수는 데이터프레임에 있는 다른 변수들을 가지고 새로운 변수를 데이터 프레임안에 생성하는 방법이다.

이 절차를 위해 cars데이터를 열고

cars데이터의 컬럼이름을 추출한다. cars 데이터는 총 2개의 변수와 50개의 관측치로 이루어진 데이터 프레임 자료이다.

변수들의 이름은 다음과 같다.

## [1] "speed" "dist"

파생변수를 생성하기 이전에 변수들의 이름을 한글로 변경하였다.

그 결과로서 앞에서부터 6개의 자료를 보여지는것은 다음과 같다.

##   속도 거리
## 1    4    2
## 2    4   10
## 3    7    4
## 4    7   22
## 5    8   16
## 6    9   10

새로 생성되는 파생변수의 이름은 ’속도x거리’이며 이 값은 속도와 거리의 곱이다.

##   속도 거리 속도x거리
## 1    4    2         8
## 2    4   10        40
## 3    7    4        28
## 4    7   22       154
## 5    8   16       128
## 6    9   10        90

그 결과로서 cars 데이터는 총 3개의 변수와 50개의 관측치로 이루어진 데이터 프레임으로 업데이트 되었다.

0.0.1 조건에 따라 변수를 새로 생성

cars$속도정도 <- ifelse(cars$속도 > 7 , "빠름", "느림")
head(cars)
##   속도 거리 속도x거리 속도정도
## 1    4    2         8     느림
## 2    4   10        40     느림
## 3    7    4        28     느림
## 4    7   22       154     느림
## 5    8   16       128     빠름
## 6    9   10        90     빠름

0.0.2 hflights 패시키 설치 및 반영

#install.packages("hflights")
require(hflights)
## Loading required package: hflights
hflights <- hflights
#View(hflights)
summary(hflights)
##       Year          Month          DayofMonth      DayOfWeek        DepTime    
##  Min.   :2011   Min.   : 1.000   Min.   : 1.00   Min.   :1.000   Min.   :   1  
##  1st Qu.:2011   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2.000   1st Qu.:1021  
##  Median :2011   Median : 7.000   Median :16.00   Median :4.000   Median :1416  
##  Mean   :2011   Mean   : 6.514   Mean   :15.74   Mean   :3.948   Mean   :1396  
##  3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:6.000   3rd Qu.:1801  
##  Max.   :2011   Max.   :12.000   Max.   :31.00   Max.   :7.000   Max.   :2400  
##                                                                  NA's   :2905  
##     ArrTime     UniqueCarrier        FlightNum      TailNum         
##  Min.   :   1   Length:227496      Min.   :   1   Length:227496     
##  1st Qu.:1215   Class :character   1st Qu.: 855   Class :character  
##  Median :1617   Mode  :character   Median :1696   Mode  :character  
##  Mean   :1578                      Mean   :1962                     
##  3rd Qu.:1953                      3rd Qu.:2755                     
##  Max.   :2400                      Max.   :7290                     
##  NA's   :3066                                                       
##  ActualElapsedTime    AirTime         ArrDelay          DepDelay      
##  Min.   : 34.0     Min.   : 11.0   Min.   :-70.000   Min.   :-33.000  
##  1st Qu.: 77.0     1st Qu.: 58.0   1st Qu.: -8.000   1st Qu.: -3.000  
##  Median :128.0     Median :107.0   Median :  0.000   Median :  0.000  
##  Mean   :129.3     Mean   :108.1   Mean   :  7.094   Mean   :  9.445  
##  3rd Qu.:165.0     3rd Qu.:141.0   3rd Qu.: 11.000   3rd Qu.:  9.000  
##  Max.   :575.0     Max.   :549.0   Max.   :978.000   Max.   :981.000  
##  NA's   :3622      NA's   :3622    NA's   :3622      NA's   :2905     
##     Origin              Dest              Distance          TaxiIn       
##  Length:227496      Length:227496      Min.   :  79.0   Min.   :  1.000  
##  Class :character   Class :character   1st Qu.: 376.0   1st Qu.:  4.000  
##  Mode  :character   Mode  :character   Median : 809.0   Median :  5.000  
##                                        Mean   : 787.8   Mean   :  6.099  
##                                        3rd Qu.:1042.0   3rd Qu.:  7.000  
##                                        Max.   :3904.0   Max.   :165.000  
##                                                         NA's   :3066     
##     TaxiOut         Cancelled       CancellationCode      Diverted       
##  Min.   :  1.00   Min.   :0.00000   Length:227496      Min.   :0.000000  
##  1st Qu.: 10.00   1st Qu.:0.00000   Class :character   1st Qu.:0.000000  
##  Median : 14.00   Median :0.00000   Mode  :character   Median :0.000000  
##  Mean   : 15.09   Mean   :0.01307                      Mean   :0.002853  
##  3rd Qu.: 18.00   3rd Qu.:0.00000                      3rd Qu.:0.000000  
##  Max.   :163.00   Max.   :1.00000                      Max.   :1.000000  
##  NA's   :2947

0.0.3 Select

require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
str(hflights)
## 'data.frame':    227496 obs. of  21 variables:
##  $ Year             : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayofMonth       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek        : int  6 7 1 2 3 4 5 6 7 1 ...
##  $ DepTime          : int  1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
##  $ ArrTime          : int  1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
##  $ UniqueCarrier    : chr  "AA" "AA" "AA" "AA" ...
##  $ FlightNum        : int  428 428 428 428 428 428 428 428 428 428 ...
##  $ TailNum          : chr  "N576AA" "N557AA" "N541AA" "N403AA" ...
##  $ ActualElapsedTime: int  60 60 70 70 62 64 70 59 71 70 ...
##  $ AirTime          : int  40 45 48 39 44 45 43 40 41 45 ...
##  $ ArrDelay         : int  -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
##  $ DepDelay         : int  0 1 -8 3 5 -1 -1 -5 43 43 ...
##  $ Origin           : chr  "IAH" "IAH" "IAH" "IAH" ...
##  $ Dest             : chr  "DFW" "DFW" "DFW" "DFW" ...
##  $ Distance         : int  224 224 224 224 224 224 224 224 224 224 ...
##  $ TaxiIn           : int  7 6 5 9 9 6 12 7 8 6 ...
##  $ TaxiOut          : int  13 9 17 22 9 13 15 12 22 19 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CancellationCode : chr  "" "" "" "" ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
newhflights <- select(hflights, Origin, Distance, TailNum) 

0.0.4 Filter

head(
  filter(hflights, Month==2)
)
##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011     2          1         2    1401    1539            AA       428
## 2 2011     2          2         3    1420    1530            AA       428
## 3 2011     2          3         4    1405    1504            AA       428
## 4 2011     2          4         5    1516    1614            AA       428
## 5 2011     2          5         6    1358    1505            AA       428
## 6 2011     2          6         7    1350    1452            AA       428
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N474AA                98      73       29        1    IAH  DFW      224
## 2  N463AA                70      42       20       20    IAH  DFW      224
## 3  N548AA                59      40       -6        5    IAH  DFW      224
## 4  N425AA                58      39       64       76    IAH  DFW      224
## 5  N4UCAA                67      40       -5       -2    IAH  DFW      224
## 6  N560AA                62      44      -18      -10    IAH  DFW      224
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1      7      18         0                         0
## 2      8      20         0                         0
## 3      7      12         0                         0
## 4      7      12         0                         0
## 5     12      15         0                         0
## 6      4      14         0                         0
hflights1 <- filter(hflights, Month==2)
nrow(hflights1)
## [1] 17128
head(
  filter(hflights, Month==8)
)
##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011     8         16         2     556     859            DL      1884
## 2 2011     8         17         3    1545    1836            DL         8
## 3 2011     8         17         3    1800    2117            DL        54
## 4 2011     8         17         3     700    1007            DL       810
## 5 2011     8         17         3     632     907            DL      1512
## 6 2011     8         17         3     910    1203            DL      1590
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N951DL               123      99      -15       -4    IAH  ATL      689
## 2  N338NB               111      91      -28       -5    IAH  ATL      689
## 3  N345NW               137      90       12        5    IAH  ATL      689
## 4  N969DL               127      98       -8       -2    IAH  ATL      689
## 5  N317NB               155     136      -18       -5    IAH  MSP     1034
## 6  N992DL               113      95      -19       -5    IAH  ATL      689
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1     16       8         0                         0
## 2     10      10         0                         0
## 3     25      22         0                         0
## 4     18      11         0                         0
## 5      7      12         0                         0
## 6      5      13         0                         0
hflights2 <- filter(hflights, Month==8)
nrow(hflights2)
## [1] 20176
hflights3 <- filter(hflights, Month==2 | Month==8)
nrow(hflights3)
## [1] 37304
hflights4 <- filter(hflights, Month==2 , Dest=="DFW")
nrow(hflights4)
## [1] 508

0.0.5 정렬 Arrange

hflights5 <- arrange(hflights, Month)
head(hflights5)
##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011     1          1         6    1400    1500            AA       428
## 2 2011     1          2         7    1401    1501            AA       428
## 3 2011     1          3         1    1352    1502            AA       428
## 4 2011     1          4         2    1403    1513            AA       428
## 5 2011     1          5         3    1405    1507            AA       428
## 6 2011     1          6         4    1359    1503            AA       428
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N576AA                60      40      -10        0    IAH  DFW      224
## 2  N557AA                60      45       -9        1    IAH  DFW      224
## 3  N541AA                70      48       -8       -8    IAH  DFW      224
## 4  N403AA                70      39        3        3    IAH  DFW      224
## 5  N492AA                62      44       -3        5    IAH  DFW      224
## 6  N262AA                64      45       -7       -1    IAH  DFW      224
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1      7      13         0                         0
## 2      6       9         0                         0
## 3      5      17         0                         0
## 4      9      22         0                         0
## 5      9       9         0                         0
## 6      6      13         0                         0
tail(hflights5)
##        Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 227491 2011    12          6         2    1307    1600            WN       471
## 227492 2011    12          6         2    1818    2111            WN      1191
## 227493 2011    12          6         2    2047    2334            WN      1674
## 227494 2011    12          6         2     912    1031            WN       127
## 227495 2011    12          6         2     656     812            WN       621
## 227496 2011    12          6         2    1600    1713            WN      1597
##        TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 227491  N632SW               113      98        0        7    HOU  TPA      781
## 227492  N284WN               113      97       -9        8    HOU  TPA      781
## 227493  N366SW               107      94        4        7    HOU  TPA      781
## 227494  N777QC                79      61       -4       -3    HOU  TUL      453
## 227495  N727SW                76      64      -13       -4    HOU  TUL      453
## 227496  N745SW                73      59      -12        0    HOU  TUL      453
##        TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 227491      5      10         0                         0
## 227492      5      11         0                         0
## 227493      4       9         0                         0
## 227494      4      14         0                         0
## 227495      3       9         0                         0
## 227496      3      11         0                         0
hflights6 <- arrange(hflights, desc(Month))
head(hflights6)
##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011    12         15         4    2113    2217            AA       426
## 2 2011    12         16         5    2004    2128            AA       426
## 3 2011    12         18         7    2007    2113            AA       426
## 4 2011    12         19         1    2108    2223            AA       426
## 5 2011    12         20         2    2008    2107            AA       426
## 6 2011    12         21         3    2025    2124            AA       426
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N433AA                64      44       47       63    IAH  DFW      224
## 2  N588AA                84      39       -2       -6    IAH  DFW      224
## 3  N4XHAA                66      46      -17       -3    IAH  DFW      224
## 4  N4YDAA                75      54       53       58    IAH  DFW      224
## 5  N434AA                59      41      -23       -2    IAH  DFW      224
## 6  N589AA                59      43       -6       15    IAH  DFW      224
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1      6      14         0                         0
## 2      8      37         0                         0
## 3      7      13         0                         0
## 4     11      10         0                         0
## 5      6      12         0                         0
## 6      5      11         0                         0
hflights7 <- arrange(hflights, desc(Month), DayOfWeek)
head(hflights7)
##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011    12         19         1    2108    2223            AA       426
## 2 2011    12         26         1    2013    2118            AA       426
## 3 2011    12          5         1     558     926            AA       466
## 4 2011    12         12         1     609     921            AA       466
## 5 2011    12         19         1     603     913            AA       466
## 6 2011    12         26         1     558     912            AA       466
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N4YDAA                75      54       53       58    IAH  DFW      224
## 2  N4YLAA                65      43      -12        3    IAH  DFW      224
## 3  N3CTAA               148     116        6       -2    IAH  MIA      964
## 4  N3CEAA               132     113        1        9    IAH  MIA      964
## 5  N3GWAA               130     115       -7        3    IAH  MIA      964
## 6  N3GWAA               134     112       -8       -2    IAH  MIA      964
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1     11      10         0                         0
## 2      8      14         0                         0
## 3     11      21         0                         0
## 4     10       9         0                         0
## 5      5      10         0                         0
## 6      6      16         0                         0

0.0.6 열의 조작

head(
  mutate(hflights,
        gain = ArrDelay - DepDelay,
        gain_per_hour = gain / (AirTime/60)
  )
)
##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011     1          1         6    1400    1500            AA       428
## 2 2011     1          2         7    1401    1501            AA       428
## 3 2011     1          3         1    1352    1502            AA       428
## 4 2011     1          4         2    1403    1513            AA       428
## 5 2011     1          5         3    1405    1507            AA       428
## 6 2011     1          6         4    1359    1503            AA       428
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N576AA                60      40      -10        0    IAH  DFW      224
## 2  N557AA                60      45       -9        1    IAH  DFW      224
## 3  N541AA                70      48       -8       -8    IAH  DFW      224
## 4  N403AA                70      39        3        3    IAH  DFW      224
## 5  N492AA                62      44       -3        5    IAH  DFW      224
## 6  N262AA                64      45       -7       -1    IAH  DFW      224
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted gain gain_per_hour
## 1      7      13         0                         0  -10     -15.00000
## 2      6       9         0                         0  -10     -13.33333
## 3      5      17         0                         0    0       0.00000
## 4      9      22         0                         0    0       0.00000
## 5      9       9         0                         0   -8     -10.90909
## 6      6      13         0                         0   -6      -8.00000

0.0.7 서머리

summarise(hflights,
          delay = mean(DepDelay,
                       na.rm = TRUE)
          )
##      delay
## 1 9.444951

0.0.8 그룹별 요약

hflights %>%
group_by(TailNum) %>%
select(Month, DayofMonth, DayOfWeek) %>%
summarise(
      AveMonth = mean(Month, na.rm = TRUE),
      AveDayOfMonth = mean(DayofMonth, na.rm = TRUE),
      AveDayOfWeek = mean(DayOfWeek, na.rm = TRUE)
                )
## Adding missing grouping variables: `TailNum`
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3,320 x 4
##    TailNum  AveMonth AveDayOfMonth AveDayOfWeek
##    <chr>       <dbl>         <dbl>        <dbl>
##  1 ""           3.25          10.0         4.32
##  2 "N0EGMQ"     6.42          16.4         3.42
##  3 "N10156"     6.23          14.0         3.83
##  4 "N10575"     8.49          13.2         3.26
##  5 "N11106"     7.37          15.0         4.10
##  6 "N11107"     7.18          14.8         4.04
##  7 "N11109"     6.24          15.4         3.92
##  8 "N11113"     6.77          15.4         3.98
##  9 "N11119"     7.07          16.0         4.03
## 10 "N11121"     6.46          16.2         4.05
## # … with 3,310 more rows

0.0.9 Dplyr 연습문제 1

head(
  #hflights를 찾고
  hflights %>%
    #테일넘의 NA를 없애주고 
    filter(TailNum != "") %>%
    #선택을 해주고
    select(TailNum, FlightNum, AirTime, Distance, Origin) %>%
    #조건을 걸어주고
    filter(Origin == "HOU" | Distance > 1000) %>%
    
    group_by(TailNum) %>%
  summarise(
    AveAirTime = mean(AirTime, na.rm = TRUE),
    AveDistance = mean(Distance, na.rm = TRUE)
  )
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 3
##   TailNum AveAirTime AveDistance
##   <chr>        <dbl>       <dbl>
## 1 N0EGMQ        182.       1379 
## 2 N10156        142.       1113.
## 3 N10575        136.       1042 
## 4 N11106        141.       1093.
## 5 N11107        143.       1094.
## 6 N11109        144.       1082.

0.0.10 ggplot2 패키지 설치

#install.packages("ggplot2")
require(ggplot2)
## Loading required package: ggplot2

0.0.11 데이터 불러오기

data("midwest", package = "ggplot2")
summary(midwest)
##       PID          county             state                area        
##  Min.   : 561   Length:437         Length:437         Min.   :0.00500  
##  1st Qu.: 670   Class :character   Class :character   1st Qu.:0.02400  
##  Median :1221   Mode  :character   Mode  :character   Median :0.03000  
##  Mean   :1437                                         Mean   :0.03317  
##  3rd Qu.:2059                                         3rd Qu.:0.03800  
##  Max.   :3052                                         Max.   :0.11000  
##     poptotal         popdensity          popwhite          popblack      
##  Min.   :   1701   Min.   :   85.05   Min.   :    416   Min.   :      0  
##  1st Qu.:  18840   1st Qu.:  622.41   1st Qu.:  18630   1st Qu.:     29  
##  Median :  35324   Median : 1156.21   Median :  34471   Median :    201  
##  Mean   :  96130   Mean   : 3097.74   Mean   :  81840   Mean   :  11024  
##  3rd Qu.:  75651   3rd Qu.: 2330.00   3rd Qu.:  72968   3rd Qu.:   1291  
##  Max.   :5105067   Max.   :88018.40   Max.   :3204947   Max.   :1317147  
##  popamerindian        popasian         popother        percwhite    
##  Min.   :    4.0   Min.   :     0   Min.   :     0   Min.   :10.69  
##  1st Qu.:   44.0   1st Qu.:    35   1st Qu.:    20   1st Qu.:94.89  
##  Median :   94.0   Median :   102   Median :    66   Median :98.03  
##  Mean   :  343.1   Mean   :  1310   Mean   :  1613   Mean   :95.56  
##  3rd Qu.:  288.0   3rd Qu.:   401   3rd Qu.:   345   3rd Qu.:99.07  
##  Max.   :10289.0   Max.   :188565   Max.   :384119   Max.   :99.82  
##    percblack       percamerindan        percasian        percother      
##  Min.   : 0.0000   Min.   : 0.05623   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.: 0.1157   1st Qu.: 0.15793   1st Qu.:0.1737   1st Qu.:0.09102  
##  Median : 0.5390   Median : 0.21502   Median :0.2972   Median :0.17844  
##  Mean   : 2.6763   Mean   : 0.79894   Mean   :0.4872   Mean   :0.47906  
##  3rd Qu.: 2.6014   3rd Qu.: 0.38362   3rd Qu.:0.5212   3rd Qu.:0.48050  
##  Max.   :40.2100   Max.   :89.17738   Max.   :5.0705   Max.   :7.52427  
##    popadults          perchsd        percollege        percprof      
##  Min.   :   1287   Min.   :46.91   Min.   : 7.336   Min.   : 0.5203  
##  1st Qu.:  12271   1st Qu.:71.33   1st Qu.:14.114   1st Qu.: 2.9980  
##  Median :  22188   Median :74.25   Median :16.798   Median : 3.8142  
##  Mean   :  60973   Mean   :73.97   Mean   :18.273   Mean   : 4.4473  
##  3rd Qu.:  47541   3rd Qu.:77.20   3rd Qu.:20.550   3rd Qu.: 4.9493  
##  Max.   :3291995   Max.   :88.90   Max.   :48.079   Max.   :20.7913  
##  poppovertyknown   percpovertyknown percbelowpoverty percchildbelowpovert
##  Min.   :   1696   Min.   :80.90    Min.   : 2.180   Min.   : 1.919      
##  1st Qu.:  18364   1st Qu.:96.89    1st Qu.: 9.199   1st Qu.:11.624      
##  Median :  33788   Median :98.17    Median :11.822   Median :15.270      
##  Mean   :  93642   Mean   :97.11    Mean   :12.511   Mean   :16.447      
##  3rd Qu.:  72840   3rd Qu.:98.60    3rd Qu.:15.133   3rd Qu.:20.352      
##  Max.   :5023523   Max.   :99.86    Max.   :48.691   Max.   :64.308      
##  percadultpoverty percelderlypoverty    inmetro         category        
##  Min.   : 1.938   Min.   : 3.547     Min.   :0.0000   Length:437        
##  1st Qu.: 7.668   1st Qu.: 8.912     1st Qu.:0.0000   Class :character  
##  Median :10.008   Median :10.869     Median :0.0000   Mode  :character  
##  Mean   :10.919   Mean   :11.389     Mean   :0.3432                     
##  3rd Qu.:13.182   3rd Qu.:13.412     3rd Qu.:1.0000                     
##  Max.   :43.312   Max.   :31.162     Max.   :1.0000
options(scipen = 999)
ggplot(midwest, aes(x=area, y=poptotal))

0.0.12 Scatter Plot생성

midwest %>%
  ggplot(aes(x=area, y=poptotal)) +
  geom_point()

0.0.13 리그레션 라인 추가

midwest %>%
  ggplot(aes(x=area, y=poptotal)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

0.0.14 그래프의 범위 조정

g <- midwest %>%
  ggplot(aes(x=area, y=poptotal)) +
  geom_point() +
  geom_smooth(method = "lm")
g + xlim(c(0, 0.1)) + ylim(c(0,100000))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 86 rows containing non-finite values (stat_smooth).
## Warning: Removed 86 rows containing missing values (geom_point).

0.0.15

g1 <- g +
  coord_cartesian(xlim = c(0,0.1), ylim = c(0,1000000))
g1
## `geom_smooth()` using formula 'y ~ x'

0.0.16 제목과 축 이름

g1 +
  labs(title = "Area VS population",
       subtitle = "midwest",
       y = "Population",
       x = "area",
       caption = "Midwest Population"
       )
## `geom_smooth()` using formula 'y ~ x'

0.0.17 색 변경1

g <- midwest %>%
  ggplot(aes(x = area, y = poptotal)) +
  geom_point(col = "blue", size = 0.5) +
  geom_smooth(method = "lm", col = "red", size = 0.5, )
g + xlim(c(0, 0.1)) + ylim(c(0,100000))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 86 rows containing non-finite values (stat_smooth).
## Warning: Removed 86 rows containing missing values (geom_point).

0.0.18 색 변경2

g <- midwest %>%
  ggplot(aes(x = area, y = poptotal)) +
  geom_point(aes(col=state), size = 0.5) +
  geom_smooth(method = "lm", col = "red", size = 0.5, )
g + xlim(c(0, 0.1)) + ylim(c(0,100000))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 86 rows containing non-finite values (stat_smooth).
## Warning: Removed 86 rows containing missing values (geom_point).

0.0.19 그래프 분할

g <- midwest %>%
  ggplot(aes(x=area, y=poptotal)) +
  geom_point() +
  geom_smooth(method = "lm")
g + facet_wrap(~state, nrow = 2)
## `geom_smooth()` using formula 'y ~ x'

midwest
## # A tibble: 437 x 28
##      PID county state  area poptotal popdensity popwhite popblack popamerindian
##    <int> <chr>  <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
##  1   561 ADAMS  IL    0.052    66090      1271.    63917     1702            98
##  2   562 ALEXA… IL    0.014    10626       759      7054     3496            19
##  3   563 BOND   IL    0.022    14991       681.    14477      429            35
##  4   564 BOONE  IL    0.017    30806      1812.    29344      127            46
##  5   565 BROWN  IL    0.018     5836       324.     5264      547            14
##  6   566 BUREAU IL    0.05     35688       714.    35157       50            65
##  7   567 CALHO… IL    0.017     5322       313.     5298        1             8
##  8   568 CARRO… IL    0.027    16805       622.    16519      111            30
##  9   569 CASS   IL    0.024    13437       560.    13384       16             8
## 10   570 CHAMP… IL    0.058   173025      2983.   146506    16559           331
## # … with 427 more rows, and 19 more variables: popasian <int>, popother <int>,
## #   percwhite <dbl>, percblack <dbl>, percamerindan <dbl>, percasian <dbl>,
## #   percother <dbl>, popadults <int>, perchsd <dbl>, percollege <dbl>,
## #   percprof <dbl>, poppovertyknown <int>, percpovertyknown <dbl>,
## #   percbelowpoverty <dbl>, percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>
hist(midwest$popdensity)

midwest$degreeofdensity <- ifelse(midwest$popdensity >= 10000, "high", "low")
table(midwest$degreeofdensity)
## 
## high  low 
##   23  414
g <- midwest %>%
  ggplot(aes(x=area, y=poptotal)) +
  geom_point(aes(col=degreeofdensity)) +
  geom_smooth(method = "lm")

g + facet_wrap( ~ state, nrow = 2)
## `geom_smooth()` using formula 'y ~ x'

0.0.20 ggplot 실습

mtcars <- mtcars
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
plot(mtcars)

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
# mtcars 데이터를 활용해서 두개의 연속형 변수를 선택한후 scatter plot 그려보세요
# 점에 카테고리컬 변수의 정보를 활용해서 점의 색갈에 반영해 보십시오.
mtcars$names <- rownames(mtcars)

m <- mtcars %>%
  ggplot(aes(x = mpg, y = disp)) +
  geom_point(aes(col = as.factor(vs)), size = 2) +
  geom_smooth(method = "lm") +
  geom_text(aes(label = names), size = 3)
m
## `geom_smooth()` using formula 'y ~ x'