### 파생변수 만들기
R에서 파생변수는 데이터프레임에 있는 다른 변수들을 가지고 새로운 변수를 데이터 프레임안에 생성하는 방법이다.
이 절차를 위해 cars데이터를 열고
cars데이터의 컬럼이름을 추출한다. cars 데이터는 총 2개의 변수와 50개의 관측치로 이루어진 데이터 프레임 자료이다.
변수들의 이름은 다음과 같다.
## [1] "speed" "dist"
파생변수를 생성하기 이전에 변수들의 이름을 한글로 변경하였다.
그 결과로서 앞에서부터 6개의 자료를 보여지는것은 다음과 같다.
## 속도 거리
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
새로 생성되는 파생변수의 이름은 ’속도x거리’이며 이 값은 속도와 거리의 곱이다.
## 속도 거리 속도x거리
## 1 4 2 8
## 2 4 10 40
## 3 7 4 28
## 4 7 22 154
## 5 8 16 128
## 6 9 10 90
그 결과로서 cars 데이터는 총 3개의 변수와 50개의 관측치로 이루어진 데이터 프레임으로 업데이트 되었다.
cars$속도정도 <- ifelse(cars$속도 > 7 , "빠름", "느림")
head(cars)
## 속도 거리 속도x거리 속도정도
## 1 4 2 8 느림
## 2 4 10 40 느림
## 3 7 4 28 느림
## 4 7 22 154 느림
## 5 8 16 128 빠름
## 6 9 10 90 빠름
#install.packages("hflights")
require(hflights)
## Loading required package: hflights
hflights <- hflights
#View(hflights)
summary(hflights)
## Year Month DayofMonth DayOfWeek DepTime
## Min. :2011 Min. : 1.000 Min. : 1.00 Min. :1.000 Min. : 1
## 1st Qu.:2011 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2.000 1st Qu.:1021
## Median :2011 Median : 7.000 Median :16.00 Median :4.000 Median :1416
## Mean :2011 Mean : 6.514 Mean :15.74 Mean :3.948 Mean :1396
## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:6.000 3rd Qu.:1801
## Max. :2011 Max. :12.000 Max. :31.00 Max. :7.000 Max. :2400
## NA's :2905
## ArrTime UniqueCarrier FlightNum TailNum
## Min. : 1 Length:227496 Min. : 1 Length:227496
## 1st Qu.:1215 Class :character 1st Qu.: 855 Class :character
## Median :1617 Mode :character Median :1696 Mode :character
## Mean :1578 Mean :1962
## 3rd Qu.:1953 3rd Qu.:2755
## Max. :2400 Max. :7290
## NA's :3066
## ActualElapsedTime AirTime ArrDelay DepDelay
## Min. : 34.0 Min. : 11.0 Min. :-70.000 Min. :-33.000
## 1st Qu.: 77.0 1st Qu.: 58.0 1st Qu.: -8.000 1st Qu.: -3.000
## Median :128.0 Median :107.0 Median : 0.000 Median : 0.000
## Mean :129.3 Mean :108.1 Mean : 7.094 Mean : 9.445
## 3rd Qu.:165.0 3rd Qu.:141.0 3rd Qu.: 11.000 3rd Qu.: 9.000
## Max. :575.0 Max. :549.0 Max. :978.000 Max. :981.000
## NA's :3622 NA's :3622 NA's :3622 NA's :2905
## Origin Dest Distance TaxiIn
## Length:227496 Length:227496 Min. : 79.0 Min. : 1.000
## Class :character Class :character 1st Qu.: 376.0 1st Qu.: 4.000
## Mode :character Mode :character Median : 809.0 Median : 5.000
## Mean : 787.8 Mean : 6.099
## 3rd Qu.:1042.0 3rd Qu.: 7.000
## Max. :3904.0 Max. :165.000
## NA's :3066
## TaxiOut Cancelled CancellationCode Diverted
## Min. : 1.00 Min. :0.00000 Length:227496 Min. :0.000000
## 1st Qu.: 10.00 1st Qu.:0.00000 Class :character 1st Qu.:0.000000
## Median : 14.00 Median :0.00000 Mode :character Median :0.000000
## Mean : 15.09 Mean :0.01307 Mean :0.002853
## 3rd Qu.: 18.00 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :163.00 Max. :1.00000 Max. :1.000000
## NA's :2947
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
str(hflights)
## 'data.frame': 227496 obs. of 21 variables:
## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DayofMonth : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DayOfWeek : int 6 7 1 2 3 4 5 6 7 1 ...
## $ DepTime : int 1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
## $ ArrTime : int 1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
## $ UniqueCarrier : chr "AA" "AA" "AA" "AA" ...
## $ FlightNum : int 428 428 428 428 428 428 428 428 428 428 ...
## $ TailNum : chr "N576AA" "N557AA" "N541AA" "N403AA" ...
## $ ActualElapsedTime: int 60 60 70 70 62 64 70 59 71 70 ...
## $ AirTime : int 40 45 48 39 44 45 43 40 41 45 ...
## $ ArrDelay : int -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
## $ DepDelay : int 0 1 -8 3 5 -1 -1 -5 43 43 ...
## $ Origin : chr "IAH" "IAH" "IAH" "IAH" ...
## $ Dest : chr "DFW" "DFW" "DFW" "DFW" ...
## $ Distance : int 224 224 224 224 224 224 224 224 224 224 ...
## $ TaxiIn : int 7 6 5 9 9 6 12 7 8 6 ...
## $ TaxiOut : int 13 9 17 22 9 13 15 12 22 19 ...
## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CancellationCode : chr "" "" "" "" ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
newhflights <- select(hflights, Origin, Distance, TailNum)
head(
filter(hflights, Month==2)
)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011 2 1 2 1401 1539 AA 428
## 2 2011 2 2 3 1420 1530 AA 428
## 3 2011 2 3 4 1405 1504 AA 428
## 4 2011 2 4 5 1516 1614 AA 428
## 5 2011 2 5 6 1358 1505 AA 428
## 6 2011 2 6 7 1350 1452 AA 428
## TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1 N474AA 98 73 29 1 IAH DFW 224
## 2 N463AA 70 42 20 20 IAH DFW 224
## 3 N548AA 59 40 -6 5 IAH DFW 224
## 4 N425AA 58 39 64 76 IAH DFW 224
## 5 N4UCAA 67 40 -5 -2 IAH DFW 224
## 6 N560AA 62 44 -18 -10 IAH DFW 224
## TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1 7 18 0 0
## 2 8 20 0 0
## 3 7 12 0 0
## 4 7 12 0 0
## 5 12 15 0 0
## 6 4 14 0 0
hflights1 <- filter(hflights, Month==2)
nrow(hflights1)
## [1] 17128
head(
filter(hflights, Month==8)
)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011 8 16 2 556 859 DL 1884
## 2 2011 8 17 3 1545 1836 DL 8
## 3 2011 8 17 3 1800 2117 DL 54
## 4 2011 8 17 3 700 1007 DL 810
## 5 2011 8 17 3 632 907 DL 1512
## 6 2011 8 17 3 910 1203 DL 1590
## TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1 N951DL 123 99 -15 -4 IAH ATL 689
## 2 N338NB 111 91 -28 -5 IAH ATL 689
## 3 N345NW 137 90 12 5 IAH ATL 689
## 4 N969DL 127 98 -8 -2 IAH ATL 689
## 5 N317NB 155 136 -18 -5 IAH MSP 1034
## 6 N992DL 113 95 -19 -5 IAH ATL 689
## TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1 16 8 0 0
## 2 10 10 0 0
## 3 25 22 0 0
## 4 18 11 0 0
## 5 7 12 0 0
## 6 5 13 0 0
hflights2 <- filter(hflights, Month==8)
nrow(hflights2)
## [1] 20176
hflights3 <- filter(hflights, Month==2 | Month==8)
nrow(hflights3)
## [1] 37304
hflights4 <- filter(hflights, Month==2 , Dest=="DFW")
nrow(hflights4)
## [1] 508
hflights5 <- arrange(hflights, Month)
head(hflights5)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1 N576AA 60 40 -10 0 IAH DFW 224
## 2 N557AA 60 45 -9 1 IAH DFW 224
## 3 N541AA 70 48 -8 -8 IAH DFW 224
## 4 N403AA 70 39 3 3 IAH DFW 224
## 5 N492AA 62 44 -3 5 IAH DFW 224
## 6 N262AA 64 45 -7 -1 IAH DFW 224
## TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1 7 13 0 0
## 2 6 9 0 0
## 3 5 17 0 0
## 4 9 22 0 0
## 5 9 9 0 0
## 6 6 13 0 0
tail(hflights5)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 227491 2011 12 6 2 1307 1600 WN 471
## 227492 2011 12 6 2 1818 2111 WN 1191
## 227493 2011 12 6 2 2047 2334 WN 1674
## 227494 2011 12 6 2 912 1031 WN 127
## 227495 2011 12 6 2 656 812 WN 621
## 227496 2011 12 6 2 1600 1713 WN 1597
## TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 227491 N632SW 113 98 0 7 HOU TPA 781
## 227492 N284WN 113 97 -9 8 HOU TPA 781
## 227493 N366SW 107 94 4 7 HOU TPA 781
## 227494 N777QC 79 61 -4 -3 HOU TUL 453
## 227495 N727SW 76 64 -13 -4 HOU TUL 453
## 227496 N745SW 73 59 -12 0 HOU TUL 453
## TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 227491 5 10 0 0
## 227492 5 11 0 0
## 227493 4 9 0 0
## 227494 4 14 0 0
## 227495 3 9 0 0
## 227496 3 11 0 0
hflights6 <- arrange(hflights, desc(Month))
head(hflights6)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011 12 15 4 2113 2217 AA 426
## 2 2011 12 16 5 2004 2128 AA 426
## 3 2011 12 18 7 2007 2113 AA 426
## 4 2011 12 19 1 2108 2223 AA 426
## 5 2011 12 20 2 2008 2107 AA 426
## 6 2011 12 21 3 2025 2124 AA 426
## TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1 N433AA 64 44 47 63 IAH DFW 224
## 2 N588AA 84 39 -2 -6 IAH DFW 224
## 3 N4XHAA 66 46 -17 -3 IAH DFW 224
## 4 N4YDAA 75 54 53 58 IAH DFW 224
## 5 N434AA 59 41 -23 -2 IAH DFW 224
## 6 N589AA 59 43 -6 15 IAH DFW 224
## TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1 6 14 0 0
## 2 8 37 0 0
## 3 7 13 0 0
## 4 11 10 0 0
## 5 6 12 0 0
## 6 5 11 0 0
hflights7 <- arrange(hflights, desc(Month), DayOfWeek)
head(hflights7)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011 12 19 1 2108 2223 AA 426
## 2 2011 12 26 1 2013 2118 AA 426
## 3 2011 12 5 1 558 926 AA 466
## 4 2011 12 12 1 609 921 AA 466
## 5 2011 12 19 1 603 913 AA 466
## 6 2011 12 26 1 558 912 AA 466
## TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1 N4YDAA 75 54 53 58 IAH DFW 224
## 2 N4YLAA 65 43 -12 3 IAH DFW 224
## 3 N3CTAA 148 116 6 -2 IAH MIA 964
## 4 N3CEAA 132 113 1 9 IAH MIA 964
## 5 N3GWAA 130 115 -7 3 IAH MIA 964
## 6 N3GWAA 134 112 -8 -2 IAH MIA 964
## TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1 11 10 0 0
## 2 8 14 0 0
## 3 11 21 0 0
## 4 10 9 0 0
## 5 5 10 0 0
## 6 6 16 0 0
head(
mutate(hflights,
gain = ArrDelay - DepDelay,
gain_per_hour = gain / (AirTime/60)
)
)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1 N576AA 60 40 -10 0 IAH DFW 224
## 2 N557AA 60 45 -9 1 IAH DFW 224
## 3 N541AA 70 48 -8 -8 IAH DFW 224
## 4 N403AA 70 39 3 3 IAH DFW 224
## 5 N492AA 62 44 -3 5 IAH DFW 224
## 6 N262AA 64 45 -7 -1 IAH DFW 224
## TaxiIn TaxiOut Cancelled CancellationCode Diverted gain gain_per_hour
## 1 7 13 0 0 -10 -15.00000
## 2 6 9 0 0 -10 -13.33333
## 3 5 17 0 0 0 0.00000
## 4 9 22 0 0 0 0.00000
## 5 9 9 0 0 -8 -10.90909
## 6 6 13 0 0 -6 -8.00000
summarise(hflights,
delay = mean(DepDelay,
na.rm = TRUE)
)
## delay
## 1 9.444951
hflights %>%
group_by(TailNum) %>%
select(Month, DayofMonth, DayOfWeek) %>%
summarise(
AveMonth = mean(Month, na.rm = TRUE),
AveDayOfMonth = mean(DayofMonth, na.rm = TRUE),
AveDayOfWeek = mean(DayOfWeek, na.rm = TRUE)
)
## Adding missing grouping variables: `TailNum`
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3,320 x 4
## TailNum AveMonth AveDayOfMonth AveDayOfWeek
## <chr> <dbl> <dbl> <dbl>
## 1 "" 3.25 10.0 4.32
## 2 "N0EGMQ" 6.42 16.4 3.42
## 3 "N10156" 6.23 14.0 3.83
## 4 "N10575" 8.49 13.2 3.26
## 5 "N11106" 7.37 15.0 4.10
## 6 "N11107" 7.18 14.8 4.04
## 7 "N11109" 6.24 15.4 3.92
## 8 "N11113" 6.77 15.4 3.98
## 9 "N11119" 7.07 16.0 4.03
## 10 "N11121" 6.46 16.2 4.05
## # … with 3,310 more rows
head(
#hflights를 찾고
hflights %>%
#테일넘의 NA를 없애주고
filter(TailNum != "") %>%
#선택을 해주고
select(TailNum, FlightNum, AirTime, Distance, Origin) %>%
#조건을 걸어주고
filter(Origin == "HOU" | Distance > 1000) %>%
group_by(TailNum) %>%
summarise(
AveAirTime = mean(AirTime, na.rm = TRUE),
AveDistance = mean(Distance, na.rm = TRUE)
)
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 3
## TailNum AveAirTime AveDistance
## <chr> <dbl> <dbl>
## 1 N0EGMQ 182. 1379
## 2 N10156 142. 1113.
## 3 N10575 136. 1042
## 4 N11106 141. 1093.
## 5 N11107 143. 1094.
## 6 N11109 144. 1082.
#install.packages("ggplot2")
require(ggplot2)
## Loading required package: ggplot2
data("midwest", package = "ggplot2")
summary(midwest)
## PID county state area
## Min. : 561 Length:437 Length:437 Min. :0.00500
## 1st Qu.: 670 Class :character Class :character 1st Qu.:0.02400
## Median :1221 Mode :character Mode :character Median :0.03000
## Mean :1437 Mean :0.03317
## 3rd Qu.:2059 3rd Qu.:0.03800
## Max. :3052 Max. :0.11000
## poptotal popdensity popwhite popblack
## Min. : 1701 Min. : 85.05 Min. : 416 Min. : 0
## 1st Qu.: 18840 1st Qu.: 622.41 1st Qu.: 18630 1st Qu.: 29
## Median : 35324 Median : 1156.21 Median : 34471 Median : 201
## Mean : 96130 Mean : 3097.74 Mean : 81840 Mean : 11024
## 3rd Qu.: 75651 3rd Qu.: 2330.00 3rd Qu.: 72968 3rd Qu.: 1291
## Max. :5105067 Max. :88018.40 Max. :3204947 Max. :1317147
## popamerindian popasian popother percwhite
## Min. : 4.0 Min. : 0 Min. : 0 Min. :10.69
## 1st Qu.: 44.0 1st Qu.: 35 1st Qu.: 20 1st Qu.:94.89
## Median : 94.0 Median : 102 Median : 66 Median :98.03
## Mean : 343.1 Mean : 1310 Mean : 1613 Mean :95.56
## 3rd Qu.: 288.0 3rd Qu.: 401 3rd Qu.: 345 3rd Qu.:99.07
## Max. :10289.0 Max. :188565 Max. :384119 Max. :99.82
## percblack percamerindan percasian percother
## Min. : 0.0000 Min. : 0.05623 Min. :0.0000 Min. :0.00000
## 1st Qu.: 0.1157 1st Qu.: 0.15793 1st Qu.:0.1737 1st Qu.:0.09102
## Median : 0.5390 Median : 0.21502 Median :0.2972 Median :0.17844
## Mean : 2.6763 Mean : 0.79894 Mean :0.4872 Mean :0.47906
## 3rd Qu.: 2.6014 3rd Qu.: 0.38362 3rd Qu.:0.5212 3rd Qu.:0.48050
## Max. :40.2100 Max. :89.17738 Max. :5.0705 Max. :7.52427
## popadults perchsd percollege percprof
## Min. : 1287 Min. :46.91 Min. : 7.336 Min. : 0.5203
## 1st Qu.: 12271 1st Qu.:71.33 1st Qu.:14.114 1st Qu.: 2.9980
## Median : 22188 Median :74.25 Median :16.798 Median : 3.8142
## Mean : 60973 Mean :73.97 Mean :18.273 Mean : 4.4473
## 3rd Qu.: 47541 3rd Qu.:77.20 3rd Qu.:20.550 3rd Qu.: 4.9493
## Max. :3291995 Max. :88.90 Max. :48.079 Max. :20.7913
## poppovertyknown percpovertyknown percbelowpoverty percchildbelowpovert
## Min. : 1696 Min. :80.90 Min. : 2.180 Min. : 1.919
## 1st Qu.: 18364 1st Qu.:96.89 1st Qu.: 9.199 1st Qu.:11.624
## Median : 33788 Median :98.17 Median :11.822 Median :15.270
## Mean : 93642 Mean :97.11 Mean :12.511 Mean :16.447
## 3rd Qu.: 72840 3rd Qu.:98.60 3rd Qu.:15.133 3rd Qu.:20.352
## Max. :5023523 Max. :99.86 Max. :48.691 Max. :64.308
## percadultpoverty percelderlypoverty inmetro category
## Min. : 1.938 Min. : 3.547 Min. :0.0000 Length:437
## 1st Qu.: 7.668 1st Qu.: 8.912 1st Qu.:0.0000 Class :character
## Median :10.008 Median :10.869 Median :0.0000 Mode :character
## Mean :10.919 Mean :11.389 Mean :0.3432
## 3rd Qu.:13.182 3rd Qu.:13.412 3rd Qu.:1.0000
## Max. :43.312 Max. :31.162 Max. :1.0000
options(scipen = 999)
ggplot(midwest, aes(x=area, y=poptotal))
midwest %>%
ggplot(aes(x=area, y=poptotal)) +
geom_point()
midwest %>%
ggplot(aes(x=area, y=poptotal)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
g <- midwest %>%
ggplot(aes(x=area, y=poptotal)) +
geom_point() +
geom_smooth(method = "lm")
g + xlim(c(0, 0.1)) + ylim(c(0,100000))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 86 rows containing non-finite values (stat_smooth).
## Warning: Removed 86 rows containing missing values (geom_point).
g1 <- g +
coord_cartesian(xlim = c(0,0.1), ylim = c(0,1000000))
g1
## `geom_smooth()` using formula 'y ~ x'
g1 +
labs(title = "Area VS population",
subtitle = "midwest",
y = "Population",
x = "area",
caption = "Midwest Population"
)
## `geom_smooth()` using formula 'y ~ x'
g <- midwest %>%
ggplot(aes(x = area, y = poptotal)) +
geom_point(col = "blue", size = 0.5) +
geom_smooth(method = "lm", col = "red", size = 0.5, )
g + xlim(c(0, 0.1)) + ylim(c(0,100000))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 86 rows containing non-finite values (stat_smooth).
## Warning: Removed 86 rows containing missing values (geom_point).
g <- midwest %>%
ggplot(aes(x = area, y = poptotal)) +
geom_point(aes(col=state), size = 0.5) +
geom_smooth(method = "lm", col = "red", size = 0.5, )
g + xlim(c(0, 0.1)) + ylim(c(0,100000))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 86 rows containing non-finite values (stat_smooth).
## Warning: Removed 86 rows containing missing values (geom_point).
g <- midwest %>%
ggplot(aes(x=area, y=poptotal)) +
geom_point() +
geom_smooth(method = "lm")
g + facet_wrap(~state, nrow = 2)
## `geom_smooth()` using formula 'y ~ x'
midwest
## # A tibble: 437 x 28
## PID county state area poptotal popdensity popwhite popblack popamerindian
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
## 2 562 ALEXA… IL 0.014 10626 759 7054 3496 19
## 3 563 BOND IL 0.022 14991 681. 14477 429 35
## 4 564 BOONE IL 0.017 30806 1812. 29344 127 46
## 5 565 BROWN IL 0.018 5836 324. 5264 547 14
## 6 566 BUREAU IL 0.05 35688 714. 35157 50 65
## 7 567 CALHO… IL 0.017 5322 313. 5298 1 8
## 8 568 CARRO… IL 0.027 16805 622. 16519 111 30
## 9 569 CASS IL 0.024 13437 560. 13384 16 8
## 10 570 CHAMP… IL 0.058 173025 2983. 146506 16559 331
## # … with 427 more rows, and 19 more variables: popasian <int>, popother <int>,
## # percwhite <dbl>, percblack <dbl>, percamerindan <dbl>, percasian <dbl>,
## # percother <dbl>, popadults <int>, perchsd <dbl>, percollege <dbl>,
## # percprof <dbl>, poppovertyknown <int>, percpovertyknown <dbl>,
## # percbelowpoverty <dbl>, percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>
hist(midwest$popdensity)
midwest$degreeofdensity <- ifelse(midwest$popdensity >= 10000, "high", "low")
table(midwest$degreeofdensity)
##
## high low
## 23 414
g <- midwest %>%
ggplot(aes(x=area, y=poptotal)) +
geom_point(aes(col=degreeofdensity)) +
geom_smooth(method = "lm")
g + facet_wrap( ~ state, nrow = 2)
## `geom_smooth()` using formula 'y ~ x'
mtcars <- mtcars
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
plot(mtcars)
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
# mtcars 데이터를 활용해서 두개의 연속형 변수를 선택한후 scatter plot 그려보세요
# 점에 카테고리컬 변수의 정보를 활용해서 점의 색갈에 반영해 보십시오.
mtcars$names <- rownames(mtcars)
m <- mtcars %>%
ggplot(aes(x = mpg, y = disp)) +
geom_point(aes(col = as.factor(vs)), size = 2) +
geom_smooth(method = "lm") +
geom_text(aes(label = names), size = 3)
m
## `geom_smooth()` using formula 'y ~ x'