Data Exploration with hflights

2. 데이터 탐색

2.1 각 데이터 컬럼의 의미 탐색

?hflights

Year, Month, DayofMonth: 출발 년월일
DayOfWeek: 출발 요일 (주말 효과 제거에 유용)
DepTime, ArrTime: 출발 및 도착 시간 (현지 시간, hhmm)
UniqueCarrier: 운송 업체의 고유 약어(항공사)
FlightNum: 항공편 번호
TailNum: 비행기 꼬리 번호(항공기 등록 번호)
ActualElapsedTime: 비행 경과 시간 (분): ArrTime - DepTime
AirTime: 비행 시간 (분): 하늘에 떠 있는 시간
ArrDelay, DepDelay: 도착 및 출발 지연 (분)
Origin, Dest origin and destination airport codes: 출발지, 목적지 출발지 및 목적지 공항 코드
Distance: 비행 거리 (마일)
TaxiIn, TaxiOut: 몇 분 안에 택시를 타고 내릴 수 있습니다.
Cancelled: 취소됨 표시기 : 1 = 예, 0 = 아니요
CancellationCode: 취소 이유 : A = 항공사, B = 날씨, C = 국가 항공 시스템, D = 보안
Diverted: 전환됨 표시기 : 1 = 예, 0 = 아니요

위 각 컬럼의 의미를 세세히 아는 것은 항공 산업의 영역이라고 할 수 있다. 그래서 데이터 분석가는 항공 산업의 전문가와 소통을 통해서 위 컬럼의 의미를 최대한 정확히 아는 것이 중요하다.

2.2 각 컬럼에 결측치가 얼마나 되는지 확인해 보자.

R 데이터 프레임에서 결측치는 NA, NaN, Infinite가 있을수 있는데, 모든 컬럼에 대해 확인해 보자.
결측치는 아래와 같이 4가지 종류가 있다.
NA: is.na, NaN: is.nan, Infinite: is.infinite

library(dplyr) # %>% 연산자를 사용하기 위해서 필요함
hflights %>% sapply(is.na) %>% class() # 이 함수를 수행하면 matrix, array 형이다.

[1] "matrix" "array"

hflights %>% sapply(is.na) %>% colSums()

             Year             Month        DayofMonth         DayOfWeek 
                0                 0                 0                 0 
          DepTime           ArrTime     UniqueCarrier         FlightNum 
             2905              3066                 0                 0 
          TailNum ActualElapsedTime           AirTime          ArrDelay 
                0              3622              3622              3622 
         DepDelay            Origin              Dest          Distance 
             2905                 0                 0                 0 
           TaxiIn           TaxiOut         Cancelled  CancellationCode 
             3066              2947                 0                 0 
         Diverted 
                0

hflights %>% sapply(is.nan) %>% colSums()

             Year             Month        DayofMonth         DayOfWeek 
                0                 0                 0                 0 
          DepTime           ArrTime     UniqueCarrier         FlightNum 
                0                 0                 0                 0 
          TailNum ActualElapsedTime           AirTime          ArrDelay 
                0                 0                 0                 0 
         DepDelay            Origin              Dest          Distance 
                0                 0                 0                 0 
           TaxiIn           TaxiOut         Cancelled  CancellationCode 
                0                 0                 0                 0 
         Diverted 
                0

hflights %>% sapply(is.infinite) %>% colSums()

             Year             Month        DayofMonth         DayOfWeek 
                0                 0                 0                 0 
          DepTime           ArrTime     UniqueCarrier         FlightNum 
                0                 0                 0                 0 
          TailNum ActualElapsedTime           AirTime          ArrDelay 
                0                 0                 0                 0 
         DepDelay            Origin              Dest          Distance 
                0                 0                 0                 0 
           TaxiIn           TaxiOut         Cancelled  CancellationCode 
                0                 0                 0                 0 
         Diverted 
                0

hflights %>% sapply(is.na) %>% class() 를 수행하면, 자료의 구조가 matrix, array이다. 그래서 테이블에 적용하는 함수들 적용이 가능하다.(예, colSums()) 확인 결과 NA 값의 결측치만 있고, NaN 또는 Inf 결측값은 없다.
결측치를 어떻게 처리할 지는 좀 더 탐색을 수행한 후 정하기로 한다.

2.3 테이블의 각 컬럼이 어떻게 구성되었는지 확인

library(data.table)   # table() 함수를 사용하기 위해서 필요함
hflights %>% sapply(table) %>% sapply(length)

             Year             Month        DayofMonth         DayOfWeek 
                1                12                31                 7 
          DepTime           ArrTime     UniqueCarrier         FlightNum 
             1207              1283                15              3740 
          TailNum ActualElapsedTime           AirTime          ArrDelay 
             3320               435               398               463 
         DepDelay            Origin              Dest          Distance 
              429                 2               116               159 
           TaxiIn           TaxiOut         Cancelled  CancellationCode 
               96               143                 2                 5 
         Diverted 
                2

2011년 12개월 데이터, 항공사(UniqueCarrier)는 총 15개, 출발지(Origin)는 2개, 도착지(Dest)는 총 116개인 것을 알 수 있다.

2.4 요일별, 월별 데이터의 분포를 텍스트/그래프로 보자.

library(ggplot2)   # ggplot 함수를 사용하기 위해서 필요함

hflights %>% group_by(Month) %>% summarise(n=n())

# A tibble: 12 x 2
   Month     n
   <int> <int>
 1     1 18910
 2     2 17128
 3     3 19470
 4     4 18593
 5     5 19172
 6     6 19600
 7     7 20548
 8     8 20176
 9     9 18065
10    10 18696
11    11 18021
12    12 19117

hflights %>% group_by(Month) %>% summarise(n=n()) %>% ggplot()+aes(x = as.factor(Month), y = n, fill=as.factor(Month))+geom_col()+xlab("Month")+ylab("운황 횟수")+labs(title = "월별 운항 횟수")+theme(plot.title = element_text(hjust = 0.5, size =15))+guides(fill=guide_legend(title="Month"))+theme(legend.position = "none")+theme(axis.title.y = element_text(size = 15))

hflights %>% group_by(DayOfWeek) %>% summarise(n=n())

# A tibble: 7 x 2
  DayOfWeek     n
      <int> <int>
1         1 34360
2         2 31649
3         3 31926
4         4 34902
5         5 34972
6         6 27629
7         7 32058

hflights %>% 
  group_by(DayOfWeek) %>%
  summarise(n=n()) %>% 
  ggplot()+
  aes(x = as.factor(DayOfWeek), y=n, fill=as.factor(DayOfWeek))+
  geom_col()+
  xlab("Day of Week")+
  ylab("운항 횟수") +
  labs(title = "요일별 운항 횟수")+
  theme(plot.title = element_text(hjust = 0.5, size = 15))+
  guides(fill=guide_legend(title = "Day Of Week"))+
  theme(legend.position = "none")+
  theme(axis.title.y = element_text(size = 15))

월별로는 7~8월이 성수기인 것으로 보이고, 요일별로는 금요일이 가장 적어 보인다.

2.5 항공사별로 운항 횟수, 취소된(Cancelled) 횟수와 비율을 살펴보자.

hflights %>% group_by(UniqueCarrier ) %>% 
  ggplot()+aes(x = UniqueCarrier, fill = UniqueCarrier)+
  geom_bar()+
  labs(title = "항공사별 운항 횟수")+
  ylab("운항 횟수")+
  xlab("항공사")+
  theme(plot.title = element_text(hjust = 0.5, size = 25))+
  theme(axis.title = element_text(size = 15))+
  theme(legend.position = "none")

hflights %>% group_by(UniqueCarrier) %>% filter(Cancelled==1) %>% summarise(n=n())

# A tibble: 14 x 2
   UniqueCarrier     n
   <chr>         <int>
 1 AA               60
 2 B6               18
 3 CO              475
 4 DL               42
 5 EV               76
 6 F9                6
 7 FL               21
 8 MQ              135
 9 OO              224
10 UA               34
11 US               46
12 WN              703
13 XE             1132
14 YV                1

hflights %>% group_by(UniqueCarrier) %>% filter(Cancelled==1) %>% ggplot()+aes(x = UniqueCarrier, fill=UniqueCarrier)+geom_bar()+labs(title = "항공사별 운항 취소 횟수")+theme(plot.title = element_text(hjust = 0.5, size = 15))+theme(legend.position = "none")

hflights %>% group_by(UniqueCarrier) %>% summarise(n=n()) -> total
hflights %>% group_by(UniqueCarrier) %>% filter(Cancelled==1) %>% summarise(CCC=n())-> Cancel
merge(total, Cancel, by = "UniqueCarrier", all = T) %>%
  mutate(CCC=case_when(CCC=is.na(CCC) ~ 0L, TRUE ~ CCC)) %>%
  mutate(Cancelled_ratio=100*CCC/n) %>% 
  ggplot()+
  aes(x = UniqueCarrier, y = Cancelled_ratio, fill = UniqueCarrier)+
  geom_col()+
  labs(title = "항공사별 운항 취소율")+
  ylab("운항 취소율(%)")+
  xlab("항공사")+
  theme(plot.title = element_text(hjust = 0.5, size = 25))+
  theme(axis.title = element_text(size = 15))+
  theme(legend.position = "none")

항공사별 운항 횟수와 취소율을 비교해보니, 운항 횟수가 적은 회사들의 취소율이 매우 높은 것을 알 수 있다.
EV 항공사가 3.5% 대로 가장 높고, AS는 취소율이 거의 없다는 것을 알 수 있다.
그렇다면 취소 이유를 살펴볼 필요가 있다.
#### 2.6 각 항공사별 취소 이유 분석

# 항공사별 취소 이유의 개수를 A,B,C,D만 막대그래프로 그려보자.
hflights %>% group_by(UniqueCarrier) %>% select(CancellationCode) %>% table()

             CancellationCode
UniqueCarrier           A     B     C     D
           AA  3184    20    29    11     0
           AS   365     0     0     0     0
           B6   677     5    13     0     0
           CO 69557    37   436     2     0
           DL  2599    13    27     2     0
           EV  2128    60    14     2     0
           F9   832     2     4     0     0
           FL  2118     8    12     1     0
           MQ  4513    39    71    25     0
           OO 15837   121    87    15     1
           UA  2038    21    10     3     0
           US  4036    27    17     2     0
           WN 44640   517   181     5     0
           XE 71921   331   751    50     0
           YV    78     1     0     0     0

hflights %>% 
  group_by(UniqueCarrier) %>% 
  count(CancellationCode) %>% 
  filter(CancellationCode!="") %>% 
  ggplot()+aes(x = UniqueCarrier, y = n, fill=CancellationCode)+
  geom_bar(stat = "identity", position = "dodge")+
  labs(title = "원인 별 취소 건수:A-항공사,B-날씨, C-국가항공시스템, D-보안")+
  theme(plot.title = element_text(hjust = 0.5))

이제 출발 지연(DepDelay), 도착 지연(ArrDelay)이 얼마나 이루어지고 있는지 확인해 보자.
지연이 “-” 값이 있는 것은 빨리 출발하거나 빨리 도착하는 경우가 될 것이다.
이제 항공사별로 출발 지연 및 도착 지연의 횟수, 총 시간 등을 확인해 보자.(“-” 값은 제외하고.)

2.7 항공사별 출발 지연 및 도착 지연 횟수, 총 시간등 확인

# 항공사 전체 출발 및 도착 지연 총 횟수
hflights %>% ggplot()+aes(DepDelay)+stat_count(color='blue', alpha=0.5)

Warning: Removed 2905 rows containing non-finite values (stat_count).

hflights %>% ggplot()+aes(ArrDelay)+stat_count(color='blue', alpha=0.5)

Warning: Removed 3622 rows containing non-finite values (stat_count).

# 항공사별 지연 시간
hflights %>%ggplot()+aes(x = DepDelay, group = UniqueCarrier, color=UniqueCarrier)+geom_line(stat="count")

Warning: Removed 2905 rows containing non-finite values (stat_count).

# 항공사 별 30분 이상 출발 지연 횟수 
hflights %>% filter(DepDelay>30) %>% group_by(UniqueCarrier) %>%
  ggplot()+aes(UniqueCarrier, fill=UniqueCarrier)+
  geom_bar()+
  labs(title = "30분 이상 출발 지연")+ theme(plot.title = element_text(hjust = 0.5, size = 20))+
  theme(legend.position = "none")

# 항공사 별 60분 이상 출발 지연 횟수 : 시간을 바꿔가며 볼 필요가 있다.

hflights %>% filter(DepDelay>60) %>% ggplot()+aes(x = DepDelay, group = UniqueCarrier, color=UniqueCarrier)+geom_line(stat="count")

hflights %>% filter(ArrDelay>60) %>% ggplot()+aes(x = DepDelay, group = UniqueCarrier, color=UniqueCarrier)+geom_line(stat="count")

# boxplot으로 출발 및 도착 지연을 살펴보자
hflights %>% 
  ggplot()+
  aes(x = DepDelay, y = UniqueCarrier, fill=UniqueCarrier)+
  geom_boxplot()

Warning: Removed 2905 rows containing non-finite values (stat_boxplot).

hflights %>% 
  ggplot()+
  aes(x = ArrDelay, y = UniqueCarrier, fill=UniqueCarrier)+
  geom_boxplot()+
  coord_flip()

Warning: Removed 3622 rows containing non-finite values (stat_boxplot).

# 항공사별 출발 및 도착 시간의 통계를 살펴보자
hflights %>% group_by(UniqueCarrier) %>% filter(DepDelay>0) %>% summarise(mean=mean(DepDelay, na.rm = T), max=max(DepDelay, na.rm = T), sum=sum(DepDelay, na.rm = T))

# A tibble: 15 x 4
   UniqueCarrier  mean   max    sum
   <chr>         <dbl> <int>  <int>
 1 AA             24.7   970  29116
 2 AS             20.8   172   2639
 3 B6             43.5   310  11059
 4 CO             17.9   981 719252
 5 DL             32.4   730  30015
 6 EV             49.3   479  32659
 7 F9             22.7   275   6637
 8 FL             33.4   507  16552
 9 MQ             37.9   931  62763
10 OO             24.6   360 167882
11 UA             28.8   869  29281
12 US             26.5   425  22005
13 WN             21.9   548 630786
14 XE             26.9   628 718548
15 YV             24.5    54    367

hflights %>% group_by(UniqueCarrier) %>% filter(DepDelay>0) %>% summarise(mean=mean(DepDelay, na.rm = T)) %>% 
  ggplot()+aes(x = UniqueCarrier, y = mean, fill=UniqueCarrier)+
  geom_col()+
  labs(title = "항공사별 평균 출발 지연 시간")+
  theme(plot.title = element_text(hjust = 0.5, size = 20))

hflights %>% group_by(UniqueCarrier) %>% filter(ArrDelay>0) %>% summarise(mean=mean(ArrDelay, na.rm = T), max=max(ArrDelay, na.rm = T), sum=sum(ArrDelay, na.rm = T))

# A tibble: 15 x 4
   UniqueCarrier  mean   max    sum
   <chr>         <dbl> <int>  <int>
 1 AA             28.5   978  27443
 2 AS             22.9   183   3643
 3 B6             45.5   335  12097
 4 CO             22.1   957 753521
 5 DL             32.1   701  32221
 6 EV             40.2   469  31389
 7 F9             18.7   277   8652
 8 FL             27.9   500  18302
 9 MQ             38.8   918  64521
10 OO             24.1   380 203870
11 UA             32.5   861  32773
12 US             20.7   433  27265
13 WN             25.3   499 522865
14 XE             24.2   634 857147
15 YV             18.7    72    691

hflights %>% group_by(UniqueCarrier) %>% filter(ArrDelay>0) %>% summarise(mean=mean(DepDelay, na.rm = T)) %>% 
  ggplot()+aes(x = UniqueCarrier, y = mean, fill=UniqueCarrier)+
  geom_col()+
  labs(title = "항공사별 평균 도착 지연 시간")+
  theme(plot.title = element_text(hjust = 0.5, size = 20))

# 항공사별 평균 출발/도착 지연시간을 하나의 그래프에서 살펴보자
library(tidyverse) # gather() & spread() 함수르 사용하기 위해서 필요함.
hflights %>% group_by(UniqueCarrier) %>% 
  filter(DepDelay>0 & ArrDelay>0) %>% 
  summarise(dep_mean=mean(DepDelay, na.rm = T), arr_mean=mean(ArrDelay, na.rm = T)) %>% 
  # 가로형 테이블을 세로형 테이블로 변경
  gather(key = "dep_arr", value = "mean", -UniqueCarrier) %>% 
  ggplot()+
  aes(x = UniqueCarrier, y = mean, group=dep_arr, fill=dep_arr)+
  geom_col(position = "dodge")+
  labs(title = "항공사별 평균 출발/도착 지연 시간")+
  theme(plot.title = element_text(hjust = 0.5, size = 20))

# 나타난 그래프를 보니 출발이 지연된 항공사는 도착도 지연된다는 것을 보여준다.
# 어쩌면 이것은 당연한 것인 것 같다.

Data Exploration with hflights

Semper

2020 9 5

0. 이 문서의 목적

1. data import