Tornado Data

Question a)

How many tornadoes occurred in WA (Washington) and MS (Mississippi)?

sum(st=="WA" | st == "MS")
## [1] 530




Question b)

How many tornadoes occurred in WA before 2012?

sum(st=="WA" & yr <= 2012 )
## [1] 13




Question c)

Get all tornadoes that occurred in WA in 2011, 2013, and 2014.

sum(st=="WA" & yr %in% c(2011,2013,2014))
## [1] 7

The List is as follows:

filter(df_t, st=="WA" & yr %in% c(2011,2013,2014))
##     yr mo dy       date     time tz st stf stn    f
## 1 2011  5 27 2011-05-27 14:50:00  3 WA  53   1 EF-0
## 2 2011 10  5 2011-10-05 16:19:00  3 WA  53   2 EF-0
## 3 2013  3 21 2013-03-21 18:00:00  3 WA  53   0 EF-0
## 4 2013  9 30 2013-09-30 08:20:00  3 WA  53   0 EF-1
## 5 2014  4 27 2014-04-27 18:30:00  3 WA  53   0 EF-0
## 6 2014  8 13 2014-08-13 19:30:00  3 WA  53   0 EF-0
## 7 2014 10 23 2014-10-23 14:40:00  3 WA  53   0 EF-1




Question d)

Which ’mo’ (month) had the most tornadoes in HI (Hawaii)? How do you know.

month.name[max((df_t[st=="HI",] %>% group_by(mo)%>% tally())$n)]
## [1] "February"

There were two tornadoes in the month of Feb, other months only had one each (as seen in table below).

df_t %>% group_by(st, mo) %>% filter(st=="HI") %>% tally()
## # A tibble: 5 × 3
## # Groups:   st [1]
##   st       mo     n
##   <chr> <int> <int>
## 1 HI        2     2
## 2 HI        3     1
## 3 HI        4     1
## 4 HI        9     1
## 5 HI       12     1




Question e)

Get the count of tornadoes by year.

df_t %>% group_by(yr) %>% tally()
## # A tibble: 9 × 2
##      yr     n
##   <int> <int>
## 1  2007  1116
## 2  2008  1738
## 3  2009  1182
## 4  2010  1315
## 5  2011  1777
## 6  2012   956
## 7  2013   939
## 8  2014   906
## 9  2015  1212







Airline Data

Question a) i)

Find the carrier (UniqueCarrier), the flight number (FlightNum), and origin (origin) of all the flights that: Departed (DepTime) after 9pm and flew (Dest) to Nashville (’BNA’).

df_a %>%
  select(UniqueCarrier, FlightNum, Origin) %>%
  filter(DepTime>2100 & Dest=="BNA")
##   UniqueCarrier FlightNum Origin
## 1            WN      2181    BWI
## 2            OH      5421    CVG
## 3            DL       938    ATL
## 4            AA      2435    ORD
## 5            DH      7332    ORD




Question a) ii)

Find the carrier (UniqueCarrier), the flight number (FlightNum), and origin (origin) of all the flights that: departed (DepTime) after 9pm, but Originated (Origin) from Nashville ’BNA’ or landed (Dest) in Memphis ’MEM’.

df_a %>%
  select(UniqueCarrier, FlightNum, Origin) %>%
  filter(DepTime>2100 & (Origin=="BNA" | Dest =="MEM"))
##    UniqueCarrier FlightNum Origin
## 1             WN      1447    BNA
## 2             WN      2658    BNA
## 3             WN      1511    BNA
## 4             WN      1732    BNA
## 5             WN      2769    BNA
## 6             WN       459    BNA
## 7             OO      6838    ORD
## 8             DL       579    ATL
## 9             FL      1723    ATL
## 10            NW       279    DTW
## 11            DH      6270    CVG




Question a) iii)

Find the carrier (UniqueCarrier), the flight number (FlightNum), and origin (origin) of all the flights that: were early (ArrDelay) and strictly less than 8 minutes of travel (AirTime). Note that a negative delay is a flight that arrived early.

df_a %>%
  select(UniqueCarrier, FlightNum, Origin) %>%
  filter(ArrDelay<0 & AirTime<8)
##    UniqueCarrier FlightNum Origin
## 1             OH      5495    ATL
## 2             OH      5106    SWF
## 3             OH      5165    FNT
## 4             OH      5340    BOS
## 5             OH      5379    DCA
## 6             OH      5791    HOU
## 7             OH      5949    ROC
## 8             OH      5762    SBN
## 9             OH      5242    PIT
## 10            OH      5945    SLC
## 11            EV      4197    ATL
## 12            EV      4527    ATL
## 13            EV      4588    ATL
## 14            EV      4755    ATL
## 15            EV      4784    DFW
## 16            EV      4909    CVG
## 17            HA        90    LNY
## 18            DH      7459    TYS




Question a) iv)

Find the carrier (UniqueCarrier), the flight number (FlightNum), and origin (origin) of all the flights that: arrived (ArrDelay) more than two hours late, but didn’t leave late (DepDelay).

df_a %>%
  select(UniqueCarrier, FlightNum, Origin) %>%
  filter(ArrDelay>120 & DepDelay<=0)
##   UniqueCarrier FlightNum Origin
## 1            TZ       505    PIE




Question b) i)

Arrange: Sort flights to find the 5 least delayed (DepDelay) flights

df_a %>%
      arrange(-desc(DepDelay)) %>%
    slice(1:5) 
##   Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime
## 1 2004     3         25         4    2047       2113    2225       2227
## 2 2004     3         25         4     911        935    1053       1109
## 3 2004     3         25         4    1053       1115    1129       1159
## 4 2004     3         25         4    2109       2130    2159       2218
## 5 2004     3         25         4    1015       1035    1226       1243
##   UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime
## 1            MQ      3647  N817MQ                98             74      52
## 2            OH      5260  N933CA               102             94      75
## 3            NW       707  N986US                36             44      28
## 4            AA      1685  N4XGAA                50             48      36
## 5            NW      1926   N9346                71             68      46
##   ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled
## 1       -2      -26    DFW  SGF      364      3      43         0
## 2      -16      -24    RIC  CVG      413      1      26         0
## 3      -30      -22    GTF  FCA      146      2       6         0
## 4      -19      -21    HOU  AUS      148      5       9         0
## 5      -17      -20    MDW  DTW      229     12      13         0
##   CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay
## 1                         0            0            0        0             0
## 2                         0            0            0        0             0
## 3                         0            0            0        0             0
## 4                         0            0            0        0             0
## 5                         0            0            0        0             0
##   LateAircraftDelay
## 1                 0
## 2                 0
## 3                 0
## 4                 0
## 5                 0




Question b) ii)

Arrange: Sort flights to find the 5 least delayed (ArrDelay) flights

df_a %>%
      arrange(-desc(ArrDelay)) %>%
    slice(1:5) 
##   Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime
## 1 2004     3         25         4     600        605     751        843
## 2 2004     3         25         4     557        600     722        811
## 3 2004     3         25         4    1530       1540    1720       1809
## 4 2004     3         25         4    1355       1401    1702       1750
## 5 2004     3         25         4    1647       1651    1932       2020
##   UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime
## 1            OH      5071  N374CA               111            158      88
## 2            OH      5710  N473CA                85            131      75
## 3            OH      5082  N510CA               110            149      95
## 4            OO      6704  N964SW               121            169     314
## 5            DL      1181  N375DA               225            269     205
##   ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled
## 1      -52       -5    BUF  ATL      712     10      13         0
## 2      -49       -3    BGM  CVG      506      2       8         0
## 3      -49      -10    BUF  ATL      712      6       9         0
## 4      -48        0    MFR  DEN      964      5       6         0
## 5      -48       -4    BOS  DFW     1562      5      15         0
##   CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay
## 1                         0            0            0        0             0
## 2                         0            0            0        0             0
## 3                         0            0            0        0             0
## 4                         0            0            0        0             0
## 5                         0            0            0        0             0
##   LateAircraftDelay
## 1                 0
## 2                 0
## 3                 0
## 4                 0
## 5                 0




Bonus Question

Transform using the Airline data: create a new data frame with only the columns ’DepDelay’, ’ArrDelay’, ’Origin’, ’Dest’, AirTime’, and ’Distance’. In addition keep only the observation for flights that were delayed (DepDelay) by more than 1 hour.

new_df_b1 <- subset(df_a, DepDelay>60)
new_df_b <-select(new_df_b1,c(DepDelay, ArrDelay, Origin, Dest, AirTime, Distance))




Question a)

Add a new column reporting departure delays minus the median departure delay to the new data frame

med_diff <- new_df_b$DepDelay - median(new_df_b$DepDelay)
new_df_b01 <- cbind(new_df_b,med_diff)
head(kable(new_df_b01),10)
##  [1] "|      | DepDelay| ArrDelay|Origin |Dest | AirTime| Distance| med_diff|"
##  [2] "|:-----|--------:|--------:|:------|:----|-------:|--------:|--------:|"
##  [3] "|8     |      112|      102|DEN    |SNA  |     116|      846|       17|"
##  [4] "|23    |       65|       78|ORD    |ROC  |      72|      528|      -30|"
##  [5] "|44    |       73|       72|IAH    |ORD  |     139|      925|      -22|"
##  [6] "|65    |      164|      144|BUF    |ORD  |      79|      473|       69|"
##  [7] "|66    |      130|      138|ORD    |MSP  |      57|      334|       35|"
##  [8] "|78    |      122|      144|ORD    |SAN  |     238|     1723|       27|"
##  [9] "|86    |       65|      180|CMH    |ORD  |      71|      296|      -30|"
## [10] "|107   |       86|       79|LAX    |DEN  |     109|      862|       -9|"




Question b)

Convert the departure and arrival delays columns from minutes into hours

conv_depdel_hrs <- substr(times((new_df_b01$DepDelay%/%60 +  new_df_b01$DepDelay%%60 /60)/24), 1, 5)
conv_arrdel_hrs <- substr(times((new_df_b01$ArrDelay%/%60 +  new_df_b01$ArrDelay%%60 /60)/24), 1, 5)
new_df_b02 <- subset(new_df_b01, select = -c(1:2))
new_df_b03 <- cbind(new_df_b02,conv_arrdel_hrs,conv_depdel_hrs)
head(kable(new_df_b03),10)
##  [1] "|      |Origin |Dest | AirTime| Distance| med_diff|conv_arrdel_hrs |conv_depdel_hrs |"
##  [2] "|:-----|:------|:----|-------:|--------:|--------:|:---------------|:---------------|"
##  [3] "|8     |DEN    |SNA  |     116|      846|       17|01:42           |01:52           |"
##  [4] "|23    |ORD    |ROC  |      72|      528|      -30|01:18           |01:05           |"
##  [5] "|44    |IAH    |ORD  |     139|      925|      -22|01:12           |01:13           |"
##  [6] "|65    |BUF    |ORD  |      79|      473|       69|02:24           |02:44           |"
##  [7] "|66    |ORD    |MSP  |      57|      334|       35|02:18           |02:10           |"
##  [8] "|78    |ORD    |SAN  |     238|     1723|       27|02:24           |02:02           |"
##  [9] "|86    |CMH    |ORD  |      71|      296|      -30|03:00           |01:05           |"
## [10] "|107   |LAX    |DEN  |     109|      862|       -9|01:19           |01:26           |"




Question c)

Add a column with the average flight speed (in mph)

avg_flt_spd_in_mph <- new_df_b1$Distance / new_df_b1$AirTime * 60
new_df_b04 <- cbind(new_df_b03,avg_flt_spd_in_mph)
head(kable(new_df_b04),10)
##  [1] "|      |Origin |Dest | AirTime| Distance| med_diff|conv_arrdel_hrs |conv_depdel_hrs | avg_flt_spd_in_mph|"
##  [2] "|:-----|:------|:----|-------:|--------:|--------:|:---------------|:---------------|------------------:|"
##  [3] "|8     |DEN    |SNA  |     116|      846|       17|01:42           |01:52           |          437.58621|"
##  [4] "|23    |ORD    |ROC  |      72|      528|      -30|01:18           |01:05           |          440.00000|"
##  [5] "|44    |IAH    |ORD  |     139|      925|      -22|01:12           |01:13           |          399.28058|"
##  [6] "|65    |BUF    |ORD  |      79|      473|       69|02:24           |02:44           |          359.24051|"
##  [7] "|66    |ORD    |MSP  |      57|      334|       35|02:18           |02:10           |          351.57895|"
##  [8] "|78    |ORD    |SAN  |     238|     1723|       27|02:24           |02:02           |          434.36975|"
##  [9] "|86    |CMH    |ORD  |      71|      296|      -30|03:00           |01:05           |          250.14085|"
## [10] "|107   |LAX    |DEN  |     109|      862|       -9|01:19           |01:26           |          474.49541|"




Question d)

Report your new table using the function summary()

summary(new_df_b04)
##     Origin              Dest              AirTime           Distance     
##  Length:783         Length:783         Min.   :-1395.0   Min.   :  56.0  
##  Class :character   Class :character   1st Qu.:   55.0   1st Qu.: 316.0  
##  Mode  :character   Mode  :character   Median :   90.0   Median : 590.0  
##                                        Mean   :  101.1   Mean   : 679.2  
##                                        3rd Qu.:  134.0   3rd Qu.: 867.0  
##                                        Max.   :  346.0   Max.   :2704.0  
##     med_diff      conv_arrdel_hrs    conv_depdel_hrs    avg_flt_spd_in_mph
##  Min.   :-34.00   Length:783         Length:783         Min.   :-11.35    
##  1st Qu.:-21.00   Class :character   Class :character   1st Qu.:328.80    
##  Median :  0.00   Mode  :character   Mode  :character   Median :393.05    
##  Mean   : 10.97                                         Mean   :378.69    
##  3rd Qu.: 29.00                                         3rd Qu.:442.68    
##  Max.   :985.00                                         Max.   :543.11







Data Import Steps

Tornadoes

df_t <- (read.csv(file_url1))
attach(df_t)

Back to Top

Airline

df_a1 <- (read.csv(file_url2))
df_a <- na.omit(df_a1)
attach(df_a)

Back to Top