Tidying and Transforming Data

1.

A .csv file of the data was created and it is being loaded below. While loading the null fields are given “NA”. The first column is named Airline and second column is named Status.

flight <- read.table("https://raw.githubusercontent.com/irene908/DATA-607/master/Assignment%205_Delays.csv", header=TRUE, sep=",", na.strings = c("", "NA"))
names(flight)[1] <- "Airline"
names(flight)[2] <- "Status"
flight

##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2    <NA> delayed          62      12        20           102     305
## 3    <NA>    <NA>          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5    <NA> delayed         117     415        65           129      61

2.

The empty row is deleted.

flight <- flight[!apply(is.na(flight[1:7]),1,all), ]
flight

##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2    <NA> delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5    <NA> delayed         117     415        65           129      61

3.

Update the missing fields using the following for loop.

for(i in 2:nrow(flight)) {
  
  if(is.na(flight$Airline[i])){
    flight$Airline[i] <- flight$Airline[i-1]
  }
}
flight

##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5 AM WEST delayed         117     415        65           129      61

4.

COnverting the “wide” data into a “long” data. To do this the gather() function of tidyr is used.

long_flight <- gather(flight, "City", "Count", 3:7)

datatable(long_flight, extensions = 'Scroller', options = list(scrollY = 200, scroller = TRUE ))

5.

The spread() function in tidyr is applied to the Status field.

status_spread <- spread(long_flight,Status,Count)

datatable(status_spread, extensions = 'Scroller', options = list(scrollY = 200, scroller = TRUE ))

6.

group_by() and summarize() functions of dplyr are used to find the total number of flights that were on time by each airline.

overall_ontime <- status_spread %>% group_by(Airline) %>% summarize(avg_ontime = mean(`on time`))
overall_ontime

## # A tibble: 2 x 2
##   Airline avg_ontime
##   <chr>        <dbl>
## 1 ALASKA        655.
## 2 AM WEST      1288.

7.

Compare the per-city on-time performance for both airlines

per_city_ontime <- status_spread %>% group_by(Airline,City) %>% summarize(avg_ontime = mean(`on time`))

datatable(per_city_ontime, extensions = 'Scroller', options = list(scrollY = 200, scroller = TRUE ))

8. Plotting

ggplot(per_city_ontime ,aes(x= City, y=avg_ontime, fill=Airline)) +
    geom_bar(stat="identity", position=position_dodge())

9.

Summarize the delay data

overall_delay <- status_spread %>% group_by(Airline) %>% summarize(avg_delay = mean(`delayed`))
overall_delay

## # A tibble: 2 x 2
##   Airline avg_delay
##   <chr>       <dbl>
## 1 ALASKA       100.
## 2 AM WEST      157.

per_city_delay <- status_spread %>% group_by(Airline,City) %>% summarize(avg_delay = mean(`delayed`))

datatable(per_city_delay, extensions = 'Scroller', options = list(scrollY = 200, scroller = TRUE ))

ggplot(per_city_delay ,aes(x= City, y=avg_delay, fill=Airline)) +
    geom_bar(stat="identity", position=position_dodge())

d <- status_spread %>% dplyr::group_by(Airline) %>% 
  dplyr::summarise(d_max=max(delayed), 
                   d_min=min(delayed),
                   d_avg=mean(delayed),
                   d_sd=sd(delayed),
                   d_median=median(delayed),
                   IQR=IQR(delayed))

datatable(d, extensions = 'Scroller', options = list(scrollY = 100))

Conclusion

From the above analysis it is clear that AM WEST had more number of on time flights when compared to ALASKA. When we consider the on time flights in Seattle we can see that ALASKA has more on time flights compared to AM WEST.

So this proves that judging an airline based on all the flights is not the best option. We should go further on to each destination to see which airline is better for that destination.

Assignment 5

Irene Jacob

2020-09-26

Tidying and Transforming Data

1.

2.

3.

4.

5.

6.

7.

8. Plotting

9.

Conclusion