Given chart describes arrival delays for two airlines across five destinations. We have to
library(stringr)
library(dplyr)## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)## Warning: package 'tidyr' was built under R version 3.4.3
library(knitr)myurl <- "https://raw.githubusercontent.com/Harpreet1984/DATA607/master/Flight.csv"
flight <- read.csv(myurl, header= TRUE,sep=",",stringsAsFactors=FALSE)
flight## X X.1 LOS.ANGELES Phoenix San.diego San.Francisco Seattle
## 1 ALASKA ontime 497 221 212 503 1841
## 2 delayed 63 12 20 102 305
## 3 NA NA NA NA NA
## 4 Amwest ontime 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
From the above table we see By default, the read.csv() function named first two columns “X”" and “X1”. We need to rename these with proper labels. Here i converted flight data into tibble
flightdata <- flight
library(tibble)## Warning: package 'tibble' was built under R version 3.4.3
flightdata <- as_data_frame(flightdata)
flightdata <- flightdata %>% rename(Airline = X, Status = X.1)
flightdata## # A tibble: 5 x 7
## Airline Status LOS.ANGELES Phoenix San.diego San.Francisco Seattle
## <chr> <chr> <int> <int> <int> <int> <int>
## 1 ALASKA ontime 497 221 212 503 1841
## 2 "" delayed 63 12 20 102 305
## 3 "" "" NA NA NA NA NA
## 4 Amwest ontime 694 4840 383 320 201
## 5 "" delayed 117 415 65 129 61
Here we are removing the empty rows if any and then replacing missing values in Airline column
flightdata <- flightdata %>% filter(Status != '')
flightdata## # A tibble: 4 x 7
## Airline Status LOS.ANGELES Phoenix San.diego San.Francisco Seattle
## <chr> <chr> <int> <int> <int> <int> <int>
## 1 ALASKA ontime 497 221 212 503 1841
## 2 "" delayed 63 12 20 102 305
## 3 Amwest ontime 694 4840 383 320 201
## 4 "" delayed 117 415 65 129 61
flightdata$Airline <- ifelse(flightdata$Airline == "", lag(flightdata$Airline),flightdata$Airline)
kable(flightdata) | Airline | Status | LOS.ANGELES | Phoenix | San.diego | San.Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | ontime | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 63 | 12 | 20 | 102 | 305 |
| Amwest | ontime | 694 | 4840 | 383 | 320 | 201 |
| Amwest | delayed | 117 | 415 | 65 | 129 | 61 |
—-gather() function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.
—-spread() function spreads a key-value pair across multiple columns.
flightdata <- flightdata %>%
gather(city, frequency, 3:length(flightdata))
flightdata <- flightdata %>% spread("Status",frequency)
kable(flightdata)| Airline | city | delayed | ontime |
|---|---|---|---|
| ALASKA | LOS.ANGELES | 63 | 497 |
| ALASKA | Phoenix | 12 | 221 |
| ALASKA | San.diego | 20 | 212 |
| ALASKA | San.Francisco | 102 | 503 |
| ALASKA | Seattle | 305 | 1841 |
| Amwest | LOS.ANGELES | 117 | 694 |
| Amwest | Phoenix | 415 | 4840 |
| Amwest | San.diego | 65 | 383 |
| Amwest | San.Francisco | 129 | 320 |
| Amwest | Seattle | 61 | 201 |
—– mutate() function - Mutate adds new variables and preserves existing to add total time column
flightdata <- flightdata %>% mutate(Total.time = ontime + delayed) %>%
mutate(ontime_percentage = ontime / Total.time) %>%
mutate(delay_percentage = delayed / Total.time)
kable(flightdata)| Airline | city | delayed | ontime | Total.time | ontime_percentage | delay_percentage |
|---|---|---|---|---|---|---|
| ALASKA | LOS.ANGELES | 63 | 497 | 560 | 0.8875000 | 0.1125000 |
| ALASKA | Phoenix | 12 | 221 | 233 | 0.9484979 | 0.0515021 |
| ALASKA | San.diego | 20 | 212 | 232 | 0.9137931 | 0.0862069 |
| ALASKA | San.Francisco | 102 | 503 | 605 | 0.8314050 | 0.1685950 |
| ALASKA | Seattle | 305 | 1841 | 2146 | 0.8578751 | 0.1421249 |
| Amwest | LOS.ANGELES | 117 | 694 | 811 | 0.8557337 | 0.1442663 |
| Amwest | Phoenix | 415 | 4840 | 5255 | 0.9210276 | 0.0789724 |
| Amwest | San.diego | 65 | 383 | 448 | 0.8549107 | 0.1450893 |
| Amwest | San.Francisco | 129 | 320 | 449 | 0.7126949 | 0.2873051 |
| Amwest | Seattle | 61 | 201 | 262 | 0.7671756 | 0.2328244 |
Here we can perform analysis on Flight arrival status based on flights delayed and flights reached on-time.
library(ggplot2)
myggplot <- ggplot(flightdata, aes(x = city, y = delay_percentage,group = Airline,color = Airline,shape=Airline) ,xlab = "city" , ylab = "Delay") +
geom_line() + geom_point((aes(shape=Airline)))
myggplotBased on the above graphs we can say delay rates of ALASKA Airlines is less compared to AMWEST Airlines Also we see that most delays happens at the city of Sanfrancisco and least delays at Phoenix