The MYSQL database table created for this assignment contains arrival delays for two airlines across five destinations
Please note below additional packages installed. The formattable package is used to transform vectors and data frames into more readable and impactful tabular formats. Package ‘conflicted’ solved an issue with the rename func as it pertains to knit/html.
install.packages(“formattable”) install.packages(“conflicted”)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(RMySQL)
## Loading required package: DBI
library(formattable)
##DBconnect was utilized in this assignment to authenticate/pull in the flight delay information into R
my.database = dbConnect(MySQL(), user='root', password = 'Password1', dbname='flightsandstatus', host='localhost')
dbListTables(my.database)
## [1] "airlines_info1"
Using fetch function we are able to select all the row from our flightsandstatus database [table = airlines_info1]
airlines_status <- fetch(dbSendQuery(my.database, "SELECT * FROM airlines_info1 ORDER BY ID"))
dim(airlines_status)
## [1] 4 8
colnames(airlines_status)
## [1] "ID" "Airline" "Status"
## [4] "cities_losangeles" "cities_phoenix" "cities_sandiego"
## [7] "cities_sanfrancisco" "cities_seattle"
summary(airlines_status)
## ID Airline Status cities_losangeles
## Min. :1.00 Length:4 Length:4 Min. : 62.0
## 1st Qu.:1.75 Class :character Class :character 1st Qu.:103.2
## Median :2.50 Mode :character Mode :character Median :307.0
## Mean :2.50 Mean :342.5
## 3rd Qu.:3.25 3rd Qu.:546.2
## Max. :4.00 Max. :694.0
## cities_phoenix cities_sandiego cities_sanfrancisco cities_seattle
## Min. : 12.0 Min. : 20.00 Min. :102.0 Min. : 61
## 1st Qu.: 168.8 1st Qu.: 53.75 1st Qu.:122.2 1st Qu.: 166
## Median : 318.0 Median :138.50 Median :224.5 Median : 253
## Mean :1372.0 Mean :170.00 Mean :263.5 Mean : 602
## 3rd Qu.:1521.2 3rd Qu.:254.75 3rd Qu.:365.8 3rd Qu.: 689
## Max. :4840.0 Max. :383.00 Max. :503.0 Max. :1841
Experimented with the tibble func below for the first time. Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data.
airline_statusinfo <- as_tibble(airlines_status)
airline_statusinfo
## # A tibble: 4 x 8
## ID Airline Status cities_losangel~ cities_phoenix cities_sandiego
## <int> <chr> <chr> <int> <int> <int>
## 1 1 ALASKA on ti~ 497 221 212
## 2 2 ALASKA delay~ 62 12 20
## 3 3 AM WEST on ti~ 694 4840 383
## 4 4 AM WEST delay~ 117 415 65
## # ... with 2 more variables: cities_sanfrancisco <int>, cities_seattle <int>
Note the use of the conflicted library below was helpful in tackling a difficult issue of knitting to html with the rename func used in our R code. Similarly from assignment 2, it is ideal for us to rename often time certain data which is not easy to understand/or to better the appearance of certain data such as columns. Also note the dplyr::rename being utilized in this scenario.
library(conflicted)
library(dplyr)
airlinestatusinfo9 <- dplyr::rename(airline_statusinfo,"Los Angeles"="cities_losangeles",
"Phoenix" = "cities_phoenix",
"San Diego" = "cities_sandiego",
"San Francisco" = "cities_sanfrancisco",
"Seattle" = "cities_seattle"
)
print(airlinestatusinfo9)
## # A tibble: 4 x 8
## ID Airline Status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <int> <chr> <chr> <int> <int> <int> <int> <int>
## 1 1 ALASKA on ti~ 497 221 212 503 1841
## 2 2 ALASKA delay~ 62 12 20 102 305
## 3 3 AM WEST on ti~ 694 4840 383 320 201
## 4 4 AM WEST delay~ 117 415 65 129 61
Note below we are using the gather, select and spread func to tidy/better organize the flights info to later on be able to manipulate and do calculations. We were encouraged to use a ‘wide’ structure so that we are later able to transform it.
airline_statusinfo8 <- airlinestatusinfo9 %>%
gather(City, count, -ID, -Airline, -Status) %>%
select(Airline, Status, City, count) %>%
spread(Status, count)
print(airline_statusinfo8)
## # A tibble: 10 x 4
## Airline City delayed `on time`
## <chr> <chr> <int> <int>
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
The formattable package is used to transform vectors and data frames into more readable and impactful tabular formats.
formattable(airline_statusinfo8)
Airline | City | delayed | on time |
---|---|---|---|
ALASKA | Los Angeles | 62 | 497 |
ALASKA | Phoenix | 12 | 221 |
ALASKA | San Diego | 20 | 212 |
ALASKA | San Francisco | 102 | 503 |
ALASKA | Seattle | 305 | 1841 |
AM WEST | Los Angeles | 117 | 694 |
AM WEST | Phoenix | 415 | 4840 |
AM WEST | San Diego | 65 | 383 |
AM WEST | San Francisco | 129 | 320 |
AM WEST | Seattle | 61 | 201 |
airline_statusinfo7 <- airline_statusinfo8 %>%
dplyr::mutate(
ontime_rate = `on time` / (delayed + `on time`),
delay_rate = delayed / (delayed + `on time`),
flights = delayed + `on time`,
percent_on_time = round(`on time`/flights*100, 2),
)
formattable(airline_statusinfo7)
Airline | City | delayed | on time | ontime_rate | delay_rate | flights | percent_on_time |
---|---|---|---|---|---|---|---|
ALASKA | Los Angeles | 62 | 497 | 0.8890877 | 0.11091234 | 559 | 88.91 |
ALASKA | Phoenix | 12 | 221 | 0.9484979 | 0.05150215 | 233 | 94.85 |
ALASKA | San Diego | 20 | 212 | 0.9137931 | 0.08620690 | 232 | 91.38 |
ALASKA | San Francisco | 102 | 503 | 0.8314050 | 0.16859504 | 605 | 83.14 |
ALASKA | Seattle | 305 | 1841 | 0.8578751 | 0.14212488 | 2146 | 85.79 |
AM WEST | Los Angeles | 117 | 694 | 0.8557337 | 0.14426634 | 811 | 85.57 |
AM WEST | Phoenix | 415 | 4840 | 0.9210276 | 0.07897241 | 5255 | 92.10 |
AM WEST | San Diego | 65 | 383 | 0.8549107 | 0.14508929 | 448 | 85.49 |
AM WEST | San Francisco | 129 | 320 | 0.7126949 | 0.28730512 | 449 | 71.27 |
AM WEST | Seattle | 61 | 201 | 0.7671756 | 0.23282443 | 262 | 76.72 |
Using ggplot2 to gather the delay information based on airline
library(ggplot2)
Alaska_delays<-dplyr::filter(airline_statusinfo7, Airline=="ALASKA")
Alaska_delays
## # A tibble: 5 x 8
## Airline City delayed `on time` ontime_rate delay_rate flights percent_on_time
## <chr> <chr> <int> <int> <dbl> <dbl> <int> <dbl>
## 1 ALASKA Los ~ 62 497 0.889 0.111 559 88.9
## 2 ALASKA Phoe~ 12 221 0.948 0.0515 233 94.8
## 3 ALASKA San ~ 20 212 0.914 0.0862 232 91.4
## 4 ALASKA San ~ 102 503 0.831 0.169 605 83.1
## 5 ALASKA Seat~ 305 1841 0.858 0.142 2146 85.8
AMWEST_delays<-dplyr::filter(airline_statusinfo7, Airline=="AM WEST")
AMWEST_delays
## # A tibble: 5 x 8
## Airline City delayed `on time` ontime_rate delay_rate flights percent_on_time
## <chr> <chr> <int> <int> <dbl> <dbl> <int> <dbl>
## 1 AM WEST Los ~ 117 694 0.856 0.144 811 85.6
## 2 AM WEST Phoe~ 415 4840 0.921 0.0790 5255 92.1
## 3 AM WEST San ~ 65 383 0.855 0.145 448 85.5
## 4 AM WEST San ~ 129 320 0.713 0.287 449 71.3
## 5 AM WEST Seat~ 61 201 0.767 0.233 262 76.7
compare_data <- airline_statusinfo7 %>%
group_by(Airline) %>%
dplyr::summarize(OnTime = sum(`on time`), Delayed = sum(delayed),
PercentOnTime = round(OnTime/sum(OnTime,Delayed),2))%>%
select(Airline, PercentOnTime)
## `summarise()` ungrouping output (override with `.groups` argument)
compare_data %>% dplyr::arrange(desc(PercentOnTime))
## # A tibble: 2 x 2
## Airline PercentOnTime
## <chr> <dbl>
## 1 AM WEST 0.89
## 2 ALASKA 0.87
ggplot(airline_statusinfo7, aes(x = Airline, y=delay_rate, fill = City)) +
geom_bar(stat="identity",position="dodge") +
xlab("Airlines") + ylab("Delays")
ggplot(airline_statusinfo7, aes(x = Airline, y=ontime_rate, fill = City)) +
geom_bar(stat="identity",position="dodge") +
xlab("Airlines") + ylab("On time rate")
in conclusion we can witness on the chart how AM WEST has slightly higher delay percentage than Alaska. Also we see how ontime rate is very similar/close for both airlines.