The database to store the table with status, cities, etc is to be created using SQL.
CREATE DATABASE delays;
USE delays;
CREATE TABLE `arrival` (
`airline` varchar(25) NOT NULL,
`status` varchar(45) NOT NULL,
`Los Angeles` int,
`Phoenix` int,
`San Diego` int,
`San Francisco` int,
`Seattle` int
);
INSERT INTO arrival
VALUES
('ALASKA', 'on time', 497,221,212,503,1841),
('', 'delayed',62,12,20,102,305),
('','',NULL,NULL,NULL,NULL,NULL),
('AMWEST', 'on time', 694,4840,383,320,201),
('', 'delayed',117,415,65,129,61);
We can also create table using Excel or right here, Using R. Once the table is redy, we will write the data in csv file.
arrivals <- rbind(c("airline", "status", "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
c("ALASKA", "on Time", 497, 221, 212, 503, 1841),
c(NA, "delayed", 62, 12, 20, 102, 305),
c(NA,NA,NA,NA,NA,NA,NA),
c("AM WEST", "on Time", 694, 4840, 383, 320, 201),
c(NA, "delayed", 117, 415, 65, 129, 61))
write.table(arrivals, file = "arrivals.csv", sep = ",", col.names=F, row.names=F)
To load SQL database using R, RMySQL library is used.
db = dbConnect(MySQL(), user='root', password = '336261', dbname='delays', host='localhost')
Once R is connected to the database, we can view and select necessary table from the database.
arrivals <- dbGetQuery(db, 'SELECT * FROM arrival')
Let’s see some information about the table and what’s inside to check if data was imported correctly.
summary(arrivals)
## airline status Los Angeles Phoenix
## Length:5 Length:5 Min. : 62.0 Min. : 12.0
## Class :character Class :character 1st Qu.:103.2 1st Qu.: 168.8
## Mode :character Mode :character Median :307.0 Median : 318.0
## Mean :342.5 Mean :1372.0
## 3rd Qu.:546.2 3rd Qu.:1521.2
## Max. :694.0 Max. :4840.0
## NA's :1 NA's :1
## San Diego San Francisco Seattle
## Min. : 20.00 Min. :102.0 Min. : 61
## 1st Qu.: 53.75 1st Qu.:122.2 1st Qu.: 166
## Median :138.50 Median :224.5 Median : 253
## Mean :170.00 Mean :263.5 Mean : 602
## 3rd Qu.:254.75 3rd Qu.:365.8 3rd Qu.: 689
## Max. :383.00 Max. :503.0 Max. :1841
## NA's :1 NA's :1 NA's :1
head(arrivals,n=5)
## airline status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AMWEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
The file will be downloaded from the Github repository to csv file using read.csv function.
arrivals <- read.csv("https://raw.githubusercontent.com/ex-pr/DATA607/week-6/arrivals.csv", header=TRUE, sep=",", check.names=FALSE)
By checking the data downloaded, we have 5 column with 7 rows.
summary(arrivals)
## airline status Los Angeles Phoenix
## Length:5 Length:5 Min. : 62.0 Min. : 12.0
## Class :character Class :character 1st Qu.:103.2 1st Qu.: 168.8
## Mode :character Mode :character Median :307.0 Median : 318.0
## Mean :342.5 Mean :1372.0
## 3rd Qu.:546.2 3rd Qu.:1521.2
## Max. :694.0 Max. :4840.0
## NA's :1 NA's :1
## San Diego San Francisco Seattle
## Min. : 20.00 Min. :102.0 Min. : 61
## 1st Qu.: 53.75 1st Qu.:122.2 1st Qu.: 166
## Median :138.50 Median :224.5 Median : 253
## Mean :170.00 Mean :263.5 Mean : 602
## 3rd Qu.:254.75 3rd Qu.:365.8 3rd Qu.: 689
## Max. :383.00 Max. :503.0 Max. :1841
## NA's :1 NA's :1 NA's :1
head(arrivals)
## airline status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 ALASKA on Time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on Time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Tidy data should follow the rules: each variable is its own column, each observation is its own row, and each value is its own cell. At the current moment, we have “wide data” an can transform it to “long data” as it is better to work with.
We are going to work with data from csv file.
First, remove NA row.
arrivals <- filter(arrivals, rowSums(is.na(arrivals)) != ncol(arrivals))
head(arrivals)
## airline status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 ALASKA on Time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 AM WEST on Time 694 4840 383 320 201
## 4 <NA> delayed 117 415 65 129 61
We will transform data to create 2 additional columns “Delayed”, “On time”. At the end, we will have a table with columns ““Airline”, “City”, “Delayed”, “On time”.
In airline column, we will spread name of airlines for NA values.
arrivals <- fill(arrivals, airline)
head(arrivals)
## airline status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 ALASKA on Time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on Time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
We will use gather function to transform columns with city names into rows and move rows with “On Time”, “Delayed” to column “count”. Then, we will use spread to split column “column” into 2 columns “On Time”, “Delayed”.
arrivals_tidy <- arrivals %>%
gather('Los Angeles', 'Phoenix', 'San Diego', 'San Francisco', 'Seattle', key = "cities", value = "count") %>%
spread(status,count)
head(arrivals_tidy)
## airline cities delayed on Time
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
We can check which company has more delays, thus we should avoid buying connecting flights with this company.
arrivals_tidy %>%
group_by(airline) %>%
summarise(all_delayed = sum(delayed), all_on_time=sum(`on Time`)) %>%
mutate(percent_delay = round(all_delayed/(all_delayed+all_on_time)*100))
## # A tibble: 2 x 4
## airline all_delayed all_on_time percent_delay
## <chr> <int> <int> <dbl>
## 1 ALASKA 501 3274 13
## 2 AM WEST 787 6438 11
ggplot(arrivals_tidy, aes(x=airline, y=delayed, color=airline)) +
geom_boxplot()
We can also check which airport is famous for the most delays.
delay_percent <- arrivals_tidy %>%
group_by(cities) %>%
summarise(all_delayed = sum(delayed), all_on_time=sum(`on Time`)) %>%
mutate(percent_delay = round(all_delayed/(all_delayed+all_on_time)*100))
delay_percent
## # A tibble: 5 x 4
## cities all_delayed all_on_time percent_delay
## <chr> <int> <int> <dbl>
## 1 Los Angeles 179 1191 13
## 2 Phoenix 427 5061 8
## 3 San Diego 85 595 12
## 4 San Francisco 231 823 22
## 5 Seattle 366 2042 15
ggplot(delay_percent, aes(x = cities, y = percent_delay)) +
geom_bar(stat="identity",fill='blue') +
labs(title="Delayed flights by city", x= "Cities", y = "Delayed flights, %") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5))
Finally, we will check on delays for each airline in each airport. In case we are in San Francisco and we can choose better airline. We will add another column to illustrate delay % for each city and company.
arrivals_tidy <- arrivals_tidy %>%
group_by(airline, cities) %>%
mutate(total = sum(`on Time`, delayed), delayed_percent = round((delayed*100 / total),digits=0))
arrivals_tidy <- arrivals_tidy %>%
select(-total)
head(arrivals_tidy)
## # A tibble: 6 x 5
## # Groups: airline, cities [6]
## airline cities delayed `on Time` delayed_percent
## <chr> <chr> <int> <int> <dbl>
## 1 ALASKA Los Angeles 62 497 11
## 2 ALASKA Phoenix 12 221 5
## 3 ALASKA San Diego 20 212 9
## 4 ALASKA San Francisco 102 503 17
## 5 ALASKA Seattle 305 1841 14
## 6 AM WEST Los Angeles 117 694 14
ggplot(data=arrivals_tidy, aes(x=airline, y = delayed_percent, fill= airline))+
geom_bar(stat="identity")+
facet_wrap(~cities)+
labs(title="Delayed flights by city and airlines", x= "Airlines", y = "Delayed, %")+
theme(plot.title = element_text(hjust = 0.5))
During the work, we have learned how to organize data in a way called “tidy data” using package tidyverse, how to transform wide format to long format.
By analyzing data, we see that in San Francisco 22% of flights are delayed. In San Francisco, AM West company has 29% of delays while Alaska has 17%. Based on the analysis of the airport/airline delays, we can conclude that people should avoid airports of San Francisco as most likely, the flight will be late. Also, if it happens that you are in San Francisco, avoid AM West airlines, it is better to choose another company, especially if you have a connecting flight. As well as in general, AM West company has more delays in each airport comparing to Alaska airline.