Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. Perform analysis to compare the arrival delays for the two airlines.
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- "https://raw.githubusercontent.com/lburenkov/flights607/main/airlines%20607.csv"
flights <- read_csv(url)
## Rows: 6 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Column1, Column2, Column3, Column4, Column5, Column6, Column7
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
flights
## # A tibble: 6 × 7
## Column1 Column2 Column3 Column4 Column5 Column6 Column7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> <NA> Los Angeles Phoenix San Diego San Francisco Seattle
## 2 ALASKA On time 497 221 212 503 1841
## 3 <NA> Delayed 62 12 20 102 305
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 5 AM WEST On time 694 4840 383 320 201
## 6 <NA> Delayed 117 415 65 129 61
names(flights)[1:2] <- c("Airline", "Delay Status")
flights
## # A tibble: 6 × 7
## Airline `Delay Status` Column3 Column4 Column5 Column6 Column7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> <NA> Los Angeles Phoenix San Diego San Francisco Seattle
## 2 ALASKA On time 497 221 212 503 1841
## 3 <NA> Delayed 62 12 20 102 305
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 5 AM WEST On time 694 4840 383 320 201
## 6 <NA> Delayed 117 415 65 129 61
###Tidyng up the data
Deleting first row
flights = flights[-1,]
flights
## # A tibble: 5 × 7
## Airline `Delay Status` Column3 Column4 Column5 Column6 Column7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ALASKA On time 497 221 212 503 1841
## 2 <NA> Delayed 62 12 20 102 305
## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 AM WEST On time 694 4840 383 320 201
## 5 <NA> Delayed 117 415 65 129 61
names(flights)[3:7] <- c("Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seatle")
flights
## # A tibble: 5 × 7
## Airline `Delay Status` `Los Angeles` Phoenix `San Diego` `San Francisco`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ALASKA On time 497 221 212 503
## 2 <NA> Delayed 62 12 20 102
## 3 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 AM WEST On time 694 4840 383 320
## 5 <NA> Delayed 117 415 65 129
## # ℹ 1 more variable: Seatle <chr>
flights1 <- flights |>
fill(Airline) |>
pivot_longer(cols=3:7,names_to= "City",values_to = "Count")
glimpse(flights1)
## Rows: 25
## Columns: 4
## $ Airline <chr> "ALASKA", "ALASKA", "ALASKA", "ALASKA", "ALASKA", "ALAS…
## $ `Delay Status` <chr> "On time", "On time", "On time", "On time", "On time", …
## $ City <chr> "Los Angeles", "Phoenix", "San Diego", "San Francisco",…
## $ Count <chr> "497", "221", "212", "503", "1841", "62", "12", "20", "…
flights1
## # A tibble: 25 × 4
## Airline `Delay Status` City Count
## <chr> <chr> <chr> <chr>
## 1 ALASKA On time Los Angeles 497
## 2 ALASKA On time Phoenix 221
## 3 ALASKA On time San Diego 212
## 4 ALASKA On time San Francisco 503
## 5 ALASKA On time Seatle 1841
## 6 ALASKA Delayed Los Angeles 62
## 7 ALASKA Delayed Phoenix 12
## 8 ALASKA Delayed San Diego 20
## 9 ALASKA Delayed San Francisco 102
## 10 ALASKA Delayed Seatle 305
## # ℹ 15 more rows
Changing column Count ‘chr’ to numeric
flights1[, c(4)] <- sapply(flights1[, c(4)], as.numeric)
flights1
## # A tibble: 25 × 4
## Airline `Delay Status` City Count
## <chr> <chr> <chr> <dbl>
## 1 ALASKA On time Los Angeles 497
## 2 ALASKA On time Phoenix 221
## 3 ALASKA On time San Diego 212
## 4 ALASKA On time San Francisco 503
## 5 ALASKA On time Seatle 1841
## 6 ALASKA Delayed Los Angeles 62
## 7 ALASKA Delayed Phoenix 12
## 8 ALASKA Delayed San Diego 20
## 9 ALASKA Delayed San Francisco 102
## 10 ALASKA Delayed Seatle 305
## # ℹ 15 more rows
###Analyzing the data Who had the higher proportion of flight delays? When comparing the delayed flights, Alaska had more raw delayed flights.
ggplot(flights1, aes(x=Airline,y=Count,fill = `Delay Status`)) +
geom_bar(stat="identity", position = position_fill(reverse= TRUE)) +
facet_wrap(~City) +
labs(title = "Flight Delays Per City", y = "Proportion of Flights") +
theme(legend.position = c(.85,.25), plot.title = element_text(hjust = 0.5))
## Warning: Removed 5 rows containing missing values (`position_stack()`).
Conclusions: Both airlines show significant delays but overall AM West Airlines has more delays than Alaska airline. It will be interesting to have more data to get further conclusions and analysis.