Loading packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(RMySQL)
## Warning: package 'RMySQL' was built under R version 4.4.1
## Loading required package: DBI
library(DBI)
library(knitr)
library(rstudioapi)
Turn the given dataset on arlines and their arrival time into a .CSV, and tidy and transform the data for further analysis. Then analyze the data to compare the delays for both airlines.
myDB<-dbConnect(MySQL(), user='andreina.arias80', password=rstudioapi::askForPassword(prompt = 'Password: '), dbname='andreina.arias80', host='cunydata607sql.mysql.database.azure.com')
Graphic image of the provided data frame
knitr::include_graphics("~/Documents/Screenshot 2024-09-28.png")
The chart
above describes arrival delays for two airlines across five
destinations. Your task is to: (1) Create a .CSV file (or optionally, a
MySQL database!) that includes all of the information above.
##Accessing the data frame from SQL and assigning the SQL data frame to a variable in R
Arrival=dbSendQuery(myDB,"SELECT * FROM airline_arrival")
Airline_Arrival= fetch(Arrival)
Airline_Arrival
## PK Unknown Unknown2 Los_Angeles Phoenix San_Diego San_Francisco Seattle
## 1 1 Alaska on time 497 221 212 503 1841
## 2 2 <NA> delayed 62 12 20 102 305
## 3 3 AM West on time 694 4840 383 320 201
## 4 4 <NA> delayed 117 415 65 129 61
You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
##Tidying data by adding “Airlines and”arrivals to the headers that have a missing title and added values “Alaska” and “AM West” where it was missing
names(Airline_Arrival)[names(Airline_Arrival)=='Unknown']<-'Airlines'
names(Airline_Arrival)[names(Airline_Arrival)=='Unknown2']<-'Arrivals'
Airline_Arrival[2,2]<-"Alaska"
Airline_Arrival[4,2]<-"AM West"
Airline_Arrival
## PK Airlines Arrivals Los_Angeles Phoenix San_Diego San_Francisco Seattle
## 1 1 Alaska on time 497 221 212 503 1841
## 2 2 Alaska delayed 62 12 20 102 305
## 3 3 AM West on time 694 4840 383 320 201
## 4 4 AM West delayed 117 415 65 129 61
##Created a long verse of the data frame by created a column to place all the locations instead of having the locations in their on colomn creating a wide table.
Airline_Arrival_Long<-Airline_Arrival%>%
pivot_longer(cols = c('Los_Angeles', 'Phoenix', 'San_Diego', 'San_Francisco', 'Seattle'), names_to = "Locations", values_to = "Flight_arrival_count" )
Airline_Arrival_Long
## # A tibble: 20 × 5
## PK Airlines Arrivals Locations Flight_arrival_count
## <int> <chr> <chr> <chr> <int>
## 1 1 Alaska on time Los_Angeles 497
## 2 1 Alaska on time Phoenix 221
## 3 1 Alaska on time San_Diego 212
## 4 1 Alaska on time San_Francisco 503
## 5 1 Alaska on time Seattle 1841
## 6 2 Alaska delayed Los_Angeles 62
## 7 2 Alaska delayed Phoenix 12
## 8 2 Alaska delayed San_Diego 20
## 9 2 Alaska delayed San_Francisco 102
## 10 2 Alaska delayed Seattle 305
## 11 3 AM West on time Los_Angeles 694
## 12 3 AM West on time Phoenix 4840
## 13 3 AM West on time San_Diego 383
## 14 3 AM West on time San_Francisco 320
## 15 3 AM West on time Seattle 201
## 16 4 AM West delayed Los_Angeles 117
## 17 4 AM West delayed Phoenix 415
## 18 4 AM West delayed San_Diego 65
## 19 4 AM West delayed San_Francisco 129
## 20 4 AM West delayed Seattle 61
##Grouped the data for Airline and Arrivals in order to get the averages and total count of their flights arrival.
Arrival_summary_of_airlines<-Airline_Arrival_Long%>%
group_by(Airlines, Arrivals)%>%
summarise(Arrival_flight_count=sum(Flight_arrival_count), aver_flight_arrival=mean(Flight_arrival_count))
Arrival_summary_of_airlines
## # A tibble: 4 × 4
## # Groups: Airlines [2]
## Airlines Arrivals Arrival_flight_count aver_flight_arrival
## <chr> <chr> <int> <dbl>
## 1 AM West delayed 787 157.
## 2 AM West on time 6438 1288.
## 3 Alaska delayed 501 100.
## 4 Alaska on time 3274 655.
##Since we will be only looking at the delays for both airlines, a dataset will be created with just the values of the airline’s delayed counts
delayed_Airline_summary<-Arrival_summary_of_airlines|>
filter(Arrivals=="delayed")
delayed_Airline_summary
## # A tibble: 2 × 4
## # Groups: Airlines [2]
## Airlines Arrivals Arrival_flight_count aver_flight_arrival
## <chr> <chr> <int> <dbl>
## 1 AM West delayed 787 157.
## 2 Alaska delayed 501 100.
##Created a bar look at the delayed flight for each airline in different locations.
Airline_delay_plot<-Airline_Arrival_Long%>%
filter(Arrivals=="delayed")
ggplot(Airline_delay_plot,aes(Locations, Flight_arrival_count, fill=Airlines))+
geom_col(position="dodge")
The data frame provided on the arrival times of different airlines in different locations had to be tidy in order to analyze. Using a few functions from tidyverse and dplyr, I was able to tidy the data and filter out the only data I needed to make my analysis on the delayed flights. Based on the average of flights arrivals I was able to see that the flight with the most delays was AM West and major of it’s delayed occurred in Phoenix. Alaska seemed to have the lowest delay in Phoenix, while ti has the highest delay in Seattle. Therefore the best airline to use for Phoneix would be Alaska and the best for Seattle would be AM West.