Loading packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(RMySQL)
## Warning: package 'RMySQL' was built under R version 4.4.1
## Loading required package: DBI
library(DBI)
library(knitr)
library(rstudioapi)

Introduction

Turn the given dataset on arlines and their arrival time into a .CSV, and tidy and transform the data for further analysis. Then analyze the data to compare the delays for both airlines.

Data set was created in SQL, accessing the SQL database on R.

myDB<-dbConnect(MySQL(), user='andreina.arias80', password=rstudioapi::askForPassword(prompt = 'Password: '), dbname='andreina.arias80', host='cunydata607sql.mysql.database.azure.com')

Graphic image of the provided data frame

knitr::include_graphics("~/Documents/Screenshot 2024-09-28.png")

The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above.

##Accessing the data frame from SQL and assigning the SQL data frame to a variable in R

Arrival=dbSendQuery(myDB,"SELECT * FROM airline_arrival")
Airline_Arrival= fetch(Arrival)
Airline_Arrival
##   PK Unknown Unknown2 Los_Angeles Phoenix San_Diego San_Francisco Seattle
## 1  1  Alaska  on time         497     221       212           503    1841
## 2  2    <NA>  delayed          62      12        20           102     305
## 3  3 AM West  on time         694    4840       383           320     201
## 4  4    <NA>  delayed         117     415        65           129      61

You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

##Tidying data by adding “Airlines and”arrivals to the headers that have a missing title and added values “Alaska” and “AM West” where it was missing

names(Airline_Arrival)[names(Airline_Arrival)=='Unknown']<-'Airlines'
names(Airline_Arrival)[names(Airline_Arrival)=='Unknown2']<-'Arrivals'
Airline_Arrival[2,2]<-"Alaska"
Airline_Arrival[4,2]<-"AM West"
Airline_Arrival
##   PK Airlines Arrivals Los_Angeles Phoenix San_Diego San_Francisco Seattle
## 1  1   Alaska  on time         497     221       212           503    1841
## 2  2   Alaska  delayed          62      12        20           102     305
## 3  3  AM West  on time         694    4840       383           320     201
## 4  4  AM West  delayed         117     415        65           129      61

##Created a long verse of the data frame by created a column to place all the locations instead of having the locations in their on colomn creating a wide table.

Airline_Arrival_Long<-Airline_Arrival%>%
  pivot_longer(cols = c('Los_Angeles', 'Phoenix', 'San_Diego', 'San_Francisco', 'Seattle'), names_to = "Locations", values_to = "Flight_arrival_count" )
Airline_Arrival_Long
## # A tibble: 20 × 5
##       PK Airlines Arrivals Locations     Flight_arrival_count
##    <int> <chr>    <chr>    <chr>                        <int>
##  1     1 Alaska   on time  Los_Angeles                    497
##  2     1 Alaska   on time  Phoenix                        221
##  3     1 Alaska   on time  San_Diego                      212
##  4     1 Alaska   on time  San_Francisco                  503
##  5     1 Alaska   on time  Seattle                       1841
##  6     2 Alaska   delayed  Los_Angeles                     62
##  7     2 Alaska   delayed  Phoenix                         12
##  8     2 Alaska   delayed  San_Diego                       20
##  9     2 Alaska   delayed  San_Francisco                  102
## 10     2 Alaska   delayed  Seattle                        305
## 11     3 AM West  on time  Los_Angeles                    694
## 12     3 AM West  on time  Phoenix                       4840
## 13     3 AM West  on time  San_Diego                      383
## 14     3 AM West  on time  San_Francisco                  320
## 15     3 AM West  on time  Seattle                        201
## 16     4 AM West  delayed  Los_Angeles                    117
## 17     4 AM West  delayed  Phoenix                        415
## 18     4 AM West  delayed  San_Diego                       65
## 19     4 AM West  delayed  San_Francisco                  129
## 20     4 AM West  delayed  Seattle                         61
  1. Perform analysis to compare the arrival delays for the two airlines.

##Grouped the data for Airline and Arrivals in order to get the averages and total count of their flights arrival.

Arrival_summary_of_airlines<-Airline_Arrival_Long%>%
  group_by(Airlines, Arrivals)%>%
  summarise(Arrival_flight_count=sum(Flight_arrival_count), aver_flight_arrival=mean(Flight_arrival_count))
Arrival_summary_of_airlines
## # A tibble: 4 × 4
## # Groups:   Airlines [2]
##   Airlines Arrivals Arrival_flight_count aver_flight_arrival
##   <chr>    <chr>                   <int>               <dbl>
## 1 AM West  delayed                   787                157.
## 2 AM West  on time                  6438               1288.
## 3 Alaska   delayed                   501                100.
## 4 Alaska   on time                  3274                655.

##Since we will be only looking at the delays for both airlines, a data set will be created with just the values of the airline’s delayed counts

delayed_Airline_summary<-Arrival_summary_of_airlines|>
  filter(Arrivals=="delayed")
delayed_Airline_summary
## # A tibble: 2 × 4
## # Groups:   Airlines [2]
##   Airlines Arrivals Arrival_flight_count aver_flight_arrival
##   <chr>    <chr>                   <int>               <dbl>
## 1 AM West  delayed                   787                157.
## 2 Alaska   delayed                   501                100.

##Created a bar look at the delayed flight for each airline in different locations.

Airline_delay_plot<-Airline_Arrival_Long%>%
         filter(Arrivals=="delayed")
ggplot(Airline_delay_plot,aes(Locations, Flight_arrival_count, fill=Airlines))+
         geom_col(position="dodge")

  1. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:  The URL to the .Rmd file in your GitHub repository. and  The URL for your rpubs.com web page.

Conclusion- Alaska with the least delays.

The data frame provided on the arrival times of different airlines in different locations had to be tidy in order to analyze. Using a few functions from tidyverse and dplyr, I was able to tidy the data and filter out the only data I needed to make my analysis on the delayed flights. Based on the average of flights arrivals I was able to see that the flight with the most delays was AM West and most of it’s delayed occurred in Phoenix. Alaska seemed to have the lowest delay in Phoenix, while it has the highest delay in Seattle. Therefore the best airline to use for Phoenix would be Alaska and the best for Seattle would be AM West. But based on the means for the delays, the best airline would be Alaska since it has a lower mean of delays and on the plot we could see that in most locations Alaska had a lower count of delayed flights compared to AM West.