Week 5 Assignment

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

First we pull the data from personal Github, data was made with excel and uploaded to github as a CSV.

Data <- read.csv("https://raw.githubusercontent.com/sokkarbishoy/DATA607/main/Flights%20wk5.csv")

print(Data)

##         X      X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA  on time         497     221       212           503    1841
## 2  ALASKA delayed           62      12        20           102     305
## 3 AM WEST  on time         694    4840       383           320     201
## 4 AM WEST delayed          117     415        65           129      61

Install packages

in the code below, I installed the packages tidyverse which includes tidyr amd dplyr packages. I started by

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyr)
library(dplyr)

#code to transform the states columns into a new variale called destination 

Data2 <-Data %>%
  pivot_longer( cols = c('Los.Angeles', 'Phoenix', 'San.Diego', 'San.Francisco', 'Seattle'), names_to = "Destination", values_to = "Frequency")

head(Data2)

## # A tibble: 6 × 4
##   X      X.1        Destination   Frequency
##   <chr>  <chr>      <chr>             <int>
## 1 ALASKA "on time"  Los.Angeles         497
## 2 ALASKA "on time"  Phoenix             221
## 3 ALASKA "on time"  San.Diego           212
## 4 ALASKA "on time"  San.Francisco       503
## 5 ALASKA "on time"  Seattle            1841
## 6 ALASKA "delayed " Los.Angeles          62

In the code below I rename the missing columns names with Airline and Status and remove the “.” in the states with a space.

colnames(Data2)[colnames(Data2) %in% c("X", "X.1")] <- c("Airline", "Status")
head(Data2)

## # A tibble: 6 × 4
##   Airline Status     Destination   Frequency
##   <chr>   <chr>      <chr>             <int>
## 1 ALASKA  "on time"  Los.Angeles         497
## 2 ALASKA  "on time"  Phoenix             221
## 3 ALASKA  "on time"  San.Diego           212
## 4 ALASKA  "on time"  San.Francisco       503
## 5 ALASKA  "on time"  Seattle            1841
## 6 ALASKA  "delayed " Los.Angeles          62

To remove the . between states I used the following code.

Data2$Destination <- gsub("\\.", " ", Data2$Destination)

head(Data2)

## # A tibble: 6 × 4
##   Airline Status     Destination   Frequency
##   <chr>   <chr>      <chr>             <int>
## 1 ALASKA  "on time"  Los Angeles         497
## 2 ALASKA  "on time"  Phoenix             221
## 3 ALASKA  "on time"  San Diego           212
## 4 ALASKA  "on time"  San Francisco       503
## 5 ALASKA  "on time"  Seattle            1841
## 6 ALASKA  "delayed " Los Angeles          62

str(Data2)

## tibble [20 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Airline    : chr [1:20] "ALASKA" "ALASKA" "ALASKA" "ALASKA" ...
##  $ Status     : chr [1:20] "on time" "on time" "on time" "on time" ...
##  $ Destination: chr [1:20] "Los Angeles" "Phoenix" "San Diego" "San Francisco" ...
##  $ Frequency  : int [1:20] 497 221 212 503 1841 62 12 20 102 305 ...

summary(Data2)

##    Airline             Status          Destination          Frequency      
##  Length:20          Length:20          Length:20          Min.   :  12.00  
##  Class :character   Class :character   Class :character   1st Qu.:  92.75  
##  Mode  :character   Mode  :character   Mode  :character   Median : 216.50  
##                                                           Mean   : 550.00  
##                                                           3rd Qu.: 435.50  
##                                                           Max.   :4840.00

average_flights <- mean(Data2$Frequency)
average_flights

## [1] 550

We can find the create a ggplot to highlight the destinations and how often they could be delayed.

ggplot(Data2, aes(x= Destination, y= Frequency, fill = Status))+
  geom_bar(stat = "identity")+
  labs(title = "Flights by Destination and Status",
       x = "Destination",
       y = "Flights",
       fill = "Status")

To compare the number of flights of both Airlines mentioned and the number of delayed can be presented using the next plot.

It appears that AM west have higher number of flights and more thus more delayed flighthass.

ggplot(Data2, aes(x= Airline, y = Frequency, fill = Status))+
  geom_bar(stat = "identity")

Week 5 Assignment

Bishoy Sokkar

2024-02-24

R Markdown