Assignment on RPubs
Rmd on Github
This assignment’s purpose is to tidy and transform the flights wide format table below through the use of tidyr and dplyr. The wide format of the table has been created beforehand as a csv file named airlines.csv.
#Calling the stringr, tidyr, dplyr, ggplot2 libraries
library(stringr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
#Reading the airlines.csv file
flights <- read.csv("https://raw.githubusercontent.com/logicalschema/DATA607/master/week5/airlines.csv",
sep = ",",
header = TRUE)
Here is an initial view of the csv imported data.
head(flights)
The data as is is not ready to be used and needs to be cleaned up and converted for analysis. There are column names that need to be changed, a class type that needs to be changed, and empty values need to be filled.
names(flights)
## [1] "ï.." "X" "Los.Angeles" "Phoenix"
## [5] "San.Diego" "San.Francisco" "Seattle"
#Rename the columns "i..", "X", "Log.Angeles", "San.Diego", and "San.Francisco"
flights <- rename(flights, Airline = "ï..")
flights <- rename(flights, Status = "X")
flights <- rename(flights, "Los Angeles" = "Los.Angeles")
flights <- rename(flights, "San Diego" = "San.Diego")
flights <- rename(flights, "San Francisco" = "San.Francisco")
#Remove empty rows
flights <- flights %>% na.omit()
#Check the class of the Airline Column
class(flights$Airline)
## [1] "factor"
#Converts factor columns to character
flights <- flights %>% mutate_if(is.factor, as.character)
#Fill empty characters with NA an then fill with NA values in Airline with the previous non-NA value
flights <- flights %>% mutate_all(na_if,"")
flights <- flights %>% fill(Airline)
head(flights)
#Convert the columns from "Los Angeles to Seattle" as data values of a new column City with new column called Number
flights <- gather(flights, key = "Destination", value = "Number", "Los Angeles":"Seattle" )
#Split the Column Status into columns according to its distinct values and store the number
flights <- spread(flights, Status, Number)
flights <- rename(flights, On_Time = "on time")
flights <- rename(flights, Delayed = "delayed")
## Data is Read for Use
Let’s see the new data columns and values after cleanup.
names(flights)
## [1] "Airline" "Destination" "Delayed" "On_Time"
flights
Now, we are ready to utilize the data.
los_angeles <- flights %>% filter(Destination == "Los Angeles")
phoenix <- flights %>% filter(Destination == "Phoenix")
san_diego <- flights %>% filter(Destination == "San Diego")
san_francisco <- flights %>% filter(Destination == "San Francisco")
seattle <- flights %>% filter(Destination == "Seattle")
p1 <- ggplot(los_angeles, aes( x= Airline, y=Delayed/(Delayed + On_Time), fill = Airline)) +
geom_histogram(alpha = .5, stat="identity", position=position_dodge(), colour="black") +
ylab("Los Angeles")+
theme(text = element_text(size = 8))
p2 <- ggplot(phoenix, aes( x= Airline, y=Delayed/(Delayed + On_Time), fill = Airline)) +
geom_histogram(alpha = .5, stat="identity", position=position_dodge(), colour="black") +
ylab("Phoenix")+
theme(text = element_text(size = 8))
p3 <- ggplot(san_diego, aes( x= Airline, y=Delayed/(Delayed + On_Time), fill = Airline)) +
geom_histogram(alpha = .5, stat="identity", position=position_dodge(), colour="black") +
ylab("San Diego") +
theme(text = element_text(size = 8))
p4 <- ggplot(san_francisco, aes( x= Airline, y=Delayed/(Delayed + On_Time), fill = Airline)) +
geom_histogram(alpha = .5, stat="identity", position=position_dodge(), colour="black") +
ylab("San Francisco") +
theme(text = element_text(size = 8))
p5 <- ggplot(seattle, aes( x= Airline, y=Delayed/(Delayed + On_Time), fill = Airline)) +
geom_histogram(alpha = .5, stat="identity", position=position_dodge(), colour="black") +
ylab("Seattle") +
theme(text = element_text(size = 8))
multiplot(p1, p2, p3, p4, p5, cols=2)
Overall, it seems that AM West has a higher proportion of delays than Alaska across the 5 cities with San Francisco, out of the cities recorded, having the highest rate for delays. If catching a connecting flight, schedule more lag time if the connecting flight is from San Francisco.