Rmarkdown Output
I am using prettydoc with cayman theme for my rmarkdown this week. prettydoc has great documentation in this link https://prettydoc.statr.me/index.html
Problem Statement
(1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: The URL to the .Rmd file in your GitHub repository. and The URL for your rpubs.com web page.
Git-Hub
The data, .rmd used in this assignment can be found at https://github.com/forhadakbar/data607fall2019/tree/master/Week%2005
Loading package
The tidyverse includes both tidyr and dplyr. I found Amazing documentation here on tidyverse https://www.tidyverse.org/packages/. Kudos to http://hadley.nz/
Create a .csv file
One way is to manually create a .csv and push it to github. Then we can read the file using read.csv from github raw link.
Another way is to create the dataset in R using rbind and then write a table to get a .csv file as output. Then we can read the file using read.csv from either github or working directory. I would write the .csv file to my locale github repro week 05 folder then commit and push to my github using either bash command or git client so that i can get a raw link of the .csv file. I am adding all missing values at 3rd row to replicate same structure given in the assignment and to practice tidying and transformations.
flights <- rbind(c(NA, NA, "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
c("ALASKA", "on time", 497, 221, 212, 503, 1841),
c(NA, "delayed", 62, 12, 20, 102, 305),
c(NA, NA, NA, NA, NA, NA, NA),
c("AM WEST", "on time", 694, 4840, 383, 320, 201),
c(NA, "delayed", 117, 415, 65, 129, 61))
#write the data in a .csv file
write.table(flights, file = "D:/CUNY-DataScience/Git-Hub/data607fall2019/Week 05/flight.csv", sep = ",", col.names=F, row.names=F)Read .CSV file into R and fill in missing values
I am using github raw link so that anyone can run the code and have the access to the .csv file created. Then i will Fill in the airline values and have a look using kable function from knitr package.
flights <-read.csv(paste("https://raw.githubusercontent.com/forhadakbar/data607fall2019/master/Week%2005/flight.csv"))
#Fill in the airline values
flights[2,1] <- flights[1,1]
flights[5,1] <- flights[4,1]
# replace space with underscore to avoid using backtick later
flights[,2] <- sapply(flights[,2], str_replace, " ", "_")
kable(flights)| NA. | NA..1 | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on_time | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| NA | NA | NA | NA | NA | NA | NA |
| AM WEST | on_time | 694 | 4840 | 383 | 320 | 201 |
| AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
Tidy dataset
Goals of making any data tidy or perform wrangling on it mainly depends on use case. We can consider following main data wrangling goals when performing any wrangling operation.
Make data suitable to use with a particular piece of software. For example R in our course context
Reveal information by changing format and structure of the data
Now let’s analyse why the the flights dataset is untidy in context of R. R prefers just one format of the data. We are calling it tidy here. There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row
- Each value must have its own cell
So in order to any dataset to be tidy it needs to maintain all three above rules, flights dataset doesn’t fulfill all three rules.
Flights dataset has most common problem as some of the column names are not names of variables, but values of a variable. The column names Los.Angeles, Phoenix, San.Diego San.Francisco and Seattle represent values of the location or state variable, and each row represents five observations, not one. We can use gathering and spreading to tidy this dataset. We can also make separate column for on time and delayed if needed. Also, there is a blank row and two column names are missing. R reading the missing column name as NA. and NA..1
My Approach
First, I will remove the blank row and rename the missing column name. Then i will use gathering from tidyr to get long form of the dataset to make it tidy. Later, i will use spreading from tidyr to get number of flights arrived on time and delay in separate column. Then i will do analysis to get insight like compare airlines performance using dplyr.
I am interested in not only overall airlines performace but also performace by destination city.
I will use pipe (%>%) opertor instead of chaining method as i am fairly new to piping. The history of pipe (%>%) in R and how it gets introduced in tidyverse is amazing to read. In short pipe (%>%) takes the output of one statement and makes it the input of the next statement. When reading it, we can think of it as a “THEN”. I studied pipe operator and it’s history from https://www.datacamp.com/community/tutorials/pipe-r-tutorial
Why tidyr
tidyr is a package that reshapes the layout of a tables. Two main function:
- gather() Convert variable into observation
- spread () Convert observation into variable
Transformation and using gather()
long_tidy_flights <- flights %>%
na.omit() %>% #remove blank row
rename(airline = NA., status = NA..1) %>% #name missing column name
gather(3:7, key ="destination", value = "number_of_flights" )#Transform the data from its wide to a more normalized long format
kable(long_tidy_flights)| airline | status | destination | number_of_flights |
|---|---|---|---|
| ALASKA | on_time | Los.Angeles | 497 |
| ALASKA | delayed | Los.Angeles | 62 |
| AM WEST | on_time | Los.Angeles | 694 |
| AM WEST | delayed | Los.Angeles | 117 |
| ALASKA | on_time | Phoenix | 221 |
| ALASKA | delayed | Phoenix | 12 |
| AM WEST | on_time | Phoenix | 4840 |
| AM WEST | delayed | Phoenix | 415 |
| ALASKA | on_time | San.Diego | 212 |
| ALASKA | delayed | San.Diego | 20 |
| AM WEST | on_time | San.Diego | 383 |
| AM WEST | delayed | San.Diego | 65 |
| ALASKA | on_time | San.Francisco | 503 |
| ALASKA | delayed | San.Francisco | 102 |
| AM WEST | on_time | San.Francisco | 320 |
| AM WEST | delayed | San.Francisco | 129 |
| ALASKA | on_time | Seattle | 1841 |
| ALASKA | delayed | Seattle | 305 |
| AM WEST | on_time | Seattle | 201 |
| AM WEST | delayed | Seattle | 61 |
Here i used gather() from tidyr package after removing blank row and name misssing column names to transform dataset from its untidy form to a normalized tidy long form which fullfill all three rules a dataset need to fulfil to become tidy for r. gather() convert variable into observation. Here, variables like Los.Angeles, Phoenix, San.Diego, San.Francisco, Seattle became observation.
Using spread ()
wide_tidy_flights<- long_tidy_flights %>%
spread(key = status, value = number_of_flights)
kable(wide_tidy_flights)| airline | destination | delayed | on_time |
|---|---|---|---|
| ALASKA | Los.Angeles | 62 | 497 |
| ALASKA | Phoenix | 12 | 221 |
| ALASKA | San.Diego | 20 | 212 |
| ALASKA | San.Francisco | 102 | 503 |
| ALASKA | Seattle | 305 | 1841 |
| AM WEST | Los.Angeles | 117 | 694 |
| AM WEST | Phoenix | 415 | 4840 |
| AM WEST | San.Diego | 65 | 383 |
| AM WEST | San.Francisco | 129 | 320 |
| AM WEST | Seattle | 61 | 201 |
Here i used spread() from tidyr package to transform dataset from its long form to a wide form to have seprate column for on time and delayed filghts in order to better analyse the airlines performance. spread() convert obseration into variables. Here, observation like Delayed and On Time became variables.
why use dplyr
Often datsets contain more informatio than they display. dplyr helps access that information. I will use as much all functions from dplyr to practice. Main function:
- select() Extract existing variable
- filter() Extract existing observations
- mutate() Derive new variable (from existing variable)
- summarise() Change the unit of analysis
Use mutate to add columns for total, percent on time, and percent delayed
tidyflights <- wide_tidy_flights %>%
mutate(total_flights = delayed + on_time, # use mutate from dplyr to add columns
percent_on_time = round(on_time/total_flights*100, 2),
percent_delayed = round(delayed/total_flights*100, 2))
kable(tidyflights)| airline | destination | delayed | on_time | total_flights | percent_on_time | percent_delayed |
|---|---|---|---|---|---|---|
| ALASKA | Los.Angeles | 62 | 497 | 559 | 88.91 | 11.09 |
| ALASKA | Phoenix | 12 | 221 | 233 | 94.85 | 5.15 |
| ALASKA | San.Diego | 20 | 212 | 232 | 91.38 | 8.62 |
| ALASKA | San.Francisco | 102 | 503 | 605 | 83.14 | 16.86 |
| ALASKA | Seattle | 305 | 1841 | 2146 | 85.79 | 14.21 |
| AM WEST | Los.Angeles | 117 | 694 | 811 | 85.57 | 14.43 |
| AM WEST | Phoenix | 415 | 4840 | 5255 | 92.10 | 7.90 |
| AM WEST | San.Diego | 65 | 383 | 448 | 85.49 | 14.51 |
| AM WEST | San.Francisco | 129 | 320 | 449 | 71.27 | 28.73 |
| AM WEST | Seattle | 61 | 201 | 262 | 76.72 | 23.28 |
Here i used mutate function which takes a data frame and return the data frame with new variables added to it. As i had replace space in “on time” with underscore i didn’t need to use backtick. I used round to round the percentage to 2 decimel points.
Let’s see it visually using ggplot2
ggplot(tidyflights, aes(x=destination, y=percent_delayed, fill=airline)) +
geom_bar(stat="identity", position=position_dodge(), colour="black") +
geom_text(aes(label=percent_delayed), vjust=.5, hjust=1,position= position_dodge(width=0.9), color="black") +
ggtitle("Percentage of Flights Delayed by Airline by City") +
xlab("Destination") + ylab("Percentage of Flights Delayed") +
coord_flip() ggplot(tidyflights, aes(x=destination, y=percent_on_time, fill=airline)) +
geom_bar(stat="identity", position=position_dodge(), colour="black") +
geom_text(aes(label=percent_on_time), vjust=.5, hjust=1,position= position_dodge(width=0.9), color="black") +
ggtitle("Percentage of Flights on time by Airline by City") +
xlab("Destination") + ylab("Percentage of Flights on time") +
coord_flip() Arrange to see performance by destination
| airline | destination | delayed | on_time | total_flights | percent_on_time | percent_delayed |
|---|---|---|---|---|---|---|
| ALASKA | Phoenix | 12 | 221 | 233 | 94.85 | 5.15 |
| AM WEST | Phoenix | 415 | 4840 | 5255 | 92.10 | 7.90 |
| ALASKA | San.Diego | 20 | 212 | 232 | 91.38 | 8.62 |
| ALASKA | Los.Angeles | 62 | 497 | 559 | 88.91 | 11.09 |
| ALASKA | Seattle | 305 | 1841 | 2146 | 85.79 | 14.21 |
| AM WEST | Los.Angeles | 117 | 694 | 811 | 85.57 | 14.43 |
| AM WEST | San.Diego | 65 | 383 | 448 | 85.49 | 14.51 |
| ALASKA | San.Francisco | 102 | 503 | 605 | 83.14 | 16.86 |
| AM WEST | Seattle | 61 | 201 | 262 | 76.72 | 23.28 |
| AM WEST | San.Francisco | 129 | 320 | 449 | 71.27 | 28.73 |
Here i used arrange() to arrage the row by percent_on_time by descending order. We can see ALASKA airlines has 4 out of the top five spots for percentage of on time flights.
Use dplyr summarize and select function to compare overall airlines performance
compare_airlines <- tidyflights %>%
group_by(airline) %>%
summarize(on_time = sum(on_time), delayed = sum(delayed),
percent_on_time = round(on_time/sum(on_time,delayed)*100,2),
percent_delay = round(delayed/sum(on_time,delayed)*100,2))%>% # use summarise to change unit of analysis
select(airline, percent_on_time,percent_delay) # use select to extract columns
compare_airlines %>% arrange(desc(percent_on_time))## # A tibble: 2 x 3
## airline percent_on_time percent_delay
## <fct> <dbl> <dbl>
## 1 AM WEST 89.1 10.9
## 2 ALASKA 86.7 13.3
First i group the rows by airlines. Then use summarise change units of analysis. Here we used sum(). Finally use select() to extract variable and then display them in percent_delay order by percent_on_time using arrange().
ggplot(compare_airlines, aes(x=airline, y=percent_delay, fill=airline)) +
geom_bar(stat="identity", position=position_dodge(), color="black", width = 0.5) +
coord_flip() +
ggtitle("Overall Percentage of Flights Delayed by Airline") +
xlab("Airline") + ylab("Percentage of Flights Delayed")We can see AM WEST overall performance is better than ALASKA. This seems odd as ALASKA airlines has 4 out of the top five spots for percentage of on time flights. I noticed AM West has a very large number of flights to Phoenix with a high on time percentage to that destination. I would like to investigate if we filter out phoenix then what happen to overall performace.
Use dplyr to filter out Phoenix to see the comparison accross the other destinations
filter_Phoenix <- tidyflights %>%
filter(destination!="Phoenix")%>%
group_by(airline) %>%
summarize(on_time = sum(on_time), delayed = sum(delayed),
percent_on_time = round(on_time/sum(on_time,delayed)*100,2),
percent_delay = round(delayed/sum(on_time,delayed)*100,2))%>% # use summarise to change unit of analysis
select(airline, percent_on_time,percent_delay) # use select to extract columns
filter_Phoenix %>% arrange(desc(percent_on_time))## # A tibble: 2 x 3
## airline percent_on_time percent_delay
## <fct> <dbl> <dbl>
## 1 ALASKA 86.2 13.8
## 2 AM WEST 81.1 18.9
| airline | percent_on_time | percent_delay |
|---|---|---|
| ALASKA | 86.19 | 13.81 |
| AM WEST | 81.12 | 18.88 |
When we filter out Phoenix we can see that now ALASKA Airlines has a better on time percentage.
Conclusion
We can conclude that if someone is flying to Phoenix then take AM West. But if flying to any of the other destinations you have a better chance of being on time with ALASKA airlines. We can also see how performance vary depending on destination. In this assignment we learn how to formate dataset, tidy it using tidr and get useful insight using dplyr package.
Source
I used following sources to study and learn all necessary skills to complete this assignment.
kable: https://www.rdocumentation.org/packages/knitr/versions/1.1/topics/kable
Prettydoc: https://prettydoc.statr.me/index.html
Pipe: https://www.datacamp.com/community/tutorials/pipe-r-tutorial
tidyverse: https://www.tidyverse.org/
Book: https://r4ds.had.co.nz/tidy-data.html#spreading-and-gathering
Video on tidyr and dplyr: https://rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/?fbclid=IwAR2Elr92Jbv8S2E3XV2NX5e9zlRHxAahOHrzvNQF8gR3uD1yoxIt_U6-xoE