Homework 5: Data Wrangling with tidyr and dplyr.
Load libraries
library(dplyr)
library(tidyr)
library(knitr)
Step 1: Read airlines CSV file, and assign names to columns with no names.
- Column 1 –> name as “Airlines”
- Column 2 –> name as “Arrival.Status”
- The option na.strings = c(“”, “NA”) is used in the read.table() to set blank data to ‘NA’.
file_source = "https://raw.githubusercontent.com/Shetura36/Data-607-Assignments/master/Assignment5/Airlines.csv"
#results in data frame
airlines <- read.table(file_source, header=TRUE, sep=",", na.strings = c("", "NA"))
names(airlines)[1] <- "Airlines"
names(airlines)[2] <- "Arrival.Status"
#display data
kable(airlines)
ALASKA |
on time |
497 |
221 |
212 |
503 |
1841 |
NA |
delayed |
62 |
12 |
20 |
102 |
305 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
AM WEST |
on time |
694 |
4840 |
383 |
320 |
201 |
NA |
delayed |
117 |
415 |
65 |
129 |
61 |
Step 2: Remove blank rows.
A blank row is identified as a row with “NA” for all the variables.
airlines <- airlines[!apply(is.na(airlines[1:7]),1,all), ]
row.names(airlines) <- NULL
# display. airlines does not have blank row anymore
kable(airlines)
ALASKA |
on time |
497 |
221 |
212 |
503 |
1841 |
NA |
delayed |
62 |
12 |
20 |
102 |
305 |
AM WEST |
on time |
694 |
4840 |
383 |
320 |
201 |
NA |
delayed |
117 |
415 |
65 |
129 |
61 |
Step 3: Assign an airline name to rows that have a blank airline name.
An ASSUMPTION is made here that every single row that has a blank airline name has THE SAME airline name to the row above it. In addition, this code ASSUMES that the first row will always have a value for the airline name that is NOT ‘NA’. These assumptions are true for this specific file. This process would be the same if this file has 5 rows or more.
for(i in 2:nrow(airlines)) {
if(is.na(airlines$Airlines[i])){
airlines$Airlines[i] <- airlines$Airlines[i-1]
}
}
#display
kable(airlines)
ALASKA |
on time |
497 |
221 |
212 |
503 |
1841 |
ALASKA |
delayed |
62 |
12 |
20 |
102 |
305 |
AM WEST |
on time |
694 |
4840 |
383 |
320 |
201 |
AM WEST |
delayed |
117 |
415 |
65 |
129 |
61 |
Step 5: ‘Spread’ Arrival.Status so that each distinct value becames a variable (column).
- I will be using tidyr::spread() function
- Key-Value pair is going to be “Arrival.Status” and “Count” respectively.
- The “Count” column was generated during Step 4.
airlines_transform2 <- tidyr::spread(airlines_transform1, Arrival.Status, Count)
#display
kable(airlines_transform2)
ALASKA |
Los.Angeles |
62 |
497 |
ALASKA |
Phoenix |
12 |
221 |
ALASKA |
San.Diego |
20 |
212 |
ALASKA |
San.Francisco |
102 |
503 |
ALASKA |
Seattle |
305 |
1841 |
AM WEST |
Los.Angeles |
117 |
694 |
AM WEST |
Phoenix |
415 |
4840 |
AM WEST |
San.Diego |
65 |
383 |
AM WEST |
San.Francisco |
129 |
320 |
AM WEST |
Seattle |
61 |
201 |