I have created a csv file with the information and uploaded to Github. I will be using this file to manipulate into tidy form.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
# Read the file from GitHub and save to a data frame
airlines <- read.csv("https://raw.githubusercontent.com/amily52131/DATA607/refs/heads/main/Assignment_4/Tifdying%20and%20Transforming%20Data.csv")
print(airlines)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
To make the data tidy we would have to think about reformatting the table and present it with a way that each column is a variable, each row is an observation, and each cell is a value. First we have to think about the empty values in our table. I used excel to create my table and for the cells that have no value I tabbed to the next cell or I traverse the cells with the arrows. These cells were either recorded as empty cells “” or null values when converted to csv files. R has many build in functions that can take care of NA values, so the first step I did to clean up my data is to remove all empty cells and making them NA. Since my data did not have any other odd characters I did not do any more cleaning other than converting the empty cells to NA values. However, in other cases I think we would have to think about handling extra spaces, special characters, or corrupt data with regular expression.
# Taking care of empty spaces and set them to NA
airlines[airlines ==""] <- NA
Then I began organizing my data. The empty columns were renamed column X and X.1 after R load the data into data frame. I renamed my columns to meaningful names and correct formatting.
# Using clean_names() part of the janitor package to replace the . in the names and remove space for the column name
airlines <- airlines %>%
rename(Airline = X, Status = X.1) %>%
clean_names()
# Using fill() part of dplyer to fill the data in the column with the value from before until a different value is encountered
airlines <- airlines %>%
fill(airline, .direction = "down")
print(airlines)
## airline status los_angeles phoenix san_diego san_francisco seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 3 Alaska <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
I do notice there is a row of NA values in my table but I know that R
and the tidyverse library can handle this very easily. I tried looking
at the data the “long” way and “wide” way with the pivot_longer and
pivot_wider function.
In pivot_longer, I condensed the columns and made a “location” column to
include all the locations the flights arrive to and a “count” column to
count the number of flights on time or delayed. Using the values_drop_na
variable in pivot_longer functiion, I was able to drop the row with NA
values.
# Looking at data using pivot_longer
airlines_long <- airlines %>%
pivot_longer(
cols = los_angeles:seattle,
names_to = "location",
values_to = "count",
values_drop_na = TRUE
) %>%
print()
## # A tibble: 20 × 4
## airline status location count
## <chr> <chr> <chr> <int>
## 1 Alaska on time los_angeles 497
## 2 Alaska on time phoenix 221
## 3 Alaska on time san_diego 212
## 4 Alaska on time san_francisco 503
## 5 Alaska on time seattle 1841
## 6 Alaska delayed los_angeles 62
## 7 Alaska delayed phoenix 12
## 8 Alaska delayed san_diego 20
## 9 Alaska delayed san_francisco 102
## 10 Alaska delayed seattle 305
## 11 AM WEST on time los_angeles 694
## 12 AM WEST on time phoenix 4840
## 13 AM WEST on time san_diego 383
## 14 AM WEST on time san_francisco 320
## 15 AM WEST on time seattle 201
## 16 AM WEST delayed los_angeles 117
## 17 AM WEST delayed phoenix 415
## 18 AM WEST delayed san_diego 65
## 19 AM WEST delayed san_francisco 129
## 20 AM WEST delayed seattle 61
In pivot_wider I dropped the row with NA values before I expand the columns with pivot_wider since I am trying to expand with the “status” variable against all the other location columns. I ended up with location columns of “on time” or “delayed”
# Looking at data using pivot_wider
airlines_wide <- airlines %>%
drop_na() %>% # drop the rows that has na values in there
pivot_wider(
names_from = status,
values_from = los_angeles:seattle
) %>%
print()
## # A tibble: 2 × 11
## airline `los_angeles_on time` los_angeles_delayed `phoenix_on time`
## <chr> <int> <int> <int>
## 1 Alaska 497 62 221
## 2 AM WEST 694 117 4840
## # ℹ 7 more variables: phoenix_delayed <int>, `san_diego_on time` <int>,
## # san_diego_delayed <int>, `san_francisco_on time` <int>,
## # san_francisco_delayed <int>, `seattle_on time` <int>, seattle_delayed <int>
I think it is easier with pivot_longer to manipulate data later since I can think in columns vs thinking in rows.
I tried two different ways of analyzing with the “longer” representation of data to compare the arrival delays of each airline. I think it would be best to represent as percentage of flights on time over all flights taken. First attempt I break the data into two tables and then join them to calculate the percentage. Second attempt I tried not breaking into other tables and used only group_by, mutate, and summarise to come up with final data. In my first attempt I can see all the calculated columns and gave me more insight but the result of both attempts are the same.
Around 89 percent of AM WEST flights arrive on time and around 87 percent of Alaska flights arrive to the locations on time.
# Break into different summary tables
total_flights <- airlines_long %>%
group_by(airline) %>%
summarise(all_loc = sum(count)) #%>%
total_status <- airlines_long %>%
group_by(airline, status) %>%
summarise(total_status = sum(count))
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
attempt_1 <- total_status %>%
left_join(total_flights) %>%
mutate(percentage = total_status/all_loc*100) %>%
print()
## Joining with `by = join_by(airline)`
## # A tibble: 4 × 5
## # Groups: airline [2]
## airline status total_status all_loc percentage
## <chr> <chr> <int> <int> <dbl>
## 1 AM WEST delayed 787 7225 10.9
## 2 AM WEST on time 6438 7225 89.1
## 3 Alaska delayed 501 3775 13.3
## 4 Alaska on time 3274 3775 86.7
# Second attempt without breaking into different tables
attempt_2 <- airlines_long %>%
group_by(airline) %>%
mutate(all_loc = sum(count),
percentage = count/all_loc*100) %>%
group_by(airline, status, all_loc) %>%
summarise(total_per = sum(percentage)) %>%
print()
## `summarise()` has grouped output by 'airline', 'status'. You can override using
## the `.groups` argument.
## # A tibble: 4 × 4
## # Groups: airline, status [4]
## airline status all_loc total_per
## <chr> <chr> <int> <dbl>
## 1 AM WEST delayed 7225 10.9
## 2 AM WEST on time 7225 89.1
## 3 Alaska delayed 3775 13.3
## 4 Alaska on time 3775 86.7
I used Microsoft Excel to generate the table for Assignment 4 and saved as csv file then uploaded to Github to load for my analysis. The table from the assignment had various empty column headers and cells. First step after loading the table in R I noticed that the empty column headers were assigned names like X and X.1 and the empty cells were taken as either empty spaces or null values. I renamed the column headers to “airlines” and “status” and changed all the empty spaces to NA values since R has built in functions to take care of NA values. Then I used the fill function going down for column “airlines” so that the values will be filled in until the next value encountered. To take care of the empty row in the table, I used the values_drop_na value in pivot_longer function to remove the empty row and used the arrival location as key called location to expand the table.
For my analysis, I first group the data by airline to get the total flights arrived for each airlines then calculated the percentage with that information. Use the mutate function to add the information to the table and then group again by airline and status. Use the summary function I was able to find the percentage in which flights arrive on time or delayed per each airline. I find that around 89 percent of AM WEST flights arrive on time and around 87 percent of Alaska flights arrive to the locations on time.