DATA607_Assignment

1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

I have created a csv file with the information and uploaded to Github. I will be using this file to manipulate into tidy form.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

# Read the file from GitHub and save to a data frame
airlines <- read.csv("https://raw.githubusercontent.com/amily52131/DATA607/refs/heads/main/Assignment_4/Tifdying%20and%20Transforming%20Data.csv")
print(airlines)

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

2) Tidy data

To make the data tidy we would have to think about reformatting the table and present it with a way that each column is a variable, each row is an observation, and each cell is a value. First we have to think about the empty values in our table. I used excel to create my table and for the cells that have no value I tabbed to the next cell or I traverse the cells with the arrows. These cells were either recorded as empty cells “” or null values when converted to csv files. R has many build in functions that can take care of NA values, so the first step I did to clean up my data is to remove all empty cells and making them NA. Since my data did not have any other odd characters I did not do any more cleaning other than converting the empty cells to NA values. However, in other cases I think we would have to think about handling extra spaces, special characters, or corrupt data with regular expression.

# Taking care of empty spaces and set them to NA
airlines[airlines ==""] <- NA

Then I began organizing my data. The empty columns were renamed column X and X.1 after R load the data into data frame. I renamed my columns to meaningful names and correct formatting.

# Using clean_names() part of the janitor package to replace the . in the names and remove space for the column name
airlines <- airlines %>% 
  rename(Airline = X, Status = X.1) %>% 
  clean_names()

# Using fill() part of dplyer to fill the data in the column with the value from before until a different value is encountered
airlines <- airlines %>% 
  fill(airline, .direction = "down")

print(airlines)

##   airline  status los_angeles phoenix san_diego san_francisco seattle
## 1  Alaska on time         497     221       212           503    1841
## 2  Alaska delayed          62      12        20           102     305
## 3  Alaska    <NA>          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5 AM WEST delayed         117     415        65           129      61

I do notice there is a row of NA values in my table but I know that R and the tidyverse library can handle this very easily. I tried looking at the data the “long” way and “wide” way with the pivot_longer and pivot_wider function.
In pivot_longer, I condensed the columns and made a “location” column to include all the locations the flights arrive to and a “count” column to count the number of flights on time or delayed. Using the values_drop_na variable in pivot_longer functiion, I was able to drop the row with NA values.

# Looking at data using pivot_longer
airlines_long <- airlines %>% 
  pivot_longer(
    cols = los_angeles:seattle,
    names_to = "location",
    values_to = "count",
    values_drop_na = TRUE
  ) %>% 
  print()

## # A tibble: 20 × 4
##    airline status  location      count
##    <chr>   <chr>   <chr>         <int>
##  1 Alaska  on time los_angeles     497
##  2 Alaska  on time phoenix         221
##  3 Alaska  on time san_diego       212
##  4 Alaska  on time san_francisco   503
##  5 Alaska  on time seattle        1841
##  6 Alaska  delayed los_angeles      62
##  7 Alaska  delayed phoenix          12
##  8 Alaska  delayed san_diego        20
##  9 Alaska  delayed san_francisco   102
## 10 Alaska  delayed seattle         305
## 11 AM WEST on time los_angeles     694
## 12 AM WEST on time phoenix        4840
## 13 AM WEST on time san_diego       383
## 14 AM WEST on time san_francisco   320
## 15 AM WEST on time seattle         201
## 16 AM WEST delayed los_angeles     117
## 17 AM WEST delayed phoenix         415
## 18 AM WEST delayed san_diego        65
## 19 AM WEST delayed san_francisco   129
## 20 AM WEST delayed seattle          61

In pivot_wider I dropped the row with NA values before I expand the columns with pivot_wider since I am trying to expand with the “status” variable against all the other location columns. I ended up with location columns of “on time” or “delayed”

# Looking at data using pivot_wider
airlines_wide <- airlines %>% 
  drop_na() %>% # drop the rows that has na values in there
  pivot_wider(
    names_from = status,
    values_from = los_angeles:seattle
  ) %>% 
  print()

## # A tibble: 2 × 11
##   airline `los_angeles_on time` los_angeles_delayed `phoenix_on time`
##   <chr>                   <int>               <int>             <int>
## 1 Alaska                    497                  62               221
## 2 AM WEST                   694                 117              4840
## # ℹ 7 more variables: phoenix_delayed <int>, `san_diego_on time` <int>,
## #   san_diego_delayed <int>, `san_francisco_on time` <int>,
## #   san_francisco_delayed <int>, `seattle_on time` <int>, seattle_delayed <int>

I think it is easier with pivot_longer to manipulate data later since I can think in columns vs thinking in rows.

Perform analysis to compare the arrival delays for the two airlines.

I tried two different ways of analyzing with the “longer” representation of data to compare the arrival delays of each airline. I think it would be best to represent as percentage of flights on time over all flights taken. First attempt I break the data into two tables and then join them to calculate the percentage. Second attempt I tried not breaking into other tables and used only group_by, mutate, and summarise to come up with final data. In my first attempt I can see all the calculated columns and gave me more insight but the result of both attempts are the same.

Around 89 percent of AM WEST flights arrive on time and around 87 percent of Alaska flights arrive to the locations on time.

# Break into different summary tables 
total_flights <- airlines_long %>% 
  group_by(airline) %>% 
  summarise(all_loc = sum(count)) #%>% 
total_status <- airlines_long %>% 
  group_by(airline, status) %>% 
  summarise(total_status = sum(count))

## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.

attempt_1 <- total_status %>%
  left_join(total_flights) %>% 
  mutate(percentage = total_status/all_loc*100) %>% 
  print()

## Joining with `by = join_by(airline)`

## # A tibble: 4 × 5
## # Groups:   airline [2]
##   airline status  total_status all_loc percentage
##   <chr>   <chr>          <int>   <int>      <dbl>
## 1 AM WEST delayed          787    7225       10.9
## 2 AM WEST on time         6438    7225       89.1
## 3 Alaska  delayed          501    3775       13.3
## 4 Alaska  on time         3274    3775       86.7

# Second attempt without breaking into different tables
attempt_2 <- airlines_long %>% 
  group_by(airline) %>% 
  mutate(all_loc = sum(count),
         percentage = count/all_loc*100) %>% 
  group_by(airline, status, all_loc) %>% 
  summarise(total_per = sum(percentage)) %>% 
  print()

## `summarise()` has grouped output by 'airline', 'status'. You can override using
## the `.groups` argument.

## # A tibble: 4 × 4
## # Groups:   airline, status [4]
##   airline status  all_loc total_per
##   <chr>   <chr>     <int>     <dbl>
## 1 AM WEST delayed    7225      10.9
## 2 AM WEST on time    7225      89.1
## 3 Alaska  delayed    3775      13.3
## 4 Alaska  on time    3775      86.7

4 Conclusion

I used Microsoft Excel to generate the table for Assignment 4 and saved as csv file then uploaded to Github to load for my analysis. The table from the assignment had various empty column headers and cells. First step after loading the table in R I noticed that the empty column headers were assigned names like X and X.1 and the empty cells were taken as either empty spaces or null values. I renamed the column headers to “airlines” and “status” and changed all the empty spaces to NA values since R has built in functions to take care of NA values. Then I used the fill function going down for column “airlines” so that the values will be filled in until the next value encountered. To take care of the empty row in the table, I used the values_drop_na value in pivot_longer function to remove the empty row and used the arrival location as key called location to expand the table.

For my analysis, I first group the data by airline to get the total flights arrived for each airlines then calculated the percentage with that information. Use the mutate function to add the information to the table and then group again by airline and status. Use the summary function I was able to find the percentage in which flights arrive on time or delayed per each airline. I find that around 89 percent of AM WEST flights arrive on time and around 87 percent of Alaska flights arrive to the locations on time.

DATA607_Assignment_4

Ying Fang Lee

9/29/2024

1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

2) Tidy data

Perform analysis to compare the arrival delays for the two airlines.

4 Conclusion