DATA607 5th Week Assignment

Assignment – Tidying and Transforming Data

The goal of this week’s assignment is to work with data structures and utilize tidyverse and dplyr for data manipulation. We will focus on both wide and tall tables, practicing the transformation of data from one format to another.

The assignment comprises Three sections, and this R-Markdown document mirrors the same structure, introducing sub-titles for each part of the assignment.

Code Initiation

## [1] "All required packages are installed"

#1 Section: Creation of a wide database

The goal is to create a .CSV file (or optionally, a MySQL database) that includes all the information displayed in the following format. I’ve already created a CSV file named Airlines.csv, stored in the /Data folder. Our objective is to create a “wide” structure, similar to how the information is presented below. This way, we can practice tidying and transformations, as described in the subsequent sections

Wide structure sample

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#2: Data Transformation

We read the data from the file and then change the wide to long format. I will use tidyr and dplyr as needed to tidy and transform the data.

# first we read the data, 
# data will be read from local, or GitHub 


# I wanted to create a function that takes an address as input. The function should first try to load the file from a local location. If that fails, it should then attempt to load the file from an assumed web address. Finally, if both attempts are unsuccessful, the function should prompt the operator to choose a file manually from their local computer. 

load_file_KP <- function(file_name) {
  # Attempt to load the file
  file_data <- tryCatch({
    # Try to read from a local file
    if (file.exists(file_name)) {
      # Read the local file
      con <- file(file_name, "r")
      df <- read.csv(con,  check.names = FALSE,
                     na.strings = "", dec = ".", quote = "\"")
      #lines <- readLines(con, encoding = "unknown")
      close(con)
      return(list(data = df, message = "File loaded successfully,", method = "Local"))
    } else {
      # Attempt to read from a URL
      web_file <- tryCatch({
        read.csv(file_name,  check.names = FALSE,
                     na.strings = "", dec = ".", quote = "\"")
        #readLines(file_name)
      }, error = function(e) {
        NULL  # Return NULL to indicate failure
      })
      if (!is.null(web_file)) {
        return(list(data = web_file, message = "File loaded successfully, ", method = "Web"))
      }
    }
    NULL  # Return NULL to indicate failure
  }, error = function(e) {
    NULL  # Return NULL to indicate failure
  })
  
  # If loading the file was not successful, prompt the operator to select a file manually
  if (is.null(file_data)) {
    file_path <- file.choose()  # Prompt to choose a file
    # Read the file if a file path was selected
    if (file_path != "") {
      manual_file <- tryCatch({
        read.csv(file_path, check.names = FALSE,
                     na.strings = "", dec = ".", quote = "\"")
        #readLines(file_path)
      }, error = function(e) {
        NULL  # Return NULL to indicate failure
      })
      if (!is.null(manual_file)) {
        return(list(data = manual_file, message = "Manual file loaded successfully, ", method = "Manual"))
      }
    } else {
      stop("No file selected. Exiting.")  # Stop execution if no file was selected
    }
  }
  
  # Return the file data
  return(file_data)
}

#read the data into RStudio from DATA folder  

#test written function 
#local file
file_name_1 <- "Data/AirLines.csv"
result <- load_file_KP(file_name_1)
file_name_2 <- "https://raw.githubusercontent.com/koohpi/DATA607_5th_ASSGNMNT/main/Data/Airlines.csv"

#df <- read.csv("Data/Airlines.csv", na.strings = "", dec = ".", quote = "\"", #check.names = FALSE)
df <- read.csv(file_name_2, na.strings = "", dec = ".", quote = "\"", check.names = FALSE)



#KP_DF <- result$data #pass loaded data 

KP_DF <- df #pass loaded data 
# Print the result

paste(result$message, "using", result$method, "method.", sep = " ") #use past

## [1] "File loaded successfully, using Local method."

cat("\n", "Here is the head of the laoded data:", "\n")

## 
##  Here is the head of the laoded data:

#glimpse(KP_DF)
head(KP_DF)

##                   Los Angeles Phoenix San Diego San Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2    <NA> delayed          62      12        20           102     305
## 3    <NA>    <NA>          NA    <NA>        NA            NA    <NA>
## 4 AM WEST on time         694   4,840       383           320     201
## 5    <NA> delayed         117     415        65           129      61

#2: Changing the wide to long

Now that the data is loaded, let’s use dplyr to transform the format from wide to long.

tidyr offers various functionalities, including reshaping data using pivot_longer and pivot_wider. It also provides functions for handling missing values (NA), either by removing them or replacing them. You can find a helpful tidyr cheatsheet by clicking on the link .

First, I’ll tidy the data. Upon examining it, I noticed some NA values that need removal, as well as an empty column that should be populated with the names of the airlines. Finally, we’ll pivot the data to create a long-format database.

In the following code, I utilize tidyr to create a long format using filter, select, pivot_longer, and cbind. The resulting data will be stored in KP_DF_Long.

library(dplyr)
library(tidyverse)
#let's change the datafarme to tibble
colnames(KP_DF)[1:2] <- c("Airlines", "Status")
head(KP_DF)

##   Airlines  Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1   ALASKA on time         497     221       212           503   1,841
## 2     <NA> delayed          62      12        20           102     305
## 3     <NA>    <NA>          NA    <NA>        NA            NA    <NA>
## 4  AM WEST on time         694   4,840       383           320     201
## 5     <NA> delayed         117     415        65           129      61

KP_DF_T<- as_tibble(KP_DF)

# remove na using drop_na
KP_DF_T <- KP_DF_T %>% 
  drop_na( `Los Angeles`)

# use fill to repalce the missign value 
KP_DF_T<- KP_DF_T %>% 
  fill( Airlines, .direction = "down")

#change the column that are not numeric to nuemric 
KP_DF_T$Phoenix <- as.integer(gsub(",", "",KP_DF_T$Phoenix)) # change the character to integer 
KP_DF_T$Seattle <- as.integer(gsub(",", "",KP_DF_T$Seattle))

print("Here is how clean data looks like")

## [1] "Here is how clean data looks like"

KP_DF_T

## # A tibble: 4 × 7
##   Airlines Status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>    <chr>           <int>   <int>       <int>           <int>   <int>
## 1 ALASKA   on time           497     221         212             503    1841
## 2 ALASKA   delayed            62      12          20             102     305
## 3 AM WEST  on time           694    4840         383             320     201
## 4 AM WEST  delayed           117     415          65             129      61

#I will use pivot_longer to change the data, in this case we want to change the staus to delayed and on_time and we want to chaneg the names to Dest, and chaneg the table to a 4- column table with UniqueCarrier, Dest, Delayed, On_time. 

#filter based on Staus Delayed to chose only the delayed flight, and put them on a column called Dest and Delayed
KP_DF_Long <- KP_DF_T %>% filter(Status == "delayed") %>%
  pivot_longer(cols = 3:ncol(KP_DF_T), names_to = "Dest", values_to = "Delayed", values_drop_na = FALSE)

#filter based on Staus On_time to choose only the On time flights, and put them on a column called Dest and On_time
KP_DF_Long <- KP_DF_T %>% filter(Status == "on time") %>%
  pivot_longer(cols = 3:ncol(KP_DF_T), names_to = "Dest", values_to = "On_time", values_drop_na = FALSE) %>%
  select(On_time) %>% cbind(KP_DF_Long)

#slect only the important column and add the On_time to the last column 
KP_DF_Long <- KP_DF_Long %>%
  select(Airlines,Dest, Delayed, On_time)
print("Here is the Long version of the data")

## [1] "Here is the Long version of the data"

head(KP_DF_Long ,10 )

##    Airlines          Dest Delayed On_time
## 1    ALASKA   Los Angeles      62     497
## 2    ALASKA       Phoenix      12     221
## 3    ALASKA     San Diego      20     212
## 4    ALASKA San Francisco     102     503
## 5    ALASKA       Seattle     305    1841
## 6   AM WEST   Los Angeles     117     694
## 7   AM WEST       Phoenix     415    4840
## 8   AM WEST     San Diego      65     383
## 9   AM WEST San Francisco     129     320
## 10  AM WEST       Seattle      61     201

#3: Analyses using dplyr

We will use dplyr to perform analysis to compare the arrival delays for the two airlines, and do some more analyses. I will use this to find and arrange them in order for the flight with highest delayed and highest on time.

# lets comapre which air line has the most delayed on the destination, we need to find the max delay among all and find the associated Airlines 

print("Airline with the highest Delayed flights")

## [1] "Airline with the highest Delayed flights"

KP_DF_Long %>%
  filter(Delayed == max(Delayed)) %>%
  slice_max(order_by = Delayed, n = 1) %>%
  select(Airlines,Dest, Delayed)

##   Airlines    Dest Delayed
## 1  AM WEST Phoenix     415

# which destination and Airlines with the highest On_time 
print("Airline with the highest On time flights")

## [1] "Airline with the highest On time flights"

KP_DF_Long %>%
  filter(On_time == max(On_time)) %>%
  slice_max(order_by = On_time, n = 1) %>%
  select(Airlines,Dest, On_time)

##   Airlines    Dest On_time
## 1  AM WEST Phoenix    4840

#fidnd the Airlines that has the most sum of on_time for all destiantion and the one that has the most delayed for among the destination 

print("Airlines with the maximum sum of delayed flights over the destinations")

## [1] "Airlines with the maximum sum of delayed flights over the destinations"

KP_DF_Long %>%
  group_by(Airlines) %>%
  summarize(
    max_delayed_all = sum(Delayed)
  ) %>% arrange(desc(max_delayed_all))

## # A tibble: 2 × 2
##   Airlines max_delayed_all
##   <chr>              <int>
## 1 AM WEST              787
## 2 ALASKA               501

print("Airlines with the maximum sum of On_time fligth over the destinations")

## [1] "Airlines with the maximum sum of On_time fligth over the destinations"

KP_DF_Long %>%
  group_by(Airlines) %>%
  summarize(
    max_ontime_all = sum(On_time)
  )  %>% arrange(desc(max_ontime_all))

## # A tibble: 2 × 2
##   Airlines max_ontime_all
##   <chr>             <int>
## 1 AM WEST            6438
## 2 ALASKA             3274

# Fidn the destination that has the heighest and lowest On_time and delayed. 

print("Destinations with the maximum sum of delayed flights over the Airlines")

## [1] "Destinations with the maximum sum of delayed flights over the Airlines"

KP_DF_Long %>%
  group_by(Dest) %>%
  summarize(
    max_delayed_all = sum(Delayed)
  ) %>% arrange(desc(max_delayed_all))

## # A tibble: 5 × 2
##   Dest          max_delayed_all
##   <chr>                   <int>
## 1 Phoenix                   427
## 2 Seattle                   366
## 3 San Francisco             231
## 4 Los Angeles               179
## 5 San Diego                  85

print("Destinations with the maximum sum of On_time fligths over the Airlines")

## [1] "Destinations with the maximum sum of On_time fligths over the Airlines"

KP_DF_Long %>%
  group_by(Dest) %>%
  summarize(
    max_OnTime_all = sum(On_time)
  ) %>% arrange(desc(max_OnTime_all))

## # A tibble: 5 × 2
##   Dest          max_OnTime_all
##   <chr>                  <int>
## 1 Phoenix                 5061
## 2 Seattle                 2042
## 3 Los Angeles             1191
## 4 San Francisco            823
## 5 San Diego                595