Report Zero

Jakub Flizikowski, ID: 197566

2024-03-12

Report overview

This is report of the data on the first labs. It shows the solutions to the tasks in both R and the corresponding code in Python. Each exercise in the “Exercises” section below has the instruction given, the result of the task written in R, and a comparison of the code for that task in both languages - R and Python.

Exercises

Exercise 1

Read the dataframe “flights”, then filter only flights from 1st of January. Save the output to the new dataframe called “mydata”.

## # A tibble: 842 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 832 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Providing both python and R language codes with the same output for comparison:
Python:

filtered_data = flights[(flights['day'] == 1) & (flights['month'] == 1)]

R language:

mydata <- flights %>% filter(day==1 & month == 1)

Exercise 2

Filter flights operated by United (UA), American (AA), or Delta (DL):

## # A tibble: 139,504 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      554            600        -6      812            837
##  5  2013     1     1      554            558        -4      740            728
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      558            600        -2      923            937
##  9  2013     1     1      559            600        -1      941            910
## 10  2013     1     1      559            600        -1      854            902
## # ℹ 139,494 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Providing both python and R language codes with the same output for comparison:
Python:

filtered_data = flights[flights['carrier'].isin(['UA', 'AA', 'DL'])]

R language:

flights %>% filter(carrier %in% c("UA", "AA", "DL"))

Exercise 3

Filter flights departed between midnight and 6am (inclusive). Don’t forget flights that left at exactly midnight (2400).

## # A tibble: 9,373 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 9,363 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Looks like there are 9373 flights that departed from midnight to 6AM.

Providing both python and R language codes with the same output for comparison:
Python:

filtered_flights = flights[(flights['dep_time'].between(0, 600)) | (flights['dep_time'] == 2400)]

R language:

flights %>% filter(between(dep_time, 0, 600) | dep_time == 2400)

Exercise 4

We also have need to make sure the data is ordered in a certain manner. This can be easily done in R with the arrange() function.

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     7     2040           2123       -43       40           2352
##  2  2013     2     3     2022           2055       -33     2240           2338
##  3  2013    11    10     1408           1440       -32     1549           1559
##  4  2013     1    11     1900           1930       -30     2233           2243
##  5  2013     1    29     1703           1730       -27     1947           1957
##  6  2013     8     9      729            755       -26     1002            955
##  7  2013    10    23     1907           1932       -25     2143           2143
##  8  2013     3    30     2030           2055       -25     2213           2250
##  9  2013     3     2     1431           1455       -24     1601           1631
## 10  2013     5     5      934            958       -24     1225           1309
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Providing both python and R language codes with the same output for comparison:
Python:

sorted_data = flights.sort_values(by=['dep_delay'])

R language:

flights %>% arrange(dep_delay)

Exercise 5

Find the 10 most delayed flights (dep_delay) using a ranking function.

## # A tibble: 10 × 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, dep_delay_rank <int>

Providing both python and R language codes with the same output for comparison:
Python:

flights['dep_delay_rank'] = flights['dep_delay'].rank(method='min', ascending=True)
sorted_data = flights.sort_values(by='dep_delay_rank', ascending=False)
top_10 = sorted_data.head(10)

R language:

flights %>% 
mutate(dep_delay_rank = min_rank(dep_delay)) %>% 
arrange(desc(dep_delay_rank)) %>% 
head(10)

# other way suggested on labs
flights %>% 
arrange(-dep_delay) %>% 
head(10)

Exercise 6

Which carrier has the worst delays?

## # A tibble: 10 × 2
##    carrier delay
##    <chr>   <dbl>
##  1 OO       60.6
##  2 YV       51.1
##  3 9E       49.3
##  4 EV       48.3
##  5 F9       47.6
##  6 VX       43.8
##  7 FL       41.1
##  8 WN       40.7
##  9 B6       40.0
## 10 AA       38.3

Most delayed carrier using this model is “OO”.

Providing both python and R language codes with the same output for comparison:
Python:

filtered_flights = flights[flights['arr_delay'] > 0]
delay_carrier = filtered_flights.groupby('carrier')['arr_delay'].mean().reset_index()
delay_carrier_sorted = delay_by_carrier.sort_values(by='arr_delay', ascending=False)
top_10_delayed_carriers = delay_by_carrier_sorted.head(10)

R language:

flights %>% 
  filter(arr_delay>0) %>%
  group_by(carrier) %>%
  summarise(delay = mean(arr_delay)) %>%
  arrange(desc(delay)) %>%
  head(10)

Report summary

This report was created to show how to solve the tasks from the first lab. Moreover, it shows the difference between R and Python language in the context of the above tasks. It turns out that for simple tasks, the differences in the use of these languages are small if not insignificant.