Report overview
This is report of the data on the first labs. It shows the solutions to the tasks in both R and the corresponding code in Python. Each exercise in the “Exercises” section below has the instruction given, the result of the task written in R, and a comparison of the code for that task in both languages - R and Python.
Exercises
Exercise 1
Read the dataframe “flights”, then filter only flights from 1st of January. Save the output to the new dataframe called “mydata”.
## # A tibble: 842 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 832 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Providing both python and R language codes with the same output for
comparison:
Python:
filtered_data = flights[(flights['day'] == 1) & (flights['month'] == 1)]
R language:
mydata <- flights %>% filter(day==1 & month == 1)
Exercise 2
Filter flights operated by United (UA), American (AA), or Delta (DL):
## # A tibble: 139,504 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 554 600 -6 812 837
## 5 2013 1 1 554 558 -4 740 728
## 6 2013 1 1 558 600 -2 753 745
## 7 2013 1 1 558 600 -2 924 917
## 8 2013 1 1 558 600 -2 923 937
## 9 2013 1 1 559 600 -1 941 910
## 10 2013 1 1 559 600 -1 854 902
## # ℹ 139,494 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Providing both python and R language codes with the same output for
comparison:
Python:
filtered_data = flights[flights['carrier'].isin(['UA', 'AA', 'DL'])]
R language:
flights %>% filter(carrier %in% c("UA", "AA", "DL"))
Exercise 3
Filter flights departed between midnight and 6am (inclusive). Don’t forget flights that left at exactly midnight (2400).
## # A tibble: 9,373 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 9,363 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Looks like there are 9373 flights that departed from midnight to 6AM.
Providing both python and R language codes with the same output for
comparison:
Python:
filtered_flights = flights[(flights['dep_time'].between(0, 600)) | (flights['dep_time'] == 2400)]
R language:
flights %>% filter(between(dep_time, 0, 600) | dep_time == 2400)
Exercise 4
We also have need to make sure the data is ordered in a certain manner. This can be easily done in R with the arrange() function.
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 7 2040 2123 -43 40 2352
## 2 2013 2 3 2022 2055 -33 2240 2338
## 3 2013 11 10 1408 1440 -32 1549 1559
## 4 2013 1 11 1900 1930 -30 2233 2243
## 5 2013 1 29 1703 1730 -27 1947 1957
## 6 2013 8 9 729 755 -26 1002 955
## 7 2013 10 23 1907 1932 -25 2143 2143
## 8 2013 3 30 2030 2055 -25 2213 2250
## 9 2013 3 2 1431 1455 -24 1601 1631
## 10 2013 5 5 934 958 -24 1225 1309
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Providing both python and R language codes with the same output for
comparison:
Python:
sorted_data = flights.sort_values(by=['dep_delay'])
R language:
flights %>% arrange(dep_delay)
Exercise 5
Find the 10 most delayed flights (dep_delay) using a ranking function.
## # A tibble: 10 × 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, dep_delay_rank <int>
Providing both python and R language codes with the same output for
comparison:
Python:
flights['dep_delay_rank'] = flights['dep_delay'].rank(method='min', ascending=True)
sorted_data = flights.sort_values(by='dep_delay_rank', ascending=False)
top_10 = sorted_data.head(10)
R language:
flights %>%
mutate(dep_delay_rank = min_rank(dep_delay)) %>%
arrange(desc(dep_delay_rank)) %>%
head(10)
# other way suggested on labs
flights %>%
arrange(-dep_delay) %>%
head(10)
Exercise 6
Which carrier has the worst delays?
## # A tibble: 10 × 2
## carrier delay
## <chr> <dbl>
## 1 OO 60.6
## 2 YV 51.1
## 3 9E 49.3
## 4 EV 48.3
## 5 F9 47.6
## 6 VX 43.8
## 7 FL 41.1
## 8 WN 40.7
## 9 B6 40.0
## 10 AA 38.3
Most delayed carrier using this model is “OO”.
Providing both python and R language codes with the same output for
comparison:
Python:
filtered_flights = flights[flights['arr_delay'] > 0]
delay_carrier = filtered_flights.groupby('carrier')['arr_delay'].mean().reset_index()
delay_carrier_sorted = delay_by_carrier.sort_values(by='arr_delay', ascending=False)
top_10_delayed_carriers = delay_by_carrier_sorted.head(10)
R language:
flights %>%
filter(arr_delay>0) %>%
group_by(carrier) %>%
summarise(delay = mean(arr_delay)) %>%
arrange(desc(delay)) %>%
head(10)
Report summary
This report was created to show how to solve the tasks from the first lab. Moreover, it shows the difference between R and Python language in the context of the above tasks. It turns out that for simple tasks, the differences in the use of these languages are small if not insignificant.