#install.packages("devtools")
#devtools::install_github("hellodata-science/hellodatascience")HDS 2.1-2.2
Loading R Packages
To begin, copy these two lines (without the number sign) to the console and run them, one at a time:
This will install the hellodatascience package (be patient). You will also need to install the tidyverse and nycflights23 packages the ordinary way (again, in the console, without the #):
#install.packages("tidyverse")
#install.packages("nycflights23")Now you can load the packages you just installed. Insert the code for loading them here:
library(hellodatascience)
library(tidyverse)
library(nycflights23)Data Frames
Read Section 2.1 of Hello Data Science. In that section, you found that the planets data has 8 rows and 7 columns. Note that I didn’t type the number of rows or columns; R calculated them and inserted them into my text. This is important because if the data changes (someone adds a row for Pluto, for example), I don’t have to change my text.
The flights data in the nycflights23 package gives on-time data for all flights scheduled to depart from one of the three New York City airports in 2023. Type ?flights in the console to see the help page. Write a sentence (like the one in the previous paragraph) that describes the number of rows and columns in the flights dataset by mixing R code into your sentence: The flights data in the nycflights23 package has 435352 rows and 19 coloumns.
Getting to Know Data
Read Section 2.2 of Hello Data Science. Print the first few rows of the flights data (you certainly don’t want to print them all!):
head(flights)# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2023 1 1 1 2038 203 328 3
2 2023 1 1 18 2300 78 228 135
3 2023 1 1 31 2344 47 500 426
4 2023 1 1 33 2140 173 238 2352
5 2023 1 1 36 2048 228 223 2252
6 2023 1 1 503 500 3 808 815
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
tail(flights)# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2023 12 31 2207 2200 7 58 115
2 2023 12 31 2218 2224 -6 304 325
3 2023 12 31 2243 2150 53 143 56
4 2023 12 31 2248 2259 -11 338 350
5 2023 12 31 2326 2325 1 412 405
6 2023 12 31 2345 2255 50 425 347
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Use the glimpse function to get to know the variables in the flights data:
dplyr::glimpse(flights)Rows: 435,352
Columns: 19
$ year <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 1, 18, 31, 33, 36, 503, 520, 524, 537, 547, 549, 551, 5…
$ sched_dep_time <int> 2038, 2300, 2344, 2140, 2048, 500, 510, 530, 520, 545, …
$ dep_delay <dbl> 203, 78, 47, 173, 228, 3, 10, -6, 17, 2, -10, -9, -7, -…
$ arr_time <int> 328, 228, 500, 238, 223, 808, 948, 645, 926, 845, 905, …
$ sched_arr_time <int> 3, 135, 426, 2352, 2252, 815, 949, 710, 818, 852, 901, …
$ arr_delay <dbl> 205, 53, 34, 166, 211, -7, -1, -25, 68, -7, 4, -13, -14…
$ carrier <chr> "UA", "DL", "B6", "B6", "UA", "AA", "B6", "AA", "UA", "…
$ flight <int> 628, 393, 371, 1053, 219, 499, 996, 981, 206, 225, 800,…
$ tailnum <chr> "N25201", "N830DN", "N807JB", "N265JB", "N17730", "N925…
$ origin <chr> "EWR", "JFK", "JFK", "JFK", "EWR", "EWR", "JFK", "EWR",…
$ dest <chr> "SMF", "ATL", "BQN", "CHS", "DTW", "MIA", "BQN", "ORD",…
$ air_time <dbl> 367, 108, 190, 108, 80, 154, 192, 119, 258, 157, 164, 1…
$ distance <dbl> 2500, 760, 1576, 636, 488, 1085, 1576, 719, 1400, 1065,…
$ hour <dbl> 20, 23, 23, 21, 20, 5, 5, 5, 5, 5, 5, 6, 5, 6, 6, 6, 6,…
$ minute <dbl> 38, 0, 44, 40, 48, 0, 10, 30, 20, 45, 59, 0, 59, 0, 0, …
$ time_hour <dttm> 2023-01-01 20:00:00, 2023-01-01 23:00:00, 2023-01-01 2…
Describe how these two functions are similar and how they differ:
The glimpse function appears to be the transpose of the head and tail functions. The latter relay the flights data in a tibble, with the data listed within columns of their corresponding category, while the former relays the flights data across rows. The glimpse function provides a clearer view of the variables represented within the data set and their data type. However, the head and tail functions format the data to be easily readible.
In the end, they both reflect a portion of the flights data, with the glimpse function displaying all the variable names compared to only the first or last 6 rows visible with the head or tail functions.