#install.packages("devtools")
#devtools::install_github("hellodata-science/hellodatascience")HDS 2.1-2.2
Loading R Packages
To begin, copy these two lines (without the number sign) to the console and run them, one at a time:
This will install the hellodatascience package (be patient). You will also need to install the tidyverse and nycflights23 packages the ordinary way (again, in the console, without the #):
#install.packages("tidyverse")
#install.packages("nycflights23")Now you can load the packages you just installed. Insert the code for loading them here:
library(devtools)
library(hellodatascience)
library(tidyverse)
library(nycflights23)Data Frames
Read Section 2.1 of Hello Data Science. In that section, you found that the planets data has 8 rows and 7 columns. Note that I didn’t type the number of rows or columns; R calculated them and inserted them into my text. This is important because if the data changes (someone adds a row for Pluto, for example), I don’t have to change my text.
The flights data in the nycflights23 package gives on-time data for all flights scheduled to depart from one of the three New York City airports in 2023. Type ?flights in the console to see the help page. Write a sentence (like the one in the previous paragraph) that describes the number of rows and columns in the flights dataset by mixing R code into your sentence:
Getting to Know Data
Read Section 2.2 of Hello Data Science. Print the first few rows of the flights data (you certainly don’t want to print them all!):
head(flights)# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2023 1 1 1 2038 203 328 3
2 2023 1 1 18 2300 78 228 135
3 2023 1 1 31 2344 47 500 426
4 2023 1 1 33 2140 173 238 2352
5 2023 1 1 36 2048 228 223 2252
6 2023 1 1 503 500 3 808 815
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Use the glimpse function to get to know the variables in the flights data:
glimpse(flights)Rows: 435,352
Columns: 19
$ year <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 1, 18, 31, 33, 36, 503, 520, 524, 537, 547, 549, 551, 5…
$ sched_dep_time <int> 2038, 2300, 2344, 2140, 2048, 500, 510, 530, 520, 545, …
$ dep_delay <dbl> 203, 78, 47, 173, 228, 3, 10, -6, 17, 2, -10, -9, -7, -…
$ arr_time <int> 328, 228, 500, 238, 223, 808, 948, 645, 926, 845, 905, …
$ sched_arr_time <int> 3, 135, 426, 2352, 2252, 815, 949, 710, 818, 852, 901, …
$ arr_delay <dbl> 205, 53, 34, 166, 211, -7, -1, -25, 68, -7, 4, -13, -14…
$ carrier <chr> "UA", "DL", "B6", "B6", "UA", "AA", "B6", "AA", "UA", "…
$ flight <int> 628, 393, 371, 1053, 219, 499, 996, 981, 206, 225, 800,…
$ tailnum <chr> "N25201", "N830DN", "N807JB", "N265JB", "N17730", "N925…
$ origin <chr> "EWR", "JFK", "JFK", "JFK", "EWR", "EWR", "JFK", "EWR",…
$ dest <chr> "SMF", "ATL", "BQN", "CHS", "DTW", "MIA", "BQN", "ORD",…
$ air_time <dbl> 367, 108, 190, 108, 80, 154, 192, 119, 258, 157, 164, 1…
$ distance <dbl> 2500, 760, 1576, 636, 488, 1085, 1576, 719, 1400, 1065,…
$ hour <dbl> 20, 23, 23, 21, 20, 5, 5, 5, 5, 5, 5, 6, 5, 6, 6, 6, 6,…
$ minute <dbl> 38, 0, 44, 40, 48, 0, 10, 30, 20, 45, 59, 0, 59, 0, 0, …
$ time_hour <dttm> 2023-01-01 20:00:00, 2023-01-01 23:00:00, 2023-01-01 2…
Describe how these two functions are similar and how they differ:
They both give you a glance at the data. One is the first couple rows, they other switches the rows with the columns so you can focus on the table headers.