HDS 2.1-2.2

Loading R Packages

To begin, copy these two lines (without the number sign) to the console and run them, one at a time:

#install.packages("devtools")
#devtools::install_github("hellodata-science/hellodatascience")

This will install the hellodatascience package (be patient). You will also need to install the tidyverse and nycflights23 packages the ordinary way (again, in the console, without the #):

#install.packages("tidyverse")
#install.packages("nycflights23")

Now you can load the packages you just installed. Insert the code for loading them here:

library(tidyverse)
library(hellodatascience)
library(nycflights23)
library(devtools)

Data Frames

Read Section 2.1 of Hello Data Science. In that section, you found that the planets data has 8 rows and 7 columns. Note that I didn’t type the number of rows or columns; R calculated them and inserted them into my text. This is important because if the data changes (someone adds a row for Pluto, for example), I don’t have to change my text.

The flights data in the nycflights23 package gives on-time data for all flights scheduled to depart from one of the three New York City airports in 2023. Type ?flights in the console to see the help page. Write a sentence (like the one in the previous paragraph) that describes the number of rows and columns in the flights dataset by mixing R code into your sentence:

The dataset flights has 435352 rows and 19 columns.

Getting to Know Data

Read Section 2.2 of Hello Data Science. Print the first few rows of the flights data (you certainly don’t want to print them all!):

head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2023     1     1        1           2038       203      328              3
2  2023     1     1       18           2300        78      228            135
3  2023     1     1       31           2344        47      500            426
4  2023     1     1       33           2140       173      238           2352
5  2023     1     1       36           2048       228      223           2252
6  2023     1     1      503            500         3      808            815
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Use the glimpse function to get to know the variables in the flights data:

glimpse(flights)
Rows: 435,352
Columns: 19
$ year           <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 1, 18, 31, 33, 36, 503, 520, 524, 537, 547, 549, 551, 5…
$ sched_dep_time <int> 2038, 2300, 2344, 2140, 2048, 500, 510, 530, 520, 545, …
$ dep_delay      <dbl> 203, 78, 47, 173, 228, 3, 10, -6, 17, 2, -10, -9, -7, -…
$ arr_time       <int> 328, 228, 500, 238, 223, 808, 948, 645, 926, 845, 905, …
$ sched_arr_time <int> 3, 135, 426, 2352, 2252, 815, 949, 710, 818, 852, 901, …
$ arr_delay      <dbl> 205, 53, 34, 166, 211, -7, -1, -25, 68, -7, 4, -13, -14…
$ carrier        <chr> "UA", "DL", "B6", "B6", "UA", "AA", "B6", "AA", "UA", "…
$ flight         <int> 628, 393, 371, 1053, 219, 499, 996, 981, 206, 225, 800,…
$ tailnum        <chr> "N25201", "N830DN", "N807JB", "N265JB", "N17730", "N925…
$ origin         <chr> "EWR", "JFK", "JFK", "JFK", "EWR", "EWR", "JFK", "EWR",…
$ dest           <chr> "SMF", "ATL", "BQN", "CHS", "DTW", "MIA", "BQN", "ORD",…
$ air_time       <dbl> 367, 108, 190, 108, 80, 154, 192, 119, 258, 157, 164, 1…
$ distance       <dbl> 2500, 760, 1576, 636, 488, 1085, 1576, 719, 1400, 1065,…
$ hour           <dbl> 20, 23, 23, 21, 20, 5, 5, 5, 5, 5, 5, 6, 5, 6, 6, 6, 6,…
$ minute         <dbl> 38, 0, 44, 40, 48, 0, 10, 30, 20, 45, 59, 0, 59, 0, 0, …
$ time_hour      <dttm> 2023-01-01 20:00:00, 2023-01-01 23:00:00, 2023-01-01 2…

Describe how these two functions are similar and how they differ:

Both functions give you a sneak peak of the data to familiarize you with what the data may look like, but they differ with how they display the data. The head and tail functions display the data in the rows, along with all the variables associated with them; however, if there are too many variables, they may not all show. The glimpse function shows every variable with some of the data. With the glimpse function, it allows every variable to be seen. With the head and tail functions, some of the variables may not be seen.