#install.packages("devtools")
#devtools::install_github("hellodata-science/hellodatascience")HDS 2.1-2.2
Loading R Packages
To begin, copy these two lines (without the number sign) to the console and run them, one at a time:
install.packages("devtools")Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
(as 'lib' is unspecified)
library(devtools)Loading required package: usethis
devtools::install_github("hellodata-science/hellodatascience")Skipping install of 'hellodatascience' from a github remote, the SHA1 (221969b7) has not changed since last install.
Use `force = TRUE` to force installation
This will install the hellodatascience package (be patient). You will also need to install the tidyverse and nycflights23 packages the ordinary way (again, in the console, without the #):
install.packages("tidyverse")Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
(as 'lib' is unspecified)
install.packages("nycflights23")Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
(as 'lib' is unspecified)
Now you can load the packages you just installed. Insert the code for loading them here:
library(devtools)
library(tidyverse)
library(nycflights23)Data Frames
Read Section 2.1 of Hello Data Science. In that section, you found that the planets data has #r nrow(planets) rows and #r ncol(planets) columns. Note that I didn’t type the number of rows or columns; R calculated them and inserted them into my text. This is important because if the data changes (someone adds a row for Pluto, for example), I don’t have to change my text.
The flights data in the nycflights23 package gives on-time data for all flights scheduled to depart from one of the three New York City airports in 2023. Type ?flights in the console to see the help page. Write a sentence (like the one in the previous paragraph) that describes the number of rows and columns in the flights dataset by mixing R code into your sentence:
Getting to Know Data
Read Section 2.2 of Hello Data Science. Print the first few rows of the flights data (you certainly don’t want to print them all!):
library(tidyverse)
library(nycflights13)
Attaching package: 'nycflights13'
The following objects are masked from 'package:nycflights23':
airlines, airports, flights, planes, weather
head(flights)# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Use the glimpse function to get to know the variables in the flights data:
glimpse(flights)Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
Describe how these two functions are similar and how they differ: These two functions are similar because they both take small portions of the data set to help further analyze the data.