Ch 1-4 of https://moderndive.netlify.app/or
https://r4ds.had.co.nz for a deeper dive
Let’s load the required packages and see what data is included:
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
You can also embed plots, for example:
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2~
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, ~
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, ~
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1~
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,~
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,~
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1~
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "~
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4~
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394~
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",~
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",~
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1~
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, ~
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6~
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0~
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0~
?flights
## starting httpd help server ... done
Flights has more than one carrier (airline) within the table.
unique(flights$carrier)
## [1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV"
## [16] "OO"
To filter for one (or more) the filter function can be used.
Filter for flights from United Airlines (UA):
ua_flights<- flights %>%
filter(carrier == "UA")
unique(ua_flights$carrier)
## [1] "UA"
flights %>%
select(year, month, day, dep_time, flight, carrier) %>%
head()
## # A tibble: 6 x 6
## year month day dep_time flight carrier
## <int> <int> <int> <int> <int> <chr>
## 1 2013 1 1 517 1545 UA
## 2 2013 1 1 533 1714 UA
## 3 2013 1 1 542 1141 AA
## 4 2013 1 1 544 725 B6
## 5 2013 1 1 554 461 DL
## 6 2013 1 1 554 1696 UA
Mutate creates new columns within the dataframe. The make_date function from the lubridate package makes creating date or datetime vectors within a dataframe quite easy. Here we use the integer columns from year, month day to make a new column called date_flight:
flights %>%
select(year, month, day) %>%
mutate(date_flight = make_date(year, month, day)) %>%
head()
## # A tibble: 6 x 4
## year month day date_flight
## <int> <int> <int> <date>
## 1 2013 1 1 2013-01-01
## 2 2013 1 1 2013-01-01
## 3 2013 1 1 2013-01-01
## 4 2013 1 1 2013-01-01
## 5 2013 1 1 2013-01-01
## 6 2013 1 1 2013-01-01
Group By is a powerful function from the dplyr package within the tidyverse that groups data frames based upon keys:
flights_grouped_carrier <- flights %>%
group_by(carrier)
flights_grouped_carrier
## # A tibble: 336,776 x 19
## # Groups: carrier [16]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
From there we can obtain specifics within the group, like number of rows, average dep_delay:
flights_grouped_carrier %>%
summarise(cnt = n(),
avg_delay = mean(dep_delay, na.rm = T)) %>%
arrange(desc(avg_delay))
## # A tibble: 16 x 3
## carrier cnt avg_delay
## <chr> <int> <dbl>
## 1 F9 685 20.2
## 2 EV 54173 20.0
## 3 YV 601 19.0
## 4 FL 3260 18.7
## 5 WN 12275 17.7
## 6 9E 18460 16.7
## 7 B6 54635 13.0
## 8 VX 5162 12.9
## 9 OO 32 12.6
## 10 UA 58665 12.1
## 11 MQ 26397 10.6
## 12 DL 48110 9.26
## 13 AA 32729 8.59
## 14 AS 714 5.80
## 15 HA 342 4.90
## 16 US 20536 3.78
https://r4ds.had.co.nz/data-visualisation.html
Begin a flight with the ggplot()
Include the dataset
Choose how to map the data using the aes() aesthetics function
Add layers, like a geom layer, which generates the plot and has additional settings
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
If we wanted to create a scatter plot with the ua_flights dataset and dep_delay on the x, arr_delay on the y, then we would use:
ggplot(data = ua_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point(alpha = 0.2)
Use | and & to conditionally filter a dataset:
early_january_weather <- weather %>%
filter(origin == "LGA" & (month == 1 & day <= 15))
head(early_january_weather)
## # A tibble: 6 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 LGA 2013 1 1 1 39.9 26.1 57.3 260 13.8 23.0
## 2 LGA 2013 1 1 2 41 26.1 55.0 260 17.3 25.3
## 3 LGA 2013 1 1 3 41 26.1 55.0 260 16.1 24.2
## 4 LGA 2013 1 1 4 41 26.1 55.0 260 17.3 25.3
## 5 LGA 2013 1 1 5 39.9 25.0 54.8 250 15.0 21.9
## 6 LGA 2013 1 1 6 39.9 25.0 54.8 260 16.1 23.0
## # ... with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## # time_hour <dttm>
ggplot(data = early_january_weather,
mapping = aes(x = time_hour, y = temp)) +
geom_line() +
ggtitle((title = "Weather in early January")) +
ylab("Temperature (F)") +
xlab("Datetime")
Next lecture we’ll cover tsibble objects (we’re working with tibble objects today) and the autoplot() function which simplifies things:
library(tsibble)
##
## Attaching package: 'tsibble'
## The following object is masked from 'package:lubridate':
##
## interval
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(feasts)
## Loading required package: fabletools
early_january_weather %>%
tsibble(index = time_hour) %>%
autoplot(temp)
str(early_january_weather)
## tibble [358 x 15] (S3: tbl_df/tbl/data.frame)
## $ origin : chr [1:358] "LGA" "LGA" "LGA" "LGA" ...
## $ year : int [1:358] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:358] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:358] 1 1 1 1 1 1 1 1 1 1 ...
## $ hour : int [1:358] 1 2 3 4 5 6 7 8 9 10 ...
## $ temp : num [1:358] 39.9 41 41 41 39.9 ...
## $ dewp : num [1:358] 26.1 26.1 26.1 26.1 25 ...
## $ humid : num [1:358] 57.3 55 55 55 54.8 ...
## $ wind_dir : num [1:358] 260 260 260 260 250 260 250 250 260 270 ...
## $ wind_speed: num [1:358] 13.8 17.3 16.1 17.3 15 ...
## $ wind_gust : num [1:358] 23 25.3 24.2 25.3 21.9 ...
## $ precip : num [1:358] 0 0 0 0 0 0 0 0 0 0 ...
## $ pressure : num [1:358] 1012 1012 1012 1012 1011 ...
## $ visib : num [1:358] 10 10 10 10 10 10 10 10 10 10 ...
## $ time_hour : POSIXct[1:358], format: "2013-01-01 01:00:00" "2013-01-01 02:00:00" ...
### early_january_weather is a tibble.
february <- weather %>%
filter(month == 2)
ggplot(february) +
geom_line(aes(x = time_hour, y = wind_speed))
### It looks like somewhere between February 11 and February 15th there is an outlier of very high wind speeds at a particular time. Other than that every other day of the month it appears that the wind speed is relatively the same.
JFK <- weather %>%
filter(origin == "JFK")
JFK_Monthly_Avg <- JFK %>%
group_by(month) %>%
summarize(Monthly_AVG = mean(wind_speed, na.rm = 1)) %>%
arrange(desc(Monthly_AVG))
### March has the highest monthly average windspeed with an average speed of 13.997021 miles/hour.
In Class Discussion
### Last class we discussed about the different factors that can change the price of a car being sold. Among these that were discussed were: Age of the car, accident history, how many miles are on the car and what is the miles per gallon, what the market is currently, additional features, and whether the car runs on gas or is an electric car.