In this module we will explore some of the functionalities of the tidyverse. Tidyverse is a collection of packages that uses the same approach to data manipulation. The central packages in the tidyverse are:
ggplot2 3.3.2
tibble 3.0.3
tidyr 1.1.1
readr 1.3.1 purrr 0.3.4 forcats 0.5.0 dplyr 1.0.1
we will also use stringr, but you may need to install that separately.
Two of the most widely used packages are ggplot2 (graphics) and dplyr (data manipulation). In this module, we will work with these packages and a few others to manipulate data, including joins of datasets.
To receive full credit for module completion, please complete the exercises and answer the questions in this notebook and upload your Rmd and html to the Optional Modules slot in Assignments. The tutorial pulls heavily from Chapter 12 and Chapter 13 (Relational Data) of R for Data Science which can be found here: https://r4ds.had.co.nz/relational-data.html.
Before you get started, please install the package “tidyverse” in the console if you have not already.
When we load the tidyverse, the eight packages listed above are loaded so we don’t need to load them individually. Please go ahead and load the tidyverse with the chunk below. We will also work on data from the nycflights13 package which you can read about here: https://www.rdocumentation.org/packages/nycflights13/versions/1.0.1.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stringr)
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 4.0.3
As we have learned throughout the semester, there are several key functions (also called verbs) in the tidyverse that we can use in the tidyverse.
select() used to select columns in a dataframe
filter() allows us to select rows of data that we are interested in
group_by() allows us to group rows together by some criteria
summarize() summarizes data, often times once we have grouped rows together or filtered rows.
Here are some new functions that we will use in this tutorial
mutate() Can make new columns in our data
arrange() we use this to arrange the rows
pivot_longer() restructures data from wide to a long (or longer format)
pivot_wider() restructures data from long to a wide (or wider format)
separate() separates one column into multiple columns
string_replace() a function in the stringR package—a way to replace characters in strings
The tidyverse uses what is called “tibbles” instead of dataframes. These are pretty similar, although tibbles are an updated structure to more smoothly manipulate data. Because we brought in the data data with read_csv(), the data will be structured in a tibble. Please read more about tibbles here: https://r4ds.had.co.nz/tibbles.html
We are going to use a set of data that is a part of the tidyverse. The data has four variables: Country, year, TB cases, and population. Run the chunk below and see how the five tables are structured differently.
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
table4a #This table shows TB cases
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b #This table shows population
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
Question 1 Discuss (compare/contrast) the structure of each of the tables. (1 point)
The first difference I notice are the sizes of the tables: 6x4, 12x4, 6x3, 3x3 and 3x3. Each of the tables contain country and year, otherwise each have different information.
We can use the mutate() function in dplyr to make new variables. Below I make a new variable called log_population which is the log of population. I will add this variable to table1. When you run the chunk, the variable log_population should now be included in table1.
table1 %>%
mutate(log_population = log(population))
## # A tibble: 6 x 5
## country year cases population log_population
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 16.8
## 2 Afghanistan 2000 2666 20595360 16.8
## 3 Brazil 1999 37737 172006362 19.0
## 4 Brazil 2000 80488 174504898 19.0
## 5 China 1999 212258 1272915272 21.0
## 6 China 2000 213766 1280428583 21.0
Question 2 In the chunk below, use the function mutate to calculate the number of cases per 100,000 in population. Add this new variable (call it rate) to table1 (1 point)
table1 %>%
mutate(rate = cases/100,000)
## # A tibble: 6 x 6
## country year cases population rate `0`
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071 7.45 0
## 2 Afghanistan 2000 2666 20595360 26.7 0
## 3 Brazil 1999 37737 172006362 377. 0
## 4 Brazil 2000 80488 174504898 805. 0
## 5 China 1999 212258 1272915272 2123. 0
## 6 China 2000 213766 1280428583 2138. 0
Let’s now work with table4a. Please run the chunk below so you can see it. This table has three variables, country, 1999 and 2000. Let’s say we wanted to make the data long (it’s currently wide). A long tibble would have two columns, one for country and one for year. In class earlier in the semester, we used the reshape2 package (melt() function) to restructure the data. We could also do this with dplyr in the tidyverse.
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
We could use the function pivot_longer(). It is a bit complicated because the two variable names are numberic (not ideal). To work with numeric variable names we can use the mark (`) to surround the number. Some folks call it a tick.
table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
Question 3 Restructure table 4b into a longer format using the function pivot_longer(). (1 point)
table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
## # A tibble: 6 x 3
## country year population
## <chr> <chr> <int>
## 1 Afghanistan 1999 19987071
## 2 Afghanistan 2000 20595360
## 3 Brazil 1999 172006362
## 4 Brazil 2000 174504898
## 5 China 1999 1272915272
## 6 China 2000 1280428583
Question 4 Restructure table 2 into a wider format using the function pivot_wider(). (1 point)
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table2 %>%
pivot_wider(names_from = "year", values_from = "count")
## # A tibble: 6 x 4
## country type `1999` `2000`
## <chr> <chr> <int> <int>
## 1 Afghanistan cases 745 2666
## 2 Afghanistan population 19987071 20595360
## 3 Brazil cases 37737 80488
## 4 Brazil population 172006362 174504898
## 5 China cases 212258 213766
## 6 China population 1272915272 1280428583
Let’s check out table3 again. The variable rate is pretty darn ugly. Please don’t go out into the world and format your data like that. = ). Luckily, there is a function we can use called separate() that allows us to split us a variable into multiple variables based on a separator.
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
This chunk employs the separate() function and separates by the forward slash. Please run the chunk.
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
For practice, we will pull in another dataset in a csv file. Please download country_code.csv from Sakai. During the semester, we have used the function read.csv() to read in data in a csv file. The package readR (included in the tidyverse) reads csv files with the function read_csv(). There isn’t a lot of difference between the two functions. The major difference is that read_csv() reads in the data as a tibble, NOT a dataframe. You can read more about read_csv and the differences here: https://medium.com/r-tutorials/r-functions-daily-read-csv-3c418c25cba4.
Run the chunk below to pull in the data. We will call the data table5. Then we will look at the data.
table5<-read_csv("country_code.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## country = col_character(),
## year = col_double(),
## cases = col_double()
## )
table5
## # A tibble: 6 x 3
## country year cases
## <chr> <dbl> <dbl>
## 1 Afghanistan (AFG) 1999 745
## 2 Afghanistan (AFG) 2000 2666
## 3 Brazil (BRA) 1999 37737
## 4 Brazil (BRA) 2000 80488
## 5 China (CHN) 1999 212258
## 6 China (CHN) 2000 213766
Question 5 In the chunk below, please use the separate() function to separate function the name of the country and the country code. Please call the code variable, code. Please use a pipe as we did above. (2 points)
table5 %>%
separate(country, into = c("country", "code"), sep = " ")
## # A tibble: 6 x 4
## country code year cases
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan (AFG) 1999 745
## 2 Afghanistan (AFG) 2000 2666
## 3 Brazil (BRA) 1999 37737
## 4 Brazil (BRA) 2000 80488
## 5 China (CHN) 1999 212258
## 6 China (CHN) 2000 213766
In R we may need to work with character data. Often times character data are messy and may need some cleaning. One way to do this is through the use of the package stringr which is part of the tidyverse. There are two frequently used functions: str_replace() and str_replace_all(). str_replace() will replace the first occurrence in a string and str_replace_all() will replace all occurrences in a string. Let’s play around with this a bit.
I have made the vector colors of five colors. For the fun of it, let’s say I wanted to replace the first e with a &. The arguments of string_replace are string_replace(string, pattern, replacement)
colors<-c("blue", "red", "yellow", "Green", "purple", "red-green")
str_replace(colors, "e", "&")
## [1] "blu&" "r&d" "y&llow" "Gr&en" "purpl&" "r&d-green"
colors
## [1] "blue" "red" "yellow" "Green" "purple" "red-green"
What if we used str_replace_all()? Please run the chunk below.
colors<-c("blue", "red", "yellow", "Green", "purple", "red-green")
str_replace_all(colors, "e", "&")
## [1] "blu&" "r&d" "y&llow" "Gr&&n" "purpl&" "r&d-gr&&n"
colors
## [1] "blue" "red" "yellow" "Green" "purple" "red-green"
What if we wanted to replace all vowel occurrences with a dash?
colors<-c("blue", "red", "yellow", "Green", "purple", "red-green")
str_replace_all(colors, "[aeiou]", "-")
## [1] "bl--" "r-d" "y-ll-w" "Gr--n" "p-rpl-" "r-d-gr--n"
colors
## [1] "blue" "red" "yellow" "Green" "purple" "red-green"
we can also replace all capital letters with lower case letters in the following vector of colors using the function str_to_lower(). Please check out this website: https://stringr.tidyverse.org/reference/case.html
colors<-c("Blue", "Red", "Yellow", "Green", "Purple", "Red-Green")
str_to_lower(colors)
## [1] "blue" "red" "yellow" "green" "purple" "red-green"
colors
## [1] "Blue" "Red" "Yellow" "Green" "Purple" "Red-Green"
Question 6 In the chunk below, remove the () from the code variable in table5 and also make the country codes all lower case (2 points).
table5a<-table5 %>%
separate(country, into = c("country", "code"), sep = " ")
str_replace_all(table5a$code, "[()]", "")
## [1] "AFG" "AFG" "BRA" "BRA" "CHN" "CHN"
str_to_lower(table5a$country)
## [1] "afghanistan" "afghanistan" "brazil" "brazil" "china"
## [6] "china"
table5a
## # A tibble: 6 x 4
## country code year cases
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan (AFG) 1999 745
## 2 Afghanistan (AFG) 2000 2666
## 3 Brazil (BRA) 1999 37737
## 4 Brazil (BRA) 2000 80488
## 5 China (CHN) 1999 212258
## 6 China (CHN) 2000 213766
In data science, we often curate data from a variety of sources and need to join or merge the data into one workable dataset for analysis. This is part of what we call data management or data wrangling. There are a variety of functions we can use to merge two datasets together. In the tidyverse, all of these functions (or, as they call them, verbs) are joins. Please check out this website (the definitions below are direct quotations from this website: https://dplyr.tidyverse.org/reference/join.html.
All of the join functions have a similar syntax: join_type(firstTable, secondTable, by=columnTojoinOn).
inner_join() return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
left_join() return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
right_join() return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
full_join() return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
semi_join() return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
anti_join() return all rows from x where there are not matching values in y, keeping just columns from x.
Please read Chapter 13 on Relational Data: https://r4ds.had.co.nz/relational-data.html. We are going to look at three tables of data from the nycflights13 package that we loaded earlier in the script.
airlines
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
airports
## # A tibble: 1,458 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_Yo~
## 2 06A Moton Field Municipal A~ 32.5 -85.7 264 -6 A America/Chicago
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chicago
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_Yo~
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_Yo~
## 6 0A9 Elizabethton Municipal ~ 36.4 -82.2 1593 -5 A America/New_Yo~
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/New_Yo~
## 8 0G7 Finger Lakes Regional A~ 42.9 -76.8 492 -5 A America/New_Yo~
## 9 0P2 Shoestring Aviation Air~ 39.8 -76.6 1000 -5 U America/New_Yo~
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_An~
## # ... with 1,448 more rows
planes
## # A tibble: 3,322 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing m~ EMBRAER EMB-1~ 2 55 NA Turbo-~
## 2 N102UW 1998 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 3 N103US 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 4 N104UW 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 5 N10575 2002 Fixed wing m~ EMBRAER EMB-1~ 2 55 NA Turbo-~
## 6 N105UW 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 7 N107US 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 8 N108UW 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 9 N109UW 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 10 N110UW 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## # ... with 3,312 more rows
weather
## # A tibble: 26,115 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
## # ... with 26,105 more rows, and 5 more variables: wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
We can connect data from different data frames (or tibbles) with what are called ‘keys’. Keys are used to identify an observation uniquely. We can check to see if a key has more than one occurrence (and if it does, then it isn’t a key.) Here I am counting the tailnum and filtering for any row that has an n greater than one.
planes %>%
count(tailnum) %>%
filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: tailnum <chr>, n <int>
Excellent! planes$tailnum is a key.
It is important to note that not all tables have keys! And sometimes a key is made up of a combination of multiple variables.
Question 7 Is the combination of year, month, day, hour a key in the weather tibble? Why or why not? Write your code to answer this question below. (2 points)
weather %>%
count(year, month, day, hour) %>%
filter(n > 1)
## # A tibble: 8,705 x 5
## year month day hour n
## <int> <int> <int> <int> <int>
## 1 2013 1 1 1 3
## 2 2013 1 1 2 3
## 3 2013 1 1 3 3
## 4 2013 1 1 4 3
## 5 2013 1 1 5 3
## 6 2013 1 1 6 3
## 7 2013 1 1 7 3
## 8 2013 1 1 8 3
## 9 2013 1 1 9 3
## 10 2013 1 1 10 3
## # ... with 8,695 more rows
Keys need to be present in the datasets that you are joining.
Okay, let’s select some data to work with. We will use the select() function and a pipe and make a new tibble named flights2.
flights2 <- flights %>%
select(year, month, day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 x 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # ... with 336,766 more rows
We will use a left_join() to join the flights2 and planes tibbles by tailnum.
flights3 <- flights2 %>%
left_join(planes, by = "tailnum")
flights3
## # A tibble: 336,776 x 16
## year.x month day hour origin dest tailnum carrier year.y type
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixe~
## 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixe~
## 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixe~
## 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixe~
## 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixe~
## 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixe~
## 7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixe~
## 8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixe~
## 9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixe~
## 10 2013 1 1 6 LGA ORD N3ALAA AA NA <NA>
## # ... with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## # model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
Question 8 In the chunk above, what variables were added to the flights2 join? What is the difference from between year.x and year.y? (2 points) year.y and type were added to the flights2 left join by tailnum from planes. Flights and planes which were joined have common variables such as years and months, but they correlate to different things so that’s why it was added by the tail number. The difference between year.x and year.y is that we matched the variables in the first table to a different variable in the second table. In other words, the tail number matches the data for both tables, but year.x and year.y ensure the years aren’t mixed up from differing data sets.
Question 9 In the chunk below, rename the variables year.x and year.y. Rename with names that more clearly define these variables. You can learn more about the rename() function here: https://dplyr.tidyverse.org/reference/rename.html(2 points)
flights3Clean <- flights3 %>%
rename(`Year of plane travel information`= year.x,
`Year plane had issue`=year.y)
names(flights3Clean)
## [1] "Year of plane travel information" "month"
## [3] "day" "hour"
## [5] "origin" "dest"
## [7] "tailnum" "carrier"
## [9] "Year plane had issue" "type"
## [11] "manufacturer" "model"
## [13] "engines" "seats"
## [15] "speed" "engine"
flights3Clean
## # A tibble: 336,776 x 16
## `Year of plane ~ month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # ... with 336,766 more rows, and 8 more variables: `Year plane had
## # issue` <int>, type <chr>, manufacturer <chr>, model <chr>, engines <int>,
## # seats <int>, speed <int>, engine <chr>
Question 10 (from 13.4.6 in R for Data Science). Calculate the average delay time by destination for each airport and then join this with the airport tibble. Sort the tibble by in descending order by average weight time (5 points). %>% count(faa, sort = TRUE) airports2<-airports %>%
airports2
airports%>%
semi_join(flights, c("faa" = "dest"))
## # A tibble: 101 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 ABQ Albuquerque International~ 35.0 -107. 5355 -7 A America/Denv~
## 2 ACK Nantucket Mem 41.3 -70.1 48 -5 A America/New_~
## 3 ALB Albany Intl 42.7 -73.8 285 -5 A America/New_~
## 4 ANC Ted Stevens Anchorage Intl 61.2 -150. 152 -9 A America/Anch~
## 5 ATL Hartsfield Jackson Atlant~ 33.6 -84.4 1026 -5 A America/New_~
## 6 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A America/Chic~
## 7 AVL Asheville Regional Airport 35.4 -82.5 2165 -5 A America/New_~
## 8 BDL Bradley Intl 41.9 -72.7 173 -5 A America/New_~
## 9 BGR Bangor Intl 44.8 -68.8 192 -5 A America/New_~
## 10 BHM Birmingham Intl 33.6 -86.8 644 -6 A America/Chic~
## # ... with 91 more rows
delay<-flights %>%
group_by(arr_delay, dest) %>%
count(arr_delay, sort = FALSE)
delay
## # A tibble: 19,439 x 3
## # Groups: arr_delay, dest [19,439]
## arr_delay dest n
## <dbl> <chr> <int>
## 1 -86 SFO 1
## 2 -79 SFO 1
## 3 -75 LAX 1
## 4 -75 SEA 1
## 5 -74 SEA 1
## 6 -73 SFO 1
## 7 -71 LAX 1
## 8 -71 PDX 1
## 9 -71 SFO 1
## 10 -70 HNL 2
## # ... with 19,429 more rows
airports %>%
semi_join(flights, c("faa" = "dest")) %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point() +
coord_quickmap()
Congratulations! Once you have completed the module, please upload your Rmd and html to the optional slot on Sakai. You have finished the tidyverse data management module. I hope you improved your R skills!
Reference: Hadley Wickham and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data (1st. ed.). O’Reilly Media, Inc.