Data Management Module

In this module we will explore some of the functionalities of the tidyverse. Tidyverse is a collection of packages that uses the same approach to data manipulation. The central packages in the tidyverse are:

ggplot2 3.3.2
tibble 3.0.3
tidyr 1.1.1
readr 1.3.1 purrr 0.3.4 forcats 0.5.0 dplyr 1.0.1

we will also use stringr, but you may need to install that separately.

Two of the most widely used packages are ggplot2 (graphics) and dplyr (data manipulation). In this module, we will work with these packages and a few others to manipulate data, including joins of datasets.

To receive full credit for module completion, please complete the exercises and answer the questions in this notebook and upload your Rmd and html to the Optional Modules slot in Assignments. The tutorial pulls heavily from Chapter 12 and Chapter 13 (Relational Data) of R for Data Science which can be found here: https://r4ds.had.co.nz/relational-data.html.

Before you get started, please install the package “tidyverse” in the console if you have not already.

When we load the tidyverse, the eight packages listed above are loaded so we don’t need to load them individually. Please go ahead and load the tidyverse with the chunk below. We will also work on data from the nycflights13 package which you can read about here: https://www.rdocumentation.org/packages/nycflights13/versions/1.0.1.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(stringr)
library(nycflights13)

## Warning: package 'nycflights13' was built under R version 4.0.3

As we have learned throughout the semester, there are several key functions (also called verbs) in the tidyverse that we can use in the tidyverse.

select() used to select columns in a dataframe

filter() allows us to select rows of data that we are interested in

group_by() allows us to group rows together by some criteria

summarize() summarizes data, often times once we have grouped rows together or filtered rows.

Here are some new functions that we will use in this tutorial

mutate() Can make new columns in our data

arrange() we use this to arrange the rows

pivot_longer() restructures data from wide to a long (or longer format)

pivot_wider() restructures data from long to a wide (or wider format)

separate() separates one column into multiple columns

string_replace() a function in the stringR package—a way to replace characters in strings

Structure of Tibbles

The tidyverse uses what is called “tibbles” instead of dataframes. These are pretty similar, although tibbles are an updated structure to more smoothly manipulate data. Because we brought in the data data with read_csv(), the data will be structured in a tibble. Please read more about tibbles here: https://r4ds.had.co.nz/tibbles.html

We are going to use a set of data that is a part of the tidyverse. The data has four variables: Country, year, TB cases, and population. Run the chunk below and see how the five tables are structured differently.

table1

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

table2

## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

table3

## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

table4a #This table shows TB cases

## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

table4b #This table shows population

## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

Question 1 Discuss (compare/contrast) the structure of each of the tables. (1 point)

The first difference I notice are the sizes of the tables: 6x4, 12x4, 6x3, 3x3 and 3x3. Each of the tables contain country and year, otherwise each have different information.

The mutate() function

We can use the mutate() function in dplyr to make new variables. Below I make a new variable called log_population which is the log of population. I will add this variable to table1. When you run the chunk, the variable log_population should now be included in table1.

table1 %>% 
  mutate(log_population = log(population))

## # A tibble: 6 x 5
##   country      year  cases population log_population
##   <chr>       <int>  <int>      <int>          <dbl>
## 1 Afghanistan  1999    745   19987071           16.8
## 2 Afghanistan  2000   2666   20595360           16.8
## 3 Brazil       1999  37737  172006362           19.0
## 4 Brazil       2000  80488  174504898           19.0
## 5 China        1999 212258 1272915272           21.0
## 6 China        2000 213766 1280428583           21.0

Question 2 In the chunk below, use the function mutate to calculate the number of cases per 100,000 in population. Add this new variable (call it rate) to table1 (1 point)

table1 %>%
  mutate(rate = cases/100,000)

## # A tibble: 6 x 6
##   country      year  cases population    rate   `0`
##   <chr>       <int>  <int>      <int>   <dbl> <dbl>
## 1 Afghanistan  1999    745   19987071    7.45     0
## 2 Afghanistan  2000   2666   20595360   26.7      0
## 3 Brazil       1999  37737  172006362  377.       0
## 4 Brazil       2000  80488  174504898  805.       0
## 5 China        1999 212258 1272915272 2123.       0
## 6 China        2000 213766 1280428583 2138.       0

Making data longer

Let’s now work with table4a. Please run the chunk below so you can see it. This table has three variables, country, 1999 and 2000. Let’s say we wanted to make the data long (it’s currently wide). A long tibble would have two columns, one for country and one for year. In class earlier in the semester, we used the reshape2 package (melt() function) to restructure the data. We could also do this with dplyr in the tidyverse.

table4a

## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

We could use the function pivot_longer(). It is a bit complicated because the two variable names are numberic (not ideal). To work with numeric variable names we can use the mark (`) to surround the number. Some folks call it a tick.

table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

Question 3 Restructure table 4b into a longer format using the function pivot_longer(). (1 point)

table4b

## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

table4b %>%
 pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")

## # A tibble: 6 x 3
##   country     year  population
##   <chr>       <chr>      <int>
## 1 Afghanistan 1999    19987071
## 2 Afghanistan 2000    20595360
## 3 Brazil      1999   172006362
## 4 Brazil      2000   174504898
## 5 China       1999  1272915272
## 6 China       2000  1280428583

Question 4 Restructure table 2 into a wider format using the function pivot_wider(). (1 point)

table2

## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

table2 %>%
 pivot_wider(names_from = "year", values_from = "count")

## # A tibble: 6 x 4
##   country     type           `1999`     `2000`
##   <chr>       <chr>           <int>      <int>
## 1 Afghanistan cases             745       2666
## 2 Afghanistan population   19987071   20595360
## 3 Brazil      cases           37737      80488
## 4 Brazil      population  172006362  174504898
## 5 China       cases          212258     213766
## 6 China       population 1272915272 1280428583

Separating Data

Let’s check out table3 again. The variable rate is pretty darn ugly. Please don’t go out into the world and format your data like that. = ). Luckily, there is a function we can use called separate() that allows us to split us a variable into multiple variables based on a separator.

table3

## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

This chunk employs the separate() function and separates by the forward slash. Please run the chunk.

table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")

## # A tibble: 6 x 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

table3

## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

For practice, we will pull in another dataset in a csv file. Please download country_code.csv from Sakai. During the semester, we have used the function read.csv() to read in data in a csv file. The package readR (included in the tidyverse) reads csv files with the function read_csv(). There isn’t a lot of difference between the two functions. The major difference is that read_csv() reads in the data as a tibble, NOT a dataframe. You can read more about read_csv and the differences here: https://medium.com/r-tutorials/r-functions-daily-read-csv-3c418c25cba4.

Run the chunk below to pull in the data. We will call the data table5. Then we will look at the data.

table5<-read_csv("country_code.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   country = col_character(),
##   year = col_double(),
##   cases = col_double()
## )

table5

## # A tibble: 6 x 3
##   country            year  cases
##   <chr>             <dbl>  <dbl>
## 1 Afghanistan (AFG)  1999    745
## 2 Afghanistan (AFG)  2000   2666
## 3 Brazil (BRA)       1999  37737
## 4 Brazil (BRA)       2000  80488
## 5 China (CHN)        1999 212258
## 6 China (CHN)        2000 213766

Question 5 In the chunk below, please use the separate() function to separate function the name of the country and the country code. Please call the code variable, code. Please use a pipe as we did above. (2 points)

table5 %>% 
  separate(country, into = c("country", "code"), sep = " ")

## # A tibble: 6 x 4
##   country     code   year  cases
##   <chr>       <chr> <dbl>  <dbl>
## 1 Afghanistan (AFG)  1999    745
## 2 Afghanistan (AFG)  2000   2666
## 3 Brazil      (BRA)  1999  37737
## 4 Brazil      (BRA)  2000  80488
## 5 China       (CHN)  1999 212258
## 6 China       (CHN)  2000 213766

Cleaning Strings

In R we may need to work with character data. Often times character data are messy and may need some cleaning. One way to do this is through the use of the package stringr which is part of the tidyverse. There are two frequently used functions: str_replace() and str_replace_all(). str_replace() will replace the first occurrence in a string and str_replace_all() will replace all occurrences in a string. Let’s play around with this a bit.

I have made the vector colors of five colors. For the fun of it, let’s say I wanted to replace the first e with a &. The arguments of string_replace are string_replace(string, pattern, replacement)

colors<-c("blue", "red", "yellow", "Green", "purple", "red-green")
str_replace(colors, "e", "&")

## [1] "blu&"      "r&d"       "y&llow"    "Gr&en"     "purpl&"    "r&d-green"

colors

## [1] "blue"      "red"       "yellow"    "Green"     "purple"    "red-green"

What if we used str_replace_all()? Please run the chunk below.

colors<-c("blue", "red", "yellow", "Green", "purple", "red-green")
str_replace_all(colors, "e", "&")

## [1] "blu&"      "r&d"       "y&llow"    "Gr&&n"     "purpl&"    "r&d-gr&&n"

colors

## [1] "blue"      "red"       "yellow"    "Green"     "purple"    "red-green"

What if we wanted to replace all vowel occurrences with a dash?

colors<-c("blue", "red", "yellow", "Green", "purple", "red-green")
str_replace_all(colors, "[aeiou]", "-")

## [1] "bl--"      "r-d"       "y-ll-w"    "Gr--n"     "p-rpl-"    "r-d-gr--n"

colors

## [1] "blue"      "red"       "yellow"    "Green"     "purple"    "red-green"

we can also replace all capital letters with lower case letters in the following vector of colors using the function str_to_lower(). Please check out this website: https://stringr.tidyverse.org/reference/case.html

colors<-c("Blue", "Red", "Yellow", "Green", "Purple", "Red-Green")
str_to_lower(colors)

## [1] "blue"      "red"       "yellow"    "green"     "purple"    "red-green"

colors

## [1] "Blue"      "Red"       "Yellow"    "Green"     "Purple"    "Red-Green"

Question 6 In the chunk below, remove the () from the code variable in table5 and also make the country codes all lower case (2 points).

table5a<-table5 %>% 
  separate(country, into = c("country", "code"), sep = " ")

str_replace_all(table5a$code, "[()]", "")

## [1] "AFG" "AFG" "BRA" "BRA" "CHN" "CHN"

str_to_lower(table5a$country)

## [1] "afghanistan" "afghanistan" "brazil"      "brazil"      "china"      
## [6] "china"

table5a

## # A tibble: 6 x 4
##   country     code   year  cases
##   <chr>       <chr> <dbl>  <dbl>
## 1 Afghanistan (AFG)  1999    745
## 2 Afghanistan (AFG)  2000   2666
## 3 Brazil      (BRA)  1999  37737
## 4 Brazil      (BRA)  2000  80488
## 5 China       (CHN)  1999 212258
## 6 China       (CHN)  2000 213766

Joining Data

In data science, we often curate data from a variety of sources and need to join or merge the data into one workable dataset for analysis. This is part of what we call data management or data wrangling. There are a variety of functions we can use to merge two datasets together. In the tidyverse, all of these functions (or, as they call them, verbs) are joins. Please check out this website (the definitions below are direct quotations from this website: https://dplyr.tidyverse.org/reference/join.html.

All of the join functions have a similar syntax: join_type(firstTable, secondTable, by=columnTojoinOn).

inner_join() return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

left_join() return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

right_join() return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

full_join() return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

semi_join() return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

anti_join() return all rows from x where there are not matching values in y, keeping just columns from x.

Please read Chapter 13 on Relational Data: https://r4ds.had.co.nz/relational-data.html. We are going to look at three tables of data from the nycflights13 package that we loaded earlier in the script.

airlines

## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

airports

## # A tibble: 1,458 x 8
##    faa   name                       lat    lon   alt    tz dst   tzone          
##    <chr> <chr>                    <dbl>  <dbl> <dbl> <dbl> <chr> <chr>          
##  1 04G   Lansdowne Airport         41.1  -80.6  1044    -5 A     America/New_Yo~
##  2 06A   Moton Field Municipal A~  32.5  -85.7   264    -6 A     America/Chicago
##  3 06C   Schaumburg Regional       42.0  -88.1   801    -6 A     America/Chicago
##  4 06N   Randall Airport           41.4  -74.4   523    -5 A     America/New_Yo~
##  5 09J   Jekyll Island Airport     31.1  -81.4    11    -5 A     America/New_Yo~
##  6 0A9   Elizabethton Municipal ~  36.4  -82.2  1593    -5 A     America/New_Yo~
##  7 0G6   Williams County Airport   41.5  -84.5   730    -5 A     America/New_Yo~
##  8 0G7   Finger Lakes Regional A~  42.9  -76.8   492    -5 A     America/New_Yo~
##  9 0P2   Shoestring Aviation Air~  39.8  -76.6  1000    -5 U     America/New_Yo~
## 10 0S9   Jefferson County Intl     48.1 -123.    108    -8 A     America/Los_An~
## # ... with 1,448 more rows

planes

## # A tibble: 3,322 x 9
##    tailnum  year type          manufacturer   model  engines seats speed engine 
##    <chr>   <int> <chr>         <chr>          <chr>    <int> <int> <int> <chr>  
##  1 N10156   2004 Fixed wing m~ EMBRAER        EMB-1~       2    55    NA Turbo-~
##  2 N102UW   1998 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
##  3 N103US   1999 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
##  4 N104UW   1999 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
##  5 N10575   2002 Fixed wing m~ EMBRAER        EMB-1~       2    55    NA Turbo-~
##  6 N105UW   1999 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
##  7 N107US   1999 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
##  8 N108UW   1999 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
##  9 N109UW   1999 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
## 10 N110UW   1999 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
## # ... with 3,312 more rows

weather

## # A tibble: 26,115 x 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # ... with 26,105 more rows, and 5 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

We can connect data from different data frames (or tibbles) with what are called ‘keys’. Keys are used to identify an observation uniquely. We can check to see if a key has more than one occurrence (and if it does, then it isn’t a key.) Here I am counting the tailnum and filtering for any row that has an n greater than one.

planes %>% 
  count(tailnum) %>% 
  filter(n > 1)

## # A tibble: 0 x 2
## # ... with 2 variables: tailnum <chr>, n <int>

Excellent! planes$tailnum is a key.

It is important to note that not all tables have keys! And sometimes a key is made up of a combination of multiple variables.

Question 7 Is the combination of year, month, day, hour a key in the weather tibble? Why or why not? Write your code to answer this question below. (2 points)

weather %>% 
  count(year, month, day, hour) %>% 
  filter(n > 1)

## # A tibble: 8,705 x 5
##     year month   day  hour     n
##    <int> <int> <int> <int> <int>
##  1  2013     1     1     1     3
##  2  2013     1     1     2     3
##  3  2013     1     1     3     3
##  4  2013     1     1     4     3
##  5  2013     1     1     5     3
##  6  2013     1     1     6     3
##  7  2013     1     1     7     3
##  8  2013     1     1     8     3
##  9  2013     1     1     9     3
## 10  2013     1     1    10     3
## # ... with 8,695 more rows

Keys need to be present in the datasets that you are joining.

Okay, let’s select some data to work with. We will use the select() function and a pipe and make a new tibble named flights2.

flights2 <- flights %>% 
  select(year, month, day, hour, origin, dest, tailnum, carrier)
flights2

## # A tibble: 336,776 x 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # ... with 336,766 more rows

We will use a left_join() to join the flights2 and planes tibbles by tailnum.

flights3 <- flights2 %>% 
  left_join(planes, by = "tailnum")
flights3

## # A tibble: 336,776 x 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type 
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixe~
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixe~
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixe~
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixe~
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixe~
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixe~
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixe~
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixe~
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixe~
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA> 
## # ... with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>

Question 8 In the chunk above, what variables were added to the flights2 join? What is the difference from between year.x and year.y? (2 points) year.y and type were added to the flights2 left join by tailnum from planes. Flights and planes which were joined have common variables such as years and months, but they correlate to different things so that’s why it was added by the tail number. The difference between year.x and year.y is that we matched the variables in the first table to a different variable in the second table. In other words, the tail number matches the data for both tables, but year.x and year.y ensure the years aren’t mixed up from differing data sets.

Question 9 In the chunk below, rename the variables year.x and year.y. Rename with names that more clearly define these variables. You can learn more about the rename() function here: https://dplyr.tidyverse.org/reference/rename.html(2 points)

flights3Clean <- flights3 %>%                     
rename(`Year of plane travel information`= year.x, 
                `Year plane had issue`=year.y)
names(flights3Clean)

##  [1] "Year of plane travel information" "month"                           
##  [3] "day"                              "hour"                            
##  [5] "origin"                           "dest"                            
##  [7] "tailnum"                          "carrier"                         
##  [9] "Year plane had issue"             "type"                            
## [11] "manufacturer"                     "model"                           
## [13] "engines"                          "seats"                           
## [15] "speed"                            "engine"

            flights3Clean

## # A tibble: 336,776 x 16
##    `Year of plane ~ month   day  hour origin dest  tailnum carrier
##               <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1             2013     1     1     5 EWR    IAH   N14228  UA     
##  2             2013     1     1     5 LGA    IAH   N24211  UA     
##  3             2013     1     1     5 JFK    MIA   N619AA  AA     
##  4             2013     1     1     5 JFK    BQN   N804JB  B6     
##  5             2013     1     1     6 LGA    ATL   N668DN  DL     
##  6             2013     1     1     5 EWR    ORD   N39463  UA     
##  7             2013     1     1     6 EWR    FLL   N516JB  B6     
##  8             2013     1     1     6 LGA    IAD   N829AS  EV     
##  9             2013     1     1     6 JFK    MCO   N593JB  B6     
## 10             2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # ... with 336,766 more rows, and 8 more variables: `Year plane had
## #   issue` <int>, type <chr>, manufacturer <chr>, model <chr>, engines <int>,
## #   seats <int>, speed <int>, engine <chr>

Question 10 (from 13.4.6 in R for Data Science). Calculate the average delay time by destination for each airport and then join this with the airport tibble. Sort the tibble by in descending order by average weight time (5 points). %>% count(faa, sort = TRUE) airports2<-airports %>%

airports2

airports%>%
  semi_join(flights, c("faa" = "dest"))

## # A tibble: 101 x 8
##    faa   name                         lat    lon   alt    tz dst   tzone        
##    <chr> <chr>                      <dbl>  <dbl> <dbl> <dbl> <chr> <chr>        
##  1 ABQ   Albuquerque International~  35.0 -107.   5355    -7 A     America/Denv~
##  2 ACK   Nantucket Mem               41.3  -70.1    48    -5 A     America/New_~
##  3 ALB   Albany Intl                 42.7  -73.8   285    -5 A     America/New_~
##  4 ANC   Ted Stevens Anchorage Intl  61.2 -150.    152    -9 A     America/Anch~
##  5 ATL   Hartsfield Jackson Atlant~  33.6  -84.4  1026    -5 A     America/New_~
##  6 AUS   Austin Bergstrom Intl       30.2  -97.7   542    -6 A     America/Chic~
##  7 AVL   Asheville Regional Airport  35.4  -82.5  2165    -5 A     America/New_~
##  8 BDL   Bradley Intl                41.9  -72.7   173    -5 A     America/New_~
##  9 BGR   Bangor Intl                 44.8  -68.8   192    -5 A     America/New_~
## 10 BHM   Birmingham Intl             33.6  -86.8   644    -6 A     America/Chic~
## # ... with 91 more rows

delay<-flights %>%
  group_by(arr_delay, dest) %>%
count(arr_delay, sort = FALSE)
delay

## # A tibble: 19,439 x 3
## # Groups:   arr_delay, dest [19,439]
##    arr_delay dest      n
##        <dbl> <chr> <int>
##  1       -86 SFO       1
##  2       -79 SFO       1
##  3       -75 LAX       1
##  4       -75 SEA       1
##  5       -74 SEA       1
##  6       -73 SFO       1
##  7       -71 LAX       1
##  8       -71 PDX       1
##  9       -71 SFO       1
## 10       -70 HNL       2
## # ... with 19,429 more rows

airports %>%
  semi_join(flights, c("faa" = "dest")) %>%
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point() +
    coord_quickmap()

Congratulations! Once you have completed the module, please upload your Rmd and html to the optional slot on Sakai. You have finished the tidyverse data management module. I hope you improved your R skills!

Reference: Hadley Wickham and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data (1st. ed.). O’Reilly Media, Inc.