Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

College <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/salary_potential.csv")

## Rows: 935 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): name, state_name
## dbl (5): rank, early_career_pay, mid_career_pay, make_world_better_percent, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

College

## # A tibble: 935 × 7
##     rank name  state_name early_career_pay mid_career_pay make_world_better_pe…¹
##    <dbl> <chr> <chr>                 <dbl>          <dbl>                  <dbl>
##  1     1 Aubu… Alabama               54400         104500                     51
##  2     2 Univ… Alabama               57500         103900                     59
##  3     3 The … Alabama               52300          97400                     50
##  4     4 Tusk… Alabama               54500          93500                     61
##  5     5 Samf… Alabama               48400          90500                     52
##  6     6 Spri… Alabama               46600          89100                     53
##  7     7 Birm… Alabama               49100          88300                     48
##  8     8 Univ… Alabama               48600          87200                     57
##  9     9 Univ… Alabama               47700          86400                     56
## 10    10 Alab… Alabama               48700          83500                     58
## # ℹ 925 more rows
## # ℹ abbreviated name: ¹make_world_better_percent
## # ℹ 1 more variable: stem_percent <dbl>

Tuition <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/tuition_income.csv")

## Rows: 209012 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): name, state, campus, income_lvl
## dbl (3): total_price, year, net_cost
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Tuition

## # A tibble: 209,012 × 7
##    name                       state total_price  year campus net_cost income_lvl
##    <chr>                      <chr>       <dbl> <dbl> <chr>     <dbl> <chr>     
##  1 Piedmont International Un… NC          20174  2016 On Ca…   11475  0 to 30,0…
##  2 Piedmont International Un… NC          20174  2016 On Ca…   11451  30,001 to…
##  3 Piedmont International Un… NC          20174  2016 On Ca…   16229  48_001 to…
##  4 Piedmont International Un… NC          20174  2016 On Ca…   15592  75,001 to…
##  5 Piedmont International Un… NC          20514  2017 On Ca…   11668. 0 to 30,0…
##  6 Piedmont International Un… NC          20514  2017 On Ca…   11644. 30,001 to…
##  7 Piedmont International Un… NC          20514  2017 On Ca…   16503. 48_001 to…
##  8 Piedmont International Un… NC          20514  2017 On Ca…   15855. 75,001 to…
##  9 Piedmont International Un… NC          20514  2017 On Ca…       0  Over 110,…
## 10 Piedmont International Un… NC          20829  2018 On Ca…   11848. 0 to 30,0…
## # ℹ 209,002 more rows

2. Make data small

Describe the two datasets:

Data1

Columns: name, rank, state_name
Rows: 10 rows

Data 2

Columns: campus, name, year
Rows: 10 rows

College_small <- College %>% select(name, rank, state_name) %>% sample_n(10)
Tuition_small <- Tuition %>% select(campus, name, year) %>% sample_n(10)

College_small

## # A tibble: 10 × 3
##    name                                   rank state_name
##    <chr>                                 <dbl> <chr>     
##  1 Stevenson University                     15 Maryland  
##  2 University of Wisconsin-Platteville       4 Wisconsin 
##  3 Kennesaw State University                 5 Georgia   
##  4 Nova Southeastern University             14 Florida   
##  5 Taylor University                        10 Indiana   
##  6 Menlo College                            20 California
##  7 Yeshiva University                       15 New-York  
##  8 LeTourneau University                    18 Texas     
##  9 University of Nebraska Medical Center     3 Nebraska  
## 10 Colgate University                        7 New-York

Tuition_small

## # A tibble: 10 × 3
##    campus     name                                         year
##    <chr>      <chr>                                       <dbl>
##  1 Off Campus Marquette University                         2014
##  2 Off Campus South Dakota School of Mines and Technology  2018
##  3 On Campus  Santa Barbara Business College-Ventura       2018
##  4 Off Campus Whittier College                             2014
##  5 Off Campus Cortiva Institute-New Jersey                 2015
##  6 Off Campus Wesleyan College                             2018
##  7 Off Campus Platt College-Los Angeles                    2012
##  8 Off Campus Brigham Young University-Idaho               2018
##  9 On Campus  Gonzaga University                           2011
## 10 Off Campus Johnston Community College                   2013

3. inner_join

Describe the resulting data:

Columns:
Rows:

How is it different from the original two datasets?

There was no data to match so nothing was displayed.

joined_data <- College_small %>% 
    inner_join(Tuition_small)

## Joining with `by = join_by(name)`

joined_data

## # A tibble: 0 × 5
## # ℹ 5 variables: name <chr>, rank <dbl>, state_name <chr>, campus <chr>,
## #   year <dbl>

4. left_join

Describe the resulting data:

Columns: name, rank, state_name, campus, year
Rows: 10

How is it different from the original two datasets?

It found matching rows in the second data set and combined the two together.

left <- College_small %>%
    left_join(Tuition_small)

## Joining with `by = join_by(name)`

left

## # A tibble: 10 × 5
##    name                                   rank state_name campus  year
##    <chr>                                 <dbl> <chr>      <chr>  <dbl>
##  1 Stevenson University                     15 Maryland   <NA>      NA
##  2 University of Wisconsin-Platteville       4 Wisconsin  <NA>      NA
##  3 Kennesaw State University                 5 Georgia    <NA>      NA
##  4 Nova Southeastern University             14 Florida    <NA>      NA
##  5 Taylor University                        10 Indiana    <NA>      NA
##  6 Menlo College                            20 California <NA>      NA
##  7 Yeshiva University                       15 New-York   <NA>      NA
##  8 LeTourneau University                    18 Texas      <NA>      NA
##  9 University of Nebraska Medical Center     3 Nebraska   <NA>      NA
## 10 Colgate University                        7 New-York   <NA>      NA

5. right_join

Describe the resulting data:

Columns: name, rank, state_name, campus, year
Rows:10

How is it different from the original two datasets?

This returned all the rows from y and the x and y columns. There was no match for rank and state_name so NA was returned.

right <- College_small %>%
    right_join(Tuition_small)

## Joining with `by = join_by(name)`

right

## # A tibble: 10 × 5
##    name                                         rank state_name campus      year
##    <chr>                                       <dbl> <chr>      <chr>      <dbl>
##  1 Marquette University                           NA <NA>       Off Campus  2014
##  2 South Dakota School of Mines and Technology    NA <NA>       Off Campus  2018
##  3 Santa Barbara Business College-Ventura         NA <NA>       On Campus   2018
##  4 Whittier College                               NA <NA>       Off Campus  2014
##  5 Cortiva Institute-New Jersey                   NA <NA>       Off Campus  2015
##  6 Wesleyan College                               NA <NA>       Off Campus  2018
##  7 Platt College-Los Angeles                      NA <NA>       Off Campus  2012
##  8 Brigham Young University-Idaho                 NA <NA>       Off Campus  2018
##  9 Gonzaga University                             NA <NA>       On Campus   2011
## 10 Johnston Community College                     NA <NA>       Off Campus  2013

6. full_join

Describe the resulting data:

Columns: name, rank, state_name, campus, year
Rows: 20

How is it different from the original two datasets?

Rows with matching column values in each set were combined. This created 20 rows instead of 10

full <- College_small %>%
    full_join(Tuition_small)

## Joining with `by = join_by(name)`

full

## # A tibble: 20 × 5
##    name                                         rank state_name campus      year
##    <chr>                                       <dbl> <chr>      <chr>      <dbl>
##  1 Stevenson University                           15 Maryland   <NA>          NA
##  2 University of Wisconsin-Platteville             4 Wisconsin  <NA>          NA
##  3 Kennesaw State University                       5 Georgia    <NA>          NA
##  4 Nova Southeastern University                   14 Florida    <NA>          NA
##  5 Taylor University                              10 Indiana    <NA>          NA
##  6 Menlo College                                  20 California <NA>          NA
##  7 Yeshiva University                             15 New-York   <NA>          NA
##  8 LeTourneau University                          18 Texas      <NA>          NA
##  9 University of Nebraska Medical Center           3 Nebraska   <NA>          NA
## 10 Colgate University                              7 New-York   <NA>          NA
## 11 Marquette University                           NA <NA>       Off Campus  2014
## 12 South Dakota School of Mines and Technology    NA <NA>       Off Campus  2018
## 13 Santa Barbara Business College-Ventura         NA <NA>       On Campus   2018
## 14 Whittier College                               NA <NA>       Off Campus  2014
## 15 Cortiva Institute-New Jersey                   NA <NA>       Off Campus  2015
## 16 Wesleyan College                               NA <NA>       Off Campus  2018
## 17 Platt College-Los Angeles                      NA <NA>       Off Campus  2012
## 18 Brigham Young University-Idaho                 NA <NA>       Off Campus  2018
## 19 Gonzaga University                             NA <NA>       On Campus   2011
## 20 Johnston Community College                     NA <NA>       Off Campus  2013

7. semi_join

Describe the resulting data:

Columns:
Rows:

How is it different from the original two datasets?

There was no matching x values within the y values, so nothing was returned

semi <- College_small %>%
    semi_join(Tuition_small)

## Joining with `by = join_by(name)`

semi

## # A tibble: 0 × 3
## # ℹ 3 variables: name <chr>, rank <dbl>, state_name <chr>

8. anti_join

Describe the resulting data:

Columns: name, rank, state_name
Rows: 10

How is it different from the original two datasets?

All values that did not match in one dataset from the other set was returned.

anti <- College_small %>%
    anti_join(Tuition_small)

## Joining with `by = join_by(name)`

anti

## # A tibble: 10 × 3
##    name                                   rank state_name
##    <chr>                                 <dbl> <chr>     
##  1 Stevenson University                     15 Maryland  
##  2 University of Wisconsin-Platteville       4 Wisconsin 
##  3 Kennesaw State University                 5 Georgia   
##  4 Nova Southeastern University             14 Florida   
##  5 Taylor University                        10 Indiana   
##  6 Menlo College                            20 California
##  7 Yeshiva University                       15 New-York  
##  8 LeTourneau University                    18 Texas     
##  9 University of Nebraska Medical Center     3 Nebraska  
## 10 Colgate University                        7 New-York

Week 9: Apply it to your data 8

Meghan Hopps

2024-6-13

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join