Week 9: Apply it to your data 8

1. Import your data

big_tech_stock_prices <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-02-07/big_tech_stock_prices.csv')

## Rows: 45088 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): stock_symbol
## dbl  (6): open, high, low, close, adj_close, volume
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

big_tech_companies <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-02-07/big_tech_companies.csv')

## Rows: 14 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): stock_symbol, company
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Data1

Columns: stock_symbol, date, open, high, low, close, adj_close, volume
Rows: 45088

Data 2

Columns: stock_symbol, company
Rows: 14

set.seed(1234)

big_tech_stock_prices_small <- big_tech_stock_prices %>%
  select(stock_symbol, date, open, high, low, close, volume) %>%
  sample_n(10)

big_tech_companies_small <- big_tech_companies %>%
  select(stock_symbol, company)

3. inner_join

Describe the resulting data:

Columns:8 (stock_symbol, date, open, high, low, close, volume, company)
Rows:10

How is it different from the original two datasets? it combines both datasets into one, adding the company name column to the stock price rows. Only keeps rows where stock_symbol exists in BOTH datasets.

inner_join(big_tech_stock_prices_small, big_tech_companies_small, by = "stock_symbol")

## # A tibble: 10 × 8
##    stock_symbol date         open   high    low  close    volume company        
##    <chr>        <date>      <dbl>  <dbl>  <dbl>  <dbl>     <dbl> <chr>          
##  1 ORCL         2018-05-29  46.5   46.8   46.0   46.3   12427800 Oracle Corpora…
##  2 ORCL         2018-09-06  48.0   48.2   47.5   47.7   18456100 Oracle Corpora…
##  3 TSLA         2010-08-03   1.4    1.46   1.39   1.46  18457500 Tesla, Inc.    
##  4 CSCO         2018-07-27  43.3   43.6   42.4   42.6   22050900 Cisco Systems,…
##  5 NFLX         2016-04-06 105.   106.   104.   105.     9605800 Netflix, Inc.  
##  6 NVDA         2011-04-06   4.43   4.43   4.33   4.36  85698400 NVIDIA Corpora…
##  7 GOOGL        2014-07-02  29.6   29.7   29.5   29.5   23916000 Alphabet Inc.  
##  8 CSCO         2018-06-27  42.7   43.2   42.3   42.3   20794800 Cisco Systems,…
##  9 IBM          2010-11-03 138.   138.   136.   138.     5420477 International …
## 10 AAPL         2020-06-03  81.2   81.6   80.6   81.3  104491200 Apple Inc.

4. left_join

Describe the resulting data:

Columns:8 (same as above)
Rows:10

How is it different from the original two datasets? keeps all 10 rows from the left dataset (big_tech_stock_prices_small) and adds the company column from the right. Since every stock symbol in the prices data has a match in companies, the result looks the same as inner_join here.

left_join(big_tech_stock_prices_small, big_tech_companies_small, by = "stock_symbol")

## # A tibble: 10 × 8
##    stock_symbol date         open   high    low  close    volume company        
##    <chr>        <date>      <dbl>  <dbl>  <dbl>  <dbl>     <dbl> <chr>          
##  1 ORCL         2018-05-29  46.5   46.8   46.0   46.3   12427800 Oracle Corpora…
##  2 ORCL         2018-09-06  48.0   48.2   47.5   47.7   18456100 Oracle Corpora…
##  3 TSLA         2010-08-03   1.4    1.46   1.39   1.46  18457500 Tesla, Inc.    
##  4 CSCO         2018-07-27  43.3   43.6   42.4   42.6   22050900 Cisco Systems,…
##  5 NFLX         2016-04-06 105.   106.   104.   105.     9605800 Netflix, Inc.  
##  6 NVDA         2011-04-06   4.43   4.43   4.33   4.36  85698400 NVIDIA Corpora…
##  7 GOOGL        2014-07-02  29.6   29.7   29.5   29.5   23916000 Alphabet Inc.  
##  8 CSCO         2018-06-27  42.7   43.2   42.3   42.3   20794800 Cisco Systems,…
##  9 IBM          2010-11-03 138.   138.   136.   138.     5420477 International …
## 10 AAPL         2020-06-03  81.2   81.6   80.6   81.3  104491200 Apple Inc.

5. right_join

Describe the resulting data:

Columns:8 (same as above)
Rows:14

How is it different from the original two datasets? keeps all 14 rows from the right dataset (big_tech_companies_small). Companies that had no matching rows in the small prices sample will show up with NA for the price columns.

right_join(big_tech_stock_prices_small, big_tech_companies_small, by = "stock_symbol")

## # A tibble: 16 × 8
##    stock_symbol date         open   high    low  close    volume company        
##    <chr>        <date>      <dbl>  <dbl>  <dbl>  <dbl>     <dbl> <chr>          
##  1 ORCL         2018-05-29  46.5   46.8   46.0   46.3   12427800 Oracle Corpora…
##  2 ORCL         2018-09-06  48.0   48.2   47.5   47.7   18456100 Oracle Corpora…
##  3 TSLA         2010-08-03   1.4    1.46   1.39   1.46  18457500 Tesla, Inc.    
##  4 CSCO         2018-07-27  43.3   43.6   42.4   42.6   22050900 Cisco Systems,…
##  5 NFLX         2016-04-06 105.   106.   104.   105.     9605800 Netflix, Inc.  
##  6 NVDA         2011-04-06   4.43   4.43   4.33   4.36  85698400 NVIDIA Corpora…
##  7 GOOGL        2014-07-02  29.6   29.7   29.5   29.5   23916000 Alphabet Inc.  
##  8 CSCO         2018-06-27  42.7   43.2   42.3   42.3   20794800 Cisco Systems,…
##  9 IBM          2010-11-03 138.   138.   136.   138.     5420477 International …
## 10 AAPL         2020-06-03  81.2   81.6   80.6   81.3  104491200 Apple Inc.     
## 11 ADBE         NA          NA     NA     NA     NA           NA Adobe Inc.     
## 12 AMZN         NA          NA     NA     NA     NA           NA Amazon.com, In…
## 13 CRM          NA          NA     NA     NA     NA           NA Salesforce, In…
## 14 INTC         NA          NA     NA     NA     NA           NA Intel Corporat…
## 15 META         NA          NA     NA     NA     NA           NA Meta Platforms…
## 16 MSFT         NA          NA     NA     NA     NA           NA Microsoft Corp…

6. full_join

Describe the resulting data:

Columns:8 (same as above)
Rows:14

How is it different from the original two datasets? keeps every row from both datasets. Any stock symbol that appears in one but not the other gets NA for the missing columns. You see all companies even if they had no sampled price rows.

full_join(big_tech_stock_prices_small, big_tech_companies_small, by = "stock_symbol")

## # A tibble: 16 × 8
##    stock_symbol date         open   high    low  close    volume company        
##    <chr>        <date>      <dbl>  <dbl>  <dbl>  <dbl>     <dbl> <chr>          
##  1 ORCL         2018-05-29  46.5   46.8   46.0   46.3   12427800 Oracle Corpora…
##  2 ORCL         2018-09-06  48.0   48.2   47.5   47.7   18456100 Oracle Corpora…
##  3 TSLA         2010-08-03   1.4    1.46   1.39   1.46  18457500 Tesla, Inc.    
##  4 CSCO         2018-07-27  43.3   43.6   42.4   42.6   22050900 Cisco Systems,…
##  5 NFLX         2016-04-06 105.   106.   104.   105.     9605800 Netflix, Inc.  
##  6 NVDA         2011-04-06   4.43   4.43   4.33   4.36  85698400 NVIDIA Corpora…
##  7 GOOGL        2014-07-02  29.6   29.7   29.5   29.5   23916000 Alphabet Inc.  
##  8 CSCO         2018-06-27  42.7   43.2   42.3   42.3   20794800 Cisco Systems,…
##  9 IBM          2010-11-03 138.   138.   136.   138.     5420477 International …
## 10 AAPL         2020-06-03  81.2   81.6   80.6   81.3  104491200 Apple Inc.     
## 11 ADBE         NA          NA     NA     NA     NA           NA Adobe Inc.     
## 12 AMZN         NA          NA     NA     NA     NA           NA Amazon.com, In…
## 13 CRM          NA          NA     NA     NA     NA           NA Salesforce, In…
## 14 INTC         NA          NA     NA     NA     NA           NA Intel Corporat…
## 15 META         NA          NA     NA     NA     NA           NA Meta Platforms…
## 16 MSFT         NA          NA     NA     NA     NA           NA Microsoft Corp…

7. semi_join

Describe the resulting data:

Columns:7 (only the stock price columns — no company column added)
Rows:10

How is it different from the original two datasets? keeps only the rows from the left dataset that have a match in the right dataset, but does NOT add any columns from the right. It’s a filtering tool, not a merging tool.

semi_join(big_tech_stock_prices_small, big_tech_companies_small, by = "stock_symbol")

## # A tibble: 10 × 7
##    stock_symbol date         open   high    low  close    volume
##    <chr>        <date>      <dbl>  <dbl>  <dbl>  <dbl>     <dbl>
##  1 ORCL         2018-05-29  46.5   46.8   46.0   46.3   12427800
##  2 ORCL         2018-09-06  48.0   48.2   47.5   47.7   18456100
##  3 TSLA         2010-08-03   1.4    1.46   1.39   1.46  18457500
##  4 CSCO         2018-07-27  43.3   43.6   42.4   42.6   22050900
##  5 NFLX         2016-04-06 105.   106.   104.   105.     9605800
##  6 NVDA         2011-04-06   4.43   4.43   4.33   4.36  85698400
##  7 GOOGL        2014-07-02  29.6   29.7   29.5   29.5   23916000
##  8 CSCO         2018-06-27  42.7   43.2   42.3   42.3   20794800
##  9 IBM          2010-11-03 138.   138.   136.   138.     5420477
## 10 AAPL         2020-06-03  81.2   81.6   80.6   81.3  104491200

8. anti_join

Describe the resulting data:

Columns:7 (only the stock price columns)
Rows:0

How is it different from the original two datasets? keeps only rows from the left dataset that do NOT have a match in the right. Since all 10 sampled stock symbols exist in big_tech_companies_small, nothing gets returned — every row had a match.

anti_join(big_tech_stock_prices_small, big_tech_companies_small, by = "stock_symbol")

## # A tibble: 0 × 7
## # ℹ 7 variables: stock_symbol <chr>, date <date>, open <dbl>, high <dbl>,
## #   low <dbl>, close <dbl>, volume <dbl>

Week 9: Apply it to your data 8

Josh Rockefeller

2022-10-05

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join