big_tech_stock_prices <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-02-07/big_tech_stock_prices.csv')
## Rows: 45088 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): stock_symbol
## dbl (6): open, high, low, close, adj_close, volume
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
big_tech_companies <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-02-07/big_tech_companies.csv')
## Rows: 14 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): stock_symbol, company
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data1
Data 2
set.seed(1234)
big_tech_stock_prices_small <- big_tech_stock_prices %>%
select(stock_symbol, date, open, high, low, close, volume) %>%
sample_n(10)
big_tech_companies_small <- big_tech_companies %>%
select(stock_symbol, company)
Describe the resulting data:
How is it different from the original two datasets? it combines both datasets into one, adding the company name column to the stock price rows. Only keeps rows where stock_symbol exists in BOTH datasets.
inner_join(big_tech_stock_prices_small, big_tech_companies_small, by = "stock_symbol")
## # A tibble: 10 × 8
## stock_symbol date open high low close volume company
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 ORCL 2018-05-29 46.5 46.8 46.0 46.3 12427800 Oracle Corpora…
## 2 ORCL 2018-09-06 48.0 48.2 47.5 47.7 18456100 Oracle Corpora…
## 3 TSLA 2010-08-03 1.4 1.46 1.39 1.46 18457500 Tesla, Inc.
## 4 CSCO 2018-07-27 43.3 43.6 42.4 42.6 22050900 Cisco Systems,…
## 5 NFLX 2016-04-06 105. 106. 104. 105. 9605800 Netflix, Inc.
## 6 NVDA 2011-04-06 4.43 4.43 4.33 4.36 85698400 NVIDIA Corpora…
## 7 GOOGL 2014-07-02 29.6 29.7 29.5 29.5 23916000 Alphabet Inc.
## 8 CSCO 2018-06-27 42.7 43.2 42.3 42.3 20794800 Cisco Systems,…
## 9 IBM 2010-11-03 138. 138. 136. 138. 5420477 International …
## 10 AAPL 2020-06-03 81.2 81.6 80.6 81.3 104491200 Apple Inc.
Describe the resulting data:
How is it different from the original two datasets? keeps all 10 rows from the left dataset (big_tech_stock_prices_small) and adds the company column from the right. Since every stock symbol in the prices data has a match in companies, the result looks the same as inner_join here.
left_join(big_tech_stock_prices_small, big_tech_companies_small, by = "stock_symbol")
## # A tibble: 10 × 8
## stock_symbol date open high low close volume company
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 ORCL 2018-05-29 46.5 46.8 46.0 46.3 12427800 Oracle Corpora…
## 2 ORCL 2018-09-06 48.0 48.2 47.5 47.7 18456100 Oracle Corpora…
## 3 TSLA 2010-08-03 1.4 1.46 1.39 1.46 18457500 Tesla, Inc.
## 4 CSCO 2018-07-27 43.3 43.6 42.4 42.6 22050900 Cisco Systems,…
## 5 NFLX 2016-04-06 105. 106. 104. 105. 9605800 Netflix, Inc.
## 6 NVDA 2011-04-06 4.43 4.43 4.33 4.36 85698400 NVIDIA Corpora…
## 7 GOOGL 2014-07-02 29.6 29.7 29.5 29.5 23916000 Alphabet Inc.
## 8 CSCO 2018-06-27 42.7 43.2 42.3 42.3 20794800 Cisco Systems,…
## 9 IBM 2010-11-03 138. 138. 136. 138. 5420477 International …
## 10 AAPL 2020-06-03 81.2 81.6 80.6 81.3 104491200 Apple Inc.
Describe the resulting data:
How is it different from the original two datasets? keeps all 14 rows from the right dataset (big_tech_companies_small). Companies that had no matching rows in the small prices sample will show up with NA for the price columns.
right_join(big_tech_stock_prices_small, big_tech_companies_small, by = "stock_symbol")
## # A tibble: 16 × 8
## stock_symbol date open high low close volume company
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 ORCL 2018-05-29 46.5 46.8 46.0 46.3 12427800 Oracle Corpora…
## 2 ORCL 2018-09-06 48.0 48.2 47.5 47.7 18456100 Oracle Corpora…
## 3 TSLA 2010-08-03 1.4 1.46 1.39 1.46 18457500 Tesla, Inc.
## 4 CSCO 2018-07-27 43.3 43.6 42.4 42.6 22050900 Cisco Systems,…
## 5 NFLX 2016-04-06 105. 106. 104. 105. 9605800 Netflix, Inc.
## 6 NVDA 2011-04-06 4.43 4.43 4.33 4.36 85698400 NVIDIA Corpora…
## 7 GOOGL 2014-07-02 29.6 29.7 29.5 29.5 23916000 Alphabet Inc.
## 8 CSCO 2018-06-27 42.7 43.2 42.3 42.3 20794800 Cisco Systems,…
## 9 IBM 2010-11-03 138. 138. 136. 138. 5420477 International …
## 10 AAPL 2020-06-03 81.2 81.6 80.6 81.3 104491200 Apple Inc.
## 11 ADBE NA NA NA NA NA NA Adobe Inc.
## 12 AMZN NA NA NA NA NA NA Amazon.com, In…
## 13 CRM NA NA NA NA NA NA Salesforce, In…
## 14 INTC NA NA NA NA NA NA Intel Corporat…
## 15 META NA NA NA NA NA NA Meta Platforms…
## 16 MSFT NA NA NA NA NA NA Microsoft Corp…
Describe the resulting data:
How is it different from the original two datasets? keeps every row from both datasets. Any stock symbol that appears in one but not the other gets NA for the missing columns. You see all companies even if they had no sampled price rows.
full_join(big_tech_stock_prices_small, big_tech_companies_small, by = "stock_symbol")
## # A tibble: 16 × 8
## stock_symbol date open high low close volume company
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 ORCL 2018-05-29 46.5 46.8 46.0 46.3 12427800 Oracle Corpora…
## 2 ORCL 2018-09-06 48.0 48.2 47.5 47.7 18456100 Oracle Corpora…
## 3 TSLA 2010-08-03 1.4 1.46 1.39 1.46 18457500 Tesla, Inc.
## 4 CSCO 2018-07-27 43.3 43.6 42.4 42.6 22050900 Cisco Systems,…
## 5 NFLX 2016-04-06 105. 106. 104. 105. 9605800 Netflix, Inc.
## 6 NVDA 2011-04-06 4.43 4.43 4.33 4.36 85698400 NVIDIA Corpora…
## 7 GOOGL 2014-07-02 29.6 29.7 29.5 29.5 23916000 Alphabet Inc.
## 8 CSCO 2018-06-27 42.7 43.2 42.3 42.3 20794800 Cisco Systems,…
## 9 IBM 2010-11-03 138. 138. 136. 138. 5420477 International …
## 10 AAPL 2020-06-03 81.2 81.6 80.6 81.3 104491200 Apple Inc.
## 11 ADBE NA NA NA NA NA NA Adobe Inc.
## 12 AMZN NA NA NA NA NA NA Amazon.com, In…
## 13 CRM NA NA NA NA NA NA Salesforce, In…
## 14 INTC NA NA NA NA NA NA Intel Corporat…
## 15 META NA NA NA NA NA NA Meta Platforms…
## 16 MSFT NA NA NA NA NA NA Microsoft Corp…
Describe the resulting data:
How is it different from the original two datasets? keeps only the rows from the left dataset that have a match in the right dataset, but does NOT add any columns from the right. It’s a filtering tool, not a merging tool.
semi_join(big_tech_stock_prices_small, big_tech_companies_small, by = "stock_symbol")
## # A tibble: 10 × 7
## stock_symbol date open high low close volume
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ORCL 2018-05-29 46.5 46.8 46.0 46.3 12427800
## 2 ORCL 2018-09-06 48.0 48.2 47.5 47.7 18456100
## 3 TSLA 2010-08-03 1.4 1.46 1.39 1.46 18457500
## 4 CSCO 2018-07-27 43.3 43.6 42.4 42.6 22050900
## 5 NFLX 2016-04-06 105. 106. 104. 105. 9605800
## 6 NVDA 2011-04-06 4.43 4.43 4.33 4.36 85698400
## 7 GOOGL 2014-07-02 29.6 29.7 29.5 29.5 23916000
## 8 CSCO 2018-06-27 42.7 43.2 42.3 42.3 20794800
## 9 IBM 2010-11-03 138. 138. 136. 138. 5420477
## 10 AAPL 2020-06-03 81.2 81.6 80.6 81.3 104491200
Describe the resulting data:
How is it different from the original two datasets? keeps only rows from the left dataset that do NOT have a match in the right. Since all 10 sampled stock symbols exist in big_tech_companies_small, nothing gets returned — every row had a match.
anti_join(big_tech_stock_prices_small, big_tech_companies_small, by = "stock_symbol")
## # A tibble: 0 × 7
## # ℹ 7 variables: stock_symbol <chr>, date <date>, open <dbl>, high <dbl>,
## # low <dbl>, close <dbl>, volume <dbl>