Lab 7: Rectangling and Webscraping

Author

Amanda Rose Knudsen

Overview

This is a two part assignment. In the first part of the assignment you will practice rectangling on a dataset from the repurrrsive package. In the second part you will combine the rvest package along with functions and iteration to scrape data on foreign linked political action committees from the website open secrets.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(repurrrsive)
library(robotstxt)
library(rvest)

Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding
library(stringr)

Rectangling

Problem 1: Load the repurrrsive package to get access to get access to the got_chars dataset. In section 23.4.2 of R4DS, there is code that extracts data from the got_chars list and converts it into a tibble with information on each character and a separate tibble which contains information on the titles held by each character. Perform similar operations to create separate tibbles containing the aliases, allegiances, books, and TV series of each Game of Thrones character.

chars <- tibble(got_chars = got_chars)
chars |> 
  unnest_wider(got_chars)
# A tibble: 30 × 18
   url           id name  gender culture born  died  alive titles aliases father
   <chr>      <int> <chr> <chr>  <chr>   <chr> <chr> <lgl> <list> <list>  <chr> 
 1 https://w…  1022 Theo… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
 2 https://w…  1052 Tyri… Male   ""      "In … ""    TRUE  <chr>  <chr>   ""    
 3 https://w…  1074 Vict… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
 4 https://w…  1109 Will  Male   ""      ""    "In … FALSE <chr>  <chr>   ""    
 5 https://w…  1166 Areo… Male   "Norvo… "In … ""    TRUE  <chr>  <chr>   ""    
 6 https://w…  1267 Chett Male   ""      "At … "In … FALSE <chr>  <chr>   ""    
 7 https://w…  1295 Cres… Male   ""      "In … "In … FALSE <chr>  <chr>   ""    
 8 https://w…   130 Aria… Female "Dorni… "In … ""    TRUE  <chr>  <chr>   ""    
 9 https://w…  1303 Daen… Female "Valyr… "In … ""    TRUE  <chr>  <chr>   ""    
10 https://w…  1319 Davo… Male   "Weste… "In … ""    TRUE  <chr>  <chr>   ""    
# ℹ 20 more rows
# ℹ 7 more variables: mother <chr>, spouse <chr>, allegiances <list>,
#   books <list>, povBooks <list>, tvSeries <list>, playedBy <list>
characters <- chars |> 
  unnest_wider(got_chars) 
characters
# A tibble: 30 × 18
   url           id name  gender culture born  died  alive titles aliases father
   <chr>      <int> <chr> <chr>  <chr>   <chr> <chr> <lgl> <list> <list>  <chr> 
 1 https://w…  1022 Theo… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
 2 https://w…  1052 Tyri… Male   ""      "In … ""    TRUE  <chr>  <chr>   ""    
 3 https://w…  1074 Vict… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
 4 https://w…  1109 Will  Male   ""      ""    "In … FALSE <chr>  <chr>   ""    
 5 https://w…  1166 Areo… Male   "Norvo… "In … ""    TRUE  <chr>  <chr>   ""    
 6 https://w…  1267 Chett Male   ""      "At … "In … FALSE <chr>  <chr>   ""    
 7 https://w…  1295 Cres… Male   ""      "In … "In … FALSE <chr>  <chr>   ""    
 8 https://w…   130 Aria… Female "Dorni… "In … ""    TRUE  <chr>  <chr>   ""    
 9 https://w…  1303 Daen… Female "Valyr… "In … ""    TRUE  <chr>  <chr>   ""    
10 https://w…  1319 Davo… Male   "Weste… "In … ""    TRUE  <chr>  <chr>   ""    
# ℹ 20 more rows
# ℹ 7 more variables: mother <chr>, spouse <chr>, allegiances <list>,
#   books <list>, povBooks <list>, tvSeries <list>, playedBy <list>

Aliases:

aliases <- chars |> 
  unnest_wider(got_chars) |> 
  select(id, aliases) |> 
  unnest_longer(aliases) |> 
  filter(aliases != "") |> 
  rename(alias = aliases)
aliases
# A tibble: 107 × 2
      id alias             
   <int> <chr>             
 1  1022 Prince of Fools   
 2  1022 Theon Turncloak   
 3  1022 Reek              
 4  1022 Theon Kinslayer   
 5  1052 The Imp           
 6  1052 Halfman           
 7  1052 The boyman        
 8  1052 Giant of Lannister
 9  1052 Lord Tywin's Doom 
10  1052 Lord Tywin's Bane 
# ℹ 97 more rows

Allegiances:

allegiances <- chars |> 
  unnest_wider(got_chars) |> 
  select(id, allegiances) |> 
  unnest_longer(allegiances) |> 
  filter(allegiances != "") |> 
  rename(allegiance = allegiances)
allegiances
# A tibble: 33 × 2
      id allegiance                       
   <int> <chr>                            
 1  1022 House Greyjoy of Pyke            
 2  1052 House Lannister of Casterly Rock 
 3  1074 House Greyjoy of Pyke            
 4  1166 House Nymeros Martell of Sunspear
 5   130 House Nymeros Martell of Sunspear
 6  1303 House Targaryen of King's Landing
 7  1319 House Baratheon of Dragonstone   
 8  1319 House Seaworth of Cape Wrath     
 9   148 House Stark of Winterfell        
10   149 House Oakheart of Old Oak        
# ℹ 23 more rows

Books:

books <- chars |> 
  unnest_wider(got_chars) |> 
  select(id, books) |> 
  unnest_longer(books) |> 
  filter(books != "") |> 
  rename(book = books)
books
# A tibble: 77 × 2
      id book                     
   <int> <chr>                    
 1  1022 A Game of Thrones        
 2  1022 A Storm of Swords        
 3  1022 A Feast for Crows        
 4  1052 A Feast for Crows        
 5  1052 The World of Ice and Fire
 6  1074 A Game of Thrones        
 7  1074 A Clash of Kings         
 8  1074 A Storm of Swords        
 9  1109 A Clash of Kings         
10  1166 A Game of Thrones        
# ℹ 67 more rows

TV series

tvSeries <- chars |> 
  unnest_wider(got_chars) |> 
  select(id, tvSeries) |> 
  unnest_longer(tvSeries) |> 
  filter(tvSeries != "") |> 
  rename(tv_series = tvSeries)
tvSeries
# A tibble: 93 × 2
      id tv_series
   <int> <chr>    
 1  1022 Season 1 
 2  1022 Season 2 
 3  1022 Season 3 
 4  1022 Season 4 
 5  1022 Season 5 
 6  1022 Season 6 
 7  1052 Season 1 
 8  1052 Season 2 
 9  1052 Season 3 
10  1052 Season 4 
# ℹ 83 more rows

Webscraping Open Secrets

In this assignment we will scrape and work with data foreign connected PACs that donate to US political campaigns. In the United States, only American citizens and green card holders can contribute to federal elections, but the American divisions of foreign companies can form political action committees (PACs) and collect contributions from their American employees.

First, we will get data foreign connected PAC contributions in the 2022 election cycle. Then, you will use a similar approach to get data such contributions from previous years so that we can examine trends over time.

In order to complete this assignment you will need a Chrome browser with the Selector Gadget extension installed.

In addition to tidyverse, you will need to install and load the packages robotstxt and rvest

Problem 2:

  • Check that open secrets allows you to webscrape by running the paths_allowed function on the url https://www.opensecrets.org.
paths_allowed("https://www.opensecrets.org")

 www.opensecrets.org                      
[1] TRUE
  • Write a function called scrape_pac() that scrapes information from the Open Secrets webpage for foreign connected PAC contributions in a given year. The url for this data is https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2024. This function should take the url of the webpage as its only input and should output a data frame. The variables of this data-frame should be renamed so that they are in snake_case format (lower_case_and_underscores_for_spaces, see R4DS section 2.3). Use str_squish() to remove excess whitespace from the Country of Origin/Parent Company variables, and add a new column which records the year by extracting from the input url.

Hint: If you have trouble finding the right elements to search for using the selector gadget try looking for a table element.

scrape_pac <- function(url) {
  webpage <- read_html(url)
  
  pac_table <- webpage |> 
    html_elements("table") |> 
    html_table()
  
  pac_table <- pac_table[[1]]
  
  pac_table <- pac_table |> 
    rename_with(~ . |> 
                   str_replace_all("\\s", "_") |> 
                   str_replace_all("[//-/(/)]", "_") |> 
                   str_replace_all("__", "_") |> 
                   str_replace_all("_$", "") |> 
                   str_to_lower()) |> 
    mutate(
      country_of_origin_parent_company = 
        str_squish(country_of_origin_parent_company),
      year = str_extract(url, "\\d{4}$")
    )
}  
  • Test your function on the urls for 2024, 2022, and 2000, and show the first several rows of each of the outputs. Does the function seem to do what you expected it to do?
url_2024 <- 
  "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2024"
pac_data_2024 <- scrape_pac(url_2024)

head(pac_data_2024)
# A tibble: 6 × 6
  pac_name_affiliate             country_of_origin_pa…¹ total dems  repubs year 
  <chr>                          <chr>                  <chr> <chr> <chr>  <chr>
1 Accenture (Accenture)          Ireland/Accenture plc  $38,… $18,… $20,5… 2024 
2 Acreage Holdings               Canada/Acreage Holdin… $0    $0    $0     2024 
3 Air Liquide America            France/L'Air Liquide … $21,… $3,5… $17,5… 2024 
4 Airbus Group                   Netherlands/Airbus Gr… $268… $78,… $189,… 2024 
5 Alexion Pharmaceuticals (Astr… UK/AstraZeneca PLC     $115… $61,… $54,0… 2024 
6 Alkermes Inc                   Ireland/Alkermes Plc   $73,… $37,… $36,0… 2024 
# ℹ abbreviated name: ¹​country_of_origin_parent_company
url_2022 <- 
  "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2022"
pac_data_2022 <- scrape_pac(url_2022)

head(pac_data_2022)
# A tibble: 6 × 6
  pac_name_affiliate             country_of_origin_pa…¹ total dems  repubs year 
  <chr>                          <chr>                  <chr> <chr> <chr>  <chr>
1 Accenture (Accenture)          Ireland/Accenture plc  $3,0… $0    $3,000 2022 
2 Acreage Holdings               Canada/Acreage Holdin… $0    $0    $0     2022 
3 Air Liquide America            France/L'Air Liquide … $17,… $14,… $2,500 2022 
4 Airbus Group                   Netherlands/Airbus Gr… $193… $82,… $111,… 2022 
5 Alexion Pharmaceuticals (Astr… UK/AstraZeneca PLC     $186… $104… $82,2… 2022 
6 Alkermes Inc                   Ireland/Alkermes Plc   $84,… $34,… $50,0… 2022 
# ℹ abbreviated name: ¹​country_of_origin_parent_company
url_2000 <- 
  "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2000"
pac_data_2000 <- scrape_pac(url_2000)

head(pac_data_2000)
# A tibble: 6 × 6
  pac_name_affiliate        country_of_origin_parent_…¹ total dems  repubs year 
  <chr>                     <chr>                       <chr> <chr> <chr>  <chr>
1 7-Eleven                  Japan/Ito-Yokado            $8,5… $1,5… $7,000 2000 
2 ABB Group                 Switzerland/Asea Brown Bov… $46,… $17,… $28,5… 2000 
3 Accenture                 UK/Accenture plc            $75,… $23,… $52,9… 2000 
4 ACE INA                   UK/ACE Group                $38,… $12,… $26,0… 2000 
5 Acuson Corp (Siemens AG)  Germany/Siemens AG          $2,0… $2,0… $0     2000 
6 Adtranz (DaimlerChrysler) Germany/DaimlerChrysler AG  $10,… $10,… $500   2000 
# ℹ abbreviated name: ¹​country_of_origin_parent_company

After a fair amount of agonizing trial and error, I am very glad to report that yes, this works as expected and intended.

Problem 3:

  • Construct a vector called urls that contains the URLs for each webpage that contains information on foreign-connected PAC contributions for a given year (combine seq and string functions). Using the map_dfr function from the purrr package, apply the scrape_pac() function over urls in a way that will result in a data frame called pac_all that contains the data for all of the years.
years <- seq(2000, 2024, by = 2)

urls <- 
  paste0(
    "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/",
    years)

pac_all <- map_dfr(urls, scrape_pac)

pac_all
# A tibble: 2,639 × 6
   pac_name_affiliate            country_of_origin_pa…¹ total dems  repubs year 
   <chr>                         <chr>                  <chr> <chr> <chr>  <chr>
 1 7-Eleven                      Japan/Ito-Yokado       $8,5… $1,5… $7,000 2000 
 2 ABB Group                     Switzerland/Asea Brow… $46,… $17,… $28,5… 2000 
 3 Accenture                     UK/Accenture plc       $75,… $23,… $52,9… 2000 
 4 ACE INA                       UK/ACE Group           $38,… $12,… $26,0… 2000 
 5 Acuson Corp (Siemens AG)      Germany/Siemens AG     $2,0… $2,0… $0     2000 
 6 Adtranz (DaimlerChrysler)     Germany/DaimlerChrysl… $10,… $10,… $500   2000 
 7 AE Staley Manufacturing (Tat… UK/Tate & Lyle         $24,… $10,… $14,0… 2000 
 8 AEGON USA (AEGON NV)          Netherlands/Aegon NV   $58,… $10,… $47,7… 2000 
 9 AIM Management Group          UK/AMVESCAP            $25,… $10,… $15,0… 2000 
10 Air Liquide America           France/L'Air Liquide … $0    $0    $0     2000 
# ℹ 2,629 more rows
# ℹ abbreviated name: ¹​country_of_origin_parent_company
pac_all |> 
  group_by(country_of_origin_parent_company) |> 
  summarise(n = n())
# A tibble: 448 × 2
   country_of_origin_parent_company      n
   <chr>                             <int>
 1 Australia/Austal Ltd                  6
 2 Australia/BHP Billiton                1
 3 Australia/Boral Ltd                   4
 4 Australia/Broken Hill Proprietary     1
 5 Australia/CSL Ltd                    10
 6 Australia/CSR Ltd                     1
 7 Australia/Computershare Limited       1
 8 Australia/GenesisCare                 1
 9 Australia/Glenfarne Group             2
10 Australia/LNG Limited                 1
# ℹ 438 more rows
pac_all <- pac_all |> 
  separate(
    country_of_origin_parent_company,
    into = c("country_of_origin", "parent_company"),
    sep = "/",
    extra = "merge",
    fill = "right"
    )
unique(pac_all$country_of_origin)
 [1] "Japan"                "Switzerland"          "UK"                  
 [4] "Germany"              "Netherlands"          "France"              
 [7] "Canada"               "Ireland"              "Singapore"           
[10] "Mexico"               "Australia"            "Sweden"              
[13] "Italy"                "Belgium"              "Denmark"             
[16] "South Korea"          "Panama"               "Luxembourg"          
[19] "South Africa"         "Malaysia"             "Spain"               
[22] "Israel"               "Taiwan"               "Portugal"            
[25] "Brazil"               "Russia"               "Norway"              
[28] "United Arab Emirates" "China"                "Finland"             
[31] "Saudi Arabia"        
str(pac_all)
tibble [2,639 × 7] (S3: tbl_df/tbl/data.frame)
 $ pac_name_affiliate: chr [1:2639] "7-Eleven" "ABB Group" "Accenture" "ACE INA" ...
 $ country_of_origin : chr [1:2639] "Japan" "Switzerland" "UK" "UK" ...
 $ parent_company    : chr [1:2639] "Ito-Yokado" "Asea Brown Boveri" "Accenture plc" "ACE Group" ...
 $ total             : chr [1:2639] "$8,500" "$46,000" "$75,984" "$38,500" ...
 $ dems              : chr [1:2639] "$1,500" "$17,000" "$23,000" "$12,500" ...
 $ repubs            : chr [1:2639] "$7,000" "$28,500" "$52,984" "$26,000" ...
 $ year              : chr [1:2639] "2000" "2000" "2000" "2000" ...
pac_all2 <- pac_all |> 
  mutate(
    total = parse_number(total),
    dems = parse_number(dems),
    repubs = parse_number(repubs)
  )
Warning: There were 3 warnings in `mutate()`.
The first warning was:
ℹ In argument: `total = parse_number(total)`.
Caused by warning:
! 13 parsing failures.
row col expected  actual
 39  -- a number -$3,000
230  -- a number -$4,000
430  -- a number -$9,050
763  -- a number -$2,000
946  -- a number -$2,500
... ... ........ .......
See problems(...) for more details.
ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
pac_all3 <- pac_all |> 
  mutate(
    total = parse_number(if_else(
      str_starts(total, "-$"), 
      paste0("-", gsub("[$,]", "", total)), 
      gsub("[$,]", "", total)
    )),
    dems = parse_number(if_else(
      str_starts(dems, "-$"), 
      paste0("-", gsub("[$,]", "", dems)), 
      gsub("[$,]", "", dems)
    )),
    repubs = parse_number(if_else(
      str_starts(repubs, "-$"), 
      paste0("-", gsub("[$,]", "", repubs)), 
      gsub("[$,]", "", repubs)
    ))
  )
str(pac_all3)
tibble [2,639 × 7] (S3: tbl_df/tbl/data.frame)
 $ pac_name_affiliate: chr [1:2639] "7-Eleven" "ABB Group" "Accenture" "ACE INA" ...
 $ country_of_origin : chr [1:2639] "Japan" "Switzerland" "UK" "UK" ...
 $ parent_company    : chr [1:2639] "Ito-Yokado" "Asea Brown Boveri" "Accenture plc" "ACE Group" ...
 $ total             : num [1:2639] 8500 46000 75984 38500 2000 ...
 $ dems              : num [1:2639] 1500 17000 23000 12500 2000 10000 10000 10500 10000 0 ...
 $ repubs            : num [1:2639] 7000 28500 52984 26000 0 ...
 $ year              : chr [1:2639] "2000" "2000" "2000" "2000" ...
  • Clean this combined dataset by separating the country of origin from the parent company (use separate_wider_delim or another tool of your choice, you will need to be cautious with some special cases in this column) and by converting the strings in the total, dems, and repubs columns into numbers. Print out the top 10 rows over your dataset after completing these steps.
head(pac_all3, 10)
# A tibble: 10 × 7
   pac_name_affiliate  country_of_origin parent_company total  dems repubs year 
   <chr>               <chr>             <chr>          <dbl> <dbl>  <dbl> <chr>
 1 7-Eleven            Japan             Ito-Yokado      8500  1500   7000 2000 
 2 ABB Group           Switzerland       Asea Brown Bo… 46000 17000  28500 2000 
 3 Accenture           UK                Accenture plc  75984 23000  52984 2000 
 4 ACE INA             UK                ACE Group      38500 12500  26000 2000 
 5 Acuson Corp (Sieme… Germany           Siemens AG      2000  2000      0 2000 
 6 Adtranz (DaimlerCh… Germany           DaimlerChrysl… 10500 10000    500 2000 
 7 AE Staley Manufact… UK                Tate & Lyle    24000 10000  14000 2000 
 8 AEGON USA (AEGON N… Netherlands       Aegon NV       58250 10500  47750 2000 
 9 AIM Management Gro… UK                AMVESCAP       25000 10000  15000 2000 
10 Air Liquide America France            L'Air Liquide…     0     0      0 2000 
  • Calculate the total contributions from PACs linked to Canada and Mexico each year and plot how these contributions change over time.
mexico_canada_totals <- pac_all3 |> 
  filter(country_of_origin %in% c("Canada", "Mexico")) |> 
  select(country_of_origin, total, year) |> 
  group_by(country_of_origin, year) |> 
  summarise(total_contribution = sum(total))
`summarise()` has grouped output by 'country_of_origin'. You can override using
the `.groups` argument.
ggplot(mexico_canada_totals, 
       aes(x = year, y = total_contribution, color = country_of_origin)) +
  geom_point()+
  labs(
    title = 
      "Total Contributions from PACs linked to Canada and Mexico",
    subtitle = "2000-2024 Total Contributions via OpenSecrets.org"
  )

  • Find the 5 countries who over the entire time period of the dataset have the greatest total contribution from affiliated PACs. Then calculate the total contribution for each of those countries for each year of the data and make a plot of it to visualize how the contributions have changed over time.
pac_all3 |> 
  group_by(country_of_origin) |> 
  summarize(total_amount = sum(total)) |> 
  arrange(desc(total_amount))
# A tibble: 31 × 2
   country_of_origin total_amount
   <chr>                    <dbl>
 1 UK                    60684096
 2 Switzerland           36718269
 3 Germany               28439114
 4 Japan                 18110947
 5 Ireland               14531967
 6 Canada                12344489
 7 Netherlands           12344376
 8 France                11619790
 9 Belgium                6917400
10 Denmark                4521338
# ℹ 21 more rows
top_10_country_contributions <- pac_all3 |> 
  filter(country_of_origin %in% 
           c("UK", "Switzerland", "Germany", "Japan", "Ireland", 
             "Netherlands", "Canada", "France", "Belgium", "Denmark")) |> 
  select(country_of_origin, total, year) |> 
  group_by(country_of_origin, year) |> 
  summarise(total_contribution = sum(total)) |> 
  mutate(year = as.numeric(year))
`summarise()` has grouped output by 'country_of_origin'. You can override using
the `.groups` argument.
ggplot(top_10_country_contributions, 
       aes(x = year, y = total_contribution, color = country_of_origin)) +
  geom_line() +
  labs(
    title = 
      "Contributions from PACs linked to Countries with Highest Total Overall",
    subtitle = "2000-2024 Total Contributions via OpenSecrets.org"
  )