Web Scraping

Overview

This is a two part assignment. In the first part of the assignment you will practice rectangling on a dataset from the repurrrsive package. In the second part you will combine the rvest package along with functions and iteration to scrape data on foreign linked political action committees from the website open secrets.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.1
## Warning: package 'ggplot2' was built under R version 4.4.1
## Warning: package 'stringr' was built under R version 4.4.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(repurrrsive)
## Warning: package 'repurrrsive' was built under R version 4.4.1
library(jsonlite)
## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:purrr':
## 
##     flatten
library(rvest)
## Warning: package 'rvest' was built under R version 4.4.1
## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library(robotstxt)
## Warning: package 'robotstxt' was built under R version 4.4.1
library(fpp3)
## Warning: package 'fpp3' was built under R version 4.4.1
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──
## ✔ tsibble     1.1.5     ✔ fable       0.3.4
## ✔ tsibbledata 0.4.1     ✔ fabletools  0.4.2
## ✔ feasts      0.3.2
## Warning: package 'tsibble' was built under R version 4.4.1
## Warning: package 'tsibbledata' was built under R version 4.4.1
## Warning: package 'feasts' was built under R version 4.4.1
## Warning: package 'fabletools' was built under R version 4.4.1
## Warning: package 'fable' was built under R version 4.4.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ jsonlite::flatten()  masks purrr::flatten()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()
library(rvest)
library(dplyr)
library(stringr)
library(janitor)
## Warning: package 'janitor' was built under R version 4.4.1
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(ggridges)
## Warning: package 'ggridges' was built under R version 4.4.1
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.1
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Rectangling

Problem 1:

Load the repurrrsive package to get access to get access to the got_chars dataset. In section 23.4.2 of R4DS, there is code that extracts data from the got_chars list and converts it into a tibble with information on each character and a separate tibble which contains information on the titles held by each character. Perform similar operations to create separate tibbles containing the aliases, allegiances, books, and TV series of each Game of Thrones character.

chars <- tibble(json = got_chars)
head(chars)
## # A tibble: 6 × 1
##   json             
##   <list>           
## 1 <named list [18]>
## 2 <named list [18]>
## 3 <named list [18]>
## 4 <named list [18]>
## 5 <named list [18]>
## 6 <named list [18]>
chars|>
  unnest_wider(json)
## # A tibble: 30 × 18
##    url           id name  gender culture born  died  alive titles aliases father
##    <chr>      <int> <chr> <chr>  <chr>   <chr> <chr> <lgl> <list> <list>  <chr> 
##  1 https://w…  1022 Theo… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
##  2 https://w…  1052 Tyri… Male   ""      "In … ""    TRUE  <chr>  <chr>   ""    
##  3 https://w…  1074 Vict… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
##  4 https://w…  1109 Will  Male   ""      ""    "In … FALSE <chr>  <chr>   ""    
##  5 https://w…  1166 Areo… Male   "Norvo… "In … ""    TRUE  <chr>  <chr>   ""    
##  6 https://w…  1267 Chett Male   ""      "At … "In … FALSE <chr>  <chr>   ""    
##  7 https://w…  1295 Cres… Male   ""      "In … "In … FALSE <chr>  <chr>   ""    
##  8 https://w…   130 Aria… Female "Dorni… "In … ""    TRUE  <chr>  <chr>   ""    
##  9 https://w…  1303 Daen… Female "Valyr… "In … ""    TRUE  <chr>  <chr>   ""    
## 10 https://w…  1319 Davo… Male   "Weste… "In … ""    TRUE  <chr>  <chr>   ""    
## # ℹ 20 more rows
## # ℹ 7 more variables: mother <chr>, spouse <chr>, allegiances <list>,
## #   books <list>, povBooks <list>, tvSeries <list>, playedBy <list>
titles <- chars |> 
  unnest_wider(json) |> 
  select(id, titles) |> 
  unnest_longer(titles) |> 
  filter(titles != "") |> 
  rename(title = titles)
titles
## # A tibble: 52 × 2
##       id title                                                                  
##    <int> <chr>                                                                  
##  1  1022 Prince of Winterfell                                                   
##  2  1022 Lord of the Iron Islands (by law of the green lands)                   
##  3  1052 Acting Hand of the King (former)                                       
##  4  1052 Master of Coin (former)                                                
##  5  1074 Lord Captain of the Iron Fleet                                         
##  6  1074 Master of the Iron Victory                                             
##  7  1166 Captain of the Guard at Sunspear                                       
##  8  1295 Maester                                                                
##  9   130 Princess of Dorne                                                      
## 10  1303 Queen of the Andals and the Rhoynar and the First Men, Lord of the Sev…
## # ℹ 42 more rows

Aliases

aliases <- chars |> 
  unnest_wider(json) |> 
  select(id, aliases) |> 
  unnest_longer(aliases) |> 
  filter(aliases != "") |> 
  rename(aliases = aliases)
  

aliases
## # A tibble: 107 × 2
##       id aliases           
##    <int> <chr>             
##  1  1022 Prince of Fools   
##  2  1022 Theon Turncloak   
##  3  1022 Reek              
##  4  1022 Theon Kinslayer   
##  5  1052 The Imp           
##  6  1052 Halfman           
##  7  1052 The boyman        
##  8  1052 Giant of Lannister
##  9  1052 Lord Tywin's Doom 
## 10  1052 Lord Tywin's Bane 
## # ℹ 97 more rows

Allegiances

allegiance <- chars |> 
  unnest_wider(json) |> 
  select(id, allegiances) |> 
  unnest_longer(allegiances) |> 
  filter(allegiances != "") |> 
  rename(allegiance = allegiances)
  

allegiance
## # A tibble: 33 × 2
##       id allegiance                       
##    <int> <chr>                            
##  1  1022 House Greyjoy of Pyke            
##  2  1052 House Lannister of Casterly Rock 
##  3  1074 House Greyjoy of Pyke            
##  4  1166 House Nymeros Martell of Sunspear
##  5   130 House Nymeros Martell of Sunspear
##  6  1303 House Targaryen of King's Landing
##  7  1319 House Baratheon of Dragonstone   
##  8  1319 House Seaworth of Cape Wrath     
##  9   148 House Stark of Winterfell        
## 10   149 House Oakheart of Old Oak        
## # ℹ 23 more rows

Books

books<- chars|>
  unnest_wider(json)|>
  select(id, books)|>
  unnest_longer(books)|>
  filter(books!="")|>
  rename(book=books)
books|>
  head(10)
## # A tibble: 10 × 2
##       id book                     
##    <int> <chr>                    
##  1  1022 A Game of Thrones        
##  2  1022 A Storm of Swords        
##  3  1022 A Feast for Crows        
##  4  1052 A Feast for Crows        
##  5  1052 The World of Ice and Fire
##  6  1074 A Game of Thrones        
##  7  1074 A Clash of Kings         
##  8  1074 A Storm of Swords        
##  9  1109 A Clash of Kings         
## 10  1166 A Game of Thrones

TV series

tv_series<- chars|>
  unnest_wider(json)|>
  select(id,tvSeries)|>
  unnest_longer(tvSeries)|>
  filter(tvSeries!="")|>
  rename(season=tvSeries)

tv_series
## # A tibble: 93 × 2
##       id season  
##    <int> <chr>   
##  1  1022 Season 1
##  2  1022 Season 2
##  3  1022 Season 3
##  4  1022 Season 4
##  5  1022 Season 5
##  6  1022 Season 6
##  7  1052 Season 1
##  8  1052 Season 2
##  9  1052 Season 3
## 10  1052 Season 4
## # ℹ 83 more rows

Webscraping Open Secrets

In this assignment we will scrape and work with data foreign connected PACs that donate to US political campaigns. In the United States, only American citizens and green card holders can contribute to federal elections, but the American divisions of foreign companies can form political action committees (PACs) and collect contributions from their American employees.

First, we will get data foreign connected PAC contributions in the 2022 election cycle. Then, you will use a similar approach to get data such contributions from previous years so that we can examine trends over time.

In order to complete this assignment you will need a Chrome browser with the Selector Gadget extension installed.

In addition to tidyverse, you will need to install and load the packages robotstxt and rvest

Problem 2:

  • Check that open secrets allows you to webscrape by running the paths_allowed function on the url https://www.opensecrets.org.

  • Write a function called scrape_pac() that scrapes information from the Open Secrets webpage for foreign connected PAC contributions in a given year. The url for this data is https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2024. This function should take the url of the webpage as its only input and should output a data frame. The variables of this data-frame should be renamed so that they are in snake_case format (lower_case_and_underscores_for_spaces, see R4DS section 2.3). Use str_squish() to remove excess whitespace from the Country of Origin/Parent Company variables, and add a new column which records the year by extracting from the input url.

Hint: If you have trouble finding the right elements to search for using the selector gadget try looking for a table element.

  • Test your function on the urls for 2024, 2022, and 2000, and show the first several rows of each of the outpus. Does the function seem to do what you expected it to do?
paths_allowed("https://www.opensecrets.org")
##  www.opensecrets.org
## [1] TRUE

Function

scrape_pac <- function(url) {
  # Read the live webpage content
  site <- read_html_live(url)
  
  # Extract the table using XPath
  site_table <- site |>
    html_nodes(xpath = '//*[@id="DataTables_Table_0"]') |>
    html_table(fill = TRUE)
  
  # Check if the table extraction was successful and is a valid data frame
  if (length(site_table) == 0 || !is.data.frame(site_table[[1]])) {
    warning("No valid table found on the page.")
    return(NULL)
  }
  
  # Assuming there's only one table returned, take the first one
  site_table <- site_table[[1]]
  
  # Clean column names to snake_case if valid
  if (!is.null(names(site_table))) {
    site_table <- site_table |>
      clean_names(case = "snake")
  }
  
  # Use str_squish to clean whitespace in 'country_of_origin/parent_company' column
  site_table <- site_table |>
    mutate(across(contains("country_of_origin"), ~ str_squish(.)))
  
  # Extract the year from the URL and add it as a new column
  year <- str_extract(url, "\\d{4}")
  site_table <- site_table |>
    mutate(year = year)
  
  # Release memory after processing
  rm(site)
  gc()
  
  # Return the cleaned table
  return(site_table)
}
# testing 24 ,22, 20
url_2024 <- "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2024"
url_2022<- "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2022"
url_2020<-"https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2020"

Test

result_2024 <- scrape_pac(url_2024)
## Warning in scrape_pac(url_2024): No valid table found on the page.
# Display the result
head(result_2024)
## NULL
result_2022 <-scrape_pac(url_2022)
## Warning in scrape_pac(url_2022): No valid table found on the page.
head(result_2022)
## NULL
result_2020 <- scrape_pac(url_2020)
## Warning in scrape_pac(url_2020): No valid table found on the page.
head(result_2020)
## NULL

Problem 3:

  • Construct a vector called urls that contains the URLs for each webpage that contains information on foreign-connected PAC contributions for a given year (combine seq and string functions). Using the map_dfr function from the purrr package, apply the scrape_pac() function over urls in a way that will result in a data frame called pac_all that contains the data for all of the years.

    Vector url, Scrape pages

    # sequence from 2000 - 2024 every two years
    years <- seq(2000, 2024,2)
    urls <- as.character(str_glue("https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/{years}"))
    
    # Print urls to confirm
    print(urls)
    ##  [1] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2000"
    ##  [2] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2002"
    ##  [3] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2004"
    ##  [4] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2006"
    ##  [5] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2008"
    ##  [6] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2010"
    ##  [7] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2012"
    ##  [8] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2014"
    ##  [9] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2016"
    ## [10] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2018"
    ## [11] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2020"
    ## [12] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2022"
    ## [13] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2024"
    pac_all <- map_dfr(urls, scrape_pac)
    ## Warning in .f(.x[[i]], ...): No valid table found on the page.
    ## Warning in .f(.x[[i]], ...): No valid table found on the page.
    ## Warning in .f(.x[[i]], ...): No valid table found on the page.
    ## Warning in .f(.x[[i]], ...): No valid table found on the page.
    ## Warning in .f(.x[[i]], ...): No valid table found on the page.
    ## Warning in .f(.x[[i]], ...): No valid table found on the page.
    ## Warning in .f(.x[[i]], ...): No valid table found on the page.
    ## Warning in .f(.x[[i]], ...): No valid table found on the page.
    ## Warning in .f(.x[[i]], ...): No valid table found on the page.
    ## Warning in .f(.x[[i]], ...): No valid table found on the page.
    ## Warning in .f(.x[[i]], ...): No valid table found on the page.

    Cleaning table

  • Clean this combined dataset by separating the country of origin from the parent company (use separate_wider_delim or another tool of your choice, you will need to be cautious with some special cases in this column) and by converting the strings in the total, dems, and repubs columns into numbers. Print out the top 10 rows over your dataset after completing these steps.

    pac_all <- pac_all %>%
      separate_wider_delim(
        country_of_origin_parent_company, 
        delim = "/", 
        names = c("country_of_Origin", "parent_company"), 
        too_few = "align_start"
      )
    # fixing typo
    pac_all<- pac_all|>
      rename(country_of_origin = country_of_Origin)
    pac_all <- pac_all %>%
      mutate(
        total = parse_number(total),
        dems = parse_number(dems),
        repubs = parse_number(repubs)
      )
    head(pac_all,10)
    ## # A tibble: 10 × 7
    ##    pac_name_affiliate country_of_origin parent_company  total  dems repubs year 
    ##    <chr>              <chr>             <chr>           <dbl> <dbl>  <dbl> <chr>
    ##  1 7-Eleven           Japan             Ito-Yokado       7070  2528   4542 2002 
    ##  2 ABB Group          Switzerland       Asea Brown Bo…  34500 18750  15750 2002 
    ##  3 Accenture          UK                Accenture plc  220947 94242 126705 2002 
    ##  4 ACE INA            UK                ACE Group       77250 31500  45750 2002 
    ##  5 Acuson Corp (Siem… Germany           Siemens AG          0     0      0 2002 
    ##  6 AE Staley Manufac… UK                Tate & Lyle     28500 12500  16000 2002 
    ##  7 AEGON USA (AEGON … Netherlands       Aegon NV       164500 73500  91000 2002 
    ##  8 AIM Management Gr… UK                AMVESCAP        24500  8500  16000 2002 
    ##  9 Air Liquide Ameri… France            L'Air Liquide…   2650  2000    650 2002 
    ## 10 Alcatel USA        France            Alcatel         13750  4500   9250 2002
    #exporting to csv for rendering without buggy scraping
    write.csv(pac_all, "pac_all.csv", row.names = FALSE)

    Neighbors

  • Calculate the total contributions from PACs linked to Canada and Mexico each year and plot how these contributions change over time.

    neighbors_total <- pac_all %>%
      filter(country_of_origin %in% c("Canada", "Mexico")) %>%
      group_by(year) %>%
      summarize(total_contributions = sum(total, na.rm = TRUE))
    
    
    # Plot the combined total contributions over time
    ggplot(neighbors_total, aes(x = year, y = total_contributions)) +
      geom_bar(stat = "identity", fill = "skyblue", color = "blue") +
      labs(
        title = "Total Contributions fromNosy Neighbors (Canada and Mexico) Over Time",
        x = "Year",
        y = "Total Contributions"
      ) +
      theme_minimal()

    Top 5

  • Find the 5 countries who over the entire time period of the dataset have the greatest total contribution from affiliated PACs. Then calculate the total contribution for each of those countries for each year of the data and make a plot of it to visualize how the contributions have changed over time.

top_countries <- pac_all %>%
  group_by(country_of_origin) %>%
  summarize(total_contribution = sum(total, na.rm = TRUE)) %>%
  arrange(desc(total_contribution)) %>%
  slice(1:5)
top_countries_v <- top_countries$country_of_origin
top_countries
## # A tibble: 5 × 2
##   country_of_origin total_contribution
##   <chr>                          <dbl>
## 1 UK                           1180698
## 2 France                        508730
## 3 Netherlands                   350200
## 4 Germany                       211400
## 5 Switzerland                    85220
top_countries_v
## [1] "UK"          "France"      "Netherlands" "Germany"     "Switzerland"
top_5_y<- pac_all |>
  filter(country_of_origin %in% top_countries_v) |>
  group_by(country_of_origin, year) |>
  summarize(yearly_total = sum(total, na.rm = TRUE))
## `summarise()` has grouped output by 'country_of_origin'. You can override using
## the `.groups` argument.
#plotting

ggplotly(
  ggplot(top_5_y, aes(x = year, y = yearly_total, color = country_of_origin, group = country_of_origin)) +
    geom_line(size = 1) +
    labs(
      title = "Annual Contributions from Affiliated PACs for Top 5 Countries",
      x = "Year",
      y = "Total Contribution",
      color = "Country"
    ) +
    theme_minimal() +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1)
    )
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.