Overview
This is a two part assignment. In the first part of the assignment
you will practice rectangling on a dataset from the
repurrrsive
package. In the second part you will combine
the rvest
package along with functions and iteration to
scrape data on foreign linked political action committees from the
website open secrets.
## Warning: package 'tidyverse' was built under R version 4.4.1
## Warning: package 'ggplot2' was built under R version 4.4.1
## Warning: package 'stringr' was built under R version 4.4.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'repurrrsive' was built under R version 4.4.1
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:purrr':
##
## flatten
## Warning: package 'rvest' was built under R version 4.4.1
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:readr':
##
## guess_encoding
## Warning: package 'robotstxt' was built under R version 4.4.1
## Warning: package 'fpp3' was built under R version 4.4.1
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──
## ✔ tsibble 1.1.5 ✔ fable 0.3.4
## ✔ tsibbledata 0.4.1 ✔ fabletools 0.4.2
## ✔ feasts 0.3.2
## Warning: package 'tsibble' was built under R version 4.4.1
## Warning: package 'tsibbledata' was built under R version 4.4.1
## Warning: package 'feasts' was built under R version 4.4.1
## Warning: package 'fabletools' was built under R version 4.4.1
## Warning: package 'fable' was built under R version 4.4.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ jsonlite::flatten() masks purrr::flatten()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
## Warning: package 'janitor' was built under R version 4.4.1
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
## Warning: package 'ggridges' was built under R version 4.4.1
## Warning: package 'plotly' was built under R version 4.4.1
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
Load the repurrrsive
package to get access to get access
to the got_chars
dataset. In section 23.4.2 of R4DS, there
is code that extracts data from the got_chars
list and
converts it into a tibble with information on each character and a
separate tibble which contains information on the titles held by each
character. Perform similar operations to create separate tibbles
containing the aliases, allegiances, books, and TV series of each Game
of Thrones character.
## # A tibble: 6 × 1
## json
## <list>
## 1 <named list [18]>
## 2 <named list [18]>
## 3 <named list [18]>
## 4 <named list [18]>
## 5 <named list [18]>
## 6 <named list [18]>
## # A tibble: 30 × 18
## url id name gender culture born died alive titles aliases father
## <chr> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <list> <list> <chr>
## 1 https://w… 1022 Theo… Male "Ironb… "In … "" TRUE <chr> <chr> ""
## 2 https://w… 1052 Tyri… Male "" "In … "" TRUE <chr> <chr> ""
## 3 https://w… 1074 Vict… Male "Ironb… "In … "" TRUE <chr> <chr> ""
## 4 https://w… 1109 Will Male "" "" "In … FALSE <chr> <chr> ""
## 5 https://w… 1166 Areo… Male "Norvo… "In … "" TRUE <chr> <chr> ""
## 6 https://w… 1267 Chett Male "" "At … "In … FALSE <chr> <chr> ""
## 7 https://w… 1295 Cres… Male "" "In … "In … FALSE <chr> <chr> ""
## 8 https://w… 130 Aria… Female "Dorni… "In … "" TRUE <chr> <chr> ""
## 9 https://w… 1303 Daen… Female "Valyr… "In … "" TRUE <chr> <chr> ""
## 10 https://w… 1319 Davo… Male "Weste… "In … "" TRUE <chr> <chr> ""
## # ℹ 20 more rows
## # ℹ 7 more variables: mother <chr>, spouse <chr>, allegiances <list>,
## # books <list>, povBooks <list>, tvSeries <list>, playedBy <list>
titles <- chars |>
unnest_wider(json) |>
select(id, titles) |>
unnest_longer(titles) |>
filter(titles != "") |>
rename(title = titles)
titles
## # A tibble: 52 × 2
## id title
## <int> <chr>
## 1 1022 Prince of Winterfell
## 2 1022 Lord of the Iron Islands (by law of the green lands)
## 3 1052 Acting Hand of the King (former)
## 4 1052 Master of Coin (former)
## 5 1074 Lord Captain of the Iron Fleet
## 6 1074 Master of the Iron Victory
## 7 1166 Captain of the Guard at Sunspear
## 8 1295 Maester
## 9 130 Princess of Dorne
## 10 1303 Queen of the Andals and the Rhoynar and the First Men, Lord of the Sev…
## # ℹ 42 more rows
aliases <- chars |>
unnest_wider(json) |>
select(id, aliases) |>
unnest_longer(aliases) |>
filter(aliases != "") |>
rename(aliases = aliases)
aliases
## # A tibble: 107 × 2
## id aliases
## <int> <chr>
## 1 1022 Prince of Fools
## 2 1022 Theon Turncloak
## 3 1022 Reek
## 4 1022 Theon Kinslayer
## 5 1052 The Imp
## 6 1052 Halfman
## 7 1052 The boyman
## 8 1052 Giant of Lannister
## 9 1052 Lord Tywin's Doom
## 10 1052 Lord Tywin's Bane
## # ℹ 97 more rows
allegiance <- chars |>
unnest_wider(json) |>
select(id, allegiances) |>
unnest_longer(allegiances) |>
filter(allegiances != "") |>
rename(allegiance = allegiances)
allegiance
## # A tibble: 33 × 2
## id allegiance
## <int> <chr>
## 1 1022 House Greyjoy of Pyke
## 2 1052 House Lannister of Casterly Rock
## 3 1074 House Greyjoy of Pyke
## 4 1166 House Nymeros Martell of Sunspear
## 5 130 House Nymeros Martell of Sunspear
## 6 1303 House Targaryen of King's Landing
## 7 1319 House Baratheon of Dragonstone
## 8 1319 House Seaworth of Cape Wrath
## 9 148 House Stark of Winterfell
## 10 149 House Oakheart of Old Oak
## # ℹ 23 more rows
books<- chars|>
unnest_wider(json)|>
select(id, books)|>
unnest_longer(books)|>
filter(books!="")|>
rename(book=books)
books|>
head(10)
## # A tibble: 10 × 2
## id book
## <int> <chr>
## 1 1022 A Game of Thrones
## 2 1022 A Storm of Swords
## 3 1022 A Feast for Crows
## 4 1052 A Feast for Crows
## 5 1052 The World of Ice and Fire
## 6 1074 A Game of Thrones
## 7 1074 A Clash of Kings
## 8 1074 A Storm of Swords
## 9 1109 A Clash of Kings
## 10 1166 A Game of Thrones
tv_series<- chars|>
unnest_wider(json)|>
select(id,tvSeries)|>
unnest_longer(tvSeries)|>
filter(tvSeries!="")|>
rename(season=tvSeries)
tv_series
## # A tibble: 93 × 2
## id season
## <int> <chr>
## 1 1022 Season 1
## 2 1022 Season 2
## 3 1022 Season 3
## 4 1022 Season 4
## 5 1022 Season 5
## 6 1022 Season 6
## 7 1052 Season 1
## 8 1052 Season 2
## 9 1052 Season 3
## 10 1052 Season 4
## # ℹ 83 more rows
In this assignment we will scrape and work with data foreign connected PACs that donate to US political campaigns. In the United States, only American citizens and green card holders can contribute to federal elections, but the American divisions of foreign companies can form political action committees (PACs) and collect contributions from their American employees.
First, we will get data foreign connected PAC contributions in the 2022 election cycle. Then, you will use a similar approach to get data such contributions from previous years so that we can examine trends over time.
In order to complete this assignment you will need a Chrome browser with the Selector Gadget extension installed.
In addition to tidyverse
, you will need to install and
load the packages robotstxt
and rvest
Check that open secrets allows you to webscrape by running the
paths_allowed
function on the url
https://www.opensecrets.org
.
Write a function called scrape_pac()
that scrapes
information from the Open Secrets webpage for foreign connected PAC
contributions in a given year. The url
for this data is https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2024.
This function should take the url of the webpage as its only input and
should output a data frame. The variables of this data-frame should be
renamed so that they are in snake_case
format
(lower_case_and_underscores_for_spaces
, see R4DS section
2.3). Use str_squish()
to remove excess whitespace from the
Country of Origin/Parent Company variables, and add a new column which
records the year by extracting from the input url.
Hint: If you have trouble finding the right elements to search for using the selector gadget try looking for a table element.
## www.opensecrets.org
## [1] TRUE
scrape_pac <- function(url) {
# Read the live webpage content
site <- read_html_live(url)
# Extract the table using XPath
site_table <- site |>
html_nodes(xpath = '//*[@id="DataTables_Table_0"]') |>
html_table(fill = TRUE)
# Check if the table extraction was successful and is a valid data frame
if (length(site_table) == 0 || !is.data.frame(site_table[[1]])) {
warning("No valid table found on the page.")
return(NULL)
}
# Assuming there's only one table returned, take the first one
site_table <- site_table[[1]]
# Clean column names to snake_case if valid
if (!is.null(names(site_table))) {
site_table <- site_table |>
clean_names(case = "snake")
}
# Use str_squish to clean whitespace in 'country_of_origin/parent_company' column
site_table <- site_table |>
mutate(across(contains("country_of_origin"), ~ str_squish(.)))
# Extract the year from the URL and add it as a new column
year <- str_extract(url, "\\d{4}")
site_table <- site_table |>
mutate(year = year)
# Release memory after processing
rm(site)
gc()
# Return the cleaned table
return(site_table)
}
# testing 24 ,22, 20
url_2024 <- "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2024"
url_2022<- "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2022"
url_2020<-"https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2020"
## Warning in scrape_pac(url_2024): No valid table found on the page.
## NULL
## Warning in scrape_pac(url_2022): No valid table found on the page.
## NULL
## Warning in scrape_pac(url_2020): No valid table found on the page.
## NULL
Construct a vector called urls
that contains the
URLs for each webpage that contains information on foreign-connected PAC
contributions for a given year (combine seq
and string
functions). Using the map_dfr
function from the
purrr
package, apply the scrape_pac()
function
over urls
in a way that will result in a data frame called
pac_all
that contains the data for all of the years.
# sequence from 2000 - 2024 every two years
years <- seq(2000, 2024,2)
urls <- as.character(str_glue("https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/{years}"))
# Print urls to confirm
print(urls)
## [1] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2000"
## [2] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2002"
## [3] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2004"
## [4] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2006"
## [5] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2008"
## [6] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2010"
## [7] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2012"
## [8] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2014"
## [9] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2016"
## [10] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2018"
## [11] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2020"
## [12] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2022"
## [13] "https://www.opensecrets.org/political-action-committees-pacs/foreign-connected-pacs/2024"
## Warning in .f(.x[[i]], ...): No valid table found on the page.
## Warning in .f(.x[[i]], ...): No valid table found on the page.
## Warning in .f(.x[[i]], ...): No valid table found on the page.
## Warning in .f(.x[[i]], ...): No valid table found on the page.
## Warning in .f(.x[[i]], ...): No valid table found on the page.
## Warning in .f(.x[[i]], ...): No valid table found on the page.
## Warning in .f(.x[[i]], ...): No valid table found on the page.
## Warning in .f(.x[[i]], ...): No valid table found on the page.
## Warning in .f(.x[[i]], ...): No valid table found on the page.
## Warning in .f(.x[[i]], ...): No valid table found on the page.
## Warning in .f(.x[[i]], ...): No valid table found on the page.
Clean this combined dataset by separating the country of origin
from the parent company (use separate_wider_delim
or
another tool of your choice, you will need to be cautious with some
special cases in this column) and by converting the strings in the
total
, dems
, and repubs
columns
into numbers. Print out the top 10 rows over your dataset after
completing these steps.
pac_all <- pac_all %>%
separate_wider_delim(
country_of_origin_parent_company,
delim = "/",
names = c("country_of_Origin", "parent_company"),
too_few = "align_start"
)
pac_all <- pac_all %>%
mutate(
total = parse_number(total),
dems = parse_number(dems),
repubs = parse_number(repubs)
)
## # A tibble: 10 × 7
## pac_name_affiliate country_of_origin parent_company total dems repubs year
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 7-Eleven Japan Ito-Yokado 7070 2528 4542 2002
## 2 ABB Group Switzerland Asea Brown Bo… 34500 18750 15750 2002
## 3 Accenture UK Accenture plc 220947 94242 126705 2002
## 4 ACE INA UK ACE Group 77250 31500 45750 2002
## 5 Acuson Corp (Siem… Germany Siemens AG 0 0 0 2002
## 6 AE Staley Manufac… UK Tate & Lyle 28500 12500 16000 2002
## 7 AEGON USA (AEGON … Netherlands Aegon NV 164500 73500 91000 2002
## 8 AIM Management Gr… UK AMVESCAP 24500 8500 16000 2002
## 9 Air Liquide Ameri… France L'Air Liquide… 2650 2000 650 2002
## 10 Alcatel USA France Alcatel 13750 4500 9250 2002
#exporting to csv for rendering without buggy scraping
write.csv(pac_all, "pac_all.csv", row.names = FALSE)
Calculate the total contributions from PACs linked to Canada and Mexico each year and plot how these contributions change over time.
neighbors_total <- pac_all %>%
filter(country_of_origin %in% c("Canada", "Mexico")) %>%
group_by(year) %>%
summarize(total_contributions = sum(total, na.rm = TRUE))
# Plot the combined total contributions over time
ggplot(neighbors_total, aes(x = year, y = total_contributions)) +
geom_bar(stat = "identity", fill = "skyblue", color = "blue") +
labs(
title = "Total Contributions fromNosy Neighbors (Canada and Mexico) Over Time",
x = "Year",
y = "Total Contributions"
) +
theme_minimal()
Find the 5 countries who over the entire time period of the dataset have the greatest total contribution from affiliated PACs. Then calculate the total contribution for each of those countries for each year of the data and make a plot of it to visualize how the contributions have changed over time.
top_countries <- pac_all %>%
group_by(country_of_origin) %>%
summarize(total_contribution = sum(total, na.rm = TRUE)) %>%
arrange(desc(total_contribution)) %>%
slice(1:5)
top_countries_v <- top_countries$country_of_origin
top_countries
## # A tibble: 5 × 2
## country_of_origin total_contribution
## <chr> <dbl>
## 1 UK 1180698
## 2 France 508730
## 3 Netherlands 350200
## 4 Germany 211400
## 5 Switzerland 85220
## [1] "UK" "France" "Netherlands" "Germany" "Switzerland"
top_5_y<- pac_all |>
filter(country_of_origin %in% top_countries_v) |>
group_by(country_of_origin, year) |>
summarize(yearly_total = sum(total, na.rm = TRUE))
## `summarise()` has grouped output by 'country_of_origin'. You can override using
## the `.groups` argument.
#plotting
ggplotly(
ggplot(top_5_y, aes(x = year, y = yearly_total, color = country_of_origin, group = country_of_origin)) +
geom_line(size = 1) +
labs(
title = "Annual Contributions from Affiliated PACs for Top 5 Countries",
x = "Year",
y = "Total Contribution",
color = "Country"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.