── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 10 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): wiki_pages
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Wiki Scraping
Wiki Scraping - In Class
Download the necessary package, and then you can load in the Csv file from my one drive.
We got our data by looping the data harvesting task using the link provided by our professor
for(i in seq_along(wikis)) {
wiki_pages[i] <-
wikis[i] %>%
read_html() %>%
html_elements("h1") %>%
html_text2()
Sys.sleep(1)
}Here is a visual of the distribution of characters in the h1 for all websites pulled
wiki_pages_df %>%
ggplot(aes(x=wiki_pages))geom_histogram()geom_bar: na.rm = FALSE, orientation = NA
stat_bin: binwidth = NULL, bins = NULL, na.rm = FALSE, orientation = NA, pad = FALSE
position_stack