Wiki Scraping

Wiki Scraping - In Class

Download the necessary package, and then you can load in the Csv file from my one drive.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 10 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): wiki_pages

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We got our data by looping the data harvesting task using the link provided by our professor

for(i in seq_along(wikis)) {
  wiki_pages[i] <-
    wikis[i] %>% 
    read_html() %>% 
    html_elements("h1") %>% 
    html_text2()
  Sys.sleep(1) 
}

Here is a visual of the distribution of characters in the h1 for all websites pulled

wiki_pages_df %>% 
  ggplot(aes(x=wiki_pages))

geom_histogram()
geom_bar: na.rm = FALSE, orientation = NA
stat_bin: binwidth = NULL, bins = NULL, na.rm = FALSE, orientation = NA, pad = FALSE
position_stack