Wikipedia Analysis

Loading Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading Data

This code loads in the data for 10 random Wikipedia pages, via a csv file that was created in another R script.

wiki_pages <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/moningk1_xavier_edu/EZ0pbGmf9XJPhVgDIqF3j6EBWvB0IPvAArLuEkThlu0zMg?e=QEVaeYdownload=1")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 1467 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): <!DOCTYPE html>

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Creating a Vector to Extract Page Titles

This code creates an empty vector to add Wikipedia page titles to.

wiki_pages <- 
  c()

Retrieving 10 Pages via a Loop

This code retrieves 10 random web pages from Wikipedia, including the web page, node, and the text of the node for the main header. It performs this for each URL in the vector, and goes to sleep for 1 second between each request.

for(i in seq_along(wikis)) {
  wiki_pages[i] <-
    wikis[i] %>% 
    read_html() %>% 
    html_elements("h1") %>% 
    html_text2()

  Sys.sleep(1) 
}

Showing Histogram

This code results in a histogram visualization visualization with the number of characters in the titles of the Wikipedia pages.

wiki_pages %>% nchar() %>% hist()