Wikipedia In-Class Activity

Loading Packages

This allows us to read CSV files

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading Data

First, I acquired 10 random Wikipedia page titles and put them into the CSV below.

wiki_pages <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/pelles1_xavier_edu/EaQRLiUivwdPreSOVTNjWEIBxnFK9I_skaju21f715VOYw?download=1")
Rows: 10 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): wiki_pages

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

How I Extracted The Data

  1. Used a web function to gather 10 random Wikipedia pages, then stored them in a vector
  2. Created an empty column of data to store the eventual titles
  3. Looped through each page, using functions to read the HTML code and gather just the title, storing it in the empty list that was just created
wikis <- 
  rep("http://asayanalytics.com/bored",10)

wiki_pages <- 
  c()

for(i in seq_along(wikis)) {
  wiki_pages[i] <-
    wikis[i] %>% 
    read_html() %>% 
    html_elements("h1") %>% 
    html_text2()
  sys.sleep(1)
}

Analyzing the Data

I extracted the number of characters in the length of each title and created a histogram.

wiki_pages %>% nchar() %>% hist()

I found that 7/10 of the page titles had between 10 and 30 characters, with a normal distribution and 2 outliers of pages each with 40-60 characters.