Scrape HTML tables with R

Scraping HTML tables

Let me show you how to scrape HTML tables with R. The HTML tables are here: http://www.mishou.be/2021/10/04/pythonr-sample-data-for-data-analysis/ You can also learn how to scrape HTML tables with Python here: http://www.mishou.be/2021/10/04/pythonr-sample-data-for-data-analysis/.

# import libraries
library(htmltab)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# retrieve the first table
url <- "http://www.mishou.be/2021/10/04/pythonr-sample-data-for-data-analysis/"
df1 <- htmltab(url, which = 1)
# convert a data frame to a tibble
df1_tibble <- as_tibble(df1)
df1_tibble

## # A tibble: 200 × 7
##    id    english japanese nationality department classes gender
##    <chr> <chr>   <chr>    <chr>       <chr>      <chr>   <chr> 
##  1 1     17.8    75.6     japan       literature 2       male  
##  2 2     64.4    53.3     nepal       literature 2       male  
##  3 3     86.7    31.1     nepal       literature 1       male  
##  4 4     60      62.2     indonesia   literature 2       male  
##  5 5     42.2    80       japan       literature 1       male  
##  6 6     33.3    75.6     japan       literature 1       male  
##  7 7     28.9    60       japan       literature 2       male  
##  8 8     53.3    88.9     japan       literature 1       male  
##  9 9     42.2    60       japan       literature 1       male  
## 10 10    40      80       japan       literature 1       male  
## # … with 190 more rows

Scrape HTML tables with R

http://www.mishou.be/

11/4/2021

Scraping HTML tables