Introduction

Today you will get practice scraping data. To get started, load packages tidyverse and rvest. Install package rvest if you need to with install.packages("rvest").

Package rvest was written by Hadley Wickham and contains functions that serve as wrappers around packages xml2 and httr to make it easy to download and then manipulate HTML and XML. The main functions are

  • read_html(): read HTML or XML

  • html_nodes(): extract pieces out of HTML documents using css selectors

  • html_table(): parse an html table into a data frame

  • html_text(): extract attributes, text and tag name from html


Kenpom

Scrape

You will scrape some college basketball statistics using SelectorGadget. The advanced metrics are available at https://kenpom.com/.

Task 1

Use SelectorGadget along with the following functions in package rvest: read_html(), html_nodes(), html_text() to scrape the large table on the main page. Only scrape the data and not the column headers. Save the result as an object named kenpom.vec.

Check that the length of kenpom.vec is 7413.

Organize the data

Task 2

You will organize the 353 NCAAB teams into a 353 by 21 matrix. You will then remove the odd columns 7 through 21. These columns represent the rank corresponding to the adjacent statistic. If you look on https://kenpom.com/ you will see small numbers next to some of the statistics such as AdjO, AdjD, etc. This is what we are removing. Remove the chunk option eval=FALSE.

Check the dimensions of kenpom.mat.

Task 3

Turn kenpom.mat into a tibble and save the result as kenpom. Assign the column names as given below.

c("rank", "team", "conf", "w_l", "adjem", "adjo", "adjd", "adjt", "luck", "adjem_sos", "oppo_sos", "oppd_sos", "adjem_nc")

End the chunk with glimpse(kenpom).

Task 4

Use function separate() to turn the w_l column into two columns: “wins”, “losses”.

Task 5

Change the variable type of rank and wins:adjem_nc from character to double (numeric).

Task 6

Add an additional variable to kenpom called win_percentage. Variable win_percentage is the number of wins divided by total games played (wins + losses).

Task 7

Remove the extra blank space at the end of each team’s name with substr(team, start = 1, stop = nchar(team) - 1).

Team comparison

Task 8

Create a 353 by 353 matrix that represents the difference between each team’s adjem metric for all 353 teams. Metric adjem is the difference between a team’s offensive and defensive efficiency. Save the result as kenpom.adjem.mat. The diagonal of the resulting matrix should be 0.

Function outer() in R will allow you to do this in one line of code. Below is a simple example of functionouter() in action.

outer(X = c(1:4), Y = c(1:4), FUN = "-")
     [,1] [,2] [,3] [,4]
[1,]    0   -1   -2   -3
[2,]    1    0   -1   -2
[3,]    2    1    0   -1
[4,]    3    2    1    0

Task 9

Add the team names to the rows and columns of kenpom.adjem.mat.

Task 10

On Saturday, 04-06-19, Michigan State plays Texas Tech in the Final Four. Use matrix kenpom.adjem.mat to find the difference in adjem between “Michigan St.” and “Texas Tech”. Recall that you can subset matrices by their row and column names.

For details on some of the advanced metrics see https://kenpom.com/blog/ratings-methodology-update/.

Choice scrape

Pick any website, scrape some data, organize the data, and visualize the data with ggplot() or a similar extension.