Assignment 2: Web Scraping

In this section, I will be scraping the USM athletics site for sporting event information. First, I loaded the libraries that I will be working with in the assignment.

Data Before Cleaning

In this section, we can see that the data need to be cleaned before proceeding with the web scraping process.

## Warning: package 'rvest' was built under R version 3.6.3
## Loading required package: xml2
## Warning: package 'stringr' was built under R version 3.6.3
## Warning: package 'reshape2' was built under R version 3.6.3
## [1] "(Wrestling) NCAA Regional Tournament"                                   
## [2] "(Men's Indoor Track & Field) Tufts Last Chance Qualifier"               
## [3] "(Women's Indoor Track & Field) Tufts Last Chance Qualifier"             
## [4] "(Men's Indoor Track & Field) NCAA Division III National Championships"  
## [5] "(Women's Indoor Track & Field) NCAA Division III National Championships"
## [6] "(Wrestling) NCAA Division III National Championships"

Splitting Columns

We use the colsplit() function in this section to split the columns

##                          sport                                    event
## 1                    Wrestling                 NCAA Regional Tournament
## 2   Men's Indoor Track & Field              Tufts Last Chance Qualifier
## 3 Women's Indoor Track & Field              Tufts Last Chance Qualifier
## 4   Men's Indoor Track & Field NCAA Division III National Championships
## 5 Women's Indoor Track & Field NCAA Division III National Championships
## 6                    Wrestling NCAA Division III National Championships

Showing Month, time, and Day

## [1] "(All day) Mar 1"  "(All day) Mar 7"  "(All day) Mar 7"  "(All day) Mar 13"
## [5] "(All day) Mar 13" "(All day) Mar 13"

Combining all the variables into a single POSIXct column

It is helpful to combine time, month, and day variables into a single POSIXct column as I have done in this section

## [1] NA NA NA NA NA NA

Dataframe containing usable information

It all depends with what I am intending to use the data for.

##   event_time                        sport
## 1       <NA>                    Wrestling
## 2       <NA>   Men's Indoor Track & Field
## 3       <NA> Women's Indoor Track & Field
## 4       <NA>   Men's Indoor Track & Field
## 5       <NA> Women's Indoor Track & Field
## 6       <NA>                    Wrestling
##                                      event
## 1                 NCAA Regional Tournament
## 2              Tufts Last Chance Qualifier
## 3              Tufts Last Chance Qualifier
## 4 NCAA Division III National Championships
## 5 NCAA Division III National Championships
## 6 NCAA Division III National Championships

SelectorGadget attempts to always return the shortest syntax

In this section, I use the html_text() function to extract the text from a node

## {xml_nodeset (6)}
## [1] <a href="/athletics/wrestling-ncaa-regional-tournament-0">(Wrestling) NCA ...
## [2] <a href="/athletics/mens-indoor-track-field-tufts-last-chance-qualifier"> ...
## [3] <a href="/athletics/womens-indoor-track-field-tufts-last-chance-qualifier ...
## [4] <a href="/athletics/mens-indoor-track-field-ncaa-division-iii-national-ch ...
## [5] <a href="/athletics/womens-indoor-track-field-ncaa-division-iii-national- ...
## [6] <a href="/athletics/wrestling-ncaa-division-iii-national-championships">( ...