In this section, I will be scraping the USM athletics site for sporting event information. First, I loaded the libraries that I will be working with in the assignment.
In this section, we can see that the data need to be cleaned before proceeding with the web scraping process.
## Warning: package 'rvest' was built under R version 3.6.3
## Loading required package: xml2
## Warning: package 'stringr' was built under R version 3.6.3
## Warning: package 'reshape2' was built under R version 3.6.3
## [1] "(Wrestling) NCAA Regional Tournament"
## [2] "(Men's Indoor Track & Field) Tufts Last Chance Qualifier"
## [3] "(Women's Indoor Track & Field) Tufts Last Chance Qualifier"
## [4] "(Men's Indoor Track & Field) NCAA Division III National Championships"
## [5] "(Women's Indoor Track & Field) NCAA Division III National Championships"
## [6] "(Wrestling) NCAA Division III National Championships"
We use the colsplit() function in this section to split the columns
## sport event
## 1 Wrestling NCAA Regional Tournament
## 2 Men's Indoor Track & Field Tufts Last Chance Qualifier
## 3 Women's Indoor Track & Field Tufts Last Chance Qualifier
## 4 Men's Indoor Track & Field NCAA Division III National Championships
## 5 Women's Indoor Track & Field NCAA Division III National Championships
## 6 Wrestling NCAA Division III National Championships
## [1] "(All day) Mar 1" "(All day) Mar 7" "(All day) Mar 7" "(All day) Mar 13"
## [5] "(All day) Mar 13" "(All day) Mar 13"
It is helpful to combine time, month, and day variables into a single POSIXct column as I have done in this section
## [1] NA NA NA NA NA NA
It all depends with what I am intending to use the data for.
## event_time sport
## 1 <NA> Wrestling
## 2 <NA> Men's Indoor Track & Field
## 3 <NA> Women's Indoor Track & Field
## 4 <NA> Men's Indoor Track & Field
## 5 <NA> Women's Indoor Track & Field
## 6 <NA> Wrestling
## event
## 1 NCAA Regional Tournament
## 2 Tufts Last Chance Qualifier
## 3 Tufts Last Chance Qualifier
## 4 NCAA Division III National Championships
## 5 NCAA Division III National Championships
## 6 NCAA Division III National Championships
In this section, I use the html_text() function to extract the text from a node
## {xml_nodeset (6)}
## [1] <a href="/athletics/wrestling-ncaa-regional-tournament-0">(Wrestling) NCA ...
## [2] <a href="/athletics/mens-indoor-track-field-tufts-last-chance-qualifier"> ...
## [3] <a href="/athletics/womens-indoor-track-field-tufts-last-chance-qualifier ...
## [4] <a href="/athletics/mens-indoor-track-field-ncaa-division-iii-national-ch ...
## [5] <a href="/athletics/womens-indoor-track-field-ncaa-division-iii-national- ...
## [6] <a href="/athletics/wrestling-ncaa-division-iii-national-championships">( ...