Ashleigh Barty is at the top of her game as world #1 for the 4th consecutive year. This year she won the Australian Open, her maiden slam. As an Aussie, she has inspired many generations of young tennis players and been a role model of the game.
Sadly this week, she decided to formally retire from the sport. Its not her first time, but I feel like she means it this time. Upon hearing the news, I decided to change my data source for this project to her career stats and shed some light on her wonderful career.
# the following packages should be loaded in order to facilitate running this report;
library(magrittr) # for useful pipes
library(knitr) # useful for compiling data tables
library(readr) # Useful for importing data
library(readxl) # useful for reading data in excel spreadsheets
library(dplyr) # useful to group data together
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr) # For changing the structure of the data
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
library(rvest) # needed to scrape data from html sources
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
I searched the internet to find a good summary of Ash’s wins / losses through the years.
I found what I was looking for on this website. “https://www.tennisstats247.com/players/A-Barty-3627/”
The table I’m interested in is Ash’s career singles stats at the top of the page.
The table describes a number of statistics about her game per year. Year, being the categorical variable.
The quantitative variables include;
Now that we understand the data, lets import it into R as a data.frame.
# First we'll identify the URL for the webpage in our parameters.
url_data <- "https://www.tennisstats247.com/players/A-Barty-3627/"
# Next were going to read the html page into R by pipping this URL into the read_html function of RVEST package.
url_data %>%
read_html()
## {html_document}
## <html xmlns="http://www.w3.org/1999/xhtml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="allbody">\n<script type="text/javascript" lang="javascript" src ...
We can see the components of the HTML document such as head and body
I need to inspect the html document in my browser in order to find the source selector (table id) of the one I am interested in. I can do this using the developer tools in Chrome.
Lets save the css selector in our parameters now;
css_selector <- "#gvCareerStats"
Now we can extract the career stats table into R
url_data %>%
read_html() %>%
html_element(css = css_selector)
## {html_node}
## <table class="gridAlternate careerStats" cellspacing="0" id="gvCareerStats">
## [1] <tr>\n<th scope="col">\n<span title="Year">Year</span>\n</th>\n<th scope ...
## [2] <tr>\n<td>\n2022\n</td>\n<td>\n1\n</td>\n<td>\n2\n</td>\n<td>\n11\n</td> ...
## [3] <tr>\n<td>\n2021\n</td>\n<td>\n1\n</td>\n<td>\n5\n</td>\n<td>\n42\n</td> ...
## [4] <tr>\n<td>\n2020\n</td>\n<td>\n1\n</td>\n<td>\n1\n</td>\n<td>\n11\n</td> ...
## [5] <tr>\n<td>\n2019\n</td>\n<td>\n1\n</td>\n<td>\n4\n</td>\n<td>\n59\n</td> ...
## [6] <tr>\n<td>\n2018\n</td>\n<td>\n15\n</td>\n<td>\n2\n</td>\n<td>\n46\n</td ...
## [7] <tr>\n<td>\n2017\n</td>\n<td>\n17\n</td>\n<td>\n1\n</td>\n<td>\n42\n</td ...
## [8] <tr>\n<td>\n2016\n</td>\n<td>\n272\n</td>\n<td>\n0\n</td>\n<td>\n11\n</t ...
## [9] <tr>\n<td>\n2014\n</td>\n<td>\n223\n</td>\n<td>\n0\n</td>\n<td>\n15\n</t ...
## [10] <tr>\n<td>\n2013\n</td>\n<td>\n190\n</td>\n<td>\n0\n</td>\n<td>\n12\n</t ...
## [11] <tr>\n<td>\n2012\n</td>\n<td>\n175\n</td>\n<td>\n2\n</td>\n<td>\n24\n</t ...
## [12] <tr>\n<td>\n2011\n</td>\n<td>\n679\n</td>\n<td>\n0\n</td>\n<td>\n0\n</td ...
As we can see, the data is still in HTML format but its starting to come together.
Next we’ll need to import it as a data table using the html_table() function.
barty_tbl <- url_data %>%
read_html(url_data) %>%
html_element(css = css_selector) %>%
html_table()
barty_tbl
Lets inspect the data structure and class
dim(barty_tbl)
## [1] 11 11
class(barty_tbl)
## [1] "tbl_df" "tbl" "data.frame"
It seems R has automatically imported the data as tibble which is fine, but for the sake of this exercise, we’ll need to change the class to a “data.frame”.
barty_df <- as.data.frame(barty_tbl, stringsAsFactors = FALSE)
class(barty_df)
## [1] "data.frame"
Lets look at the data in more detail;
str(barty_df)
## 'data.frame': 11 obs. of 11 variables:
## $ Year : int 2022 2021 2020 2019 2018 2017 2016 2014 2013 2012 ...
## $ Rank : int 1 1 1 1 15 17 272 223 190 175 ...
## $ Titles: int 2 5 1 4 2 1 0 0 0 2 ...
## $ W : int 11 42 11 59 46 42 11 15 12 24 ...
## $ L : int 0 7 4 14 19 16 4 10 12 9 ...
## $ HW : int 11 21 11 40 27 31 2 10 6 15 ...
## $ HL : int 0 3 4 11 11 10 1 4 6 6 ...
## $ CW : int 0 14 0 11 7 5 0 3 6 4 ...
## $ CL : int 0 4 0 2 5 3 0 4 4 2 ...
## $ GW : int 0 7 0 8 12 6 9 2 0 5 ...
## $ GL : int 0 0 0 1 3 3 3 2 2 1 ...
A few observations using the str() function:
# Convert Year vector to Character as its our categorical variable
barty_df$Year <- as.character(barty_df$Year)
# Convert each of the observations as numeric using the lapply (list apply function) so that we can make some analysis on the numbers.
barty_df[2:11] <- lapply(barty_df[2:11], as.numeric)
# Check structure of barty_df now to confirm class of each vector
str(barty_df)
## 'data.frame': 11 obs. of 11 variables:
## $ Year : chr "2022" "2021" "2020" "2019" ...
## $ Rank : num 1 1 1 1 15 17 272 223 190 175 ...
## $ Titles: num 2 5 1 4 2 1 0 0 0 2 ...
## $ W : num 11 42 11 59 46 42 11 15 12 24 ...
## $ L : num 0 7 4 14 19 16 4 10 12 9 ...
## $ HW : num 11 21 11 40 27 31 2 10 6 15 ...
## $ HL : num 0 3 4 11 11 10 1 4 6 6 ...
## $ CW : num 0 14 0 11 7 5 0 3 6 4 ...
## $ CL : num 0 4 0 2 5 3 0 4 4 2 ...
## $ GW : num 0 7 0 8 12 6 9 2 0 5 ...
## $ GL : num 0 0 0 1 3 3 3 2 2 1 ...
# Count the total number of titles she won
sum(barty_df$Titles)
## [1] 17
# Check the avg number of Wins she's had per year
mean(barty_df$W)
## [1] 24.81818
In my opinion, the data is considered tidy because each observation (Year) is in a row and the corresponding variables are in columns. There is NO missing data or NA’s in this file.
The only thing I would like to update in this case are the column names as they may be hard to understand unless spelled out completely. To do so, we’ll use the colnames() functions
colnames(barty_df) <- c("Year", "Rank", "Titles", "Wins", "Losses", "Hard Court Wins", "Hard Court Loss", "Clay Court Win", "Clay Court Loss", "Grass Court Win", "Grass Court Loss")
# we will see the output further down once we inspect the data again
Next we’ we’ll create a list for each of the 11 categorical variables (Years)
my.list <- list(1:11)
my.list
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11
Then we’ll create a vector called years and join this with my.list using the cbind() bind function calling it “yearslist”.
# save Years as a vector
years <- barty_df[1]
years
# join the list using cbind to the years category and view
yearslist <- cbind(my.list, years)
yearslist
Next we need to select the first 10 observations and its variables and convert it to a matrix.
# Using the head() function, we'll select number of rows = 10 and save as sub10
sub10 <- head(barty_df, n = 10)
sub10
# Convert sub10 to a matrix using the as.matrix() function and confirm.
bartymatrix <- as.matrix(sub10)
bartymatrix
## Year Rank Titles Wins Losses Hard Court Wins Hard Court Loss
## 1 "2022" " 1" "2" "11" " 0" "11" " 0"
## 2 "2021" " 1" "5" "42" " 7" "21" " 3"
## 3 "2020" " 1" "1" "11" " 4" "11" " 4"
## 4 "2019" " 1" "4" "59" "14" "40" "11"
## 5 "2018" " 15" "2" "46" "19" "27" "11"
## 6 "2017" " 17" "1" "42" "16" "31" "10"
## 7 "2016" "272" "0" "11" " 4" " 2" " 1"
## 8 "2014" "223" "0" "15" "10" "10" " 4"
## 9 "2013" "190" "0" "12" "12" " 6" " 6"
## 10 "2012" "175" "2" "24" " 9" "15" " 6"
## Clay Court Win Clay Court Loss Grass Court Win Grass Court Loss
## 1 " 0" "0" " 0" "0"
## 2 "14" "4" " 7" "0"
## 3 " 0" "0" " 0" "0"
## 4 "11" "2" " 8" "1"
## 5 " 7" "5" "12" "3"
## 6 " 5" "3" " 6" "3"
## 7 " 0" "0" " 9" "3"
## 8 " 3" "4" " 2" "2"
## 9 " 6" "4" " 0" "2"
## 10 " 4" "2" " 5" "1"
class(bartymatrix)
## [1] "matrix" "array"
str(bartymatrix)
## chr [1:10, 1:11] "2022" "2021" "2020" "2019" "2018" "2017" "2016" "2014" ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:10] "1" "2" "3" "4" ...
## ..$ : chr [1:11] "Year" "Rank" "Titles" "Wins" ...
Because the matrix contains logical character and integer variables. The default method for as.matrix calls as.vector(x) and thus coerces factors to character vectors.
Next we’ll select the last first and last variables (Years and Grass Court Loss) and save that as a new .RDS file called “bartyvar2.rds”
# Step 9 - subset data frame including only the first and last variable in the dataset
bartyvar2 <- barty_df[c(1, 11)]
bartyvar2
# Save it as an object file (R.Data) and list files in my home directory to confirm.
saveRDS(bartyvar2, "bartyvar2.rds")
list.files("C:/Users/HFCMasters/Documents/Mels documents/RMIT/Assignment 1/Assignment 1")
## character(0)
Ash Barty wins Australian Open 2022, photo taken from http.gma.news.tv all rights reserved