Ash Barty career stats

Ashleigh Barty is at the top of her game as world #1 for the 4th consecutive year. This year she won the Australian Open, her maiden slam. As an Aussie, she has inspired many generations of young tennis players and been a role model of the game.

Sadly this week, she decided to formally retire from the sport. Its not her first time, but I feel like she means it this time. Upon hearing the news, I decided to change my data source for this project to her career stats and shed some light on her wonderful career.

Setup

# the following packages should be loaded in order to facilitate running this report;  
library(magrittr) # for useful pipes 
library(knitr) # useful for compiling data tables
library(readr) # Useful for importing data
library(readxl) # useful for reading data in excel spreadsheets
library(dplyr) # useful to group data together
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr) # For changing the structure of the data 
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
library(rvest) # needed to scrape data from html sources
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding

Data Description

I searched the internet to find a good summary of Ash’s wins / losses through the years.

I found what I was looking for on this website. “https://www.tennisstats247.com/players/A-Barty-3627/

The table I’m interested in is Ash’s career singles stats at the top of the page.

Import Data

The table describes a number of statistics about her game per year. Year, being the categorical variable.

The quantitative variables include;

  • Rank
  • Titles won
  • Hard Court Wins
  • Hard Court Losses
  • Clay Court Wins
  • Clay Court Losses
  • Grass Court Wins
  • Grass Court Losses

Now that we understand the data, lets import it into R as a data.frame.

# First we'll identify the URL for the webpage in our parameters.

url_data <- "https://www.tennisstats247.com/players/A-Barty-3627/"

# Next were going to read the html page into R by pipping this URL into the read_html function of RVEST package. 

url_data %>%
  read_html()
## {html_document}
## <html xmlns="http://www.w3.org/1999/xhtml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="allbody">\n<script type="text/javascript" lang="javascript" src ...

We can see the components of the HTML document such as head and body

I need to inspect the html document in my browser in order to find the source selector (table id) of the one I am interested in. I can do this using the developer tools in Chrome.

Lets save the css selector in our parameters now;

css_selector <- "#gvCareerStats"

Now we can extract the career stats table into R

url_data %>% 
  read_html() %>%
  html_element(css = css_selector)
## {html_node}
## <table class="gridAlternate careerStats" cellspacing="0" id="gvCareerStats">
##  [1] <tr>\n<th scope="col">\n<span title="Year">Year</span>\n</th>\n<th scope ...
##  [2] <tr>\n<td>\n2022\n</td>\n<td>\n1\n</td>\n<td>\n2\n</td>\n<td>\n11\n</td> ...
##  [3] <tr>\n<td>\n2021\n</td>\n<td>\n1\n</td>\n<td>\n5\n</td>\n<td>\n42\n</td> ...
##  [4] <tr>\n<td>\n2020\n</td>\n<td>\n1\n</td>\n<td>\n1\n</td>\n<td>\n11\n</td> ...
##  [5] <tr>\n<td>\n2019\n</td>\n<td>\n1\n</td>\n<td>\n4\n</td>\n<td>\n59\n</td> ...
##  [6] <tr>\n<td>\n2018\n</td>\n<td>\n15\n</td>\n<td>\n2\n</td>\n<td>\n46\n</td ...
##  [7] <tr>\n<td>\n2017\n</td>\n<td>\n17\n</td>\n<td>\n1\n</td>\n<td>\n42\n</td ...
##  [8] <tr>\n<td>\n2016\n</td>\n<td>\n272\n</td>\n<td>\n0\n</td>\n<td>\n11\n</t ...
##  [9] <tr>\n<td>\n2014\n</td>\n<td>\n223\n</td>\n<td>\n0\n</td>\n<td>\n15\n</t ...
## [10] <tr>\n<td>\n2013\n</td>\n<td>\n190\n</td>\n<td>\n0\n</td>\n<td>\n12\n</t ...
## [11] <tr>\n<td>\n2012\n</td>\n<td>\n175\n</td>\n<td>\n2\n</td>\n<td>\n24\n</t ...
## [12] <tr>\n<td>\n2011\n</td>\n<td>\n679\n</td>\n<td>\n0\n</td>\n<td>\n0\n</td ...

As we can see, the data is still in HTML format but its starting to come together.

Next we’ll need to import it as a data table using the html_table() function.

barty_tbl <- url_data %>% 
  read_html(url_data) %>% 
  html_element(css = css_selector) %>% 
  html_table()

barty_tbl

Lets inspect the data structure and class

dim(barty_tbl)
## [1] 11 11
class(barty_tbl)
## [1] "tbl_df"     "tbl"        "data.frame"

It seems R has automatically imported the data as tibble which is fine, but for the sake of this exercise, we’ll need to change the class to a “data.frame”.

barty_df <- as.data.frame(barty_tbl, stringsAsFactors = FALSE)

class(barty_df)
## [1] "data.frame"

Inspect dataset and variables

Lets look at the data in more detail;

str(barty_df)
## 'data.frame':    11 obs. of  11 variables:
##  $ Year  : int  2022 2021 2020 2019 2018 2017 2016 2014 2013 2012 ...
##  $ Rank  : int  1 1 1 1 15 17 272 223 190 175 ...
##  $ Titles: int  2 5 1 4 2 1 0 0 0 2 ...
##  $ W     : int  11 42 11 59 46 42 11 15 12 24 ...
##  $ L     : int  0 7 4 14 19 16 4 10 12 9 ...
##  $ HW    : int  11 21 11 40 27 31 2 10 6 15 ...
##  $ HL    : int  0 3 4 11 11 10 1 4 6 6 ...
##  $ CW    : int  0 14 0 11 7 5 0 3 6 4 ...
##  $ CL    : int  0 4 0 2 5 3 0 4 4 2 ...
##  $ GW    : int  0 7 0 8 12 6 9 2 0 5 ...
##  $ GL    : int  0 0 0 1 3 3 3 2 2 1 ...

A few observations using the str() function:

  • the data contains 11 obs of 11 variables
  • each variable is considered an integer or numeric quantitative variable
  • we can see that its sorted by year, which classifies as a qualitative or categorical variable in this case.
  • the column names are abbreviated and might need to be expanded to understand it better.
# Convert Year vector to Character as its our categorical variable

barty_df$Year <- as.character(barty_df$Year)

# Convert each of the observations as numeric using the lapply (list apply function) so that we can make some analysis on the numbers. 

barty_df[2:11] <- lapply(barty_df[2:11], as.numeric)

# Check structure of barty_df now to confirm class of each vector

str(barty_df)
## 'data.frame':    11 obs. of  11 variables:
##  $ Year  : chr  "2022" "2021" "2020" "2019" ...
##  $ Rank  : num  1 1 1 1 15 17 272 223 190 175 ...
##  $ Titles: num  2 5 1 4 2 1 0 0 0 2 ...
##  $ W     : num  11 42 11 59 46 42 11 15 12 24 ...
##  $ L     : num  0 7 4 14 19 16 4 10 12 9 ...
##  $ HW    : num  11 21 11 40 27 31 2 10 6 15 ...
##  $ HL    : num  0 3 4 11 11 10 1 4 6 6 ...
##  $ CW    : num  0 14 0 11 7 5 0 3 6 4 ...
##  $ CL    : num  0 4 0 2 5 3 0 4 4 2 ...
##  $ GW    : num  0 7 0 8 12 6 9 2 0 5 ...
##  $ GL    : num  0 0 0 1 3 3 3 2 2 1 ...
# Count the total number of titles she won 

sum(barty_df$Titles)
## [1] 17
# Check the avg number of Wins she's had per year

mean(barty_df$W)
## [1] 24.81818

Tidy

In my opinion, the data is considered tidy because each observation (Year) is in a row and the corresponding variables are in columns. There is NO missing data or NA’s in this file.

The only thing I would like to update in this case are the column names as they may be hard to understand unless spelled out completely. To do so, we’ll use the colnames() functions

colnames(barty_df) <- c("Year", "Rank", "Titles", "Wins", "Losses", "Hard Court Wins", "Hard Court Loss", "Clay Court Win", "Clay Court Loss", "Grass Court Win", "Grass Court Loss")

# we will see the output further down once we inspect the data again

Create a list

Next we’ we’ll create a list for each of the 11 categorical variables (Years)

my.list <- list(1:11)

my.list
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11

Join the list

Then we’ll create a vector called years and join this with my.list using the cbind() bind function calling it “yearslist”.

# save Years as a vector

years <- barty_df[1]

years
# join the list using cbind to the years category and view 

yearslist <- cbind(my.list, years)

yearslist

Subsetting (10 observations) and convert to matrix

Next we need to select the first 10 observations and its variables and convert it to a matrix.

# Using the head() function, we'll select number of rows = 10 and save as sub10
sub10 <- head(barty_df, n = 10)

sub10
# Convert sub10 to a matrix using the as.matrix() function and confirm. 

bartymatrix <- as.matrix(sub10)

bartymatrix
##    Year   Rank  Titles Wins Losses Hard Court Wins Hard Court Loss
## 1  "2022" "  1" "2"    "11" " 0"   "11"            " 0"           
## 2  "2021" "  1" "5"    "42" " 7"   "21"            " 3"           
## 3  "2020" "  1" "1"    "11" " 4"   "11"            " 4"           
## 4  "2019" "  1" "4"    "59" "14"   "40"            "11"           
## 5  "2018" " 15" "2"    "46" "19"   "27"            "11"           
## 6  "2017" " 17" "1"    "42" "16"   "31"            "10"           
## 7  "2016" "272" "0"    "11" " 4"   " 2"            " 1"           
## 8  "2014" "223" "0"    "15" "10"   "10"            " 4"           
## 9  "2013" "190" "0"    "12" "12"   " 6"            " 6"           
## 10 "2012" "175" "2"    "24" " 9"   "15"            " 6"           
##    Clay Court Win Clay Court Loss Grass Court Win Grass Court Loss
## 1  " 0"           "0"             " 0"            "0"             
## 2  "14"           "4"             " 7"            "0"             
## 3  " 0"           "0"             " 0"            "0"             
## 4  "11"           "2"             " 8"            "1"             
## 5  " 7"           "5"             "12"            "3"             
## 6  " 5"           "3"             " 6"            "3"             
## 7  " 0"           "0"             " 9"            "3"             
## 8  " 3"           "4"             " 2"            "2"             
## 9  " 6"           "4"             " 0"            "2"             
## 10 " 4"           "2"             " 5"            "1"
class(bartymatrix)
## [1] "matrix" "array"
str(bartymatrix)
##  chr [1:10, 1:11] "2022" "2021" "2020" "2019" "2018" "2017" "2016" "2014" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:10] "1" "2" "3" "4" ...
##   ..$ : chr [1:11] "Year" "Rank" "Titles" "Wins" ...

Why have we ended up with this structure ?

Because the matrix contains logical character and integer variables. The default method for as.matrix calls as.vector(x) and thus coerces factors to character vectors.

Subsetting (first and last variable)

Next we’ll select the last first and last variables (Years and Grass Court Loss) and save that as a new .RDS file called “bartyvar2.rds”

# Step 9 - subset data frame including only the first and last variable in the dataset
bartyvar2 <- barty_df[c(1, 11)]
bartyvar2
# Save it as an object file (R.Data) and list files in my home directory to confirm.
saveRDS(bartyvar2, "bartyvar2.rds")

list.files("C:/Users/HFCMasters/Documents/Mels documents/RMIT/Assignment 1/Assignment 1")
## character(0)

Ash Barty wins Australian Open 2022, photo taken from http.gma.news.tv all rights reserved

The End