Important

You must also publish your report to RPubs (see here) and add this RPubs link to the comments/description section in Turnitin while uploading your report. This online version of the report will be used for marking. Failure to submit your link will delay your feedback and risk late penalties.

Load packages

library(readr)
library(dplyr)
library(tidyr)
library(knitr)
library(mlr)
library(lubridate)
library(forcats)
library(infotheo)
library(ggplot2)

Executive Summary

The data is loaded from the working directory and joined using the 1-1 player_id variable. Providing a singular dataset with demographic and ranking variables. Next, playing hand (hand) is factorised and the labels are expanded, the country code is factorised and date variables are converted to such, finally, a summary of the columns and structure of the data frame is displayed. We then separate the date columns, so as each atomic component has an individual column, finally data type conversions occur. We then create a player age column, that represents the player’s age at the time of the ranking, this is calculated by, subtracting birth year, from the ranking year. Then we scan the data frame for missing values, locating a single row missing all demographic information and remove this row. We then convert null values in ranking points and tours variables to 0, before finally converting null values in playing hand, birth year, month, day and age to 0. we then assess for outliers visually using boxplots for all variables, we determine outliers to be present within ranking points and player age. As these values are legitimate, removing them from the data would not be an option, we decided to bin these variables using the equal width method. Finally, we assess the distribution of the ranking points variable, and detect a heavy positive skew, to better inform our analysis we decide to apply a log function as to normalise this variable.

Data

The Players dataset contains historical descriptive information regarding female tennis players, whilst the Rankings dataset contains information regarding world ranking at specified points in history - the rankings dataset ranges from 1984 to 2017, whilst the players set has information regarding players born between 1918 and 2003.

The source:

source: #https://www.kaggle.com/joaoevangelista/wta-matches-and-rankings

variable descriptions:

steps:

players <- read_csv("players.csv")
rankings <- read_csv("rankings.csv")

playRank <- left_join(rankings, players, by = "player_id") 

head(players)
head(rankings)
head(playRank)

Understand

In this step, we have defined the hand variable as a factor and expanded upon the ‘R’ & ‘L’ labels into ‘Right’ & ‘Left labels, also defining the ’country_code’ as a factor - thus satisfying min requirement 4. Next, we have defined the date variables into ‘date’ data type satisfying min requirement 2 and 3. Finally, we display the structure of the data frame and a summary of the columns.

playRank$hand <- factor(playRank$hand,
                        levels = c("R", "L"),
                        labels = c("Right", "Left"))

playRank$country_code <- factor(playRank$country_code) 

playRank$ranking_date <- ymd(playRank$ranking_date)
## Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone 'zone/tz/2018c.1.0/
## zoneinfo/Australia/Melbourne'
playRank$birth_date <- ymd(playRank$birth_date)

str(playRank)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1597902 obs. of  10 variables:
##  $ ranking_date  : Date, format: "2000-01-01" "2000-01-01" ...
##  $ ranking       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ player_id     : num  200001 200128 200748 200033 200096 ...
##  $ ranking_points: num  6074 4841 4378 3021 2658 ...
##  $ tours         : num  NA NA 13 15 NA NA NA NA NA NA ...
##  $ first_name    : chr  "Martina" "Lindsay" "Venus" "Serena" ...
##  $ last_name     : chr  "Hingis" "Davenport" "Williams" "Williams" ...
##  $ hand          : Factor w/ 2 levels "Right","Left": 1 1 1 1 1 2 1 1 1 1 ...
##  $ birth_date    : Date, format: "1980-09-30" "1976-06-08" ...
##  $ country_code  : Factor w/ 111 levels "ALB","ALG","ANG",..: 92 106 106 106 33 106 33 7 33 33 ...
summarizeColumns(playRank)  %>% kable(caption = "Feature Summary Prior to Data Preprocessing") 
## Warning in mde(x): NAs introduced by coercion
## Warning in mde(x): NAs introduced by coercion
Feature Summary Prior to Data Preprocessing
name type na mean disp median mad min max nlevs
ranking_date Date 0 NA NA NA NA 1 1523 1779
ranking integer 0 542.54714 355.260745 498 429.9540 1 1515 0
player_id numeric 1 203828.81259 3627.806206 202642 2333.6124 200001 220593 0
ranking_points numeric 19446 134.83696 467.849367 17 22.2390 0 13615 0
tours numeric 1087461 12.99519 9.042693 12 11.8608 0 41 0
first_name character 1 NA NA NA NA 1 16494 3075
last_name character 1 NA NA NA NA 1 7104 6693
hand factor 720000 NA NA NA NA 84647 793255 2
birth_date Date 402 NA NA NA NA 1 2063 5757
country_code factor 1 NA NA NA NA 42 177267 111

Tidy & Manipulate Data I

steps:

An inital check of the dataframe confirms both date variables (ranking_date & birth_date) contain multiple variables (year, month and day), and thus violating the tidy data principles.

Using the tidyr::seperate function, we break ‘ranking_date’ and ‘birth_date’ into the atomic components (year, month and day).

Finally define new variables as numeric.

playRank <- separate(playRank,ranking_date, into = c("ranking_date_year", "ranking_date_month", "ranking_date_day"),sep = '-')
playRank <- separate(playRank,birth_date, into = c("birth_date_year", "birth_date_month", "birth_date_day"),sep = '-')

playRank <- playRank %>%
  mutate(
    ranking_date_day   = as.numeric(ranking_date_day),
    ranking_date_month = as.numeric(ranking_date_month),
    ranking_date_year  = as.numeric(ranking_date_year),
    
    birth_date_day   = as.numeric(birth_date_day),
    birth_date_month = as.numeric(birth_date_month),
    birth_date_year  = as.numeric(birth_date_year),
    
    player_id      = as.integer(player_id),
    ranking_points = as.integer(ranking_points),
    tours          = as.integer(tours)
  )

Tidy & Manipulate Data II

steps: 1. create a column ‘age’, to store the age of the player in the ranking year.

playRank <- playRank %>%
  mutate(player_age = ranking_date_year - birth_date_year)

head(playRank)

Scan I

Scan the data for missing values, inconsistencies and obvious errors. In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain how you dealt with these values. steps: * scan data for missing values * We locate 1 observation missing, both first and last name, also missing the player_id - we delete this observation.

colSums(is.na(playRank))
##  ranking_date_year ranking_date_month   ranking_date_day 
##                  0                  0                  0 
##            ranking          player_id     ranking_points 
##                  0                  1              19446 
##              tours         first_name          last_name 
##            1087461                  1                  1 
##               hand    birth_date_year   birth_date_month 
##             720000                402                402 
##     birth_date_day       country_code         player_age 
##                402                  1                402
which(is.na(playRank$first_name))
## [1] 1593272
which(is.na(playRank$last_name))
## [1] 1593272
which(is.na(playRank$player_id))
## [1] 1593272
playRank<-playRank[-1593272,]

colSums(is.na(playRank))
##  ranking_date_year ranking_date_month   ranking_date_day 
##                  0                  0                  0 
##            ranking          player_id     ranking_points 
##                  0                  0              19446 
##              tours         first_name          last_name 
##            1087461                  0                  0 
##               hand    birth_date_year   birth_date_month 
##             719999                401                401 
##     birth_date_day       country_code         player_age 
##                401                  0                401

steps:

  • Recode the Na values of rannking points, and tours to 0
playRank$ranking_points[is.na(playRank$ranking_points)]<- 0

playRank$tours[is.na(playRank$tours)]<- 0

colSums(is.na(playRank))
##  ranking_date_year ranking_date_month   ranking_date_day 
##                  0                  0                  0 
##            ranking          player_id     ranking_points 
##                  0                  0                  0 
##              tours         first_name          last_name 
##                  0                  0                  0 
##               hand    birth_date_year   birth_date_month 
##             719999                401                401 
##     birth_date_day       country_code         player_age 
##                401                  0                401

steps:

  • Recode the Na in ‘hand’ to ‘unknown’
playRank$hand <- fct_explicit_na(playRank$hand, "unknown")

colSums(is.na(playRank))
##  ranking_date_year ranking_date_month   ranking_date_day 
##                  0                  0                  0 
##            ranking          player_id     ranking_points 
##                  0                  0                  0 
##              tours         first_name          last_name 
##                  0                  0                  0 
##               hand    birth_date_year   birth_date_month 
##                  0                401                401 
##     birth_date_day       country_code         player_age 
##                401                  0                401

steps:

  • Recode the Na values of birth information with 0
playRank$birth_date_year[is.na(playRank$birth_date_year)] <- 0

playRank$birth_date_month[is.na(playRank$birth_date_month)] <- 0

playRank$birth_date_day[is.na(playRank$birth_date_day)] <- 0

playRank$player_age[is.na(playRank$player_age)] <- 0

colSums(is.na(playRank))
##  ranking_date_year ranking_date_month   ranking_date_day 
##                  0                  0                  0 
##            ranking          player_id     ranking_points 
##                  0                  0                  0 
##              tours         first_name          last_name 
##                  0                  0                  0 
##               hand    birth_date_year   birth_date_month 
##                  0                  0                  0 
##     birth_date_day       country_code         player_age 
##                  0                  0                  0

Scan II

steps: * Scan the numeric data for outliers. + ranking
+ ranking_points
+ ranking_date_year
+ ranking_date_month
+ ranking_date_day
+ birth_date_year
+ birth_date_month
+ birth_date_day
+ age

Although, there are various outliers within the dataset - we have decided rather than removing any legitimate values, binning age and ranking points would be a more useful transformation for analysis.

boxplot(playRank$ranking ~ playRank$hand, main = "decting outliers in ranking", ylab = "ranking", xlab = "hands")

boxplot(playRank$ranking_points ~ playRank$hand, main="decting outliers in ranking_points", ylab = "ranking_points", xlab = "hands")

boxplot(playRank$ranking_date_year ~ playRank$hand, main="detecting outliers in ranking_date_year", ylab = "ranking_date_year", xlab = "hands")

boxplot(playRank$ranking_date_month ~ playRank$hand, main="detecting outliers in ranking_date_month", ylab = "ranking_date_month", xlab = "hands")

boxplot(playRank$ranking_date_day ~ playRank$hand, main="detecting outliers in ranking_date_day", ylab = "ranking_date_day", xlab = "hands")

boxplot(playRank$birth_date_month ~ playRank$hand, main="detecting outliers in birth_date_month", ylab = "birth_month_month", xlab = "hands")

boxplot(playRank$birth_date_day ~ playRank$hand, main="detecting outliers in birth_date_day", ylab = "birth_date_day", xlab = "hands")

boxplot(playRank$player_age ~ playRank$hand, main = "detecting outliers in age", ylab = 'age', xlab ='hands')

  • Select variables to apply equal width binning upon

  • Perform equal width binning using infotheo::discretize

to_bin <- playRank %>% select(ranking_points, player_age)

playRank_binned <- 
  discretize(to_bin, disc = "equalwidth")
  playRank %>% bind_cols(playRank_binned)

Transform

Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.

Finally, we display the distribution of the ranking points variable Then apply a logarithmic transformation on the ranking_points variable And, display the new distribution

a <- ggplot(playRank, aes(ranking_points)) + geom_histogram(colour = "white") + ggtitle("Historgram of player ranking points")
a

playRank <- playRank %>%
  mutate(log_rankingPoints = log(ranking_points))

b <- ggplot(playRank, aes(log_rankingPoints)) + geom_histogram(colour = "white") + ggtitle("Historgram of log player ranking points")   
b