Important

You must also publish your report to RPubs (see here) and add this RPubs link to the comments/description section in Turnitin while uploading your report. This online version of the report will be used for marking. Failure to submit your link will delay your feedback and risk late penalties.

Load packages

library(readr)
library(dplyr)
library(tidyr)
library(knitr)
library(mlr)
library(lubridate)
library(forcats)
library(infotheo)
library(ggplot2)

Executive Summary

The data is loaded from the working directory and joined using the 1-1 player_id variable. Providing a singular dataset with demographic and ranking variables. Next, playing hand (hand) is factorised and the labels are expanded, the country code is factorised and date variables are converted to such, finally, a summary of the columns and structure of the data frame is displayed. We then separate the date columns, so as each atomic component has an individual column, finally data type conversions occur. We then create a player age column, that represents the player’s age at the time of the ranking, this is calculated by, subtracting birth year, from the ranking year. Then we scan the data frame for missing values, locating a single row missing all demographic information and remove this row. We then convert null values in ranking points and tours variables to 0, before finally converting null values in playing hand, birth year, month, day and age to 0. we then assess for outliers visually using boxplots for all variables, we determine outliers to be present within ranking points and player age. As these values are legitimate, removing them from the data would not be an option, we decided to bin these variables using the equal width method. Finally, we assess the distribution of the ranking points variable, and detect a heavy positive skew, to better inform our analysis we decide to apply a log function as to normalise this variable.

Data

The Players dataset contains historical descriptive information regarding female tennis players, whilst the Rankings dataset contains information regarding world ranking at specified points in history - the rankings dataset ranges from 1984 to 2017, whilst the players set has information regarding players born between 1918 and 2003.

The source:

source: #https://www.kaggle.com/joaoevangelista/wta-matches-and-rankings

variable descriptions:

‘players.csv’:
- player_id: Unique identifying code of the player
- first_name: First name of the player
- last_name: Last name of the player
- hand: the Preferred playing hand of the player
- birth_date: Birth date of the player
- country_code: A 3 digit code referencing the native country of the player
‘rankings.csv’:
- ranking_date: Specific date relating to the ranking
- ranking: The player ranking, 1 equating to the highest possible rank
- player_id: Unique identifying code of the player
- ranking_points: Cumulative total of ranking points
- tours: unable to determine this variable

steps:

Read the files from working directory
Join the ‘players.csv’ to ‘rankings.csv’ by ‘player_id’ and create a new dataframe ‘playRank’
Preview all dataframes

players <- read_csv("players.csv")
rankings <- read_csv("rankings.csv")

playRank <- left_join(rankings, players, by = "player_id") 

head(players)

head(rankings)

head(playRank)

Understand

In this step, we have defined the hand variable as a factor and expanded upon the ‘R’ & ‘L’ labels into ‘Right’ & ‘Left labels, also defining the ’country_code’ as a factor - thus satisfying min requirement 4. Next, we have defined the date variables into ‘date’ data type satisfying min requirement 2 and 3. Finally, we display the structure of the data frame and a summary of the columns.

playRank$hand <- factor(playRank$hand,
                        levels = c("R", "L"),
                        labels = c("Right", "Left"))

playRank$country_code <- factor(playRank$country_code) 

playRank$ranking_date <- ymd(playRank$ranking_date)

## Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone 'zone/tz/2018c.1.0/
## zoneinfo/Australia/Melbourne'

playRank$birth_date <- ymd(playRank$birth_date)

str(playRank)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1597902 obs. of  10 variables:
##  $ ranking_date  : Date, format: "2000-01-01" "2000-01-01" ...
##  $ ranking       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ player_id     : num  200001 200128 200748 200033 200096 ...
##  $ ranking_points: num  6074 4841 4378 3021 2658 ...
##  $ tours         : num  NA NA 13 15 NA NA NA NA NA NA ...
##  $ first_name    : chr  "Martina" "Lindsay" "Venus" "Serena" ...
##  $ last_name     : chr  "Hingis" "Davenport" "Williams" "Williams" ...
##  $ hand          : Factor w/ 2 levels "Right","Left": 1 1 1 1 1 2 1 1 1 1 ...
##  $ birth_date    : Date, format: "1980-09-30" "1976-06-08" ...
##  $ country_code  : Factor w/ 111 levels "ALB","ALG","ANG",..: 92 106 106 106 33 106 33 7 33 33 ...

summarizeColumns(playRank)  %>% kable(caption = "Feature Summary Prior to Data Preprocessing")

## Warning in mde(x): NAs introduced by coercion

## Warning in mde(x): NAs introduced by coercion

Feature Summary Prior to Data Preprocessing
name	type	na	mean	disp	median	mad	min	max	nlevs
ranking_date	Date	0	NA	NA	NA	NA	1	1523	1779
ranking	integer	0	542.54714	355.260745	498	429.9540	1	1515	0
player_id	numeric	1	203828.81259	3627.806206	202642	2333.6124	200001	220593	0
ranking_points	numeric	19446	134.83696	467.849367	17	22.2390	0	13615	0
tours	numeric	1087461	12.99519	9.042693	12	11.8608	0	41	0
first_name	character	1	NA	NA	NA	NA	1	16494	3075
last_name	character	1	NA	NA	NA	NA	1	7104	6693
hand	factor	720000	NA	NA	NA	NA	84647	793255	2
birth_date	Date	402	NA	NA	NA	NA	1	2063	5757
country_code	factor	1	NA	NA	NA	NA	42	177267	111

Tidy & Manipulate Data I

steps:

An inital check of the dataframe confirms both date variables (ranking_date & birth_date) contain multiple variables (year, month and day), and thus violating the tidy data principles.

Using the tidyr::seperate function, we break ‘ranking_date’ and ‘birth_date’ into the atomic components (year, month and day).

Finally define new variables as numeric.

playRank <- separate(playRank,ranking_date, into = c("ranking_date_year", "ranking_date_month", "ranking_date_day"),sep = '-')
playRank <- separate(playRank,birth_date, into = c("birth_date_year", "birth_date_month", "birth_date_day"),sep = '-')

playRank <- playRank %>%
  mutate(
    ranking_date_day   = as.numeric(ranking_date_day),
    ranking_date_month = as.numeric(ranking_date_month),
    ranking_date_year  = as.numeric(ranking_date_year),
    
    birth_date_day   = as.numeric(birth_date_day),
    birth_date_month = as.numeric(birth_date_month),
    birth_date_year  = as.numeric(birth_date_year),
    
    player_id      = as.integer(player_id),
    ranking_points = as.integer(ranking_points),
    tours          = as.integer(tours)
  )

Tidy & Manipulate Data II

steps: 1. create a column ‘age’, to store the age of the player in the ranking year.

playRank <- playRank %>%
  mutate(player_age = ranking_date_year - birth_date_year)

head(playRank)

Scan I

Scan the data for missing values, inconsistencies and obvious errors. In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain how you dealt with these values. steps: * scan data for missing values * We locate 1 observation missing, both first and last name, also missing the player_id - we delete this observation.

colSums(is.na(playRank))

##  ranking_date_year ranking_date_month   ranking_date_day 
##                  0                  0                  0 
##            ranking          player_id     ranking_points 
##                  0                  1              19446 
##              tours         first_name          last_name 
##            1087461                  1                  1 
##               hand    birth_date_year   birth_date_month 
##             720000                402                402 
##     birth_date_day       country_code         player_age 
##                402                  1                402

which(is.na(playRank$first_name))

## [1] 1593272

which(is.na(playRank$last_name))

## [1] 1593272

which(is.na(playRank$player_id))

## [1] 1593272

playRank<-playRank[-1593272,]

colSums(is.na(playRank))

##  ranking_date_year ranking_date_month   ranking_date_day 
##                  0                  0                  0 
##            ranking          player_id     ranking_points 
##                  0                  0              19446 
##              tours         first_name          last_name 
##            1087461                  0                  0 
##               hand    birth_date_year   birth_date_month 
##             719999                401                401 
##     birth_date_day       country_code         player_age 
##                401                  0                401

steps:

Recode the Na values of rannking points, and tours to 0

playRank$ranking_points[is.na(playRank$ranking_points)]<- 0

playRank$tours[is.na(playRank$tours)]<- 0

colSums(is.na(playRank))

##  ranking_date_year ranking_date_month   ranking_date_day 
##                  0                  0                  0 
##            ranking          player_id     ranking_points 
##                  0                  0                  0 
##              tours         first_name          last_name 
##                  0                  0                  0 
##               hand    birth_date_year   birth_date_month 
##             719999                401                401 
##     birth_date_day       country_code         player_age 
##                401                  0                401

steps:

Recode the Na in ‘hand’ to ‘unknown’

playRank$hand <- fct_explicit_na(playRank$hand, "unknown")

colSums(is.na(playRank))

##  ranking_date_year ranking_date_month   ranking_date_day 
##                  0                  0                  0 
##            ranking          player_id     ranking_points 
##                  0                  0                  0 
##              tours         first_name          last_name 
##                  0                  0                  0 
##               hand    birth_date_year   birth_date_month 
##                  0                401                401 
##     birth_date_day       country_code         player_age 
##                401                  0                401

steps:

Recode the Na values of birth information with 0

playRank$birth_date_year[is.na(playRank$birth_date_year)] <- 0

playRank$birth_date_month[is.na(playRank$birth_date_month)] <- 0

playRank$birth_date_day[is.na(playRank$birth_date_day)] <- 0

playRank$player_age[is.na(playRank$player_age)] <- 0

colSums(is.na(playRank))

##  ranking_date_year ranking_date_month   ranking_date_day 
##                  0                  0                  0 
##            ranking          player_id     ranking_points 
##                  0                  0                  0 
##              tours         first_name          last_name 
##                  0                  0                  0 
##               hand    birth_date_year   birth_date_month 
##                  0                  0                  0 
##     birth_date_day       country_code         player_age 
##                  0                  0                  0

Scan II

steps: * Scan the numeric data for outliers. + ranking
+ ranking_points
+ ranking_date_year
+ ranking_date_month
+ ranking_date_day
+ birth_date_year
+ birth_date_month
+ birth_date_day
+ age

Although, there are various outliers within the dataset - we have decided rather than removing any legitimate values, binning age and ranking points would be a more useful transformation for analysis.

boxplot(playRank$ranking ~ playRank$hand, main = "decting outliers in ranking", ylab = "ranking", xlab = "hands")

boxplot(playRank$ranking_points ~ playRank$hand, main="decting outliers in ranking_points", ylab = "ranking_points", xlab = "hands")

boxplot(playRank$ranking_date_year ~ playRank$hand, main="detecting outliers in ranking_date_year", ylab = "ranking_date_year", xlab = "hands")

boxplot(playRank$ranking_date_month ~ playRank$hand, main="detecting outliers in ranking_date_month", ylab = "ranking_date_month", xlab = "hands")

boxplot(playRank$ranking_date_day ~ playRank$hand, main="detecting outliers in ranking_date_day", ylab = "ranking_date_day", xlab = "hands")

boxplot(playRank$birth_date_month ~ playRank$hand, main="detecting outliers in birth_date_month", ylab = "birth_month_month", xlab = "hands")

boxplot(playRank$birth_date_day ~ playRank$hand, main="detecting outliers in birth_date_day", ylab = "birth_date_day", xlab = "hands")

boxplot(playRank$player_age ~ playRank$hand, main = "detecting outliers in age", ylab = 'age', xlab ='hands')

Select variables to apply equal width binning upon
Perform equal width binning using infotheo::discretize

to_bin <- playRank %>% select(ranking_points, player_age)

playRank_binned <- 
  discretize(to_bin, disc = "equalwidth")
  playRank %>% bind_cols(playRank_binned)

Transform

Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.

Finally, we display the distribution of the ranking points variable Then apply a logarithmic transformation on the ranking_points variable And, display the new distribution

a <- ggplot(playRank, aes(ranking_points)) + geom_histogram(colour = "white") + ggtitle("Historgram of player ranking points")
a

playRank <- playRank %>%
  mutate(log_rankingPoints = log(ranking_points))

b <- ggplot(playRank, aes(log_rankingPoints)) + geom_histogram(colour = "white") + ggtitle("Historgram of log player ranking points")   
b

MATH2349 Semester 1, 2018

Assignment 3

Qiushun Liang s3584868 Fletcher Herman s3551652

Important

Load packages

Executive Summary

Data

The source:

variable descriptions:

steps:

Understand

Tidy & Manipulate Data I

steps:

Tidy & Manipulate Data II

Scan I

Scan II

Transform