You must also publish your report to RPubs (see here) and add this RPubs link to the comments/description section in Turnitin while uploading your report. This online version of the report will be used for marking. Failure to submit your link will delay your feedback and risk late penalties.
library(readr)
library(dplyr)
library(tidyr)
library(knitr)
library(mlr)
library(lubridate)
library(forcats)
library(infotheo)
library(ggplot2)
The data is loaded from the working directory and joined using the 1-1 player_id variable. Providing a singular dataset with demographic and ranking variables. Next, playing hand (hand) is factorised and the labels are expanded, the country code is factorised and date variables are converted to such, finally, a summary of the columns and structure of the data frame is displayed. We then separate the date columns, so as each atomic component has an individual column, finally data type conversions occur. We then create a player age column, that represents the player’s age at the time of the ranking, this is calculated by, subtracting birth year, from the ranking year. Then we scan the data frame for missing values, locating a single row missing all demographic information and remove this row. We then convert null values in ranking points and tours variables to 0, before finally converting null values in playing hand, birth year, month, day and age to 0. we then assess for outliers visually using boxplots for all variables, we determine outliers to be present within ranking points and player age. As these values are legitimate, removing them from the data would not be an option, we decided to bin these variables using the equal width method. Finally, we assess the distribution of the ranking points variable, and detect a heavy positive skew, to better inform our analysis we decide to apply a log function as to normalise this variable.
The Players dataset contains historical descriptive information regarding female tennis players, whilst the Rankings dataset contains information regarding world ranking at specified points in history - the rankings dataset ranges from 1984 to 2017, whilst the players set has information regarding players born between 1918 and 2003.
source: #https://www.kaggle.com/joaoevangelista/wta-matches-and-rankings
players <- read_csv("players.csv")
rankings <- read_csv("rankings.csv")
playRank <- left_join(rankings, players, by = "player_id")
head(players)
head(rankings)
head(playRank)
In this step, we have defined the hand variable as a factor and expanded upon the ‘R’ & ‘L’ labels into ‘Right’ & ‘Left labels, also defining the ’country_code’ as a factor - thus satisfying min requirement 4. Next, we have defined the date variables into ‘date’ data type satisfying min requirement 2 and 3. Finally, we display the structure of the data frame and a summary of the columns.
playRank$hand <- factor(playRank$hand,
levels = c("R", "L"),
labels = c("Right", "Left"))
playRank$country_code <- factor(playRank$country_code)
playRank$ranking_date <- ymd(playRank$ranking_date)
## Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone 'zone/tz/2018c.1.0/
## zoneinfo/Australia/Melbourne'
playRank$birth_date <- ymd(playRank$birth_date)
str(playRank)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1597902 obs. of 10 variables:
## $ ranking_date : Date, format: "2000-01-01" "2000-01-01" ...
## $ ranking : int 1 2 3 4 5 6 7 8 9 10 ...
## $ player_id : num 200001 200128 200748 200033 200096 ...
## $ ranking_points: num 6074 4841 4378 3021 2658 ...
## $ tours : num NA NA 13 15 NA NA NA NA NA NA ...
## $ first_name : chr "Martina" "Lindsay" "Venus" "Serena" ...
## $ last_name : chr "Hingis" "Davenport" "Williams" "Williams" ...
## $ hand : Factor w/ 2 levels "Right","Left": 1 1 1 1 1 2 1 1 1 1 ...
## $ birth_date : Date, format: "1980-09-30" "1976-06-08" ...
## $ country_code : Factor w/ 111 levels "ALB","ALG","ANG",..: 92 106 106 106 33 106 33 7 33 33 ...
summarizeColumns(playRank) %>% kable(caption = "Feature Summary Prior to Data Preprocessing")
## Warning in mde(x): NAs introduced by coercion
## Warning in mde(x): NAs introduced by coercion
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| ranking_date | Date | 0 | NA | NA | NA | NA | 1 | 1523 | 1779 |
| ranking | integer | 0 | 542.54714 | 355.260745 | 498 | 429.9540 | 1 | 1515 | 0 |
| player_id | numeric | 1 | 203828.81259 | 3627.806206 | 202642 | 2333.6124 | 200001 | 220593 | 0 |
| ranking_points | numeric | 19446 | 134.83696 | 467.849367 | 17 | 22.2390 | 0 | 13615 | 0 |
| tours | numeric | 1087461 | 12.99519 | 9.042693 | 12 | 11.8608 | 0 | 41 | 0 |
| first_name | character | 1 | NA | NA | NA | NA | 1 | 16494 | 3075 |
| last_name | character | 1 | NA | NA | NA | NA | 1 | 7104 | 6693 |
| hand | factor | 720000 | NA | NA | NA | NA | 84647 | 793255 | 2 |
| birth_date | Date | 402 | NA | NA | NA | NA | 1 | 2063 | 5757 |
| country_code | factor | 1 | NA | NA | NA | NA | 42 | 177267 | 111 |
An inital check of the dataframe confirms both date variables (ranking_date & birth_date) contain multiple variables (year, month and day), and thus violating the tidy data principles.
Using the tidyr::seperate function, we break ‘ranking_date’ and ‘birth_date’ into the atomic components (year, month and day).
Finally define new variables as numeric.
playRank <- separate(playRank,ranking_date, into = c("ranking_date_year", "ranking_date_month", "ranking_date_day"),sep = '-')
playRank <- separate(playRank,birth_date, into = c("birth_date_year", "birth_date_month", "birth_date_day"),sep = '-')
playRank <- playRank %>%
mutate(
ranking_date_day = as.numeric(ranking_date_day),
ranking_date_month = as.numeric(ranking_date_month),
ranking_date_year = as.numeric(ranking_date_year),
birth_date_day = as.numeric(birth_date_day),
birth_date_month = as.numeric(birth_date_month),
birth_date_year = as.numeric(birth_date_year),
player_id = as.integer(player_id),
ranking_points = as.integer(ranking_points),
tours = as.integer(tours)
)
steps: 1. create a column ‘age’, to store the age of the player in the ranking year.
playRank <- playRank %>%
mutate(player_age = ranking_date_year - birth_date_year)
head(playRank)
Scan the data for missing values, inconsistencies and obvious errors. In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain how you dealt with these values. steps: * scan data for missing values * We locate 1 observation missing, both first and last name, also missing the player_id - we delete this observation.
colSums(is.na(playRank))
## ranking_date_year ranking_date_month ranking_date_day
## 0 0 0
## ranking player_id ranking_points
## 0 1 19446
## tours first_name last_name
## 1087461 1 1
## hand birth_date_year birth_date_month
## 720000 402 402
## birth_date_day country_code player_age
## 402 1 402
which(is.na(playRank$first_name))
## [1] 1593272
which(is.na(playRank$last_name))
## [1] 1593272
which(is.na(playRank$player_id))
## [1] 1593272
playRank<-playRank[-1593272,]
colSums(is.na(playRank))
## ranking_date_year ranking_date_month ranking_date_day
## 0 0 0
## ranking player_id ranking_points
## 0 0 19446
## tours first_name last_name
## 1087461 0 0
## hand birth_date_year birth_date_month
## 719999 401 401
## birth_date_day country_code player_age
## 401 0 401
steps:
playRank$ranking_points[is.na(playRank$ranking_points)]<- 0
playRank$tours[is.na(playRank$tours)]<- 0
colSums(is.na(playRank))
## ranking_date_year ranking_date_month ranking_date_day
## 0 0 0
## ranking player_id ranking_points
## 0 0 0
## tours first_name last_name
## 0 0 0
## hand birth_date_year birth_date_month
## 719999 401 401
## birth_date_day country_code player_age
## 401 0 401
steps:
playRank$hand <- fct_explicit_na(playRank$hand, "unknown")
colSums(is.na(playRank))
## ranking_date_year ranking_date_month ranking_date_day
## 0 0 0
## ranking player_id ranking_points
## 0 0 0
## tours first_name last_name
## 0 0 0
## hand birth_date_year birth_date_month
## 0 401 401
## birth_date_day country_code player_age
## 401 0 401
steps:
playRank$birth_date_year[is.na(playRank$birth_date_year)] <- 0
playRank$birth_date_month[is.na(playRank$birth_date_month)] <- 0
playRank$birth_date_day[is.na(playRank$birth_date_day)] <- 0
playRank$player_age[is.na(playRank$player_age)] <- 0
colSums(is.na(playRank))
## ranking_date_year ranking_date_month ranking_date_day
## 0 0 0
## ranking player_id ranking_points
## 0 0 0
## tours first_name last_name
## 0 0 0
## hand birth_date_year birth_date_month
## 0 0 0
## birth_date_day country_code player_age
## 0 0 0
steps: * Scan the numeric data for outliers. + ranking
+ ranking_points
+ ranking_date_year
+ ranking_date_month
+ ranking_date_day
+ birth_date_year
+ birth_date_month
+ birth_date_day
+ age
Although, there are various outliers within the dataset - we have decided rather than removing any legitimate values, binning age and ranking points would be a more useful transformation for analysis.
boxplot(playRank$ranking ~ playRank$hand, main = "decting outliers in ranking", ylab = "ranking", xlab = "hands")
boxplot(playRank$ranking_points ~ playRank$hand, main="decting outliers in ranking_points", ylab = "ranking_points", xlab = "hands")
boxplot(playRank$ranking_date_year ~ playRank$hand, main="detecting outliers in ranking_date_year", ylab = "ranking_date_year", xlab = "hands")
boxplot(playRank$ranking_date_month ~ playRank$hand, main="detecting outliers in ranking_date_month", ylab = "ranking_date_month", xlab = "hands")
boxplot(playRank$ranking_date_day ~ playRank$hand, main="detecting outliers in ranking_date_day", ylab = "ranking_date_day", xlab = "hands")
boxplot(playRank$birth_date_month ~ playRank$hand, main="detecting outliers in birth_date_month", ylab = "birth_month_month", xlab = "hands")
boxplot(playRank$birth_date_day ~ playRank$hand, main="detecting outliers in birth_date_day", ylab = "birth_date_day", xlab = "hands")
boxplot(playRank$player_age ~ playRank$hand, main = "detecting outliers in age", ylab = 'age', xlab ='hands')
Select variables to apply equal width binning upon
Perform equal width binning using infotheo::discretize
to_bin <- playRank %>% select(ranking_points, player_age)
playRank_binned <-
discretize(to_bin, disc = "equalwidth")
playRank %>% bind_cols(playRank_binned)
Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.
Finally, we display the distribution of the ranking points variable Then apply a logarithmic transformation on the ranking_points variable And, display the new distribution
a <- ggplot(playRank, aes(ranking_points)) + geom_histogram(colour = "white") + ggtitle("Historgram of player ranking points")
a
playRank <- playRank %>%
mutate(log_rankingPoints = log(ranking_points))
b <- ggplot(playRank, aes(log_rankingPoints)) + geom_histogram(colour = "white") + ggtitle("Historgram of log player ranking points")
b