Required packages

library(readr)
library(tidyr)
package <U+393C><U+3E31>tidyr<U+393C><U+3E32> was built under R version 3.4.4
library(dplyr)

Attaching package: <U+393C><U+3E31>dplyr<U+393C><U+3E32>

The following objects are masked from <U+393C><U+3E31>package:stats<U+393C><U+3E32>:

    filter, lag

The following objects are masked from <U+393C><U+3E31>package:base<U+393C><U+3E32>:

    intersect, setdiff, setequal, union
library(Hmisc)
package <U+393C><U+3E31>Hmisc<U+393C><U+3E32> was built under R version 3.4.4Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: <U+393C><U+3E31>Hmisc<U+393C><U+3E32>

The following objects are masked from <U+393C><U+3E31>package:dplyr<U+393C><U+3E32>:

    src, summarize

The following objects are masked from <U+393C><U+3E31>package:base<U+393C><U+3E32>:

    format.pval, units
library(outliers)
library(lubridate)
package <U+393C><U+3E31>lubridate<U+393C><U+3E32> was built under R version 3.4.4
Attaching package: <U+393C><U+3E31>lubridate<U+393C><U+3E32>

The following object is masked from <U+393C><U+3E31>package:base<U+393C><U+3E32>:

    date

Executive Summary

The 3 datasets pertaining information related to formula one seasons were sourced from the data science platform Kaggle. The 3 datasets were first read into as data frames Laptimes, Drivers and Races. The data frame F1 was subsequently created by first joining Laptimes with Drivers using the key variable DriverID; the resulting data frame was then joined with Races using the key variable raceID to form F1.

The variables of F1 were assessed to determine their data type, those returning a data type of factors were further gauged to check their factor levels. A few variables were subsequently converted into data types better suited for their nature.

F1 was then checked to establish whether or not it conformed to the tidy data principles. Upon inspection it seemed that they did conform. 2 variables were then generated to facilitate calculations; one variable was created to determine the time passed by in weeks between today and when the race was held. A second variable, lap_time was established `by converting the variable “time” into a numeric nature to handle milliseconds.

The whole data frame was scanned to identify missing and special values. Only one value was found to be missing, the value in question was identified/located and dealt with by recoding the missing value.

The whole data frame was scanned to identify outliers using Univariate Outlier Detection method. Once identified and located, these values were dealt with by excluding them. The imputing and capping methods were not deemed suitable.

A histogram of the distribution of the lap time was produced to give a better insight into the data, however it resulted in a highly positively skewed distribution. A reciprocal square transformation was performed on the data to decrease the skewness and convert the distribution into a normal distribution.

Data

This dataset contains information relating to the formula one seasons from 1950 up to 2017. 3 datasets will be merged, with the first one holding information pertaining to the lap times for every race, including driverid, lap number, position and time; the second data set will contain a list of every F1 driver intcluding their first name, last name, date of birth and nationallity; the third data set will cover details of every race, including year, date, time, circuit and round.

Contents:-

Data set 1 Laptimes

raceID: Key variable

driverID: Key variable

Lap: The lap number being timed

Position: The current position of the driver whilst performing the lap

Time: The time registered for the lap

Data Set 2 Drivers:

DriverID : Key variable

forename: First Name

surname:

dob: date of birth

nationality

Data set 3 Races

raceid: Key Variable

round: The race number in the F1 championship calendar

name: The name of the grandprix

date: The date the race was held

These datasets were sourced from: url: https://www.kaggle.com/cjgdev/formula-1-race-data-19502017/data

setwd("C:/Users/Robby/Desktop/Master of Analytics/Yr 3 Sem 1/Data Pre-Processing/Assignment 3/Data")
The working directory was changed to C:/Users/Robby/Desktop/Master of Analytics/Yr 3 Sem 1/Data Pre-Processing/Assignment 3/Data inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
Laptimes <- read.csv("laptimes.csv")
Drivers <- read.csv("drivers.csv")
Races <- read.csv("races.csv")
head (Laptimes)
head (Drivers)
head (Races)
F1 <- (Lapdriver <- Laptimes %>% left_join(Drivers, by = "driverId")) %>% left_join(Races, by = "raceId")
head (F1)

Understand

The function class() was used to determine the data type, of the merged dataset F1. The data types returned were either integers or factors in nature. Once the the data types were returned, the variables “nationality” and “name” were further assessed using the function levels() to check the factor levels. The variable “position” was converted into an ordered factor type using the function as.factor() and then ordered using levels = c(“”). The variables “dob” and “date” were converted into the date type using the function as.date() to facilitate up-coming calculations.

cat("raceid:")
raceid:
class(F1$raceId)
[1] "integer"
cat("driverId:")
driverId:
class(F1$driverId)
[1] "integer"
cat("lap:")
lap:
class(F1$lap)
[1] "integer"
cat("Position:")
Position:
class(F1$position)
[1] "integer"
cat("time:")
time:
class(F1$time)
[1] "factor"
cat("forename:")
forename:
class(F1$forename)
[1] "factor"
cat("surname")
surname
class(F1$surname)
[1] "factor"
cat("dob")
dob
class(F1$dob)
[1] "factor"
cat("nationality")
nationality
class(F1$nationality)
[1] "factor"
cat("round")
round
class(F1$round)
[1] "integer"
cat("name")
name
class(F1$name)
[1] "factor"
cat("date")
date
class(F1$date)
[1] "factor"
cat ("\n","Nationalities:-", "\n")

 Nationalities:- 
levels(F1$nationality)
 [1] "American"          "American-Italian"  "Argentine"         "Argentine-Italian"
 [5] "Australian"        "Austrian"          "Belgian"           "Brazilian"        
 [9] "British"           "Canadian"          "Chilean"           "Colombian"        
[13] "Czech"             "Danish"            "Dutch"             "East German"      
[17] "Finnish"           "French"            "German"            "Hungarian"        
[21] "Indian"            "Indonesian"        "Irish"             "Italian"          
[25] "Japanese"          "Liechtensteiner"   "Malaysian"         "Mexican"          
[29] "Monegasque"        "New Zealander"     "Polish"            "Portuguese"       
[33] "Rhodesian"         "Russian"           "South African"     "Spanish"          
[37] "Swedish"           "Swiss"             "Thai"              "Uruguayan"        
[41] "Venezuelan"       
cat ("\n","GrandPrix:-", "\n")

 GrandPrix:- 
levels(F1$name)
 [1] "Abu Dhabi Grand Prix"          "Argentine Grand Prix"         
 [3] "Australian Grand Prix"         "Austrian Grand Prix"          
 [5] "Azerbaijan Grand Prix"         "Bahrain Grand Prix"           
 [7] "Belgian Grand Prix"            "Brazilian Grand Prix"         
 [9] "British Grand Prix"            "Caesars Palace Grand Prix"    
[11] "Canadian Grand Prix"           "Chinese Grand Prix"           
[13] "Dallas Grand Prix"             "Detroit Grand Prix"           
[15] "Dutch Grand Prix"              "European Grand Prix"          
[17] "French Grand Prix"             "German Grand Prix"            
[19] "Hungarian Grand Prix"          "Indian Grand Prix"            
[21] "Indianapolis 500"              "Italian Grand Prix"           
[23] "Japanese Grand Prix"           "Korean Grand Prix"            
[25] "Luxembourg Grand Prix"         "Malaysian Grand Prix"         
[27] "Mexican Grand Prix"            "Monaco Grand Prix"            
[29] "Moroccan Grand Prix"           "Pacific Grand Prix"           
[31] "Pescara Grand Prix"            "Portuguese Grand Prix"        
[33] "Russian Grand Prix"            "San Marino Grand Prix"        
[35] "Singapore Grand Prix"          "South African Grand Prix"     
[37] "Spanish Grand Prix"            "Swedish Grand Prix"           
[39] "Swiss Grand Prix"              "Turkish Grand Prix"           
[41] "United States Grand Prix"      "United States Grand Prix West"
F1$position <- as.factor(F1$position) 
cat ("\n")
F1$position <-  factor(F1$position,levels = c("1","2",  "3",  "4",  "5",  "6",  "7",  "8",  "9",  "10", "11", "12", "13", "14", "15", "16", "17", "18", "19","20", "21", "22", "23", "24"), ordered=TRUE )
cat("\n","New class for Postion:" )

 New class for Postion:
class(F1$position)
[1] "ordered" "factor" 
F1$dob <- as.Date(F1$dob, format = "%d/%m/%Y")
cat("\n","New class for dob")

 New class for dob
class(F1$dob)
[1] "Date"
F1$date <- as.Date(F1$date, format = "%d/%m/%Y")
cat("\n", "New class for date")

 New class for date
class(F1$date)
[1] "Date"

Tidy & Manipulate Data I

The data set F1 seems to conform to the tidy data principles since :-

each variable had its own column; each observation had its own row; *each value had its own cell.

Futhermore no values are being utilised as column headers in this data set, worth noting that the dob and date variables are not being split up into day, month and year. This is to facilitate upcoming calculations to be performed on these variables.

Tidy & Manipulate Data II

The variable “time_elapsed” was created to determine the time passed by in weeks between now and when the race was held. This was achieved using the function mutate() to create the variable and difftime() to work out the duration.

The variable lap_time was generated to facilitate calculations,the nature type was set to numeric to handle milliseconds. The function as.numeric() was utilised on the variable “time”.

F1 <- F1 %>% mutate(time_elapsed = round(difftime(today(), F1$date, unit="weeks"),2))
package <U+393C><U+3E31>bindrcpp<U+393C><U+3E32> was built under R version 3.4.4
F1 <- F1 %>% mutate(lap_time = F1$time <-as.numeric(ms(F1$time)))
head (F1)

Scan I

Utilising “sapply(F1, function(x) sum(is.na(x)))” enabled the scanning of all variables of the data set for missing values. A single value was found to be missing.

“which(is.na(F1$lap))” helped to locate the missing value identified in the variable “lap”, which represents the lap number being timed. Upon inspection of the missing entry the value used to recode the missing data was “20” following logical sequence of the entries since the preceding entry being 19 and the ensuing entry “21”. The function impute() helped to achieve this task.

“sapply(F1, function(x) sum(is.nan(x)))”" enabled the scanning of all variables of the data set for special values. None were found.

cat("Number of missing Values", "\n","\n")
Number of missing Values 
 
 sapply(F1, function(x) sum(is.na(x)))
      raceId     driverId          lap     position         time     forename      surname 
           0            0            1            0            0            0            0 
         dob  nationality        round         name         date time_elapsed     lap_time 
           0            0            0            0            0            0            0 
 cat("\n", "Missing value is found at entry")

 Missing value is found at entry
 which(is.na(F1$lap))
[1] 78
  cat("\n","Number of special Values", "\n","\n")

 Number of special Values 
 
 sapply(F1, function(x) sum(is.nan(x)))
      raceId     driverId          lap     position         time     forename      surname 
           0            0            0            0            0            0            0 
         dob  nationality        round         name         date time_elapsed     lap_time 
           0            0            0            0            0            0            0 
 
 sapply(F1, function(x) sum(is.infinite(x)))
      raceId     driverId          lap     position         time     forename      surname 
           0            0            0            0            0            0            0 
         dob  nationality        round         name         date time_elapsed     lap_time 
           0            0            0            0            0            0            0 
 F1$lap <- impute(F1$lap, "20")

Scan II

The variable “lap_time” was the only one numeric variable that needed scanning.

Univariate Outlier Detection method was used, the scores() function calculated the z-scores of the data.the summary() output shows that the minimum z score is -0.61819 and the maximum is 67.57.

Using which( abs(z.scores) >3 ), we can identify the z-scores whose absolute value is greater than 3.

length (which( abs(z.scores) >3 )) give the total number of outliers.

For the purpose of this assignment the outliers were excluded using the function F1_clean<- F1$lap_time[ - which( abs(z.scores) >3 )].

However, this was done solely for the purpose of the assignment, these outliers should not be excluded as they can be typically be accounted for by in-race incidents where the lap time could ballooned up well over the average lap time. Another factor could be rain interrupting the race as the conditions are deemed too dangerous to race yet the lap time is not stopped. These lap times would still be considered valid. The best approach would be to investigate each outlier in the data set before considering excluding the data due to data entry error or data processing error.

z.scores <- F1$lap_time %>%  scores(type = "z")
z.scores %>% summary()
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.61819 -0.28518 -0.09869  0.00000  0.16549 67.56851 
which( abs(z.scores) >3 )
  [1]    929   6360   6438   6516   6594   6670   6780   6952   7029   7105   7183   7260   7337
 [14]   7414   7490   7632   7737   7811   7886   7917   7986   8021   8090   8166   8235   8304
 [27]   8373   8427   8496   8562   8622   8691   8739   8808   8877   8946   9014   9041   9109
 [40]   9177   9245   9313  16276  16277  25038  25094  25150  25206  25262  25318  25374  25428
 [53]  25484  25530  25586  25642  25698  25754  25810  25865  25921  25977  26032  26087  26144
 [66]  26198  26252  27124  41943  55165  55243  55321  55399  55477  55555  55633  55705  55783
 [79]  55861  55939  56017  56095  56156  56219  56304  56410  56487  56590  59260  72914  75091
 [92]  88120  88164  88208  88252  88298  88342  88385  88429  88473  88516  88559  88602  88646
[105]  88686  88729  88772  88813  88856  88900  88943  88984  88986 103393 105367 111794 111851
[118] 111872 111929 111985 112042 112099 112156 112214 112286 112343 112399 112456 112512 112550
[131] 112606 112663 112752 112753 116639 116650 118645 119009 119725 120335 123708 125252 127280
[144] 128226 128813 130211 132762 139458 139545 139652 139717 139759 139852 139877 139920 139962
[157] 139988 150766 151842 152238 154756 156851 158365 158642 159131 160744 162419 163917 167180
[170] 167846 170491 173849 174349 176000 176243 178051 178061 185173 187750 190034 190125 191660
[183] 191829 196305 196360 196613 196843 196897 197251 197438 200540 200968 201866 202191 202909
[196] 208354 209501 213076 213109 213125 213251 213665 213983 214145 215173 218633 218997 224969
[209] 227937 229802 231048 231664 241850 247096 249079 260061 260595 269451 269462 271981 274105
[222] 274698 276052 277690 278060 278063 278084 279061 281284 282939 285032 285584 285587 286218
[235] 288041 290150 293880 295511 297320 299364 306934 306940 314520 314529 314587 314588 314648
[248] 314707 314766 314826 314862 314920 314921 314981 315040 315041 315058 315059 315118 315177
[261] 315178 315240 315306 315318 315319 315337 315338 317774 320582 320830 323277 323278 338106
[274] 338558 344868 361897 362564 362745 363740 363917 370221 372745 373739 377445 378672 378727
[287] 378782 378837 378855 378900 378918 378973 379028 379083 379138 379193 379232 379278 379333
[300] 379363 379418 379439 379443 379497 379550 379603 379658 379713 379744 381010 381393 388602
[313] 396341 396385 396429 396473 396517 396561 396606 396650 396699 396743 396787 396830 396874
[326] 396918 396965 397008 397053 397840 404047 404055 404118 404126 404208 404216 404279 404287
[339] 404350 404358 404421 404429 404492 404500 404563 404571 404634 404642 404705 404713 404765
[352] 404773 404811 404819 404882 404890 404973 404981 405044 405052 405115 405123 405197 405205
[365] 405268 405276 407244 413601 413652 413701 413752 413800 413825 413876 413927 413978 414002
[378] 414062 414113 414163 414214 414272 414323 414374
length (which( abs(z.scores) >3 ))
[1] 384
F1_clean<- F1$lap_time[ - which( abs(z.scores) >3 )]

Transform

Using hist() to produce a histogram on the distribution of the lap time resulted in a highly positively skewed distribution.

A reciprocal square transformation was performed on the data to decrease the skewness and convert the distribution into a normal distribution. A better distribution was then observed.

hist(F1$lap_time)

Rec_sqr_lap_time <- (F1$lap_time)^-2
hist(Rec_sqr_lap_time)

---
title: "MATH2349 Semester 1, 2018"
author: "Robby Bookun S3587469"
subtitle: Assignment 3
output:
  html_notebook: default
  pdf_document: default
  html_document:
    df_print: paged
---

## Required packages 

```{r}

library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(lubridate)
```


## Executive Summary 

  The 3 datasets pertaining information related to formula one seasons were sourced from the data science platform Kaggle. The 3 datasets were first read into as data frames Laptimes, Drivers and Races. The data frame F1 was subsequently created by first joining Laptimes with Drivers using the key variable DriverID; the resulting data frame was then joined with Races using the key variable raceID to form F1.

  The variables of F1 were assessed to determine their data type, those returning a data type of factors were further gauged to check their factor levels. A few variables were subsequently converted into data types better suited for their nature.
  
  F1 was then checked to establish whether or not it conformed to the tidy data principles. Upon inspection it seemed that they did conform. 2 variables were then generated to facilitate calculations; one variable was created to determine the time passed by in weeks between today and when the race was held. A second variable, lap_time was established `by converting the variable "time" into a numeric nature to handle milliseconds.
  
  The whole data frame was scanned to identify missing and special values. Only one value was found to be missing, the value in question was identified/located and dealt with by recoding the missing value. 

  The whole data frame was scanned to identify outliers using Univariate Outlier Detection method. Once identified and located, these values were dealt with by excluding them. The imputing and capping methods were not deemed suitable. 

  A histogram of the distribution of the lap time was produced to give a better insight into the data, however it resulted in a highly positively skewed distribution. A reciprocal square transformation was performed on the data to decrease the skewness and convert the distribution into a normal distribution.



## Data 


This dataset contains information relating to the formula one seasons from 1950 up to 2017. 3 datasets will be merged, with the first one holding information pertaining to the lap times for every race, including driverid, lap number, position and time; the second data set will contain a list of every F1 driver intcluding their first name, last name, date of birth and nationallity; the third data set will cover details of every race, including year, date, time, circuit and round. 

Contents:- 

Data set 1 Laptimes

raceID: Key variable

driverID: Key variable

Lap: The lap number being timed

Position: The current position of the driver whilst performing the lap

Time: The time registered for the lap

Data Set 2 Drivers:

DriverID : Key variable

forename: First Name

surname: 

dob: date of birth

nationality

Data set 3 Races

raceid: Key Variable

round: The race number in the F1 championship calendar

name: The name of the grandprix

date: The date the race was held

These datasets were sourced from:
url: https://www.kaggle.com/cjgdev/formula-1-race-data-19502017/data



```{r}
setwd("C:/Users/Robby/Desktop/Master of Analytics/Yr 3 Sem 1/Data Pre-Processing/Assignment 3/Data")

Laptimes <- read.csv("laptimes.csv")

Drivers <- read.csv("drivers.csv")

Races <- read.csv("races.csv")

head (Laptimes)

head (Drivers)

head (Races)

F1 <- (Lapdriver <- Laptimes %>% left_join(Drivers, by = "driverId")) %>% left_join(Races, by = "raceId")

head (F1)
```

## Understand 

The function class() was used to determine the data type, of the merged dataset F1. The data types returned were either integers or factors in nature. 
Once the the data types were returned, the variables "nationality" and "name" were further assessed using the function levels() to check the factor levels. 
The variable "position" was converted into an ordered factor type using the function as.factor() and then ordered using levels = c("").
The variables "dob" and "date" were converted into the date type using the function as.date() to facilitate up-coming calculations. 

```{r}
cat("raceid:")
class(F1$raceId)
cat("driverId:")
class(F1$driverId)
cat("lap:")
class(F1$lap)
cat("Position:")
class(F1$position)
cat("time:")
class(F1$time)
cat("forename:")
class(F1$forename)
cat("surname")
class(F1$surname)
cat("dob")
class(F1$dob)
cat("nationality")
class(F1$nationality)
cat("round")
class(F1$round)
cat("name")
class(F1$name)
cat("date")
class(F1$date)

cat ("\n","Nationalities:-", "\n")
levels(F1$nationality)
cat ("\n","GrandPrix:-", "\n")
levels(F1$name)

F1$position <- as.factor(F1$position) 
cat ("\n")
F1$position <-  factor(F1$position,levels = c("1","2",  "3",  "4",  "5",  "6",  "7",  "8",  "9",  "10", "11", "12", "13", "14", "15", "16", "17", "18", "19","20", "21", "22", "23", "24"), ordered=TRUE )
cat("\n","New class for Postion:" )
class(F1$position)

F1$dob <- as.Date(F1$dob, format = "%d/%m/%Y")
cat("\n","New class for dob")
class(F1$dob)

F1$date <- as.Date(F1$date, format = "%d/%m/%Y")
cat("\n", "New class for date")

class(F1$date)

```


##	Tidy & Manipulate Data I 

The data set F1 seems to conform to the tidy data principles since :-

  *each variable had its own column;
  *each observation had its own row;
  *each value had its own cell.

Futhermore no values are being utilised as column headers in this data set, worth noting that the dob and date variables are not being split up into day, month and year. This is to facilitate upcoming calculations to be performed on these variables.



```{r}


```

##	Tidy & Manipulate Data II 

The variable "time_elapsed" was created to determine the time passed by in weeks between now and when the race was held. This was achieved using the function mutate() to create the variable and difftime() to work out the duration. 

The variable lap_time was generated to facilitate calculations,the nature type was set to numeric to handle milliseconds. The function as.numeric() was utilised on the variable "time". 
```{r}

F1 <- F1 %>% mutate(time_elapsed = round(difftime(today(), F1$date, unit="weeks"),2))

F1 <- F1 %>% mutate(lap_time = F1$time <-as.numeric(ms(F1$time)))

head (F1)
```


##	Scan I 

Utilising "sapply(F1, function(x) sum(is.na(x)))" enabled the scanning of all variables of the data set for missing values. A single value was found to be missing.

"which(is.na(F1$lap))" helped to locate the missing value identified in the variable "lap", which represents the lap number being timed. Upon inspection of the missing entry the value used to recode the missing data was "20" following logical sequence of the entries since the preceding entry being 19 and the ensuing entry "21". The function impute() helped to achieve this task.

"sapply(F1, function(x) sum(is.nan(x)))"" enabled the scanning of all variables of the data set for special values. None were found.

```{r}

cat("Number of missing Values", "\n","\n")
 sapply(F1, function(x) sum(is.na(x)))

 cat("\n", "Missing value is found at entry")
 which(is.na(F1$lap))

  cat("\n","Number of special Values", "\n","\n")
 sapply(F1, function(x) sum(is.nan(x)))
 
 sapply(F1, function(x) sum(is.infinite(x)))

 F1$lap <- impute(F1$lap, "20")

```


##	Scan II

The variable "lap_time" was the only one numeric variable that needed scanning.

Univariate Outlier Detection method was used, the scores() function calculated the z-scores of the data.the summary() output shows that the minimum z score is -0.61819 and the maximum is 67.57. 

Using which( abs(z.scores) >3 ), we can identify the z-scores whose absolute value is greater than 3.

length (which( abs(z.scores) >3 )) give the total number of outliers.

For the purpose of this assignment the outliers were excluded using the function F1_clean<- F1$lap_time[ - which( abs(z.scores) >3 )].

However, this was done solely for the purpose of the assignment, these outliers should not be excluded as they can be typically be accounted for by in-race incidents where the lap time could ballooned up well over the average lap time. Another factor could be rain interrupting the race as the conditions are deemed too dangerous to race yet the lap time is not stopped. These lap times would still be considered valid. The best approach would be to investigate each outlier in the data set before considering excluding the data  due to data entry error or data processing error. 

```{r}

z.scores <- F1$lap_time %>%  scores(type = "z")

z.scores %>% summary()

which( abs(z.scores) >3 )

length (which( abs(z.scores) >3 ))

F1_clean<- F1$lap_time[ - which( abs(z.scores) >3 )]

```


##	Transform 
Using hist() to produce a histogram on the distribution of the lap time resulted in a highly positively skewed distribution.

A reciprocal square transformation was performed on the data to decrease the skewness and convert the distribution into a normal distribution. A better distribution was then observed.

```{r}

hist(F1$lap_time)

Rec_sqr_lap_time <- (F1$lap_time)^-2

hist(Rec_sqr_lap_time)
```