MATH2349 Semester 2, 2019

Required packages

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(outliers)
library(stringr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(ggplot2)
library(dplyr)
library(readr)

Executive Summary

In this report, Formula 1 data from 1950-2017 is chosen. The data consists of different tables which are drivers, results, race tracks and others. Four individual tables were chosen to be performed pre-processing tasks. The results and race data dataframe are combined. Other than that, the new column, ‘age’ was created in drivers dataframe. Lastly, boxplots were plotted and used to determine the z-score.

Data

In this assignment 3, four datasets about Formula 1 data from 1950 through to the 2017 season from one same survey are used, which are results.csv, races.csv, circuits.csv and drivers.csv.

The data sources are from https://www.kaggle.com/cjgdev/formula-1-race-data-19502017.

Variables name for 4 different datasets are stated below: For Results dataset,

raceId
driverId
positionOrder
points
fastestLapTime
fastestLapSpeed

For Races dataset，

raceId
year
round
circuitId
name
date

For Circuits dataset,

circuitId
circuitRef
name
location
country

For Drivers dataset,

driverId
driverRef
forename
surname
dob
nationality

results <- read_csv("formula-1-race-data-19502017/results.csv")

## Parsed with column specification:
## cols(
##   resultId = col_double(),
##   raceId = col_double(),
##   driverId = col_double(),
##   constructorId = col_double(),
##   number = col_double(),
##   grid = col_double(),
##   position = col_double(),
##   positionText = col_character(),
##   positionOrder = col_double(),
##   points = col_double(),
##   laps = col_double(),
##   time = col_character(),
##   milliseconds = col_double(),
##   fastestLap = col_double(),
##   rank = col_double(),
##   fastestLapTime = col_time(format = ""),
##   fastestLapSpeed = col_double(),
##   statusId = col_double()
## )

## Warning: 1 parsing failure.
##   row             col               expected actual                                       file
## 23765 fastestLapSpeed no trailing characters  :42.6 'formula-1-race-data-19502017/results.csv'

dim(results)

## [1] 23777    18

head(results)

results$fastestLapSpeed<-as.numeric(results$fastestLapSpeed)

Secondly, the races dataset is imported.

races <- read_csv("formula-1-race-data-19502017/races.csv")

## Parsed with column specification:
## cols(
##   raceId = col_double(),
##   year = col_double(),
##   round = col_double(),
##   circuitId = col_double(),
##   name = col_character(),
##   date = col_date(format = ""),
##   time = col_time(format = ""),
##   url = col_character()
## )

## Warning: 731 parsing failures.
## row  col   expected actual                                     file
##  90 time time like         'formula-1-race-data-19502017/races.csv'
##  91 time time like         'formula-1-race-data-19502017/races.csv'
##  92 time time like         'formula-1-race-data-19502017/races.csv'
##  93 time time like         'formula-1-race-data-19502017/races.csv'
##  94 time time like         'formula-1-race-data-19502017/races.csv'
## ... .... .......... ...... ........................................
## See problems(...) for more details.

#Check the dimension of races dataset.

dim(races)

## [1] 997   8

# Check the first six rows of the Races dataset.

head(races)

circuits <- read_csv("formula-1-race-data-19502017/circuits.csv")

## Parsed with column specification:
## cols(
##   circuitId = col_double(),
##   circuitRef = col_character(),
##   name = col_character(),
##   location = col_character(),
##   country = col_character(),
##   lat = col_double(),
##   lng = col_double(),
##   alt = col_double(),
##   url = col_character()
## )

dim(circuits)

## [1] 73  9

head(circuits)

drivers <- read_csv("formula-1-race-data-19502017/drivers.csv",col_types = cols(dob = col_date(format = "%d/%m/%Y")))

dim(drivers)

## [1] 842   9

head(drivers)

# Remove duplicated columns
results<-distinct(results)
circuits<-distinct(circuits)
races<-distinct(races)
drivers<-distinct(drivers)

# Check the column existence again after removement.
dim(results)

## [1] 23777    18

dim(circuits)

## [1] 73  9

dim(races)

## [1] 997   8

dim(drivers)

## [1] 842   9

# For higher accuracy of data frame, the redundant columns have been removed.

circuits <- circuits[,c(1:5)]
results<-results[, c(2:3, 9:10, 14, 16:17)]
races<-races[, -c(7,8)]
drivers<-drivers[, c(1:2, 5:8)]

# From observation, both circuits and races dataframe have same column name with 'Name'. To advoid the unclear condition happened, both column names are changed to more specific name.

colnames(circuits)[which(names(circuits) == "name")] <-"Circuitname"
colnames(races)[which(names(races) == "name")] <-"Grandprixname"

# Combine the existing four dataframe into two different dataframes.

newresults <- left_join(results, races, by = "raceId")
newraces <- left_join(races, circuits, by = "circuitId")

# Print out the newresults dataframe

head(newresults)

# Print out the newraces dataframe

head(newraces)

Understand

Check types of attributes in both new dataframes by using str() function.

str(newresults)

## Classes 'tbl_df', 'tbl' and 'data.frame':    23777 obs. of  12 variables:
##  $ raceId         : num  18 18 18 18 18 18 18 18 18 18 ...
##  $ driverId       : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ positionOrder  : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ points         : num  10 8 6 5 4 3 2 1 0 0 ...
##  $ fastestLap     : num  39 41 41 58 43 50 22 20 15 23 ...
##  $ fastestLapTime : 'hms' num  01:27:00 01:27:00 01:28:00 01:28:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ fastestLapSpeed: num  218 218 217 215 218 ...
##  $ year           : num  2008 2008 2008 2008 2008 ...
##  $ round          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ circuitId      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Grandprixname  : chr  "Australian Grand Prix" "Australian Grand Prix" "Australian Grand Prix" "Australian Grand Prix" ...
##  $ date           : Date, format: "2008-03-16" "2008-03-16" ...

From the table above, we found that the points for the dataframe are set as numeric variables. However, it’s not convenient to be analysed and it is better in ordered factor. Throughout years, we found that the scoring system of the Formula 1 competiton experienced several changes. Therefore, we have divided the datafrane by years into five different set of scoring system which are 1950-1960, 1961-1990, 1991-2002, 2003-2009 and 2010-2017.

Upon inspection of the results_2 dataset, we can see that whilst the points attribute is filled with only integers, it is not very helpful in that state and must be moved into an ordered factor. However, due to numerous changes to the Formula 1 points system over the years, the data must be subset to account for these changes.

#Subset the dataframe into different year frame.

results5060<- subset(newresults, year<=1960)
results6190 <- subset(newresults, year >=1961 & year <=1990)
results9102<- subset(newresults, year>=1991 & year<=2002)
results0309 <- subset (newresults, year >=2003 & year <=2009)
results1017 <- subset(newresults, year >=2010 & year <=2017)

#Transform points variables into ordered factor.
results5060$points <- factor (results5060$points, levels = c("8", "6", "4", "3", "2", "0"), ordered=TRUE)

results6190$points <- factor (results6190$points,
                                   levels = c("9", "6", "4", "3", "2", "1", "0"), ordered=TRUE)

results9102$points <- factor (results9102$points, 
                            levels = c("10", "6", "4", "3", "2", "1", "0"), ordered=TRUE)

results0309$points <- factor (results0309$points,
                                   levels = c("10", "8", "6", "5", "4", "3", "2", "1", "0"), ordered=TRUE)

results1017$points <- factor (results1017$points, 
                                   levels = c("25", "18", "15", "10", "8", "6", "4", "2", "1", "0"), ordered=TRUE)

# Display the ordered factor.

levels(results5060$points)

## [1] "8" "6" "4" "3" "2" "0"

levels(results6190$points)

## [1] "9" "6" "4" "3" "2" "1" "0"

levels(results9102$points)

## [1] "10" "6"  "4"  "3"  "2"  "1"  "0"

levels(results0309$points)

## [1] "10" "8"  "6"  "5"  "4"  "3"  "2"  "1"  "0"

levels(results1017$points)

##  [1] "25" "18" "15" "10" "8"  "6"  "4"  "2"  "1"  "0"

Tidy & Manipulate Data I

Firstly, check N/A existing in driver dataframe.

colSums(is.na(drivers))

##    driverId   driverRef    forename     surname         dob nationality 
##           0           0           0           0           1           0

We can check the location of the N/A located in the dataframe.

which(is.na(drivers$dob))

## [1] 415

From the column shown above, we found that the driver who doesn’t have dob is Ray Reed.

Therefore, we replace the dob with his actual birthdate in wikipedia.

drivers$dob[is.na(drivers$dob)] <- "1932/04/30"
colSums(is.na(drivers))

##    driverId   driverRef    forename     surname         dob nationality 
##           0           0           0           0           0           0

Secondly, check the existence of N/A in all dataframes.

colSums(is.na(results5060))

##          raceId        driverId   positionOrder          points 
##               0               0               0             123 
##      fastestLap  fastestLapTime fastestLapSpeed            year 
##            2230            2230            2230               0 
##           round       circuitId   Grandprixname            date 
##               0               0               0               0

colSums(is.na(results6190))

##          raceId        driverId   positionOrder          points 
##               0               0               0               9 
##      fastestLap  fastestLapTime fastestLapSpeed            year 
##           10739           10739           10739               0 
##           round       circuitId   Grandprixname            date 
##               0               0               0               0

colSums(is.na(results9102))

##          raceId        driverId   positionOrder          points 
##               0               0               0               3 
##      fastestLap  fastestLapTime fastestLapSpeed            year 
##            4817            4817            4817               0 
##           round       circuitId   Grandprixname            date 
##               0               0               0               0

colSums(is.na(results0309))

##          raceId        driverId   positionOrder          points 
##               0               0               0               3 
##      fastestLap  fastestLapTime fastestLapSpeed            year 
##             433             433             433               0 
##           round       circuitId   Grandprixname            date 
##               0               0               0               0

colSums(is.na(results1017))

##          raceId        driverId   positionOrder          points 
##               0               0               0             162 
##      fastestLap  fastestLapTime fastestLapSpeed            year 
##             175             175             176               0 
##           round       circuitId   Grandprixname            date 
##               0               0               0               0

The N/A existed in all dataframe could be originally not filled by any number or it was half point due to disqualification or various race issues. Therefore, we can filled N/A value with 0 since the majority of half-points was allocated because of the disqualification.

results5060$points [is.na(results5060$points)] <- 0
results6190$points [is.na(results6190$points)] <- 0
results9102$points [is.na(results9102$points)] <- 0
results0309$points [is.na(results0309$points)] <- 0
results1017$points [is.na(results1017$points)] <- 0

To prevent the redundancy of dataframe, fill the time column with NULL.

# The time column is filled with NULL.

results5060$time <- NULL
results6190$time <- NULL
results9102$time <- NULL
results0309$time <- NULL
results1017$time <- NULL

Tidy & Manipulate Data II

Create a new column from the existing column ‘dob’, which by calculating the year difference till now.

mutate(drivers, age = as.period(interval(start = drivers$dob, end = today()))) %>%.$age %>%.@year %>%
mutate(drivers, age = .)

Scan I

As mentioned previously, some fields in each table are filled with NA’s. We can realised that fasterst lap, fastest laptime and fastest speed are only shown after 2003. This might be caused by the technology problem. Therefore, in order to prevent the low accuracy, these 3 columns will be removed from 1950-2002.

However, there are some NAs in the results0309 and results1017. However, it might be caused by the failure of joining competition, accident happened during the race or disqualification given before the race. Therefore, the NA’s will be remained and omit them when dealing with outliers. What is noteworthy is the treatment of how the fastestlap, fastestlaptime, and fastestspeed is recorded in

results5060$fastestLap <- NULL
results5060$fastestLapSpeed <- NULL
results5060$fastestLapTime <- NULL
results6190$fastestLap <- NULL
results6190$fastestLapSpeed <- NULL
results6190$fastestLapTime <- NULL
results9102$fastestLap <- NULL
results9102$fastestLapSpeed <- NULL
results9102$fastestLapTime <- NULL

Scan II

In this report, boxplots are used to determine the outliers existence in the dataset above. As the test is only apply on the numeric variable and the fastest speed is the only available numeric variable. 2003-2009 and 2010-2017 were picked to be used the test. Z score will be calculated and any row which have z-score higher than 3 will be removed. The boxplot will be plotted again to determine the outliers.

na.omit(results0309$fastestLapSpeed) %>% boxplot(main="Boxplot of the Fastest recorded speeds from 2003-2009, N/A's omitted", ylab = "Speed", col="grey")

na.omit(results1017$fastestLapSpeed) %>% boxplot(main="Boxplot of the Fastest recorded speeds from 2010-2017, N/A's omitted",
ylab="Speed", col="grey")

z.scores.lapspeed2003<- na.omit(results0309$fastestLapSpeed) %>% scores(type = "z")

z.scores.lapspeed2003 %>% summary()

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -5.42242 -0.34600  0.07048  0.00000  0.53428  2.45058

cleaned2003<-na.omit(results0309$fastestLapSpeed)[which (abs(z.scores.lapspeed2003)>3)]
# Show the z-score summary after cleaned.
z.cleanscores2003 <- cleaned2003 %>% scores(type = "z")
z.cleanscores2003 %>% summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.4511 -0.2757  0.2163  0.0000  0.6288  1.4909

# Plot the boxplot again.

cleaned2003 %>% boxplot(main = "Box Plot of Fastest Lap speed from 2003-2009, outliers removed", ylab = "Length", col = "grey")

However, after the z-score test, the boxplot is showing that one in the bottom.

Next, we apply the same test on 2010-2017 data.

cleaned2010<-na.omit(results1017$fastestLapSpeed)[which (abs(z.scores.lapspeed2003)>3)]

# Print out the z score after cleaning.
z.cleanscores2010 <- cleaned2010 %>% scores(type = "z")
z.cleanscores2010 %>% summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.1767 -1.0039  0.4099  0.0000  0.6391  2.2228

cleaned2010 %>% boxplot(main = "Box Plot of Fastest Lap speed from 2010-2017, outliers removed", ylab = "Speed", col = "grey")

## Transform

Histogram is plotted and it’s based on the fastest lap data for 2003-09 and 2010-17.

hist(results0309$fastestLapSpeed,
     main = "Histogram of Fastest Speed recorded for '2003-2009'", xlab = "Speed (km/h)")

squared0309<-results0309$fastestLapSpeed^2
hist(squared0309)

By observing from histogram above, we realised that the data is left-screwed. The square transformaation will be applied on the data above.

Next, the histogram of the fastest speed between 2010 and 2017 was plotted.

hist(results1017$fastestLapSpeed,
     main = "Histogram of Fastest Speed recorded for '2010-2017'", xlab = "Speed (km/h)")

The histogram above is showing that the data is left-screwed as well. Therefore, square transformation will be applied as well.

squared1017<-results1017$fastestLapSpeed^2
hist(squared1017)