Formula One, abbreviated to F1, is the highest class of open-wheeled auto racing defined by the Fédération Internationale de l'Automobile (FIA), motorsport's world governing body. The "formula" in the name refers to a set of rules to which all participants and vehicles must conform. The F1 World Championship season consists of a series of races, known as Grands Prix, usually held on purpose-built circuits, and in a few cases on closed city streets. The results of each race are combined to determine two annual championships, one for drivers and one for constructors.(https://en.wikipedia.org/wiki/Formula_One)

I choose Formula One as the topic of the final project because I want to understand how to entertain a data driven sport and to learn how this industry work. To me, the most fascinated part of this sport is its technical aspect as it represents both the advanced automobile and aeronautical engineering. While Formula One could be seen as automobile companies showcasing their ability to perform in racing sport, it brings the fast evolution of the technologies behind the sport. The racing telemetry collect data like the speed, stability, tire wear, and aerodynamics etc. combined with data analytics allows engineers to evaluate the performance of cars around the racing track and figure out what needs to change. The audience could anticipate the enhanced engineering by each passing season. Therefore data is the trade secret for each team because whether teams can shave hundredths of seconds off their lap times will rely on the details of those data. Another reason I like to investigate Formula One is that it is a game of number. For example, 1.5 GB of data will generate for each car per race for McLaren. Each weekend the Grand Prix racing results broadcast on TV so that audience can follow the race and be updated. More data including practice laps, warming up are available at Formula One Live Timing. So there are plenty of data for me to do interesting analysis in this sport. The third reason I think Formula One would be fun is that the participants could work out the best strategy to win the game as all participants' cars must conform the "formula", a set of rules, as designated in the name. As such, Formula One is a sport mixed with technology, brain, speed, and number, which perfect for people who like sport and strategy game at the same time.

Goal

The goal of this project is to apply R language and MySQL I've learned in Data Acquisition and Management Course to collect, structure and visualize data in the context of Formula One sporting. At the same time, I will learn the history of this sport and learn how this industry works through digging the data of racing results. Formula One rule becomes more and more complicated in the regulation of the costs, safety. Thought this gives the dedicated fans more fun, it makes it difficult to someone who is new to this sport to enjoy it right away. To knowing how technical and sporting regulations shape this sport, I will extract the information from the archive and find out quantification analysis.

Data Science Workflow

To obtain the goal, I will use OSEMN model. That is, I will execute a data science workflow that includes:

For API, JSON files were downloaded and stored. For PDF, PDFTables(https://pdftables.com/) was used to extract data and convert to .xlsx files followed by a conversion to .txt file. R packages "RCurl", "jsonlite", "XML" will be used in downloading data from the websites.

Packages used in this projects: RCurl, XML. tidyr,dplyr,stringr,knitr,ggplot2, jsonlite.

library(XML)
library(tidyr)

library(stringr)
library(knitr)
library(ggplot2)
library(jsonlite)

The raw data were downloaded from the above-mentioned resources. Data tidying and reorganizing were performed to transform the data to a structure that is easy for analysis and modeling. For example, some unnecessary variables were removed and some number will be converted to numeric or character type as they would be used as continuous or categorical variables. R packages "stringr", "tidyr", "dplyr", "knitr" were used. "RMySQL" were used in data cleaning, transforming and storage.

After cleaning the data in hands, basic statistics were done such as viewing the distribution of the data. Mean and standard deviation of some of the data were checked. Some sporting related analysis was also performed. For example, whether there was racer not qualified in the qualifying session was checked.

The path to the championship: plot the position changes of each team and driver to see their Chronological performance.

Practice time utilization: to reproduce how each driver uses the track in the practice, the accumulated time derived from the lap time record was calculated and plotted. The stint time (between pit stops) segmentation with the time elapse could be viewed from the plot.

Race chart: the gap to the leader of each driver at each lap was plotted to view the relative position change with the time elapse.

Reproduce the fight for the lead: calculate and plot the difference of lap times between to two fighting cars at each lap will give people close look at the battle

Championship: (List of Formula One World Championship points scoring system. from Wikipedia) The total points of 1st and 2nd drivers were calculated and the best performers were defined by their points and times won the championship.

Data collecting and Analyzing

1. Path to the championship

As a person who is new to Formula One, I want to start from knowing the teams and the racers. Formula one official website did not disappoint a new fan like me as it archived the racing results, driver standing, team standing, and fastest lap awards from 1950 to the current season.

First, I will look at how each team fought for their championship through the years. It is also anticipated to see the come and go of some teams in the F1 history. As the Constructors Championship was not awarded until 1958, data from 1950 till now will be collected.

# The process will be very slow while knitting the R Markdown file. So I commended the following code and save the data but the evaluation will be silenced.

finalTable <- data.frame()
htmlTable <- data.frame()
for (i in 1958:2017){
  html_url <- paste0("https://www.formula1.com/en/results.html/",i,"/team.html")
  fetch_html <- getURL(html_url)
  htmlContent <- htmlParse(fetch_html,encoding = "UTF-8")
  htmlTable <- readHTMLTable(htmlContent, stringsAsFactors = FALSE)
  Sys.sleep(0.5)
  htmlTable <- htmlTable[[1]]
  htmlTable$year <- rep(i,nrow(htmlTable))
  finalTable <- rbind(finalTable,htmlTable)
}

colnames(finalTable) <- c("Empty1","Pos","Team","Pts","Empty2","year")
finalTable$Empty1 <-NULL
finalTable$Empty2 <-NULL

finalTable$Pos <- as.numeric(finalTable$Pos)

write.csv(finalTable,"F1_team.csv")

# F1_team.csv were uploaded to the Github repo and then downloaded.

finalTable <- read.csv(file="https://raw.githubusercontent.com/YunMai-SPS/DA607/master/DA607_final%20project/F1_team.csv", header=TRUE, sep=",", stringsAsFactors = F)

kable(head(finalTable,n=10))

X	Pos	Team	Pts	year
1	1	Vanwall	48	1958
2	2	Ferrari	40	1958
3	3	Cooper Climax	31	1958
4	4	BRM	18	1958
5	5	Maserati	6	1958
6	6	Lotus Climax	3	1958
7	1	Cooper Climax	40	1959
8	2	Ferrari	32	1959
9	3	BRM	18	1959
10	4	Lotus Climax	5	1959

kable(tail(finalTable, n=5))

	X	Pos	Team	Pts	year
620	620	6	Toro Rosso	13	2017
621	621	7	Haas Ferrari	8	2017
622	622	8	Renault	6	2017
623	623	9	Sauber Ferrari	0	2017
624	624	10	McLaren Honda	0	2017

tally <- as.data.frame(table(finalTable$Team))
team1 <- tally[which(tally$Freq == 1),]
team2 <- tally[which(tally$Freq == 2),]
team3 <- tally[which(tally$Freq == 3),]
team4 <- tally[which(tally$Freq == 4),]
team5 <- tally[which(tally$Freq == 5),]
team6 <- tally[which(tally$Freq == 6),]
team7 <- tally[which(tally$Freq == 7),]
team8 <- tally[which(tally$Freq == 8),]
team9 <- tally[which(tally$Freq == 9),]
team10 <- tally[which(tally$Freq == 10),]
teamLong <- as.data.frame(tally[which(tally$Freq != 1 & tally$Freq != 2 & tally$Freq != 3 & tally$Freq != 4 & tally$Freq != 5 & tally$Freq != 6 & tally$Freq != 7 & tally$Freq != 8 & tally$Freq != 9), ])

longlastTeam.name <- as.character(teamLong$Var1)

longlastTeam <- finalTable[which(finalTable$Team == longlastTeam.name[1] | finalTable$Team == longlastTeam.name[2] | finalTable$Team == longlastTeam.name[3] | finalTable$Team == longlastTeam.name[4] | finalTable$Team == longlastTeam.name[5] | finalTable$Team == longlastTeam.name[6] | finalTable$Team == longlastTeam.name[7] | finalTable$Team == longlastTeam.name[8] | finalTable$Team == longlastTeam.name[9]),]

longlastTeam <- group_by(longlastTeam,Team)
    
ggplot(data = longlastTeam, aes(x=year, y=Pos, color = Team, group = Team)) +
    geom_point() +
    geom_line() +
    labs(y="Position")+
    facet_grid(Team ~ .)

Only teams appeared in the top 10 more than 10 years were plotted. As shown in the figure, Ferrari is the only name came all the way from the every beginning in the ~70 years history of F1. It locked No.1 for consecutive five years from 1999 to 2004. All the team with Ford in their names: Brabham Ford, McLaren Ford, and Tyrrell Ford, are active between 1970's and 1990's. But one of them, Brabham Ford, disappeared from 1980's and the rest three disappeared after the middle of 1990's. Right after that, McLaren Mercedes began to show in the top 10 combat. Renault started to enter top 10 since late 1970's and kept progressing towards the first place. It seemed that it had no luck since mid-1980's and did not come back to top 10 till 2000's. The come and go, up and down of these teams reflect the changes of automobile companies, such as mergers and acquisitions. Ford haven't come back to top 10 after 1990's.

1.2 Path to the champion - Driver

# The process will be held while knitting the R Markdown file. So I commended the following code and save the data but the evaluation will be silenced.

driverTable <- data.frame()
htmlTable <- data.frame()
for (i in 1950:2017){
  html_url <- paste0("https://www.formula1.com/en/results.html/",i,"/drivers/94/great-britain/race-result.html")
  fetch_html <- getURL(html_url)
  htmlContent <- htmlParse(fetch_html,encoding = "UTF-8")
  htmlTable <- readHTMLTable(htmlContent, stringsAsFactors = FALSE)
  Sys.sleep(0.5)
  htmlTable <- htmlTable[[1]]
  htmlTable$year <- rep(i,nrow(htmlTable))
  driverTable <- rbind(driverTable,htmlTable)
}
  
colnames(driverTable) <- c("Empty1","Pos","Driver","Nationality","Car","Pts","Empty2","year")

driverTable$Empty1 <-NULL
driverTable$Empty2 <-NULL

driverTable$Pos <- as.numeric(driverTable$Pos)

write.csv(driverTable,"F1_driver.csv")

# F1_driver.csv were uploaded to the Github repo and then downloaded.

driverTable <- read.csv(file="https://raw.githubusercontent.com/YunMai-SPS/DA607/master/DA607_final%20project/F1_driver.csv", header=TRUE, sep=",", stringsAsFactors = F)

driverTable50 <- driverTable[which(driverTable$year < 1960),]
driverTable60 <- driverTable[which(driverTable$year >= 1960 & driverTable$year < 1970),]
driverTable70 <- driverTable[which(driverTable$year >= 1970 & driverTable$year < 1980),]
driverTable80 <- driverTable[which(driverTable$year >= 1980 & driverTable$year < 1990),]
driverTable90 <- driverTable[which(driverTable$year >= 1990 & driverTable$year < 2000),]
driverTable00 <- driverTable[which(driverTable$year >= 2000 & driverTable$year < 2010),]
driverTable10 <- driverTable[which(driverTable$year >= 2010 & driverTable$year < 2020),]

driverTable50$Driver<- str_replace_all(driverTable50$Driver,"[:space:]?\n[:space:]+", " ")
driverTable50$FirstName <- str_extract(driverTable50$Driver, "[:alpha:]+")
driverTable50$LastName <- str_extract(driverTable50$Driver, "[A-Z]{3}\\b")
driverTable50 <- transform(driverTable50, Name = paste(FirstName,LastName, sep = ""))

drivertally50 <- as.data.frame(table(driverTable50$Name))
colnames(drivertally50) <- c("Name","Freq")
d50.1 <- drivertally50[which(drivertally50$Freq == max(drivertally50$Freq)),]
d50.2 <- drivertally50[which(drivertally50$Freq == (max(drivertally50$Freq)-1)),]
d50.3 <- drivertally50[which(drivertally50$Freq == (max(drivertally50$Freq)-2)),]
d50.top3.tally <- rbind(d50.1,d50.2,d50.3)

sort(d50.top3.tally$Freq, decreasing = T)

d50.top3.name <- as.character(d50.top3.tally$Name)

length(d50.top3.name)

driverTable50$Name <-as.character(driverTable50$Name)

d50.top3 <- driverTable50[which(driverTable50$Name == d50.top3.name[1] | driverTable50$Name == d50.top3.name[2] | driverTable50$Name == d50.top3.name[3] | driverTable50$Name == d50.top3.name[4] | driverTable50$Name == d50.top3.name[5] | driverTable50$Name == d50.top3.name[6] | driverTable50$Name == d50.top3.name[7] | driverTable50$Name == d50.top3.name[8]),]

d50.top3 <- group_by(d50.top3,Name)
 
ggplot(data = d50.top3, aes(x=year, y=Pos, color = Name, group = Name)) +
    geom_point() +
    geom_line() +
    labs(y="Position")+
    facet_grid(Name ~ .)

Drivers who entered From the figure we know that Juan Manuel Fangio Déramo is the most successful racing car driver in 1950's. He won 5 first place in 10 years.

Some 90's F1 stars are still active in the race. So I decided to know the famous drivers from 1990 till now.

driverTable10$Driver <-str_replace_all(driverTable10$Driver,"[:space:]?\n[:space:]+", " ")
driverTable10$FirstName <- str_extract(driverTable10$Driver, "[:alpha:]+")
driverTable10$LastName <- str_extract(driverTable10$Driver, "[A-Z]{3}\\b")
driverTable10 <- transform(driverTable10, Name = paste(FirstName,LastName, sep = ""))

drivertally10 <- as.data.frame(table(driverTable10$Name))
colnames(drivertally10) <- c("Name","Freq")
d10.1 <- drivertally10[which(drivertally10$Freq == max(drivertally10$Freq)),]
d10.2 <- drivertally10[which(drivertally10$Freq == (max(drivertally10$Freq)-1)),]
d10.3 <- drivertally10[which(drivertally10$Freq == (max(drivertally10$Freq)-2)),]
d10.top3.tally <- rbind(d10.1,d10.2,d10.3)

sort(d10.top3.tally$Freq, decreasing = T)

d10.top3.name <- as.character(d10.top3.tally$Name)

length(d10.top3.name)

driverTable10$Name <-as.character(driverTable10$Name)

d10.top3 <- driverTable10[which(driverTable10$Name == d10.top3.name[1] | driverTable10$Name == d10.top3.name[2] | driverTable10$Name == d10.top3.name[3] | driverTable10$Name == d10.top3.name[4] | driverTable10$Name == d50.top3.name[5] | driverTable10$Name == d10.top3.name[6] | driverTable10$Name == d10.top3.name[7] | driverTable10$Name == d10.top3.name[8]),]

d10.top3 <- group_by(d10.top3,Name)
 
ggplot(data = d10.top3, aes(x=year, y=Pos, color = Name, group = Name)) +
    geom_point() +
    geom_line() +
    labs(y="Position")+
    facet_grid(Name ~ .)

It would be too much to show all drivers who entered top 10 in recent 7 years. Drivers frequently appeared in the top 10 list were selected to show. From the figure we can see that Sebastian Vettel was the champion in the 4 consecutive years from 2010 to 2013. Lewis Hamilton was the No.1 in 2014 and 2015. Hamilton's team mate Nico Rosberg was the 2016 champion.

2.Grand Prix of Australia, July 2016

In this project, I took data from Grand Prix of Australia in July 2016 to see how we can use the numbers to tell the stories happened in this particular season and particular run of the game. The reports included three parts: practice, qualifying, and racing.

2.1 Practice and Qualifying

Since 2006, three practice sessions are held before the Grand Prix race; the first on Friday morning and the second on Friday afternoon(Thursday at Monaco). Both sessions last one and a half hours. While individual practice sessions are not compulsory, a driver must take part in at least one practice session to be eligible for the race. Teams use practice time to work on car set-up in preparation for qualifying and the race.

To view how each racer uses the practice time, timing sheet contents in PDF from FIA (http://www.fia.com/f1-archives) were extract through https://pdftables.com/ and converted to a tab-delimited text file. A hunderdth second can make the difference to the race results. The crucial data, millisecond, in three decimal places could be preserved in a tab-delimited text file.

While I learned how to read timing sheet, I found one very interesting blog of Joe Saward, who is a long-time F1 journalist. Joe Saward talked about how F1 journalist did lap charts in one of his articles, Lap Charts.(Joe Saward, JOEBLOGF1, April 14, 2011, https://joesaward.wordpress.com/2011/04/14/lap-charts/). Lap charters develop their own marks to record the action on a given lap. For example, in Joe Saward's charts, an arrow indicates a car is catching the one in front, two lines indicate a battle, circled number indicates pit stop, small horizontal lines between numbers indicates the gap, and a circular arrow indicates a spin etc.

# Grand Prix of Austrlia, July 2016, First Practice: AuJy.p1

AuJy.p1 <- read.table("https://raw.githubusercontent.com/YunMai-SPS/DA607/master/DA607_final%20project/practice_lap_times_55_raw%20-%20Copy.txt",fill = TRUE)
colnames(AuJy.p1) <- c("lapNo","pit","time","No","Name","V6","V7")
AuJy.p1 <- AuJy.p1[-1, ]
AuJy.p1.fix <- AuJy.p1[AuJy.p1$pit == "P",]
AuJy.p1.fix$pit <- NULL
AuJy.p1.fix$V6 <- paste(AuJy.p1.fix$V6,AuJy.p1.fix$V7)
AuJy.p1.fix$V7 <- NULL

colnames(AuJy.p1.fix) <- c("lapNo","pit","time","No","Name")

AuJy.p1.fix$pit <- str_replace(AuJy.p1.fix$pit, "11","1")

AuJy.p1.rest <- AuJy.p1[AuJy.p1$pit != "P",]
AuJy.p1.rest$V7 <- NULL
AuJy.p1.rest$Name <- paste(AuJy.p1.rest$Name,AuJy.p1.rest$V6)
AuJy.p1.rest$V6 <- NULL
AuJy.p1 <- rbind(AuJy.p1.fix,AuJy.p1.rest)

AuJy.p1$lapNo <- as.character(AuJy.p1$lapNo)
AuJy.p1$lapNo <- str_extract(AuJy.p1$lapNo,"\\d+")
AuJy.p1$lapNo <- as.numeric(AuJy.p1$lapNo)

AuJy.p1$No <- as.numeric(AuJy.p1$No)
AuJy.p1$time <- as.character(AuJy.p1$time)
AuJy.p1 <-arrange(AuJy.p1,No,lapNo)

# find out the driver who first recorded a laptime during the session
start.time <- AuJy.p1[which(AuJy.p1$lapNo == 1),]
start.time$hour <- str_extract(start.time$time,"\\d+")
start.time$hour <- as.numeric(start.time$hour)

start.time$minute <- str_extract(start.time$time,":\\d+:")
start.time$minute <- str_replace_all(start.time$minute,":","")
start.time$minute <- as.numeric(start.time$minute)

start.time$second <- as.numeric(str_sub(start.time$time, -2,-1))

earliest.start <- start.time[which(start.time$hour == min(start.time$hour)),]

if (nrow(earliest.start) == 1){
  basetime <- earliest.start$hour * 60 * 60 + earliest.start$minute * 60 + earliest.start$second
}else
{
  earliest.start <- earliest.start[which(earliest.start$minute == min(earliest.start$minute)),]
}

if (nrow(earliest.start) == 1){
  basetime <- earliest.start$hour * 60 * 60 + earliest.start$minute * 60 + earliest.start$second
}else
{
  earliest.start <- earliest.start[which(earliest.start$second == min(earliest.start$second)),]
  basetime <- earliest.start$hour * 60 * 60 + earliest.start$minute * 60 + earliest.start$second
}

# set basetime as the  time recorded by the driver who first recorded the laptime. It is converted to seconds.

# First laptime in AuJy.p1 is special as it is not laptime but the time of day the driver completed their first lap of the session. So calculate the first 'laptime' and the other laptime separately.

#calculate the first 'laptime'
#convert time to secondsand annotated as  "ctime". Set basetime as 0.
start.time$ctime <- start.time$hour*60*60+ start.time$minute*60+ start.time$second
start.time$ctime <- start.time$ctime-basetime

#remove first 'laptime' from the dataset and calculte the other laptime.
intermediate <- AuJy.p1[which(AuJy.p1$lapNo != 1),]

#convert time to seconds and annotated as "ctime"
intermediate$minute <- as.numeric(str_extract(intermediate$time,"\\d+"))

intermediate$second <- str_extract(intermediate$time,":\\d+.")
intermediate$second <- str_extract(intermediate$second,"\\d+")
intermediate$second <- as.numeric(intermediate$second)

intermediate$milisecond <- str_extract(intermediate$time, "\\.\\d+")
intermediate$milisecond <- str_extract(intermediate$milisecond,"\\d+")
intermediate$milisecond <- as.numeric(intermediate$milisecond) 
  
intermediate$ctime <- intermediate$minute * 60 + intermediate$second +(intermediate$milisecond)/1000

#attach ctime to the original dataset by combining start.time and intermediate.
start.time$hour <- NULL
start.time$minute <- NULL
start.time$second <- NULL
  
intermediate$minute <- NULL
intermediate$second <- NULL
intermediate$milisecond <- NULL

AuJy.p1 <- rbind(start.time,intermediate)

AuJy.p1 <- arrange(AuJy.p1,No,lapNo)

#calculate the cumulated time
AuJy.p1 <- AuJy.p1 %>% 
  group_by(No) %>% 
  mutate(cumtime = cumsum(ctime))

kable(head(AuJy.p1,n=5))

lapNo	pit	time	No	Name	ctime	cumtime
1	1	10:01:54	1	F. NASR	14.000	14.000
2	0	18:59.955	1	F. NASR	1139.955	1153.955
3	0	1:13.592	1	F. NASR	73.592	1227.547
4	0	1:12.738	1	F. NASR	72.738	1300.285
5	0	1:11.766	1	F. NASR	71.766	1372.051

green <- do.call(rbind, by(AuJy.p1, AuJy.p1$No, function(x) x[which.min(x$ctime), ] ))

ggplot(data = AuJy.p1, aes(x=cumtime, y=Name, color = Name, group = Name)) +
    geom_point() +
    geom_line() +
    geom_point(data=AuJy.p1[which(AuJy.p1$pit == 1), ], aes(x=cumtime, y=Name),color="black",pch = 1,size = 1.7)+
    geom_point(data=green, aes(x=cumtime, y=Name),color="green",pch = 1,size = 3)+
  labs( title="Practice 1")+
  theme(legend.position="none")

The time between two dots indicates the time of completion of one lap. It could be lap time or lap time plus pit stop time. The lap chart shows how each driver completea each lap and utilize the practice time.

In the lap chart, the sessions in which the driver stopped in the pit were marked by a black circle.The fastest lap of each driver was marked by a green circle.

Saturday's qualifying session, designed to take about an hour, is split into three distinct segments - Q1, Q2 and Q3.**

Q1: Lasts for 18 minutes, at the end of which time the six slowest drivers are eliminated from qualifying and 16 advance to Q2.

Q2: After a short break, the times are reset and the 16 remaining cars run in a 15-minute session, at the end of which the slowest six are eliminated from qualifying, leaving 10 to progress to Q3.

Q3: After a further break, the times are reset and a final 12-minute session is held to decide pole position and the starting order for the top ten grid places.

AuJy.qli <- read.table("https://raw.githubusercontent.com/YunMai-SPS/DA607/master/DA607_final%20project/qulifying_lap_time_58_raw.txt",fill=T,row.names=NULL)

colnames(AuJy.qli) <- c("lapNo","pit","time","No","FN","LN")
AuJy.qli$Name <- paste(AuJy.qli$FN,AuJy.qli$LN)
AuJy.qli$FN <- NULL
AuJy.qli$LN <- NULL

AuJy.qli$time <- as.character(AuJy.qli$time)

# find out the driver who first recorded a laptime during the session
start.time.qli <-AuJy.qli[which(AuJy.qli$lapNo == 1),]
start.time.qli$hour <- str_extract(start.time.qli$time,"\\d+")
start.time.qli$hour <- as.numeric(start.time.qli$hour)

start.time.qli$minute <- str_extract(start.time.qli$time,":\\d+:")
start.time.qli$minute <- str_replace_all(start.time.qli$minute,":","")
start.time.qli$minute <- as.numeric(start.time.qli$minute)

start.time.qli$second <- as.numeric(str_sub(start.time.qli$time, -2,-1))

earliest.start.qli <- start.time.qli[which(start.time.qli$hour == min(start.time.qli$hour)),]

if (nrow(earliest.start.qli) == 1){
  basetime.qli <- earliest.start.qli$hour * 60 * 60 + earliest.start.qli$minute * 60 + earliest.start$second.qli
}else
{
  earliest.start.qli <- earliest.start.qli[which(earliest.start.qli$minute == min(earliest.start.qli$minute)),]
}

if (nrow(earliest.start.qli) == 1){
  basetime/qli <- earliest.start.qli$hour * 60 * 60 + earliest.start.qli$minute * 60 + earliest.start.qli$second
}else
{
  earliest.start.qli <- earliest.start.qli[which(earliest.start.qli$second == min(earliest.start.qli$second)),]
  basetime.qli <- earliest.start.qli$hour * 60 * 60 + earliest.start.qli$minute * 60 + earliest.start.qli$second
}

# set basetime as the  time recorded by the driver who first recorded the laptime. It is converted to seconds.

# First laptime in AuJy.p1 is special as it is not laptime but the time of day the driver completed their first lap of the session. So calculate the first 'laptime' and the other laptime separately.

#calculate the first 'laptime'
#convert time to secondsand annotated as  "ctime". Set basetime as 0.
start.time.qli$ctime <- start.time.qli$hour*60*60+ start.time.qli$minute*60+ start.time.qli$second
start.time.qli$ctime <- start.time.qli$ctime-basetime.qli

#remove first 'laptime' from the dataset and calculte the other laptime.
intermediate.qli <- AuJy.qli[which(AuJy.qli$lapNo != 1),]

#convert time to seconds and annotated as "ctime"
intermediate.qli$minute <- as.numeric(str_extract(intermediate.qli$time,"\\d+"))

intermediate.qli$second <- str_extract(intermediate.qli$time,":\\d+.")
intermediate.qli$second <- str_extract(intermediate.qli$second,"\\d+")
intermediate.qli$second <- as.numeric(intermediate.qli$second)

intermediate.qli$milisecond <- str_extract(intermediate.qli$time, "\\.\\d+")
intermediate.qli$milisecond <- str_extract(intermediate.qli$milisecond,"\\d+")
intermediate.qli$milisecond <- as.numeric(intermediate.qli$milisecond) 
  
intermediate.qli$ctime <- intermediate.qli$minute * 60 + intermediate.qli$second +(intermediate.qli$milisecond)/1000

#attach ctime to the original dataset by combining start.time and intermediate.
start.time.qli$hour <- NULL
start.time.qli$minute <- NULL
start.time.qli$second <- NULL
  
intermediate.qli$minute <- NULL
intermediate.qli$second <- NULL
intermediate.qli$milisecond <- NULL

AuJy.qli <- rbind(start.time.qli,intermediate.qli)

AuJy.qli$lapNo <- as.numeric(AuJy.qli$lapNo)
AuJy.qli <- arrange(AuJy.qli,No,lapNo)

#calculate the cumulated time
AuJy.qli <- AuJy.qli %>% 
  group_by(No) %>% 
  mutate(cumtime = cumsum(ctime))

kable(head(AuJy.qli,n=5))

lapNo	pit	time	No	Name	ctime	cumtime
1	0	14:10:35	3	D. RICCIARDO	525.000	525.000
2	0	1:07.824	3	D. RICCIARDO	67.824	592.824
3	0	1:24.580	3	D. RICCIARDO	84.580	677.404
4	0	1:07.500	3	D. RICCIARDO	67.500	744.904
5	1	1:25.712	3	D. RICCIARDO	85.712	830.616

green <- do.call(rbind, by(AuJy.qli, AuJy.qli$No, function(x) x[which.min(x$ctime), ] ))

ggplot(data = AuJy.qli, aes(x=cumtime, y=Name, color = Name, group = Name)) +
    geom_point() +
    geom_line() +
    geom_point(data=AuJy.qli[which(AuJy.qli$pit == 1), ], aes(x=cumtime, y=Name),color="black",pch = 1,size = 1.7)+
    geom_point(data=green, aes(x=cumtime, y=Name),color="green",pch = 1,size = 3)+
    labs( title="Qulifying 1")+
    theme(legend.position="none")

In the lap chart, the sessions in which the driver stopped in the pit were marked by a black circle. The fastest lap of each driver was marked by a green circle. From the qualifying lap chart, we can see some drivers retired before the race finish. All drivers had unaminous pit stop at the similar time in the Australia Grad Prix in July 2016.

json_url <- "http://ergast.com/api/f1/2016/9/qualifying.json"
jsonFile <- getURL(json_url)
jsonContent <- fromJSON(jsonFile)
json_table5 <- jsonContent$MRData

json_table6 <-json_table5$RaceTable
json_table7 <-json_table6$Races
json_table8 <-json_table7$QualifyingResults
json_table9 <-json_table8[[1]]
json_table10 <-json_table9[["Driver"]]
json_table11 <-json_table9[["Constructor"]]
json_table9$Driver <- NULL
json_table9$Constructor <- NULL
json_table<- cbind(json_table9,json_table10)
json_table <- cbind(json_table,json_table11)
json_table <- json_table[,c("number","position", "Q1", "Q2", "Q3", "driverId","permanentNumber", "code", "url", "givenName", "familyName", "dateOfBirth", "nationality", "constructorId", "url", "name", "nationality")]
json_table$permanentNumber <- NULL
kable(head(json_table,n=10))

number	position	Q1	Q2	Q3	driverId	code	url	givenName	familyName	dateOfBirth	nationality	constructorId	url.1	name	nationality.1
44	1	1:06.947	1:06.228	1:07.922	hamilton	HAM	http://en.wikipedia.org/wiki/Lewis_Hamilton	Lewis	Hamilton	1985-01-07	British	mercedes	http://en.wikipedia.org/wiki/Lewis_Hamilton	Mercedes	British
6	2	1:06.516	1:06.403	1:08.465	rosberg	ROS	http://en.wikipedia.org/wiki/Nico_Rosberg	Nico	Rosberg	1985-06-27	German	mercedes	http://en.wikipedia.org/wiki/Nico_Rosberg	Mercedes	German
27	3	1:07.385	1:07.257	1:09.285	hulkenberg	HUL	http://en.wikipedia.org/wiki/Nico_H%C3%BClkenberg	Nico	Hülkenberg	1987-08-19	German	force_india	http://en.wikipedia.org/wiki/Nico_H%C3%BClkenberg	Force India	German
5	4	1:06.761	1:06.602	1:09.781	vettel	VET	http://en.wikipedia.org/wiki/Sebastian_Vettel	Sebastian	Vettel	1987-07-03	German	ferrari	http://en.wikipedia.org/wiki/Sebastian_Vettel	Ferrari	German
22	5	1:07.653	1:07.572	1:09.900	button	BUT	http://en.wikipedia.org/wiki/Jenson_Button	Jenson	Button	1980-01-19	British	mclaren	http://en.wikipedia.org/wiki/Jenson_Button	McLaren	British
7	6	1:07.240	1:06.940	1:09.901	raikkonen	RAI	http://en.wikipedia.org/wiki/Kimi_R%C3%A4ikk%C3%B6nen	Kimi	Räikkönen	1979-10-17	Finnish	ferrari	http://en.wikipedia.org/wiki/Kimi_R%C3%A4ikk%C3%B6nen	Ferrari	Finnish
3	7	1:07.500	1:06.840	1:09.980	ricciardo	RIC	http://en.wikipedia.org/wiki/Daniel_Ricciardo	Daniel	Ricciardo	1989-07-01	Australian	red_bull	http://en.wikipedia.org/wiki/Daniel_Ricciardo	Red Bull	Australian
77	8	1:07.148	1:06.911	1:10.440	bottas	BOT	http://en.wikipedia.org/wiki/Valtteri_Bottas	Valtteri	Bottas	1989-08-29	Finnish	williams	http://en.wikipedia.org/wiki/Valtteri_Bottas	Williams	Finnish
33	9	1:07.131	1:06.866	1:11.153	max_verstappen	VER	http://en.wikipedia.org/wiki/Max_Verstappen	Max	Verstappen	1997-09-30	Dutch	red_bull	http://en.wikipedia.org/wiki/Max_Verstappen	Red Bull	Dutch
19	10	1:07.419	1:07.145	1:11.977	massa	MAS	http://en.wikipedia.org/wiki/Felipe_Massa	Felipe	Massa	1981-04-25	Brazilian	williams	http://en.wikipedia.org/wiki/Felipe_Massa	Williams	Brazilian

#convert qualifying time Q1 time to seconds and annotated as "ctime"
json_table$Q1m <- as.numeric(str_extract(json_table$Q1,"\\d+"))

json_table$Q1s <- str_extract(json_table$Q1,":\\d+.")
json_table$Q1s <- as.numeric(str_extract(json_table$Q1s,"\\d+"))

json_table$Q1ms <- str_extract(json_table$Q1, "\\.\\d+")
json_table$Q1ms <- as.numeric(str_extract(json_table$Q1ms,"\\d+")) 
  
json_table$Q1ctime <- json_table$Q1m * 60 + json_table$Q1s +(json_table$Q1ms)/1000

#convert qualifying time Q2 time to seconds and annotated as "ctime"
json_table$Q2m <- as.numeric(str_extract(json_table$Q2,"\\d+"))

json_table$Q2s <- str_extract(json_table$Q2,":\\d+.")
json_table$Q2s <- as.numeric(str_extract(json_table$Q2s,"\\d+"))

json_table$Q2ms <- str_extract(json_table$Q2, "\\.\\d+")
json_table$Q2ms <- as.numeric(str_extract(json_table$Q2ms,"\\d+")) 
  
json_table$Q2ctime <- json_table$Q2m * 60 + json_table$Q2s +(json_table$Q2ms)/1000

#convert qualifying time Q3 time to seconds and annotated as "ctime"
json_table$Q3m <- as.numeric(str_extract(json_table$Q3,"\\d+"))

json_table$Q3s <- str_extract(json_table$Q3,":\\d+.")
json_table$Q3s <- as.numeric(str_extract(json_table$Q3s,"\\d+"))

json_table$Q3ms <- str_extract(json_table$Q3, "\\.\\d+")
json_table$Q3ms <- as.numeric(str_extract(json_table$Q3ms,"\\d+")) 
  
json_table$Q3ctime <- json_table$Q3m * 60 + json_table$Q3s +(json_table$Q3ms)/1000

hist(json_table$Q1ctime,breaks = 10,xlab = "Q1 qualifying time",main = "Q1 - Grand Prix of Austria July 3, 2016")

summary(json_table$Q1ctime)

107% rule: The 107% rule is a sporting regulation affecting Formula One racing qualifying sessions. During the first phase of qualifying, any driver who fails to set a lap within 107 percent of the fastest time in the first qualifying session will not be allowed to start the race.

q <-min(json_table$Q1ctime)*1.07
qualifying.failed <- json_table[which(json_table$Q1ctime > q),]
paste("There is/are",nrow(qualifying.failed),"racer who faild in 107% rule in 2016 Season Run 9 (Grand Prix of Austria - July 3, 2016).")

2.2 Race chart

AuJy.race <- read.csv("https://raw.githubusercontent.com/YunMai-SPS/DA607/master/DA607_final%20project/history_chart_11_australiaJuly2016.csv")
AuJy.race <- AuJy.race [,1:4]
kable(head(AuJy.race,n=3))

lapNo	No	gap	time
1	44	0	1:15.725
1	22	0.664	1:16.389
1	7	1.312	1:17.037

#change class of variable from factor to character for further manipulation
AuJy.race$gap <- as.character(AuJy.race$gap)
AuJy.race$time <- as.character(AuJy.race$time)
AuJy.race$minute <-as.numeric(str_extract(AuJy.race$time,"\\d+"))
AuJy.race$second <-str_extract(AuJy.race$time,":\\d+")
AuJy.race$second <-as.numeric(str_extract(AuJy.race$second,"\\d+"))
AuJy.race$milisecond <-as.numeric(str_sub(AuJy.race$time,-3,-1))
AuJy.race$timeinsecond <- AuJy.race$minute*60 + AuJy.race$second + (AuJy.race$milisecond)/1000

AuJy.race <- AuJy.race %>% 
  group_by(No) %>% 
  mutate(cumtime=cumsum(timeinsecond))


# add the laptime of fastest racer in the lap to the laptime of racer who is one lap behind  
find.behind <- subset(AuJy.race, gap == "1 LAP")

find.lap <- levels(as.factor(find.behind$lapNo))
find.lap <- as.numeric(find.lap)
print("find.lap contain the lap No. in which there is racer who is more than 1 lap behind the fastest racer:")

find.leader<- AuJy.race[which(AuJy.race$lapNo == 25 & AuJy.race$gap == "0"),]
lap <- data.frame()
for(i in c(find.lap)){
  lap <- AuJy.race[which(AuJy.race$lapNo == i & AuJy.race$gap == "0"),]
  find.leader<- union(find.leader,lap)
}

leader.behind <- rbind(find.leader,find.behind)

leader.behind <- arrange(leader.behind,lapNo)

single.behind <- spread(leader.behind,gap,cumtime)
names(single.behind)[names(single.behind)=="0"]<- "zero"
names(single.behind)[names(single.behind)=="1 LAP"]<- "onelap"

require(zoo)

single.behind <- transform(single.behind, zero = na.locf(zero))

OneLap <- single.behind[which(is.na(single.behind$onelap) == F),]
OneLap$cumtime <- OneLap$zero + OneLap$onelap

#OneLap$timeinsecond <- OneLap$zero + OneLap$onelap

OneLap <- arrange(OneLap,No,lapNo)

# "1 LAP"" does not mean one more laptime of the fastest racer will be added to the gap. So the following thee sentences are not right and will not be run.
#OneLap <- OneLap %>% 
#  group_by(No) %>% 
#  mutate(cumtime=cumsum(timeinsecond))

OneLap$zero <- NULL
OneLap$onelap <- NULL

#OneLap$timeinsecond <- NULL

#OneLap$cumtime <- NULL

OneLap$gap <- "1 LAP"
OneLap <- OneLap[,c("lapNo","No","gap","time","minute","second","milisecond","timeinsecond","cumtime")]

#names(OneLap)[names(OneLap)=="cumtime"] <- "timeinsecond"
  
removeOneLap <- AuJy.race[which(AuJy.race$gap != "1 LAP"),]

AuJy.race <- union(removeOneLap,OneLap)

AuJy.race <- arrange(AuJy.race,lapNo,timeinsecond)

# rank the racers in each lap
sorted <- AuJy.race %>% 
  group_by(lapNo) %>% 
  mutate(rank=row_number(cumtime)) %>% 
  arrange(lapNo,rank)
sorted$No <- as.character(sorted$No)

# plot all the drivers
ggplot(data = sorted, aes(x=lapNo, y=rank, color = No, group=No)) +
    geom_point() +
    geom_line() +
    scale_y_reverse()

The race chart tells us the changes of the position of each driver during the race.

# plot the 5 start at the close position  
ggplot(data = subset(sorted,No == 44 | No == 7 | No == 3 | No == 6 | No == 33), aes(x=lapNo, y=rank, color = No, group=No)) +
    geom_point() +
    geom_line() +
    scale_x_continuous(breaks=seq(0,72, by = 4))+
    scale_y_continuous(breaks=seq(0,15, by = 1))+
    scale_y_reverse()

ggplot(data = subset(sorted,No == 44 | No == 6 ), aes(x=lapNo, y=rank, color = No, group=No)) +
    geom_point() +
    geom_line() +
    scale_x_continuous(breaks=seq(0,72, by = 4))+
    scale_y_continuous(breaks=seq(0,15, by = 1))+
    scale_y_reverse()

Above is a race chart for Australia Grand Prix in July 2016. The horizontal axis is lap number, and the vertical axis shows the position of each car.The performance of each car can be seen in the lines running from left to right. The sudden drops in the lines are due to pitstops.

Lewis Hamilton's car No.44 Nico Erik Rosberg's car No.6 Max Verstappen's car No.33 Daniel Ricciardo's car No.3 Kimi Räikkönen's car No.7

From the above two figures, we can see how Hamilton and Rosberg fought for the championship. And you may wonder why positions of Hamilton and Rosberg changed on the last lap. The story was Hamilton and Rosberg crash on the final lap.

Hamilton was the pole sitter. Rosberg, starting fifth, made good progress. They made their way past Verstappen (No.6), fought out many laps with Rosberg in ahead of Hamilton. Rosberg was criticised and penalized for the collision because he only beginning to apply full lock - or much steering at all - after he ploughed into the side of Hamilton, ruining his front wing and end up as No. 4th.

2.3 Reproduce the fight for the lead

#calculate the gap between each car and the fastest car in each lap. There are gap records in the history chart from FIA but the gap time when the car it pit are not shown in the sheet. The missing number could be derived from the laptime.
a <- data.frame(sorted)
a <- mutate(a, calgap=0)
k <- 1
n <- 1
for(i in 1:71){
  for(j in 1:nrow(a[which(a$lapNo == i),])){
    if (i==1){
      a[j,11] <- a[j,9] - a[i,9]
      k <- 1+j
    }else{
      a[k,11] <- a[k,9] - a[n,9]
      k <- k+1
      n <- k-j
    }
  }
  n<-n+nrow(a[which(a$lapNo == i),])
} 

a$diff <- a$calgap*(-1)
a$No <- as.character(a$No)

ggplot(data = a, aes(x=lapNo, y=diff, color = No, group=No)) +
    geom_point() +
    geom_line()

There is no way to see the difference clearly when plotting the gap of all drivers together as in the above figure.

#plot the first 5 drivers at the start standing (No.27 retired at lap 67 according to the race chart above so he was not picked for this graph)
ggplot(data = subset(a, No == 44 | No == 22 | No == 7 | No == 6 | No == 3 | No==33), aes(x=lapNo, y=diff, color = No, group=No)) +
    geom_point() +
    geom_line()+
    scale_x_continuous(breaks=seq(0,72, by = 4))+
    labs(y="Gap")

Above is a race history chart for Australia Grand Prix in July 2016. The horizontal axis is lap number, and the vertical axis shows the time each car is behind the race time of the reference, the fastest car.

From the race history chart above we could see the performance of the first six cars in the race, which the shift of the lines up and down the graph. The performance of each car can be seen in the lines running from left to right. The sudden drops in the lines are due to the time lost in pit stops.

Lewis Hamilton's car No.44 Nico Erik Rosberg's car No.6 Max Verstappen's car No.33 Daniel Ricciardo's car No.3 Kimi Räikkönen's car No.7 Jenson Button's car No.22

We can see Hamilton pulling away at the front closely followed by Räikkönen, Rosberg might entering the pit and soon catching up and leading the race. We also see Verstappen and Ricciardo battling, Räikkönen joining their battle after lap 33. While Button was far behind from the beginning. They all went to pit at lap 64. In the final stint we knew that Hamilton and Rosberg had collision and Hamilton secured the win but Rosberg ended at 4th with 16.710s behind the champion.

b <- a[which(a$lapNo==71 & a$No==44),]
c <- a[which(a$lapNo==71 & a$No==6),]
d <- rbind(b,c)
kable(d)

3. Most Successful Formula 1 Drivers in the past decades

	lapNo	No	gap	time	minute	second	milisecond	timeinsecond	cumtime	rank	calgap	diff
1400	71	44	0	1:13.030	1	13	30	73.030	5258.107	1	0.00	0.00
1403	71	6	16.710	1:30.339	1	30	339	90.339	5274.817	4	16.71	-16.71

#The 1st of the Championship in 2006

json_url <- "http://ergast.com/api/f1/2006/results/1.json"
jsonFile <- getURL(json_url)
jsonContent <- fromJSON(jsonFile)
json_table5 <- jsonContent$MRData

json_table6 <-json_table5$RaceTable
json_table7 <-json_table6$Races
json_table8 <-json_table7$Results
json_table9 <- data.frame(json_table8[[1]])[,c(1,2,4)]

for( i in 2:18){
  json_table9 <- rbind(json_table9,data.frame(json_table8[[i]])[,c(1,2,4)])
}

json_table10 <- json_table8[[1]][["Driver"]]

for( i in 2:18){
  if (i %in% c(2,4,5,10,11,12,15,16)){
     b <- json_table8[[i]][["Driver"]]
     b$permanentNumber <- NA 
     json_table10 <- rbind(json_table10,b)
  }else{
     c <- json_table8[[i]][["Driver"]]
     json_table10 <- rbind(json_table10,c)
  }
}

json_table11 <- json_table8[[1]][["Constructor"]]
for(i in 2:18){
 json_table11 <- rbind(json_table11,json_table8[[i]][["Constructor"]])
}

json_table12 <- data.frame(json_table8[[1]][["FastestLap"]])[,1:2]
json_table12 <- json_table12[,1:2]
for( i in 2:18){
  json_table12 <- rbind(json_table12,data.frame(json_table8[[i]][["FastestLap"]])[,1:2])
}

json_table13 <- json_table8[[1]][["FastestLap"]]
json_table14 <- json_table13[["AverageSpeed"]]

for( i in 2:18){
     c <- json_table8[[i]][["FastestLap"]]
     d <- c[["AverageSpeed"]]
     json_table14 <- rbind(json_table14,d)
}
json_table15 <- cbind(json_table12,json_table14)

e <- cbind(json_table9,json_table10)
json_table.results <-cbind(e,json_table15)
names(json_table.results)[names(json_table.results) == "driverId"]<-"name"
json_table.results$year<- "2006"
json_table.results <- json_table.results[c("year","number","position", "points" ,"name", "permanentNumber", "code", "url","givenName", "familyName", "dateOfBirth","nationality", "rank", "lap", "units", "speed")]
json_table.results1 <- json_table.results[,c("year","number","position", "points" ,"name", "rank", "lap", "units", "speed")]
kable(head(json_table.results,n=2))

year	number	position	points	name	permanentNumber	code	url	givenName	familyName	dateOfBirth	nationality	rank	lap	units	speed
2006	1	1	10	alonso	14	ALO	http://en.wikipedia.org/wiki/Fernando_Alonso	Fernando	Alonso	1981-07-29	Spanish	3	21	kph	210.551
2006	2	1	10	fisichella	NA	FIS	http://en.wikipedia.org/wiki/Giancarlo_Fisichella	Giancarlo	Fisichella	1973-01-14	Italian	2	16	kph	209.402

#The 2nd of the Championship in 2006

json_url <- "http://ergast.com/api/f1/2006/results/2.json"
jsonFile <- getURL(json_url)
jsonContent <- fromJSON(jsonFile)
json_table5 <- jsonContent$MRData

json_table6 <-json_table5$RaceTable
json_table7 <-json_table6$Races
json_table8 <-json_table7$Results
json_table9 <- data.frame(json_table8[[1]])[,c(1,2,4)]

for( i in 2:18){
  json_table9 <- rbind(json_table9,data.frame(json_table8[[i]])[,c(1,2,4)])
}

json_table10 <- json_table8[[1]][["Driver"]]
json_table10 <- json_table10[["driverId"]]

for( i in 2:18){
     f <- json_table8[[i]][["Driver"]][["driverId"]]
     json_table10 <- rbind(json_table10,f)
}

json_table11 <- json_table8[[1]][["Constructor"]]
for(i in 2:18){
 json_table11 <- rbind(json_table11,json_table8[[i]][["Constructor"]])
}

json_table12 <- data.frame(json_table8[[1]][["FastestLap"]])[,1:2]
json_table12 <- json_table12[,1:2]
for( i in 2:18){
  json_table12 <- rbind(json_table12,data.frame(json_table8[[i]][["FastestLap"]])[,1:2])
}

json_table13 <- json_table8[[1]][["FastestLap"]]
json_table14 <- json_table13[["AverageSpeed"]]

for( i in 2:18){
     c <- json_table8[[i]][["FastestLap"]]
     d <- c[["AverageSpeed"]]
     json_table14 <- rbind(json_table14,d)
}

json_table.results2 <- cbind(json_table9,json_table10,json_table12,json_table14)

names(json_table.results2)[names(json_table.results2) == "json_table10"] <- "name" 
json_table.results2$year<- "2006"

json_table.results2 <- json_table.results2[c("year","number","position", "points" ,"name", "rank", "lap", "units", "speed")]

kable(head(json_table.results2,n=2))

To be concise, the code, similar to above two chunks, for collecting race results from 2007 to 2016 will be hide but the only the data was shown here.#

year	number	position	points	name	rank	lap	units	speed
2006	5	2	8	michael_schumacher	2	38	kph	210.576
2006	1	2	8	alonso	1	45	kph	210.487

year	number	position	points	name	rank	lap	units	speed
2008	22	1	10	hamilton	2	39	kph	218.300
2008	1	1	10	raikkonen	2	37	kph	209.158

year	number	position	points	name	rank	lap	units	speed
2008	3	2	8	heidfeld	3	41	kph	217.586
2008	4	2	8	kubica	6	39	kph	208.033

year	number	position	points	name	rank	lap	units	speed
2010	8	1	25	alonso	1	45	kph	191.706
2010	1	1	25	button	6	52	kph	213.804

year	number	position	points	name	rank	lap	units	speed
2010	7	2	18	massa	5	38	kph	189.392
2010	11	2	18	kubica	8	52	kph	213.138

year	number	position	points	name	rank	lap	units	speed
2012	3	1	25	button	1	56	kph	214.053
2012	5	1	25	alonso	7	53	kph	196.250

year	number	position	points	name	rank	lap	units	speed
2012	1	2	18	vettel	2	57	kph	213.503
2012	15	2	18	perez	3	54	kph	197.531

year	number	position	points	name	rank	lap	units	speed
2014	6	1	25	rosberg	1	19	kph	206.436
2014	44	1	25	hamilton	1	53	kph	193.611

year	number	position	points	name	rank	lap	units	speed
2014	20	2	18	kevin_magnussen	6	49	kph	205.131
2014	6	2	18	rosberg	2	55	kph	191.946

year	number	position	points	name	rank	lap	units	speed
2016	6	1	25	rosberg	3	21	kph	210.815
2016	6	1	25	rosberg	1	41	kph	206.210

year	number	position	points	name	rank	lap	units	speed
2016	44	2	18	hamilton	4	48	kph	210.608
2016	7	2	18	raikkonen	3	39	kph	204.745

year	number	position	points	name	rank	lap	units	speed
2007	6	1	10	raikkonen	1	41	kph	223.978
2007	1	1	10	alonso	2	42	kph	206.014

year	number	position	points	name	rank	lap	units	speed
2007	1	2	8	alonso	2	20	kph	221.178
2007	2	2	8	hamilton	1	22	kph	206.355

year	number	position	points	name	rank	lap	units	speed
2009	22	1	10	button	3	17	kph	216.891
2009	22	1	5	button	1	18	kph	206.483

year	number	position	points	name	rank	lap	units	speed
2009	23	2	8	barrichello	14	43	kph	214.344
2009	6	2	4	heidfeld	10	17	kph	201.392

year	number	position	points	name	rank	lap	units	speed
2013	7	1	25	raikkonen	1	56	kph	213.845
2013	1	1	25	vettel	3	45	kph	198.661

year	number	position	points	name	rank	lap	units	speed
2013	3	2	18	alonso	3	53	kph	213.162
2013	2	2	18	webber	6	45	kph	198.190

year	number	position	points	name	rank	lap	units	speed
2015	44	1	25	hamilton	1	50	kph	209.915
2015	5	1	25	vettel	3	46	kph	193.501

year	number	position	points	name	rank	lap	units	speed
2015	6	2	18	rosberg	2	47	kph	209.577
2015	44	2	18	hamilton	2	45	kph	193.501

#combine 2006 to 2016 data (9 years, 2011 data were discarded because of missing information from ergast API)

championship <- rbind(json_table.results1,json_table.results2,json_table.results3,json_table.results4,json_table.results5,json_table.results6,json_table.results7,json_table.results8,json_table.results9,json_table.results10,json_table.results11,json_table.results12,json_table.results13,json_table.results14,json_table.results15,json_table.results16,json_table.results19,json_table.results20,json_table.results21,json_table.results22)

championship$points <- as.numeric(championship$points)
championship$speed <- as.numeric(championship$speed)
championship$position <- as.numeric(championship$position)
points <- championship %>% 
  group_by(name) %>% 
  summarise(points=sum(points),maxspeed=max(speed),avg.speed=mean(speed))
first <- championship[which(championship$position == 1),]
second <- championship[which(championship$position == 2),]

position.1 <- first %>% 
  group_by(name) %>% 
  summarise(first = n())

position.2 <- second %>% 
  group_by(name) %>% 
  summarise(second=n())

a <- left_join(position.2,position.1,by="name")
a <- gather(a,position,counts,2:3)

ggplot(data=a, aes(x=name, y=counts, fill=position)) +
    geom_bar(stat="identity", position=position_dodge())+
    coord_flip() + 
     geom_hline(yintercept = 20, col="red")+
    theme(axis.text.y=element_text(angle=0, hjust=1))+
     labs(y="", x="Drivers Name",
              title="Who won the most Champoionship in the Last Decades")

ggplot(championship,aes(x=name,y=speed),group=name)+
  geom_boxplot()+
  labs(y=" Speed",title="speed")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

From the boxplot, we can see that not all the driver's speed distribution is normal. The distributions of speed of the four famous champions are normal and the medium is around 200 kph.Fisichella had the highest medium of speed.

famous <- a[which(a$counts > 20 ),]
kable(famous)

The racer who won the most championship in the past 10 years is Lewis Hamilton, who got first place for 45 times. The second successful racer is Sebastian Vettel, who won 30 times. There is a tie between Nico Erik Rosberg and Fernando Alonso Díaz as they both won 23 times. (results could be varied as 2011 results were not collected because of the technical issue.)

When the champion did not make the first place, they were very close to the first place. It is easy to understand that Hamilton, Rosberg, and Alonso won the most second place among all the racers who fight for the same place in the past 10 years.

Conclusion and Discussion:

1.Ferrari is a team came in top 10 from the every beginning of the 70-years history of F1.

2.Sebastian Vettel and Lewis Hamilton are the two people won championships from 2010 to 2016.

3.Timing sheet data could be used to present the practice time utilization. From the session utilization chart, it seems all the participants use the time to set up their cars.

4.Qulifying chart could be used to view which drivers are eliminated from Q1 and Q2.

5.The race chart tells us the changes of the position of each driver during the race. The race history chart based on gap time tells us the combat of the cars.

6.The racer who won the most championship in the past 10 years is Lewis Hamilton, who got first place for 45 times.And Sebastian Vettel, ,Nico Erik Rosberg, and Fernando Alonso Díaz are also famous racer as they won more than 20 championships.

7.The average speed of the four famous champion are about the same, around 200 kph. This suggests that being stable is the key for the champion.

8.Technically, there are different resources to collecting data and they are in different format. The results of analyze or the model will vary when you use different strategies to filter data.

9.I used PDFtable to extract data from PDF to store as .txt or .csv file, which is a feature that we did not cover in class.

name	position	counts
alonso	second	24
hamilton	second	24
rosberg	second	22
alonso	first	23
hamilton	first	45
rosberg	first	23
vettel	first	30

1. Tony Hirst, Wrangling F1 Data With R - Leanpub. (The lap utilization chart idea in this project comes from Tony's book. Also The book introduce useful data resources, F1 knowledge and analysis methods.)

Sporting and Data Analytics: An Exploration into Formula One

Motivation