library (readr)
library(rvest)
library(dplyr)
library(tidyr)
library(validate)
library(Hmisc)
library(forecast)
library(stringr)
library(lubridate)
library(outliers)
library(MVN)
library(caret)
library(MASS)
library(ggplot2)
library(knitr)
library(mlr)
In this assignment, the goal is used the tools and skills learned during the course and apply it to data to preprocess for further analysis. First, data was imported in csv formats with selected attributes and then merge to create a new data frame. After merging the data, its structure and dimension were scanned for errors and issues. Class of columns were restored to their respective ones. As per the definition of Tidy Data, our data was found to be in tidy format. A new numeric variable was created using the variables in the data frame. Routine check was done to scan missing values and outliers. The new variable created was transformed to plot distribution of total number of goals scored in all the seasons.
The data has been downloaded from http://www.football-data.co.uk/englandm.php (http://www.footballdata. co.uk/englandm.php). The data contains the history of goals scored in English Premier League(EPL) for three consecutive seasons from 2010 to 2013.All data is in csv format, ready for use within standard spreadsheet applications.
An overivew of the data chosen for our analysis: Date, HomeTeam, AwayTeam, FTHG (Full Time Home Team Goals), FTAG (Full Time Away Team Goals)
EPL Season (2010 -2011)
setwd("C:/Users/Wel/Downloads/R/Assign 3")
soccerdata1<-read.csv("EPL_Season_10_11.csv")
EPL_10_11 <- soccerdata1[,2:6]
head(EPL_10_11)
EPL Season (2011 -2012)
setwd("C:/Users/Wel/Downloads/R/Assign 3")
soccerdata2<-read.csv("EPL_Season_11_12.csv")
EPL_11_12 <- soccerdata2[,2:6]
head(EPL_11_12)
EPL Season (2012 -2013)
setwd("C:/Users/Wel/Downloads/R/Assign 3")
soccerdata3<-read.csv("EPL_Season_12_13.csv")
EPL_12_13 <- soccerdata3[,2:6]
head(EPL_12_13)
EPLSeason <- rbind(EPL_10_11,EPL_11_12,EPL_12_13)
head(EPLSeason)
str(EPLSeason)
## 'data.frame': 1140 obs. of 5 variables:
## $ Date : Factor w/ 310 levels "01/01/11","01/02/11",..: 46 46 46 46 46 46 46 46 51 54 ...
## $ HomeTeam: Factor w/ 25 levels "Arsenal","Aston Villa",..: 2 4 6 7 15 16 19 20 10 12 ...
## $ AwayTeam: Factor w/ 25 levels "Arsenal","Aston Villa",..: 18 8 9 17 3 11 5 14 1 13 ...
## $ FTHG : int 3 1 0 6 2 0 0 2 1 3 ...
## $ FTAG : int 0 0 0 0 2 0 4 1 1 0 ...
dim(EPLSeason)
## [1] 1140 5
Over here we are converting the class of variables from string to character
EPLSeason$HomeTeam <- as.character(EPLSeason$HomeTeam)
EPLSeason$AwayTeam <- as.character(EPLSeason$AwayTeam)
Looking at the structure and dimesion of the data frame, it confirms with the definition of Tidy shape format. Hence we ll proceed to manupulate data
An extra attribute FTTG (Full Time Total Goals) has been calculated from FTHG and FTAG attributes.
EPLSeason$FTTG <- EPLSeason$FTHG+EPLSeason$FTAG
head(EPLSeason)
Data was scanned for missing values and none was found.
colSums(is.na(EPLSeason))
## Date HomeTeam AwayTeam FTHG FTAG FTTG
## 0 0 0 0 0 0
Boxplot of Total Goals in a match shows two outliers for the three seasons. After careful analysis of the data there seems to be no valid reason to exclude or mutate them, so we ll proceed with them
EPLSeason$FTTG %>% boxplot(main="Box Plot of Total Goals", ylab="Goals", col = "grey")
Variable FTTG was transformed to plot poisson distribution of the total number of goals scored in a match in all the season. Probability mass function was used and histogram was plotted
mu<-mean(EPLSeason$FTTG,na.rm=T)
X<-seq(ifelse(sign(round(mu-sqrt(mu)*4))==-1,0,round(mu-sqrt(mu)*4)),round(mu+sqrt(mu)*4))
PMF<-dpois(X,mu)
PMF<-as.list(PMF)
dplist=list()
for(i in 1:length(PMF)){dplist[[i]]<-round(PMF[[i]],3)}
hist(EPLSeason$FTTG,main="Distribution of total goals scored in a match in EPL\n All Seasons(2010-2013)",xlab="Total Goals Scored",ylab="Proportion / PMF",freq = FALSE, col="lightblue")
points(X,dpois(X,mu),type="p",col="black")
lines(X,dpois(X,mu),type="l",col="blue")
for(i in 1:length(PMF)){text(X[i],dplist[[i]]+0.0100, dplist[[i]], cex = .6,pos
=4,col="red")}
Source: “England Football Results Betting Odds | Premiership Results & Betting Odds”. Football-data.co.uk. N.p., 2017. Web. 16 Apr. 2017.