Data preprocessing is an important step for data analysis. It makes the data ready for the further statistical analysis.Packages like readr, dplyr,tydyr,outliers and many more are used to perform several data preprocessing operations.
library(readr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(knitr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
Through this analyis report, we aim to determine the general information,fitness levels and overall performance of famous football players in the latest edition of the top ranked football game of the world “FIFA 19”. Through the atttributes under each player’s information we can determine which are the major skills and areas that a footballer must work upon to improve his overall ratings as a player in the game since the game is using actual performances of the players to set up these virtual ratings of various attributes. The dataset, “players_overall” is a combination of 2 datasets,those 2 datasets are “players_details” and “player_stats”. The final dataset has 17 variables showing general information about the players like their name,ID,International team, club team,etc. along with their physical statistics(like agility,stamina,weight,passing,etc.) which would provide us with the understanding of the player’s fitness levels and performance as a player which are useful for the main objective of our assignment. There are number of analysis techniques that we have used under this report, including data type conversions, descriptive statistics and visualisation of the dataset variables with boxplots and histograms,imputing missing vales, finding and removing outliers, and transforming the data.
As mentioned in the above section, our main dataset “Player_overall” is made up of two datasets: “Player_details” and “Player_stats”.
“Player_details” dataset provides general information about a player, including ID,Name,Age, Nationality,Club,Value (in pounds),Jersey Number,Joined(date of first appearance).
“Player_stats” dataset digs into the physical stats of the players which determine their performance in the game, including ID,Preferred Foot(Left or Right),Position(on field),Weight(in lbs), ShortPassing, Acceleration,Agility,Stamina. This data was collected from the player database in the official FIFA-19 game.
Steps Undertaken: 1) Reading both the csv files using the read_csv() function from the ‘readr’ package. 2) Using head() function to show a preview of observations in the datasets, and using str() function to provide a summary of variables in the datasets. 3) Using the merge() function to join both the datasets into a major dataset “player_overall”, on the basis of the variable ‘ID’. 4) Using dim() and str() function to fetch details of the newly formed dataset.
Players_details <- read_csv(file = "Footballers_data.csv")
## Parsed with column specification:
## cols(
## ID = col_double(),
## Name = col_character(),
## Age = col_double(),
## Nationality = col_character(),
## Club = col_character(),
## Value = col_character(),
## `Jersey Number` = col_double(),
## Joined = col_character()
## )
head(Players_details)
str(Players_details)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 18207 obs. of 8 variables:
## $ ID : num 158023 20801 190871 193080 192985 ...
## $ Name : chr "L. Messi" "Cristiano Ronaldo" "Neymar Jr" "De Gea" ...
## $ Age : num 31 33 26 27 27 27 32 31 32 25 ...
## $ Nationality : chr "Argentina" "Portugal" "Brazil" "Spain" ...
## $ Club : chr "FC Barcelona" "Juventus" "Paris Saint-Germain" "Manchester United" ...
## $ Value : chr "\200110.5M" "\20077M" "\200118.5M" "\20072M" ...
## $ Jersey Number: num 10 7 10 1 7 10 10 9 15 1 ...
## $ Joined : chr "Jul 1, 2004" "Jul 10, 2018" "Aug 3, 2017" "Jul 1, 2011" ...
## - attr(*, "spec")=
## .. cols(
## .. ID = col_double(),
## .. Name = col_character(),
## .. Age = col_double(),
## .. Nationality = col_character(),
## .. Club = col_character(),
## .. Value = col_character(),
## .. `Jersey Number` = col_double(),
## .. Joined = col_character()
## .. )
Players_stats <- read_csv(file = "Footballers_data_2.csv")
## Parsed with column specification:
## cols(
## ID = col_double(),
## `Preferred Foot` = col_character(),
## Position = col_character(),
## Weight = col_character(),
## ShortPassing = col_double(),
## Acceleration = col_double(),
## Agility = col_double(),
## Stamina = col_double()
## )
head(Players_stats)
str(Players_stats)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 18207 obs. of 8 variables:
## $ ID : num 158023 20801 190871 193080 192985 ...
## $ Preferred Foot: chr "Left" "Right" "Right" "Right" ...
## $ Position : chr "RF" "ST" "LW" "GK" ...
## $ Weight : chr "159lbs" "183lbs" "150lbs" "168lbs" ...
## $ ShortPassing : num 90 81 84 50 92 89 93 82 78 29 ...
## $ Acceleration : num 91 89 94 57 78 94 80 86 76 43 ...
## $ Agility : num 91 87 96 60 79 95 93 82 78 67 ...
## $ Stamina : num 72 88 81 43 90 83 89 90 84 41 ...
## - attr(*, "spec")=
## .. cols(
## .. ID = col_double(),
## .. `Preferred Foot` = col_character(),
## .. Position = col_character(),
## .. Weight = col_character(),
## .. ShortPassing = col_double(),
## .. Acceleration = col_double(),
## .. Agility = col_double(),
## .. Stamina = col_double()
## .. )
##joing datasets
Player_overall <- merge(Players_details, Players_stats,by="ID")
head(Player_overall)
str(Player_overall)
## 'data.frame': 18207 obs. of 15 variables:
## $ ID : num 16 41 80 164 657 ...
## $ Name : chr "Luis García" "Iniesta" "E. Belözoglu" "G. Pinzi" ...
## $ Age : num 37 34 37 37 35 33 40 35 38 39 ...
## $ Nationality : chr "Spain" "Spain" "Turkey" "Italy" ...
## $ Club : chr "KAS Eupen" "Vissel Kobe" "Medipol Basaksehir FK" "Padova" ...
## $ Value : chr "\200750K" "\20021.5M" "\2004M" "\200240K" ...
## $ Jersey Number : num 10 8 5 11 8 27 1 22 18 11 ...
## $ Joined : chr "Jul 19, 2014" "Jul 16, 2018" "Jul 9, 2015" "Aug 31, 2017" ...
## $ Preferred Foot: chr "Right" "Right" "Left" "Right" ...
## $ Position : chr "RCM" "LF" "CM" "LCM" ...
## $ Weight : chr "143lbs" "150lbs" "159lbs" "168lbs" ...
## $ ShortPassing : num 76 90 86 69 72 62 37 39 68 54 ...
## $ Acceleration : num 56 70 54 65 33 38 49 30 35 45 ...
## $ Agility : num 62 79 68 69 60 49 55 30 35 62 ...
## $ Stamina : num 64 55 61 74 50 48 39 30 34 55 ...
dim(Player_overall)
## [1] 18207 15
The variables of player_overall are summarised. -variable preffered foot is factorised using factor()function. -variable jersey number is converted into character type using as.character().
#Summarising the variables of Student_overall
Player_overall %>% summary()
## ID Name Age Nationality
## Min. : 16 Length:18207 Min. :16.00 Length:18207
## 1st Qu.:200316 Class :character 1st Qu.:21.00 Class :character
## Median :221759 Mode :character Median :25.00 Mode :character
## Mean :214298 Mean :25.12
## 3rd Qu.:236530 3rd Qu.:28.00
## Max. :246620 Max. :45.00
##
## Club Value Jersey Number Joined
## Length:18207 Length:18207 Min. : 1.00 Length:18207
## Class :character Class :character 1st Qu.: 8.00 Class :character
## Mode :character Mode :character Median :17.00 Mode :character
## Mean :19.55
## 3rd Qu.:26.00
## Max. :99.00
## NA's :60
## Preferred Foot Position Weight ShortPassing
## Length:18207 Length:18207 Length:18207 Min. : 7.00
## Class :character Class :character Class :character 1st Qu.:54.00
## Mode :character Mode :character Mode :character Median :62.00
## Mean :58.69
## 3rd Qu.:68.00
## Max. :93.00
## NA's :48
## Acceleration Agility Stamina
## Min. :12.00 Min. :14.0 Min. :12.00
## 1st Qu.:57.00 1st Qu.:55.0 1st Qu.:56.00
## Median :67.00 Median :66.0 Median :66.00
## Mean :64.61 Mean :63.5 Mean :63.22
## 3rd Qu.:75.00 3rd Qu.:74.0 3rd Qu.:74.00
## Max. :97.00 Max. :96.0 Max. :96.00
## NA's :48 NA's :48 NA's :48
##factorising the variable preffered foot
Player_overall$`Preferred Foot`<- factor(Player_overall$`Preferred Foot`, levels=c('Left','Right'),ordered=TRUE)
class(Player_overall$`Preferred Foot`)
## [1] "ordered" "factor"
# data type conversion for jersey number
class(Player_overall$`Jersey Number`)
## [1] "numeric"
Player_overall$`Jersey Number`<- Player_overall$`Jersey Number` %>% as.character(Player_overall$`Jersey Number`)
#checking conversion of data type
class(Player_overall$`Jersey Number`)
## [1] "character"
Data is tidy already as per the neccessary requirements for a data to be tidy.
is.na() coupled with sum() function was used to detect any missing values in the dataset. after the missing values are found ,imputation is done. imputation is the process of replacing missing data with substituted values. the substituted value is the NOTE: Due to the max page limit of 20, the code for imputation is not run and is instead provided as comments.
colSums(is.na(Player_overall))
## ID Name Age Nationality Club
## 0 0 0 0 241
## Value Jersey Number Joined Preferred Foot Position
## 0 60 1553 48 60
## Weight ShortPassing Acceleration Agility Stamina
## 48 48 48 48 48
#####
#impute(Player_overall$ShortPassing, mean)
#impute(Player_overall$Acceleration, mean)
#impute(Player_overall$Agility, mean)
#impute(Player_overall$Stamina, mean)
#colSums(is.na(Player_overall))
Before performimg data transformation, it is necessary to correct the data of any outliers. To detect the outliers , boxplot() was plotted using ggplot2 package for the variableS - AGILITY AND STAMINA
#Boxplot for variable Agility
boxplot(Player_overall$Agility,main="Boxplot of agility",xlab = "Agility",col = "Lightblue")
#Creating a separate vector of outliers in the variable
outliers_agility <- boxplot(Player_overall$Agility, plot = FALSE)$out
#checking the outlier values
print(outliers_agility)
## [1] 23 25 19 22 18 21 25 18 25 25 23 25 25 14 24 26 21 24 22 22 22 20 25
## [24] 23 26 19 25 26 26 24 19 25 23 22 26 21 23 18 25 25 26 21 25 19 26 21
## [47] 26 22 26 26 25 24 24 24 21 26 22 22 25 26 26 23 26 26 24 21 22 21 24
## [70] 25 25 25 24 22 23 24 24 23 25 24 26 22 22 25 23 22 23 23 24 25 23 22
## [93] 23 25 21 25 25 24 26 26 22 22 21 26 26 24 24 25 19 18 22 23 22 22 24
## [116] 26 21 23 22 25 15 23 24 22 22 22 22 22 21 25 25 25 25 23 23 23 24 23
## [139] 22 25 22 22 25 23 22 22 26 23 25 22 26 25 25 22 23 22 25 23 21 25 22
## [162] 26 21 24 22 26 25 23 25 21 23 25 23 24 22 24 22 24 22 26 26 24 23 26
## [185] 26 26 25 20 26 25 23 19
Player_overall[which(Player_overall$Agility %in% outliers_agility),]
#Removing the outliers
Player_overall <- Player_overall[-which(Player_overall$Agility %in% outliers_agility),]
#Checking the boxplot
boxplot(Player_overall$Agility,main="Boxplot of agility_filtered",xlab = "Agility",col = "Lightblue")
#####
#Boxplot for variable Stamina
boxplot(Player_overall$Stamina,main="Boxplot of stamina",xlab = "Stamina",col = "orange")
#Creating a separate vector of outliers in the variable
outliers_stamina <- boxplot(Player_overall$Stamina, plot = FALSE)$out
#checking the outlier values
print(outliers_stamina)
## [1] 27 26 26 26 28 27 28 28 27 25 25 27 22 26 23 26 19 23 28 28 24 28 22
## [24] 22 22 25 27 22 18 28 21 24 26 28 27 24 12 16 26 23 27 28 28 20 20 24
## [47] 18 21 22 25 28 27 19 28 28 23 20 28 26 28 27 18 25 28 23 26 25 25 18
## [70] 25 28 28 24 15 21 20 24 17 20 26 21 28 20 24 25 28 26 23 27 28 27 26
## [93] 28 26 26 23 22 24 24 27 22 26 26 27 28 20 24 27 21 23 23 25 23 28 27
## [116] 21 26 19 22 22 18 23 26 18 27 25 20 25 21 22 17 28 28 28 25 20 28 26
## [139] 25 28 20 27 22 25 28 25 18 22 28 28 26 19 22 23 27 22 24 21 19 27 24
## [162] 21 26 21 18 14 28 23 28 21 17 28 28 27 27 22 24 21 21 23 27 22 24 28
## [185] 17 27 20 23 25 28 23 20 17 28 22 23 17 27 25 22 21 24 27 28 19 28 27
## [208] 26 25 26 21 27 20 23 21 26 27 22 21 25 19 20 14 28 28 22 24 23 22 20
## [231] 19 24 26 18 23 21 28 27 24 24 26 25 24 27 25 23 21 27 24 26 28 25 24
## [254] 25 18 28 28 22 25 28 21 25 21 28 28 28 24 20 25 23 20 28 28 18 27 17
## [277] 28 22 25 27 28 24 26 27 20 28 21 27 28 25 23 28 20 20 24 25 24 23 28
## [300] 23 22 28 22 28 21 25 27 23 22 17 28 22 28 27 26 24 25 18 28 25 28 28
## [323] 27 27 25 24 23 21 25 21 26 21 16 25 20 25 28 19 18 24 21 25 22 23 28
## [346] 23 20 21 21 24 24 27 22 23 19 23 15 25 25 17 25 20 26 23 27 22 27 26
## [369] 27 20 23 25 18 19 27 26 28 28 25 28 26 28 26 24 27 20 22 21 19 21 25
## [392] 24 20 24 23 25 23 22 28 28 16 16 25 21 23 21 25 25 13 27 27 20 28 20
## [415] 26 23 26 17 18 21 27 28 25 28 21 28 23 25 28 19 25 23 27 20 20 24 26
## [438] 20 23 17 19 23 20 18 18 25 26 26 26 27 22 25 17 26 28 20 27 23 21 20
## [461] 17 21 16 17 20 24 25 24 28 20 21 28 25 27 22 21 26 16 22 25 19 22 26
## [484] 22 20 25 25 22 28 22 21 28 28 28 21 26 28 25 27 21 18 25 28 28 27 22
## [507] 27 26 19 18 24 28 27 17 18 20 21 26 24 23 24 21 27 23 28 23 27 18 16
## [530] 24 24 22 25 16 27 16 23 20 22 17 21 20 14 28 19 28 20 23 28 22 24 28
## [553] 24 27 28 17 20 24 18 14 21 26 17 21 21 18 25 20 24 17 19 24 25 23 21
## [576] 28 17 27 24 17 18 19 17 17 18 28 16 23 25 17 25 21 24 20 18 19 24 23
## [599] 19 21 22 24 24 18 24 20 25 20 28 18 23 20 20 25 19 20 19 24 17 17 27
## [622] 27 26 27 25 22 16 25 26 26 23 16 19 16 22 22 24 18 18 17 19 22 24 20
## [645] 23 21 23 23 22 19 22 26 24 16 22 22 18 24 26 16 22 27 22 21 17 24 24
## [668] 15 21 20 21 27 19 24 23 25 27 25 19 24 27 17 18 17 18 21 18 20 28 22
## [691] 17 20 28 21 26 26 27 21 26 26 15 21 24 25 20 22 27 23 24 27 26 24 26
## [714] 18 26 18 25 21 16 24
Player_overall[which(Player_overall$Stamina %in% outliers_stamina),]
#Removing the outliers
Player_overall <- Player_overall[-which(Player_overall$Stamina %in% outliers_stamina),]
#Checking the boxplot
boxplot(Player_overall$Stamina,main="Boxplot of stamina_filtered",xlab = "Stamina",col = "orange")
#####
To Check if the variable has normal distribution, we plot the histogram using hist() fUnction of the variables agility and stamina.
##Transforming data
# Plotting histogram of Agility
hist(Player_overall$Agility,main="Histogram of Agility",xlab="Agility",col="Lightblue")
# Plotting histogram of Stamina
hist(Player_overall$Stamina,main="Histogram of Stamina",xlab="Stamina",col="orange")
From the above analysis we have concluded that out of the given attributes for physical fitness of a player,mean of the variable “short-passing” is the least, which means players need to work on their passing more as compared to other attributes in order to improve their overall rating. With mean accelaration being the highest among all the variables, we can conclude that players have been working hard on their speed and sprint workouts. There is also link to the actual data under the reference section to check other attributes for the players.
References: Karan Gadiya, published January,2019. FIFA-19 dataset https://sofifa.com/ Source link: https://github.com/amanthedorkknight/fifa18-all-player-statistics/tree/master/2019