Required packages

library(readr)
library(dplyr)
library(tidyr)
library(Hmisc)
library(stringr)
library(lubridate)
library(outliers)
library(MVN)
library(ggplot2)
library(knitr)
library(gdata)
library(mlr)
library(forecast)

Executive Summary

Data Preprocessing plays an important role in Statistical Analysis of data as it impacts the results based on data.Data Preprocessing is a technique of transforming raw data into a understandable format.We have performed all the necessary steps required like get,understand, tidy,manipulate , scan & transform.

At the first , we imported the two datasets. Both the datasets were sorted and a new dataset was formed by using “Inner Join” function.This newly formed subset is understood in terms of its class & structure and for some variables were converted into factors.

Then the data was checked for its tidyness and for some variables ,rules were applied to seperate alpha-numeric variables. ‘dplyr’ package was also explored for setting the carrier years of a player using mutate () function by evaluating end_year-start_year. For checking the missing values if any , is.na function was used and the missing values were imputed using median value for the particular variable.A new subset was created with 2 numeric and 1 character variable and it was scanned using Turkey’s fencing method and outliers were detected using z-score and for those variables showing outliers were cleaned by capping function.

Data

The dataset used for this project was about an ‘American Football League’. It is one of prime sports played in North America .The data was taken from kaggle.com’, an open data source software.The link for the dataset is:https://www.kaggle.com/kendallgillies/nflstatistics#Basic_Stats.csv

There were 2 datasets choosen for this assignment namely- 1).Basic.csv which contained statistics of the players.The dataset contained variables as follows:Age, Birthday,Birthplace , Height, Current Status, College Attended,etc

2).Game.csv which conatined statistics of Game Logs for each player.Each player have some specific roles in some areas. We had choosen Game statistics of player on the offensive line and stored it in ‘Game.csv’ file.It contained following variables namely :Player_id, Name, Position, Year,Team, Games Played , Season,etc.

The working directory was first checked and then set with directory where the data files were present.The file was imported using read_csv function which is similar as read.csv but is efficient and faster.

getwd()
[1] "G:/Semester-2/Data Pre-processing"
setwd("G:/Semester-2/Data Pre-processing/Assignment")
The working directory was changed to G:/Semester-2/Data Pre-processing/Assignment inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
play<- read_csv("Basic.csv")
Parsed with column specification:
cols(
  Age = col_double(),
  `Birth Place` = col_character(),
  Birthday = col_character(),
  College = col_character(),
  `Current Status` = col_character(),
  `Current Team` = col_character(),
  Experience = col_character(),
  `Height (inches)` = col_double(),
  `High School` = col_character(),
  `High School Location` = col_character(),
  Name = col_character(),
  Number = col_double(),
  `Player Id` = col_character(),
  Position = col_character(),
  `Weight (lbs)` = col_double(),
  `Years Played` = col_character()
)
head(play)

Using head function , first few rows of the data are displayed.

game<- read_csv("Game.csv")
Parsed with column specification:
cols(
  `Player Id` = col_character(),
  Name = col_character(),
  Position = col_character(),
  Year = col_double(),
  Season = col_character(),
  Week = col_double(),
  `Game Date` = col_character(),
  `Home or Away` = col_character(),
  Opponent = col_character(),
  Outcome = col_character(),
  Score = col_character(),
  `Games Played` = col_double(),
  `Games Started` = col_character()
)
head(game)

The dataset below is subset of original dataset in which 7 variables were taken from 16 variables . The varibles choosen were : PlayerId,Experience, Weight, Height, Birthday, College Attendend, Current Status.

play_new <- play[,c(13,3,4,5,7,8,15)]
head(play_new)

Now the columns of second dataset i.e Game.csv were sorted & changed to improve the quality of dataset.From the 13 variables of dataset , 2 variables were formed after applying summarise function and were grouped by ‘Player_ID’

game_new <- game %>% group_by(`Player Id`) %>% summarise(start_year=min(Year,na.rm=FALSE),end_year=max(Year,na.rm = FALSE),                                              total_games=sum(`Games Played`,na.rm=FALSE))
head(game_new)

The subsets of the main data (play_new & game_new ) were joined together and named as football using inner_join function on variable player_ID.

football <- play_new %>% inner_join(game_new,by="Player Id")
head(football)

Understand

firstly , class function was used to check the class/type.

class(football$`Height (inches)`)
[1] "numeric"

The class of Height varible is seen to be numeric.Hence, to convert it to integer format we have used as.integer function.

football$`Height (inches)`<-as.integer(football$`Height (inches)`)
class(football$`Height (inches)`)
[1] "integer"
class(football$Birthday)
[1] "character"

Next the class for variable Birthday was shown as character as the reason being date & time values are imported as character value in R by default. Therefore ,we converted it into suitable date-time form.

football$Birthday <- mdy(football$Birthday)
class(football$Birthday)
[1] "Date"

Finally, the Variable ‘Current Status’ is converted to factor variable and the variable ‘Experience’ is sorted , factored ,labelled and ordered as per experinec of players.

class(football$`Current Status`)
[1] "character"
football$`Current Status`<-as.factor(football$`Current Status`)
class(football$`Current Status`)
[1] "factor"
football$Experience <- factor( c(football$Experience), 
                              levels = c("0 Season", "1 Season", "1st season",   "2 Seasons",
                                         "2nd season",   "3 Seasons",    "3rd season",   "4 Seasons", 
                                         "4th season",   "5 Seasons",    "5th season",   "6 Seasons", 
                                         "6th season",   "7 Seasons",    "7th season",   "8 Seasons",
                                         "8th season",   "9 Seasons",    "9th season",   "10 Seasons",
                                         "10th season",  "11 Seasons",   "11th season",  "12 Seasons",
                                         "13 Seasons",   "14 Seasons",   "14th season",  "17 Seasons",
                                         "18 Seasons",   "18th season"), 
                              labels = c("0 Season", "1 Season", "1 Season", "2 Seasons", 
                                         "2 Seasons",    "3 Seasons",    "3 Seasons",    "4 Seasons",
                                         "4 Seasons",    "5 Seasons",    "5 Seasons",    "6 Seasons",
                                         "6 Seasons",    "7 Seasons",    "7 Seasons",    "8 Seasons",
                                         "8 Seasons",    "9 Seasons",    "9 Seasons",    "10 Seasons",
                                         "10 Seasons",   "11 Seasons",   "11 Seasons",   "12 Seasons",  
                                         "13 Seasons",   "14 Seasons",   "14 Seasons",   "17 Seasons",
                                         "18 Seasons",   "18 Seasons"), ordered = TRUE)

Tidy & Manipulate Data I

As per statistican Hardley Wickham and Grolemund ,there are 3 tidy rule as: 1) Each variable in the dataset is placed in its own column. 2) Each observation is placed in its own row. 3) Each value is placed in its own cell. and the data that satisfies these 3 rules is known as tidy data.

In our original dataset, their was 1 variable namely playerID which was a combination of firstname,lastname and player_ID.As we were not confirmed about the firstname and familyname, we decided to seperate the variable on basis of ID. The below code shows how player_ID is seperated into Name and ID by using seperate() of tidyr package.

football_1 <-football %>% separate (`Player Id`,into= c("Names","ID"),sep='/')
head(football_1)

Tidy & Manipulate Data II

Mutate() function is used for creation of a new column called career_years. By looking at the data we see that some players have started playing and given up in same year.But when we take the difference of their start_year & end_year we get 0 which is logically incorrect.To solve the issue we have added 1 explicity .

mutate(football_1,career_years=end_year-start_year+1)

Scan I

The numeric variables are further subsetted into football_subset with variables ID, Height & Weight to scan any missing values/inconsistences.

football_subset<- football_1[,c(2,7,8)]
football_subset

To calculate the total number of missing values in each column , we used colsum function. We found 1 missing value in Height column. which() function is then applied to find the location of NA in that variable

colSums(is.na(football_subset))
             ID Height (inches)    Weight (lbs) 
              0               1               0 
which(is.na(football_subset$`Height (inches)`))
[1] 253

As their are many ways to handle the missing values in dataset such as replacing missing values with constants or a mean or median values of the that variable ,here we have imputed the missing value with ’Median Value’of height column.After imputing the values, the variable is checked again to see if changes are applied.

football_subset$`Height (inches)`[is.na(football_subset$`Height (inches)`)] <-median(football_subset$`Height (inches)`,na.rm = TRUE)
colSums(is.na(football_subset))
             ID Height (inches)    Weight (lbs) 
              0               0               0 

Scan II

We know methods for detecting outliers are distance based method and Turkey’s method of outlier detection.We therefore created box plot using Turkey’s method and outliers are seen.

football_subset$`Height (inches)` %>% boxplot(main="Boxplot for Height",ylab="Height In Inches")

A box plot is standardised way of displaying distribution of data based on summary statistics and shows the outliers and what their values are. The distance based (Z-score) helps in confirming the number of outliers detected. We then calculated the z-score for height variable .Also we know that an observation is said to be an outlier if the absolute value of its z-score is > 3.Hence, according to z-score the value is 1 which is computed using length function.

zscores_football_height <-football_subset$`Height (inches)` %>% scores(type="z")
zscores_football_height %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3.8395 -0.7722 -0.1588  0.0000  0.4547  2.9085 
length (which( abs(zscores_football_height) >3 ))
[1] 1

Their are multiple ways to take care of outliers like excluding ,imputing , capping & binning. But before we use any of the methods mentioned above , we need to analysed it properly .Due to insufficient reasons to measure the certaninty of outliers we then used user-defined capping method.

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}
football_height_capped <- football_subset$`Height (inches)` %>% cap()

To cross check if any outliers are present , we plotted a boxplot for football_height_capped.

football_height_capped %>% boxplot(main="Boxplot for Height",ylab="Height In Inches")

Transform

Before the data is used to prepare a model , we applied some transformations on the tidy data. Data Transformation is used in many situations like - 1) To alter the scale of variable 2) To transform non-linear into linear relationship 3) To reduce skewness for making it a normal distribution

The distribution of height variable is then visualised using a histogram. The histogram appears to be normally distributed.Therefore we did not applied any further transforamtion .

hist(football_height_capped)



---
title: "MATH2349 Semester 2, 2019"
author: "Poorvashi Pokhun S3390602  & Ankit Munot s3764950"
subtitle: Assignment 3
output:
  html_notebook: default
  html_document:
    df_print: paged
---

## Required packages 

```{r}
library(readr)
library(dplyr)
library(tidyr)
library(Hmisc)
library(stringr)
library(lubridate)
library(outliers)
library(MVN)
library(ggplot2)
library(knitr)
library(gdata)
library(mlr)
library(forecast)
```


## Executive Summary 
Data Preprocessing plays an important role in Statistical Analysis of data as it impacts the results based on data.Data Preprocessing is a technique of transforming raw data into a understandable format.We have performed all the necessary steps required like get,understand, tidy,manipulate , scan & transform.

At the first , we imported the two datasets. Both the datasets were sorted and a new dataset was formed by using "Inner Join" function.This newly formed subset is understood in terms of its class  & structure and for some variables were converted  into factors.

Then the data was checked for its tidyness and for some variables ,rules were applied to seperate alpha-numeric variables. 'dplyr' package was also explored for setting the carrier years of a player using mutate () function by evaluating end_year-start_year.
For checking the missing values if any , is.na function was used and the missing values were imputed using median value for the particular variable.A new subset was created with 2 numeric and 1 character variable and it was scanned using Turkey's fencing method and outliers were detected using z-score and for those variables showing outliers were cleaned by capping function.


## Data 
The dataset used for this project was about an 'American Football League'. It is one of prime sports played in North America .The data was taken from kaggle.com', an open data source software.The link for the dataset is:https://www.kaggle.com/kendallgillies/nflstatistics#Basic_Stats.csv

There were 2 datasets choosen for this assignment namely-
1).Basic.csv which contained statistics of the players.The dataset contained variables as follows:Age, Birthday,Birthplace , Height, Current Status, College Attended,etc

2).Game.csv which conatined statistics of Game Logs for each player.Each player have some specific roles in some areas. We had choosen Game statistics of player on the offensive line and stored it in 'Game.csv' file.It contained following variables namely :Player_id, Name, Position, Year,Team, Games Played , Season,etc.

The working directory was first checked and then set with directory where the data files were present.The file was imported using read_csv function which is similar as read.csv but is efficient and faster.


```{r}

getwd()
setwd("G:/Semester-2/Data Pre-processing/Assignment")

play<- read_csv("Basic.csv")

head(play)
```
Using head function , first few rows of the data are displayed.

```{r}
game<- read_csv("Game.csv")
head(game)

```
The dataset below is subset of original dataset in which 7 variables were taken from 16 variables . The varibles choosen were : PlayerId,Experience, Weight, Height, Birthday, College Attendend, Current Status.

```{r}

play_new <- play[,c(13,3,4,5,7,8,15)]

head(play_new)

```

Now the columns of second dataset i.e Game.csv were sorted & changed to improve the quality of dataset.From the 13 variables of dataset , 2 variables were formed after applying summarise function and were grouped by 'Player_ID'

```{r}

game_new <- game %>% group_by(`Player Id`) %>% summarise(start_year=min(Year,na.rm=FALSE),end_year=max(Year,na.rm = FALSE),                                              total_games=sum(`Games Played`,na.rm=FALSE))


head(game_new)

```

The subsets of the main data (play_new & game_new ) were joined together and named as football using inner_join function on variable player_ID.

```{r}

football <- play_new %>% inner_join(game_new,by="Player Id")

head(football)

```


## Understand 
 firstly , class function was used to check the class/type.

```{r}

class(football$`Height (inches)`)

```
The class of Height varible is seen to be numeric.Hence, to convert it to integer format we have used as.integer function.

```{r}

football$`Height (inches)`<-as.integer(football$`Height (inches)`)
class(football$`Height (inches)`)

class(football$Birthday)

```

Next the class for variable Birthday was shown as character as the reason being date & time values are imported as character value in R by default. Therefore ,we converted it into suitable date-time form. 

```{r}
football$Birthday <- mdy(football$Birthday)
class(football$Birthday)

```

Finally, the  Variable 'Current Status' is converted to factor variable and the variable 'Experience' is sorted , factored ,labelled and ordered as per experinec of players.


```{r}
class(football$`Current Status`)

football$`Current Status`<-as.factor(football$`Current Status`)

class(football$`Current Status`)


football$Experience <- factor( c(football$Experience), 
                              levels = c("0 Season", "1 Season", "1st season",   "2 Seasons",
                                         "2nd season",   "3 Seasons",    "3rd season",   "4 Seasons", 
                                         "4th season",   "5 Seasons",    "5th season",   "6 Seasons", 
                                         "6th season",   "7 Seasons",    "7th season",   "8 Seasons",
                                         "8th season",   "9 Seasons",    "9th season",   "10 Seasons",
                                         "10th season",  "11 Seasons",   "11th season",  "12 Seasons",
                                         "13 Seasons",   "14 Seasons",   "14th season",  "17 Seasons",
                                         "18 Seasons",   "18th season"), 
                              labels = c("0 Season", "1 Season", "1 Season", "2 Seasons", 
                                         "2 Seasons",    "3 Seasons",    "3 Seasons",    "4 Seasons",
                                         "4 Seasons",    "5 Seasons",    "5 Seasons",    "6 Seasons",
                                         "6 Seasons",    "7 Seasons",    "7 Seasons",    "8 Seasons",
                                         "8 Seasons",    "9 Seasons",    "9 Seasons",    "10 Seasons",
                                         "10 Seasons",   "11 Seasons",   "11 Seasons",   "12 Seasons",  
                                         "13 Seasons",   "14 Seasons",   "14 Seasons",   "17 Seasons",
                                         "18 Seasons",   "18 Seasons"), ordered = TRUE)

```


##	Tidy & Manipulate Data I 
As per statistican Hardley Wickham and Grolemund ,there are 3 tidy rule as:
1) Each variable in the dataset is placed in its own column.
2) Each observation is placed in its own row.
3) Each value is placed in its own cell.
and the data that satisfies these 3 rules is known as tidy data.

In our original dataset, their was 1 variable namely playerID which was a combination of firstname,lastname and player_ID.As we were not confirmed about the firstname and familyname, we decided to seperate  the variable on basis of ID.
The below code shows how player_ID is seperated into Name and ID by using seperate() of tidyr package.

```{r}


football_1 <-football %>% separate (`Player Id`,into= c("Names","ID"),sep='/')

head(football_1)

```

##	Tidy & Manipulate Data II 

Mutate() function is used for creation of a new column called career_years.
By looking at the data we see that some players have started playing and given up in same year.But when we take the difference of their start_year & end_year we get 0 which is logically incorrect.To solve the issue we have added 1 explicity .

```{r}
mutate(football_1,career_years=end_year-start_year+1)

```


##	Scan I 

The numeric variables are further subsetted into football_subset with variables ID, Height & Weight to scan any missing values/inconsistences. 
```{r}

football_subset<- football_1[,c(2,7,8)]

football_subset

```

To calculate the total number of missing values in each column , we used colsum function.
We found 1 missing value in Height column.
which() function is then applied to find the location of NA in that variable

```{r}

colSums(is.na(football_subset))

which(is.na(football_subset$`Height (inches)`))

```
As their are many ways to handle the missing values in dataset such as replacing missing values with constants or a mean or median values of the that variable ,here we have imputed the missing value with 'Median Value'of height column.After imputing the values, the variable is checked again to see if changes are applied.

```{r}

football_subset$`Height (inches)`[is.na(football_subset$`Height (inches)`)] <-median(football_subset$`Height (inches)`,na.rm = TRUE)

colSums(is.na(football_subset))

```
##	Scan II

We know methods for detecting outliers are distance based method and Turkey's method of outlier detection.We therefore created box plot using Turkey's method and outliers are seen.

```{r}
football_subset$`Height (inches)` %>% boxplot(main="Boxplot for Height",ylab="Height In Inches")

```


A box plot is standardised way of displaying distribution of data based on summary statistics and shows the outliers and what their values are.
The distance based (Z-score) helps in confirming the number of outliers detected.
We then calculated the z-score for height variable .Also we know that an observation is said to be an outlier if the absolute value of its z-score is > 3.Hence, according to z-score the value is 1 which is computed using length function.

```{r}

zscores_football_height <-football_subset$`Height (inches)` %>% scores(type="z")

zscores_football_height %>% summary()

length (which( abs(zscores_football_height) >3 ))


```
Their are multiple ways to take care of outliers like excluding ,imputing , capping & binning. But before we use any of the methods mentioned above , we need to analysed it properly .Due to insufficient reasons to measure the certaninty of outliers we then used user-defined capping method.  
```{r}

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

football_height_capped <- football_subset$`Height (inches)` %>% cap()

```

To cross check if any outliers are present , we plotted a boxplot for football_height_capped. 

```{r}

football_height_capped %>% boxplot(main="Boxplot for Height",ylab="Height In Inches")



```




##	Transform 

Before the data is used to prepare a model , we applied some transformations on the tidy data.
Data Transformation is used in many situations like -
1) To alter the scale of variable 
2) To transform non-linear into linear relationship 
3) To reduce skewness for making it a normal distribution

The distribution of height variable is then visualised using a histogram.
The histogram appears to be normally distributed.Therefore we did not applied any further transforamtion .

```{r}
hist(football_height_capped)

```


<br>
<br>
