Setup

Loading the necessary packages to reproduce the report:

library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(deductive)
library(validate)
library(outliers)
library(kableExtra)
library(car)

Read WHO Data

Importing the WHO.csv dataset using readr.

# This is an R chunk for reading the WHO data.
WHO <- read_csv("WHO.csv")
Parsed with column specification:
cols(
  .default = col_double(),
  country = col_character(),
  iso2 = col_character(),
  iso3 = col_character()
)
See spec(...) for full column specifications.
head(WHO)
NA

Tidy Task 1:

In WHO dataset, from 5th column to the last column represents values instead of variables, thus those columns were transformed from wide to long format by using gather() function.

# Task 1
WHO_1 <- WHO %>% gather(code, value, 5:60) 
WHO_1

Tidy Task 2:

Now, each value in the new column “code” containes separate pieces of information which are separated by "_" sign. Concatenated values in “code” column was stored separated separate() function and stored in three new columns(“New”, “CaseType”, “sex_age”). The column “sex_age” contains combined values of “sex” and “age”, separate() function was used again to split these values into gender and age group.

# Task 2
WHO_1 <- WHO_1 %>% separate(code , into= c("New", "CaseType", "sex_age"), sep = "_")
WHO_1 <- separate(WHO_1, 'sex_age', into = c("Sex","Age"), sep = 1) 
WHO_1

Tidy Task 3:

The “rel”, “ep”, “sn”, and “sp” keys need to be in their own columns as we will treat each of these as a separate variable. The column “CaseType” contains these keys.To decompose CaseType in various columns spread() function is used.

# Task 3
WHO_1 <- WHO_1 %>% spread('CaseType', value)

Tidy Task 4:

To achieve final Tidy format, the two categorical variables “Sex” and “Age”, needed to be converted as factors. Apporipiate steps are performed to achieve this using mutate() function (as mentioned in the requirements).

# Task 4

#Factorizing and labeling age column
WHO_1 <- WHO_1 %>% mutate(Age = factor(Age, levels = c("014","1524","2534","3544","4554","5564","65"), labels = c("<15","15-24","25-34","35-44","45-54","55-64","65>="), ordered = TRUE))

#Factorizing gender column
WHO_1 <- WHO_1 %>% mutate(Sex = factor(Sex))
WHO_1
class(WHO_1$Sex)
[1] "factor"
class(WHO_1$Age)
[1] "ordered" "factor" 

Task 5: Filter & Select

The columns iso2 and New are redundant and hence those were dropped using select() function. Three countries were selected using filter() function.


# Task 5 
WHO_subset <- WHO_1 %>% select( -(iso2), -(New)) %>% filter( country %in% c("Sri Lanka", "Romania", "Portugal"))
head(WHO_subset)
NA

Dimension of the subsetted dataset.

dim(WHO_subset)
[1] 1428    9

Read Species and Surveys data sets

Importing the Species and Surveys data sets using readr.


# Reading the Species and Surveys data sets.
species <- read_csv("species.csv")
Parsed with column specification:
cols(
  species_id = col_character(),
  genus = col_character(),
  species = col_character(),
  taxa = col_character()
)
head(species)

surveys <- read_csv("surveys.csv")
Parsed with column specification:
cols(
  record_id = col_double(),
  month = col_double(),
  day = col_double(),
  year = col_double(),
  species_id = col_character(),
  sex = col_character(),
  hindfoot_length = col_double(),
  weight = col_double()
)
head(surveys)
NA
NA

Task 6: Join

To combine surveys and species data frames, left_join() function was used with the key variable species_id. The combined data frame was renamed as surveys_combined.

# Task 6.
surveys_combined <- left_join(surveys , species, by="species_id")
head(surveys_combined)
str(surveys_combined)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    35549 obs. of  11 variables:
 $ record_id      : num  1 2 3 4 5 6 7 8 9 10 ...
 $ month          : num  7 7 7 7 7 7 7 7 7 7 ...
 $ day            : num  16 16 16 16 16 16 16 16 16 16 ...
 $ year           : num  1977 1977 1977 1977 1977 ...
 $ species_id     : chr  "NL" "NL" "DM" "DM" ...
 $ sex            : chr  "M" "M" "F" "M" ...
 $ hindfoot_length: num  32 33 37 36 35 14 NA 37 34 20 ...
 $ weight         : num  NA NA NA NA NA NA NA NA NA NA ...
 $ genus          : chr  "Neotoma" "Neotoma" "Dipodomys" "Dipodomys" ...
 $ species        : chr  "albigula" "albigula" "merriami" "merriami" ...
 $ taxa           : chr  "Rodent" "Rodent" "Rodent" "Rodent" ...

Task 7: Calculate

The variable “month” of “surveys_combined” is recorded as a numerical variable, to get monthly average, “month” column was converted as a factor. Then the columns “month”, “weight” and “hindfoot_length”, fitered by the species_id=“NL”, was stored in a new data frame “NL”. Aggregate function was used to compute monthly averages of weight and hindfoot length in a new data frame.


# Task 7.
surveys_combined$month <- factor(surveys_combined$month) # Factorizing the "month" column
#str(surveys_combined$month)
NL <- surveys_combined %>% filter(species_id == "NL") #Filtering the dataset by species id = "NL"

The NL dataframe is derived according to the specifications. Monthly averages of weight and hindfoot length variables will be calculated from the following code.


NL <- NL %>%select(month, weight, hindfoot_length) # Selecting the columns month, weight and hindfoot_length
# Storing the monthly averages of weight and hindfoot length in a new data frame
NL_averages <- aggregate(NL[,c("weight","hindfoot_length")], by = list(NL$month), FUN = mean, na.rm=TRUE)
NL_averages <- NL_averages %>% rename(Month = Group.1, Average_weight = weight, Average_hindfoot_length = hindfoot_length)
NL_averages %>% kable() %>% kable_styling(bootstrap_options = "condensed")#Output the results
Month Average_weight Average_hindfoot_length
1 179.3443 32.54098
2 181.3818 32.82353
3 177.4516 32.75862
4 153.0690 32.02439
5 142.7536 31.60000
6 143.7879 32.18889
7 141.7415 32.35398
8 152.5100 32.07143
9 164.9920 32.50427
10 169.1364 32.43119
11 170.6373 32.59574
12 175.1515 32.34848

Task 8: Missing Values

The dataframe surveys_combined was subsetted by selecting records only for the year 1989 and renamed as surveys_combined_year. Then, the total missing values in the weight column of surveys_combined_year data frame grouped by species were calculated. Missing values in the weight column were replaced with the mean values of each species. Imputed data was saved as surveys_weight_imputed.

# Task 8 
surveys_combined$year <- factor(surveys_combined$year) # Factorizing the "year" column
surveys_combined_year <- surveys_combined %>% filter(year == "1989") # Filtering the year
head(surveys_combined_year)

kable(surveys_combined_year %>% group_by(species_id) %>% summarise(Missing_values = sum(is.na(weight)))) %>% kable_styling(bootstrap_options = "condensed", full_width = FALSE, position = "left") # Total missing values in weight column grouped by species_id

species_id Missing_values
AB 31
AH 30
BA 0
CB 1
CS 1
DM 6
DO 2
DS 4
NL 6
OL 0
OT 1
OX 0
PC 1
PE 6
PF 0
PH 0
PM 0
PP 1
RF 0
RM 9
SA 3
SC 1
SF 0
SH 0
SS 18
UP 2
NA 7


weight_mean_by_species <- aggregate(surveys_combined_year[,c("weight")], by = list(surveys_combined_year$species_id), FUN = mean, na.rm=TRUE)
weight_mean_by_species <- weight_mean_by_species %>% rename(species_id = Group.1)# A new dataframe with average weights by species id for the year 1989
weight_mean_by_species %>% kable() %>% kable_styling(bootstrap_options = "condensed", full_width = FALSE, position = "left")

species_id weight
AB NaN
AH NaN
BA 7.000000
CB NaN
CS NaN
DM 44.349206
DO 51.025641
DS 121.896552
NL 151.278688
OL 31.947368
OT 24.663366
OX 20.000000
PC NaN
PE 21.935672
PF 7.230769
PH 30.500000
PM 20.222222
PP 17.409091
RF 13.346939
RM 10.411311
SA NaN
SC NaN
SF 54.800000
SH 76.345455
SS NaN
UP NaN


surveys_weight_imputed <- surveys_combined_year %>% group_by(species_id) %>% mutate(weight = ifelse(is.na(weight), mean(weight, na.rm = TRUE), weight)) # Mean by species type imputation for weight column

#surveys_weight_imputed <- left_join(surveys_combined_year, weight_mean_by_species, by= "species_id") %>% mutate(weight = ifelse(is.na(weight.x), weight.y, weight.x)) %>% select(-weight.y, -weight.x) # adding weight_mean_by_species dataset to surveys_combined_year dataset to fill the missing values in the weight column
head(surveys_weight_imputed)
View(surveys_weight_imputed)

Checking Imputation As we have replaced the missing values in the “weight” column with the mean values by species type, the count of total NA values in surveys_weight_imputed df’s weight column should be less than that value of surveys_combined_year. However, it can be noticed that the NA count is non zero in the weight column of surveys_weight_imputed data frame. This issue will be discussed in the next task.

sum(is.na(surveys_combined_year$weight))
[1] 130
sum(is.na(surveys_weight_imputed$weight))
[1] 95

Task 9: Special Values

Inspecting the weight column in surveys_weight_imputed dataframe for any special values (i.e., NaN, Inf, -Inf).

# Task 9

cat("missing value count in weights of surveys_combined_year: ", sum(is.na(surveys_combined_year$weight)))
missing value count in weights of surveys_combined_year:  130
cat("\nSpecial value count in weights of surveys_weight_imputed: ",sum(is.nan(surveys_weight_imputed$weight)))

Special value count in weights of surveys_weight_imputed:  95

It can be noticed that some of the missing values are recorded as special values, “NAN” which stands for “Not A Number”" and is a logical vector of a length 1 and applies to numerical values. In part 8 when we calculated the mean using aggrigate() function by passing an argument na.rm=TRUE, which ignores missing values. There are groups in surveys_combined_year in which all values in weight column are missing values. The groups in our case are (“AB”, “AH”,“CB”,“CS”). When applying the function mean(….,na.rm=TRUE) on the fore mentioned species_id’s weight column, as all values are NA’s, it will select zero values and divide it by total values selected(0), which will return “NAN”.

Task 10: Outliers

In the surveys_combined data frame, the variable hindfoot length was inspected for possible univariate outliers. Different approaches can be used to detect univariate outliers such as; - Tukey’s method of outlier detection - Box plots - z-score method

First a boxplot was constructed to visually identify whether there are possible outliers in the data set. Very few number of outliers were visible in the boxplot. In addition, the boxplot depicted a skewness in the distribution of the hindfoot_length variable, which does not support the normality assumption. And hence for the outlier detection task, the two non parametric methods, Tukey’s method and boxplot were used as they don’t require any assumptions about the distribution.

OutVals = boxplot(surveys_combined$hindfoot_length,main="Boxplot for hindfoot_length", ylab="Hindfoot Length", col ="seagreen")$out

which(surveys_combined$hindfoot_length %in% OutVals)
[1] 10574 30425
Boxplot(surveys_combined$hindfoot_length,id=TRUE, ylab="Hindfoot Length", col ="seagreen")#boxplot with location of the outliers
[1] 10574 30425

It is visible from the Boxplot that 2 observations are outside of the top wisker which denotes (Q3+1.5xIQR). Hence, qualified to be considered as outliers according to Tukey’s Method.

Handeling Outliers

Only two outliers were detected and it is a very small amount when compared to the complete dataset surveys_combined which has over 31,000 observations. Thus the effect of the outlier on the data distribution parameters will be negligible. Hence, excluding outliers will be the best suited approach to handle outliers in dataset surveys_combined.

# Task 10. 

summary(surveys_combined$hindfoot_length) # summary statistics before removing outliers
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2.00   21.00   32.00   29.29   36.00   70.00    4111 
surveys_combined <- surveys_combined[-c(10574,30425),] #Excluding outliers(rows) from the dataframe
summary(surveys_combined$hindfoot_length) # summary statistics after removing outliers
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2.00   21.00   32.00   29.29   36.00   58.00    4111 
surveys_combined$hindfoot_length %>%  boxplot(main="Box Plot of Hindfoot Length", ylab="Hindfoot Length", col = "lightblue") # checking the imputed dataset



---
title: "MATH2349 Semester 2, 2019"
author: "Bodiyabaduge Dewsri Lalithi Perera(S3762890) & Ayush Ranjan(S3802541)"
subtitle: Assignment 2
output:
  html_notebook: default
---

## Setup

Loading the necessary packages to reproduce the report:

```{r, echo = TRUE, message=FALSE}
library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(deductive)
library(validate)
library(outliers)
library(kableExtra)
library(car)
```


## Read WHO Data

Importing the WHO.csv dataset using readr.

```{r}
# This is an R chunk for reading the WHO data.
WHO <- read_csv("WHO.csv")
head(WHO)

```



## Tidy Task 1:

In WHO dataset, from 5th column to the last column represents values instead of variables, thus those columns were transformed from wide to long format by using gather() function.
```{r, echo=TRUE}
# Task 1
WHO_1 <- WHO %>% gather(code, value, 5:60) 
WHO_1
```



## Tidy Task 2:

Now, each value in the new column "code" containes separate pieces of information which are separated by "_" sign. Concatenated values in "code" column was stored separated separate() function and stored in three new columns("New", "CaseType", "sex_age").
The column "sex_age" contains combined values of "sex" and "age", separate() function was used again to split these values into gender and age group.

```{r}
# Task 2
WHO_1 <- WHO_1 %>% separate(code , into= c("New", "CaseType", "sex_age"), sep = "_")
WHO_1 <- separate(WHO_1, 'sex_age', into = c("Sex","Age"), sep = 1) 
WHO_1
```



## Tidy Task 3:

The “rel”, “ep”, “sn”, and “sp” keys need to be in their own columns as we will treat each of these as a separate variable.
The column "CaseType" contains these keys.To decompose CaseType in various columns spread() function is used.

```{r}
# Task 3
WHO_1 <- WHO_1 %>% spread('CaseType', value)
WHO_1
```



## Tidy Task 4: 

To achieve final Tidy format, the two categorical variables "Sex" and "Age", needed to be converted as factors.
Apporipiate steps are performed to achieve this using mutate() function (as mentioned in the requirements).

```{r}
# Task 4

#Factorizing and labeling age column
WHO_1 <- WHO_1 %>% mutate(Age = factor(Age, levels = c("014","1524","2534","3544","4554","5564","65"), labels = c("<15","15-24","25-34","35-44","45-54","55-64","65>="), ordered = TRUE))

#Factorizing gender column
WHO_1 <- WHO_1 %>% mutate(Sex = factor(Sex))
WHO_1
class(WHO_1$Sex)
class(WHO_1$Age)
```



## Task 5: Filter & Select

The columns `iso2` and `New` are redundant and hence those were dropped using select() function. Three countries were selected using filter() function.

```{r}

# Task 5 
WHO_subset <- WHO_1 %>% select( -(iso2), -(New)) %>% filter( country %in% c("Sri Lanka", "Romania", "Portugal"))
head(WHO_subset)

```
Dimension of the subsetted dataset.
```{r}
dim(WHO_subset)

```



## Read Species and Surveys data sets

Importing the **Species** and **Surveys** data sets using readr.

```{r}

# Reading the Species and Surveys data sets.
species <- read_csv("species.csv")
head(species)

surveys <- read_csv("surveys.csv")
head(surveys)


```



## Task 6: Join  

To combine `surveys` and `species` data frames, left_join() function was used with the key variable `species_id`. The combined data frame was renamed as `surveys_combined`.


```{r}
# Task 6.
surveys_combined <- left_join(surveys , species, by="species_id")
head(surveys_combined)
str(surveys_combined)

```



## Task 7: Calculate 

The variable "month" of "surveys_combined" is recorded as a numerical variable, to get monthly average, "month" column was converted as a factor. Then the columns "month", "weight" and "hindfoot_length", fitered by the **species_id="NL"**, was stored in a new data frame "NL". 
Aggregate function was used to compute monthly averages of weight and hindfoot length in a new data frame.


```{r}

# Task 7.
surveys_combined$month <- factor(surveys_combined$month) # Factorizing the "month" column
#str(surveys_combined$month)
NL <- surveys_combined %>% filter(species_id == "NL") #Filtering the dataset by species id = "NL"

```


The **NL** dataframe is derived according to the specifications. Monthly averages of weight and hindfoot length variables will be calculated from the following code.


```{r}

NL <- NL %>%select(month, weight, hindfoot_length) # Selecting the columns month, weight and hindfoot_length
# Storing the monthly averages of weight and hindfoot length in a new data frame
NL_averages <- aggregate(NL[,c("weight","hindfoot_length")], by = list(NL$month), FUN = mean, na.rm=TRUE)
NL_averages <- NL_averages %>% rename(Month = Group.1, Average_weight = weight, Average_hindfoot_length = hindfoot_length)
NL_averages %>% kable() %>% kable_styling(bootstrap_options = "condensed")#Output the results
```




## Task 8: Missing Values

The dataframe `surveys_combined` was subsetted by selecting records only for the year 1989 and renamed as `surveys_combined_year`.
Then, the total missing values in the `weight` column of `surveys_combined_year` data frame grouped by `species` were calculated. Missing values in the `weight` column were replaced with the mean values of each species. Imputed data was saved as `surveys_weight_imputed`. 

```{r}
# Task 8 
surveys_combined$year <- factor(surveys_combined$year) # Factorizing the "year" column
surveys_combined_year <- surveys_combined %>% filter(year == "1989") # Filtering the year
head(surveys_combined_year)

kable(surveys_combined_year %>% group_by(species_id) %>% summarise(Missing_values = sum(is.na(weight)))) %>% kable_styling(bootstrap_options = "condensed", full_width = FALSE, position = "left") # Total missing values in weight column grouped by species_id

weight_mean_by_species <- aggregate(surveys_combined_year[,c("weight")], by = list(surveys_combined_year$species_id), FUN = mean, na.rm=TRUE)
weight_mean_by_species <- weight_mean_by_species %>% rename(species_id = Group.1)# A new dataframe with average weights by species id for the year 1989
weight_mean_by_species %>% kable() %>% kable_styling(bootstrap_options = "condensed", full_width = FALSE, position = "left")

surveys_weight_imputed <- surveys_combined_year %>% group_by(species_id) %>% mutate(weight = ifelse(is.na(weight), mean(weight, na.rm = TRUE), weight)) # Mean by species type imputation for weight column

#surveys_weight_imputed <- left_join(surveys_combined_year, weight_mean_by_species, by= "species_id") %>% mutate(weight = ifelse(is.na(weight.x), weight.y, weight.x)) %>% select(-weight.y, -weight.x) # adding weight_mean_by_species dataset to surveys_combined_year dataset to fill the missing values in the weight column
head(surveys_weight_imputed)
View(surveys_weight_imputed)
```

**Checking Imputation**
As we have replaced the missing values in the “weight” column with the mean values by species type, the count of total NA values in **surveys_weight_imputed** df's weight column should be less than that value of **surveys_combined_year**. However, it can be noticed that the NA count is non zero in the weight column of **surveys_weight_imputed** data frame. This issue will be discussed in the next task.

```{r}
sum(is.na(surveys_combined_year$weight))
sum(is.na(surveys_weight_imputed$weight))
```



## Task 9: Special Values

Inspecting the `weight` column in `surveys_weight_imputed` dataframe for any special values (i.e., NaN, Inf, -Inf). 

```{r}
# Task 9

cat("missing value count in weights of surveys_combined_year: ", sum(is.na(surveys_combined_year$weight)))

cat("\nSpecial value count in weights of surveys_weight_imputed: ",sum(is.nan(surveys_weight_imputed$weight)))

```
It can be noticed that some of the missing values are recorded as special values, "NAN" which stands for "Not A Number"" and is a logical vector of a length 1 and applies to numerical values. In part 8 when we calculated the mean using aggrigate() function by passing an argument na.rm=TRUE, which ignores missing values. There are groups in surveys_combined_year in which all values in weight column are missing values. The groups in our case are (“AB”, “AH”,“CB”,“CS”). When applying the function mean(....,na.rm=TRUE) on the fore mentioned species_id's weight column, as all values are NA's, it will select zero values and divide it by total values selected(0), which will return "NAN".


## Task 10: Outliers

In the `surveys_combined` data frame, the variable hindfoot length was inspected for possible univariate outliers. 
Different approaches can be used to detect univariate outliers such as;
 - Tukey’s method of outlier detection
 - Box plots
 - z-score method
 
First a boxplot was constructed to visually identify whether there are possible outliers in the data set. Very few number of outliers were visible in the boxplot. In addition, the boxplot depicted a skewness in the distribution of the hindfoot_length variable, which does not support the normality assumption. And hence for the outlier detection task, the two non parametric methods, Tukey's method and boxplot were used as they don't require any assumptions about the distribution.
```{r}
OutVals = boxplot(surveys_combined$hindfoot_length,main="Boxplot for hindfoot_length", ylab="Hindfoot Length", col ="seagreen")$out
which(surveys_combined$hindfoot_length %in% OutVals)
Boxplot(surveys_combined$hindfoot_length,id=TRUE, ylab="Hindfoot Length", col ="seagreen")#boxplot with location of the outliers

```

It is visible from the Boxplot that 2 observations are outside of the top wisker which denotes (Q3+1.5xIQR). Hence, qualified to be considered as **outliers** according to Tukey's Method. 

**Handeling Outliers**

Only two outliers were detected and it is a very small amount when compared to the complete dataset surveys_combined which has over 31,000 observations. Thus the effect of the outlier on the data distribution parameters will be negligible. Hence, excluding outliers will be the best suited approach to handle outliers in dataset surveys_combined.

```{r}
# Task 10. 

summary(surveys_combined$hindfoot_length) # summary statistics before removing outliers

surveys_combined <- surveys_combined[-c(10574,30425),] #Excluding outliers(rows) from the dataframe
summary(surveys_combined$hindfoot_length) # summary statistics after removing outliers

surveys_combined$hindfoot_length %>%  boxplot(main="Box Plot of Hindfoot Length", ylab="Hindfoot Length", col = "lightblue") # checking the imputed dataset

```




<br>
<br>
