Setup
Loading the necessary packages to reproduce the report:
library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(deductive)
library(validate)
library(outliers)
library(kableExtra)
library(car)
Read WHO Data
Importing the WHO.csv dataset using readr.
# This is an R chunk for reading the WHO data.
WHO <- read_csv("WHO.csv")
Parsed with column specification:
cols(
.default = col_double(),
country = [31mcol_character()[39m,
iso2 = [31mcol_character()[39m,
iso3 = [31mcol_character()[39m
)
See spec(...) for full column specifications.
head(WHO)
NA
Tidy Task 1:
In WHO dataset, from 5th column to the last column represents values instead of variables, thus those columns were transformed from wide to long format by using gather() function.
# Task 1
WHO_1 <- WHO %>% gather(code, value, 5:60)
WHO_1
Tidy Task 2:
Now, each value in the new column “code” containes separate pieces of information which are separated by "_" sign. Concatenated values in “code” column was stored separated separate() function and stored in three new columns(“New”, “CaseType”, “sex_age”). The column “sex_age” contains combined values of “sex” and “age”, separate() function was used again to split these values into gender and age group.
# Task 2
WHO_1 <- WHO_1 %>% separate(code , into= c("New", "CaseType", "sex_age"), sep = "_")
WHO_1 <- separate(WHO_1, 'sex_age', into = c("Sex","Age"), sep = 1)
WHO_1
Tidy Task 3:
The “rel”, “ep”, “sn”, and “sp” keys need to be in their own columns as we will treat each of these as a separate variable. The column “CaseType” contains these keys.To decompose CaseType in various columns spread() function is used.
# Task 3
WHO_1 <- WHO_1 %>% spread('CaseType', value)
Tidy Task 4:
To achieve final Tidy format, the two categorical variables “Sex” and “Age”, needed to be converted as factors. Apporipiate steps are performed to achieve this using mutate() function (as mentioned in the requirements).
# Task 4
#Factorizing and labeling age column
WHO_1 <- WHO_1 %>% mutate(Age = factor(Age, levels = c("014","1524","2534","3544","4554","5564","65"), labels = c("<15","15-24","25-34","35-44","45-54","55-64","65>="), ordered = TRUE))
#Factorizing gender column
WHO_1 <- WHO_1 %>% mutate(Sex = factor(Sex))
WHO_1
class(WHO_1$Sex)
[1] "factor"
class(WHO_1$Age)
[1] "ordered" "factor"
Task 5: Filter & Select
The columns iso2 and New are redundant and hence those were dropped using select() function. Three countries were selected using filter() function.
# Task 5
WHO_subset <- WHO_1 %>% select( -(iso2), -(New)) %>% filter( country %in% c("Sri Lanka", "Romania", "Portugal"))
head(WHO_subset)
NA
Dimension of the subsetted dataset.
dim(WHO_subset)
[1] 1428 9
Read Species and Surveys data sets
Importing the Species and Surveys data sets using readr.
# Reading the Species and Surveys data sets.
species <- read_csv("species.csv")
Parsed with column specification:
cols(
species_id = [31mcol_character()[39m,
genus = [31mcol_character()[39m,
species = [31mcol_character()[39m,
taxa = [31mcol_character()[39m
)
head(species)
surveys <- read_csv("surveys.csv")
Parsed with column specification:
cols(
record_id = [32mcol_double()[39m,
month = [32mcol_double()[39m,
day = [32mcol_double()[39m,
year = [32mcol_double()[39m,
species_id = [31mcol_character()[39m,
sex = [31mcol_character()[39m,
hindfoot_length = [32mcol_double()[39m,
weight = [32mcol_double()[39m
)
head(surveys)
NA
NA
Task 6: Join
To combine surveys and species data frames, left_join() function was used with the key variable species_id. The combined data frame was renamed as surveys_combined.
# Task 6.
surveys_combined <- left_join(surveys , species, by="species_id")
head(surveys_combined)
str(surveys_combined)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 35549 obs. of 11 variables:
$ record_id : num 1 2 3 4 5 6 7 8 9 10 ...
$ month : num 7 7 7 7 7 7 7 7 7 7 ...
$ day : num 16 16 16 16 16 16 16 16 16 16 ...
$ year : num 1977 1977 1977 1977 1977 ...
$ species_id : chr "NL" "NL" "DM" "DM" ...
$ sex : chr "M" "M" "F" "M" ...
$ hindfoot_length: num 32 33 37 36 35 14 NA 37 34 20 ...
$ weight : num NA NA NA NA NA NA NA NA NA NA ...
$ genus : chr "Neotoma" "Neotoma" "Dipodomys" "Dipodomys" ...
$ species : chr "albigula" "albigula" "merriami" "merriami" ...
$ taxa : chr "Rodent" "Rodent" "Rodent" "Rodent" ...
Task 7: Calculate
The variable “month” of “surveys_combined” is recorded as a numerical variable, to get monthly average, “month” column was converted as a factor. Then the columns “month”, “weight” and “hindfoot_length”, fitered by the species_id=“NL”, was stored in a new data frame “NL”. Aggregate function was used to compute monthly averages of weight and hindfoot length in a new data frame.
# Task 7.
surveys_combined$month <- factor(surveys_combined$month) # Factorizing the "month" column
#str(surveys_combined$month)
NL <- surveys_combined %>% filter(species_id == "NL") #Filtering the dataset by species id = "NL"
The NL dataframe is derived according to the specifications. Monthly averages of weight and hindfoot length variables will be calculated from the following code.
NL <- NL %>%select(month, weight, hindfoot_length) # Selecting the columns month, weight and hindfoot_length
# Storing the monthly averages of weight and hindfoot length in a new data frame
NL_averages <- aggregate(NL[,c("weight","hindfoot_length")], by = list(NL$month), FUN = mean, na.rm=TRUE)
NL_averages <- NL_averages %>% rename(Month = Group.1, Average_weight = weight, Average_hindfoot_length = hindfoot_length)
NL_averages %>% kable() %>% kable_styling(bootstrap_options = "condensed")#Output the results
| Month |
Average_weight |
Average_hindfoot_length |
| 1 |
179.3443 |
32.54098 |
| 2 |
181.3818 |
32.82353 |
| 3 |
177.4516 |
32.75862 |
| 4 |
153.0690 |
32.02439 |
| 5 |
142.7536 |
31.60000 |
| 6 |
143.7879 |
32.18889 |
| 7 |
141.7415 |
32.35398 |
| 8 |
152.5100 |
32.07143 |
| 9 |
164.9920 |
32.50427 |
| 10 |
169.1364 |
32.43119 |
| 11 |
170.6373 |
32.59574 |
| 12 |
175.1515 |
32.34848 |
Task 8: Missing Values
The dataframe surveys_combined was subsetted by selecting records only for the year 1989 and renamed as surveys_combined_year. Then, the total missing values in the weight column of surveys_combined_year data frame grouped by species were calculated. Missing values in the weight column were replaced with the mean values of each species. Imputed data was saved as surveys_weight_imputed.
# Task 8
surveys_combined$year <- factor(surveys_combined$year) # Factorizing the "year" column
surveys_combined_year <- surveys_combined %>% filter(year == "1989") # Filtering the year
head(surveys_combined_year)
kable(surveys_combined_year %>% group_by(species_id) %>% summarise(Missing_values = sum(is.na(weight)))) %>% kable_styling(bootstrap_options = "condensed", full_width = FALSE, position = "left") # Total missing values in weight column grouped by species_id
| species_id |
Missing_values |
| AB |
31 |
| AH |
30 |
| BA |
0 |
| CB |
1 |
| CS |
1 |
| DM |
6 |
| DO |
2 |
| DS |
4 |
| NL |
6 |
| OL |
0 |
| OT |
1 |
| OX |
0 |
| PC |
1 |
| PE |
6 |
| PF |
0 |
| PH |
0 |
| PM |
0 |
| PP |
1 |
| RF |
0 |
| RM |
9 |
| SA |
3 |
| SC |
1 |
| SF |
0 |
| SH |
0 |
| SS |
18 |
| UP |
2 |
| NA |
7 |
weight_mean_by_species <- aggregate(surveys_combined_year[,c("weight")], by = list(surveys_combined_year$species_id), FUN = mean, na.rm=TRUE)
weight_mean_by_species <- weight_mean_by_species %>% rename(species_id = Group.1)# A new dataframe with average weights by species id for the year 1989
weight_mean_by_species %>% kable() %>% kable_styling(bootstrap_options = "condensed", full_width = FALSE, position = "left")
| species_id |
weight |
| AB |
NaN |
| AH |
NaN |
| BA |
7.000000 |
| CB |
NaN |
| CS |
NaN |
| DM |
44.349206 |
| DO |
51.025641 |
| DS |
121.896552 |
| NL |
151.278688 |
| OL |
31.947368 |
| OT |
24.663366 |
| OX |
20.000000 |
| PC |
NaN |
| PE |
21.935672 |
| PF |
7.230769 |
| PH |
30.500000 |
| PM |
20.222222 |
| PP |
17.409091 |
| RF |
13.346939 |
| RM |
10.411311 |
| SA |
NaN |
| SC |
NaN |
| SF |
54.800000 |
| SH |
76.345455 |
| SS |
NaN |
| UP |
NaN |
surveys_weight_imputed <- surveys_combined_year %>% group_by(species_id) %>% mutate(weight = ifelse(is.na(weight), mean(weight, na.rm = TRUE), weight)) # Mean by species type imputation for weight column
#surveys_weight_imputed <- left_join(surveys_combined_year, weight_mean_by_species, by= "species_id") %>% mutate(weight = ifelse(is.na(weight.x), weight.y, weight.x)) %>% select(-weight.y, -weight.x) # adding weight_mean_by_species dataset to surveys_combined_year dataset to fill the missing values in the weight column
head(surveys_weight_imputed)
View(surveys_weight_imputed)
Checking Imputation As we have replaced the missing values in the “weight” column with the mean values by species type, the count of total NA values in surveys_weight_imputed df’s weight column should be less than that value of surveys_combined_year. However, it can be noticed that the NA count is non zero in the weight column of surveys_weight_imputed data frame. This issue will be discussed in the next task.
sum(is.na(surveys_combined_year$weight))
[1] 130
sum(is.na(surveys_weight_imputed$weight))
[1] 95
Task 9: Special Values
Inspecting the weight column in surveys_weight_imputed dataframe for any special values (i.e., NaN, Inf, -Inf).
# Task 9
cat("missing value count in weights of surveys_combined_year: ", sum(is.na(surveys_combined_year$weight)))
missing value count in weights of surveys_combined_year: 130
cat("\nSpecial value count in weights of surveys_weight_imputed: ",sum(is.nan(surveys_weight_imputed$weight)))
Special value count in weights of surveys_weight_imputed: 95
It can be noticed that some of the missing values are recorded as special values, “NAN” which stands for “Not A Number”" and is a logical vector of a length 1 and applies to numerical values. In part 8 when we calculated the mean using aggrigate() function by passing an argument na.rm=TRUE, which ignores missing values. There are groups in surveys_combined_year in which all values in weight column are missing values. The groups in our case are (“AB”, “AH”,“CB”,“CS”). When applying the function mean(….,na.rm=TRUE) on the fore mentioned species_id’s weight column, as all values are NA’s, it will select zero values and divide it by total values selected(0), which will return “NAN”.
Task 10: Outliers
In the surveys_combined data frame, the variable hindfoot length was inspected for possible univariate outliers. Different approaches can be used to detect univariate outliers such as; - Tukey’s method of outlier detection - Box plots - z-score method
First a boxplot was constructed to visually identify whether there are possible outliers in the data set. Very few number of outliers were visible in the boxplot. In addition, the boxplot depicted a skewness in the distribution of the hindfoot_length variable, which does not support the normality assumption. And hence for the outlier detection task, the two non parametric methods, Tukey’s method and boxplot were used as they don’t require any assumptions about the distribution.
OutVals = boxplot(surveys_combined$hindfoot_length,main="Boxplot for hindfoot_length", ylab="Hindfoot Length", col ="seagreen")$out

which(surveys_combined$hindfoot_length %in% OutVals)
[1] 10574 30425
Boxplot(surveys_combined$hindfoot_length,id=TRUE, ylab="Hindfoot Length", col ="seagreen")#boxplot with location of the outliers
[1] 10574 30425

It is visible from the Boxplot that 2 observations are outside of the top wisker which denotes (Q3+1.5xIQR). Hence, qualified to be considered as outliers according to Tukey’s Method.
Handeling Outliers
Only two outliers were detected and it is a very small amount when compared to the complete dataset surveys_combined which has over 31,000 observations. Thus the effect of the outlier on the data distribution parameters will be negligible. Hence, excluding outliers will be the best suited approach to handle outliers in dataset surveys_combined.
# Task 10.
summary(surveys_combined$hindfoot_length) # summary statistics before removing outliers
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
2.00 21.00 32.00 29.29 36.00 70.00 4111
surveys_combined <- surveys_combined[-c(10574,30425),] #Excluding outliers(rows) from the dataframe
summary(surveys_combined$hindfoot_length) # summary statistics after removing outliers
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
2.00 21.00 32.00 29.29 36.00 58.00 4111
surveys_combined$hindfoot_length %>% boxplot(main="Box Plot of Hindfoot Length", ylab="Hindfoot Length", col = "lightblue") # checking the imputed dataset

---
title: "MATH2349 Semester 2, 2019"
author: "Bodiyabaduge Dewsri Lalithi Perera(S3762890) & Ayush Ranjan(S3802541)"
subtitle: Assignment 2
output:
  html_notebook: default
---

## Setup

Loading the necessary packages to reproduce the report:

```{r, echo = TRUE, message=FALSE}
library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(deductive)
library(validate)
library(outliers)
library(kableExtra)
library(car)
```


## Read WHO Data

Importing the WHO.csv dataset using readr.

```{r}
# This is an R chunk for reading the WHO data.
WHO <- read_csv("WHO.csv")
head(WHO)

```



## Tidy Task 1:

In WHO dataset, from 5th column to the last column represents values instead of variables, thus those columns were transformed from wide to long format by using gather() function.
```{r, echo=TRUE}
# Task 1
WHO_1 <- WHO %>% gather(code, value, 5:60) 
WHO_1
```



## Tidy Task 2:

Now, each value in the new column "code" containes separate pieces of information which are separated by "_" sign. Concatenated values in "code" column was stored separated separate() function and stored in three new columns("New", "CaseType", "sex_age").
The column "sex_age" contains combined values of "sex" and "age", separate() function was used again to split these values into gender and age group.

```{r}
# Task 2
WHO_1 <- WHO_1 %>% separate(code , into= c("New", "CaseType", "sex_age"), sep = "_")
WHO_1 <- separate(WHO_1, 'sex_age', into = c("Sex","Age"), sep = 1) 
WHO_1
```



## Tidy Task 3:

The “rel”, “ep”, “sn”, and “sp” keys need to be in their own columns as we will treat each of these as a separate variable.
The column "CaseType" contains these keys.To decompose CaseType in various columns spread() function is used.

```{r}
# Task 3
WHO_1 <- WHO_1 %>% spread('CaseType', value)
WHO_1
```



## Tidy Task 4: 

To achieve final Tidy format, the two categorical variables "Sex" and "Age", needed to be converted as factors.
Apporipiate steps are performed to achieve this using mutate() function (as mentioned in the requirements).

```{r}
# Task 4

#Factorizing and labeling age column
WHO_1 <- WHO_1 %>% mutate(Age = factor(Age, levels = c("014","1524","2534","3544","4554","5564","65"), labels = c("<15","15-24","25-34","35-44","45-54","55-64","65>="), ordered = TRUE))

#Factorizing gender column
WHO_1 <- WHO_1 %>% mutate(Sex = factor(Sex))
WHO_1
class(WHO_1$Sex)
class(WHO_1$Age)
```



## Task 5: Filter & Select

The columns `iso2` and `New` are redundant and hence those were dropped using select() function. Three countries were selected using filter() function.

```{r}

# Task 5 
WHO_subset <- WHO_1 %>% select( -(iso2), -(New)) %>% filter( country %in% c("Sri Lanka", "Romania", "Portugal"))
head(WHO_subset)

```
Dimension of the subsetted dataset.
```{r}
dim(WHO_subset)

```



## Read Species and Surveys data sets

Importing the **Species** and **Surveys** data sets using readr.

```{r}

# Reading the Species and Surveys data sets.
species <- read_csv("species.csv")
head(species)

surveys <- read_csv("surveys.csv")
head(surveys)


```



## Task 6: Join  

To combine `surveys` and `species` data frames, left_join() function was used with the key variable `species_id`. The combined data frame was renamed as `surveys_combined`.


```{r}
# Task 6.
surveys_combined <- left_join(surveys , species, by="species_id")
head(surveys_combined)
str(surveys_combined)

```



## Task 7: Calculate 

The variable "month" of "surveys_combined" is recorded as a numerical variable, to get monthly average, "month" column was converted as a factor. Then the columns "month", "weight" and "hindfoot_length", fitered by the **species_id="NL"**, was stored in a new data frame "NL". 
Aggregate function was used to compute monthly averages of weight and hindfoot length in a new data frame.


```{r}

# Task 7.
surveys_combined$month <- factor(surveys_combined$month) # Factorizing the "month" column
#str(surveys_combined$month)
NL <- surveys_combined %>% filter(species_id == "NL") #Filtering the dataset by species id = "NL"

```


The **NL** dataframe is derived according to the specifications. Monthly averages of weight and hindfoot length variables will be calculated from the following code.


```{r}

NL <- NL %>%select(month, weight, hindfoot_length) # Selecting the columns month, weight and hindfoot_length
# Storing the monthly averages of weight and hindfoot length in a new data frame
NL_averages <- aggregate(NL[,c("weight","hindfoot_length")], by = list(NL$month), FUN = mean, na.rm=TRUE)
NL_averages <- NL_averages %>% rename(Month = Group.1, Average_weight = weight, Average_hindfoot_length = hindfoot_length)
NL_averages %>% kable() %>% kable_styling(bootstrap_options = "condensed")#Output the results
```




## Task 8: Missing Values

The dataframe `surveys_combined` was subsetted by selecting records only for the year 1989 and renamed as `surveys_combined_year`.
Then, the total missing values in the `weight` column of `surveys_combined_year` data frame grouped by `species` were calculated. Missing values in the `weight` column were replaced with the mean values of each species. Imputed data was saved as `surveys_weight_imputed`. 

```{r}
# Task 8 
surveys_combined$year <- factor(surveys_combined$year) # Factorizing the "year" column
surveys_combined_year <- surveys_combined %>% filter(year == "1989") # Filtering the year
head(surveys_combined_year)

kable(surveys_combined_year %>% group_by(species_id) %>% summarise(Missing_values = sum(is.na(weight)))) %>% kable_styling(bootstrap_options = "condensed", full_width = FALSE, position = "left") # Total missing values in weight column grouped by species_id

weight_mean_by_species <- aggregate(surveys_combined_year[,c("weight")], by = list(surveys_combined_year$species_id), FUN = mean, na.rm=TRUE)
weight_mean_by_species <- weight_mean_by_species %>% rename(species_id = Group.1)# A new dataframe with average weights by species id for the year 1989
weight_mean_by_species %>% kable() %>% kable_styling(bootstrap_options = "condensed", full_width = FALSE, position = "left")

surveys_weight_imputed <- surveys_combined_year %>% group_by(species_id) %>% mutate(weight = ifelse(is.na(weight), mean(weight, na.rm = TRUE), weight)) # Mean by species type imputation for weight column

#surveys_weight_imputed <- left_join(surveys_combined_year, weight_mean_by_species, by= "species_id") %>% mutate(weight = ifelse(is.na(weight.x), weight.y, weight.x)) %>% select(-weight.y, -weight.x) # adding weight_mean_by_species dataset to surveys_combined_year dataset to fill the missing values in the weight column
head(surveys_weight_imputed)
View(surveys_weight_imputed)
```

**Checking Imputation**
As we have replaced the missing values in the “weight” column with the mean values by species type, the count of total NA values in **surveys_weight_imputed** df's weight column should be less than that value of **surveys_combined_year**. However, it can be noticed that the NA count is non zero in the weight column of **surveys_weight_imputed** data frame. This issue will be discussed in the next task.

```{r}
sum(is.na(surveys_combined_year$weight))
sum(is.na(surveys_weight_imputed$weight))
```



## Task 9: Special Values

Inspecting the `weight` column in `surveys_weight_imputed` dataframe for any special values (i.e., NaN, Inf, -Inf). 

```{r}
# Task 9

cat("missing value count in weights of surveys_combined_year: ", sum(is.na(surveys_combined_year$weight)))

cat("\nSpecial value count in weights of surveys_weight_imputed: ",sum(is.nan(surveys_weight_imputed$weight)))

```
It can be noticed that some of the missing values are recorded as special values, "NAN" which stands for "Not A Number"" and is a logical vector of a length 1 and applies to numerical values. In part 8 when we calculated the mean using aggrigate() function by passing an argument na.rm=TRUE, which ignores missing values. There are groups in surveys_combined_year in which all values in weight column are missing values. The groups in our case are (“AB”, “AH”,“CB”,“CS”). When applying the function mean(....,na.rm=TRUE) on the fore mentioned species_id's weight column, as all values are NA's, it will select zero values and divide it by total values selected(0), which will return "NAN".


## Task 10: Outliers

In the `surveys_combined` data frame, the variable hindfoot length was inspected for possible univariate outliers. 
Different approaches can be used to detect univariate outliers such as;
 - Tukey’s method of outlier detection
 - Box plots
 - z-score method
 
First a boxplot was constructed to visually identify whether there are possible outliers in the data set. Very few number of outliers were visible in the boxplot. In addition, the boxplot depicted a skewness in the distribution of the hindfoot_length variable, which does not support the normality assumption. And hence for the outlier detection task, the two non parametric methods, Tukey's method and boxplot were used as they don't require any assumptions about the distribution.
```{r}
OutVals = boxplot(surveys_combined$hindfoot_length,main="Boxplot for hindfoot_length", ylab="Hindfoot Length", col ="seagreen")$out
which(surveys_combined$hindfoot_length %in% OutVals)
Boxplot(surveys_combined$hindfoot_length,id=TRUE, ylab="Hindfoot Length", col ="seagreen")#boxplot with location of the outliers

```

It is visible from the Boxplot that 2 observations are outside of the top wisker which denotes (Q3+1.5xIQR). Hence, qualified to be considered as **outliers** according to Tukey's Method. 

**Handeling Outliers**

Only two outliers were detected and it is a very small amount when compared to the complete dataset surveys_combined which has over 31,000 observations. Thus the effect of the outlier on the data distribution parameters will be negligible. Hence, excluding outliers will be the best suited approach to handle outliers in dataset surveys_combined.

```{r}
# Task 10. 

summary(surveys_combined$hindfoot_length) # summary statistics before removing outliers

surveys_combined <- surveys_combined[-c(10574,30425),] #Excluding outliers(rows) from the dataframe
summary(surveys_combined$hindfoot_length) # summary statistics after removing outliers

surveys_combined$hindfoot_length %>%  boxplot(main="Box Plot of Hindfoot Length", ylab="Hindfoot Length", col = "lightblue") # checking the imputed dataset

```




<br>
<br>
