Setup
Loading the packages required for the assignment
library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(kableExtra)
library(knitr)
Read WHO Data
WHO data is read from WHO.csv using the read_csv() function of the readr package and stored as a data frame named who_data.
who_data <- read_csv("WHO.csv")
dim(who_data)
[1] 7240 60
Tidy Task 1:
Tidying data using gather():
The WHO data set is not in tidy format. It is in wide format. Here we use the tidyr function gather() to reshape the data set into long format.
who_data <- who_data %>%
gather(c(5:60), key = "code", value = "value" )
The data set after reshape:
who_data
Dimensions of the dataset:
dim(who_data)
[1] 405440 6
Tidy Task 2:
Separating the code column:
The code column contains the information of four different variables. This information can be split into 4 more columns. In this task we separate the code column into new, var, sex, and age columns.
after_tidy_2 <- who_data %>%
# separating the columns based on the separator '_'
separate(code, into = c("new", "var", "sex_age"), sep = c("_")) %>%
# sex and age information doesn't have a separator, this is separated after the first charachter (m/f)
separate(sex_age, into = c("sex", "age"), 1)
The dataset after separating the code column:
after_tidy_2
Dimensions of the dataset:
dim(after_tidy_2)
[1] 405440 9
Tidy Task 3:
Tidying var column using spread():
The var column contains four keys - rel, ep, sn, and sp. Each of these keys can be a separate variable (column). We use spread() to place these keys into their own variables with the values from the value variable.
after_tidy_3 <- after_tidy_2 %>% spread(key = var, value = value)
The dataset after spreading the var column:
after_tidy_3
Structure of the dataset:
str(after_tidy_3)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 101360 obs. of 11 variables:
$ country: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ iso2 : chr "AF" "AF" "AF" "AF" ...
$ iso3 : chr "AFG" "AFG" "AFG" "AFG" ...
$ year : num 1980 1981 1982 1983 1984 ...
$ new : chr "new" "new" "new" "new" ...
$ sex : chr "m" "m" "m" "m" ...
$ age : chr "014" "014" "014" "014" ...
$ ep : num NA NA NA NA NA NA NA NA NA NA ...
$ rel : num NA NA NA NA NA NA NA NA NA NA ...
$ sn : num NA NA NA NA NA NA NA NA NA NA ...
$ sp : num NA NA NA NA NA NA NA NA NA NA ...
Dimensions of the dataset:
dim(after_tidy_3)
[1] 101360 11
Tidy Task 4:
Factoring sex and age variables using mutate() and factor():
The sex and age variables are categorical and need to be factorised. We use mutate() to factorise the two variables. Here we create appropriate labels for each level in the age variable.
after_tidy_4 <- after_tidy_3 %>%
mutate(sex = factor(sex, levels = c("m","f")),
age = factor(age,
levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),
labels = c("<15", "15-24", "25-34", "35-44", "45-54", "55-64", "65>="),
ordered = TRUE
)
)
Levels of the sex and age varaibles after factoring:
levels(after_tidy_4$sex)
[1] "m" "f"
levels(after_tidy_4$age)
[1] "<15" "15-24" "25-34" "35-44" "45-54" "55-64" "65>="
Structure of the sex and age variables after factoring:
str(after_tidy_4$sex)
Factor w/ 2 levels "m","f": 1 1 1 1 1 1 1 1 1 1 ...
str(after_tidy_4$age)
Ord.factor w/ 7 levels "<15"<"15-24"<..: 1 1 1 1 1 1 1 1 1 1 ...
The dataset after factoring the sex and age variables:
after_tidy_4
Dimensions of the dataset:
dim(after_tidy_4)
[1] 101360 11
Task 5: Filter & Select
Dropping the iso2 and new columns and filtering three countries:
The columns iso2 and new columns are redundant and hence are dropped. After this, the dataset is filtered to obtain data for three countries - India, Australia, United Arab Emirates. The filtered dataset is saved as WHO_subset.
WHO_subset <- after_tidy_4 %>%
select(-c("iso2", "new")) %>%
filter(country == "India" | country == "Australia" | country == "United Arab Emirates" )
The dataset after dropping and filtering:
WHO_subset
The following code shows that WHO_subset has the filtered data for only 3 countries:
unique(WHO_subset$country)
[1] "Australia" "India" "United Arab Emirates"
Dimensions of the dataset:
dim(WHO_subset)
[1] 1428 9
Read Species and Surveys data sets
The species and surveys data sets are read from species.csv and surveys.csv using the read_csv() function of the readr package and stored as data frames named species and surveys respectively.
species <- read_csv("species.csv")
surveys <- read_csv("surveys.csv")
Task 6: Join
Combining surveys and species data:
The datasets surveys and species are joined by the key variable species_id. We use left_join() function to add the species variables (genus,taxa,species) to surveys data and save this as a new dataframe surveys_combined.
surveys_combined <- left_join(surveys, species, by = "species_id")
surveys_combined
To compare, here are the Dimensions of the datasets:
dim(species)
[1] 54 4
dim(surveys)
[1] 35549 8
dim(surveys_combined)
[1] 35549 11
Task 8: Missing Values
Filtering the surveys_combined data:
We filter the surveys_combined data for the year 1997 and save it as the data set surveys_combined_year.
surveys_combined_year <- surveys_combined %>%
filter(year == 1997)
Counting the NA values in weight variable for each species:
The surveys_combined_year data is grouped by species and the number of missing values (NA) is calculated.
surveys_combined_year %>%
group_by(species) %>%
summarise(`NA Count` = sum(is.na(weight))) %>%
filter(`NA Count` != 0)
The output above shows the 13 species which have missing (NA) values, along with the count of the corresponding NA values. The 3 remaining species not displayed here are spilosoma - 12, torridus - 2 and NA - 1
Imputing the NA (missing) values:
Now we impute the NA values with the mean weight for each species using mutate(). We save the imputed data as surveys_weight_imputed.
# group the data by species and impute the NA values with the mean of each species
surveys_weight_imputed <- surveys_combined_year %>%
group_by(species) %>%
mutate(weight = ifelse(is.na(weight), mean(weight, na.rm=TRUE), weight))
The code above is written by referring to the code from Stackoverflow: Link to code
Checking if the imputation was successful in the surveys_weight_imputed data:
surveys_weight_imputed %>%
group_by(species) %>%
summarise(`NA Count` = sum(is.na(weight))) %>%
filter(`NA Count` != 0)
From the output above, we can observe that the surveys_weight_imputed still has few species with missing values. But we can note that some of the missing values have been imputed sucessfully. Of the 13 species with missing values, the imputation succesfully imputed the missing values for 8 species. The remaining 5 species still show missing values. We inspect and explain why this happens in the next task.
Task 9: Special Values
Checking for special values in weight column in surveys_weight_imputed:
sum(sapply(surveys_weight_imputed$weight, function(x) (is.infinite(x) | is.nan(x) )) )
[1] 54
Here are the number of special values before we imputed the surveys_combined_year data:
sum(sapply(surveys_combined_year$weight, function(x) (is.infinite(x) | is.nan(x) )) )
[1] 0
Observing the outputs above, we can conclude that the special values were generated as a result of the imputation. Also the function - is.na() counts NA as well as NaN values, which is why we got the count for NA values after imputation. In reality, we were getting the count of NaN values since all the NA values were imputed. We inspect the reason for the NaN values showing up in the data below.
Examining the data for a species that showed missing values after imputation:
filter(surveys_weight_imputed, species == "harrisi")
All the special values in the weight column for species = harrisi are NaN values. We get NaN values because for some species such as harrisi, the weight column has only NA values. When we calculate the mean for these species, to impute the NA values (as performed in task 8) we use the parameter na.rm = TRUE to ignore NA values when calculating the mean. For species such as harrisi, when calculating the mean, all the values are ignored restulting in NaN being returned as the mean. This caused NaN values to be imputed in the weight column.
Consider the following example to understand how the NaN values were created:
mean(c(NA, NA), na.rm = TRUE)
[1] NaN
When the mean is calculated for a vector with only NAs, na.rm = TRUE ignores all the NA values, hence resulting in the denominator of the mean to be zero, which evaluates to NaN.
Task 10: Outliers
---
title: "MATH2349 Semester 2, 2019"
author:
- Maaz Shaikh - S3795603
- Vaishnavi Narayana Naik - S3797442
subtitle: Assignment 2
output:
  html_notebook: default
---

<hr>

## Setup

##### Loading the packages required for the assignment

```{r , echo = TRUE, message=FALSE}

library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(kableExtra)
library(knitr)
```

## Read WHO Data

WHO data is read from `WHO.csv` using the `read_csv()` function of the `readr` package and stored as a data frame named `who_data`.

```{r, echo = TRUE, message = FALSE}

who_data <- read_csv("WHO.csv")
dim(who_data)

```


## Tidy Task 1:

#### Tidying data using `gather()`:
The WHO data set is not in tidy format. It is in wide format. Here we use the `tidyr` function `gather()` to reshape the data set into long format.
```{r, echo=TRUE}
who_data <- who_data %>%
  gather(c(5:60), key = "code", value = "value" )
```

<br>

##### __The data set after reshape:__
```{r}
who_data
```

<br>

##### __Dimensions of the dataset:__
```{r}
dim(who_data)

```

<hr>

## Tidy Task 2:

#### Separating the `code` column: 

The `code` column contains the information of four different variables. This information can be split into 4 more columns.
In this task we separate the `code` column into `new`, `var`, `sex`, and `age` columns.

```{r}
after_tidy_2 <- who_data %>%
  # separating the columns based on the separator '_'
  separate(code, into = c("new", "var", "sex_age"), sep = c("_")) %>% 
    # sex and age information doesn't have a separator, this is separated after the first charachter (m/f)
    separate(sex_age, into = c("sex", "age"), 1)
```

<br>

##### The dataset after separating the `code` column:
```{r}
after_tidy_2
```

<br>

##### __Dimensions of the dataset:__
```{r}
dim(after_tidy_2)
```

<hr>

## Tidy Task 3:

#### Tidying `var` column using `spread()`:
The `var` column contains four keys - `rel`, `ep`, `sn`, and `sp`. Each of these keys can be a separate variable (column). We use `spread()` to place these keys into their own variables with the values from the `value` variable. 
```{r}
after_tidy_3 <- after_tidy_2 %>% spread(key = var, value = value)
```

<br>

##### __The dataset after spreading the `var` column:__
```{r}
after_tidy_3
```

<br>

##### __Structure of the dataset:__
```{r}
str(after_tidy_3)
```

<br>

##### __Dimensions of the dataset:__
```{r}
dim(after_tidy_3)
```

<hr>

## Tidy Task 4: 

#### Factoring `sex` and `age` variables using `mutate()` and `factor()`:
The `sex` and `age` variables are categorical and need to be factorised. We use `mutate()` to factorise the two variables. Here we create appropriate labels for each level in the age variable. 
```{r}
after_tidy_4 <- after_tidy_3 %>%
  mutate(sex = factor(sex, levels = c("m","f")),
         age = factor(age, 
                      levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),
                      labels = c("<15", "15-24", "25-34", "35-44", "45-54", "55-64", "65>="),
                      ordered = TRUE
                      )
         )
```

<br>

##### __Levels of the `sex` and `age` varaibles after factoring:__
```{r}
levels(after_tidy_4$sex)
levels(after_tidy_4$age)
```

<br>

##### __Structure of the `sex` and `age` variables after factoring:__
```{r}
str(after_tidy_4$sex)
str(after_tidy_4$age)
```

<br>

##### __The dataset after factoring the `sex` and `age` variables:__
```{r}
after_tidy_4
```

<br>

##### __Dimensions of the dataset:__
```{r}
dim(after_tidy_4)
```

<hr>

## Task 5: Filter & Select

#### Dropping the `iso2` and `new` columns and filtering three countries:
The columns `iso2` and `new` columns are redundant and hence are dropped. After this, the dataset is filtered to obtain data for three countries - `India`, `Australia`, `United Arab Emirates`. The filtered dataset is saved as `WHO_subset`.
```{r}
WHO_subset <- after_tidy_4 %>%
  select(-c("iso2", "new")) %>%
  filter(country == "India" | country == "Australia" | country == "United Arab Emirates" )
```

<br>

##### __The dataset after dropping and filtering:__
```{r}
WHO_subset
```

<br>

##### The following code shows that `WHO_subset` has the filtered data for only 3 countries:
```{r}
unique(WHO_subset$country)
```

<br>

##### __Dimensions of the dataset:__
```{r}
dim(WHO_subset)
```

<hr>

## Read Species and Surveys data sets

The `species` and `surveys` data sets are read from `species.csv` and `surveys.csv` using the `read_csv()` function of the `readr` package and stored as data frames named `species` and `surveys` respectively.

```{r, echo=TRUE, message=FALSE}
species <- read_csv("species.csv")
surveys <- read_csv("surveys.csv")
```

<hr>

## Task 6: Join  

#### Combining `surveys` and `species` data:

The datasets `surveys` and `species` are joined by the key variable `species_id`. We use `left_join()` function to add the species variables (`genus`,`taxa`,`species`) to `surveys` data and save this as a new dataframe `surveys_combined`.

```{r}
surveys_combined <- left_join(surveys, species, by = "species_id")
surveys_combined
```

<br>

##### __To compare, here are the Dimensions of the datasets:__

```{r}
dim(species)
dim(surveys)
dim(surveys_combined)

```

<hr>

## Task 7: Calculate 

#### Calculating average `weight` and `hindfoot length` for one species - `albigula`:

We first filter the `surveys_combined` data set for the species _'albigula'_ and group it by `month`.
```{r}
albigula_data <- surveys_combined %>%
  filter(species == "albigula") %>%
  group_by(month)
albigula_data
```

<br>

We then calclate the average of both weight and hindfoot length for the species _'albigula'_:
```{r}
 albigula_data %>%
  summarise(`Average Weight` = mean(weight, na.rm = TRUE),
            `Average Hindfoot Length` = mean(hindfoot_length, na.rm = TRUE)
  )
```

<hr>

## Task 8: Missing Values

#### Filtering the `surveys_combined` data:

We filter the `surveys_combined` data for the year `1997` and save it as the data set `surveys_combined_year`.
```{r}
surveys_combined_year  <- surveys_combined %>% 
  filter(year == 1997)
```

<br>

#### Counting the `NA` values in `weight` variable for each species:

The `surveys_combined_year` data is grouped by species and the number of missing values (`NA`) is calculated.
```{r}
surveys_combined_year %>% 
  group_by(species) %>% 
  summarise(`NA Count` = sum(is.na(weight))) %>% 
  filter(`NA Count` != 0)
```
The output above shows the `13` species which have missing (`NA`) values, along with the count of the corresponding `NA` values.  The 3 remaining species not displayed here are `spilosoma`	- 12, `torridus` - 2	and `NA` -	1	

<br>

#### Imputing the `NA` (missing) values:

Now we impute the `NA` values with the mean weight for each species using `mutate()`. We save the imputed data as `surveys_weight_imputed`.
```{r}
# group the data by species and impute the NA values with the mean of each species
surveys_weight_imputed <- surveys_combined_year %>% 
  group_by(species) %>%
  mutate(weight = ifelse(is.na(weight), mean(weight, na.rm=TRUE), weight))
```
_The code above is written by referring to the code from Stackoverflow: [Link to code](https://stackoverflow.com/questions/55345593/impute-missing-data-with-mean-by-group)_

<br>

##### Checking if the imputation was successful in the `surveys_weight_imputed` data:
```{r}
surveys_weight_imputed %>% 
  group_by(species) %>% 
  summarise(`NA Count` = sum(is.na(weight))) %>% 
  filter(`NA Count` != 0)
```

From the output above, we can observe that the `surveys_weight_imputed` still has few species with missing values. But  we can note that some of the missing values have been imputed sucessfully. Of the `13` species with missing values, the imputation succesfully imputed the missing values for `8` species. The remaining `5` species still show missing values. We inspect and explain why this happens in the next task.

<hr>

## Task 9: Special Values

#### Checking for special values in `weight` column in `surveys_weight_imputed`:
```{r}
sum(sapply(surveys_weight_imputed$weight, function(x) (is.infinite(x) | is.nan(x) )) )
```

##### Here are the number of special values before we imputed the `surveys_combined_year` data:
```{r}
sum(sapply(surveys_combined_year$weight, function(x) (is.infinite(x) | is.nan(x) )) )
```

Observing the outputs above, we can conclude that the special values were generated as a result of the imputation. Also the function - `is.na()` counts `NA` as well as `NaN` values, which is why we got the count for `NA` values after imputation. In reality, we were getting the count of `NaN` values since all the `NA` values were imputed. We inspect the reason for the `NaN` values showing up in the data below.

<br>

#### Examining the data for a species that showed missing values after imputation:
```{r}
filter(surveys_weight_imputed, species == "harrisi")
```

All the special values in the weight column for _species = `harrisi`_ are `NaN` values. We get `NaN` values because for some species such as `harrisi`, the weight column has only NA values. When we calculate the mean for these species, to impute the NA values (as performed in task 8) we use the parameter `na.rm = TRUE` to ignore NA values when calculating the mean. For species such as `harrisi`, when calculating the mean, all the values are ignored restulting in `NaN` being returned as the mean. This caused `NaN` values to be imputed in the weight column.

<br>

##### Consider the following example to understand how the NaN values were created:
```{r}
mean(c(NA, NA), na.rm = TRUE)
```

When the mean is calculated for a vector with only NAs, na.rm = TRUE ignores all the NA values, hence resulting in the denominator of the mean to be zero, which evaluates to `NaN`.

<hr>

## Task 10: Outliers

#### Finding outliers in `hindfoot_length` variable:

We try to detect outliers by using a boxplot.
```{r}
survery_outliers <- boxplot(surveys_combined$hindfoot_length)$out
```
##### The outliers are:
```{r}
survery_outliers
```

Since there are a few outliers, we check the percentage of outliers in `hindfoot length` variable. 
```{r}
(length(survery_outliers)/nrow(surveys_combined)) * 100
```

Since the number of outliers is exteremly small `(0.0056%)` compared to the entire dataset, the best way to handle these outliers is to delete them. Therefore we delete the two outliers found. 
```{r}
cat("Number of rows before deleting outliers:" , nrow(surveys_combined) , "\n")
surveys_combined <- surveys_combined[-which(surveys_combined$hindfoot_length %in% survery_outliers),]
cat("Number of rows after deleting outliers:" , nrow(surveys_combined) , "\n")
```

<br>

##### Boxplot after removing outliers:
```{r}
val<-boxplot(surveys_combined$hindfoot_length)$out
```

Checking if any outliers remain:
```{r}
val
```

<br>
<br>
