Setup

The packages necessary for the assignment have been loaded below.

library(readr)
library(tidyr)
library(dplyr)
library(outliers)

Read WHO Data

The WHO Data has been read in and displayed below.

setwd("C:/Users/Jaimee-Lee/Documents/R/Working Directory/Data Preprocessing/Data Sets/Assignment 2")
WHO <- read_csv("WHO.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   country = col_character(),
##   iso2 = col_character(),
##   iso3 = col_character()
## )
## See spec(...) for full column specifications.
WHO

Tidy Task 1:

After investigating the original dimensions, the WHO data has been reshaped as requested below.

dim(WHO)
## [1] 7240   60
WHO <- WHO %>% gather(new_sp_m014:new_rel_f65, key = "code", value = "value")
WHO

Tidy Task 2:

The WHO data has been reshaped as requested below.

WHO <- WHO %>% separate(col = code , into = c("new","var","sex/age"), sep = "_")  %>% separate(col = "sex/age", into = c("sex","age"), sep = 1)
WHO

Tidy Task 3:

The WHO data has been reshaped as requested below.

WHO <- WHO %>% spread(key = "var", value = "value")
WHO

Tidy Task 4:

The variables age and sex have been factorised below using mutate(). Labels have also been added to the age variable.

WHO <- WHO %>% mutate(sex = factor(sex, levels = c("f", "m")), age = factor(age, levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),
                                   labels = c("<15","15-24","25-34","35-44","45-54","55-64","65>="), ordered = TRUE))
str(WHO$sex)
##  Factor w/ 2 levels "f","m": 2 2 2 2 2 2 2 2 2 2 ...
str(WHO$age)
##  Ord.factor w/ 7 levels "<15"<"15-24"<..: 1 1 1 1 1 1 1 1 1 1 ...
WHO

Task 5: Filter & Select

The variables iso2 and new have been removed from the data, the data has then been subset to include only three countries (Chad, Albania and Japan).

WHO_subset <- WHO %>% select(-iso2, -new) %>% filter(WHO$country == "Chad" | WHO$country == "Albania" | WHO$country == "Japan")
WHO_subset

Read Species and Surveys data sets

The Species and Surveys datasets have been read in and displayed below.

setwd("C:/Users/Jaimee-Lee/Documents/R/Working Directory/Data Preprocessing/Data Sets/Assignment 2")
species <- read_csv("species.csv")
## Parsed with column specification:
## cols(
##   species_id = col_character(),
##   genus = col_character(),
##   species = col_character(),
##   taxa = col_character()
## )
surveys <- read_csv("surveys.csv")
## Parsed with column specification:
## cols(
##   record_id = col_double(),
##   month = col_double(),
##   day = col_double(),
##   year = col_double(),
##   species_id = col_character(),
##   sex = col_character(),
##   hindfoot_length = col_double(),
##   weight = col_double()
## )
species
surveys

Task 6: Join

The Surveys and Species data frames have been combined below on the species_id variable and then named surveys_combined.

surveys_combined <- inner_join(surveys, species, by = "species_id")
surveys_combined

Task 7: Calculate

The species BA has been selected from the dataset, the average weight and hindfoot length in each month has been calculated, while excluding missing values.

BA <- surveys_combined %>%  filter(species_id == "BA")
BA
BA_by_month <- BA %>% group_by(month)
BA_by_month
BA_summary <- BA_by_month %>% summarise(mean_weight = mean(weight, na.rm = TRUE), mean_hindfoot_length = mean(hindfoot_length, na.rm = TRUE)) 
BA_summary
BA_summary%>% round(digits = 2)

Task 8: Missing Values

The year 1991 has been selected from the the dataset, and saved as surveys_combined_year. The total missing values for the weight column were then found, grouped by species. These missing values have then been replaced with the mean weight of the species they belong to. The imputed data has then been saved as surveys_weight_imputed.

surveys_combined_year <- surveys_combined %>% filter(surveys_combined$year == "1991")
surveys_combined_year
surveys_combined_year %>% group_by(species) %>% summarise(missing_values = sum(is.na(weight)))
surveys_weight_imputed <- surveys_combined_year %>%
         group_by(species) %>%
         mutate(weight = replace_na(weight, mean(weight, na.rm = TRUE)))
surveys_weight_imputed

Task 9: Inconsistencies or Special Values

The weight column has been inspected for special values using is.finite(), is.infinite(), is.nan(). The output of these functions reveal that there are 1278 finite values (i.e. real numbers) in the dataset(total: 1332 observations), leaving 54 values that are possibly not numbers. This notion is further confirmed by is.infinite() reutrning TRUE for all values and is.nan() returning 54 values as TRUE.

Which() was used to locate the NaN values, three were selected randomly in order to ascertain why their weight values were NaN. This reveals that the observations belong to species that did not have their missing values imputed with means in the last task. This is due to those particular species’ not having any data recorded under the weight variable with which to create a mean to be imputed (e.g. harrisi and billineata).

sum(is.finite(surveys_weight_imputed$weight))
## [1] 1278
sum(is.infinite(surveys_weight_imputed$weight))
## [1] 0
sum(is.nan(surveys_weight_imputed$weight))
## [1] 54
which(is.nan(surveys_weight_imputed$weight))
##  [1]   10   38  130  187  189  273  305  310  311  391  420  428  446  474
## [15]  481  498  522  541  548  554  560  570  655  687  713  717  732  735
## [29]  736  762  765  767  791  801  857  896  939 1009 1014 1035 1081 1085
## [43] 1098 1146 1156 1161 1182 1195 1206 1209 1220 1226 1258 1285
print(surveys_weight_imputed[273,])
## # A tibble: 1 x 11
## # Groups:   species [1]
##   record_id month   day  year species_id sex   hindfoot_length weight genus
##       <dbl> <dbl> <dbl> <dbl> <chr>      <chr>           <dbl>  <dbl> <chr>
## 1     18462     3    13  1991 AH         <NA>               NA    NaN Ammo~
## # ... with 2 more variables: species <chr>, taxa <chr>
print(surveys_weight_imputed[522,])
## # A tibble: 1 x 11
## # Groups:   species [1]
##   record_id month   day  year species_id sex   hindfoot_length weight genus
##       <dbl> <dbl> <dbl> <dbl> <chr>      <chr>           <dbl>  <dbl> <chr>
## 1     18713     5    14  1991 AH         <NA>               NA    NaN Ammo~
## # ... with 2 more variables: species <chr>, taxa <chr>
print(surveys_weight_imputed[1014,])
## # A tibble: 1 x 11
## # Groups:   species [1]
##   record_id month   day  year species_id sex   hindfoot_length weight genus
##       <dbl> <dbl> <dbl> <dbl> <chr>      <chr>           <dbl>  <dbl> <chr>
## 1     19218    11    13  1991 AB         <NA>               NA    NaN Amph~
## # ... with 2 more variables: species <chr>, taxa <chr>

Task 10: Outliers

Below the outliers in teh dataset have been identified and handled.

First, The mean, median and mode were calculated in order to determine whether the data followed a normal distribution or not. The data did not follow a normal distribution, therefore boxplot was used to identify the outliers, the outliers were then saved as surveys_combined_outliers.
Using arrange() the outliers were viewed, the values were determined to be of the albigula and ordii. The quantiles and IQR were then both calculated for both of those species, these values were then used in Tukeys’ Method to ascertain whether the outliers were infact lying outside these ranges for their relevant species.
In doing so, it was found that the two outliers were in fact unusual of their respective species. It was thought that it was possible that the outliers were possibly due to data entry errors. Therefore, capping was employed to deal with them so as to not disturb the data too much or resort to excluding the observations completely.

#Descriptive statistics to determine the distribution
surveys_combined %>% summarise(Mean = mean(hindfoot_length, na.rm = TRUE),Median = median(hindfoot_length, na.rm = TRUE))
#Boxplot showing outliers
surveys_combined_outliers <- surveys_combined$hindfoot_length %>% boxplot(main = "Boxplot of Hindfoot Length", ylab = "Length in cm")

#Outliers
surveys_combined_outliers$out
## [1] 70 64
#Showing outliers (top two observations)
surveys_combined %>% arrange(desc(hindfoot_length))
#Calculating quantiles and IQR for both species
surveys_combined %>% filter(species == 'Albigula') %>% summarise(Q1 = quantile(hindfoot_length, probs = .25, na.rm = TRUE),
                       Q3 = quantile(hindfoot_length, probs = .75, na.rm = TRUE),
                       IQR = IQR(hindfoot_length, na.rm = TRUE))
surveys_combined %>% filter(species == 'Ordii') %>% summarise(Q1 = quantile(hindfoot_length, probs = .25, na.rm = TRUE),
                       Q3 = quantile(hindfoot_length, probs = .75, na.rm = TRUE),
                       IQR = IQR(hindfoot_length, na.rm = TRUE))
#Calculating Tukeys' mMethod for each species
#Albigula
31-1.5*2
## [1] 28
33+1.5*2
## [1] 36
#Ordii
35-1.5*2
## [1] 32
37-1.5*2
## [1] 34
#Capping function
cap <- function(x){
  quantiles <- quantile(x, c(.05, 0.25, 0.75, 0.95), na.rm = T)
  x[x < quantiles[2] - 1.5 * IQR(x, na.rm = T)] <- quantiles[1]
  x[x > quantiles[3] + 1.5 * IQR(x, na.rm = T)] <- quantiles[4]
  return(x)
}

#Capping the outliers
hindfoot_sub <- surveys_combined[,7]
hindfoot_cap <- as.data.frame(sapply(hindfoot_sub, FUN = cap))

#The maximum value now excludes the outliers, meaning the capping was successful
summary(hindfoot_cap)
##  hindfoot_length
##  Min.   : 2.00  
##  1st Qu.:21.00  
##  Median :32.00  
##  Mean   :29.29  
##  3rd Qu.:36.00  
##  Max.   :58.00  
##  NA's   :3348