Loading Relevant Packages

Install and load the necessary packages to reproduce the report here:

library(readr)
library(tidyr)
library(dplyr)
library(kableExtra)
library(Hmisc)
library(outliers)
library(stringr)

Read WHO Data

Read the WHO data and displaying the first 6 observations using head() function.

who <- read_csv("WHO.csv")
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   country = col_character(),
##   iso2 = col_character(),
##   iso3 = col_character()
## )
## See spec(...) for full column specifications.
head(who)

We notice, variable names beyound year are actually supposed to be values, hence we need to gather the data and tidy it up slightly using gather() function.

Tidy Task 1:

Using gather() function, the dataset has been reshaped and displayed below.

who_gathered <- gather(who,"code", "value", 5:60)
who_gathered

Tidy Task 2:

After gathering the data, it is seen that the code variable has 4 seperate components in the observation - New/old cases, Case type (rel/ep/sp/sn/), gender (m/f) and age bracket. The code variable needs to be deconstructed and seperated into 4 variables - new, var, sex and age - using seperate() function.

who_separated <- who_gathered %>% separate(code, c("new", "var", "sex"), sep = "_")
who_separated <- who_separated %>% separate(sex, c("sex", "age"), sep = "(?<=[A-Za-z])(?=[0-9])")

Now that the dataset the code column has been reshaped, the resultant dataset consists of 9 variables and 405440 observations and is displayed below.

who_separated

Tidy Task 3:

It is seen that the var column of in the dataset has four components -ep, rel, sn, sp - which need to converted into variables as well. To do this, spread() function was used to create 4 new variables by replacing the “var” and “value” variables.

who_spread <- spread(who_separated, var, value)

The new dataset now contains 11 variables and and 101360 observations. The first handful observations are displayed below.

who_spread

Tidy Task 4:

In the final stage of tidying the data, the two character variables “sex” and “age” need to be converted into factors. This was done in the following manner with the help of the mutate() function.

  1. sex variable was factored into two levels “m” and “f”
  2. age variable was converted into an ordered factor with 7 levels - <15, 15-24, 25-34, 35-44, 45-54, 55-64, 65>=.
who_final <- who_spread %>% mutate(sex = factor(sex, levels = c("m","f")))
## Warning: package 'bindrcpp' was built under R version 3.4.4
who_final <- who_final %>% mutate(age = factor(age, levels = c("014","1524", "2534","3544","4554","5564","65"), labels = c("<15","15-24", "25-34","35-44","45-54","55-64","65>="), ordered = TRUE))

The final tidy dataset in displayed below.

who_final

Task 5: Filter & Select

In the final task a subset is created called who_subset, in which only the data pertaining to “Australia”, “India” and “Japan” has been filtered from the main dataset. The variables iso2 and new have also been dropped in this subset due to redundancy.

who_subset <- who_final %>% select(-(iso2),-(new)) %>% filter(country == c("Australia", "India", "Japan"))
## Warning in country == c("Australia", "India", "Japan"): longer object
## length is not a multiple of shorter object length

The subset for the data pertaining to “Australia”, “India” and “Japan” consists of 9 variables and 477 observations.

who_subset

Read Species and Surveys data sets

For the next series of tasks, the first step is to read in the species and surveys data.

species <- read_csv("species.csv")
## Parsed with column specification:
## cols(
##   species_id = col_character(),
##   genus = col_character(),
##   species = col_character(),
##   taxa = col_character()
## )
surveys <- read_csv("surveys.csv")
## Parsed with column specification:
## cols(
##   record_id = col_integer(),
##   month = col_integer(),
##   day = col_integer(),
##   year = col_integer(),
##   species_id = col_character(),
##   sex = col_character(),
##   hindfoot_length = col_integer(),
##   weight = col_integer()
## )

The first 6 observations of surveys table is displayed below.

head(surveys) %>% kable("html") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)
record_id month day year species_id sex hindfoot_length weight
1 7 16 1977 NL M 32 NA
2 7 16 1977 NL M 33 NA
3 7 16 1977 DM F 37 NA
4 7 16 1977 DM M 36 NA
5 7 16 1977 DM M 35 NA
6 7 16 1977 PF M 14 NA

The first 6 observations of species table is displayed below.

head(species) %>% kable("html") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)
species_id genus species taxa
AB Amphispiza bilineata Bird
AH Ammospermophilus harrisi Rodent
AS Ammodramus savannarum Bird
BA Baiomys taylori Rodent
CB Campylorhynchus brunneicapillus Bird
CM Calamospiza melanocorys Bird

Task 6: Join

Both the datasets have specied_id as a common variable, it will be used as the primary key for joining them both. The variables “genus”, “species” and “taxa” from the species table are added to the surveys table using left_join() function.

surveys_combined <- surveys %>% left_join(species, by = "species_id")

The first 10 observations of this new combined table are displayed below.

head(surveys_combined,10) %>% kable("html") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)
record_id month day year species_id sex hindfoot_length weight genus species taxa
1 7 16 1977 NL M 32 NA Neotoma albigula Rodent
2 7 16 1977 NL M 33 NA Neotoma albigula Rodent
3 7 16 1977 DM F 37 NA Dipodomys merriami Rodent
4 7 16 1977 DM M 36 NA Dipodomys merriami Rodent
5 7 16 1977 DM M 35 NA Dipodomys merriami Rodent
6 7 16 1977 PF M 14 NA Perognathus flavus Rodent
7 7 16 1977 PE F NA NA Peromyscus eremicus Rodent
8 7 16 1977 DM M 37 NA Dipodomys merriami Rodent
9 7 16 1977 DM F 34 NA Dipodomys merriami Rodent
10 7 16 1977 PF F 20 NA Perognathus flavus Rodent

Task 7: Calculate

Next, the average weights and hindfoot length of the “flavus” species across each month is calculated. Missing values were removed during the calculation of means. The table below shows the results of the calculation.

surveys_combined %>% filter(species == "flavus") %>% group_by(month) %>% summarise(average_weight = mean(weight, na.rm = TRUE), average_hidfoot_length = mean(hindfoot_length, na.rm = TRUE)) %>% kable("html", caption = "Flavus species monthly data") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)
Flavus species monthly data
month average_weight average_hidfoot_length
1 7.592593 15.50000
2 7.980952 15.57944
3 7.703226 15.54015
4 8.220994 15.92771
5 8.288000 15.48000
6 7.939130 15.73077
7 8.292208 15.73381
8 8.262626 15.44554
9 7.913669 15.44118
10 7.790698 15.58857
11 7.093220 15.26050
12 7.778846 15.61538

The weights of the flavus species is seen to be increasing in the months numbered 4,5,7 and 8.

Task 8: Missing Values

The data from year 2000 is selected and a new data frame called surveys_combined_year is created and a few observations are displayed.

surveys_combined_year <- surveys_combined %>% filter(year == "2000")
head(surveys_combined_year)

The total number of missing values in weight column grouped by species for surveys_combined_year is displayed below.

surveys_combined_year %>% group_by(species) %>% summarise(total_missing = sum(is.na(weight))) %>% kable("html") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)
species total_missing
albigula 3
baileyi 10
bilineata 3
eremicus 2
gramineus 1
harrisi 12
hispidus 0
leucopus 0
maniculatus 0
megalotis 0
merriami 4
ordii 2
penicillatus 12
sp. 5
spilosoma 4
torridus 9
NA 43

Now that the number of missing values is known (110), they will be replaced with mean weight of the species and saved to a new data frame surveys_weight_imputed. Species with “NA” have 43 missing values, so these will need to be addressed at a later stage.

surveys_weight_imputed <- surveys_combined_year
surveys_weight_imputed$weight[is.na(surveys_weight_imputed$weight)] <- ave(surveys_weight_imputed$weight, surveys_weight_imputed$species, FUN = function(x)mean(x, na.rm = T))[is.na(surveys_weight_imputed$weight)]

Now that the missing values have been imputed into surveys_weight_imputed, we need to inspect the weight column for further inconsistencies.

Task 9: Inconsistencies or Special Values

On checking for discrepancy or special values, a check for infinite and NaN values was undertaken.

is.special <- function(x){
if (is.numeric(x)) !is.finite(x) else is.na(x)
}
which(is.special(surveys_weight_imputed$weight))
##  [1]   31   37   59   60  116  117  125  168  199  220  260  262  263  290
## [15]  291  305  318  342  410  422  423  562  563  564  565  644  729  737
## [29]  738  739  810  828  829  873  878  908  933  934 1009 1010 1011 1012
## [43] 1013 1014 1015 1016 1017 1018 1019 1114 1135 1208 1209 1210 1260 1284
## [57] 1331 1344 1345 1346 1397 1456 1457 1508 1531 1550 1551 1552
sum(is.special(surveys_weight_imputed$weight))
## [1] 68

We see that a lot of values (68) are special, hence we need to check for the reason through a traceback.

First, a check is done to see whether the species variable had NA values to start with, as imputing mean would give a default NaN value.

which(is.na(surveys_weight_imputed$species))
##  [1]  125  199  262  263  342  422  423  562  563  564  565  644  738  739
## [15]  829  933  934 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019
## [29] 1114 1208 1209 1210 1284 1344 1345 1346 1397 1456 1457 1508 1550 1551
## [43] 1552
sum(is.special(surveys_weight_imputed$species))
## [1] 43

From the results, we see that out of 68 special values found in the weight column, 43 of them were a result of NA observation in the species column.

As for the remaining inconsistencies, we check the mean weights by species that were supposed to be imputed using aggregate() function.

par(mfrow = c(1,2))
aggregate(surveys_combined_year[,8], list(surveys_combined_year$species), mean, na.rm = TRUE) %>% kable("html") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)
Group.1 weight
albigula 179.30769
baileyi 30.87890
bilineata NaN
eremicus 21.61538
gramineus NaN
harrisi NaN
hispidus 73.37500
leucopus 21.00000
maniculatus 20.50000
megalotis 11.40000
merriami 43.12664
ordii 49.22472
penicillatus 16.78862
sp. NaN
spilosoma NaN
torridus 25.30345

From the tables above we see that 5 species have NaN as the mean value that was supposed to be imputed. So a traceback to the dataset shows from the surveys_combined_year dataset:

This accounts for the remaining 25 special values.

Task 10: Outliers

Firstly, a check through a boxplot is done, to visually see existence of outliers.

boxplot(surveys_combined$hindfoot_length, main="Box Plot of Hindfoot Length in Surveys_Combined", ylab ="Hind Foot Length", col = "dodgerblue3")

From the boxplot, only two outliers are visible. To confirm this, z score test will also be undertaken to see how many observations surpass the threshold value of 3.

z.scores <- surveys_combined$hindfoot_length[!is.na(surveys_combined$hindfoot_length)] %>%  scores(type = "z")
z.scores %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.8530 -0.8665  0.2835  0.0000  0.7017  4.2565
which(abs(z.scores) > 3 )
## [1]  1355  8617 18682 26611

From the results of z-score summary, it is seen that surveys_combined shows 4 observations with absolute z score > 3.

In a boxplot, an outlier is defined as any value in a data set that falls beyond the range of -1.5IQR to 1.5IQR, whereas in a z-score approach, an outlier is defined as any abolute z-score value greater than 3. Z-scores are calculated by taking the difference of values of observations and the sample mean, whole divided by the standard deviation.

It is because of this difference, we see a mismatch in the number of outliers detected by both the boxplot and the z-score approach.

Since it is only 4 observations (1355, 8617, 18682 and 26611) in a data frame consisting of 35549 observations, it is efficient to exclude and remove these observations from the set.

surveys_combined_clean <- surveys_combined[ -c(1355,8617,18682,26611),]
dim(surveys_combined_clean)
## [1] 35545    11

The cleaned data frame ‘surveys_combined_clean’ now consists of 35545 observations, with the 4 outliers in the hindfoot_length variable being omitted.