MATH2349 Semester 2, 2018

Loading Relevant Packages

Install and load the necessary packages to reproduce the report here:

library(readr)
library(tidyr)
library(dplyr)
library(kableExtra)
library(Hmisc)
library(outliers)
library(stringr)

Read WHO Data

Read the WHO data and displaying the first 6 observations using head() function.

who <- read_csv("WHO.csv")

## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   country = col_character(),
##   iso2 = col_character(),
##   iso3 = col_character()
## )

## See spec(...) for full column specifications.

head(who)

We notice, variable names beyound year are actually supposed to be values, hence we need to gather the data and tidy it up slightly using gather() function.

Tidy Task 1:

Using gather() function, the dataset has been reshaped and displayed below.

who_gathered <- gather(who,"code", "value", 5:60)
who_gathered

Tidy Task 2:

After gathering the data, it is seen that the code variable has 4 seperate components in the observation - New/old cases, Case type (rel/ep/sp/sn/), gender (m/f) and age bracket. The code variable needs to be deconstructed and seperated into 4 variables - new, var, sex and age - using seperate() function.

who_separated <- who_gathered %>% separate(code, c("new", "var", "sex"), sep = "_")
who_separated <- who_separated %>% separate(sex, c("sex", "age"), sep = "(?<=[A-Za-z])(?=[0-9])")

Now that the dataset the code column has been reshaped, the resultant dataset consists of 9 variables and 405440 observations and is displayed below.

who_separated

Tidy Task 3:

It is seen that the var column of in the dataset has four components -ep, rel, sn, sp - which need to converted into variables as well. To do this, spread() function was used to create 4 new variables by replacing the “var” and “value” variables.

who_spread <- spread(who_separated, var, value)

The new dataset now contains 11 variables and and 101360 observations. The first handful observations are displayed below.

who_spread

Tidy Task 4:

In the final stage of tidying the data, the two character variables “sex” and “age” need to be converted into factors. This was done in the following manner with the help of the mutate() function.

sex variable was factored into two levels “m” and “f”
age variable was converted into an ordered factor with 7 levels - <15, 15-24, 25-34, 35-44, 45-54, 55-64, 65>=.

who_final <- who_spread %>% mutate(sex = factor(sex, levels = c("m","f")))

## Warning: package 'bindrcpp' was built under R version 3.4.4

who_final <- who_final %>% mutate(age = factor(age, levels = c("014","1524", "2534","3544","4554","5564","65"), labels = c("<15","15-24", "25-34","35-44","45-54","55-64","65>="), ordered = TRUE))

The final tidy dataset in displayed below.

who_final

Task 5: Filter & Select

In the final task a subset is created called who_subset, in which only the data pertaining to “Australia”, “India” and “Japan” has been filtered from the main dataset. The variables iso2 and new have also been dropped in this subset due to redundancy.

who_subset <- who_final %>% select(-(iso2),-(new)) %>% filter(country == c("Australia", "India", "Japan"))

## Warning in country == c("Australia", "India", "Japan"): longer object
## length is not a multiple of shorter object length

The subset for the data pertaining to “Australia”, “India” and “Japan” consists of 9 variables and 477 observations.

who_subset

Read Species and Surveys data sets

For the next series of tasks, the first step is to read in the species and surveys data.

species <- read_csv("species.csv")

## Parsed with column specification:
## cols(
##   species_id = col_character(),
##   genus = col_character(),
##   species = col_character(),
##   taxa = col_character()
## )

surveys <- read_csv("surveys.csv")

## Parsed with column specification:
## cols(
##   record_id = col_integer(),
##   month = col_integer(),
##   day = col_integer(),
##   year = col_integer(),
##   species_id = col_character(),
##   sex = col_character(),
##   hindfoot_length = col_integer(),
##   weight = col_integer()
## )

The first 6 observations of surveys table is displayed below.

head(surveys) %>% kable("html") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)

record_id	month	day	year	species_id	sex	hindfoot_length	weight
1	7	16	1977	NL	M	32	NA
2	7	16	1977	NL	M	33	NA
3	7	16	1977	DM	F	37	NA
4	7	16	1977	DM	M	36	NA
5	7	16	1977	DM	M	35	NA
6	7	16	1977	PF	M	14	NA

The first 6 observations of species table is displayed below.

head(species) %>% kable("html") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)

species_id	genus	species	taxa
AB	Amphispiza	bilineata	Bird
AH	Ammospermophilus	harrisi	Rodent
AS	Ammodramus	savannarum	Bird
BA	Baiomys	taylori	Rodent
CB	Campylorhynchus	brunneicapillus	Bird
CM	Calamospiza	melanocorys	Bird

Task 6: Join

Both the datasets have specied_id as a common variable, it will be used as the primary key for joining them both. The variables “genus”, “species” and “taxa” from the species table are added to the surveys table using left_join() function.

surveys_combined <- surveys %>% left_join(species, by = "species_id")

The first 10 observations of this new combined table are displayed below.

head(surveys_combined,10) %>% kable("html") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)

record_id	month	day	year	species_id	sex	hindfoot_length	weight	genus	species	taxa
1	7	16	1977	NL	M	32	NA	Neotoma	albigula	Rodent
2	7	16	1977	NL	M	33	NA	Neotoma	albigula	Rodent
3	7	16	1977	DM	F	37	NA	Dipodomys	merriami	Rodent
4	7	16	1977	DM	M	36	NA	Dipodomys	merriami	Rodent
5	7	16	1977	DM	M	35	NA	Dipodomys	merriami	Rodent
6	7	16	1977	PF	M	14	NA	Perognathus	flavus	Rodent
7	7	16	1977	PE	F	NA	NA	Peromyscus	eremicus	Rodent
8	7	16	1977	DM	M	37	NA	Dipodomys	merriami	Rodent
9	7	16	1977	DM	F	34	NA	Dipodomys	merriami	Rodent
10	7	16	1977	PF	F	20	NA	Perognathus	flavus	Rodent

Task 7: Calculate

Next, the average weights and hindfoot length of the “flavus” species across each month is calculated. Missing values were removed during the calculation of means. The table below shows the results of the calculation.

surveys_combined %>% filter(species == "flavus") %>% group_by(month) %>% summarise(average_weight = mean(weight, na.rm = TRUE), average_hidfoot_length = mean(hindfoot_length, na.rm = TRUE)) %>% kable("html", caption = "Flavus species monthly data") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)

Flavus species monthly data
month	average_weight	average_hidfoot_length
1	7.592593	15.50000
2	7.980952	15.57944
3	7.703226	15.54015
4	8.220994	15.92771
5	8.288000	15.48000
6	7.939130	15.73077
7	8.292208	15.73381
8	8.262626	15.44554
9	7.913669	15.44118
10	7.790698	15.58857
11	7.093220	15.26050
12	7.778846	15.61538

The weights of the flavus species is seen to be increasing in the months numbered 4,5,7 and 8.

Task 8: Missing Values

The data from year 2000 is selected and a new data frame called surveys_combined_year is created and a few observations are displayed.

surveys_combined_year <- surveys_combined %>% filter(year == "2000")
head(surveys_combined_year)

The total number of missing values in weight column grouped by species for surveys_combined_year is displayed below.

surveys_combined_year %>% group_by(species) %>% summarise(total_missing = sum(is.na(weight))) %>% kable("html") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)

species	total_missing
albigula	3
baileyi	10
bilineata	3
eremicus	2
gramineus	1
harrisi	12
hispidus	0
leucopus	0
maniculatus	0
megalotis	0
merriami	4
ordii	2
penicillatus	12
sp.	5
spilosoma	4
torridus	9
NA	43

Now that the number of missing values is known (110), they will be replaced with mean weight of the species and saved to a new data frame surveys_weight_imputed. Species with “NA” have 43 missing values, so these will need to be addressed at a later stage.

surveys_weight_imputed <- surveys_combined_year
surveys_weight_imputed$weight[is.na(surveys_weight_imputed$weight)] <- ave(surveys_weight_imputed$weight, surveys_weight_imputed$species, FUN = function(x)mean(x, na.rm = T))[is.na(surveys_weight_imputed$weight)]

Now that the missing values have been imputed into surveys_weight_imputed, we need to inspect the weight column for further inconsistencies.

Task 9: Inconsistencies or Special Values

On checking for discrepancy or special values, a check for infinite and NaN values was undertaken.

is.special <- function(x){
if (is.numeric(x)) !is.finite(x) else is.na(x)
}
which(is.special(surveys_weight_imputed$weight))

##  [1]   31   37   59   60  116  117  125  168  199  220  260  262  263  290
## [15]  291  305  318  342  410  422  423  562  563  564  565  644  729  737
## [29]  738  739  810  828  829  873  878  908  933  934 1009 1010 1011 1012
## [43] 1013 1014 1015 1016 1017 1018 1019 1114 1135 1208 1209 1210 1260 1284
## [57] 1331 1344 1345 1346 1397 1456 1457 1508 1531 1550 1551 1552

sum(is.special(surveys_weight_imputed$weight))

## [1] 68

We see that a lot of values (68) are special, hence we need to check for the reason through a traceback.

First, a check is done to see whether the species variable had NA values to start with, as imputing mean would give a default NaN value.

which(is.na(surveys_weight_imputed$species))

##  [1]  125  199  262  263  342  422  423  562  563  564  565  644  738  739
## [15]  829  933  934 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019
## [29] 1114 1208 1209 1210 1284 1344 1345 1346 1397 1456 1457 1508 1550 1551
## [43] 1552

sum(is.special(surveys_weight_imputed$species))

## [1] 43

From the results, we see that out of 68 special values found in the weight column, 43 of them were a result of NA observation in the species column.

As for the remaining inconsistencies, we check the mean weights by species that were supposed to be imputed using aggregate() function.

par(mfrow = c(1,2))
aggregate(surveys_combined_year[,8], list(surveys_combined_year$species), mean, na.rm = TRUE) %>% kable("html") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)

Group.1	weight
albigula	179.30769
baileyi	30.87890
bilineata	NaN
eremicus	21.61538
gramineus	NaN
harrisi	NaN
hispidus	73.37500
leucopus	21.00000
maniculatus	20.50000
megalotis	11.40000
merriami	43.12664
ordii	49.22472
penicillatus	16.78862
sp.	NaN
spilosoma	NaN
torridus	25.30345

From the tables above we see that 5 species have NaN as the mean value that was supposed to be imputed. So a traceback to the dataset shows from the surveys_combined_year dataset:

species bilineata, there are only 3 observations and all three have NA weights, hence imputing weight gave NaN
species gramnineus, there is only 1 observations and it has NA weight, hence imputing weight gave NaN
species bilineata, there are only 12 observations and all twelve have NA weights, hence imputing weight gave NaN
species sp., there are only 5 observations and all five have NA weights, hence imputing weight gave NaN
species spilosoma, there are only 4 observations and all four have NA weights, hence imputing weight gave NaN

This accounts for the remaining 25 special values.

Task 10: Outliers

Firstly, a check through a boxplot is done, to visually see existence of outliers.

boxplot(surveys_combined$hindfoot_length, main="Box Plot of Hindfoot Length in Surveys_Combined", ylab ="Hind Foot Length", col = "dodgerblue3")

From the boxplot, only two outliers are visible. To confirm this, z score test will also be undertaken to see how many observations surpass the threshold value of 3.

z.scores <- surveys_combined$hindfoot_length[!is.na(surveys_combined$hindfoot_length)] %>%  scores(type = "z")
z.scores %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.8530 -0.8665  0.2835  0.0000  0.7017  4.2565

which(abs(z.scores) > 3 )

## [1]  1355  8617 18682 26611

From the results of z-score summary, it is seen that surveys_combined shows 4 observations with absolute z score > 3.

In a boxplot, an outlier is defined as any value in a data set that falls beyond the range of -1.5IQR to 1.5IQR, whereas in a z-score approach, an outlier is defined as any abolute z-score value greater than 3. Z-scores are calculated by taking the difference of values of observations and the sample mean, whole divided by the standard deviation.

It is because of this difference, we see a mismatch in the number of outliers detected by both the boxplot and the z-score approach.

Since it is only 4 observations (1355, 8617, 18682 and 26611) in a data frame consisting of 35549 observations, it is efficient to exclude and remove these observations from the set.

surveys_combined_clean <- surveys_combined[ -c(1355,8617,18682,26611),]
dim(surveys_combined_clean)

## [1] 35545    11

The cleaned data frame ‘surveys_combined_clean’ now consists of 35545 observations, with the 4 outliers in the hindfoot_length variable being omitted.