Install and load the necessary packages to reproduce the report here:
library(readr)
library(tidyr)
library(dplyr)
library(kableExtra)
library(Hmisc)
library(outliers)
library(stringr)
Read the WHO data and displaying the first 6 observations using head() function.
who <- read_csv("WHO.csv")
## Parsed with column specification:
## cols(
## .default = col_integer(),
## country = col_character(),
## iso2 = col_character(),
## iso3 = col_character()
## )
## See spec(...) for full column specifications.
head(who)
We notice, variable names beyound year are actually supposed to be values, hence we need to gather the data and tidy it up slightly using gather() function.
Using gather() function, the dataset has been reshaped and displayed below.
who_gathered <- gather(who,"code", "value", 5:60)
who_gathered
After gathering the data, it is seen that the code variable has 4 seperate components in the observation - New/old cases, Case type (rel/ep/sp/sn/), gender (m/f) and age bracket. The code variable needs to be deconstructed and seperated into 4 variables - new, var, sex and age - using seperate() function.
who_separated <- who_gathered %>% separate(code, c("new", "var", "sex"), sep = "_")
who_separated <- who_separated %>% separate(sex, c("sex", "age"), sep = "(?<=[A-Za-z])(?=[0-9])")
Now that the dataset the code column has been reshaped, the resultant dataset consists of 9 variables and 405440 observations and is displayed below.
who_separated
It is seen that the var column of in the dataset has four components -ep, rel, sn, sp - which need to converted into variables as well. To do this, spread() function was used to create 4 new variables by replacing the “var” and “value” variables.
who_spread <- spread(who_separated, var, value)
The new dataset now contains 11 variables and and 101360 observations. The first handful observations are displayed below.
who_spread
In the final stage of tidying the data, the two character variables “sex” and “age” need to be converted into factors. This was done in the following manner with the help of the mutate() function.
who_final <- who_spread %>% mutate(sex = factor(sex, levels = c("m","f")))
## Warning: package 'bindrcpp' was built under R version 3.4.4
who_final <- who_final %>% mutate(age = factor(age, levels = c("014","1524", "2534","3544","4554","5564","65"), labels = c("<15","15-24", "25-34","35-44","45-54","55-64","65>="), ordered = TRUE))
The final tidy dataset in displayed below.
who_final
In the final task a subset is created called who_subset, in which only the data pertaining to “Australia”, “India” and “Japan” has been filtered from the main dataset. The variables iso2 and new have also been dropped in this subset due to redundancy.
who_subset <- who_final %>% select(-(iso2),-(new)) %>% filter(country == c("Australia", "India", "Japan"))
## Warning in country == c("Australia", "India", "Japan"): longer object
## length is not a multiple of shorter object length
The subset for the data pertaining to “Australia”, “India” and “Japan” consists of 9 variables and 477 observations.
who_subset
For the next series of tasks, the first step is to read in the species and surveys data.
species <- read_csv("species.csv")
## Parsed with column specification:
## cols(
## species_id = col_character(),
## genus = col_character(),
## species = col_character(),
## taxa = col_character()
## )
surveys <- read_csv("surveys.csv")
## Parsed with column specification:
## cols(
## record_id = col_integer(),
## month = col_integer(),
## day = col_integer(),
## year = col_integer(),
## species_id = col_character(),
## sex = col_character(),
## hindfoot_length = col_integer(),
## weight = col_integer()
## )
The first 6 observations of surveys table is displayed below.
head(surveys) %>% kable("html") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)
| record_id | month | day | year | species_id | sex | hindfoot_length | weight |
|---|---|---|---|---|---|---|---|
| 1 | 7 | 16 | 1977 | NL | M | 32 | NA |
| 2 | 7 | 16 | 1977 | NL | M | 33 | NA |
| 3 | 7 | 16 | 1977 | DM | F | 37 | NA |
| 4 | 7 | 16 | 1977 | DM | M | 36 | NA |
| 5 | 7 | 16 | 1977 | DM | M | 35 | NA |
| 6 | 7 | 16 | 1977 | PF | M | 14 | NA |
The first 6 observations of species table is displayed below.
head(species) %>% kable("html") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)
| species_id | genus | species | taxa |
|---|---|---|---|
| AB | Amphispiza | bilineata | Bird |
| AH | Ammospermophilus | harrisi | Rodent |
| AS | Ammodramus | savannarum | Bird |
| BA | Baiomys | taylori | Rodent |
| CB | Campylorhynchus | brunneicapillus | Bird |
| CM | Calamospiza | melanocorys | Bird |
Both the datasets have specied_id as a common variable, it will be used as the primary key for joining them both. The variables “genus”, “species” and “taxa” from the species table are added to the surveys table using left_join() function.
surveys_combined <- surveys %>% left_join(species, by = "species_id")
The first 10 observations of this new combined table are displayed below.
head(surveys_combined,10) %>% kable("html") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)
| record_id | month | day | year | species_id | sex | hindfoot_length | weight | genus | species | taxa |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 7 | 16 | 1977 | NL | M | 32 | NA | Neotoma | albigula | Rodent |
| 2 | 7 | 16 | 1977 | NL | M | 33 | NA | Neotoma | albigula | Rodent |
| 3 | 7 | 16 | 1977 | DM | F | 37 | NA | Dipodomys | merriami | Rodent |
| 4 | 7 | 16 | 1977 | DM | M | 36 | NA | Dipodomys | merriami | Rodent |
| 5 | 7 | 16 | 1977 | DM | M | 35 | NA | Dipodomys | merriami | Rodent |
| 6 | 7 | 16 | 1977 | PF | M | 14 | NA | Perognathus | flavus | Rodent |
| 7 | 7 | 16 | 1977 | PE | F | NA | NA | Peromyscus | eremicus | Rodent |
| 8 | 7 | 16 | 1977 | DM | M | 37 | NA | Dipodomys | merriami | Rodent |
| 9 | 7 | 16 | 1977 | DM | F | 34 | NA | Dipodomys | merriami | Rodent |
| 10 | 7 | 16 | 1977 | PF | F | 20 | NA | Perognathus | flavus | Rodent |
Next, the average weights and hindfoot length of the “flavus” species across each month is calculated. Missing values were removed during the calculation of means. The table below shows the results of the calculation.
surveys_combined %>% filter(species == "flavus") %>% group_by(month) %>% summarise(average_weight = mean(weight, na.rm = TRUE), average_hidfoot_length = mean(hindfoot_length, na.rm = TRUE)) %>% kable("html", caption = "Flavus species monthly data") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)
| month | average_weight | average_hidfoot_length |
|---|---|---|
| 1 | 7.592593 | 15.50000 |
| 2 | 7.980952 | 15.57944 |
| 3 | 7.703226 | 15.54015 |
| 4 | 8.220994 | 15.92771 |
| 5 | 8.288000 | 15.48000 |
| 6 | 7.939130 | 15.73077 |
| 7 | 8.292208 | 15.73381 |
| 8 | 8.262626 | 15.44554 |
| 9 | 7.913669 | 15.44118 |
| 10 | 7.790698 | 15.58857 |
| 11 | 7.093220 | 15.26050 |
| 12 | 7.778846 | 15.61538 |
The weights of the flavus species is seen to be increasing in the months numbered 4,5,7 and 8.
The data from year 2000 is selected and a new data frame called surveys_combined_year is created and a few observations are displayed.
surveys_combined_year <- surveys_combined %>% filter(year == "2000")
head(surveys_combined_year)
The total number of missing values in weight column grouped by species for surveys_combined_year is displayed below.
surveys_combined_year %>% group_by(species) %>% summarise(total_missing = sum(is.na(weight))) %>% kable("html") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)
| species | total_missing |
|---|---|
| albigula | 3 |
| baileyi | 10 |
| bilineata | 3 |
| eremicus | 2 |
| gramineus | 1 |
| harrisi | 12 |
| hispidus | 0 |
| leucopus | 0 |
| maniculatus | 0 |
| megalotis | 0 |
| merriami | 4 |
| ordii | 2 |
| penicillatus | 12 |
| sp. | 5 |
| spilosoma | 4 |
| torridus | 9 |
| NA | 43 |
Now that the number of missing values is known (110), they will be replaced with mean weight of the species and saved to a new data frame surveys_weight_imputed. Species with “NA” have 43 missing values, so these will need to be addressed at a later stage.
surveys_weight_imputed <- surveys_combined_year
surveys_weight_imputed$weight[is.na(surveys_weight_imputed$weight)] <- ave(surveys_weight_imputed$weight, surveys_weight_imputed$species, FUN = function(x)mean(x, na.rm = T))[is.na(surveys_weight_imputed$weight)]
Now that the missing values have been imputed into surveys_weight_imputed, we need to inspect the weight column for further inconsistencies.
On checking for discrepancy or special values, a check for infinite and NaN values was undertaken.
is.special <- function(x){
if (is.numeric(x)) !is.finite(x) else is.na(x)
}
which(is.special(surveys_weight_imputed$weight))
## [1] 31 37 59 60 116 117 125 168 199 220 260 262 263 290
## [15] 291 305 318 342 410 422 423 562 563 564 565 644 729 737
## [29] 738 739 810 828 829 873 878 908 933 934 1009 1010 1011 1012
## [43] 1013 1014 1015 1016 1017 1018 1019 1114 1135 1208 1209 1210 1260 1284
## [57] 1331 1344 1345 1346 1397 1456 1457 1508 1531 1550 1551 1552
sum(is.special(surveys_weight_imputed$weight))
## [1] 68
We see that a lot of values (68) are special, hence we need to check for the reason through a traceback.
First, a check is done to see whether the species variable had NA values to start with, as imputing mean would give a default NaN value.
which(is.na(surveys_weight_imputed$species))
## [1] 125 199 262 263 342 422 423 562 563 564 565 644 738 739
## [15] 829 933 934 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019
## [29] 1114 1208 1209 1210 1284 1344 1345 1346 1397 1456 1457 1508 1550 1551
## [43] 1552
sum(is.special(surveys_weight_imputed$species))
## [1] 43
From the results, we see that out of 68 special values found in the weight column, 43 of them were a result of NA observation in the species column.
As for the remaining inconsistencies, we check the mean weights by species that were supposed to be imputed using aggregate() function.
par(mfrow = c(1,2))
aggregate(surveys_combined_year[,8], list(surveys_combined_year$species), mean, na.rm = TRUE) %>% kable("html") %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = F)
| Group.1 | weight |
|---|---|
| albigula | 179.30769 |
| baileyi | 30.87890 |
| bilineata | NaN |
| eremicus | 21.61538 |
| gramineus | NaN |
| harrisi | NaN |
| hispidus | 73.37500 |
| leucopus | 21.00000 |
| maniculatus | 20.50000 |
| megalotis | 11.40000 |
| merriami | 43.12664 |
| ordii | 49.22472 |
| penicillatus | 16.78862 |
| sp. | NaN |
| spilosoma | NaN |
| torridus | 25.30345 |
From the tables above we see that 5 species have NaN as the mean value that was supposed to be imputed. So a traceback to the dataset shows from the surveys_combined_year dataset:
This accounts for the remaining 25 special values.
Firstly, a check through a boxplot is done, to visually see existence of outliers.
boxplot(surveys_combined$hindfoot_length, main="Box Plot of Hindfoot Length in Surveys_Combined", ylab ="Hind Foot Length", col = "dodgerblue3")
From the boxplot, only two outliers are visible. To confirm this, z score test will also be undertaken to see how many observations surpass the threshold value of 3.
z.scores <- surveys_combined$hindfoot_length[!is.na(surveys_combined$hindfoot_length)] %>% scores(type = "z")
z.scores %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.8530 -0.8665 0.2835 0.0000 0.7017 4.2565
which(abs(z.scores) > 3 )
## [1] 1355 8617 18682 26611
From the results of z-score summary, it is seen that surveys_combined shows 4 observations with absolute z score > 3.
In a boxplot, an outlier is defined as any value in a data set that falls beyond the range of -1.5IQR to 1.5IQR, whereas in a z-score approach, an outlier is defined as any abolute z-score value greater than 3. Z-scores are calculated by taking the difference of values of observations and the sample mean, whole divided by the standard deviation.
It is because of this difference, we see a mismatch in the number of outliers detected by both the boxplot and the z-score approach.
Since it is only 4 observations (1355, 8617, 18682 and 26611) in a data frame consisting of 35549 observations, it is efficient to exclude and remove these observations from the set.
surveys_combined_clean <- surveys_combined[ -c(1355,8617,18682,26611),]
dim(surveys_combined_clean)
## [1] 35545 11
The cleaned data frame ‘surveys_combined_clean’ now consists of 35545 observations, with the 4 outliers in the hindfoot_length variable being omitted.