The packages necessary for the assignment have been loaded below.
library(readr)
library(tidyr)
library(dplyr)
library(outliers)
The WHO Data has been read in and displayed below.
setwd("C:/Users/Jaimee-Lee/Documents/R/Working Directory/Data Preprocessing/Data Sets/Assignment 2")
WHO <- read_csv("WHO.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## country = col_character(),
## iso2 = col_character(),
## iso3 = col_character()
## )
## See spec(...) for full column specifications.
WHO
After investigating the original dimensions, the WHO data has been reshaped as requested below.
dim(WHO)
## [1] 7240 60
WHO <- WHO %>% gather(new_sp_m014:new_rel_f65, key = "code", value = "value")
WHO
The WHO data has been reshaped as requested below.
WHO <- WHO %>% separate(col = code , into = c("new","var","sex/age"), sep = "_") %>% separate(col = "sex/age", into = c("sex","age"), sep = 1)
WHO
The WHO data has been reshaped as requested below.
WHO <- WHO %>% spread(key = "var", value = "value")
WHO
The variables age and sex have been factorised below using mutate(). Labels have also been added to the age variable.
WHO <- WHO %>% mutate(sex = factor(sex, levels = c("f", "m")), age = factor(age, levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),
labels = c("<15","15-24","25-34","35-44","45-54","55-64","65>="), ordered = TRUE))
str(WHO$sex)
## Factor w/ 2 levels "f","m": 2 2 2 2 2 2 2 2 2 2 ...
str(WHO$age)
## Ord.factor w/ 7 levels "<15"<"15-24"<..: 1 1 1 1 1 1 1 1 1 1 ...
WHO
The variables iso2 and new have been removed from the data, the data has then been subset to include only three countries (Chad, Albania and Japan).
WHO_subset <- WHO %>% select(-iso2, -new) %>% filter(WHO$country == "Chad" | WHO$country == "Albania" | WHO$country == "Japan")
WHO_subset
The Species and Surveys datasets have been read in and displayed below.
setwd("C:/Users/Jaimee-Lee/Documents/R/Working Directory/Data Preprocessing/Data Sets/Assignment 2")
species <- read_csv("species.csv")
## Parsed with column specification:
## cols(
## species_id = col_character(),
## genus = col_character(),
## species = col_character(),
## taxa = col_character()
## )
surveys <- read_csv("surveys.csv")
## Parsed with column specification:
## cols(
## record_id = col_double(),
## month = col_double(),
## day = col_double(),
## year = col_double(),
## species_id = col_character(),
## sex = col_character(),
## hindfoot_length = col_double(),
## weight = col_double()
## )
species
surveys
The Surveys and Species data frames have been combined below on the species_id variable and then named surveys_combined.
surveys_combined <- inner_join(surveys, species, by = "species_id")
surveys_combined
The species BA has been selected from the dataset, the average weight and hindfoot length in each month has been calculated, while excluding missing values.
BA <- surveys_combined %>% filter(species_id == "BA")
BA
BA_by_month <- BA %>% group_by(month)
BA_by_month
BA_summary <- BA_by_month %>% summarise(mean_weight = mean(weight, na.rm = TRUE), mean_hindfoot_length = mean(hindfoot_length, na.rm = TRUE))
BA_summary
BA_summary%>% round(digits = 2)
The year 1991 has been selected from the the dataset, and saved as surveys_combined_year. The total missing values for the weight column were then found, grouped by species. These missing values have then been replaced with the mean weight of the species they belong to. The imputed data has then been saved as surveys_weight_imputed.
surveys_combined_year <- surveys_combined %>% filter(surveys_combined$year == "1991")
surveys_combined_year
surveys_combined_year %>% group_by(species) %>% summarise(missing_values = sum(is.na(weight)))
surveys_weight_imputed <- surveys_combined_year %>%
group_by(species) %>%
mutate(weight = replace_na(weight, mean(weight, na.rm = TRUE)))
surveys_weight_imputed
The weight column has been inspected for special values using is.finite(), is.infinite(), is.nan(). The output of these functions reveal that there are 1278 finite values (i.e. real numbers) in the dataset(total: 1332 observations), leaving 54 values that are possibly not numbers. This notion is further confirmed by is.infinite() reutrning TRUE for all values and is.nan() returning 54 values as TRUE.
Which() was used to locate the NaN values, three were selected randomly in order to ascertain why their weight values were NaN. This reveals that the observations belong to species that did not have their missing values imputed with means in the last task. This is due to those particular species’ not having any data recorded under the weight variable with which to create a mean to be imputed (e.g. harrisi and billineata).
sum(is.finite(surveys_weight_imputed$weight))
## [1] 1278
sum(is.infinite(surveys_weight_imputed$weight))
## [1] 0
sum(is.nan(surveys_weight_imputed$weight))
## [1] 54
which(is.nan(surveys_weight_imputed$weight))
## [1] 10 38 130 187 189 273 305 310 311 391 420 428 446 474
## [15] 481 498 522 541 548 554 560 570 655 687 713 717 732 735
## [29] 736 762 765 767 791 801 857 896 939 1009 1014 1035 1081 1085
## [43] 1098 1146 1156 1161 1182 1195 1206 1209 1220 1226 1258 1285
print(surveys_weight_imputed[273,])
## # A tibble: 1 x 11
## # Groups: species [1]
## record_id month day year species_id sex hindfoot_length weight genus
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 18462 3 13 1991 AH <NA> NA NaN Ammo~
## # ... with 2 more variables: species <chr>, taxa <chr>
print(surveys_weight_imputed[522,])
## # A tibble: 1 x 11
## # Groups: species [1]
## record_id month day year species_id sex hindfoot_length weight genus
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 18713 5 14 1991 AH <NA> NA NaN Ammo~
## # ... with 2 more variables: species <chr>, taxa <chr>
print(surveys_weight_imputed[1014,])
## # A tibble: 1 x 11
## # Groups: species [1]
## record_id month day year species_id sex hindfoot_length weight genus
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 19218 11 13 1991 AB <NA> NA NaN Amph~
## # ... with 2 more variables: species <chr>, taxa <chr>
Below the outliers in teh dataset have been identified and handled.
First, The mean, median and mode were calculated in order to determine whether the data followed a normal distribution or not. The data did not follow a normal distribution, therefore boxplot was used to identify the outliers, the outliers were then saved as surveys_combined_outliers.
Using arrange() the outliers were viewed, the values were determined to be of the albigula and ordii. The quantiles and IQR were then both calculated for both of those species, these values were then used in Tukeys’ Method to ascertain whether the outliers were infact lying outside these ranges for their relevant species.
In doing so, it was found that the two outliers were in fact unusual of their respective species. It was thought that it was possible that the outliers were possibly due to data entry errors. Therefore, capping was employed to deal with them so as to not disturb the data too much or resort to excluding the observations completely.
#Descriptive statistics to determine the distribution
surveys_combined %>% summarise(Mean = mean(hindfoot_length, na.rm = TRUE),Median = median(hindfoot_length, na.rm = TRUE))
#Boxplot showing outliers
surveys_combined_outliers <- surveys_combined$hindfoot_length %>% boxplot(main = "Boxplot of Hindfoot Length", ylab = "Length in cm")
#Outliers
surveys_combined_outliers$out
## [1] 70 64
#Showing outliers (top two observations)
surveys_combined %>% arrange(desc(hindfoot_length))
#Calculating quantiles and IQR for both species
surveys_combined %>% filter(species == 'Albigula') %>% summarise(Q1 = quantile(hindfoot_length, probs = .25, na.rm = TRUE),
Q3 = quantile(hindfoot_length, probs = .75, na.rm = TRUE),
IQR = IQR(hindfoot_length, na.rm = TRUE))
surveys_combined %>% filter(species == 'Ordii') %>% summarise(Q1 = quantile(hindfoot_length, probs = .25, na.rm = TRUE),
Q3 = quantile(hindfoot_length, probs = .75, na.rm = TRUE),
IQR = IQR(hindfoot_length, na.rm = TRUE))
#Calculating Tukeys' mMethod for each species
#Albigula
31-1.5*2
## [1] 28
33+1.5*2
## [1] 36
#Ordii
35-1.5*2
## [1] 32
37-1.5*2
## [1] 34
#Capping function
cap <- function(x){
quantiles <- quantile(x, c(.05, 0.25, 0.75, 0.95), na.rm = T)
x[x < quantiles[2] - 1.5 * IQR(x, na.rm = T)] <- quantiles[1]
x[x > quantiles[3] + 1.5 * IQR(x, na.rm = T)] <- quantiles[4]
return(x)
}
#Capping the outliers
hindfoot_sub <- surveys_combined[,7]
hindfoot_cap <- as.data.frame(sapply(hindfoot_sub, FUN = cap))
#The maximum value now excludes the outliers, meaning the capping was successful
summary(hindfoot_cap)
## hindfoot_length
## Min. : 2.00
## 1st Qu.:21.00
## Median :32.00
## Mean :29.29
## 3rd Qu.:36.00
## Max. :58.00
## NA's :3348