Required packages

Following packages have been used in this report:

library(readr) # Used for importing csv files
library(tidyr) # Used for tidying data
library(dplyr) # Used for data manipulation
library(knitr) # Used for creating nice tables
library(kableExtra) # Used for table formatting
library(magrittr) # Used for pipe operation
library(stringi) # Used for string manipulation
library(editrules) # Used to deal with inconsistencies in data

Executive Summary

The aim of this report is to carry out data pre-processing. This involves - importing, understanding, tidying, manipulating, scanning and transforming data.

First, we identify two open datasets (country wise malnutrition and life expectancy) having at least one common attribute (country) and import them into the R workspace. Next we subset the data by dropping unwanted columns and rename the columns suitably. The untidy dataset is reshaped into tidy format, according to the tidy data principles. To avoid any inconsistencies while merging the datasets, we perform upper case conversion on the common column ‘Country’. We use left join to merge the datasets. It ensures any data loss as a result of merging does not affect the research goal.

Next, we understand the data by checking its volume, attributes and structure. We validate if all the different data types of variables, are classified correctly and apply type conversions for incorrect classifications.

We demonstrate data manipulation through creation of a new column from the existing columns.

Next, we scan for missing values and choose to omit the NA values. Replacing it with a constant/mean/median value wouldn’t make sense in our case, as each of the values are country specific and cannot be approximated. Dataset is scanned for special values and obvious errors. It is validated against pre-defined rules and no issues are observed in the summary of errors.

Tukey’s method of outlier detection is used to plot the box plot for numerical variables. Observed outliers for population column are valid and not handled. For example, countries like India, China etc have very large population compared to other countries. On a similar reasoning, since the outliers observed in all the box plots are actual valid values, we retain the outliers. The distribution of all numerical variables are explored through histograms. To decrease the right skewness of one of the variable and convert it to a normal distribution, we apply the square root transformation.

All the above steps ensure that our data is error free, consistent and ready for further statistical analysis.

Data

Data is sourced from Kaggle.

The first open data set used in this assignment, consists of country wise life expectancy data segregated by gender (male/female) and also the overall life expectancy.
Dataset 1 URL: https://www.kaggle.com/amansaxena/lifeexpectancy?select=Life_expectancy_dataset.csv

The second data set comprises of malnutrition data of countries across the globe from 1983 to 2019.
Dataset 2 URL: https://www.kaggle.com/ruchi798/malnutrition-across-the-globe?select=country-wise-average.csv

Variables description:
Data set 1 - Life_expectancy_dataset.csv
Rank: Numerical variable, life expectancy rank
Country: Character variable, name of the country
Overall Life: Numerical variable, overall life expectancy of male and female population
Male Life: Numerical variable, life expectancy of males
Female Life: Numerical variable, life expectancy of females
Continent: Categorical variable with 6 levels, continent to which a country belongs to - Africa, Europe, Asia, Oceania, North America, South America

Data set 2 - country-wise-average.csv
Country: Character variable, name of the country
Income Classification: Categorical variable with 4 levels, Low Income: 0, Lower Middle Income: 1, Upper Middle Income: 2, High Income: 3
Severe Wasting: Numerical variable, average severe wasting percentage. It indicates the percentage of children aged under 5 years who are below minus three standard deviations from median weight-for-height
Wasting: Numerical variable, average wasting percentage. It indicates the percentage of children aged under 5 years who are below minus two standard deviations from median weight-for-height
Overweight: Numerical variable, average overweight percentage. It indicates the percentage of children aged under 5 years who are above two standard deviations from median weight-for-height
Stunting: Numerical variable, average stunting percentage. It indicates the percentage of children aged under 5 years who are below minus two standard deviations from median height-for-age
Underweight: Numerical variable, average underweight percentage. It indicates the percentage of children aged under 5 years who are below minus two standard deviations from median weight-for-age
U5 Population (`000s): Numerical variable, population of children below the age of five, expressed in thousands

Read in the data sets using read_csv() function. Rename the columns appropriately and drop any columns that are not required.

#Research goal - To compare and correlate malnutrition to average life expectancy.
df1 <- read_csv("Life_expectancy_dataset.csv") # Read first data set

── Column specification ─────────────────────────────────────────────────────────────────────────────
cols(
  Rank = col_double(),
  Country = col_character(),
  `Overall Life` = col_double(),
  `Male Life` = col_double(),
  `Female Life` = col_double(),
  Continent = col_character()
)
colnames(df1)[1] <- "Global Rank" # Rename column name "Rank" to "Global Rank"
colnames(df1)[4] <- "Male" # Rename column name "Male Life" to "Male"
colnames(df1)[5] <- "Female" # Rename column name "Female Life" to "Female"
df1 <- df1[ -c(3) ] # Drop the overall "Overall Life" column, as it is not required

df2 <- read_csv("country-wise-average.csv") # Read second data set

── Column specification ─────────────────────────────────────────────────────────────────────────────
cols(
  Country = col_character(),
  `Income Classification` = col_double(),
  `Severe Wasting` = col_double(),
  Wasting = col_double(),
  Overweight = col_double(),
  Stunting = col_double(),
  Underweight = col_double(),
  `U5 Population ('000s)` = col_double()
)


Let us take a look at the top 6 rows of our datasets.

Data set 1

knitr::kable(head(df1), format = "html", align = 'l')%>%
kable_styling(bootstrap_options = c("striped")) # Head of data set 1 - untidy format
Global Rank Country Male Female Continent
1 Monaco 85.6 93.5 Europe
2 Japan 81.7 88.5 Asia
3 Singapore 82.3 87.8 Asia
4 Macau; China 81.6 87.6 Asia
5 San Marino 80.7 86.1 Europe
6 Iceland 80.9 85.3 Europe


Our first data set is in untidy format as column names (Male and Female) are values (of life expectancy) instead of variables and doesn’t satisfy the tidy data principles. Let us convert it to tidy format using pivot_longer() function from the tidyr package and check the top 6 rows.

df1_tidy <- pivot_longer(df1, names_to = "Gender", values_to = "Life Expectancy", cols = 3:4) # Convert to tidy format

knitr::kable(head(df1_tidy), format = "html", align = 'l')%>%
kable_styling(bootstrap_options = c("striped")) # Display top 6 rows of the tidy dataset 1
Global Rank Country Continent Gender Life Expectancy
1 Monaco Europe Male 85.6
1 Monaco Europe Female 93.5
2 Japan Asia Male 81.7
2 Japan Asia Female 88.5
3 Singapore Asia Male 82.3
3 Singapore Asia Female 87.8


Now, lets take a peek into the Data set 2

knitr::kable(head(df2), format = "html", align = 'l')%>%
kable_styling(bootstrap_options = c("striped")) # Head of data set 2 
Country Income Classification Severe Wasting Wasting Overweight Stunting Underweight U5 Population ('000s)
AFGHANISTAN 0 3.033333 10.350000 5.12500 47.77500 30.375000 4918.5615
ALBANIA 2 4.075000 7.760000 20.80000 24.16000 7.700000 232.8598
ALGERIA 2 2.733333 5.942857 12.83333 19.57143 7.342857 3565.2131
ANGOLA 1 2.400000 6.933333 2.55000 42.63333 23.600000 3980.0540
ARGENTINA 2 0.200000 2.150000 11.12500 10.02500 2.600000 3613.6517
ARMENIA 2 1.600000 3.940000 13.62000 16.12000 3.480000 204.1452


The second data set is already in tidy format and does not require to be reshaped.

Next we proceed to merge both the data sets, for further analysis.
Since we are trying to compare and correlate malnutrition to average life expectancy, we merge the datasets using a left_join() as per below. Prior to merging, we convert the values of the Country column in df1_tidy to upper case, to match the casing of the common column between the data sets.

df1_tidy$Country <-  stringi::stri_trans_toupper(df1_tidy$Country) #Convert to upper case
#Source - https://stackoverflow.com/questions/9637278/r-tm-package-invalid-input-in-utf8towcs

df_combined <- left_join(df2,df1_tidy,by="Country") # We do a left join for the data sets as we intend to find correlation between malnutrition and life expectancy across the countries.


Lets check the top 6 rows of our combined data set.

knitr::kable(head(df_combined), format = "html", align = 'l')%>%
kable_styling(bootstrap_options = c("striped"))
Country Income Classification Severe Wasting Wasting Overweight Stunting Underweight U5 Population ('000s) Global Rank Continent Gender Life Expectancy
AFGHANISTAN 0 3.033333 10.350000 5.12500 47.77500 30.375000 4918.5615 221 Asia Male 49.9
AFGHANISTAN 0 3.033333 10.350000 5.12500 47.77500 30.375000 4918.5615 221 Asia Female 52.7
ALBANIA 2 4.075000 7.760000 20.80000 24.16000 7.700000 232.8598 59 Europe Male 75.7
ALBANIA 2 4.075000 7.760000 20.80000 24.16000 7.700000 232.8598 59 Europe Female 81.2
ALGERIA 2 2.733333 5.942857 12.83333 19.57143 7.342857 3565.2131 79 Africa Male 75.5
ALGERIA 2 2.733333 5.942857 12.83333 19.57143 7.342857 3565.2131 79 Africa Female 78.2


This meets the minimum requirement 1 of the specification.

Understand

In this section, let us inspect and understand the data by checking its volume, attributes and structure.

The dimensions of the data frame can be checked as below. It consists of 273 rows and 12 columns.

dim(df_combined)
[1] 273  12


str() function gives the structure of the data frame.

str(df_combined)
spec_tbl_df [273 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Country              : chr [1:273] "AFGHANISTAN" "AFGHANISTAN" "ALBANIA" "ALBANIA" ...
 $ Income Classification: num [1:273] 0 0 2 2 2 2 1 1 2 2 ...
 $ Severe Wasting       : num [1:273] 3.03 3.03 4.08 4.08 2.73 ...
 $ Wasting              : num [1:273] 10.35 10.35 7.76 7.76 5.94 ...
 $ Overweight           : num [1:273] 5.12 5.12 20.8 20.8 12.83 ...
 $ Stunting             : num [1:273] 47.8 47.8 24.2 24.2 19.6 ...
 $ Underweight          : num [1:273] 30.38 30.38 7.7 7.7 7.34 ...
 $ U5 Population ('000s): num [1:273] 4919 4919 233 233 3565 ...
 $ Global Rank          : num [1:273] 221 221 59 59 79 79 207 207 75 75 ...
 $ Continent            : chr [1:273] "Asia" "Asia" "Europe" "Europe" ...
 $ Gender               : chr [1:273] "Male" "Female" "Male" "Female" ...
 $ Life Expectancy      : num [1:273] 49.9 52.7 75.7 81.2 75.5 78.2 54.8 57.2 74 80.4 ...
 - attr(*, "spec")=
  .. cols(
  ..   Country = col_character(),
  ..   `Income Classification` = col_double(),
  ..   `Severe Wasting` = col_double(),
  ..   Wasting = col_double(),
  ..   Overweight = col_double(),
  ..   Stunting = col_double(),
  ..   Underweight = col_double(),
  ..   `U5 Population ('000s)` = col_double()
  .. )


We can see that by default, the read_csv() function has allocated the following data types for each of the columns.

Country - Character variable
Income Classification - Double numeric variable
Severe Wasting - Double numeric variable
Wasting - Double numeric variable
Overweight - Double numeric variable
Stunting - Double numeric variable
Underweight - Double numeric variable
U5 Population (’000s) - Double numeric variable
Global Rank - Double numeric variable
Continent - Character variable
Gender - Character variable
Life Expectancy - Double numeric variable

However, columns - Income Classification, Continent and Gender are incorrectly classified. We can go ahead and perform type conversions on the column names as below -

df_combined$`Income Classification` <- factor(df_combined$`Income Classification`, 
                                              levels = c(0,1,2,3),
                                              labels = c('Low Income', 'Lower Middle Income', 'Upper Middle Income', 'High Income'),
                                              ordered = TRUE) #Convert from numeric to an ordered factor

df_combined$Continent <- factor(df_combined$Continent,
                                levels = c('Africa', 'Europe', 'Asia', 'Oceania', 'North America', 'South America')) # Convert from character to factor
df_combined$Gender <- factor(df_combined$Gender,
                                levels = c('Male', 'Female'))#Convert from character to factor


Lets check the structure of the dataframe again

str(df_combined)
spec_tbl_df [273 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Country              : chr [1:273] "AFGHANISTAN" "AFGHANISTAN" "ALBANIA" "ALBANIA" ...
 $ Income Classification: Ord.factor w/ 4 levels "Low Income"<"Lower Middle Income"<..: 1 1 3 3 3 3 2 2 3 3 ...
 $ Severe Wasting       : num [1:273] 3.03 3.03 4.08 4.08 2.73 ...
 $ Wasting              : num [1:273] 10.35 10.35 7.76 7.76 5.94 ...
 $ Overweight           : num [1:273] 5.12 5.12 20.8 20.8 12.83 ...
 $ Stunting             : num [1:273] 47.8 47.8 24.2 24.2 19.6 ...
 $ Underweight          : num [1:273] 30.38 30.38 7.7 7.7 7.34 ...
 $ U5 Population ('000s): num [1:273] 4919 4919 233 233 3565 ...
 $ Global Rank          : num [1:273] 221 221 59 59 79 79 207 207 75 75 ...
 $ Continent            : Factor w/ 6 levels "Africa","Europe",..: 3 3 2 2 1 1 1 1 6 6 ...
 $ Gender               : Factor w/ 2 levels "Male","Female": 1 2 1 2 1 2 1 2 1 2 ...
 $ Life Expectancy      : num [1:273] 49.9 52.7 75.7 81.2 75.5 78.2 54.8 57.2 74 80.4 ...
 - attr(*, "spec")=
  .. cols(
  ..   Country = col_character(),
  ..   `Income Classification` = col_double(),
  ..   `Severe Wasting` = col_double(),
  ..   Wasting = col_double(),
  ..   Overweight = col_double(),
  ..   Stunting = col_double(),
  ..   Underweight = col_double(),
  ..   `U5 Population ('000s)` = col_double()
  .. )

We can see that our data set includes multiple data types - character, numeric and factors. Also we’ve applied data type conversions from numeric to factor, and character to factor. We have at least one factor variable which is ordered and labeled. Thus it meets the minimum requirements 2, 3 and 4.

The details of column names, row names, column specifications and class of the variable can be obtained using the attributes() function.

attributes(df_combined) #Describe the attributes of a variable
$names
 [1] "Country"               "Income Classification" "Severe Wasting"        "Wasting"              
 [5] "Overweight"            "Stunting"              "Underweight"           "U5 Population ('000s)"
 [9] "Global Rank"           "Continent"             "Gender"                "Life Expectancy"      

$row.names
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24
 [25]  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48
 [49]  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96
 [97]  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
[121] 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
[145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168
[169] 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
[193] 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
[217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
[241] 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264
[265] 265 266 267 268 269 270 271 272 273

$spec
cols(
  Country = col_character(),
  `Income Classification` = col_double(),
  `Severe Wasting` = col_double(),
  Wasting = col_double(),
  Overweight = col_double(),
  Stunting = col_double(),
  Underweight = col_double(),
  `U5 Population ('000s)` = col_double()
)

$class
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 


Tidy & Manipulate Data I

According to the Tidy data principles, if one of the below conditions is violated, then its is categorized as untidy data:
1) Each variable must have its own column.
2) Each observation must have its own row.
3) Each value must have its own cell.

In our case, the first data set “df1” is untidy.
This is because the column names “Male Life” and “Female Life”, represent the values (life expectancy) for the gender variable and each row represents 2 observations and not 1. Since the column headers are values instead of variables, it violates the first condition mentioned above.

We use pivot_longer() function from tidyr package to reshape the data into tidy format. It converts data from wide format to long format.
Required steps to tidy the data have already been applied prior to merging it with data set 2. Kindly refer the section “Data”.

Tidy & Manipulate Data II

Let us create a new variable by using the mutate() function from the dplyr package.

We create a new variable ‘U5 over/under-weight population(’000s)’ to compute the population (in thousands) of children aged below 5 years, who are either underweight or overweight, using the existing columns.

# This is the R chunk for the Tidy & Manipulate Data II 
df_combined_mutated <- mutate(df_combined, 'U5 over/under-weight population(\'000s)'=(df_combined$Overweight+df_combined$Underweight)*df_combined$`U5 Population ('000s)`/100) #Use of mutate function
kable(head(df_combined_mutated[c(1,2,5,7,8,11:13)]), format = "html", align = 'l')%>%
kable_styling(bootstrap_options = c("striped")) # Display the top 6 rows of the mutated and subsetted data frame.
Country Income Classification Overweight Underweight U5 Population ('000s) Gender Life Expectancy U5 over/under-weight population('000s)
AFGHANISTAN Low Income 5.12500 30.375000 4918.5615 Male 49.9 1746.08933
AFGHANISTAN Low Income 5.12500 30.375000 4918.5615 Female 52.7 1746.08933
ALBANIA Upper Middle Income 20.80000 7.700000 232.8598 Male 75.7 66.36504
ALBANIA Upper Middle Income 20.80000 7.700000 232.8598 Female 81.2 66.36504
ALGERIA Upper Middle Income 12.83333 7.342857 3565.2131 Male 75.5 719.32419
ALGERIA Upper Middle Income 12.83333 7.342857 3565.2131 Female 78.2 719.32419


This meets the minimum requirement 6.

Scan I

Let us check for missing values in our data set using is.na() function. Further using the sum() function we can obtain the total count.
We use sapply() function to apply it to the entire data frame.

# This is the R chunk for the Scan I
sapply(df_combined, function(x) sum( is.na(x) )) #Check for missing values
              Country Income Classification        Severe Wasting               Wasting 
                    0                     0                    23                     4 
           Overweight              Stunting           Underweight U5 Population ('000s) 
                    6                     2                     4                     0 
          Global Rank             Continent                Gender       Life Expectancy 
                   31                    31                    31                    31 


As we can see, we have many NA values in our data set. We choose to omit the NA values.Replacing it with a constant/mean/median/mode value doesn’t make much sense in our case as each of the values differ from country to country and cannot be approximated.

The missing values are removed using complete.cases() function.

df_combined <- df_combined[complete.cases(df_combined), ]
sapply(df_combined, function(x) sum( is.na(x) )) #Check for missing values again
              Country Income Classification        Severe Wasting               Wasting 
                    0                     0                     0                     0 
           Overweight              Stunting           Underweight U5 Population ('000s) 
                    0                     0                     0                     0 
          Global Rank             Continent                Gender       Life Expectancy 
                    0                     0                     0                     0 


All missing values are now omitted.


Next, let us scan the data set for Special values.
We create a function called is.special() that returns TRUE if the numerical variable has Inf, -Inf or NaN values.
This is applied to the data set using sapply() function

is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
} #Function to scan for Inf, -Inf and NaN values.

sapply(df_combined, function(x) sum( is.special(x) ))
              Country Income Classification        Severe Wasting               Wasting 
                    0                     0                     0                     0 
           Overweight              Stunting           Underweight U5 Population ('000s) 
                    0                     0                     0                     0 
          Global Rank             Continent                Gender       Life Expectancy 
                    0                     0                     0                     0 


The output indicates that there are no special values.

Further, let us scan the dataset for obvious errors.
We define custom rules and place it in a file called editrules.txt

  1. The categorical variables (factors- Income Classification, Continent, Gender) can hold only predefined values as mentioned below.
  2. Life Expectancy mandatorily needs to be a value between 0 to 120.
Rules <- editfile("editrules.txt", type = "all")#Populate constraints defined under editrules.txt file to Rules variable
Rules #Display the rules

Data model:
dat1 : `Income Classification` %in% c('High Income', 'Low Income', 'Lower Middle Income', 'Upper Middle Income')
dat2 : Continent %in% c('Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America')
dat3 : Gender %in% c('Female', 'Male') 

Edit set:
num1 : 0 <= `Life Expectancy`
num2 : `Life Expectancy` <= 100 


Our dataframe is validated against the pre-defined rules using the violatedEdits() function and we obtain the summary of errors.

violated <- violatedEdits(Rules,df_combined)
summary(violated)
No violations detected, 0 checks evaluated to NA
NULL

We do not see any errors, meaning all the values in our dataset satisfy the conditions defined under editfiles.txt
If required, we could use deducorrect package and correctionRules() function to correct the obvious inconsistencies and errors.

Scan II

Using Tukey’s method of outlier detection, we plot the box plot for each of the numerical variables.

df_combined$`U5 Population ('000s)` %>% boxplot(main="Boxplot of under 5 years Population in thousands", ylab="Population", col = "grey") #Boxplot of U5 Population ('000s)


Observation: Eventhough we see outliers, we choose to retaim those values. For example countries like India, China etc have a population over a billion and are identified as valid outliers.

#Box plot of numerical variables
par(mfrow=c(2,2))
df_combined$`Severe Wasting` %>% boxplot(main="Boxplot of severe wasting percentage", ylab="Percentage", col = "grey")
df_combined$`Wasting` %>% boxplot(main="Boxplot of wasting percentage", ylab="Percentage", col = "grey")
df_combined$`Overweight` %>% boxplot(main="Boxplot of over-weight percentage", ylab="Percentage", col = "grey")
df_combined$`Underweight` %>% boxplot(main="Boxplot of under-weight percentage", ylab="Percentage", col = "grey")
par(mfrow=c(1,1))


On a similar reasoning, since the outliers observed in the above box plots (> 1.5 * IQR) are not errors, but actual values which are valid, we do not remove the outliers.

No outliers are detected in the stunting and Life Expectancy columns.

#Boxplot of stunting percentage and life expectancy
par(mfrow=c(1,2))
df_combined$Stunting %>% boxplot(main="Boxplot of stunting percentage", ylab="Percentage", col = "grey")
df_combined$`Life Expectancy` %>% boxplot(main="Boxplot of life expectancy", ylab="age in years", col = "grey")
par(mfrow=c(1,1))

Transform

Let us explore the histograms of all the numerical variables.

# Histogram plots of all numerical variables
par(mfrow =c(3,3))

U5_population_in_thousands <-  df_combined$`U5 Population ('000s)`
hist(U5_population_in_thousands)
severe_wasting <- df_combined$`Severe Wasting`
hist(severe_wasting)
wasting <- df_combined$`Wasting`
hist(wasting)
overweight<- df_combined$`Overweight`
hist(overweight)
stunting <- df_combined$Stunting
hist(stunting)
underweight <- df_combined$`Underweight`
hist(underweight)
life_expectancy <- df_combined$`Life Expectancy`
hist(life_expectancy)

par(mfrow=c(1,1))

To apply a suitable transformation, let us choose the histogram plot of the variable wasting.

hist(wasting)

We observe that this distribution is right skewed. To decrease the right skewness and convert the distribution into a normal distribution we apply the square root transformation

sqrt_wasting <- sqrt(wasting)
hist(sqrt_wasting)

Observation - The square root transformation has reduced the right skewness and has improved the symmetry of the distribution

We can also apply log transformations to reduce the right skewness.

log_wasting <- log10(wasting)
hist(log_wasting)

ln_wasting <- log(wasting)
hist(ln_wasting)

Compared to the log transformations, we can see that square root transformation gives a better result.

References:


1) Create Awesome HTML Table with knitr::kable and kableExtra, Hao Zhu, viewed 13 May 2021: https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
2) Data set 1: https://www.kaggle.com/amansaxena/lifeexpectancy?select=Life_expectancy_dataset.csv
3) Data set 2: https://www.kaggle.com/ruchi798/malnutrition-across-the-globe?select=country-wise-average.csv
4) Conversion to upper case, viewed 11 May 2021: https://stackoverflow.com/questions/9637278/r-tm-package-invalid-input-in-utf8towcs
5) Concepts discussed under module notes 1 to 7:
https://rare-phoenix-161610.appspot.com/secured/index.html

---
title: "Data Wrangling (Data Preprocessing)"
author: "Yeshaswi Aralaguppe Muralidhar - S3825856"
subtitle: Practical assessment 2
output:
  html_notebook: default
  pdf_document: default
  html_document:
    df_print: paged
---

## Required packages 

Following packages have been used in this report:
```{r, echo = TRUE, warnings = FALSE}
library(readr) # Used for importing csv files
library(tidyr) # Used for tidying data
library(dplyr) # Used for data manipulation
library(knitr) # Used for creating nice tables
library(kableExtra) # Used for table formatting
library(magrittr) # Used for pipe operation
library(stringi) # Used for string manipulation
library(editrules) # Used to deal with inconsistencies in data
```

## Executive Summary 

The aim of this report is to carry out data pre-processing. This involves - importing, understanding, tidying, manipulating, scanning and transforming data.<br>

First, we identify two open datasets (country wise malnutrition and life expectancy) having at least one common attribute (country) and import them into the R workspace. Next we subset the data by dropping unwanted columns and rename the columns suitably. The untidy dataset is reshaped into tidy format, according to the tidy data principles. To avoid any inconsistencies while merging the datasets, we perform upper case conversion on the common column ‘Country’. We use left join to merge the datasets. It ensures any data loss as a result of merging does not affect the research goal.<br>

Next, we understand the data by checking its volume, attributes and structure. We validate if all the different data types of variables, are classified correctly and apply type conversions for incorrect classifications.<br>

We demonstrate data manipulation through creation of a new column from the existing columns.<br>

Next, we scan for missing values and choose to omit the NA values. Replacing it with a constant/mean/median value wouldn’t make sense in our case, as each of the values are country specific and cannot be approximated.
Dataset is scanned for special values and obvious errors. It is validated against pre-defined rules and no issues are observed in the summary of errors.<br>

Tukey’s method of outlier detection is used to plot the box plot for numerical variables. Observed outliers for population column are valid and not handled. For example, countries like India, China etc have very large population compared to other countries. On a similar reasoning, since the outliers observed in all the box plots are actual valid values, we retain the outliers.
The distribution of all numerical variables are explored through histograms. To decrease the right skewness of one of the variable and convert it to a normal distribution, we apply the square root transformation.<br>

All the above steps ensure that our data is error free, consistent and ready for further statistical analysis.<br>


## Data 

Data is sourced from Kaggle. <br>

The first open data set used in this assignment, consists of country wise life expectancy data segregated by gender (male/female) and also the overall life expectancy.<br>
**Dataset 1 URL:** <a href=https://www.kaggle.com/amansaxena/lifeexpectancy?select=Life_expectancy_dataset.csv>https://www.kaggle.com/amansaxena/lifeexpectancy?select=Life_expectancy_dataset.csv<a/>

The second data set comprises of malnutrition data of countries across the globe from 1983 to 2019.<br>
**Dataset 2 URL:** <a href=https://www.kaggle.com/ruchi798/malnutrition-across-the-globe?select=country-wise-average.csv>https://www.kaggle.com/ruchi798/malnutrition-across-the-globe?select=country-wise-average.csv<a/> <br>

**Variables description**:<br>
**Data set 1 - Life_expectancy_dataset.csv** <br>
**Rank**: Numerical variable, life expectancy rank <br>
**Country**: Character variable, name of the country<br>
**Overall Life**: Numerical variable, overall life expectancy of male and female population <br>
**Male Life**: Numerical variable, life expectancy of males <br>
**Female Life**: Numerical variable, life expectancy of females <br>
**Continent**: Categorical variable with 6 levels, continent to which a country belongs to - Africa, Europe, Asia, Oceania, North America, South America <br><br>

**Data set 2 - country-wise-average.csv** <br>
**Country**: Character variable, name of the country <br>
**Income Classification**: Categorical variable with 4 levels, Low Income: 0, Lower Middle Income: 1, Upper Middle Income: 2, High Income: 3<br>
**Severe Wasting**: Numerical variable, average severe wasting percentage. It indicates the percentage of children aged under 5 years who are below minus three standard deviations from median weight-for-height <br>
**Wasting**: Numerical variable, average wasting percentage. It indicates the percentage of children aged under 5 years who are below minus two standard deviations from median weight-for-height  <br>
**Overweight**: Numerical variable, average overweight percentage. It indicates the percentage of children aged under 5 years who are above two standard deviations from median weight-for-height <br>
**Stunting**: Numerical variable, average stunting percentage. It indicates the percentage of children aged under 5 years who are below minus two standard deviations from median height-for-age<br>
**Underweight**: Numerical variable, average underweight percentage. It indicates the percentage of children aged under 5 years who are below minus two standard deviations from median weight-for-age<br>
**U5 Population (`000s)**: Numerical variable, population of children below the age of five, expressed in thousands<br>

Read in the data sets using read_csv() function. Rename the columns appropriately and drop any columns that are not required.
```{r}
#Research goal - To compare and correlate malnutrition to average life expectancy.
df1 <- read_csv("Life_expectancy_dataset.csv") # Read first data set

colnames(df1)[1] <- "Global Rank" # Rename column name "Rank" to "Global Rank"
colnames(df1)[4] <- "Male" # Rename column name "Male Life" to "Male"
colnames(df1)[5] <- "Female" # Rename column name "Female Life" to "Female"
df1 <- df1[ -c(3) ] # Drop the overall "Overall Life" column, as it is not required

df2 <- read_csv("country-wise-average.csv") # Read second data set
```
<br>
Let us take a look at the top 6 rows of our datasets. <br>
<br>
**Data set 1**
```{r}
knitr::kable(head(df1), format = "html", align = 'l')%>%
kable_styling(bootstrap_options = c("striped")) # Head of data set 1 - untidy format
```
<br>
Our first data set is in untidy format as column names (Male and Female) are values (of life expectancy) instead of variables and doesn't satisfy the tidy data principles. Let us convert it to tidy format using pivot_longer() function from the tidyr package and check the top 6 rows.
```{r}
df1_tidy <- pivot_longer(df1, names_to = "Gender", values_to = "Life Expectancy", cols = 3:4) # Convert to tidy format

knitr::kable(head(df1_tidy), format = "html", align = 'l')%>%
kable_styling(bootstrap_options = c("striped")) # Display top 6 rows of the tidy dataset 1
```
<br>
Now, lets take a peek into the **Data set 2**
```{r}
knitr::kable(head(df2), format = "html", align = 'l')%>%
kable_styling(bootstrap_options = c("striped")) # Head of data set 2 
```
<br>
The second data set is already in tidy format and does not require to be reshaped.<br><br>

Next we proceed to merge both the data sets, for further analysis.<br>
Since we are trying to compare and correlate malnutrition to average life expectancy, we merge the datasets using a left_join() as per below. Prior to merging, we convert the values of the Country column in df1_tidy to upper case, to match the casing of the common column between the data sets. 

```{r}
df1_tidy$Country <-  stringi::stri_trans_toupper(df1_tidy$Country) #Convert to upper case
#Source - https://stackoverflow.com/questions/9637278/r-tm-package-invalid-input-in-utf8towcs

df_combined <- left_join(df2,df1_tidy,by="Country") # We do a left join for the data sets as we intend to find correlation between malnutrition and life expectancy across the countries.
```
<br>
Lets check the top 6 rows of our combined data set.

```{r}
knitr::kable(head(df_combined), format = "html", align = 'l')%>%
kable_styling(bootstrap_options = c("striped"))
```
<br>
This meets the minimum requirement 1 of the specification.
<br>

## Understand 

In this section, let us inspect and understand the data by checking its volume, attributes and structure.<br>
<br>
The dimensions of the data frame can be checked as below. It consists of 273 rows and 12 columns.
```{r}
dim(df_combined)
```
<br>
**str()** function gives the structure of the data frame.
```{r}
str(df_combined)
```
<br>
We can see that by default, the read_csv() function has allocated the following data types for each of the columns.<br> <br>
**Country** - *Character variable* <br>
**Income Classification** - *Double numeric variable* <br>
**Severe Wasting** - *Double numeric variable* <br>
**Wasting** - *Double numeric variable* <br>
**Overweight** - *Double numeric variable* <br>
**Stunting** - *Double numeric variable* <br>
**Underweight** - *Double numeric variable* <br>
**U5 Population ('000s)** - *Double numeric variable* <br>
**Global Rank** - *Double numeric variable* <br>
**Continent** - *Character variable* <br>
**Gender** - *Character variable* <br>
**Life Expectancy** - *Double numeric variable* <br>

However, columns - **Income Classification**, **Continent** and **Gender** are incorrectly classified.
We can go ahead and perform type conversions on the column names as below -
```{r}
df_combined$`Income Classification` <- factor(df_combined$`Income Classification`, 
                                              levels = c(0,1,2,3),
                                              labels = c('Low Income', 'Lower Middle Income', 'Upper Middle Income', 'High Income'),
                                              ordered = TRUE) #Convert from numeric to an ordered factor

df_combined$Continent <- factor(df_combined$Continent,
                                levels = c('Africa', 'Europe', 'Asia', 'Oceania', 'North America', 'South America')) # Convert from character to factor
df_combined$Gender <- factor(df_combined$Gender,
                                levels = c('Male', 'Female'))#Convert from character to factor
```
<br>
Lets check the structure of the dataframe again
```{r}
str(df_combined)

```

We can see that our data set includes multiple data types - character, numeric and factors. Also we've applied data type conversions from numeric to factor, and character to factor. We have at least one factor variable which is ordered and labeled. Thus it meets the minimum requirements 2, 3 and 4. <br>

The details of column names, row names, column specifications and class of the variable can be obtained using the attributes() function.
```{r}
attributes(df_combined) #Describe the attributes of a variable
```
<br>

##	Tidy & Manipulate Data I 

According to the Tidy data principles, if one of the below conditions is violated, then its is categorized as untidy data:<br>
1) Each variable must have its own column.<br>
2) Each observation must have its own row.<br>
3) Each value must have its own cell.<br>

In our case, the first data set **"df1"** is untidy. <br>
This is because the column names **"Male Life"** and **"Female Life"**, represent the **values** (life expectancy) for the gender variable and each row represents 2 observations and not 1. Since the column headers are values instead of variables, it violates the first condition mentioned above.<br><br>

We use pivot_longer() function from tidyr package to reshape the data into tidy format.
It converts data from wide format to long format.<br>
Required steps to tidy the data have already been applied prior to merging it with data set 2. Kindly refer the section "Data".<br><br>


##	Tidy & Manipulate Data II 

Let us create a new variable by using the mutate() function from the dplyr package.<br>

We create a new variable 'U5 over/under-weight population('000s)'  to compute the population (in thousands) of children aged below 5 years, who are either underweight or overweight, using the existing columns.

```{r}
# This is the R chunk for the Tidy & Manipulate Data II 
df_combined_mutated <- mutate(df_combined, 'U5 over/under-weight population(\'000s)'=(df_combined$Overweight+df_combined$Underweight)*df_combined$`U5 Population ('000s)`/100) #Use of mutate function
kable(head(df_combined_mutated[c(1,2,5,7,8,11:13)]), format = "html", align = 'l')%>%
kable_styling(bootstrap_options = c("striped")) # Display the top 6 rows of the mutated and sub-setted data frame.
```
<br>This meets the minimum requirement 6.<br><br>


##	Scan I 

Let us check for missing values in our data set using **is.na()** function. Further using the **sum()** function we can obtain the total count.<br>
We use sapply() function to apply it to the entire data frame.

```{r}
# This is the R chunk for the Scan I
sapply(df_combined, function(x) sum( is.na(x) )) #Check for missing values
```
<br>
As we can see, we have many NA values in our data set. We choose to omit the NA values.Replacing it with a constant/mean/median/mode value doesn't make much sense in our case as each of the values differ from country to country and cannot be approximated.<br>

The missing values are removed using complete.cases() function.
```{r}
df_combined <- df_combined[complete.cases(df_combined), ]
sapply(df_combined, function(x) sum( is.na(x) )) #Check for missing values again
```
<br> All missing values are now omitted.

<br>
Next, let us scan the data set for Special values.<br>
We create a function called is.special() that returns TRUE if the numerical variable has Inf, -Inf or NaN values.<br>
This is applied to the data set using sapply() function
```{r}
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
} #Function to scan for Inf, -Inf and NaN values.

sapply(df_combined, function(x) sum( is.special(x) ))
```
<br>
The output indicates that there are no special values.<br>

Further, let us scan the dataset for obvious errors.<br>
We define custom rules and place it in a file called **editrules.txt**<br>

  1)  The categorical variables (factors- Income Classification, Continent, Gender) can hold only predefined values as mentioned below.<br>
  2)  Life Expectancy mandatorily needs to be a value between 0 to 120.<br>

```{r}
Rules <- editfile("editrules.txt", type = "all")#Populate constraints defined under editrules.txt file to Rules variable
Rules #Display the rules
```
<br>
Our dataframe is validated against the pre-defined rules using the violatedEdits() function and we obtain the summary of errors.<br>
```{r}
violated <- violatedEdits(Rules,df_combined)
summary(violated)

```
We do not see any errors, meaning all the values in our dataset satisfy the conditions defined under editfiles.txt <br> 
If required, we could use deducorrect package and correctionRules() function to correct the obvious inconsistencies and errors. <br><br>

##	Scan II

Using Tukey’s method of outlier detection, we plot the box plot for each of the numerical variables.
```{r}
df_combined$`U5 Population ('000s)` %>% boxplot(main="Boxplot of under 5 years Population in thousands", ylab="Population", col = "grey") #Boxplot of U5 Population ('000s)
```
<br>
Observation: Eventhough we see outliers, we choose to retaim those values. For example countries like India, China etc have a population over a billion and are identified as valid outliers.

```{r}
#Box plot of numerical variables
par(mfrow=c(2,2))
df_combined$`Severe Wasting` %>% boxplot(main="Boxplot of severe wasting percentage", ylab="Percentage", col = "grey")
df_combined$`Wasting` %>% boxplot(main="Boxplot of wasting percentage", ylab="Percentage", col = "grey")
df_combined$`Overweight` %>% boxplot(main="Boxplot of over-weight percentage", ylab="Percentage", col = "grey")
df_combined$`Underweight` %>% boxplot(main="Boxplot of under-weight percentage", ylab="Percentage", col = "grey")
par(mfrow=c(1,1))
```
<br>
On a similar reasoning, since the outliers observed in the above box plots (> 1.5 * IQR) are not errors, but actual values which are valid, we do not remove the outliers.<br>

No outliers are detected in the stunting and Life Expectancy columns.
```{r}
#Boxplot of stunting percentage and life expectancy
par(mfrow=c(1,2))
df_combined$Stunting %>% boxplot(main="Boxplot of stunting percentage", ylab="Percentage", col = "grey")
df_combined$`Life Expectancy` %>% boxplot(main="Boxplot of life expectancy", ylab="age in years", col = "grey")
par(mfrow=c(1,1))
```
##	Transform 

Let us explore the histograms of all the numerical variables.

```{r}
# Histogram plots of all numerical variables
par(mfrow =c(3,3))

U5_population_in_thousands <-  df_combined$`U5 Population ('000s)`
hist(U5_population_in_thousands)
severe_wasting <- df_combined$`Severe Wasting`
hist(severe_wasting)
wasting <- df_combined$`Wasting`
hist(wasting)
overweight<- df_combined$`Overweight`
hist(overweight)
stunting <- df_combined$Stunting
hist(stunting)
underweight <- df_combined$`Underweight`
hist(underweight)
life_expectancy <- df_combined$`Life Expectancy`
hist(life_expectancy)

par(mfrow=c(1,1))

```
To apply a suitable transformation, let us choose the histogram plot of the variable wasting.<br>

```{r}
hist(wasting)
```

We observe that this distribution is right skewed. To decrease the right skewness and convert the distribution into a normal distribution we apply the **square root transformation**

```{r}
sqrt_wasting <- sqrt(wasting)
hist(sqrt_wasting)

```
Observation - The square root transformation has reduced the right skewness and has improved the symmetry of the distribution<br>

We can also apply log transformations to reduce the right skewness.
```{r}
log_wasting <- log10(wasting)
hist(log_wasting)
```

```{r}
ln_wasting <- log(wasting)
hist(ln_wasting)
```
Compared to the log transformations, we can see that square root transformation gives a better result.<br>

## References: 
 <br> 
1) Create Awesome HTML Table with knitr::kable and kableExtra, Hao Zhu, viewed 13 May 2021: <a href=https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html> <https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html><a/><br>
2) Data set 1: <a href=https://www.kaggle.com/amansaxena/lifeexpectancy?select=Life_expectancy_dataset.csv> <https://www.kaggle.com/amansaxena/lifeexpectancy?select=Life_expectancy_dataset.csv><a/><br>
3) Data set 2: <a href=https://www.kaggle.com/ruchi798/malnutrition-across-the-globe?select=country-wise-average.csv> <https://www.kaggle.com/ruchi798/malnutrition-across-the-globe?select=country-wise-average.csv><a/><br>
4) Conversion to upper case, viewed 11 May 2021: <a href=https://stackoverflow.com/questions/9637278/r-tm-package-invalid-input-in-utf8towcs> <https://stackoverflow.com/questions/9637278/r-tm-package-invalid-input-in-utf8towcs><a/><br>
5) Concepts discussed under module notes 1 to 7: <br> <a href=https://rare-phoenix-161610.appspot.com/secured/index.html> <https://rare-phoenix-161610.appspot.com/secured/index.html><a/>
<br>
<br>
