MATH2349 Data Wrangling

Required packages

install.packages("dplyr")

## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

install.packages("tidyr")

## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

install.packages("outliers")

## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

install.packages("forecast")

## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(outliers)
library(ggplot2)
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Executive Summary

In this assignment, we combine three data sets, Blood Pressure Male, Blood Pressure Female and Obesity Rate(both genders) to estimate if there’s any relationship between blood pressure and obesity rate across countries for both genders.

Firstly, as blood pressure data set includes information between 1980 and 2008, in order to do the comparison, we first calculate the average blood pressure for these years and create a new column to store it.

Second, we add one more “Gender” column to both Blood Pressure Male and Blood Pressure Female and combine two data set together to get the final data set of both genders, named BP data set.

We then check BP and Obesity data sets to see if they follow “Tidy Data Principles”, in this step, with the BP data set, the years are separated as columns, but as we only need average blood pressure, we didn’t convert the years into one column, because if we do so, the data set will be confusing and difficult to work on which is not what we want. Instead, we subset BP.

Next, we subset BP and Obesity to get only the columns we need for the comparison, and combine these two data sets to get our final data set which is named as Comparison for further processing.

Structure and variable types are then checked and converted to the correct format. Gender is convered from character to factor.

Missing values, errors are scanned and fixed. With all numeric variables, we scanned for outliers and removed it, for example, outliers of “Average_Blood_Pressure” is removed.

At last, we check if all the numeric variables are normal distributed, as the variable “Obesity_Rate” is right-skewed, boxcox is used to transform it to normal distribution.

Finally, we have the tidy data set ready for further processing.

Data

Three data sets are used in this assignment.

Blood Pressure Men: data sourced from https://data.world/brianray/gapminder-blood-pressure-sbp-m, this data set provides blood pressure(mm Hg) of male across 199 countries, the only variable is men’s blood pressure, which is measured in “mm Hg”.

Blood Pressure Women, data sourced from https://data.world/brianray/gapminder-blood-pressure-sbp-w, this data set provides blood pressure(mm Hg) of female across 199 countries, the only variable is women’s blood pressure, which is measured in “mm Hg”.

Global Obesity Rate, data sourced from https://data.world/xprizeai-health/global-obesity-rates-2014, this data set provides the percentage of citizenry obese for both male and female across 179 countries. The variables include percentage of obesity for male, female and both genders.

Data Input

Obesity <- read_csv("health-global-obesity-rates-2014_obese_nations_iso.csv")

## Parsed with column specification:
## cols(
##   code = col_character(),
##   Country = col_character(),
##   value = col_double(),
##   Female = col_double(),
##   Male = col_double()
## )

BP_Male <- read_csv("Blood pressure SBP men mmHg.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `SBP male (mm Hg), age standardized mean` = col_character()
## )

## See spec(...) for full column specifications.

BP_Female <- read_csv("Blood pressure SBP women mmHg.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `SBP female (mm Hg), age standardized mean` = col_character()
## )
## See spec(...) for full column specifications.

head(Obesity)

head(BP_Male)

head(BP_Female)

Get the Average Blood Pressure for Each Country

In order to do the comparison between blood pressure and obesity rate, we first need to get the average blood pressure from 1980 and 2008.

BP_Male <- BP_Male %>% mutate(Average_Blood_Pressure = rowMeans(BP_Male[,2:30]))
head(BP_Male)

BP_Female <- BP_Male %>% mutate(Average_Blood_Pressure = rowMeans(BP_Female[,2:30]))
head(BP_Female)

Combine BP_Male and BP_Female

We then combine BP_Male and BP_Female to get the total result.

BP_Male <- BP_Male %>% mutate(Gender= "Male")
head(BP_Male)

BP_Female <- BP_Female %>% mutate(Gender = "Female")
head(BP_Female)

colnames(BP_Male)[1] <- c("Country")

colnames(BP_Female)[1] <- c("Country")

BP <- rbind(BP_Male, BP_Female)
head(BP)

Tidy & Manipulate Data I

In Obesity data set, gender is separated in different column which is against the “Tidy Data Principles”, because gender as a variable should have a single column.
So we gather column 4 and 5 to create a single column - “Gender”.

Obesity <- Obesity %>% gather(4:5, key = "Gender", value = "Percentage_Per_Genger")
colnames(Obesity)[3] <- c("Percentage for Both Gender")

Tidy & Manipulate Data II

With BP data set, what we need is column “Country”, “Gender”, “Average_Blood_Pressure”. With Obesity data set, what we need is column “Country”, “Gender”, “Percentage_Per_Gender”. So we subset both data sets to capture only the data we need for further processing.

Subset Both Data Set

BP_Update <- BP[,c(1,31,32)]

Obesity_Update <- Obesity[,c(2,4,5)]

Combine BP_Update with Obesity

Comparison <- Obesity_Update %>% left_join(BP_Update, by = c("Country" = "Country", "Gender" = "Gender"))

Understand

We use str() to get the type of all variables, what we can see from below is column “Country” is character. “Percentage_Per_Gender” “Average_Blood_Pressure” are both in numeric, these variables are record of human body statistics. However, “Gender” is showing as character which is incorrect, it should be converted into factor.

str(Comparison)

## tibble [358 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Country               : chr [1:358] "Afghanistan" "Angola" "Albania" "Andorra" ...
##  $ Gender                : chr [1:358] "Female" "Female" "Female" "Female" ...
##  $ Percentage_Per_Genger : num [1:358] 4.1 14.2 18.7 30.5 45.1 28.9 22 38.7 28.8 16.3 ...
##  $ Average_Blood_Pressure: num [1:358] 124 130 130 129 127 ...

Convert Character to Factor

Comparison$Gender <- Comparison$Gender %>% factor (c("Male", "Female"),
                                            levels = c("Male", "Female") )
class(Comparison$Gender)

## [1] "factor"

Scan I

Scan Comparison data to see if there’s any missing value.We use sum(is.na(x)) to get the total number of missing value in Comparison, and colSums(is.ma(x)) to get the total missing values in each column, then use which(is.na(x)) to get the location of missing value in Comparison. As the missing value here is the record of average blood pressure, it is not reasonable to estimate or derive value for the missing fields. A better way to deal with missing value is to excluding it, and only compare the relationship of Blood Pressure and Obesity Rate among the rest.

Scan for Missing Value

sum(is.na(Comparison))

## [1] 30

colSums(is.na(Comparison))

##                Country                 Gender  Percentage_Per_Genger 
##                      0                      0                      0 
## Average_Blood_Pressure 
##                     30

which(is.na(Comparison))

##  [1] 1099 1102 1108 1112 1116 1121 1164 1193 1211 1222 1225 1230 1241 1245 1250
## [16] 1278 1281 1287 1291 1295 1300 1343 1372 1390 1401 1404 1409 1420 1424 1429

Excluding Missing Value

Comparison <- Comparison[complete.cases(Comparison),]

Scan II

We scan both Average_Blood_Pressure per gender and Obesity_Rate per gender for outliers using box plot. As is shown below, Obesity_Rate has no outlier, but Average_Blood_Pressure might have outliers because outlier is defined as the values in the data set that fall beyond the range of −1.5×IQR to 1.5×IQR.

boxplot(Comparison$Average_Blood_Pressure ~ Comparison$Gender, main="Average_Blood_Pressure", ylab = "BP", xlab = "Gender", col = "olivedrab")

boxplot(Comparison$Percentage_Per_Genger ~ Comparison$Gender, main="Obesity_Rate", ylab = "Percentage", xlab = "Gender", col = "lightgoldenrod2")

Check If Data is Normally Distributed

Run histogram for both genders, as is shown below, both Male and Female blood pressure are normally distributed, then we can use z-scores method to exclude the outliers.

Comparison_M <- Comparison %>% filter(Gender == "Male") 
Comparison_F <- Comparison %>% filter(Gender == "Female")

Comparison_M$Average_Blood_Pressure %>% hist(main="Distribution of Male", col = "cadetblue")

Comparison_F$Average_Blood_Pressure %>% hist(main="Distribution of Female", col = "lightpink3")

Check for Outliers

As is shown below, male blood pressure has no outliers, and female blood pressure has 1 outlier.

z.scores <- Comparison_M$Average_Blood_Pressure %>%  scores(type = "z")
z.scores %>% summary()

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.71330 -0.65027  0.08499  0.00000  0.61710  2.03545

which( abs(z.scores) >3 )

## integer(0)

length (which( abs(z.scores) >3 ))

## [1] 0

z.scores <- Comparison_F$Average_Blood_Pressure %>%  scores(type = "z")
z.scores %>% summary()

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.12803 -0.66847  0.08651  0.00000  0.75100  1.91250

which( abs(z.scores) >3 )

## [1] 123

length (which( abs(z.scores) >3 ))

## [1] 1

Result_clean<- Comparison_F$Average_Blood_Pressure[ - which( abs(z.scores) >3 )]

Transform

Based on below histogram, it’s obvious that Percentage_PER_Gender is right-skewed, while Average_Blood_Pressure is normal distributed.

Comparison$Percentage_Per_Genger %>% hist(main="Distribution of Obesity Rate", col = "lightgoldenrod2")

Comparison$Average_Blood_Pressure %>% hist(main="Distribution of Blood Pressure", col = "olivedrab")

Transformation into Normal Distribution

Use boxcox to transform right-skewed into normal distribution

boxcox_Percentage_Per_Genger<- BoxCox(Comparison$Percentage_Per_Genger,lambda = "auto")
boxcox_Percentage_Per_Genger %>% hist(main="Distribution of Obesity Rate", col = "lightgoldenrod2")