MATH2349 Data Wrangling

REQUIRED PACKAGES

Importing the required packages.

#Import packages
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(readr)

EXECUTIVE SUMMARY

The following data consists ofthe birth rate and count of babies born in New Zealand from 2000 to 2017. There are two datasets which give the details of the count of the babies born , mothers age , period and the birth rate respectively. Both the datasets are combined to get insights on the data.

On combining the data sets, the data types of each of the columns are checked by following the the Tidy Data principles. Following this a new varibale is created which gives the count of the popluation of the mothers. This gave insights on the number of babies delivered by the mothers.

Further, the combined dataset was checked for any nul values along with the presence of the outliers.It could be seen that there is 1 outlier in the dataset which was dealt using the outlier handling principles.Finally, the distribution of every numerical variable is checked. For any skewed data, approperiate transformations were used in order to make the variable normally distributed.

DATA

The following Datasets are used from Data World :

bd-dec17-age-specific-birth-rates.csv (https://data.world/nz-stats-nz/257553f6-16e4-4466-83fa-5cc44c1fc56b/workspace/file?filename=bd-dec18-age-specific-birth-rates.csv)
bd-dec17-births-by-mothers-age.csv (https://data.world/nz-stats-nz/257553f6-16e4-4466-83fa-5cc44c1fc56b/workspace/file?filename=bd-dec18-births-by-mothers-age.csv)

Importing the datasets :

#Import Dataset
setwd("F:/Data Wrangling")
Birthrates <- read.csv("bd-dec18-age-specific-birth-rates.csv", stringsAsFactors = FALSE)
Birthrates <- select(Birthrates, c(1:3))
Birthcounts <- read.csv("bd-dec18-births-by-mothers-age.csv", stringsAsFactors = FALSE)

Variable Description :

Birthrates dataset :
+ Period: The birth Year of the baby.
+ Mothers_Age: Age brackets to represent mother’s age when baby was born
+ Age_specific_birth_rate: Birth rate of babies for particular year and the mother’s age

Birthcounts dataset :
+ Period: Year in which baby was born
+ Mothers_Age: Age brackets to represent mother’s age when baby was born
+ Count: Count of babies for particular year and the mother’s age

Using the head() function to display both the datasets.

#Display Data

head(Birthrates)

head(Birthcounts)

Joining both the datasets using the inner_join() function from the dplyr package. Both the datasets are joined by the common columns Period and Mothes_Age

#Joining Datasets

Birth <- inner_join(Birthrates,Birthcounts)

## Joining, by = c("Period", "Mothers_Age")

#Display joined data

head(Birth)

UNDERSTAND

We used the str() function to check the structure of the data. This includes the data types of the columns along with the number of rows in the whole dataset.

#Structure of the dataset
str(Birth)

## 'data.frame':    152 obs. of  4 variables:
##  $ Period                 : int  2000 2000 2000 2000 2000 2000 2000 2000 2001 2001 ...
##  $ Mothers_Age            : chr  "Under 15" "15–19" "20–24" "25–29" ...
##  $ Age_specific_birth_rate: num  0.2 28.2 77.5 113.5 113.1 ...
##  $ Count                  : int  30 3786 9879 15765 17163 8430 1497 51 30 3747 ...

The Mothers_Age column is in the character format. We convert this to the factor data type using the factor() function and giving the appropriate and ordered levels.

#Converting the Mothers_age column to factor

Birth$Mothers_Age <- Birth$Mothers_Age %>% 
  factor(levels = c("Under 15", "15–19", "20–24", "25–29", "30–34", "35–39", "40–44", "45 and over"),
         ordered = TRUE)

Checking the class of the Mothers_Age column along with the levels of the column. This is done using the class() function and the levels() function from the dplyr package.

# Checking the class of the column
class(Birth$Mothers_Age)

## [1] "ordered" "factor"

#Checking the levels of the column
levels(Birth$Mothers_Age)

## [1] "Under 15"    "15–19"       "20–24"       "25–29"       "30–34"      
## [6] "35–39"       "40–44"       "45 and over"

TIDY AND MANIPULATE DATA I

To check the 1st principle of the Tidy Data Principles, we use the head() function. This shows us that each variable has its own column.

head(Birth)

To check the 2nd princple of the Tidy Data Princple, we use the nrow() and the counts() function . This shows us that every observation has its own rows. The nrow() and the counts() give the same output of 144.

nrow(Birth)

## [1] 152

count(Birth)

To check the 3rd princple of the Tidy Data Princple we use the length() function. This shows us that the every cell/unit as value for every column.

length(Birth$Period)

## [1] 152

length(Birth$Mothers_Age)

## [1] 152

length(Birth$Age_specific_birth_rate)

## [1] 152

length(Birth$Count)

## [1] 152

From the above steps all the conditions of the Tidy Data Principle have been satistfied.

TIDY & MANIPULATE DATA II

We create a new varibale using the current variables Count and Age_Specific_Birth_Rate. This new column is named as Mother_Population which is calculated by dividing Count by Age_specification_Birth. This population variable gives us the number or the count of mohers who gave gave birth to the baby in that a particular age bracket in that year.

We use the mutate() function to create a new variable Mother_Population using Count and Age_Specific_Birth_Rate

# Create a new variable, Population using the Mutate Function.

Birth<- Birth %>% mutate(Mother_Population= (Count/Age_specific_birth_rate)*100)

Using the head() function to display the contents of the new variable, Mother_Population.

#Display contetns of Mother_Population

head(Birth$Mother_Population)

## [1] 15000.00 13425.53 12747.10 13889.87 15175.07 16057.14

SCAN I

To check if there are any missing values in the dataset using the is.na() function.
If the output says 0 then there are no missing values in the dataset or else we check each column individually.

#Check for any missing values using is.na() function.
sum(is.na(Birth))

## [1] 0

SCAN II

To check if there are any outliers present in the dataset we use the boxplot function to plot the boxplot.
A boxplot is plot for each variable n the dataset.

#Boxplot for each varibale in the dataset using the boxplot function
Birth$Age_specific_birth_rate %>% boxplot(main="Box plot for Age Specific Birth Rate", ylab="Birth Rates")

Birth$Count %>% boxplot(main="Box plot for Count" , ylab= "Count of Babies")

Birth$Mother_Population %>% boxplot(main= "Box plot for Mothers Population" , ylab = "Count of Mothers")

From the Mothers Population plot it can be seen that there is 1 outlier present.We check for the outlying value.

#Check the outlying value

Birth_Mother_Population_outlier <- Birth$Mother_Population %>% boxplot(main= "Box plot for Mothers Population" , ylab = "Count of Mothers")

Birth%>%filter(Mother_Population==Birth_Mother_Population_outlier$out)

It can be seen that the outlying value is 21000.
This can be dealt with using the capping method.
We use the capping function and is applied to the Mother_population variable using the cap() function. The function involves replacing the outliers with nearest neighbours that are not outliers.If the outliers lie on the lower fences of the box plot then it is replace with the lower limit with 5 percentile .Similarly if the outliers lie on the upper fences then it is replaced with upper limit with 95 percentile. Here the outlier lies above the upper fences of the uppper limit of the boxplot.

#Creating a capping function
cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

#Applying the cappting function to the Mother_Population variable.

Birth$Mother_Population <- Birth$Mother_Population %>% cap()

Replotting the boxplot for the Mother_Population variable to check if the outlier has been handled.

Birth$Mother_Population %>% boxplot()

TRANSFORM

We plot the histogram of the Count, Mother_Poplation and Age_Specific_Birth_Rate variables to check the skweness of the histogram.
These historgrams are saved as Before transformation for each variable in the dataset.

Count_Before_Tranformation <- hist(Birth$Count, main="Histogram of Count before Transformation", xlab= "Count", col="Light Blue")

Age_Before_Transformation <-hist(Birth$Age_specific_birth_rate , main= " Histogram for Age Specific Birth Rate before Transformaton", xlab="Birth Rate", col="Light Blue")

Mother_Beofre_Transformation <- hist(Birth$Mother_Population, main="Histogram of Mother Population before Transformation", xlab="Mother Population", col="Light Blue")

To remove the skweness of the histograms we use the BoxCox transformation method from the forecast package.
We save the transformed histogram of each varibale as After transforamtion.

#Performing boxcox transformation
Age_specific_birth_rate_boxcox <- BoxCox(Birth$Age_specific_birth_rate, lambda = "auto")
count_boxcox <- BoxCox(Birth$Count,lambda = "auto")
Mother_Population_boxcox <- BoxCox(Birth$Mother_Population, lambda = "auto")

#Plotting the trnasformed histograms

Age_After_Transformation <- hist(Age_specific_birth_rate_boxcox, main = "Histogram of Age Specific Birth After Tranformation" , xlab="Age",col="Light blue")

Count_After_Transformation <- hist(count_boxcox, main="Histogram of Count After Transformation", xlab = "Count", col="Light Blue")

Mother_After_Transformation <- hist(Mother_Population_boxcox, main="Histogram of Mother Population After Transformation", xlab = "Mother Population", col="Light Blue")

Finally , the histograms are transformed thus removing the skwenes and the making the variable normally distributed.