Importing the required packages.
#Import packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(readr)
The following data consists ofthe birth rate and count of babies born in New Zealand from 2000 to 2017. There are two datasets which give the details of the count of the babies born , mothers age , period and the birth rate respectively. Both the datasets are combined to get insights on the data.
On combining the data sets, the data types of each of the columns are checked by following the the Tidy Data principles. Following this a new varibale is created which gives the count of the popluation of the mothers. This gave insights on the number of babies delivered by the mothers.
Further, the combined dataset was checked for any nul values along with the presence of the outliers.It could be seen that there is 1 outlier in the dataset which was dealt using the outlier handling principles.Finally, the distribution of every numerical variable is checked. For any skewed data, approperiate transformations were used in order to make the variable normally distributed.
bd-dec17-age-specific-birth-rates.csv (https://data.world/nz-stats-nz/257553f6-16e4-4466-83fa-5cc44c1fc56b/workspace/file?filename=bd-dec18-age-specific-birth-rates.csv)
bd-dec17-births-by-mothers-age.csv (https://data.world/nz-stats-nz/257553f6-16e4-4466-83fa-5cc44c1fc56b/workspace/file?filename=bd-dec18-births-by-mothers-age.csv)
#Import Dataset
setwd("F:/Data Wrangling")
Birthrates <- read.csv("bd-dec18-age-specific-birth-rates.csv", stringsAsFactors = FALSE)
Birthrates <- select(Birthrates, c(1:3))
Birthcounts <- read.csv("bd-dec18-births-by-mothers-age.csv", stringsAsFactors = FALSE)
Birthrates dataset :
+ Period: The birth Year of the baby.
+ Mothers_Age: Age brackets to represent mother’s age when baby was born
+ Age_specific_birth_rate: Birth rate of babies for particular year and the mother’s age
Birthcounts dataset :
+ Period: Year in which baby was born
+ Mothers_Age: Age brackets to represent mother’s age when baby was born
+ Count: Count of babies for particular year and the mother’s age
#Display Data
head(Birthrates)
head(Birthcounts)
#Joining Datasets
Birth <- inner_join(Birthrates,Birthcounts)
## Joining, by = c("Period", "Mothers_Age")
#Display joined data
head(Birth)
#Structure of the dataset
str(Birth)
## 'data.frame': 152 obs. of 4 variables:
## $ Period : int 2000 2000 2000 2000 2000 2000 2000 2000 2001 2001 ...
## $ Mothers_Age : chr "Under 15" "15–19" "20–24" "25–29" ...
## $ Age_specific_birth_rate: num 0.2 28.2 77.5 113.5 113.1 ...
## $ Count : int 30 3786 9879 15765 17163 8430 1497 51 30 3747 ...
#Converting the Mothers_age column to factor
Birth$Mothers_Age <- Birth$Mothers_Age %>%
factor(levels = c("Under 15", "15–19", "20–24", "25–29", "30–34", "35–39", "40–44", "45 and over"),
ordered = TRUE)
# Checking the class of the column
class(Birth$Mothers_Age)
## [1] "ordered" "factor"
#Checking the levels of the column
levels(Birth$Mothers_Age)
## [1] "Under 15" "15–19" "20–24" "25–29" "30–34"
## [6] "35–39" "40–44" "45 and over"
head(Birth)
nrow(Birth)
## [1] 152
count(Birth)
length(Birth$Period)
## [1] 152
length(Birth$Mothers_Age)
## [1] 152
length(Birth$Age_specific_birth_rate)
## [1] 152
length(Birth$Count)
## [1] 152
From the above steps all the conditions of the Tidy Data Principle have been satistfied.
We create a new varibale using the current variables Count and Age_Specific_Birth_Rate. This new column is named as Mother_Population which is calculated by dividing Count by Age_specification_Birth. This population variable gives us the number or the count of mohers who gave gave birth to the baby in that a particular age bracket in that year.
# Create a new variable, Population using the Mutate Function.
Birth<- Birth %>% mutate(Mother_Population= (Count/Age_specific_birth_rate)*100)
#Display contetns of Mother_Population
head(Birth$Mother_Population)
## [1] 15000.00 13425.53 12747.10 13889.87 15175.07 16057.14
To check if there are any missing values in the dataset using the is.na() function.
If the output says 0 then there are no missing values in the dataset or else we check each column individually.
#Check for any missing values using is.na() function.
sum(is.na(Birth))
## [1] 0
To check if there are any outliers present in the dataset we use the boxplot function to plot the boxplot.
A boxplot is plot for each variable n the dataset.
#Boxplot for each varibale in the dataset using the boxplot function
Birth$Age_specific_birth_rate %>% boxplot(main="Box plot for Age Specific Birth Rate", ylab="Birth Rates")
Birth$Count %>% boxplot(main="Box plot for Count" , ylab= "Count of Babies")
Birth$Mother_Population %>% boxplot(main= "Box plot for Mothers Population" , ylab = "Count of Mothers")
#Check the outlying value
Birth_Mother_Population_outlier <- Birth$Mother_Population %>% boxplot(main= "Box plot for Mothers Population" , ylab = "Count of Mothers")
Birth%>%filter(Mother_Population==Birth_Mother_Population_outlier$out)
It can be seen that the outlying value is 21000.
This can be dealt with using the capping method.
We use the capping function and is applied to the Mother_population variable using the cap() function. The function involves replacing the outliers with nearest neighbours that are not outliers.If the outliers lie on the lower fences of the box plot then it is replace with the lower limit with 5 percentile .Similarly if the outliers lie on the upper fences then it is replaced with upper limit with 95 percentile. Here the outlier lies above the upper fences of the uppper limit of the boxplot.
#Creating a capping function
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
#Applying the cappting function to the Mother_Population variable.
Birth$Mother_Population <- Birth$Mother_Population %>% cap()
Birth$Mother_Population %>% boxplot()
We plot the histogram of the Count, Mother_Poplation and Age_Specific_Birth_Rate variables to check the skweness of the histogram.
These historgrams are saved as Before transformation for each variable in the dataset.
Count_Before_Tranformation <- hist(Birth$Count, main="Histogram of Count before Transformation", xlab= "Count", col="Light Blue")
Age_Before_Transformation <-hist(Birth$Age_specific_birth_rate , main= " Histogram for Age Specific Birth Rate before Transformaton", xlab="Birth Rate", col="Light Blue")
Mother_Beofre_Transformation <- hist(Birth$Mother_Population, main="Histogram of Mother Population before Transformation", xlab="Mother Population", col="Light Blue")
To remove the skweness of the histograms we use the BoxCox transformation method from the forecast package.
We save the transformed histogram of each varibale as After transforamtion.
#Performing boxcox transformation
Age_specific_birth_rate_boxcox <- BoxCox(Birth$Age_specific_birth_rate, lambda = "auto")
count_boxcox <- BoxCox(Birth$Count,lambda = "auto")
Mother_Population_boxcox <- BoxCox(Birth$Mother_Population, lambda = "auto")
#Plotting the trnasformed histograms
Age_After_Transformation <- hist(Age_specific_birth_rate_boxcox, main = "Histogram of Age Specific Birth After Tranformation" , xlab="Age",col="Light blue")
Count_After_Transformation <- hist(count_boxcox, main="Histogram of Count After Transformation", xlab = "Count", col="Light Blue")
Mother_After_Transformation <- hist(Mother_Population_boxcox, main="Histogram of Mother Population After Transformation", xlab = "Mother Population", col="Light Blue")
Finally , the histograms are transformed thus removing the skwenes and the making the variable normally distributed.