MATH2349 Semester 1, 2020

Required packages

*Loading required packages.

library(readxl)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(readr)
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Executive Summary

*The data sets are csv file which contains Birth rate and Death rate of different countries from year 2009-2018. To make this data set ready for analysis, I did following steps of pre-processing:

*Imported data from each sheet respectively.

*Tidyed each data set and merged them all together.

*Checked data set structure and each variable’s data type.

*Converted data type if the variable was not suitable for data type.

*Created new variable “difference” represents the difference in birth rate and death rate for a country.

*Checked missing values, special values and obvious errors for all variables and with proper methodology.

*Checked outliers for all numeric variables, and fixed with capping methodology.

*Transformed the “Birth_rate” variable with logarithmic method to decrease the skewness and convert the distribution into a normal distribution.

Data

*The data set is taken from open source Wolrd Bank open Data.

*data set Source:-https://data.worldbank.org/.

*Following are two data sets and their variable description:-

1.Birth rate:-it contains birth rate of different countires in years from 2009-2018.This data set has 14 variables:-

*Series Name:-it contains information that birth rate is taken per 1000 people.

*Series Code:-it contains unique code for series Name.

*Country Name:-it contains name of the Country.

*Country Code:-it contains unique code for Country Name.

*2009 [YR2009]- 2018[YR2018]:-this 10 coloumns contains birth rate from year 2009-2018.

2.Death rate:-it contains death rate of different countires in years from 2009-2018.This data set has 14 variables:-

*Series Name:-it contains information that death rate is taken per 1000 people.

*Series Code:-it contains unique code for series Name.

*Country Name:-it contains name of the Country.

*Country Code:-it contains unique code for Country Name.

*2009 [YR2009]- 2018[YR2018]:-this 10 coloumns contains death rate from year 2009-2018.

*Steps taken in this chunk are:-

1.Setting working directory to ~DATA Wrangling/Assignment-2 where both data sets are saved.

2.Importing Birth rate data set in Birth object.

3.Importing Death rate set in Death object.

4.Showing variables with 6 observations of both the objects with the help of head function.

setwd("C:/Users/61450/Desktop/DATA Wrangling/Assignment-2")
Birth<-read_csv("birth-rate.csv")

## Parsed with column specification:
## cols(
##   `Series Name` = col_character(),
##   `Series Code` = col_character(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `2009 [YR2009]` = col_character(),
##   `2010 [YR2010]` = col_character(),
##   `2011 [YR2011]` = col_character(),
##   `2012 [YR2012]` = col_character(),
##   `2013 [YR2013]` = col_character(),
##   `2014 [YR2014]` = col_character(),
##   `2015 [YR2015]` = col_character(),
##   `2016 [YR2016]` = col_character(),
##   `2017 [YR2017]` = col_character(),
##   `2018 [YR2018]` = col_character()
## )

Death<-read_csv("death-rate.csv")

## Parsed with column specification:
## cols(
##   `Series Name` = col_character(),
##   `Series Code` = col_character(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `2009 [YR2009]` = col_character(),
##   `2010 [YR2010]` = col_character(),
##   `2011 [YR2011]` = col_character(),
##   `2012 [YR2012]` = col_character(),
##   `2013 [YR2013]` = col_character(),
##   `2014 [YR2014]` = col_character(),
##   `2015 [YR2015]` = col_character(),
##   `2016 [YR2016]` = col_character(),
##   `2017 [YR2017]` = col_character(),
##   `2018 [YR2018]` = col_character()
## )

head(Birth)

head(Death)

Tidy & Manipulate Data I

*The data sets Birth and Death are untid as they does not meet the “Tidy data principles”-“Each variable must have its own coloumn”.

*The columns 2009 [YR2009]-2018[YR2018] in both the data sets are actually represeting same variable “year” and values in this coloumns are representing another same variable “birth rate” and “death rate”.

*To tidy the data sets i have gathered all the year variables in one coloumn“year” and their values i.e birth rate and death rate in another coloumns named “Birth_rate” and “death_rate”.

Birth<-Birth%>%gather(`2009 [YR2009]`,`2010 [YR2010]`,`2011 [YR2011]`,`2012 [YR2012]`,`2013 [YR2013]`,`2014 [YR2014]`,`2015 [YR2015]`,`2016 [YR2016]`,`2017 [YR2017]`,`2018 [YR2018]`,key="year",value="Birth_rate")

Death<-Death%>%gather(`2009 [YR2009]`,`2010 [YR2010]`,`2011 [YR2011]`,`2012 [YR2012]`,`2013 [YR2013]`,`2014 [YR2014]`,`2015 [YR2015]`,`2016 [YR2016]`,`2017 [YR2017]`,`2018 [YR2018]`,key="year",value="Death_rate")

##Merging Data sets

*Merging data sets birth and Death by left join and joining by common variables Country Name,Country code and Year.

*Showing merged data set variables using head function.

Birth_death<-Birth%>%left_join(Death,by=c("Country Name","Country Code","year"))

head(Birth_death)

Understand

*Checking the structure of data frame using str function and finding incorrect data type.

*Changing Birth_rate and Death_rate variable data type from character to numeric since it contains numeric values instead of character.

*Data types of variables Coutnry Name, Country code,year should be factor instead of character and hence changing their data types from character to factor.

*Showing levels of the converted factor variables to ensure the conversion took properly.

str(Birth_death)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2420 obs. of  9 variables:
##  $ Series Name.x: chr  "Birth rate, crude (per 1,000 people)" "Birth rate, crude (per 1,000 people)" "Birth rate, crude (per 1,000 people)" "Birth rate, crude (per 1,000 people)" ...
##  $ Series Code.x: chr  "SP.DYN.CBRT.IN" "SP.DYN.CBRT.IN" "SP.DYN.CBRT.IN" "SP.DYN.CBRT.IN" ...
##  $ Country Name : chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Country Code : chr  "AFG" "ALB" "DZA" "ASM" ...
##  $ year         : chr  "2009 [YR2009]" "2009 [YR2009]" "2009 [YR2009]" "2009 [YR2009]" ...
##  $ Birth_rate   : chr  "40.903" "11.945" "23.932" ".." ...
##  $ Series Name.y: chr  "Death rate, crude (per 1,000 people)" "Death rate, crude (per 1,000 people)" "Death rate, crude (per 1,000 people)" "Death rate, crude (per 1,000 people)" ...
##  $ Series Code.y: chr  "SP.DYN.CDRT.IN" "SP.DYN.CDRT.IN" "SP.DYN.CDRT.IN" "SP.DYN.CDRT.IN" ...
##  $ Death_rate   : chr  "8.584" "6.756" "4.643" ".." ...

Birth_death$Birth_rate<-Birth_death$Birth_rate%>%as.numeric()

## Warning in function_list[[k]](value): NAs introduced by coercion

Birth_death$Death_rate<-Birth_death$Death_rate%>%as.numeric()

## Warning in function_list[[k]](value): NAs introduced by coercion

Birth_death$year <- factor(Birth_death$year,ordered = TRUE)

levels(Birth_death$year)

##  [1] "2009 [YR2009]" "2010 [YR2010]" "2011 [YR2011]" "2012 [YR2012]"
##  [5] "2013 [YR2013]" "2014 [YR2014]" "2015 [YR2015]" "2016 [YR2016]"
##  [9] "2017 [YR2017]" "2018 [YR2018]"

Birth_death$`Country Name`<- factor(Birth_death$`Country Name`, ordered = TRUE)

levels(Birth_death$`Country Name`)

##   [1] "Afghanistan"                    "Albania"                       
##   [3] "Algeria"                        "American Samoa"                
##   [5] "Andorra"                        "Angola"                        
##   [7] "Antigua and Barbuda"            "Argentina"                     
##   [9] "Armenia"                        "Aruba"                         
##  [11] "Australia"                      "Austria"                       
##  [13] "Azerbaijan"                     "Bahamas, The"                  
##  [15] "Bahrain"                        "Bangladesh"                    
##  [17] "Barbados"                       "Belarus"                       
##  [19] "Belgium"                        "Belize"                        
##  [21] "Benin"                          "Bermuda"                       
##  [23] "Bhutan"                         "Bolivia"                       
##  [25] "Bosnia and Herzegovina"         "Botswana"                      
##  [27] "Brazil"                         "British Virgin Islands"        
##  [29] "Brunei Darussalam"              "Bulgaria"                      
##  [31] "Burkina Faso"                   "Burundi"                       
##  [33] "Cabo Verde"                     "Cambodia"                      
##  [35] "Cameroon"                       "Canada"                        
##  [37] "Cayman Islands"                 "Central African Republic"      
##  [39] "Chad"                           "Channel Islands"               
##  [41] "Chile"                          "China"                         
##  [43] "Colombia"                       "Comoros"                       
##  [45] "Congo, Dem. Rep."               "Congo, Rep."                   
##  [47] "Costa Rica"                     "Cote d'Ivoire"                 
##  [49] "Croatia"                        "Cuba"                          
##  [51] "Curacao"                        "Cyprus"                        
##  [53] "Czech Republic"                 "Denmark"                       
##  [55] "Djibouti"                       "Dominica"                      
##  [57] "Dominican Republic"             "Ecuador"                       
##  [59] "Egypt, Arab Rep."               "El Salvador"                   
##  [61] "Equatorial Guinea"              "Eritrea"                       
##  [63] "Estonia"                        "Eswatini"                      
##  [65] "Ethiopia"                       "Faroe Islands"                 
##  [67] "Fiji"                           "Finland"                       
##  [69] "France"                         "French Polynesia"              
##  [71] "Gabon"                          "Gambia, The"                   
##  [73] "Georgia"                        "Germany"                       
##  [75] "Ghana"                          "Gibraltar"                     
##  [77] "Greece"                         "Greenland"                     
##  [79] "Grenada"                        "Guam"                          
##  [81] "Guatemala"                      "Guinea"                        
##  [83] "Guinea-Bissau"                  "Guyana"                        
##  [85] "Haiti"                          "Honduras"                      
##  [87] "Hong Kong SAR, China"           "Hungary"                       
##  [89] "Iceland"                        "India"                         
##  [91] "Indonesia"                      "Iran, Islamic Rep."            
##  [93] "Iraq"                           "Ireland"                       
##  [95] "Isle of Man"                    "Israel"                        
##  [97] "Italy"                          "Jamaica"                       
##  [99] "Japan"                          "Jordan"                        
## [101] "Kazakhstan"                     "Kenya"                         
## [103] "Kiribati"                       "Korea, Dem. People’s Rep."     
## [105] "Korea, Rep."                    "Kosovo"                        
## [107] "Kuwait"                         "Kyrgyz Republic"               
## [109] "Lao PDR"                        "Latvia"                        
## [111] "Lebanon"                        "Lesotho"                       
## [113] "Liberia"                        "Libya"                         
## [115] "Liechtenstein"                  "Lithuania"                     
## [117] "Luxembourg"                     "Macao SAR, China"              
## [119] "Madagascar"                     "Malawi"                        
## [121] "Malaysia"                       "Maldives"                      
## [123] "Mali"                           "Malta"                         
## [125] "Marshall Islands"               "Mauritania"                    
## [127] "Mauritius"                      "Mexico"                        
## [129] "Micronesia, Fed. Sts."          "Moldova"                       
## [131] "Monaco"                         "Mongolia"                      
## [133] "Montenegro"                     "Morocco"                       
## [135] "Mozambique"                     "Myanmar"                       
## [137] "Namibia"                        "Nauru"                         
## [139] "Nepal"                          "Netherlands"                   
## [141] "New Caledonia"                  "New Zealand"                   
## [143] "Nicaragua"                      "Niger"                         
## [145] "Nigeria"                        "North Macedonia"               
## [147] "Northern Mariana Islands"       "Norway"                        
## [149] "Oman"                           "Pakistan"                      
## [151] "Palau"                          "Panama"                        
## [153] "Papua New Guinea"               "Paraguay"                      
## [155] "Peru"                           "Philippines"                   
## [157] "Poland"                         "Portugal"                      
## [159] "Puerto Rico"                    "Qatar"                         
## [161] "Romania"                        "Russian Federation"            
## [163] "Rwanda"                         "Samoa"                         
## [165] "San Marino"                     "Sao Tome and Principe"         
## [167] "Saudi Arabia"                   "Senegal"                       
## [169] "Serbia"                         "Seychelles"                    
## [171] "Sierra Leone"                   "Singapore"                     
## [173] "Sint Maarten (Dutch part)"      "Slovak Republic"               
## [175] "Slovenia"                       "Solomon Islands"               
## [177] "Somalia"                        "South Africa"                  
## [179] "South Sudan"                    "Spain"                         
## [181] "Sri Lanka"                      "St. Kitts and Nevis"           
## [183] "St. Lucia"                      "St. Martin (French part)"      
## [185] "St. Vincent and the Grenadines" "Sudan"                         
## [187] "Suriname"                       "Sweden"                        
## [189] "Switzerland"                    "Syrian Arab Republic"          
## [191] "Tajikistan"                     "Tanzania"                      
## [193] "Thailand"                       "Timor-Leste"                   
## [195] "Togo"                           "Tonga"                         
## [197] "Trinidad and Tobago"            "Tunisia"                       
## [199] "Turkey"                         "Turkmenistan"                  
## [201] "Turks and Caicos Islands"       "Tuvalu"                        
## [203] "Uganda"                         "Ukraine"                       
## [205] "United Arab Emirates"           "United Kingdom"                
## [207] "United States"                  "Uruguay"                       
## [209] "Uzbekistan"                     "Vanuatu"                       
## [211] "Venezuela, RB"                  "Vietnam"                       
## [213] "Virgin Islands (U.S.)"          "West Bank and Gaza"            
## [215] "Yemen, Rep."                    "Zambia"                        
## [217] "Zimbabwe"

Birth_death$`Country Code`<- factor(Birth_death$`Country Code`, ordered = TRUE)

levels(Birth_death$`Country Code`)

##   [1] "ABW" "AFG" "AGO" "ALB" "AND" "ARE" "ARG" "ARM" "ASM" "ATG" "AUS" "AUT"
##  [13] "AZE" "BDI" "BEL" "BEN" "BFA" "BGD" "BGR" "BHR" "BHS" "BIH" "BLR" "BLZ"
##  [25] "BMU" "BOL" "BRA" "BRB" "BRN" "BTN" "BWA" "CAF" "CAN" "CHE" "CHI" "CHL"
##  [37] "CHN" "CIV" "CMR" "COD" "COG" "COL" "COM" "CPV" "CRI" "CUB" "CUW" "CYM"
##  [49] "CYP" "CZE" "DEU" "DJI" "DMA" "DNK" "DOM" "DZA" "ECU" "EGY" "ERI" "ESP"
##  [61] "EST" "ETH" "FIN" "FJI" "FRA" "FRO" "FSM" "GAB" "GBR" "GEO" "GHA" "GIB"
##  [73] "GIN" "GMB" "GNB" "GNQ" "GRC" "GRD" "GRL" "GTM" "GUM" "GUY" "HKG" "HND"
##  [85] "HRV" "HTI" "HUN" "IDN" "IMN" "IND" "IRL" "IRN" "IRQ" "ISL" "ISR" "ITA"
##  [97] "JAM" "JOR" "JPN" "KAZ" "KEN" "KGZ" "KHM" "KIR" "KNA" "KOR" "KWT" "LAO"
## [109] "LBN" "LBR" "LBY" "LCA" "LIE" "LKA" "LSO" "LTU" "LUX" "LVA" "MAC" "MAF"
## [121] "MAR" "MCO" "MDA" "MDG" "MDV" "MEX" "MHL" "MKD" "MLI" "MLT" "MMR" "MNE"
## [133] "MNG" "MNP" "MOZ" "MRT" "MUS" "MWI" "MYS" "NAM" "NCL" "NER" "NGA" "NIC"
## [145] "NLD" "NOR" "NPL" "NRU" "NZL" "OMN" "PAK" "PAN" "PER" "PHL" "PLW" "PNG"
## [157] "POL" "PRI" "PRK" "PRT" "PRY" "PSE" "PYF" "QAT" "ROU" "RUS" "RWA" "SAU"
## [169] "SDN" "SEN" "SGP" "SLB" "SLE" "SLV" "SMR" "SOM" "SRB" "SSD" "STP" "SUR"
## [181] "SVK" "SVN" "SWE" "SWZ" "SXM" "SYC" "SYR" "TCA" "TCD" "TGO" "THA" "TJK"
## [193] "TKM" "TLS" "TON" "TTO" "TUN" "TUR" "TUV" "TZA" "UGA" "UKR" "URY" "USA"
## [205] "UZB" "VCT" "VEN" "VGB" "VIR" "VNM" "VUT" "WSM" "XKX" "YEM" "ZAF" "ZMB"
## [217] "ZWE"

Tidy & Manipulate Data II

*Creating a new variable “difference” which is a difference of birth rate and death rate for a given year and country with the help of mutate function.

*Rearranging sequence of the variables.

Birth_death<-mutate(Birth_death,difference=Birth_rate-Death_rate)

Birth_death%>%select(`Country Name`,`Country Code`,`Series Name.x`,`Series Code.x`,Birth_rate,`Series Name.y`,`Series Code.y`,Death_rate)

Scan I

*Checking total number of NAs in all the variables of Birth_death data set using colsums function.

*Filterting observations which has series Name.x “Birth rate, crude (per 1,000 people)”.I have used this method as the data set contains values other than Birth rate information which is not required and irrelevant.

*Replacing NAs of Birth_rate and Death_rate with 0.I have used this method as NA in this variable means the information of Birth rate and Death rate is not available for that particular country and year, also Birth rate and Death rate are numeric variables and hence replacing it with 0 number.

*Since difference variable is difference of Birth_rate and Death_rate to eliminate NAs of “difference” variable again storing difference of Birth_rate and Death_rate in difference coloumn.

*To confirm that all NAs of data frame is handled checking NAs of all variables again using colSums function and found that all NAs are handled.

colSums(is.na(Birth_death))

## Series Name.x Series Code.x  Country Name  Country Code          year 
##           150           250           250           250             0 
##    Birth_rate Series Name.y Series Code.y    Death_rate    difference 
##           368           150           250           367           369

Birth_death<-Birth_death%>%filter(`Series Name.x`== "Birth rate, crude (per 1,000 people)")

colSums(is.na(Birth_death))

## Series Name.x Series Code.x  Country Name  Country Code          year 
##             0             0             0             0             0 
##    Birth_rate Series Name.y Series Code.y    Death_rate    difference 
##           118             0             0           117           119

Birth_death$Birth_rate[is.na(Birth_death$Birth_rate)] = 0
Birth_death$Death_rate[is.na(Birth_death$Death_rate)] = 0

Birth_death$difference<-Birth_death$Birth_rate-Birth_death$Death_rate

colSums(is.na(Birth_death))

## Series Name.x Series Code.x  Country Name  Country Code          year 
##             0             0             0             0             0 
##    Birth_rate Series Name.y Series Code.y    Death_rate    difference 
##             0             0             0             0             0

Scan II

*Using boxplot visualisation to detect outliers of all numeric variable i.e,Birth_rate,Death_rate and difference.

*It is found that Death_rate has outliers present.

*Using capping method to deal with outliers of Death_rate variable, the reason behind choosing capping method is:- Every value of each variable is unique, if using MVN to detect and delete the outlier would influence the further analysis.

*Checking the results by comparing descriptive statistics summary.

boxplot(Birth_death$Birth_rate~Birth_death$year, main = "Birth rate by year", ylab = "Birth rate", xlab = "Year")

boxplot(Birth_death$Death_rate~Birth_death$year, main = "Death rate by year", ylab = "Death rate", xlab = "Year")

boxplot(Birth_death$difference~Birth_death$year, main = "difference in rate by year", ylab = "Difference in rate", xlab = "Year")

cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

Birth_death$Death_rate<-Birth_death$Death_rate%>%cap

summary(Birth_death)

##  Series Name.x      Series Code.x              Country Name   Country Code 
##  Length:2170        Length:2170        Afghanistan   :  10   ABW    :  10  
##  Class :character   Class :character   Albania       :  10   AFG    :  10  
##  Mode  :character   Mode  :character   Algeria       :  10   AGO    :  10  
##                                        American Samoa:  10   ALB    :  10  
##                                        Andorra       :  10   AND    :  10  
##                                        Angola        :  10   ARE    :  10  
##                                        (Other)       :2110   (Other):2110  
##             year       Birth_rate    Series Name.y      Series Code.y     
##  2009 [YR2009]:217   Min.   : 0.00   Length:2170        Length:2170       
##  2010 [YR2010]:217   1st Qu.:11.16   Class :character   Class :character  
##  2011 [YR2011]:217   Median :17.40   Mode  :character   Mode  :character  
##  2012 [YR2012]:217   Mean   :19.84                                        
##  2013 [YR2013]:217   3rd Qu.:27.92                                        
##  2014 [YR2014]:217   Max.   :50.22                                        
##  (Other)      :868                                                        
##    Death_rate       difference    
##  Min.   : 0.000   Min.   :-6.500  
##  1st Qu.: 5.492   1st Qu.: 2.889  
##  Median : 7.200   Median :11.601  
##  Mean   : 7.318   Mean   :12.483  
##  3rd Qu.: 9.265   3rd Qu.:20.919  
##  Max.   :14.900   Max.   :38.570  
##

Transform

*I have selected Birth_rate numeric variable to transform.

*Plotting histogram of Birth_rate and found that the distribution is right skewed.

*Trying to reduce right skewness with logarithms, reciprocals, roots and Box-Cox methods.

*After comparing, logarithems method would be taken for the proper method.

*Applying logarithems method to the variable of data frame.

hist(Birth_death$Birth_rate,breaks=20)

log_Birth_death <- log10(Birth_death$Birth_rate)
hist(log_Birth_death, breaks = 20)

Birth_death_recip <- 1/Birth_death$Birth_rate
hist(Birth_death_recip, breaks = 20)

sqrt_Birth_death <- sqrt(Birth_death$Birth_rate)
hist(sqrt_Birth_death, breaks = 20)

boxcox_Birth_death <- BoxCox(Birth_death$Birth_rate, lambda = "auto")
hist(boxcox_Birth_death, breaks = 20)

Birth_death$Birth_rate <- log10(Birth_death$Birth_rate)

##References

Birth rate data set:- https://databank.worldbank.org/source/world-development-indicators#

*Death rate data set:-https://databank.worldbank.org/source/world-development-indicators#

*CANVAS-> DATA WRANGLING-> Modules https://rmit.instructure.com/courses/67186/modules.